Loading gene expression data requires at least a raw data file, a subject sample mapping file and it requires the platform definition to be present in tranSMART. Uploading the platform definition is discussed in Section 7. The sample remapping file is optional. Below follows a description of the input files, followed by information on invoking the gene expression data upload script. If you have clinical data from the same study as well, it is important to upload the clinical data first. Gene expression data can be linked to subjects who already have clinical data in the database, but not the other way around.
Figure 9: Ontology of the example study
Figure 10: Conceptual representation of the ETL process for gene expression data.
6.1 Raw Intensity Input Files
The raw intensity values can be located in one or more tab delimited text files. The first row denotes the column headers. The first column should be named ID_REF, and refers to the probe ID from which the intensity was measured. All subsequent columns names are sample IDs. These sample IDs will be used in the subject sample mapping file to map sample IDs to subject IDs from clinical data.
6.2 Subject Sample Mapping File
The subject sample mapping le is a tab delimited text file. It holds information on each of the samples in the raw data files. Table 5 describes the columns that make up this file.
Table 5: Description of columns in subject sample mapping file
|1||STUDY_ID||Identifier of the study. Remember that study IDs are always uppercase||GSEXXXX|
|2||SITE_ID||Identifier of the site where the samples were acquired|
|3||SUBJECT_ID||Identifier of the subject. This should correspond to the identifier of a subject in the clinical data file|
|4||SAMPLE_ID||Identifier of the sample. This should correspond to a column name in the raw intensity data file||SAMXXXX|
|5||PLATFORM||The platform identifier||GPL201|
|6||TISSUETYPE||Tissue type from which the sample was collected||Synovial Tissue|
|7||ATTR1||Custom attribute 1. The value of this field can be used in the category code to make it part of the ontology|
|8||ATTR2||Custom attribute 2. The value of this field can be used in the category code to make it part of the ontology|
|9||CATEGORY_CD||Category code where this gene expression data will be inserted in the ontology. You can use the keywords PLATFORM, TISSUETYPE, ATTR1 and ATTR2 here as well. PLATFORM will be replaced by the description of the platform in the PLATFORM column, the other keywords will be replaced by their values in the corresponding columns. Use + to seperate ontology levels and for spaces.||Biomarker Data+Gene Expression+PLATFORM+TISSUETYPE|
|10||SOURCE_CD||Identifier of data source||GEO|
6.3 Sample Remapping File
The sample remapping file is optional and is a tab delimited text file. You can use this file to rename specific sample IDs in particular input les to something else. Table 6 describes the columns that make up this file.
Table 6: Description of columns in the sample remapping file
|1||REMAP_DATA_FILENAME||Name of the file that holds the data to be remapped raw.||GSEXXXX expression.txt|
|2||CURRENT_SAMPLE_ID||The sample ID to be renamed|
The new name of the sample ID
6.4 Upload Script
Make sure you have prepared your data in the formats listed above. The Kettle-script for gene expression data is located at transmart-ETL/Kettle/postgres/Kettle-ETL/load_gene_expression_data.kjb. It is important to know that tranSMART stores three projections of the data: the raw data, the log-transformed data and the z-score. When exporting data you can choose which projection you want. Therefore it is important to set the DATA_TYPE parameter (described below) to the correct value.
Table 7 describes the parameters that can be passed to the load_gene_expression_data.kjb script. Use either Kettle or Spoon as described above to load the script and set its parameters.
Table 7: Gene expression data upload parameters
|BULK_LOADER_PATH||x||Path to your psql executable. E.g. C:\Program Files\PostgreSQL\bin\psql.exe or /usr/bin/psql. Only required if LOAD_TYPE set to L.|
|DATA_FILE_PREFIX||x||Prefix for the filenames of raw gene expression data files|
|DATA_LOCATION||x||Full path to the input files|
Can be R, L or T.
R: The data are raw intensity values, no transformation has occured. In this case the log (base LOGBASE) and z-score is calculated. The z-score is trimmed to the interval [-2:5; 2:5] and calculated from the log values.
L: The data has been log transformed. It is uploaded to the log projection and a z-score is calculated from this. The zscore is trimmed to the interval [-2:5; 2:5]. The raw intensity is derived using LOGBASE.
T: data will be uploaded with no additional transformation to the log-projection. Data is also loaded to the z-score projection, but there it is trimmed to the interval [-2:5; 2:5].
|FilePivot_LOCATION||x||Full path to directory where FilePivot.jar is located. This file should be located in transmart-ETL/Kettle/postgres/|
|JAVA_LOCATION||Full path to the directory where java (or java.exe) is located.|
I: load the data by generating an insert statement for each row. This is the preferred method for loading to the database.
L: load the data through the Postgres bulk loader. This can be more efficient than the I option, although in most cases the performance difference is negligible. You need to set BULK_LOADER_PATH as well to use this option.
F: instead of loading to the database, write to a file. The data will be written to a file called <STUDY_ID>_clinical_data, where <STUDY_ID> is the study ID you configured with the STUDY_ID parameter.
|LOG_BASE||2||The log base to use when log transforming raw data. Also used to derive the raw intensity value from already log transformed data.|
|MAP_FILENAME||x||Filename of the subject-to-sample mapping file|
Filename of the sample remapping file. Omit this parameter or set it to NOSAMPLEREMAP if there is no sample remapping.
If all sample_cds have a common suffix that you wish to remove, you can specify that suffix here.
Note: The suffix will not be removed if a SAMPLE_REMAP_FILENAME is specified
N: Indicates this is public data. Any user logged in to tranSMART can view this data.
Y: A tranSMART administrator needs to give explicit access to each user who requires access to this data.
|SORT_DIR||x||Full path to a directory where temporary files can be stored for sorting|
|SOURCE_CD||STD||Only samples with a matching DATA_SOURCE_CD column in the subject sample mapping file will be imported. Samples with an empty (null) DATA_SOURCE_CD will be imported regardless of the value of this parameter|
|STUDY_ID||x||Unique identifier of the study. This will be transformed to all caps before being used|
|TOP_NODE||x||The string that defines the node under which this data will be inserted. For example: \Public Studies\Breast_Cancer_Kao_GSE20685\Gene Expression|
In this example we will add gene expression data to the mock study. If you have not uploaded
the mock study yet, please refer to section 5.6. The gene expression data can be found in the Expression data directory included with this document. The file raw.GSEXXXX_expression.txt contains the raw expression values. The subject sample mapping is defined in Subject_Sample_Mapping.txt.
If you are on a Linux system, open the load_gene_expression.sh le. On a Windows system, open the load_gene_expression.bat file. Change the DATA_INTEGRATION_PATH and TRANSMART_ETL_PATH variables to the location of the Data Integration software suite and the tranSMART-ETL repository respectively. In contrast to the clinical data upload script, we can not refer to the data location by relative path. The data location is also passed to file pivoter, so it needs to be an absolute path. Complete the file by defining the DATA_LOCATION parameter. If the Java executable can not be found automatically by your system, you must specify its location with the JAVA_LOCATION parameter. In a terminal, navigate to the Expression data directory and start the script. On Linux, type bash load_gene_expression.sh, on Windows, type load_gene_expression.bat
Alternatively, you can use Spoon as described in section 4. Just hit the run button and copy all parameter values over from the script to the appropriate input fields in Spoon. Just like uploading clinical data, kitchen will produce a large amount of output. Check the last few lines output. If everything went well, these lines should look like this:
INFO 07-05 08:56:52,708 - Kitchen - Finished!
INFO 07-05 08:56:52,708 - Kitchen - Start=2015/05/07 08:56:45.339,
INFO 07-05 08:56:52,708 - Kitchen - Processing ended after 7 seconds.
Kitchen will also tell you if something went wrong:
INFO 07-05 08:56:19,419 - Kitchen - Finished!
ERROR 07-05 08:56:19,419 - Kitchen - Finished with errors
INFO 07-05 08:56:19,419 - Kitchen - Start=2015/05/07 08:56:18.982,
INFO 07-05 08:56:19,419 - Kitchen - Processing ended after 0 seconds.
Notice the extra line warning you about errors. If there are errors, double-check the script parameters and variables. Log in to your tranSMART instance to admire your work! Open the study node. You should now see an extra node compared to the ontology shown in Figure 9. The gene expression data has been added to the study and linked to the existing subjects thanks to the subject sample mapping file. Your ontology should like shown in Figure 11.
Figure 11: Ontology of the example study, after uploading gene expression data