The most common data type for tranSMART high dimensional data is mRNA expression, usually from microarray data. These are the oldest data loading procedures from which others have been derived for new data types.
Data values can be loaded in two ways, as raw data values or as log intensity values. In each case the other form has to be calculated on data loading, and Z-scores calculated. Data values can also be loaded as Z-scores bu thti sis not recommended as this leaves the raw and log intensity values empty and will break various features unless the user is aware of the missing data.
Data values are linked to a unique identifier, usually a probe ID from a microarray. A platform definition must first be loaded to map these IDs to gene identifiers. The gene locus name can then be used and disdplayed in analysis functions.
See The JnJ ETL: Guide for Gene Expression Data for a description of the data files for Kettle loading. These files can be placed in an appropriate directory structure for other data loading tools.
Using the make_expression_<STUDY> loader, the script launches is samples/(postgres|oracle)/load_expression.sh
This script checks the input values (STUDY_ID must match the first column of the mapping file).
When validated, the script launches Kettle load_gene_expression_data.kjb
The Kettle script is load_gene_expression_data.kjb
This script calls another job load_all_gene_expression_files_for_study.kjb
It also calls run_i2b2_process_mrna_data to run a stored procedure for data loading.
Other Kettle transform steps are for validation of inputs and writing to the audit log table.
There is a stored procedure tm_cz.i2b2_process_mrna_data for data loading which is used by Kettle and other ETL tools.
When loading data into temporary working zone table tm_wz.wt_subject_mrna_probeset and using raw data loading (DATA_TYPE=R) any value not greater than zero is ignored. For other data types this check is included later - the raw value is retained, and the log intensity calculation is adjusted.
This procedure validates data in temporary tables and loads into the database.
Data is stored in deapp.de_subject_microarray_data but this table is partitioned. In postgres, the partition number is written to the audit log during loading, and can be found in a message in tm_cz.cz_audit_log, with JOB_ID=the job ID reported by the loading job, and in the log file created in the expression directory, in the form:
Create partition deapp.de_subject_microarray_data_89
Subject/sample relations are in deapp.de_subject_sample_mapping
Patients (if not already defined by clinical data for the study) are added
Observations are created for the study and platform.
See i2b2-tranSMART Foundation Curated Data for over 200 studies with expression data from GEO, TCGA and other sources. These are ready to load with transmart-data.