Gene Expression Data Loading Instructions
High Dimensional (HDD) Gene Expression Data in tranSMART is historically defined as data generated by measuring gene expression level using microarray technology. Data generated by other technologies can also be loaded as "Expression" as long as an entity being detected can be mapped to a gene and each probe/reagent used for detection is unique (one entity per probe, multiple probes for the same entity are possible).
Gene Expression Data Layout (sample)
To load Expression Data you should have following files in your study directory:
<STUDY_NAME>_<STUDY_ID> ... \-ExpressionData \-<STUDY_NAME>_<STUDY_ID>_Subject_Sample_Mapping_File.txt \-<STUDY_NAME>_<STUDY_ID>_Gene_Expression_Data_<DATA_TYPE>.txt \-<GENE_PLATFORM_NAME>.txt ...
Mapping file (sample)
The file contains mapping between samples and corresponding subjects. It also contains additional information about samples - such as tissue type and optional attributes (attr1, attr2). Finally, it has category_cd which used to determine path to sample related expression data.
The mapping file should contain 9 columns (study_id, site_id, subject_id, sample_cd, platform, tissuetype, attr1, attr2, category_cd).
- study_id - study identifier (should be same for all samples)
- site_id - samples's site. Optional
- subject_id - subject identifier
- sample_cd - sample code, should match record from data file
- platform - gene platform ID. It should be UPPERCASE, it can't start with "GSE".
- tissuetype - tissue type (i.e. Blood)
- attr1 - custom attribute 1. Optional
- attr2 - custom attribute 2. Optional
- category_cd - multi-level category, separated by '+' symbol, used to build path in tree
Usually category_cd converts to path as is. So, if you have
category_cd=Expression Data+GPL100+Blood you should expect following path in tree under study root: //Expression Data/GPL100/Blood/. But you can use special keywords as tokens which will be automatically replaced with corresponding values.
- PLATFORM - this token is replaced with
Platform Titlefrom the Platform File for the Platform ID indicated in the
- TISSUETYPE - value from
- ATTR1 - value from
- ATTR2 - value from
Data file (sample)
The file contains gene expression data. Values meaning defined by last letter in data file name.
The first column contains Probes. All other columns contains values for corresponding samples (defined in header).
The last symbol in data file name (before extension) is one of following letters:
R - raw data. Values is a raw data, which should be transformed to calculate log2 value and z-score.
L - log2 data. Values is a log2 data, raw values are restored and z-score calculated.
Z - z-score data. Has same meaning, value will be written to z-score without modifications if it in range of (-2.5; 2.5). It will be truncated to this range otherwise.
Platform file (sample)
GEO platform files for gene array data can be downloaded and used directly in the format provided. A custom platform file can also be created. The platform file should start with three metadata strings containing platform ID, name and species. The metadata lines should be at the beginning of the file and before the table header and start with ‘#’ symbol. Example
#PLATFORM_TITLE: Test GEX Platform
#SPECIES: Homo Sapiens
|ID||GENE SYMBOL||ENTREZ_GENE_ID||GENE TITLE|
|1007_s_at||DDR1||780||discoidin domain receptor tyrosine kinase 1|
|1053_at||RFC2||5982||replication factor C (activator 1) 2, 40kDa|
|117_at||HSPA6||3310||heat shock 70kDa protein 6 (HSP70B')|
|121_at||PAX8||7849||paired box 8|
|1255_g_at||GUCA1A||2978||guanylate cyclase activator 1A (retina)|
- NOTE: ID and ENTREZ_GENE_ID are mandatory. Other columns are optional. GENE SYMBOL is loaded but then updated from the Dictionary using ENTREZ_GENE_ID. If GENE SYMBOL does not match ENTREZ_GENE_ID, it is replaced with the correct one.