Clinical Data Loading Instructions
Please note that tMDataLoader is Study ID case-insensitive. Study IDs ABCEFD, ABcDeFD, abcdefd, etc. are treated as the same Study ID.
tMDataLoader will load multiple categorical values for the same variable for the same subject ID (e.g. Subject1 - Diagnosis - Osteoarthritis; Subject1 - Diagnosis - COPD). Multiple numerical values for the same variable for the same subject are not allowed and will generate error.
Before uploading, the studies need to be placed in the corresponding directory on the server, that you specified in the tMDataLoader configuration file. It should have the structure following the example below. Nested folders are supported.
The first level of the directory defines the top category in the Dataset Explorer tree. The last level (before Clinical/Expression folders) defines the study name that will be used in the tree. All folders in between define intermediate folders in the Dataset Explorer tree.
To load Clinical Data you should have the following folders and files in your study directory:
<STUDY_NAME>_<STUDY_ID> ... \-ClinicalData (or ClinicalDataToUpload) \-<STUDY_NAME>_<STUDY_ID>_Mapping_File.txt \-<CLINICAL_DATA_FILE1>.txt ... \-<CLINICAL_DATA_FILEN>.txt ...
Clinical Data Layout (sample)
Mapping file (sample)
The mapping file should contain 5 required columns (filename, category_cd, col_nbr, data_label, data_label_source) and may contain special optional "baseline" column for "Timestamp" data loading and variable_type column for Timepoint and Timestamp data loading. variable_type and additional validation_rules columns are also used for data validation.
- column 1: filename - references to data file
- column 2: category_cd - multi-level variable category, separated by '+' symbol
- column 3: col_nbr - column number in the data file that contains values for the variable being loaded
- column 4: data_label - name of the variable, optional, could be empty
NOTE: There are 3 controlled terminology data_labels required by the tranSMART Summary Statistics functionality
- Age or AGE or Age (AGE)
- Sex or SEX or Sex (SEX)
- Race or RACE or Race (RACE)
If different terminology used or these labels are included into category_cd instead of data_label column, Summary Statistics for age, race and sex for the selected subsets will not be generated automatically.
If there is more than one "Age" Data_Label anywhere in the study, higher age value will be used for Summary Statistics.
- column 5: data_label_source - in conjunction with special value (
\) in data_label may be used to form dynamic variable name
- column 6: baseline (special column used for Timestamp data load)
- column 7: variable_type - type of variable (Numerical, Text, Date, Categorical, Timepoint, Timestamp) (see "Summary Statistic" and Serial data loading section bellow)
- column 8: validation_rules - list of variable specifications separated by
;(see "Summary Statistic" bellow)
Usually it contains just a name of variable, but it also can hold some special values:
- STUDY_ID - the marker for column with study identifier, you should use same study ID for all clinical data files. If the STUDY_ID is missing from the mapping file, the data from the affected data file is not loaded and no error is generated.
- SUBJ_ID - the marker for column with subject identifier
- VISIT_NAME - the marker for column with visit name. Optional. If not specified, then visit name is ignored. If VISIT_NAME is used, all SUBJ_ID must have a value in this column (e.g. Screening, Visit 1, Follow Up, etc.)
- OMIT - may occurs multiple times, used to mark a column as omitted (data is not loaded from this column)
NOTE: You can also omit columns from loading by not including them in the mapping file. tMDataLoader allows to skip columns, list column in any order and load data from the same column multiple times using different category_cd. If data from the same data column is loaded more then once, tMDataLoader will give a warning "... column X is already mapped...".
Category_cd contains path to variable separated by
+. Target variable path will be combined from the study path,
data_value (for non-numeric variables) and visit_name. Empty values will be omitted from path.
Effectively result path is concatenation of the followings:
I.e., if study loaded to
/Test Studies/Test Study_GSE0/,
data_label = 'Sex',
data_value = 'Female',
visit_name = 'v1',
category_cd = 'Clinical Data+Demographics', then result path will be
'/Test Studies/Test Study_GSE0/Sex/Female/v1/'.
You can change rules for
If you enable
visitNameFirst option either in config or with
--visit-name-first command line switcher, then
DATAVALUE will be reordered:
STUDY_PATH+CATEGORY_CD+VISITNAME+DATALABEL+DATAVALUE. For the sample above you will get
'/Test Studies/Test Study_GSE0/Sex/v1/Female/'.
If you specify
DATAVALUE in category_cd, then they will not be added to the end of path. For example, it may be useful if you want to place
DATALABEL, then you can just use the following value for category_cd:
'Clinical Data+Demographics+VISITNAME+DATALABEL+DATAVALUE' or just:
Both variants will produce the same variable path:
'/Test Studies/Test Study_GSE0/v1/Sex/Female/'.
If you want to use
+ in your category_cd, then read '
+ special handling' section.
For advanced path building using multiple variables read 'Using tags' section.
To prevent auto-expanding
VISITNAME read 'Using terminator symbol' section.
NOTE: category_cd only affects the variable's path, not the variable's value. If you change DATAVALUE location in path it will be still associated with the the same variable value.
By default, if subject has only one visit, then it will be removed from variable paths to avoid unnecessary nesting, but in some cases it is an undesirable behavior. In this case you can use either
always-set-visit-name command line switcher or
alwaysSetVisitName config option. It will prevent visit name removing even if it is just one.
Some of symbols will be pre-processed while loading:
|in data value replaced with
-if an occurrence in the middle, removed otherwise.
|in data label replaced with
)or empty '()' in data value removed
%in data label, data value and variable path (category_cd) replaced with Pct
+in both data label and data value replaced with
' and '
_in variable path (category_cd) and data_label replaced with
(plus)in data label, data value and variable path replaced with
+ in category_cd used as separator for categories and pre-processed by default in data values and data labels it has special placeholder
(plus) to keep it as plus sign.
I.e. for category_cd=
Clinical Data+A(plus)B, data_label=
A+B(plus)C following variable path will be produced:
/Clinical Data/A+B/D and E+F/A and B+C/
In some cases may be useful to use other variables for variable path. A typical sample of such usage is the
visit name variable. It comes from other column, but participate in variable path. This behavior may be extended to as much variables as needed using tags. Tag is a reference to other variable with a syntax
For example, if you want to group
Sex variable by
Race, then you can use following
For Male Asian subject we will get the following variable path:
If you want to fix category_cd and disable auto-appending DATALABEL, DATAVALUE and VISITNAME to end of the variable path you have option to use terminator symbol. It looks like
+$ at the end of
For example, to prevent adding
VISITNAME you can write
Clinical Data+DATALABEL+DATAVALUE+$ and it will produce variable path with just DATALABEL and DATAVALUE.
+$ will be removed from the variable path.
If we have
data_label=Sex, 'data_value'=Female, then you will get:
Another example, if we want only keep DATAVALUE in path, then for
Clinical Data+DATAVALUE+$ and the same data as in previous sample you will get:
If you upload Clinical Data for same study again, then, by default, old previously loaded variables will be removed and new variables will be loaded instead. But you can set merge mode in mapping file to change this behavior.
It can be done in head lines comments (see samples here):
The following modes are available:
- REPLACE - default behavior, no head line comment needed. If an existing study was not deleted first, it will be deleted by the ETL and replaced with the new copy of the same study being loaded (if same study ID, name, path). An attempt to load a new copy of the study with the same ID into a different path will generate an error.
- UPDATE - all clinical data for patients in the existing, previously loaded study will be removed and new data for these patients will be loaded. If a variable is not included in the UPDATE data set, all values for this variable for patients included in the UPDATE data set will be removed. Data for subjects not included in the UPDATE data set will not be affected.
Note: most of the time you would want to use UPDATE_VARIABLE. Just UPDATE mode helpful when you want to replace all or a lot of data for a subset of subjects.
- UPDATE_VARIABLE - old values for variables for patients in the "UPDATE_VARIABLE" data set are removed and replaced with the new ones. Other variables for these patients are not affected.
- APPEND - New variables (not present in the existing study) will be appended to the tree according to the APPEND data set mapping file. Categorical values for variables in the APPEND data set that are already present in the existing study will not be affected (e.g Existing study Subject 1 - Sex - Female; APPEND Subject 1 - Sex - Male. Both, female and male will be associated with Subject 1). Old numerical values for existing variables that are included in the APPEND data set will be rewritten.
If variable_type is defined as either Timepoint or Timestamp, tMDataLoader will create the required record that will allow to plot loaded data on the right scale using Advanced Workflow Line Graph. Timepoint can have formats such as "1 hour 30 min", "2 h", "Day 1", "Day 01". Timestamp is used in a standard format "5/16/13 19:10". Timestamp functionality requires two columns in the data file: actual timestamp for the measurement/observation and baseline timestamp to calculate timepoint (time interval). Links to examples are below.
Test Study Link for Serial LDD data Timepoint
Test Study Link for Serial LDD data Timestamp
For each clinical data upload the ETL generates SummaryStatistic.txt report in ClinicalData directory. It contains variable metrics and requested validations results. Output depends on
validation_rules specified in mapping file. By default all variables has
'Text' type and SUBJ_ID has an special variable type:
The SummaryStatistic.txt has following columns:
- File - clinical data file
- Variable - name of a variable
- Variable Type - a variable type (as defined in the mapping file)
- N - non empty values count
- null - empty values count
- Mean - mean value. Only for
- Median - median value. Only for
- IQR - inter quartile range. Only for
- Min - minimal value. Only for
- Max - maximal value. Only for
- SD - standard deviation. Only for
- Count - different values counts (i.e. Male: 20, Female: 100). Only for
- Required - flags if the variable marked as required
- Validation rule - list of range constraints on variable. Only for
- QC missing data - lists subject ids (
IDvariable's values) where value is missing for required variables
- QC data range - lists subject ids grouped by violated constraints (if any specified in Validation Rule)
- ID - special variable type for SUBJ_ID. Used to identify records with validation errors
- Text - default variable type. No special behavior defined
- Date - semantic variable type. No special behavior defined. In future may be used in validation rules
- Numerical - variable type for numeric variables. Different metrics (like Mean, SD, IQR) calculated for these variables. Range validation rules may be applied.
- Categorical - variable type for categorical variables.
Countcolumn contains information about values distribution
List of validation rules should contains in validation_rules column in the mapping file separated by
;. Validation rule should be from following list (
Y should be substituted with numbers). All rules only applies to
Numerical variables and rules violations reported in
QC data range if no otherwise specified:
- Required - may be applied to any type of variable. Validates if variable value present. Subjects with missing values for the variable will be reported in 'QC missing data' column in the SummaryStatistic.txt.
X, Greater than
X- variable's value should be greater than
X, Greater than or equal to
X- variable's value should be greater than or equal to
X, Lesser than
X- variable's value should be lesser than
X, Lesser than or equal to
X- variable's value should be lesser than or equal to
Y- variable's value should be between
Example: >10; <=60; Between 50 to 100
Any rule can be specialized by additional condition on related variable using
when .. is keyword. Validation rule should be followed by ', when "
condition_on_related_variable'. The rule should be applied only to variable's values when condition on related variable for same subject satisfied.
condition_on_related_variable should be from following list (
S should be substituted with a string):
- blank - related variable value should be blank
- present - related variable value should present
- equal to "
S" - related variable value should be equal to
- not equal to "
S" - related variable value should not be equal to
- any range condition from previous list
Example: >10, when "Age">=90; <=50, when "Sex" is equal to "Female"; 10-20, when "Sex" is present; 20-30, when "Sex" is blank
For Postgres (in wrk_clinical_data.sql)
- study_id character varying(25)
- site_id character varying(50)
- subject_id character varying(100)
- visit_name character varying(100)
- data_label character varying(500)
- data_value character varying(500)
- category_cd character varying(250)
- category_path character varying(1000)
For Oracle (in wrk_clinical_data.sql)
- STUDY_ID VARCHAR2(25 BYTE)
- SITE_ID VARCHAR2(50 BYTE) SUBJECT_ID VARCHAR2(100 BYTE)
- VISIT_NAME VARCHAR2(100 BYTE)
- DATA_LABEL VARCHAR2(500 BYTE)
- DATA_VALUE VARCHAR2(500 BYTE)
- CATEGORY_CD VARCHAR2(250 BYTE)
- ETL_JOB_ID NUMBER(220)
- ETL_DATE DATE
- USUBJID VARCHAR2(200 BYTE)
- CATEGORY_PATH VARCHAR2(1000 BYTE)