Loading clinical data requires at least a raw data file and a column mapping file. The word mapping and record exclusion files are optional. Below follows a description of the input files, followed by information on invoking the clinical data upload script. The filenames do not need to follow a predefined convention, you can specify each filename as a parameter for the upload script. Since all input files will be referred to relative to a single source directory, it is advisable to keep all of the input files in the same directory.
Figure 1: Conceptual representation of the ETL process for clinical data.
Raw Data File
Generate one or more files which contain the data that you wish to upload into tranSMART. Each file should be a tab-delimited text file with one header row, a column for the subject identifier and a column for each attribute. The following section describes the column mapping file, which will define the meaning of each column.
Column Mapping File
The column mapping file is a tab-delimited file with at least four columns and is used to map columns from the raw data files to concepts in the Dataset Explorer ontology tree. Table 1 describes the structure of this file. If you choose to include one or more of the optional columns listed below, they must appear as the nth column, where n is the position listed below.
Table 1: Description of columns in the column mapping file
|1||FILENAME||The name of the raw data file containing the data. If your raw data is distributed across multiple files, they can all be referenced with one column mapping file.||GSEXXXX clinical.txt|
|2||CATEGORY_CD||This describes the position of the concept in the study structure within tranSMART||Clinical Data+Alcohol Habits|
|3||COL_NBR||The column number in the data file that should be mapped||3|
The data label you wish to assign to this column of the data file. Kettle will automatically pull the column heading and use it as the data label if you leave this empty. If you would like to map the column heading to a new label, enter it here. The reserved words below instruct Kettle to perform specific actions. Reserved words must be fully capitalized to be recognized.
OMIT: Skip the column
SUBJ_ID: The data in this column is the subject ID.
SITE_ID: The ID of the site where this data was acquired.
VISIT_NAME: Use the data in this column as the visit name. A subject's attributes can be measured at multiple points in time. Each concept will be extended with the distinct values found in this column.
DATA_LABEL: Treat the data in this column as a data label for another column.
n: Maps the data as a data value of a column designated as a DATA_LABEL column.
VISIT_NAME_2: A second level of visit name. The total visit name will become VISIT_NAME\VISIT_NAME_2
SEQ_COL: If a subject's attribute was measured multiple times, this column indicates the sequence number of the measurement. The values in this column should therefore be integer and strictly positive.
UNITS: Units for the value in column ALT_DATA_LABEL.
In addition to these reserved words, there exist the following reserved concept names: Age, Race, Sex and Gender. Sex and Gender are synonyms. If a column carries this label, it will be automatically incorporated in the summary statistics page of the Dataset Explorer. In contrast to the reserved keywords, the match on these concept names is not case sensitive. The concept will appear in the ontology tree with the same capitalization as you have written it here.
Note: Each raw data file must include one column that carries the label SUBJ ID. All other reserved words are optional.
Use this column if you wish to use the data in a column as a data label for another column, or if you wish to aggregate multiple data values. Aggregate functions are applied over rows which have the same values all columns except for COL_NBR.
This field specifies the number of the column that should be used as the source for the data label. This column cannot be left blank for any row where the n is used in the data label field. In rare circumstances where multiple columns must be integrated together, use the following convention:
Append an A to the column number if you want to have the value from the other column added as a level in the ontology after the data label of the column.
Append a B to the column number if you want to have the value from the other column added as a level in the ontology before the data label of the column (for example, 4A, 6B, etc.). The default is A if not specified. If you want aggregation, set this field to MIN, MEAN or MAX for the minimum, mean or maximum respectively.
|Use this column if you wish to map the record to a controlled terminology (for example, SNOMED or MedDRA).||L-85B02|
Word Mapping File
The word mapping file is an optional file that allows a data value to be transformed into another data value. Any exact match in the specified column of the raw data file will be transformed to the new data value. The word mapping file is primarily used to map categorical values to a controlled vocabulary, and also to change unknown and null values into a value that can be displayed in tranSMART. A word mapping file must be a tab-delimited text file with four columns, which are described in Table 2.
Table 2: Description of columns in the word mapping file
|1||FILENAME||The name of the file containing the data you wish to remap.||GSEXXXX clinical.txt|
|2||COL_NBR||The column number within the raw clinical data file that should be mapped.||3|
|3||FROM_VALUE||The original data value of the record||RA|
|4||TO_VALUE||The new data value you wish to display||Rheumatoid Arthritis|
Record Exclusion File
The record exclusion file is a tab delimited text file that is used to exclude records with a specific value in a particular column of the raw data file from the upload process. If the column value for a particular line matches the exclude value, the entire line is skipped. No data for that subject will be found in the database. Table 3 describes the columns that make up this file.
Table 3: Description of columns in the record exclusion file
|1||FILENAME||The name of the file containing the data you want to filter||GSEXXXX clinical.txt|
|2||COL_NBR||The column number within the raw clinical data file that should be checked.||3|
|3||VALUE||The data value you wish to exclude||N/A|
Make sure you have prepared your data in the formats listed above. The Kettle-script for clinical data is located at transmart-ETL/Kettle/postgres/Kettle-ETL/create_clinical_data.kjb
Table 4 describes the parameters that can be passed to the create_clinical_data.kjb script. Use either Kettle or Spoon as described above to load the script and set its parameters.
Table 4: Clinical data upload parameters
|COLUMN_MAP_FILE||x||Filename of the column mapping file|
|DATA_LOCATION||x||Full path to the input files. All other filenames are relative to this path.|
Path to your psql executable. E.g. C:\Program Files\PostgreSQL\bin\psql.exe
Only used if LOAD_TYPE = L.
|HIGHLIGHT_STUDY||N||Y will cause the study name to be shown in green in the ontology|
I: load the data by generating an insert statement for each row. This is the preferred method for loading to the database.
L: load the data through the Postgres bulk loader. This can be more efficient than the I option, although in most cases the performance difference is negligible. You need to set PSQL_PATH as well to use this option.
F: instead of loading to the database, write to a file. The data will be written to a file called <STUDY_ID>_clinical_data, where <STUDY_ID> is the study ID you configured with the STUDY_ID parameter.
|SECURITY_REQUIRED||N||N: Indicates this is a public study. Any user logged in to tranSMART can view this study.
Y: A tranSMART administrator needs to give explicit access to each user who requires access to this study.
|SORT_DIR||$HOME||Full path to a directory where temporary files can be stored for sorting|
|STUDY_ID||x||Unique identifier of the study.
Note:The study ID will be transformed to uppercase during ETL. This means that all future references to the study ID must be in uppercase as well.
|TOP_NODE||x||The string that defines the top node of the ontology, including the full name of the study. For example: \Public Studies\Breast_Cancer_Kao_GSE20685\|
|WORD_MAP_FILE||x||Optional file that remaps values when reading source files.|
|RECORD_EXCLUSION_FILE||x||Optional file defining filters to exclude records.|