Loading clinical data requires at least a raw data file and a column mapping file. The word mapping and record exclusion files are optional. Below follows a description of the input les, followed by information on invoking the clinical data upload script. The filenames do not need to follow a predefined convention, you can specify each filename as a parameter for the upload script. Since all input files will be referred to relative to a single source directory, it is advisable to keep all of the input files in the same directory.
Figure 8: Conceptual representation of the ETL process for clinical data.
5.1 Raw Data File
Generate one or more files which contain the data that you wish to upload into tranSMART. Each file should be a tab-delimited text file with one header row, a column for the subject identifier and a column for each attribute. The following section describes the column mapping file, which will define the meaning of each column.
5.2 Column Mapping File
The column mapping le is a tab-delimited le with at least four columns and is used to map columns from the raw data les to concepts in the Dataset Explorer ontology tree. Table 1 describes the structure of this file. If you choose to include one or more of the optional columns listed below, they must appear as the n'th column, where n is the position listed below.
Table 1: Description of columns in the column mapping file
|1||FILENAME||The name of the raw data file containing the data. If your raw data is distributed across multiple files, they can all be referenced with one column mapping file.||GSEXXXX clinical.txt|
|2||CATEGORY_CD||This describes the position of the concept in the study structure within tranSMART||Clinical Data+Alcohol Habits|
|3||COL_NBR||The column number in the data file that should be mapped||3|
The data label you wish to assign to this column of the data file. Kettle will automatically pull the column heading and use it as the data label if you leave this empty. If you would like to map the column heading to a new label, enter it here. The reserved words below instruct Kettle to perform specific actions. Reserved words must be fully capitalized to be recognized.
OMIT: Skip the column
SUBJ_ID: The data in this column is the subject ID.
SITE_ID: The ID of the site where this data was acquired.
VISIT_NAME: Use the data in this column as the visit name. A subject's attributes can be measured at multiple points in time. Each concept will be extended with the distinct values found in this column.
DATA_LABEL: Treat the data in this column as a data label for another column.
n: Maps the data as a data value of a column designated as a DATA_LABEL column.
VISIT_NAME_2: A second level of visit name. The total visit name will become VISIT_NAME\VISIT_NAME_2
SEQ_COL: If a subject's attribute was measured multiple times, this column indicates the sequence number of the measurement. The values in this column should therefore be integer and strictly positive.
UNITS: Units for the value in column ALT_DATA_LABEL.
In addition to these reserved words, there exist the following reserved concept names: Age, Race, Sex and Gender. Sex and Gender are synonyms. If a column carries this label, it will be automatically incorporated in the summary statistics page of the Dataset Explorer. In contrast to the reserved keywords, the match on these concept names is not case sensitive. The concept will appear in the ontology tree with the same capitalization as you have written it here.
Note: Each raw data file must include one column that carries the label SUBJ ID. All other reserved words are optional.
Use this column if you wish to use the data in a column as a data label for another column, or if you wish to aggregate multiple data values. Aggregate functions are applied over rows which have the same values all columns except for COL_NBR.
This field specifies the number of the column that should be used as the source for the data label. This column cannot be left blank for any row where the n is used in the data label field. In rare circumstances where multiple columns must be integrated together, use the following convention:
Append an A to the column number if you want to have the value from the other column added as a level in the ontology after the data label of the column.
Append a B to the column number if you want to have the value from the other column added as a level in the ontology before the data label of the column (for example, 4A, 6B, etc.). The default is A if not specified. If you want aggregation, set this field to MIN, MEAN or MAX for the minimum, mean or maximum respectively.
|Use this column if you wish to map the record to a controlled terminology (for example, SNOMED or MedDRA).||L-85B02|
5.3 Word Mapping File
The word mapping file is an optional file that allows a data value to be transformed into another data value. Any exact match in the specified column of the raw data file will be transformed to the new data value. The word mapping file is primarily used to map categorical values to a controlled vocabulary, and also to change unknown and null values into a value that can be displayed in tranSMART. A word mapping file must be a tab-delimited text file with four columns, which are described in Table 2.
Table 2: Description of columns in the word mapping file
|1||FILENAME||The name of the file containing the data you wish to remap.||GSEXXXX clinical.txt|
|2||COL_NBR||The column number within the raw clinical data file that should be mapped.||3|
|3||FROM_VALUE||The original data value of the record||RA|
|4||TO_VALUE||The new data value you wish to display||Rheumatoid Arthritis|
5.4 Record Exclusion File
The record exclusion le is a tab delimited text le that is used to exclude records with a specific value in a particular column of the raw data file from the upload process. If the column value for a particular line matches the exclude value, the entire line is skipped. No data for that subject will be found in the database. Table 3 describes the columns that make up this file.
Table 3: Description of columns in the record exclusion file
|1||FILENAME||The name of the file containing the data you want to filter||GSEXXXX clinical.txt|
|2||COL_NBR||The column number within the raw clinical data file that should be checked.||3|
|3||VALUE||The data value you wish to exclude||N/A|
5.5 Upload Script
Make sure you have prepared your data in the formats listed above. The Kettle-script for clinical data is located at transmart-ETL/Kettle/postgres/Kettle-ETL/create_clinical_data.kjb
Table 4 describes the parameters that can be passed to the create_clinical_data.kjb script. Use either Kettle or Spoon as described above to load the script and set its parameters.
Table 4: Clinical data upload parameters
|COLUMN_MAP_FILE||x||Filename of the column mapping file|
|DATA_LOCATION||x||Full path to the input files. All other filenames are relative to this path.|
Path to your psql executable. E.g. C:\Program Files\PostgreSQL\bin\psql.exe
Only used if LOAD_TYPE = L.
|HIGHLIGHT_STUDY||N||Y will cause the study name to be shown in green in the ontology|
I: load the data by generating an insert statement for each row. This is the preferred method for loading to the database.
L: load the data through the Postgres bulk loader. This can be more efficient than the I option, although in most cases the performance difference is negligible. You need to set PSQL_PATH as well to use this option.
F: instead of loading to the database, write to a file. The data will be written to a file called <STUDY_ID>_clinical_data, where <STUDY_ID> is the study ID you configured with the STUDY_ID parameter.
|SECURITY_REQUIRED||N||N: Indicates this is a public study. Any user logged in to tranSMART can view this study.
Y: A tranSMART administrator needs to give explicit access to each user who requires access to this study.
|SORT_DIR||$HOME||Full path to a directory where temporary files can be stored for sorting|
|STUDY_ID||x||Unique identifier of the study.
Note:The study ID will be transformed to uppercase during ETL. This means that all future references to the study ID must be in uppercase as well.
|TOP_NODE||x||The string that defines the top node of the ontology, including the full name of the study. For example: \Public Studies\Breast_Cancer_Kao_GSE20685\|
|WORD_MAP_FILE||x||Optional file that remaps values when reading source files.|
|RECORD_EXCLUSION_FILE||x||Optional file defining filters to exclude records.|
In this example we will upload a small mock study, using the aspects defined above. The files can be found in the Clinical data directory included with this document. The clinical data upload example consists of four files: GSEXXXX_clinical.txt, GSEXXXX_columns.txt, GSEXXXX_words.txt and GSEXXXX_record_exclusion.txt. It is recommended to study the contents of each le, in order to understand the upload process.
GSEXXXX_clinical.txt contains the raw clinical data. Notice that the DAS28 column has one missing value, and one value set to NA. In this example, we want to exclude the subject for which the DAS28 score is NA. We create a record exclusion le with one line defining the exclusion parameters. Additionally, we want to rename the value of RA in the disease_state column to Rheumatoid Arthritis. The word mapping file can help us do that. Finally, we use the column mapping file to define the study ontology.
Now we are ready to start the ETL process. As the top node of our study, we choose \Public studies\GSEXXXX\. You can review the meaning of all parameters in table 4. Since typing out all parameters every time can be tedious, we recommend keeping a small script file within the study directory. If you need to re-upload the data at a later point, you can just run this script. The Clinical data directory contains scripts for both Linux and Windows environments.
On a Linux machine, open the file load_clinical.sh in the study directory. On a Windows machine, open the file load_clinical.bat in the study directory. Change the DATA_INTEGRATION_PATH and TRANSMART_ETL_PATH variables to the location of the Data Integration software suite and the tranSMART-ETL repository respectively. Notice that when keeping the upload script together with the data les, we can set the DATA_LOCATION parameter to . (current directory). Finally check the location of your psql executable and make sure it corresponds to the SQLLDR_PATH parameter. In a terminal, navigate to the Clinical data directory and start the upload script. On Linux, type bash load_clinical.sh, on Windows, type load_clinical.bat
Alternatively, you can use Spoon as described in section 4. Just hit the run button and copy all parameter values over from the script to the appropriate input elds in Spoon. Kitchen will produce a large amount of output. Check the last few lines output. If everything went well, these lines should look like this:
INFO 07-05 08:56:52,708 - Kitchen - Finished!
INFO 07-05 08:56:52,708 - Kitchen - Start=2015/05/07 08:56:45.339,
INFO 07-05 08:56:52,708 - Kitchen - Processing ended after 7 seconds.
Kitchen will also tell you if something went wrong:
INFO 07-05 08:56:19,419 - Kitchen - Finished!
ERROR 07-05 08:56:19,419 - Kitchen - Finished with errors
INFO 07-05 08:56:19,419 - Kitchen - Start=2015/05/07 08:56:18.982,
INFO 07-05 08:56:19,419 - Kitchen - Processing ended after 0 seconds.
Notice the extra line warning you about errors. If there are errors, double-check the script parameters and variables.
Log in to your tranSMART instance to admire your work! Open the study node. The ontology should look like the one shown in Figure 9. Notice that altough we have 20 subjects in our input, i.e. only 19 were uploaded to tranSMART. Remember that we excluded the subject with the NA value for DAS28 using the record exclusion file. Additionally, there was a subject with a missing DAS28 score. As a result, we can see that there are 19 subjects in the study, of which only 18 have the DAS28 concept. Finally, we can see that the RA concept was mapped to Rheumatoid Arthritis.