Files included in export:
The job information submitted to tranSMART. The job information gives an overview of the input data for the analysis. At parameters, it shows the jobType to indicate what analysis and image will be produced and it shows which variables were selected by displaying the full concept path names for each of them.
In case of high-dimensional data, this file contains information on which genes were selected for the analysis. If binning of high-dimensional data was done for the group variable, this will be indicated in this file.
Variables selected: The items independentVariable and groupByVariable show the full path name of the items selected in tranSMART to use as input. The independent and group by variable reflect the two input box names in the user interface: independent and outcome, respectively. Combining several items gives information on the type of variable selected and if the group variable was binned or not. variablesConceptPaths gives a summary of all used concept paths.
divIndependentVariableType and divGroupByVariable indicate the type of variables that were used. Values include CLINICAL for categorical and numeric low-dimensional data types or the type of high-dimensional data that was used, for example “mrna” incase of a mrna expression dataset. The independent variable is always a numerical or high dimensional data node. The groupByVariable is always a categorical or binned variable.
Binning: There are three different options for binning:
- EDP; Evenly Distribute Population
- ESB; Evenly Spaced Bins
- Manual binning
In case of EDP or ESB for numerical or high-dimensional variables, the file will contain the following parameters to provide information on which variable was used to bin:
- binning; Either True or False
- numberOfBins; Number of bins defined by the algorithm
- binDistribution; EDP or ESB
- binVariable; IND, which stands for independent variable. Note that for logistic regression this is incorrectly displayed and the groupByVariable is the item that is always used for binning.
Information on the actual bin boundaries is available from the outputfile.txt in the column named X.
In case of the Manual binning option the above items are also included in the jobInfo but the item manualBinning will be set to True instead of False, indicating that manual bins were defined. Additionally, an item named binRanges is added to reflect the different bins manually defined. Note that both the numberOfBins and binVariable are still relevant in this case, while binDistribution is not.
The data with four columns describing the input data:
- PATIENT_NUM; Subject identifier
- X; the outcome variable options, in case of numerical binned variable this indicates the bin boundaries. Maximum of two groups possible in this analysis. This column is plotted on the Y-axis
- Y; the independent variable options, must be numerical or high dimensional concepts. This column is plotted on the X-axis
Optional columns when using multiple independent variables are GROUP and GROUP.1 (only possible when using high dimensional data with multiple probes per gene/protein or when selecting multiple genes/proteins). Depending on the type of data that was used as input, the GROUP can refer to either the X or the Y column.
- In case of high dimensional data for only the outcome variable the GROUP displays probe, gene or protein names for X.
- In case of high dimensional data for only the independent variable the GROUP column displays probe, gene or protein names for Y.
- In case of a high dimensional node for both, GROUP refers to X and GROUP.1 refers to Y, showing probe, gene or protein names for Y.
json representation of the jobInfo.txt
A text file with the results of the general linear models(glm) algorithm in R. The I stands for the intercept and Y is the name of the independent variable input. For more information on the glm function used in R please go here
A text file with the summary of the glm algorithme in R. The call used to model the data using glm is shown. In the coefficients table the Y represents the independent variables used as input.
An image file with two plots. The first plot shows the estimator over all the values inputted in the independent variable. The second plot is a ROC curve indicating the quality of the model with the AUC score.