Machine learning is bringing a paradigm shift to healthcare by changing the process of disease diagnosis and prognosis in clinics and hospitals. This development equips doctors and medical staff with tools to evaluate their hypotheses and hence make more precise decisions. Although most current research in the literature seeks to develop techniques and methods for predicting one particular clinical outcome, this approach is far from the reality of clinical decision making in which you have to consider several factors simultaneously. In addition, it is difficult to follow the recent progress concretely as there is a lack of consistency in benchmark datasets and task definitions in the field of Genomics. To address the aforementioned issues, we provide a clinical Meta-Dataset derived from the publicly available data hub called The Cancer Genome Atlas Program (TCGA) that contains 174 tasks. We believe those tasks could be good proxy tasks to develop methods which can work on a few samples of gene expression data. Also, learning to predict multiple clinical variables using gene-expression data is an important task due to the variety of phenotypes in clinical problems and lack of samples for some of the rare variables. The defined tasks cover a wide range of clinical problems including predicting tumor tissue site, white cell count, histological type, family history of cancer, gender, and many others which we explain later in the paper. Each task represents an independent dataset. We use regression and neural network baselines for all the tasks using only 150 samples and compare their performance.READ FULL TEXT VIEW PDF
Current machine learning has made great progress on computer vision and ...
Background Precise prediction of cancer types is vital for cancer diagno...
Disease-modifying treatments are currently assessed in neurodegenerative...
A vast majority of deep learning methods are built to automate diagnosti...
We propose a methodology for the identification of transcription factors...
Clinical predictions using clinical data by computational methods are co...
Formalin-fixed paraffin-embedded (FFPE) samples have great potential for...
Most researchers develop new methods for one particular clinical task at a time, as an example, we can name survival prediction (Wieringen et al., 2009) or tumor cell type classification (Lyu and Haque, 2018). This approach is far from the reality of clinical decision making in which you have to consider several clinical variables simultaneously which are often performed by doctors and clinical staff. In some cases, those prediction tasks are correlated with each other. For instance, by knowing about gender, alcohol history documents, and tumor tissue site, we can achieve a more reliable result of cancer mortality. To tackle a precise outcome for many clinical tasks, it’s more practical to consider inter-related tasks. In addition, there are some tasks that do not have enough number of samples. By considering a few-shot learning regime, we leverage the samples from a collection of tasks and construct a prior knowledge for a general prediction.
Additionally, while there has been an extreme growth in Machine Learning research for Genomics, several barriers have slowed the progress. One particular barrier is the absence of global publicly available benchmark datasets in the field of gene expression works to evaluate the performance of models, facilitate reproducibility and allow the community to focus on a set of challenges. There are consistent public benchmarks in the field of computer vision for image classification tasks, for example, ImageNet(Russakovsky et al., 2015) and Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015) which allowed a significant improvement in the classification error from 25% in 2011 to 95% in 2017 for benchmark tasks. In contrast, practical progress in genomics applications has been difficult to measure due to variability and inconsistency in data sets and task definitions. Although there is a fairly well-known challenge called DREAM that can be found at http://dreamchallenges.org/, it has not imposed itself as a standard challenge such as ImageNet.
In this paper, we propose a collection of 174 TCGA benchmark clinical tasks that can be used in a multi-task learning framework. All the tasks are classification problems and the input space is gene-expression data with 20530 genes. We apply independent regression and neural models that will predict across any general clinical tasks. The purpose of each dataset that can be inferred by its name is predicting a clinical variable of a particular cancer type. For instance, one of the datasets is named ’gender-LAML’ means we predict ’gender’ for ’Acute Myeloid Leukemia’ samples.
We evaluate the performance of each clinical task using logistic regression and a neural network that will predict across corresponding cancer study. This framework is an initiative to develop Meta-Learning techniques that make use of the interrelated clinical tasks of gene-expression data to learn effective classification models despite the small sample size of each individual task. The code is accessible at:https://github.com/mandanasmi/TCGABenchmark.
This paper proposes a public Meta-Dataset that provides 174 defined clinical tasks. TCGA Meta-Dataset also includes a meta-dataloader which is available on the Github repository. The organization of the paper is as follows: Section 2 presents related works to this paper, and Section 3 gives an overview of the public TCGA dataset and the defined tasks. Section 4 explains the data-loaders available in the TCGA Meta-Dataset. Section 5 gives a detail of the experimental and training setup, the baselines we used, the evaluation method, the architecture as well as a comparison of models’ performance on different tasks. In the end, Section 6 concludes the work and explains how we plan to extend this in future work.
A recent paper in Meta-Learning literature which is closely related to this work is Meta-Dataset by (Triantafillou et al., 2019) that offers an environment for training and evaluating meta-learners for few-shot image classification by evaluating various baselines and meta-learners on this Meta-Dataset. This benchmark includes 10 datasets from various image classification tasks including ILSVRC-2012 (ImageNet) (Russakovsky et al., 2015), Omniglot (Lake et al., 2019), Aircraft (Maji et al., 2013), MSCOCO (Lin et al., 2014), CUB-200-2011 (Birds) (Wah et al., 2011), Describable Textures (Cimpoi et al., 2016), Quick Draw (Jongejan et al.,2016), Fungi (Schroeder and Cui, 2018), VGG Flower (Nilsback and Zisserman, 2008) and Traffic Signs (Houben et al., 2013). Despite having a small number of tasks, developing this benchmark opens the door to the use of multiple datasets for few-shot learning by leveraging the use of samples from different dataset to construct the prior knowledge from multiple data sources and allows researchers to evaluate more challenging generalization problem.
In the literature of genomics applications, most of the works are developing algorithms and techniques for only solving one particular clinical task at a time. To the best of our knowledge, there is no particular work which is doing general clinical feature prediction using gene expression data, although one of the works that has a similar initiative is "Multitask learning and benchmarking with clinical time-series data" by (Harutyunyan et al., 2019), they are only considering 4 different tasks from MIMIC-III database which is different from what we have done. This difference is coming from the number of defined tasks, the dataset that the tasks are originated from as well as defining a few-shot setting for each task in our Meta-Dataset.
The Cancer Genome Atlas (TCGA) program collected clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 33 different cancer types (Liu et al., 2018). We use the gene expression profile (20530 genes) called IlluminaHiSeq that is obtained from RNA-Seq using some alignment and quantification algorithm. We consider all cancer studies available in TCGA to predict the phenotypes corresponding to them. To perform the prediction, we restrict ourselves to work only with samples for which TCGA has both RNA-Seq and a given phenotype. There are 44 clinical variables (phenotypes) and 25 cancer studies. Tasks are combinations of a phenotype to predict and a cancer study. The defined tasks cover a range of clinical problems including modeling tumor tissue site, white cell count, histological type, family history of cancer, gender, pancan DNAMethyl which is the pan-cancer patterns of DNA methylation, pancan miRNA which is the pan-cancer data analysis of microRNA, animal allergy history, asthema history, mental status changes and a few others. You can find the full list of tasks in figure 1 and the Github Repo. There are 174 tasks in total in which 121 of them have enough number of samples by considering the tasks that have at least 150 samples. We use gene expression data from tumor samples of a particular cancer study to predict a phenotype. For example, a task called (’EVENT’, ’KIRP’) means we predict a phenotype called _EVENT which is an overall survival indicator (1=death, 0=censor) for TCGA Kidney Papillary Cell Carcinoma (KIRP) tumor samples. There are 174 tasks in TCGA Meta-Dataset, a full list of tasks besides the baselines’ performance summary can be found in Figure 1.
We offer a meta-dataloader for TCGA that allows loading a batch of tasks that can be iterated over to generate datasets. We also provide the possibility of loading only one clinical task by specifying the phenotype and the cancer study. The file TCGA.py includes the detail of meta-dataloader. You can see an example below in which the variable ’datasets’ is the collection of all tasks and ’task’ is one of these datasets named (’PAM50CallRNAseq’, ’BRCA’). TCGA meta-dataloader is a part of Torchmeta Library (Deleu et al., 2019) .
We used three supervised models to evaluate the performance of each task using regression and neural network baselines. For the majority class predictions, we use a dummy classifier from Scikit-learn(Pedregosa et al., 2011)
, for logistic regression we use a 1-layer neural network (zero-hidden layer neural net with Softmax output) from PyTorch(Paszke et al., 2017)
. As the third baseline, after a hyperparameter search, we use a 3-layers network with 128 and 64 of hidden state size respectively, a learning rate equal to, batch size of 32 and weight decay of 0.0 (no regularization effect). We use Adam (Kingma and Ba, 2014) for optimization in the neural network and LBFGS (Liu and Nocedal, 1989)
for logistic regression. We run the network for 250 epochs for each task and over 10 different trials (by changing seed from 0 to 9). All tasks have 150 samples in total. We split 150 samples between train and test sets and valid set if needed. Train and test sets contain 50 and 100 samples respectively.
MLP has the highest overall performance compared to Logistic Regression and Majority. The main limitation of logistic regression is that it cannot model the interactions between input variables. While the MLP is dealing better with clinical classification tasks, logistic regression has higher performance on gender tasks. In table 1
, you can see the average accuracy of each baseline over all tasks as well as the standard deviation. This result has been concluded over 10 repeats.
|Neural Network (MLP)||71.19||14.62|
We proposed a Meta-Dataset of tasks where each task is combinations of a clinical variable (phenotype) to predict and a cancer study. This dataset can be used as a testbed for developing algorithms for genomics applications which can leverage a few-shot learning regime. We evaluated the performance of clinical tasks using supervised models and compared their accuracy to provide a baseline of what current techniques are able to achieve on the defined tasks. For future work, we can apply meta-learning methods to construct the shared structural information within interrelated clinical tasks by using gene-expression data. As there is a lack of number of samples in the few-shot learning regime, by using meta-learning methods we leverage the samples from a collection of tasks and construct a prior knowledge for a general prediction. Also, studying the way that we can transfer these features from one clinical task to another could be an interesting direction.