Log In Sign Up

The TCGA Meta-Dataset Clinical Benchmark

by   Mandana Samiei, et al.

Machine learning is bringing a paradigm shift to healthcare by changing the process of disease diagnosis and prognosis in clinics and hospitals. This development equips doctors and medical staff with tools to evaluate their hypotheses and hence make more precise decisions. Although most current research in the literature seeks to develop techniques and methods for predicting one particular clinical outcome, this approach is far from the reality of clinical decision making in which you have to consider several factors simultaneously. In addition, it is difficult to follow the recent progress concretely as there is a lack of consistency in benchmark datasets and task definitions in the field of Genomics. To address the aforementioned issues, we provide a clinical Meta-Dataset derived from the publicly available data hub called The Cancer Genome Atlas Program (TCGA) that contains 174 tasks. We believe those tasks could be good proxy tasks to develop methods which can work on a few samples of gene expression data. Also, learning to predict multiple clinical variables using gene-expression data is an important task due to the variety of phenotypes in clinical problems and lack of samples for some of the rare variables. The defined tasks cover a wide range of clinical problems including predicting tumor tissue site, white cell count, histological type, family history of cancer, gender, and many others which we explain later in the paper. Each task represents an independent dataset. We use regression and neural network baselines for all the tasks using only 150 samples and compare their performance.


Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction

Current machine learning has made great progress on computer vision and ...

Gaussian process regression for survival time prediction with genome-wide gene expression

Predicting the survival time of a cancer patient based on his/her genome...

Convolutional neural network models for cancer type prediction based on gene expression

Background Precise prediction of cancer types is vital for cancer diagno...

Vocal markers from sustained phonation in Huntington's Disease

Disease-modifying treatments are currently assessed in neurodegenerative...

Modeling of Whole Genomic Sequencing Implementation using System Dynamics and Game Theory

Biomarker testing is a laboratory test in oncology that is used in the s...

Code Repositories

1 Introduction

Most researchers develop new methods for one particular clinical task at a time, as an example, we can name survival prediction (Wieringen et al., 2009) or tumor cell type classification (Lyu and Haque, 2018). This approach is far from the reality of clinical decision making in which you have to consider several clinical variables simultaneously which are often performed by doctors and clinical staff. In some cases, those prediction tasks are correlated with each other. For instance, by knowing about gender, alcohol history documents, and tumor tissue site, we can achieve a more reliable result of cancer mortality. To tackle a precise outcome for many clinical tasks, it’s more practical to consider inter-related tasks. In addition, there are some tasks that do not have enough number of samples. By considering a few-shot learning regime, we leverage the samples from a collection of tasks and construct a prior knowledge for a general prediction.

Additionally, while there has been an extreme growth in Machine Learning research for Genomics, several barriers have slowed the progress. One particular barrier is the absence of global publicly available benchmark datasets in the field of gene expression works to evaluate the performance of models, facilitate reproducibility and allow the community to focus on a set of challenges. There are consistent public benchmarks in the field of computer vision for image classification tasks, for example, ImageNet

(Russakovsky et al., 2015) and Large Scale Visual Recognition Challenge (ILSVRC) (Russakovsky et al., 2015) which allowed a significant improvement in the classification error from 25% in 2011 to 95% in 2017 for benchmark tasks. In contrast, practical progress in genomics applications has been difficult to measure due to variability and inconsistency in data sets and task definitions. Although there is a fairly well-known challenge called DREAM that can be found at, it has not imposed itself as a standard challenge such as ImageNet.

In this paper, we propose a collection of 174 TCGA benchmark clinical tasks that can be used in a multi-task learning framework. All the tasks are classification problems and the input space is gene-expression data with 20530 genes. We apply independent regression and neural models that will predict across any general clinical tasks. The purpose of each dataset that can be inferred by its name is predicting a clinical variable of a particular cancer type. For instance, one of the datasets is named ’gender-LAML’ means we predict ’gender’ for ’Acute Myeloid Leukemia’ samples.

We evaluate the performance of each clinical task using logistic regression and a neural network that will predict across corresponding cancer study. This framework is an initiative to develop Meta-Learning techniques that make use of the interrelated clinical tasks of gene-expression data to learn effective classification models despite the small sample size of each individual task. The code is accessible at:

1.1 Summary of Contributions

This paper proposes a public Meta-Dataset that provides 174 defined clinical tasks. TCGA Meta-Dataset also includes a meta-dataloader which is available on the Github repository. The organization of the paper is as follows: Section 2 presents related works to this paper, and Section 3 gives an overview of the public TCGA dataset and the defined tasks. Section 4 explains the data-loaders available in the TCGA Meta-Dataset. Section 5 gives a detail of the experimental and training setup, the baselines we used, the evaluation method, the architecture as well as a comparison of models’ performance on different tasks. In the end, Section 6 concludes the work and explains how we plan to extend this in future work.

2 Related Works

A recent paper in Meta-Learning literature which is closely related to this work is Meta-Dataset by (Triantafillou et al., 2019) that offers an environment for training and evaluating meta-learners for few-shot image classification by evaluating various baselines and meta-learners on this Meta-Dataset. This benchmark includes 10 datasets from various image classification tasks including ILSVRC-2012 (ImageNet) (Russakovsky et al., 2015), Omniglot (Lake et al., 2019), Aircraft (Maji et al., 2013), MSCOCO (Lin et al., 2014), CUB-200-2011 (Birds) (Wah et al., 2011), Describable Textures (Cimpoi et al., 2016), Quick Draw (Jongejan et al.,2016), Fungi (Schroeder and Cui, 2018), VGG Flower (Nilsback and Zisserman, 2008) and Traffic Signs (Houben et al., 2013). Despite having a small number of tasks, developing this benchmark opens the door to the use of multiple datasets for few-shot learning by leveraging the use of samples from different dataset to construct the prior knowledge from multiple data sources and allows researchers to evaluate more challenging generalization problem.

In the literature of genomics applications, most of the works are developing algorithms and techniques for only solving one particular clinical task at a time. To the best of our knowledge, there is no particular work which is doing general clinical feature prediction using gene expression data, although one of the works that has a similar initiative is "Multitask learning and benchmarking with clinical time-series data" by (Harutyunyan et al., 2019), they are only considering 4 different tasks from MIMIC-III database which is different from what we have done. This difference is coming from the number of defined tasks, the dataset that the tasks are originated from as well as defining a few-shot setting for each task in our Meta-Dataset.

3 Task Definition

The Cancer Genome Atlas (TCGA) program collected clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 33 different cancer types (Liu et al., 2018). We use the gene expression profile (20530 genes) called IlluminaHiSeq that is obtained from RNA-Seq using some alignment and quantification algorithm. We consider all cancer studies available in TCGA to predict the phenotypes corresponding to them. To perform the prediction, we restrict ourselves to work only with samples for which TCGA has both RNA-Seq and a given phenotype. There are 44 clinical variables (phenotypes) and 25 cancer studies. Tasks are combinations of a phenotype to predict and a cancer study. The defined tasks cover a range of clinical problems including modeling tumor tissue site, white cell count, histological type, family history of cancer, gender, pancan DNAMethyl which is the pan-cancer patterns of DNA methylation, pancan miRNA which is the pan-cancer data analysis of microRNA, animal allergy history, asthema history, mental status changes and a few others. You can find the full list of tasks in figure 1 and the Github Repo. There are 174 tasks in total in which 121 of them have enough number of samples by considering the tasks that have at least 150 samples. We use gene expression data from tumor samples of a particular cancer study to predict a phenotype. For example, a task called (’EVENT’, ’KIRP’) means we predict a phenotype called _EVENT which is an overall survival indicator (1=death, 0=censor) for TCGA Kidney Papillary Cell Carcinoma (KIRP) tumor samples. There are 174 tasks in TCGA Meta-Dataset, a full list of tasks besides the baselines’ performance summary can be found in Figure  1.

4 Data-loaders for TCGA Meta-Dataset

We offer a meta-dataloader for TCGA that allows loading a batch of tasks that can be iterated over to generate datasets. We also provide the possibility of loading only one clinical task by specifying the phenotype and the cancer study. The file includes the detail of meta-dataloader. You can see an example below in which the variable ’datasets’ is the collection of all tasks and ’task’ is one of these datasets named (’PAM50CallRNAseq’, ’BRCA’). TCGA meta-dataloader is a part of Torchmeta Library (Deleu et al., 2019) .

# Helper function, equivalent to Section 4 datasets = metadataloader.TCGA.TCGAMeta(download="True", minsamplesperclass=10) task = metadataloader.TCGA.TCGATask((’PAM50CallRNAseq’, ’BRCA’)) print( # output: (’PAM50CallRNAseq’, ’BRCA’) print(task._samples.shape) # output: (956, 20530) print(collections.Counter(task.labels)) # output: Counter(2: 434, 3: 194, 0: 142, 4: 119, 1: 67)

5 Experimental Results

We used three supervised models to evaluate the performance of each task using regression and neural network baselines. For the majority class predictions, we use a dummy classifier from Scikit-learn

(Pedregosa et al., 2011)

, for logistic regression we use a 1-layer neural network (zero-hidden layer neural net with Softmax output) from PyTorch

(Paszke et al., 2017)

. As the third baseline, after a hyperparameter search, we use a 3-layers network with 128 and 64 of hidden state size respectively, a learning rate equal to

, batch size of 32 and weight decay of 0.0 (no regularization effect). We use Adam (Kingma and Ba, 2014) for optimization in the neural network and LBFGS (Liu and Nocedal, 1989)

for logistic regression. We run the network for 250 epochs for each task and over 10 different trials (by changing seed from 0 to 9). All tasks have 150 samples in total. We split 150 samples between train and test sets and valid set if needed. Train and test sets contain 50 and 100 samples respectively.


MLP has the highest overall performance compared to Logistic Regression and Majority. The main limitation of logistic regression is that it cannot model the interactions between input variables. While the MLP is dealing better with clinical classification tasks, logistic regression has higher performance on gender tasks. In table 1

, you can see the average accuracy of each baseline over all tasks as well as the standard deviation. This result has been concluded over 10 repeats.

Model Accuracy Standard Deviation
Majority 66.18 16.87
Logistic Regression 68.78 15.91
Neural Network (MLP) 71.19 14.62
Table 1: Accuracy and standard deviation of 3 baselines over all tasks
Figure 1: Model Performance over all datasets. Sub-figures show the performance of Majority classifier, Logistic Regression and MLP across all tasks.

6 Conclusion and Future works

We proposed a Meta-Dataset of tasks where each task is combinations of a clinical variable (phenotype) to predict and a cancer study. This dataset can be used as a testbed for developing algorithms for genomics applications which can leverage a few-shot learning regime. We evaluated the performance of clinical tasks using supervised models and compared their accuracy to provide a baseline of what current techniques are able to achieve on the defined tasks. For future work, we can apply meta-learning methods to construct the shared structural information within interrelated clinical tasks by using gene-expression data. As there is a lack of number of samples in the few-shot learning regime, by using meta-learning methods we leverage the samples from a collection of tasks and construct a prior knowledge for a general prediction. Also, studying the way that we can transfer these features from one clinical task to another could be an interesting direction.


  • Cimpoi et al. (2016) Cimpoi, M., Maji, S., Kokkinos, I., and Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1):65–94.
  • Deleu et al. (2019) Deleu, T., Würfl, T., Samiei, M., Cohen, J. P., and Bengio, Y. (2019). Torchmeta: A Meta-Learning library for PyTorch. Available at:
  • Harutyunyan et al. (2019) Harutyunyan, H., Khachatrian, H., Kale, D. C., Ver Steeg, G., and Galstyan, A. (2019). Multitask learning and benchmarking with clinical time series data. volume 6.
  • Houben et al. (2013) Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., and Igel, C. (2013). Detection of traffic signs in real-world images: The german traffic sign detection benchmark.
  • Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. cite arxiv:1412.6980Comment: Published as a conference paper at the 3rd International Conference for Learning Representations, San Diego, 2015.
  • Lake et al. (2019) Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. (2019). The Omniglot Challenge: A 3-Year Progress Report.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T., editors, Computer Vision – ECCV 2014, pages 740–755, Cham. Springer International Publishing.
  • Liu and Nocedal (1989) Liu, D. C. and Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Math. Program., 45(1-3):503–528.
  • Liu et al. (2018) Liu, J., Lichtenberg, T., Hoadley, K. A., Poisson, L. M., Lazar, A. J., Cherniack, A. D., Kovatich, A. J., Benz, C. C., Levine, D. A., Lee, A. V., Omberg, L., Wolf, D. M., Shriver, C. D., Thorsson, V., Network, T. C. G. A. R., and Hu, H. (2018). Abstract 3287: An integrated tcga pan-cancer clinical data resource to drive high quality survival outcome analytics. Cancer Research, 78(13 Supplement).
  • Lyu and Haque (2018) Lyu, B. and Haque, A. (2018). Deep learning based tumor type classification using gene expression data. ACM-BCB, pages 89–96.
  • Maji et al. (2013) Maji, S., Rahtu, E., Kannala, J., Blaschko, M. B., and Vedaldi, A. (2013). Fine-grained visual classification of aircraft. CoRR, abs/1306.5151.
  • Nilsback and Zisserman (2008) Nilsback, M.-E. and Zisserman, A. (2008). Automated flower classification over a large number of classes. pages 722–729.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in pytorch. 31st Conference on Neural Information Processing Systems (NIPS).
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in python. J. Mach. Learn. Res., 12:2825–2830.
  • Russakovsky et al. (2015) Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252.
  • Schroeder and Cui (2018) Schroeder, B. and Cui, Y. (2018). Fgvcx fungi classification challenge. fungi_comp.
  • Triantafillou et al. (2019) Triantafillou, E., Zhu, T., Dumoulin, V., Lamblin, P., Xu, K., Goroshin, R., Gelada, C., Swersky, K., Manzagol, P., and Larochelle, H. (2019). Meta-dataset: A dataset of datasets for learning to learn from few examples. CoRR, abs/1903.03096.
  • Wah et al. (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology.
  • Wieringen et al. (2009) Wieringen, W., Kun, D., Hampel, R., and Boulesteix, A.-L. (2009). Survival prediction using gene expression data: A review and comparison. Computational Statistics & Data Analysis, 53:1590–1603.