UFDN-TCGA
None
view repo
We introduce a Unified Disentanglement Network (UFDN) trained on The Cancer Genome Atlas (TCGA). We demonstrate that the UFDN learns a biologically relevant latent space of gene expression data by applying our network to two classification tasks of cancer status and cancer type. Our UFDN specific algorithms perform comparably to random forest methods. The UFDN allows for continuous, partial interpolation between distinct cancer types. Furthermore, we perform an analysis of differentially expressed genes between skin cutaneous melanoma(SKCM) samples and the same samples interpolated into glioblastoma (GBM). We demonstrate that our interpolations learn relevant metagenes that recapitulate known glioblastoma mechanisms and suggest possible starting points for investigations into the metastasis of SKCM into GBM.
READ FULL TEXT VIEW PDFNone
Deep learning is being applied to many difficult problems in genomics and medicine. Alipanahi et al. used deep learning to learn site specific binding patterns of DNA and RNA-binding proteins [1]. Zhou et al. were able to predict non-coding variants using deep learning [2]. Google has even produced an improved variant caller known as DeepVariant [3].
More specifically, deep learning has been applied to understanding cancer prognosis. Chaudhary et al. were able to robustly predict survival in liver cancer [4]. Cruz-Roa et al. leveraged deep learning to quantify the extent of breast cancer tumors in imaging data [5]. Other groups have trained networks to identify metastatic breast cancer and lymph node metastasis [6, 7].
Nevertheless, there is little work in machine learning being done on what changes are occurring at a gene expression level in cancer samples. Understanding the genomic basis of cancer will yield better treatments and prognosis for patients
[8]. There are significant questions remaining in oncology about the relationships between different cancer types. For instance, while there is an association between melanoma, a type of skin cancer, and glioblastoma, a type of brain cancer, little is known about the molecular underpinnings of this relationship [9, 10].Recently, deep generative models such as variational auto encoders (VAEs) and generative adversarial networks (GANs) have made large advances in image, audio, and text generation
[11, 12, 13]. VAEs and GANs learn generative distributions on lower-dimensional encodings of input data [14]. VAEs have found genomic applications. Rampasek et al. applied VAEs to learn drug responses based on gene expression data [15]. Way et al. trained a VAE called Tybalt to encode The Cancer Genome Atlas (TCGA) [14]. Huang et al. have developed a theory of cancer development as a progression along a low dimensional space, justifying exploration of cancer metastasis using machine learning algorithms that learn low dimensional representations [16].A new VAE-GAN hybrid architecture known as the Unified Feature Disentanglement Network (UFDN) learns fundamental features that distinguish input domains [17]. For multiple input data types, such as photographs, sketches, and watercolor paintings, the UFDN learns an VAE encoding of the data domains and trains a discriminator in the latent space to discriminate between domain types. Then, the UFDN can subsequently encode data from one domain and decode the data into a different domain[17]. An additional GAN distinguishes between real/fake images in the pixel space to promote high quality decodings[17].
In this work, we apply this new UFDN architecture to TCGA RNA-Seq data and learn a latent space embedding that allows us to convert between different cancer types given gene expression data. Given gene expression levels in a cancer sample of domain , we can predict gene expression levels as if that cancer sample were of domain . This represents a generative, personalized model of metastasis. We can sample points in our latent space encoding and decode them into any new cancer domain.
Additionally, we can partially interpolate between cancer domains. UFDN decoding is not strictly binary—input data can be decoded into a mix of output domains. We investigate partial interpolations of one cancer type into another, mimicking the progressive nature of metastasis.
We analyze the performance of our TCGA-trained UFDN on two tasks: predicting whether a sample is from cancerous or normal tissue and predicting which cancer sub-type a sample consists of. Additionally, we investigate partial interpolations from skin cutaneous melanoma (SKCM) TCGA samples to glioblastoma (GBM) by looking at differential expression of genes. We compute metagenes that summarize gene expression changes using integrative non-negative matrix factorization. Finally, we analyze Gene Ontology (GO) term enrichment in highly activated metagenes for each interpolated dataset.
The data consisted of 10,433 samples of RNA-Seq gene expression levels across 33 cancer types for 20,501 genes from TCGA obtained via the R Package curatedTCGA [18, 19]
. For the purpose of this work, we only considered the RSEM (RNA-Seq by Expectation Maximization
[20]) normalized expression levels. We divided 70%, 20%, and 10% to train, test, and holdout datasets, respectively.Tybalt demonstrated that preprocessing gene expression levels by scaling gene-wise expression levels (across all samples) to between 0 and 1 yields a trainable latent space [14]
. We adapted this procedure by first clipping expression levels to fall within 3 standard deviations from the mean of gene-wise expression levels followed by the same min-max normalization of Tybalt
[14].Liu et al. develop a UFDN as a combination of an encoder , a generator , and two discriminators: in the latent space and in the pixel space [17]. In our application, pixel space is replaced by “gene expression space.” takes input data and encodes it in a latent space. In our UFDN, we encode gene expression using fully connected networks. learns to discriminate between domains, or cancer types. Then generator
uses a latent space encoding and a domain vector
to produce gene expression data in domain [17]. Our UFDN uses since there are 33 cancer types in TCGA.We define a partial interpolation with parameter of an input of domain to domain to be the decoding of the input into into a composition of domains and , with weight given to domain . That is, the domain vector of the partial interpolation has components , , and remaining components zero. For instance, a 0.25-GBM interpolation means an input has been decoded with and original domain entry is .
In the pixel space (or gene expression space), learns to distinguish between samples that have been decoded to their original domain or a new domain [17]. The network is trained by iterative stochastic gradient updates to , , and . For a more detailed exposition of the architecture of and gradient updates for training the UFDN, please see Section 3 of Liu et al. 2018 [17].
The encoder and generator are single layer networks, each with 500 hidden units, that learn a 100 dimensional latent space. The feature space discriminator is a single layer network with 64 hidden units and the pixel space discriminator
is a two layer network with 500 and 100 hidden units. All networks are fully connected with leaky relu activation functions. We use 50,000 iterations of Adam updates with a learning rate of
.We attempted two classification tasks using the UFDN. The first was classifying a sample as tumor or normal. This is referred to as the cancer status task. The second task was predicting cancer domain, one of 33 sub-types in the TCGA.
In order to solve these tasks, we developed 3 algorithms using UFDN:
UDFN-MSE: classify a sample’s type by encoding the sample and decoding it into all 33 domains, predicting the type of the domain with lowest reconstruction error as defined by mean square error (MSE).
Unsupervised UFDN: Inspired by the unsupervised domain adaptation experiments from Liu et al.[17], this algorithm predicts cancer status by encoding a sample into the latent space, then decoding it into the mesothelioma domain, regardless of input domain. We trained a random forest classifier to predict cancer status on mesothelioma training data. Use the prediction of this classifier to predict cancer status in the original input domain. The motivation for this approach is that the classifier trained on mesothelioma data is strong but the test data of interest is of a different cancer type.
Semi-supervised UFDN: A hybrid of the two above algorithms used to predict cancer status and type. First, predict cancer type using UDFN-MSE. Then, predict cancer status using a random forest classifier trained on that specific type’s status data.
We encoded 95 samples of SKCM (skin cutaneous melanoma) from our test set partition of the TCGA into our latent space using our trained UFDN. Then, we interpolated the samples into glioblastoma (GBM) at four different fractions of interpolation: 25%, 50%, and 75%, and 100%. The 100% interpolation represents a prediction of gene expression levels of the SKCM samples as GBM per sample.
In order to analyze how gene expression changed between SKCM samples and these samples as GBM, we performed a differential expression analysis using edgeR [21, 22]
. This is an R package that uses a negative binomial distribution model to analyze significant gene expression changes between two groups
[21, 22]. Although normally edgeR works with raw read counts, more recently the package creator has stated that RSEM normalized reads are also suitable for use with edgeR [23].We applied the inverse transformation of our min-max normalization to our four interpolated datasets since our UFDN decodes gene expression levels to the range [0,1]. Then we used edgeR to find differentially expressed genes between SKCM samples and 100% GBM interpolated samples. A p-value threshold for differential expression was set at to control for false discovery.
Analyzing every single gene the significantly changed between SKCM and GBM would be a challenge, so we used integrative Non-negative Matrix Factorization (IntNMF) to learn metagenes that summarized gene expression changes [24]. IntNMF learns a reduced dimensionality representation across multiple datasets [24]. IntNMF learns a shared basis matrix and where is the number of features (here, the differentially expressed genes) and is the number of metagenes, . Each dataset is described by a learned matrix where is the number of samples in the dataset [24]. Each row of represents the linear combination of metagenes of that combine to reconstruct the original sample in [24]. We chose based on an analysis of the reconstruction error , where is the Frobenius norm. We learned and for each dataset using the R package IntNMF [24].
Every element of column is non-negative and represents the contribution of gene to the -th metagene [24]. Each element of the -th row of represents the contribution of metagene to the -th sample of the -th dataset. We can analyze how these metagenes change over the different interpolation datasets in order to understand how gene expression is changing [24].
Finally, to understand the broad composition of the metagenes discovered by IntNMF, we used Gene Ontology (GO) enrichment analysis. GO terms are an ontology of three categories: biological processes, molecular function, and cellular component. They link together information about the functions and relationships of genes and proteins. topGO is an R package that analyzes if GO terms, which have been mapped to genes, show up more often than expected in a set of genes and associated scores for each gene [25].
We used a Kolmogorov-Smirnov like test known as Gene Score Enrichment Analysis that calculates p-values of enrichment based on a score for each gene[25]. In our work, we did this test on each metagene derived from IntNMF with the score for gene as [25]. By looking at the top scoring GO terms for each metagene, we understand what sort of genes are changing as we interpolate between cancer types [25].
All our code is available at https://github.com/bkompa/UFDN-TCGA. We used Liu et al.’s implementation of UFDN as a starting point but had to expand the architecture to work with an arbitrary number of domains. We wrote all other code used for analysis with the various packages mentioned above.
First, we validated that our UFDN learned a non-trivial latent space representation of TCGA RNA-Seq data. We projected both the TCGA data and latent space encodings into UMAP space [26]. UMAP learns a Riemann manifold representation of the data [26]. We used hyper-parameters spread=2.0 and min_dist=.01 to produce Figure 2. We observed distinct clusters by cancer types for both the original data and encodings. We proceeded in our downstream analysis confident that our UFDN had learned how to discern between cancer types based on these UMAP projections.
Next, we estimated the ability of our UFDN to take data from a source domain (original cancer type) and interpolate these data into a target domain (new cancer type). We considered the fraction of the
nearest neighbors, in the training data, of the interpolated samples that were in the target domain as a measure of success. These decoding rates are shown in Figure 3. There were certain cancers that the UFDN was able to more robustly interpolate into. These included glioblastoma, acute myloid leukemia, mesothelioma, and prostate adenocarcinoma, among others. Difficult cancers to interpolate into were sarcomas, which are a heterogeneous subcategory of soft tissue cancers and cervical squamous cell carcinoma.Algorithm | Cancer Status Acc (Train/Test) | Cancer Type Acc (Train/Test) |
---|---|---|
Random Forests | 99.60%/98.41% | 99.65%/95.20% |
UFDN-MSE | — | 96.51%/94.10% |
Unsupervised UDFN | 95.60%/86.14% | — |
Semi-supervised UDFN | 99.60%/98.41% | 96.51%/94.10% |
Finally, we analyzed our UFDN’s performance on two classification tasks: cancer status prediction and cancer type prediction. Table 1 reports the performances of our three UFDN classification algorithms as compared to a random forest baseline. The random forests had a maximum depth of 15 and were composed of 100 trees. The semi-supervised UFDN algorithm was able to match the performance of random forests on the cancer status task and was comparable on the cancer type task. Other UFDN algorithms were less successful compared to the baseline.
After interpolating 95 samples of SKCM from the test set into GBM, we analyzed which genes had significant changes in expression between the SKCM and 1.0-GBM samples. Using edgeR, we looked for genes that had differential expression that exceeded a significance threshold of . There were 10,557 genes that exceeded this threshold. Figure 4 shows the plot of average log fold change versus average log counts per million and highlights the differentially expressed genes between the two groups.
For the 10,557 differential expressed genes, we learned a shared basis using IntNMF. By varying the rank of that basis, we were able to decrease the reconstruction error across datasets SKCM, 0.25-GBM, 0.5-GBM, 0.75-GBM, and 1.0-GBM. Figure 5 reports how affected the reconstruction error. We chose for subsequent analysis based on the inflection point of this reconstruction curve. Hutchins et al. suggest that this is an optimal way to select for NMF [27]. was also chosen for computational considerations. Optimizing and for took nearly 7 hours and increasing much more would significantly increase this considerable time requirement.
Finally, we visualized the rows of for each dataset in SKCM, 0.25-GBM, 0.50-GBM, 0.75-GBM, 1.00-GBM. The columns of each heatmap in Figure 6 represent the relative activation of the respective metagene. As interpolation towards GBM increases, distinct metagenes increase their responsibility for reconstructing . In SKCM, metagene 36 has the most representation in the data. For 0.25-GBM, 0.50-GBM, and 0.75-GBM, it was metagenes 15, 32, and 1, respectively.
In the 1.00-GBM heatmap (Figure 6 E), we saw the increased activation of metagene 23. When we took 33 samples of TCGA GBM data from the test set and learned the matrix that minimized reconstruction error for the same, fixed, learned previously by IntNMF, we observed the same metagene 23 dominating (Figure 6 F).
We proceeded to analyze the dominant metagene for every dataset for GO term enrichment. In the interest of space, we only report the top 15 most enriched GO terms for metagene 23 based on p-value. Table 2 reports the GO term as well as p-value for each term.
GO ID | Term | p-value |
---|---|---|
GO:0003676 | Nucleic acid binding | 5.20E-19 |
GO:0003735 | Structural constituent of ribosome | 2.70E-15 |
GO:0003723 | RNA binding | 3.90E-14 |
GO:0003677 | DNA binding | 1.60E-12 |
GO:0005198 | Structural molecule activity | 3.80E-12 |
GO:0000981 | DNA-binding transcription factor activit… | 4.70E-12 |
GO:0003700 | DNA-binding transcription factor activit… | 3.50E-11 |
GO:0140110 | Transcription regulator activity | 2.80E-09 |
GO:0008376 | Acetylgalactosaminyltransferase activity | 4.10E-08 |
GO:0043492 | ATPase activity, coupled to movement of … | 1.00E-07 |
GO:0060089 | Molecular transducer activity | 1.30E-07 |
GO:0004126 | Cytidine deaminase activity | 2.10E-07 |
GO:0019239 | Deaminase activity | 4.50E-07 |
GO:0048020 | CCR chemokine receptor binding | 7.30E-07 |
GO:0008009 | Chemokine activity | 8.10E-07 |
Our UFDN was able to learn a biologically relevant latent space encoding of TCGA data. Classification task results in Table 1 indicate that our UFDN was able to compete with random forests that were trained on all 20,501 gene expression features. This indicates our algorithm was able to learn an efficient, useful embedding of gene expression data. Figure 2 demonstrates that we learned an encoding space that clustered cancers of the same domain. This likely facilitates successful interpolation and classification between cancer domains. Additionally, our UFDN could robustly interpolate into many cancer domains. Although Figure 3 demonstrates that not every cancer domain was easy for the UFDN to decode into, one thing to note is that almost every column (target domain) had at least one element with high decoding fraction. It’s possible to consider multiple interpolations and potentially from converting from domain A to B to C would have a higher success rate than going from A to C.
We learned 10,557 differentially expressed genes between SKCM and 1.0-GBM interpolated samples as demonstrated in Figure 4. This reduction in dimensionality allowed us to make IntNMF computationally tractable. The lower number of genes considered in IntNMF, the faster the learning of the shared basis and dataset specific . Analysis of the reconstruction error from IntNMF informed our choice of 60 metagenes (see Figure 5). In Figure 6, we investigated how linear combinations of these distinct metagenes reconstructed samples from many partially interpolated datasets. We observed unique metagenes increasing activation for each partial interpolation. This is an approximation of how gene expression profiles change during metastasis.
When we learned , the representation of TCGA GBM samples with respect to the basis , something remarkable happened. Note that was not informed by the TCGA dataset at all. was simply the shared basis trained by IntNMF on interpolation datasets SKCM (equivalently, 0.00-GBM), 0.25-GBM, 0.5-GBM, 0.75-GBM, and 1.0-GBM. Yet when and were compared side by side in Figure 6 E&F, their metagene activation profiles were dominated by the same metagene 23. Therefore, our interpolation from SKCM to GBM successfully recapitulated observed gene expression activity.
Furthermore, when we explored several of the GO terms identified by a GO term enrichment analysis, metagene 23 was enriched for terms related to glioblastoma. GO:0008376 represents a glycoprotein with a known association to glioblastoma [28, 29]. GO:0004126 refers to cytidine deaminase activity. Cytidine deaminase gene therapy has been identified as a potential treatment for glioblastoma[30, 31]. GO:0048020 and GO:0008009 are associated with chemokines, which are implicated in glioblastoma development [32, 33]. Our metagenes learned glioblastoma-specific genes and our UFDN interpolated skin cancer samples to glioblastoma. Further analysis of the metagenes activated during interpolations 0.25-GBM, 0.50-GBM, and 0.75-GBM could provide starting points for the investigation of the metastasis pathway from SKCM to GBM. This could help explain the association between melanoma and glioblastoma that is not currently understood [9, 10].
Our UFDN learned a biologically relevant latent space that facilitated meaningful interpolations between cancer domains. Our latent space can be used to generate more examples of transitions between cancers types. Our interpolations from SKCM to GBM have feasible biological interpretations and suggest possible gene expression changes during the mysterious transition from melanoma to glioblastoma.
We acknowledge the helpful suggestions of Dr. Devavrat Shah and Flora Meng on this project. Kompa is also indebted to the feedback of Scott Nanda and Kathryn Almon.
Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks.
April 2017.-d-galactosamine: PolypeptideN-Acetylgalactosaminyltransferase, designated pp-GalNAc-T13, that is specifically expressed in neurons and synthesizes GalNAc
-Serine/Threonine antigen. J. Biol. Chem., 278(1):573–584, 2003.