Learning a Generative Model of Cancer Metastasis

by   Benjamin Kompa, et al.
Harvard University

We introduce a Unified Disentanglement Network (UFDN) trained on The Cancer Genome Atlas (TCGA). We demonstrate that the UFDN learns a biologically relevant latent space of gene expression data by applying our network to two classification tasks of cancer status and cancer type. Our UFDN specific algorithms perform comparably to random forest methods. The UFDN allows for continuous, partial interpolation between distinct cancer types. Furthermore, we perform an analysis of differentially expressed genes between skin cutaneous melanoma(SKCM) samples and the same samples interpolated into glioblastoma (GBM). We demonstrate that our interpolations learn relevant metagenes that recapitulate known glioblastoma mechanisms and suggest possible starting points for investigations into the metastasis of SKCM into GBM.



page 5

page 7


Identify Statistical Similarities and Differences Between the Deadliest Cancer Types Through Gene Expression

Prognostic genes have been well studied within each type of cancer. Howe...

Convolutional neural network models for cancer type prediction based on gene expression

Background Precise prediction of cancer types is vital for cancer diagno...

Correlating Cellular Features with Gene Expression using CCA

To understand the biology of cancer, joint analysis of multiple data mod...

Correlated Mixed Membership Modeling of Somatic Mutations

Recent studies of cancer somatic mutation profiles seek to identify muta...

A Novel Self-Learning Framework for Bladder Cancer Grading Using Histopathological Images

Recently, bladder cancer has been significantly increased in terms of in...

Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data

We present a nonparametric Bayesian method for disease subtype discovery...

Pairwise Nonlinear Dependence Analysis of Genomic Data

In The Cancer Genome Atlas (TCGA) dataset, there are many interesting no...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning is being applied to many difficult problems in genomics and medicine. Alipanahi et al. used deep learning to learn site specific binding patterns of DNA and RNA-binding proteins [1]. Zhou et al. were able to predict non-coding variants using deep learning [2]. Google has even produced an improved variant caller known as DeepVariant [3].

More specifically, deep learning has been applied to understanding cancer prognosis. Chaudhary et al. were able to robustly predict survival in liver cancer [4]. Cruz-Roa et al. leveraged deep learning to quantify the extent of breast cancer tumors in imaging data [5]. Other groups have trained networks to identify metastatic breast cancer and lymph node metastasis [6, 7].

Nevertheless, there is little work in machine learning being done on what changes are occurring at a gene expression level in cancer samples. Understanding the genomic basis of cancer will yield better treatments and prognosis for patients

[8]. There are significant questions remaining in oncology about the relationships between different cancer types. For instance, while there is an association between melanoma, a type of skin cancer, and glioblastoma, a type of brain cancer, little is known about the molecular underpinnings of this relationship [9, 10].

Recently, deep generative models such as variational auto encoders (VAEs) and generative adversarial networks (GANs) have made large advances in image, audio, and text generation

[11, 12, 13]. VAEs and GANs learn generative distributions on lower-dimensional encodings of input data [14]. VAEs have found genomic applications. Rampasek et al. applied VAEs to learn drug responses based on gene expression data [15]. Way et al. trained a VAE called Tybalt to encode The Cancer Genome Atlas (TCGA) [14]. Huang et al. have developed a theory of cancer development as a progression along a low dimensional space, justifying exploration of cancer metastasis using machine learning algorithms that learn low dimensional representations [16].

A new VAE-GAN hybrid architecture known as the Unified Feature Disentanglement Network (UFDN) learns fundamental features that distinguish input domains [17]. For multiple input data types, such as photographs, sketches, and watercolor paintings, the UFDN learns an VAE encoding of the data domains and trains a discriminator in the latent space to discriminate between domain types. Then, the UFDN can subsequently encode data from one domain and decode the data into a different domain[17]. An additional GAN distinguishes between real/fake images in the pixel space to promote high quality decodings[17].

In this work, we apply this new UFDN architecture to TCGA RNA-Seq data and learn a latent space embedding that allows us to convert between different cancer types given gene expression data. Given gene expression levels in a cancer sample of domain , we can predict gene expression levels as if that cancer sample were of domain . This represents a generative, personalized model of metastasis. We can sample points in our latent space encoding and decode them into any new cancer domain.

Additionally, we can partially interpolate between cancer domains. UFDN decoding is not strictly binary—input data can be decoded into a mix of output domains. We investigate partial interpolations of one cancer type into another, mimicking the progressive nature of metastasis.

We analyze the performance of our TCGA-trained UFDN on two tasks: predicting whether a sample is from cancerous or normal tissue and predicting which cancer sub-type a sample consists of. Additionally, we investigate partial interpolations from skin cutaneous melanoma (SKCM) TCGA samples to glioblastoma (GBM) by looking at differential expression of genes. We compute metagenes that summarize gene expression changes using integrative non-negative matrix factorization. Finally, we analyze Gene Ontology (GO) term enrichment in highly activated metagenes for each interpolated dataset.

Figure 1: The overall workflow of our project. We aimed to identify the crucial changes in gene expression as cancer metastasizes from the original location to a new location. We encoded RNA-Seq samples from skin cutaneous melanoma, decoded them into glioblastoma, and then applied 3 bioinformatics tools to analyze which sets of genes were changing between cancer types.

2 Methods

2.1 Data Preprocessing

The data consisted of 10,433 samples of RNA-Seq gene expression levels across 33 cancer types for 20,501 genes from TCGA obtained via the R Package curatedTCGA [18, 19]

. For the purpose of this work, we only considered the RSEM (RNA-Seq by Expectation Maximization

[20]) normalized expression levels. We divided 70%, 20%, and 10% to train, test, and holdout datasets, respectively.

Tybalt demonstrated that preprocessing gene expression levels by scaling gene-wise expression levels (across all samples) to between 0 and 1 yields a trainable latent space [14]

. We adapted this procedure by first clipping expression levels to fall within 3 standard deviations from the mean of gene-wise expression levels followed by the same min-max normalization of Tybalt


2.2 Ufdn

2.2.1 Theory

Liu et al. develop a UFDN as a combination of an encoder , a generator , and two discriminators: in the latent space and in the pixel space [17]. In our application, pixel space is replaced by “gene expression space.” takes input data and encodes it in a latent space. In our UFDN, we encode gene expression using fully connected networks. learns to discriminate between domains, or cancer types. Then generator

uses a latent space encoding and a domain vector

to produce gene expression data in domain [17]. Our UFDN uses since there are 33 cancer types in TCGA.

We define a partial interpolation with parameter of an input of domain to domain to be the decoding of the input into into a composition of domains and , with weight given to domain . That is, the domain vector of the partial interpolation has components , , and remaining components zero. For instance, a 0.25-GBM interpolation means an input has been decoded with and original domain entry is .

In the pixel space (or gene expression space), learns to distinguish between samples that have been decoded to their original domain or a new domain [17]. The network is trained by iterative stochastic gradient updates to , , and . For a more detailed exposition of the architecture of and gradient updates for training the UFDN, please see Section 3 of Liu et al. 2018 [17].

The encoder and generator are single layer networks, each with 500 hidden units, that learn a 100 dimensional latent space. The feature space discriminator is a single layer network with 64 hidden units and the pixel space discriminator

is a two layer network with 500 and 100 hidden units. All networks are fully connected with leaky relu activation functions. We use 50,000 iterations of Adam updates with a learning rate of


2.2.2 Classification Tasks

We attempted two classification tasks using the UFDN. The first was classifying a sample as tumor or normal. This is referred to as the cancer status task. The second task was predicting cancer domain, one of 33 sub-types in the TCGA.

In order to solve these tasks, we developed 3 algorithms using UFDN:

  • UDFN-MSE: classify a sample’s type by encoding the sample and decoding it into all 33 domains, predicting the type of the domain with lowest reconstruction error as defined by mean square error (MSE).

  • Unsupervised UFDN: Inspired by the unsupervised domain adaptation experiments from Liu et al.[17], this algorithm predicts cancer status by encoding a sample into the latent space, then decoding it into the mesothelioma domain, regardless of input domain. We trained a random forest classifier to predict cancer status on mesothelioma training data. Use the prediction of this classifier to predict cancer status in the original input domain. The motivation for this approach is that the classifier trained on mesothelioma data is strong but the test data of interest is of a different cancer type.

  • Semi-supervised UFDN: A hybrid of the two above algorithms used to predict cancer status and type. First, predict cancer type using UDFN-MSE. Then, predict cancer status using a random forest classifier trained on that specific type’s status data.

2.3 Interpolation Analysis

We encoded 95 samples of SKCM (skin cutaneous melanoma) from our test set partition of the TCGA into our latent space using our trained UFDN. Then, we interpolated the samples into glioblastoma (GBM) at four different fractions of interpolation: 25%, 50%, and 75%, and 100%. The 100% interpolation represents a prediction of gene expression levels of the SKCM samples as GBM per sample.

In order to analyze how gene expression changed between SKCM samples and these samples as GBM, we performed a differential expression analysis using edgeR [21, 22]

. This is an R package that uses a negative binomial distribution model to analyze significant gene expression changes between two groups

[21, 22]. Although normally edgeR works with raw read counts, more recently the package creator has stated that RSEM normalized reads are also suitable for use with edgeR [23].

We applied the inverse transformation of our min-max normalization to our four interpolated datasets since our UFDN decodes gene expression levels to the range [0,1]. Then we used edgeR to find differentially expressed genes between SKCM samples and 100% GBM interpolated samples. A p-value threshold for differential expression was set at to control for false discovery.

Analyzing every single gene the significantly changed between SKCM and GBM would be a challenge, so we used integrative Non-negative Matrix Factorization (IntNMF) to learn metagenes that summarized gene expression changes [24]. IntNMF learns a reduced dimensionality representation across multiple datasets [24]. IntNMF learns a shared basis matrix and where is the number of features (here, the differentially expressed genes) and is the number of metagenes, . Each dataset is described by a learned matrix where is the number of samples in the dataset [24]. Each row of represents the linear combination of metagenes of that combine to reconstruct the original sample in [24]. We chose based on an analysis of the reconstruction error , where is the Frobenius norm. We learned and for each dataset using the R package IntNMF [24].

Every element of column is non-negative and represents the contribution of gene to the -th metagene [24]. Each element of the -th row of represents the contribution of metagene to the -th sample of the -th dataset. We can analyze how these metagenes change over the different interpolation datasets in order to understand how gene expression is changing [24].

Finally, to understand the broad composition of the metagenes discovered by IntNMF, we used Gene Ontology (GO) enrichment analysis. GO terms are an ontology of three categories: biological processes, molecular function, and cellular component. They link together information about the functions and relationships of genes and proteins. topGO is an R package that analyzes if GO terms, which have been mapped to genes, show up more often than expected in a set of genes and associated scores for each gene [25].

We used a Kolmogorov-Smirnov like test known as Gene Score Enrichment Analysis that calculates p-values of enrichment based on a score for each gene[25]. In our work, we did this test on each metagene derived from IntNMF with the score for gene as [25]. By looking at the top scoring GO terms for each metagene, we understand what sort of genes are changing as we interpolate between cancer types [25].

2.4 Code

All our code is available at https://github.com/bkompa/UFDN-TCGA. We used Liu et al.’s implementation of UFDN as a starting point but had to expand the architecture to work with an arbitrary number of domains. We wrote all other code used for analysis with the various packages mentioned above.

3 Results

3.1 UFDN Training and Performance

First, we validated that our UFDN learned a non-trivial latent space representation of TCGA RNA-Seq data. We projected both the TCGA data and latent space encodings into UMAP space [26]. UMAP learns a Riemann manifold representation of the data [26]. We used hyper-parameters spread=2.0 and min_dist=.01 to produce Figure 2. We observed distinct clusters by cancer types for both the original data and encodings. We proceeded in our downstream analysis confident that our UFDN had learned how to discern between cancer types based on these UMAP projections.

Figure 2: UMAP projections of the RNA-Seq TCGA data (Figure 2A) and UFDN latent space encodings of said data (Figure 2B). The full 20,501 dimensional representation of gene expression levels have more cancer specific clusters. The 100 dimensional latent space encodings of these samples still clustered in the UMAP space, though to a lesser extent.

Next, we estimated the ability of our UFDN to take data from a source domain (original cancer type) and interpolate these data into a target domain (new cancer type). We considered the fraction of the

nearest neighbors, in the training data, of the interpolated samples that were in the target domain as a measure of success. These decoding rates are shown in Figure 3. There were certain cancers that the UFDN was able to more robustly interpolate into. These included glioblastoma, acute myloid leukemia, mesothelioma, and prostate adenocarcinoma, among others. Difficult cancers to interpolate into were sarcomas, which are a heterogeneous subcategory of soft tissue cancers and cervical squamous cell carcinoma.

Figure 3: The fraction of nearest neighbors that were in the target domain (the rows of the figures) after decoding from a source domain (the columns of the figures). Some domains were noticeably more difficult to interpolate into. Glioblastoma had strong interpolation results across .
Algorithm Cancer Status Acc (Train/Test) Cancer Type Acc (Train/Test)
Random Forests 99.60%/98.41% 99.65%/95.20%
UFDN-MSE 96.51%/94.10%
Unsupervised UDFN 95.60%/86.14%
Semi-supervised UDFN 99.60%/98.41% 96.51%/94.10%
Table 1: Results on two classification tasks compared to a random forest baseline.

Finally, we analyzed our UFDN’s performance on two classification tasks: cancer status prediction and cancer type prediction. Table 1 reports the performances of our three UFDN classification algorithms as compared to a random forest baseline. The random forests had a maximum depth of 15 and were composed of 100 trees. The semi-supervised UFDN algorithm was able to match the performance of random forests on the cancer status task and was comparable on the cancer type task. Other UFDN algorithms were less successful compared to the baseline.

3.2 Gene Expression Changes

After interpolating 95 samples of SKCM from the test set into GBM, we analyzed which genes had significant changes in expression between the SKCM and 1.0-GBM samples. Using edgeR, we looked for genes that had differential expression that exceeded a significance threshold of . There were 10,557 genes that exceeded this threshold. Figure 4 shows the plot of average log fold change versus average log counts per million and highlights the differentially expressed genes between the two groups.

Figure 4: Differential expressed genes at a Bonferroni corrected p-value of . These genes are shown in red, while non DE genes are in black. 10,557 genes were differential expressed between skin cancer samples and 1.0-glioblastoma interpolated skin cancer samples.

For the 10,557 differential expressed genes, we learned a shared basis using IntNMF. By varying the rank of that basis, we were able to decrease the reconstruction error across datasets SKCM, 0.25-GBM, 0.5-GBM, 0.75-GBM, and 1.0-GBM. Figure 5 reports how affected the reconstruction error. We chose for subsequent analysis based on the inflection point of this reconstruction curve. Hutchins et al. suggest that this is an optimal way to select for NMF [27]. was also chosen for computational considerations. Optimizing and for took nearly 7 hours and increasing much more would significantly increase this considerable time requirement.

Figure 5: Reconstruction error based on the Frobenius norm from IntNMF versus , the rank of and in IntNMF, on the x-axis as the number of metagenes. For subsequent analysis, was chosen to be 60 as error is nearly at an inflection point and plateauing.

Finally, we visualized the rows of for each dataset in SKCM, 0.25-GBM, 0.50-GBM, 0.75-GBM, 1.00-GBM. The columns of each heatmap in Figure 6 represent the relative activation of the respective metagene. As interpolation towards GBM increases, distinct metagenes increase their responsibility for reconstructing . In SKCM, metagene 36 has the most representation in the data. For 0.25-GBM, 0.50-GBM, and 0.75-GBM, it was metagenes 15, 32, and 1, respectively.

In the 1.00-GBM heatmap (Figure 6 E), we saw the increased activation of metagene 23. When we took 33 samples of TCGA GBM data from the test set and learned the matrix that minimized reconstruction error for the same, fixed, learned previously by IntNMF, we observed the same metagene 23 dominating (Figure 6 F).

Figure 6: Heatmap visualization of the matrices for each interpolation of the SKCM test data set. No row or column reordering was done to keep consistent metagene order across datasets. A full interpolation of SKCM data into GBM data results in a consistent activation of metagene 23 (Figure 6E). This is replicated in (Figure 6F), which was optimized against the fixed basis learned for the other 5 datasets.

We proceeded to analyze the dominant metagene for every dataset for GO term enrichment. In the interest of space, we only report the top 15 most enriched GO terms for metagene 23 based on p-value. Table 2 reports the GO term as well as p-value for each term.

GO ID Term p-value
GO:0003676 Nucleic acid binding 5.20E-19
GO:0003735 Structural constituent of ribosome 2.70E-15
GO:0003723 RNA binding 3.90E-14
GO:0003677 DNA binding 1.60E-12
GO:0005198 Structural molecule activity 3.80E-12
GO:0000981 DNA-binding transcription factor activit… 4.70E-12
GO:0003700 DNA-binding transcription factor activit… 3.50E-11
GO:0140110 Transcription regulator activity 2.80E-09
GO:0008376 Acetylgalactosaminyltransferase activity 4.10E-08
GO:0043492 ATPase activity, coupled to movement of … 1.00E-07
GO:0060089 Molecular transducer activity 1.30E-07
GO:0004126 Cytidine deaminase activity 2.10E-07
GO:0019239 Deaminase activity 4.50E-07
GO:0048020 CCR chemokine receptor binding 7.30E-07
GO:0008009 Chemokine activity 8.10E-07
Table 2: The top 15 Gene Ontology Terms enriched in metagene 23

4 Discussion

Our UFDN was able to learn a biologically relevant latent space encoding of TCGA data. Classification task results in Table 1 indicate that our UFDN was able to compete with random forests that were trained on all 20,501 gene expression features. This indicates our algorithm was able to learn an efficient, useful embedding of gene expression data. Figure 2 demonstrates that we learned an encoding space that clustered cancers of the same domain. This likely facilitates successful interpolation and classification between cancer domains. Additionally, our UFDN could robustly interpolate into many cancer domains. Although Figure 3 demonstrates that not every cancer domain was easy for the UFDN to decode into, one thing to note is that almost every column (target domain) had at least one element with high decoding fraction. It’s possible to consider multiple interpolations and potentially from converting from domain A to B to C would have a higher success rate than going from A to C.

We learned 10,557 differentially expressed genes between SKCM and 1.0-GBM interpolated samples as demonstrated in Figure 4. This reduction in dimensionality allowed us to make IntNMF computationally tractable. The lower number of genes considered in IntNMF, the faster the learning of the shared basis and dataset specific . Analysis of the reconstruction error from IntNMF informed our choice of 60 metagenes (see Figure 5). In Figure 6, we investigated how linear combinations of these distinct metagenes reconstructed samples from many partially interpolated datasets. We observed unique metagenes increasing activation for each partial interpolation. This is an approximation of how gene expression profiles change during metastasis.

When we learned , the representation of TCGA GBM samples with respect to the basis , something remarkable happened. Note that was not informed by the TCGA dataset at all. was simply the shared basis trained by IntNMF on interpolation datasets SKCM (equivalently, 0.00-GBM), 0.25-GBM, 0.5-GBM, 0.75-GBM, and 1.0-GBM. Yet when and were compared side by side in Figure 6 E&F, their metagene activation profiles were dominated by the same metagene 23. Therefore, our interpolation from SKCM to GBM successfully recapitulated observed gene expression activity.

Furthermore, when we explored several of the GO terms identified by a GO term enrichment analysis, metagene 23 was enriched for terms related to glioblastoma. GO:0008376 represents a glycoprotein with a known association to glioblastoma [28, 29]. GO:0004126 refers to cytidine deaminase activity. Cytidine deaminase gene therapy has been identified as a potential treatment for glioblastoma[30, 31]. GO:0048020 and GO:0008009 are associated with chemokines, which are implicated in glioblastoma development [32, 33]. Our metagenes learned glioblastoma-specific genes and our UFDN interpolated skin cancer samples to glioblastoma. Further analysis of the metagenes activated during interpolations 0.25-GBM, 0.50-GBM, and 0.75-GBM could provide starting points for the investigation of the metastasis pathway from SKCM to GBM. This could help explain the association between melanoma and glioblastoma that is not currently understood [9, 10].

5 Conclusion

Our UFDN learned a biologically relevant latent space that facilitated meaningful interpolations between cancer domains. Our latent space can be used to generate more examples of transitions between cancers types. Our interpolations from SKCM to GBM have feasible biological interpretations and suggest possible gene expression changes during the mysterious transition from melanoma to glioblastoma.


We acknowledge the helpful suggestions of Dr. Devavrat Shah and Flora Meng on this project. Kompa is also indebted to the feedback of Scott Nanda and Kathryn Almon.


  • [1] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat. Biotechnol., 33(8):831–838, August 2015.
  • [2] Jian Zhou and Olga G Troyanskaya. Predicting effects of noncoding variants with deep learning-based sequence model. Nat. Methods, 12(10):931–934, October 2015.
  • [3] Chris Anderson. Google’s AI tool DeepVariant promises significantly fewer genome errors. Clinical OMICs, 5(1):33–33, January 2018.
  • [4] Kumardeep Chaudhary, Olivier B Poirion, Liangqun Lu, and Lana X Garmire. Deep Learning–Based Multi-Omics integration robustly predicts survival in liver cancer. Clin. Cancer Res., 24(6):1248–1259, March 2018.
  • [5] Angel Cruz-Roa, Hannah Gilmore, Ajay Basavanhally, Michael Feldman, Shridar Ganesan, Natalie N C Shih, John Tomaszewski, Fabio A González, and Anant Madabhushi. Accurate and reproducible invasive breast cancer detection in whole-slide images: A deep learning approach for quantifying tumor extent. Sci. Rep., 7:46450, April 2017.
  • [6] Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, and Andrew H Beck. Deep learning for identifying metastatic breast cancer. June 2016.
  • [7] Babak Ehteshami Bejnordi, Mitko Veta, Paul Johannes van Diest, Bram van Ginneken, Nico Karssemeijer, Geert Litjens, Jeroen A W M van der Laak, Meyke Hermsen, Quirine F Manson, Maschenka Balkenhol, Oscar Geessink, Nikolaos Stathonikos, Marcory Crf van Dijk, Peter Bult, Francisco Beca, Andrew H Beck, Dayong Wang, Aditya Khosla, Rishab Gargeya, Humayun Irshad, Aoxiao Zhong, Qi Dou, Quanzheng Li, Hao Chen, Huang-Jing Lin, Pheng-Ann Heng, Christian Haß, Elia Bruni, Quincy Wong, Ugur Halici, Mustafa Ümit Öner, Rengul Cetin-Atalay, Matt Berseth, Vitali Khvatkov, Alexei Vylegzhanin, Oren Kraus, Muhammad Shaban, Nasir Rajpoot, Ruqayya Awan, Korsuk Sirinukunwattana, Talha Qaiser, Yee-Wah Tsang, David Tellez, Jonas Annuscheit, Peter Hufnagl, Mira Valkonen, Kimmo Kartasalo, Leena Latonen, Pekka Ruusuvuori, Kaisa Liimatainen, Shadi Albarqouni, Bharti Mungal, Ami George, Stefanie Demirci, Nassir Navab, Seiryo Watanabe, Shigeto Seno, Yoichi Takenaka, Hideo Matsuda, Hady Ahmady Phoulady, Vassili Kovalev, Alexander Kalinovsky, Vitali Liauchuk, Gloria Bueno, M Milagro Fernandez-Carrobles, Ismael Serrano, Oscar Deniz, Daniel Racoceanu, and Rui Venâncio. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. JAMA, 318(22):2199–2210, December 2017.
  • [8] Bissan Al-Lazikani, Udai Banerji, and Paul Workman. Combinatorial drug therapy for cancer in the post-genomic era. Nat. Biotechnol., 30(7):679–692, July 2012.
  • [9] A S Desai and S A Grossman. Association of melanoma with glioblastoma multiforme. J. Clin. Orthod., 26(15_suppl):2082–2082, May 2008.
  • [10] Peter M Scarbrough, Igor Akushevich, Margaret Wrensch, and Dora Il’yasova. Exploring the association between melanoma and glioma risks. Ann. Epidemiol., 24(6):469–474, June 2014.
  • [11] Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, and Hsin-Min Wang.

    Voice conversion from unaligned corpora using variational autoencoding wasserstein generative adversarial networks.

    April 2017.
  • [12] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. December 2015.
  • [13] Yunchen Pu, Zhe Gan, Ricardo Henao, Xin Yuan, Chunyuan Li, Andrew Stevens, and Lawrence Carin. Variational autoencoder for deep learning of images, labels and captions. In D D Lee, M Sugiyama, U V Luxburg, I Guyon, and R Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2352–2360. Curran Associates, Inc., 2016.
  • [14] Gregory P Way and Casey S Greene. Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac. Symp. Biocomput., 23:80–91, 2018.
  • [15] Ladislav Rampasek, Daniel Hidru, Petr Smirnov, Benjamin Haibe-Kains, and Anna Goldenberg. Dr.VAE: Drug response variational autoencoder. June 2017.
  • [16] Sui Huang, Ingemar Ernberg, and Stuart Kauffman. Cancer attractors: a systems view of tumors from a gene network dynamics and developmental perspective. Semin. Cell Dev. Biol., 20(7):869–876, September 2009.
  • [17] Alexander H Liu, Yen-Cheng Liu, Yu-Ying Yeh, and Yu-Chiang Frank Wang. A unified feature disentangler for Multi-Domain image translation and manipulation. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2595–2604. Curran Associates, Inc., 2018.
  • [18] Cancer Genome Atlas Research Network, John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya Shmulevich, Chris Sander, and Joshua M Stuart. The cancer genome atlas Pan-Cancer analysis project. Nat. Genet., 45(10):1113–1120, October 2013.
  • [19] Marcel Ramos. curatedTCGAData: Curated data from the cancer genome atlas (TCGA) as MultiAssayExperiment objects, 2018.
  • [20] Bo Li and Colin N Dewey. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics, 12:323, August 2011.
  • [21] Mark D Robinson, Davis J McCarthy, and Gordon K Smyth. edger: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1):139–140, January 2010.
  • [22] Davis J McCarthy, Yunshun Chen, and Gordon K Smyth. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res., 40(10):4288–4297, May 2012.
  • [23] Gordon Smyth. EdgeR bioconductor support. https://support.bioconductor.org/p/65890/#65910, April 2015. Accessed: 2018-12-11.
  • [24] Prabhakar Chalise and Brooke L Fridley. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS One, 12(5):e0176278, 2017.
  • [25] Adrian Alexa and Jorg Rahnenfuhrer. topGO: enrichment analysis for gene ontology. R package version, 2(0), 2010.
  • [26] Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and projection for dimension reduction. February 2018.
  • [27] Lucie N Hutchins, Sean M Murphy, Priyam Singh, and Joel H Graber. Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics, 24(23):2684–2690, December 2008.
  • [28] Yan Zhang, Hiroko Iwasaki, Han Wang, Takashi Kudo, Timothy B Kalka, Thierry Hennet, Tomomi Kubota, Lamei Cheng, Niro Inaba, Masanori Gotoh, and Others. Cloning and characterization of a new human UDP-N-Acetyl-

    -d-galactosamine: PolypeptideN-Acetylgalactosaminyltransferase, designated pp-GalNAc-T13, that is specifically expressed in neurons and synthesizes GalNAc

    -Serine/Threonine antigen.
    J. Biol. Chem., 278(1):573–584, 2003.
  • [29] Roger A Kroes, Glyn Dawson, and Joseph R Moskal. Focused microarray analysis of glyco-gene expression in human glioblastomas. J. Neurochem., 103:14–24, 2007.
  • [30] Ute Fischer, Sabine Steffens, Susanne Frank, Nikolai G Rainov, Klaus Schulze-Osthoff, and Christof M Kramm. Mechanisms of thymidine kinase/ganciclovir and cytosine deaminase/ 5-fluorocytosine suicide gene therapy-induced cell death in glioma cells. Oncogene, 24(7):1231–1243, February 2005.
  • [31] C Ryan Miller, Christopher R Williams, Donald J Buchsbaum, and G Yancey Gillespie. Intratumoral 5-fluorouracil produced by cytosine deaminase/5-fluorocytosine gene therapy is effective for experimental human glioblastomas. Cancer Res., 62(3):773–780, February 2002.
  • [32] Yan Zhou, Peter H Larsen, Chunhai Hao, and V Wee Yong. CXCR4 is a major chemokine receptor on glioma cells and mediates their survival. J. Biol. Chem., 277(51):49481–49487, December 2002.
  • [33] S A Rempel, S Dudas, S Ge, and J A Gutiérrez. Identification and localization of the cytokine SDF1 and its receptor, CXC chemokine receptor 4, to regions of necrosis and angiogenesis in human glioblastoma. Clin. Cancer Res., 6(1):102–111, January 2000.