1 Introduction
Many applications have multiview data. Notably, massive amounts of patient data with multiomic profiling have been accumulated during the past few years. For example, TCGA network Hutter & Zenklusen (2018) have generated comprehensive multiomic molecular profiling for more than 10,000 patients from 33 cancer types. Each type of omic data (e.g., genomic, transcriptomic, proteomic, and epigenomic, etc.) represents one view from the same set of patients. Each view has a different feature set (for example, gene features, miRNA features, protein features etc.) and can provide complementary information for other views. Integrative analysis of multiomic data is important for predicting cancer (sub)types and disease progression, but is very challenging. Currently most results generated by TCGA network are mainly based on statistical analysis, though machine learning approaches are increasingly popular to tackle the problems.
Meanwhile, in the past decade, deep learning brought about significant breakthroughs in computer vision, speech recognition, natural language processing and other fields
LeCun et al. (2015). However, conventional deep learning models requires massive training data with clearly defined structures (such as images, audio, and natural languages), and are not suitable for multiomic integrative analysis.In this work, we propose a model termed Multiview Factorization AutoEncoder (MAE), which combines the ideas from multiview learning Zhao et al. (2017) and matrix factorization Bell et al. (2009)
with deep learning, in order to utilize the great representation power in deep learning models. The backbone of a Multiview Factorization AutoEncoder model consists of multiple autoencoders (one for each view) as submodules, and a submodule to combine multiple views for supervised learning.
To alleviate overfitting and overcome the problem of “big , small ” problem, we incorporate molecular interaction networks as graph constraints into our training objective. These feature interaction networks are derived from public knowledgebases, thus enabling our model to incorporate domain knowledge.
Since all views are from the same set of patients, the patient similarity networks derived from these learned views should share some information, too. As a result, in addition to feature interaction network constraints, we also added patient similarity network constraints to our training objective. Our model equipped with feature interaction network and patient similarity network constraints performs better than traditional machine learning methods such as SVM and Random Forest on the TCGA Pancancer dataset
Hutter & Zenklusen (2018).2 Related work
Multiomic data analysis has been a hot topic in cancer genomics Hutter & Zenklusen (2018); Ebrahim et al. (2016); Henry et al. (2014). Most the work had been focused on comprehensive molecular characterization of individual cancer types Hutter & Zenklusen (2018); Shen et al. (2018), which mainly employed statistical analysis of molecular features associated with clinical outcomes. Machine learning approaches also have been applied to study individual omic data types Malta et al. (2018) and integrate multiomic data Way et al. (2018); Angione et al. (2016)
. These approaches mainly employ traditional machine learning techniques, for example, logistic regression
Malta et al. (2018), random forest Way et al. (2018), and similarity network fusion Angione et al. (2016).Many machine learning approaches for multiomic data analysis fall into the category of unsupervised clustering, such as iCluster Shen et al. (2012), SNF Wang et al. (2014), ANF Ma & Zhang (2017), etc. These approaches are either based on probabilistic models Shen et al. (2012) or networkbased regularization Hofree et al. (2013). While deep learning approaches had been applied to sequencing data Alipanahi et al. (2015) Boža et al. (2017), imaging data Wang et al. (2016), medical records Pham et al. (2016), and other individual data types, few focused on integrating multiomic data. Cotraining, coregularization and margin consistency approaches have been developed for multiview learning Zhao et al. (2017), while integrating deep learning with multiview learning is still an open frontier Zhao et al. (2017).
Our work is closely related to multimodality deep learning Baltrušaitis et al. (2018), which had been successfully applied to combine audio and video featuresNgiam et al. (2011) by employing shared feature representations. In addition to integrating multimodality data, our proposed approach can learn feature representations and object (patient) embeddings simultaneously, which enables us to integrate feature interaction networks as domain knowledge, as well as to enforce view similarity network constraints in the training objective.
Besides adding regularizers into training objectives, another way to incorporate biological networks into the model is to directly encode biological networks into the model architecture Hu et al. (2018); Ma et al. (2018). However, these approaches usually require using subcellular hierarchical molecular networks, which we do not have highquality data available for humans (though a few datasets are available for simple organisms such as bacteria). Given the fact that most human biological interaction networks such as proteinprotein interaction networks are highly incomplete and noisy, adding network regularizers to the training objective instead of directly encoding the noisy interaction network into the model architecture provides more flexibility and alleviates the risk of adopting potentially wrong model architecture.
Our work is also related matrix factorization Bell et al. (2009) and autoencoder, which will be reviewed briefly in the next section along with the detailed description of our proposed method.
3 Multiview Factorization Encoder
Notations
Suppose there are samples, types of omic data. We refer to each omic data type as a view.
We represent types omic data using a series of samplefeature matrices: . is the feature dimension for view .
In the following, we first describe a framework for a single view, then describe how to integrate multiple views. When describing a single view, we drop the superscript for simplicity. When describing a matrix , we use to represent the element of th row and th column, to represent the
th row vector, and
to represent th column vector.Let be a samplefeature matrix, with rows corresponding to samples and columns features. These features are not independent. We can represent the interactions among these features with a graph . For example, if the features are protein expressions, then can be a proteinprotein interaction network, which can be obtained from public knowledgebases, such as STRING Szklarczyk et al. (2014), Reactome Croft et al. (2013), etc. can be a weighted graph with nonnegative elements, or an unweighted graph with elements being either 0 or 1.
Denote as the graph Laplacian of ( is a diagonal matrix with ).
3.1 Lowrank matrix factorization
Matrix factorization Bell et al. (2009) and its variants are commonly used for dimensional reduction and clustering. It is often a reasonable assumption that has low rank in many real world applications. In order to identify the underlying sample clusters, lowrank matrix factorization can be applied to :
, where
In order to find a good solution , some constraints are usually added as regularizers in the objective function or enforced in the learning algorithm. For example, if is nonnegative matrix (e.g., gene feature count matrix), Nonnegative Matrix Factorization (NMF) Lee & Seung (2001) is often used to ensure both and are nonnegative.
In general, the objective functions can be formulated as follows:
(1) 
is a regularizer for and . For example, can include and norms for and . More importantly, structural constraints based on biological interaction networks can also be incorporated into , which will be discussed later.
Interpretation
Suppose there are factors that fully characterize these samples. can be seen as a samplefactor matrix. These factors are not directly observable. Instead, we observed
, which can be seen as a linear transformation of
. And can be seen as the matrix of such a linear transformation from to . The rows of can be seen as a basis for the underlying factor space. Therefore is generated by a linear transformation from , the inherent nonredundant representation of samples. In a sense, this formulation can be seen as a shallow linear generative model.Limitations
The limitations of this simple matrix factorization model arise from its shallow linear structure. The representation power of linear models is very limited. In most cases, the transformations are nonlinear. To increase the model representation capacity, we will discuss nonlinear factorization with multilayer neural networks, which can approximate any complex nonlinear transformations with sufficient data.
3.2 Nonlinear factorization with AutoEncoder
Instead of direct matrix factorization, we can use an autoencoder to reconstruct the observable samplefeature matrix . While direct matrix factorization – which can be seen as a onelayer autoencoder – is limited to model nonlinear relationships, multilayer autoencoder can approximate complex nonlinear transformations well.
We use a multilayer neural network with parameter as the encoder:
(2) 
Again can be seen as a nonredundant factor matrix that contains essential information for all samples. We are using a multilayer neural network to transform the observable samplefeature matrix to its latent representation .
The decoder is a transformation from latent factor space to the reconstructed feature space.
(3) 
As the entire autoencoder is a multilayer neural network, we can arbitrarily split it into the encoder and the decoder components. For the convenience of incorporating biological interaction networks into the framework, we make the encoder (Eq. 2) contain all layers but the last one, and the decoder only contain the last linear layer. The parameter of decoder (Eq. 3) is simply a linear transformation matrix as in matrix factorization:
(4) 
Therefore the reconstructed signal is
(5) 
The reconstruction error can be calculated with Frobenius norm: .
This formulation is different from matrix factorization in that the encoder is a multilayer neural network that can learn complex nonlinear transformations through backpropagation. In addition, the output of the encoder
can be seen as the learned representations for samples, and can be seen as learned feature representations (we can regard the columns of as learned vector representations in for features). With learned patient and feature representations, we can calculate patient similarity networks and feature interaction networks, and add network regularizers to our training objective.3.3 Incorporate biological knowledge as network regularizers
Let be the interaction matrix among genomic features. can be obtained from biological knowledgebases such as STRING Szklarczyk et al. (2014) and Reactome Croft et al. (2013).
With the factorization autoencoder model, we can learn a feature representation . Ideally this representation should be “consistent” with the biological interaction network of these features. We use graph Laplacian regularizer to “punish” the inconsistency between the learned feature representation and the feature interaction network :
(6) 
In Eq. 6, is the graph Laplacian matrix of . can be regarded as a “similarity” (interaction) measure between feature and feature . Each feature is represented as a dimensional vector . The Euclidean distance between feature and in the learned feature space is simply . can be served as a surrogate for the loss measuring the inconsistency between learned feature representation and existing interaction network . To see this, let’s consider a case where is highly inconsistent with : suppose whenever is large (i.e., feature and are similar based on existing knowledge), is also large (i.e., feature and feature are very different based on learned representations). Then the loss consists of the terms , which accounts for the level of inconsistency between learned feature representation and biological knowledge, will be large, too.
The objective function for the aforementioned factorization AutoEncoder model incorporating biological interaction networks through the graph Laplacian regularizer is as follows:
(7) 
is a hyperparameter to balance the reconstruction loss and the network regularization term. To ensure the network regularization term has a fixed range, we also normalize
and so that the is within the range of in the implementation of our model. More specifically, we set , (this also ensures that ). This facilitates multiview integration as all the network regularizers from multiple views are on the same scale.3.4 Multiview Factorization AutoEncoder with network constraints
Eq. 7 shows the objective for a single view. We can easily extend it to multiple views:
(8) 
Here for each of the views, we use a separate autoencoder. We combine all the reconstruction losses and feature interaction network regularizers together as the overall loss in Eq. 8.
As mentioned before, can be regarded as learned latent factor representation for samples. Based on , we can derive patient similarity network
(which can also be used for spectral clustering). There are multiple ways to calculate a similarity network. Here we use cosine similarity as an example:
(9) 
For each view , we get a patient similarity network (Eq. 9 omits the superscript for clarity). In addition, the outputs of multiple encoders can be combined.
(10) 
This idea is very much like ResNet He et al. (2016). Another possible approach is simply concatenating all views together like DenseNet Huang et al. (2017). We have tried using both in our experiments and the results are similar. We can then use the fused view to calculate a patient similarity network using Eq. 9 again.
Since are about the same set of patients and thus related to each other, we can fuse them together (this is a special case of affinity network fusion Ma & Zhang (2017)):
(11) 
Just like the feature interaction network regularizer (Eq. 6), we can add a regularization term on view similarity:
(12) 
is the graph Laplacian of . Adding this term to Eq. 8, we get the new objective function:
(13) 
There are two kinds of networks involved in our framework: molecular interaction networks and patient similarity networks. For each type of omic data, there is one corresponding interaction network . Unlike patient similarity networks, different molecular interaction networks involve different feature sets and cannot be directly merged. However, for patient similarity networks from multiple views, they are all about the same set of patients, and thus can fused to get a combined patient similarity network using techniques such as affinity network fusion Ma & Zhang (2017).
3.5 Supervised learning with multiview factorization autoencoder
The proposed framework with the objective function Eq. 7
can be used for unsupervised learning (up to now, we have not used labeled data yet) with multiple view data and feature interaction networks available. When class labels or other target variables are available, we can apply the proposed model for supervised learning by adding another loss term to Eq.
7:(14) 
The first part is for either classification loss (e.g., cross entropy loss) or regression loss (e.g., mean squared error for continuous target variables). is the true class labels or other continuous target variables available for training the model.
As in Eq. 10, refers to the sum of the last hidden layers of autoencoders (the output of the last hidden layer is also the encoder output). This represents the learned patient representations combining multiple views. is the weights for the last fully connected layer typically used in neural network models for classification tasks.
The second part is the reconstruction loss for all the submodule autoencoders. The third and four term are the graph Laplacian constraints for molecular interaction networks and learned patient similarity networks as in Eq. 6 and Eq. 12. are nonnegative hyperparameters adjusting the weights of the reconstruction loss, feature interaction network loss, and patient similarity network loss.
The whole framework is endtoend differentiable. A a simple illustration of the whole framework combining two views with twohiddenlayer autoencoders is depicted in Fig. 1
. We implement the model using PyTorch (
https://pytorch.org/). Code will be made publicly available.4 Experiments
4.1 Dataset
We downloaded the TCGA Pancancer dataset Hutter & Zenklusen (2018) and selected patients based on these criteria: 1) the patients’ gene expression, miRNA expression, protein expression, and DNA methylation as well as clinical data are all available; and 2) the patients having the cancer types that have at least 100 patients. In total 6179 patients with 21 different cancer types were selected for analysis.
4.1.1 Target clinical variable
We are trying to use four types of omic data (i.e., gene expression, miRNA expression, protein expression and DNA methylation) to predict ProgressionFree Interval (PFI) event and Overall Survival (OS) event. PFI and OS are derived clinical (binary) outcome endpoints Liu et al. (2018). Both endpoints are relatively accurate, and are recommended to use for predictive tasks when available Liu et al. (2018). PFI is preferred over OS given the relatively short followup time.
PFI=1 means the patient had a new tumor event in a fixed period, such as a progression of disease, local recurrence, distant metastasis, new primary tumors, or died with the cancer without new tumor event. PFI=1 implies the treatment outcome is unfavorable. PFI=0 means for patients without having a new tumor event in a fixed period or censored otherwise. There are 4268 patients with PFI=0 and 1911 patients with PFI=1. OS=1 means for patients who were dead from any cause based on followup data; OS=0 for otherwise. There are 4460 patients with OS=0 and 1719 patients with OS=1. PFI and OS are the same for most cases (4941 out of 6179, or 80%). As PFI is preferable to OS, we mainly use PFI as a binary target.
Since this is a highly unbalanced dataset and all the models including the baseline methods use prediction scores to decide binary labels, we report AUC (Area Under the ROC Curve) score as the main metric to evaluate classification performances. Other measures are similar to AUC but are less comprehensive.
4.1.2 Data preprocessing
For gene features, we performed log transformation and removed outliers. After filtering out genes with either low mean or low variance, 4942 gene features were kept for downstream analysis. For DNA methylation data, we removed features with low mean and variance. 4753 methylation features (i.e., beta values associated with CpG islands) were selected for analysis. For miRNA features, we also performed log transformation and removed outliers. For protein expression (RPPA) data, we removed nine features with NA values. 662 miRNA features and 189 protein features were kept for analysis. In total, there are 10,546 features from four omic types. For each of the four types of features, we normalize it to have zero mean and standard deviation equal to 1.
Molecular interaction networks
We downloaded PPI database from STRING (v10.5) Szklarczyk et al. (2014) (https://stringdb.org/). There are more than ten million proteinprotein interactions with confidence scores between 0 and 1000. Since most interaction edges have a low confidence score, we selected about 1.5 million interaction edges with confidence scores at least 400. For gene and protein expression features, we extracted a subnetwork from this PPI interaction network. Since genegene interaction network is too sparse, we performed a onestep random walk (i.e., multiplying the interaction network by itself), removed outliers and normalized it. For miRNA and methylation features, we first map to miRNA/methylation to gene (protein) features, and then calculate a miRNAmiRNA and a methylationmethylation interaction network. Take miRNA data as an example. Let be the adjacency matrix for the miRNAprotein mapping (this matrix is derived from miRDB (http://www.mirdb.org) miRNA target prediction scores), and be the proteinprotein interaction network, then the miRNAmiRNA interaction network is calculated as follows:
We normalized all four feature interaction matrices so that their Frobenius norms are all equal to 1.
All the processed sample feature matrices and feature interactions matrices will be provided upon reasonable requests.
We randomly split the dataset into training set: 4326 patients, or 70%; validation set: 618 patients, or 10%; and test set: 1235 patients, or 20%. We trained different models on the training set, and evaluated them on the validation set. We chose the model with the best validation accuracy to make predictions on test set, and reported the AUC score on test set. We shuffled the data and repeated the process ten times, and reported the average AUC as the final metric for model performances.
4.2 Results
We chose six traditional methods as well as plain neural network model as baselines. The six traditional methods are SVM, Decision Tree, Naive Bayes, kNN, Random Forest, and AdaBoost. Traditional models such as SVM only accept one feature matrix as input. So we used the concatenated feature matrix which has 6179 rows and 10,546 columns (features) as model input. For kNN, we chose
for all experiments. We used linear kernel for SVM. We used 10 estimators in Random Forest and 50 estimators in AdaBoost.
For the plain autoencoder model with a classification head, we used a threelayer neural network. The input layer has 10,546 units (features). Both the first and second hidden layers have 100 hidden units. The last layer also has 10,546 units (i.e., the reconstruction of the input). We added a classification head which is a linear layer with two hidden units corresponding to two classes. This plain autoencoder model uses concatenated feature matrix as input and thus is view agnostic.
To facilitate fair comparisons, all of our proposed Multiview Factorization AutoEncoder models share the same model architecture(i.e., two hidden layers each with 100 hidden units for each of the four submodule autoencoders), but the training objectives are different. Since this dataset has four different data types, our model has four autoencoders as submodules, each of which encodes one type of data (one view). Fig. 1 shows our model structure (note in our experiments we have four views instead of only two shown in the figure). We combine the outputs of the four autoencoders (i.e., the outputs of the last hidden layers) by adding them together (Eq. 10) for classification tasks.
The training objective for the Multiview Factorization AutoEncoder without graph constraints includes only the first two terms in Eq. 14. The objective for the Multiview Factorization AutoEncoder with feature interaction network constraints (feat_int) includes the first three terms in Eq. 14. The objective for the Multiview Factorization AutoEncoder with patient view similarity network constraints (view_sim) includes the first two and the last terms in Eq. 14. And the objective for the Multiview Factorization AutoEncoder with both feature interaction and view similarity network constraints includes all four terms in Eq. 14.
As our proposed model with network constraints is endtoend differentiable, we trained it with Adam Kingma & Ba (2014) with weight decay . The initial learning rate is for the first 500 iterations and then decreased by a factor of 10 (i.e., ) for another 500 iterations. Models with the best validation accuracies are used for prediction on the test set.
The average AUC scores (10 runs) for predicting PFI and OS using these models are shown in Table. 1. Our proposed models (in bold font) achieved better AUC scores for both predicting PFI and OS. Note that traditional methods such as SVM do not perform as well as deep learning models. This may be due to the superior representation power of deep learning. Though our proposed Multiview Factorization AutoEncoder is only slightly better than the plain autoencoder model, adding feature interaction and view similarity network constraints further improved the classification performance. Note that both Multiview Factorization AutoEncoder + view_sim and Multiview Factorization AutoEncoder + feat_int + view_sim achieves the best AUC for PFI prediction. It seems adding both feature interaction and patient view similarity network constraints only improves the model performance very slightly. We suspect one main reason for this is because the dataset itself contains a lot of noise (due to the nature of multiomic data) and the feature interaction networks derived from public knowledgebases are incomplete and noisy, too. If a larger dataset consisting hundreds of thousands of patients is available, we expect our proposed model with more network constraints to be able to generalize better.
Model Name  AUC (OS)  AUC (PFI) 

SVM  0.699  0.625 
Decision Tree  0.670  0.634 
Naive Bayes  0.655  0.644 
kNN  0.706  0.659 
Random Forest  0.720  0.661 
AdaBoost  0.716  0.689 
Plain AutoEncoder Model  0.758  0.716 
Multiview Factorization AutoEncoder without graph constraints  0.761  0.717 
Multiview Factorization AutoEncoder + feat_int  0.765  0.721 
Multiview Factorization AutoEncoder + view_sim  0.763  0.724 
Multiview Factorization AutoEncoder + feat_int + view_sim  0.766  0.724 
In addition, we had tried to use DenseNet Huang et al. (2017) and ResNet He et al. (2016)
as the backbone of the autoencoders instead of multilayer perceptrons, and we experimented with different number of hidden units and hidden layers. Using DenseNet as the backbone with three hidden layers each with 100 units achieved best AUC scores (0.725) for predicting PFI. But the results are not significantly different and thus not presented here.
4.2.1 Learned feature embeddings preserve interaction network structure
Our proposed model learns patient representations and feature embeddings simultaneously. While patients are different from datasets to datasets, the genomic features (such as gene features) and their interaction networks are from domain knowledge, and thus are persistent regardless of which dataset we are using. Since we have a regularization term in the loss to ensure the learned feature embeddings are consistent with feature interaction networks, we would like to know if the model is able to learn an embedding that is “compatible” with the domain knowledge of interaction networks. We plotted the loss term from one typical run of training our model with feature interaction network constraints in Fig. 2. This regularization term decreased to nearly zero very fast, which means the information from feature interaction networks is fully assimilated into the model, or more specifically, the weights of the decoders in the model. We found that many independent runs show very similar loss curves, which means the model is able to robustly learn an feature embedding that preserves the feature interaction network information.
5 Conclusion
Multiomic integrative analysis is important for cancer genomics. While multiomic data has the “big p, small N” problem, biological knowledge can be used as a leverage for largescale data integration and knowledge discovery. A number of databases such as STRING Szklarczyk et al. (2014), Reactome Pathways Croft et al. (2013), etc., can be used to extract biological interaction networks. Intelligently integrating these biological networks into a model is crucial for mining multiomic data. We proposed the Multiview Factorization AutoEncoder Model with network constraints to integrate multiomic data and molecular interaction networks for multiomic data analysis.
Our model contains multiple factorization autoencoders as submodules for different views, and combines multiple views with their highlevel latent representations. The factorization autoencoder utilizes a deep architecture for the encoder and a shallow architecture for the decoder. This on one hand increases the overall model representation power, on the other provides a natural way to integrate graph constraints into the model.
Our model learns patient embeddings and feature embeddings simultaneously, enabling us to add network constraints on both feature interaction networks and patient view similarity networks. Our approach can be applied to largescale multiomic dataset to learn embeddings for molecular entities, subject to network constraints that ensure the learned representations are consistent with molecular interaction networks. Meanwhile the model can produce patient representations for each view. As the latent patient representations from multiple views should be similar to each other, we added a network regularizer to encourage the learned patient representations in multiple views to be consistent with one another with respect to patient similarity networks.
The experimental results on the TCGA pancancer dataset show that our proposed model with feature interaction network and patient view similarity network constraints outperforms other traditional methods and plain deep learning autoencoder models. Though we mainly focused our discussion on multiomic data analysis, as a general approach, our proposed method can be applied to any other multiview data with feature interaction networks. Future work may focus on using more advanced autoencoders (e.g., adversarial autoencoder Makhzani et al. (2015)) and encoding some domain knowledge directly into the model architecture.
References
 Alipanahi et al. (2015) Alipanahi, B., Delong, A., Weirauch, M. T., and Frey, B. J. Predicting the sequence specificities of dnaand rnabinding proteins by deep learning. Nature biotechnology, 33(8):831, 2015.
 Angione et al. (2016) Angione, C., Conway, M., and Lió, P. Multiplex methods provide effective integration of multiomic data in genomescale models. BMC bioinformatics, 17(4):83, 2016.
 Baltrušaitis et al. (2018) Baltrušaitis, T., Ahuja, C., and Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 Bell et al. (2009) Bell, R., Koren, Y., and Volinsky, C. Matrix factorization techniques for recommender systems. Computer, 42:30–37, 08 2009. ISSN 00189162. doi: 10.1109/MC.2009.263. URL doi.ieeecomputersociety.org/10.1109/MC.2009.263.

Boža et al. (2017)
Boža, V., Brejová, B., and Vinař, T.
Deepnano: deep recurrent neural networks for base calling in minion nanopore reads.
PloS one, 12(6):e0178751, 2017.  Croft et al. (2013) Croft, D., Mundo, A. F., Haw, R., Milacic, M., Weiser, J., Wu, G., Caudy, M., Garapati, P., Gillespie, M., Kamdar, M. R., et al. The reactome pathway knowledgebase. Nucleic acids research, 42(D1):D472–D477, 2013.
 Ebrahim et al. (2016) Ebrahim, A., Brunk, E., Tan, J., O’brien, E. J., Kim, D., Szubin, R., Lerman, J. A., Lechner, A., Sastry, A., Bordbar, A., et al. Multiomic data integration enables discovery of hidden biological regularities. Nature communications, 7:13091, 2016.

He et al. (2016)
He, K., Zhang, X., Ren, S., and Sun, J.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Henry et al. (2014) Henry, V. J., Bandrowski, A. E., Pepin, A.S., Gonzalez, B. J., and Desfeux, A. Omictools: an informative directory for multiomic data analysis. Database, 2014, 2014.
 Hofree et al. (2013) Hofree, M., Shen, J. P., Carter, H., Gross, A., and Ideker, T. Networkbased stratification of tumor mutations. Nature methods, 10(11):1108, 2013.
 Hu et al. (2018) Hu, Z., Yang, Z., Salakhutdinov, R., Liang, X., Qin, L., Dong, H., and Xing, E. Deep generative models with learnable knowledge constraints. arXiv preprint arXiv:1806.09764, 2018.
 Huang et al. (2017) Huang, G., Liu, Z., v. d. Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, July 2017. doi: 10.1109/CVPR.2017.243.
 Hutter & Zenklusen (2018) Hutter, C. and Zenklusen, J. C. The cancer genome atlas: Creating lasting value beyond its data. Cell, 173(2):283–285, 2018.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
 Lee & Seung (2001) Lee, D. D. and Seung, H. S. Algorithms for nonnegative matrix factorization. In Advances in neural information processing systems, pp. 556–562, 2001.
 Liu et al. (2018) Liu, J., Lichtenberg, T., Hoadley, K. A., Poisson, L. M., Lazar, A. J., Cherniack, A. D., Kovatich, A. J., Benz, C. C., Levine, D. A., Lee, A. V., et al. An integrated tcga pancancer clinical data resource to drive highquality survival outcome analytics. Cell, 173(2):400–416, 2018.
 Ma et al. (2018) Ma, J., Yu, M. K., Fong, S., Ono, K., Sage, E., Demchak, B., Sharan, R., and Ideker, T. Using deep learning to model the hierarchical structure and function of a cell. Nature methods, 15(4):290, 2018.
 Ma & Zhang (2017) Ma, T. and Zhang, A. Integrate multiomic data using affinity network fusion (anf) for cancer patient clustering. In Bioinformatics and Biomedicine (BIBM), 2017 IEEE International Conference on, pp. 398–403. IEEE, 2017.
 Makhzani et al. (2015) Makhzani, A., Shlens, J., Jaitly, N., Goodfellow, I., and Frey, B. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Malta et al. (2018) Malta, T. M., Sokolov, A., Gentles, A. J., Burzykowski, T., Poisson, L., Weinstein, J. N., Kamińska, B., Huelsken, J., Omberg, L., Gevaert, O., et al. Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell, 173(2):338–354, 2018.
 Ngiam et al. (2011) Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A. Y. Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 689–696, 2011.
 Pham et al. (2016) Pham, T., Tran, T., Phung, D., and Venkatesh, S. Deepcare: A deep dynamic memory model for predictive medicine. In PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 30–41. Springer, 2016.
 Shen et al. (2018) Shen, H., Shih, J., Hollern, D. P., Wang, L., Bowlby, R., Tickoo, S. K., Thorsson, V., Mungall, A. J., Newton, Y., Hegde, A. M., et al. Integrated molecular characterization of testicular germ cell tumors. Cell reports, 23(11):3392–3406, 2018.
 Shen et al. (2012) Shen, R., Mo, Q., Schultz, N., Seshan, V. E., Olshen, A. B., Huse, J., Ladanyi, M., and Sander, C. Integrative subtype discovery in glioblastoma using icluster. PloS one, 7(4):e35236, 2012.
 Szklarczyk et al. (2014) Szklarczyk, D., Franceschini, A., Wyder, S., Forslund, K., Heller, D., HuertaCepas, J., Simonovic, M., Roth, A., Santos, A., Tsafou, K. P., et al. String v10: protein–protein interaction networks, integrated over the tree of life. Nucleic acids research, 43(D1):D447–D452, 2014.
 Wang et al. (2014) Wang, B., Mezlini, A. M., Demir, F., Fiume, M., Tu, Z., Brudno, M., HaibeKains, B., and Goldenberg, A. Similarity network fusion for aggregating data types on a genomic scale. Nature methods, 11(3):333, 2014.
 Wang et al. (2016) Wang, D., Khosla, A., Gargeya, R., Irshad, H., and Beck, A. H. Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718, 2016.
 Way et al. (2018) Way, G. P., SanchezVega, F., La, K., Armenia, J., Chatila, W. K., Luna, A., Sander, C., Cherniack, A. D., Mina, M., Ciriello, G., et al. Machine learning detects pancancer ras pathway activation in the cancer genome atlas. Cell reports, 23(1):172–180, 2018.
 Zhao et al. (2017) Zhao, J., Xie, X., Xu, X., and Sun, S. Multiview learning overview: Recent progress and new challenges. Information Fusion, 38:43–54, 2017.