1 Introduction
Technological advances in highthroughput biology enable integrative analyses that use information across multiple omics layers – including genomics, epigenomics, transcriptomics, proteomics, and metabolomics – to deliver more comprehensive understanding in biological systems [1, 2]. Unfortunately, due to limitations of experimental designs or compositions from different data platforms (e.g., TCGA^{1}^{1}1http://cancergenome.nih.gov/), integrated samples commonly have one or more entirely missing omics with various missing patterns. Learning from such incomplete observations is challenging. Discarding samples with missing omics greatly reduces sample sizes (especially when integrating many omics layers) [3]
and simple mean imputation can seriously distort the marginal and joint distribution of the data
[4].In this paper, we model multiomics data integration as learning from incomplete multiview observations where we refer to observations from each omics data as views (e.g., DNA copy number and mRNA expressions). However, direct application of the existing multiview learning methods does not address the key challenges of handling missing views in integrating multiomics data. This is because these methods are typically designed for completeview observations assuming that all the views are available for every sample [5]. Therefore, we set our goal to develop a model that not only learns complex intraview and interview interactions that are relevant for target tasks but also flexibly integrates observed views regardless of their viewmissing patterns in a single framework.
Contribution. To this goal, we propose a deep variational information bottleneck (IB) approach for incomplete multiview observations, which we refer to as DeepIMV. Our method consists of four network components: a set of viewspecific encoders, a set of viewspecific predictors, a productofexperts (PoE) module, and a multiview predictor. More specifically, for flexible integration of the observed views regardless of the viewmissing patterns, we model the joint representations as a PoE over the marginal representations, which are further utilized by the multiview predictor. Thus, the joint representations combine both common and complementary information across the observed views. The entire network is trained under the IB principle [6, 1] which encourages the marginal and joint representations to focus on intraview and interview interactions that are relevant to the target, respectively. Experiments on realworld datasets show that our method consistently achieves gain from data integration and significantly outperforms stateoftheart benchmarks with respect to measures of prediction performance.
2 Related Works
2.1 MultiView Learning
Complete MultiView Observations. To utilize information across multiple views, a variety of methods have been proposed in recent years. Canonical correlation analysis (CCA) [8]
and its kernel or deep learning extensions
[9, 6, 7]are representative methods which aim to learn a common (latent) space that is shared between the two views such that their canonical correlation is maximized in a purely unsupervised fashion. The learned representations can then be used for supervised learning as a downstream task. Meanwhile, some existing methods
[12, 13, 14] have focused on fully utilizing the label information when seeking for the common (latent) space that are not only shared between the views but also discriminative for the target. Although these methods have shown promising performance in different applications [5], they are designed for completeview observations.Incomplete MultiView Observations. Most of the existing methods require multistage training – that is, constructing complete multiview observations based on imputation methods [15, 16] and then training a multiview model – or relying on auxiliary inference steps to generate the missing views [17, 18, 19]. There are a few methods that flexibly handle incomplete multiview observations. Authors in [8] introduced a generative model that utilizes latent factorization enabling crossview generation without requiring multistage training regimes or additional inference steps. Matrix factorization was extended in [10] to learn lowdimensional representations that capture the joint aspects across different views. However, these methods are trained in a purely unsupervised fashion. Thus, while information relevant to the reconstruction of views will be wellcaptured in learned representations, information relevant to the target task may be lost.
Our work is most closely related to CPMNets [9]. Both methods aim to find representations in a common space for supervised learning of predicting the target. A notable distinction from CPMNets is how we integrate incomplete multiview observations. In CPMNets, the authors directly learn the mapping from latent representations to the original views without utilizing any encoder structure.^{2}^{2}2The “encoding networks” in [9] denote networks that reconstruct original views from latent representations; there are no such networks that map original views to latent representations (in our context, encoders). Instead, for all the training samples, the corresponding latent representations are randomly initialized and, then, iteratively updated to minimize the reconstruction and classification losses. We conjecture that this approach has two limitations: First, the method relies on reconstruction to find latent representations of the testing samples, which makes it difficult to capture taskrelevant information. Second, random initialization means that there are no inherent relations across the latent representations of the training samples. Thus, all training samples must be updated at the same time, which increases the memory burden during training. In contrast, in DeepIMV, we utilize posterior factorization for flexible integration of observed views and focus latent representations to be taskrelevant, thereby effectively capturing the complementary information for predicting the target. In addition, CPMNets is designed only for classification tasks; thus, extension to regression tasks can lose the advantages of clusterfriendly representations.
2.2 Information Bottleneck
The information bottleneck (IB) principle [6, 17, 23] is an informationtheoretic approach that defines the intuitive idea of what a taskrelevant representation is in terms of the fundamental tradeoff between having a concise representation and one that provides good prediction power. The IB framework [1] has been recently studied to address multiview problems [24, 25]. Authors in [24] combined marginal representations from each view using an auxiliary layer into joint representation on which the IB principle is applied. However, this work cannot handle missing views both during training and testing. In [25], the IB principle was extended to the twoview unsupervised setting to learn robust representations that are common to both views. Thus, this is not applicable to multiomics data integration where the goal is to flexibly integrate incomplete views (i.e., observations from different omics layers), that often contain both common and complementary information, in a supervised fashion.
3 Incomplete MultiView Problem
The presence of missing views remains an inevitable and prevalent problem in multiomics data integration. To address this, we start off by defining such an integrative analysis as an incomplete multiview problem where some of the views may be missing with arbitrary viewmissing patterns.
Notation. Let
be a random variable for the
dimensional input feature from the th view and be a random variable for the output label. We say the th view is missing if and is available if where is a realization of . To incorporate with arbitrary missing views, we denote a set of observed views by which we refer to as a viewmissing pattern. Here, is the complete set of available views. Then, we can finally define a random variable for a multiview observation as ; we say the observation is incomplete when and complete when . Throughout the paper, we will often use lowercase letters to denote realizations of a random variable.Definition 1.
(Incomplete MultiView Problem) Suppose that we have a training dataset which contains one or more sample(s) with missing views, i.e., for some . Then, we define an incomplete multiview problem as a supervised learning problem of predicting target – that is, classification for and regression for – for a new multiview observation with an arbitrary viewmissing pattern.^{3}^{3}3Note that a similar definition has been introduced in [9] for classification tasks.
Solving an incomplete multiview problem involves two main challenges: First, we want to learn representations in a common space that leverages both marginal and joint aspects of the observed views for prediction. Second, the learned representations must flexibly integrate incomplete observations with various viewmissing patterns in an unified framework.
4 Method: DeepIMV
To address such challenges, we propose a deep variational information bottleneck approach, which we refer to as DeepIMV^{4}^{4}4Source code is available in the Supplementary Material., that consists of four network components as illustrated in Figure 1:

[leftmargin=1.5em]

a set of viewspecific encoders, parameterized by , each of which stochastically maps observations from each individual view into a common latent space;

a productofexperts (PoE) module that integrates the marginal latent representations into a joint latent representation in the common space;

a multiview predictor, parameterized by , which provides label predictions based on the joint representations; and

a set of viewspecific predictors, parameterized by , each of which provides label predictions based on the marginal latent representations determined by the corresponding viewspecific encoders.
We will describe each in turn. Throughout the remainder of the paper, we will often use to denote a deterministic mapping and use to denote a stochastic mapping.
4.1 Toward TaskRelevant Representations
Finding taskrelevant representations in a common latent space that contains both marginal and joint aspects of the observations is crucial for solving the incomplete multiview problem. To this goal, we apply the IB principle [1, 23] since it aims at learning taskrelevant representations by discarding as much information about the input as possible that is irrelevant to the target task and thereby encouraging the predictor to be robust to overfitting.
Define be the common latent space. We consider the marginal representation be a stochastic encoding of , i.e., , which is defined by the th viewspecific encoder for . Similarly, we consider the joint representation be a stochastic encoding of , which is defined by the encoder block that combines outputs of the viewspecific encoders. Then, given a latent representation drawn from
, the multiview predictor estimates the target label
based on the distribution defined as . To learn joint aspects of the observed views for predicting the target , we apply the IB principle on based on the following loss:(1) 
where is a coefficient chosen to balance between the two information quantities. Here, denotes the KullbackLeibler (KL) divergence between the two distributions and . Detailed derivations can be found in the Supplementary Material.
4.2 ProductofExperts for Incomplete Views
One question is outstanding. How should we design the joint representation such that it integrates the marginal representations of the observed views with arbitrary viewmissing patterns? Our solution is to use a productofexperts (PoE) that factorizes the joint posterior into a product of the marginal posteriors for . Formally, the joint posterior can be defined as the following [8]:
(2) 
where is a normalizing constant.
Factorizing the joint posterior using a PoE is promising for solving an incomplete multiview problem compared to that using a mixtureofexperts (MoE), i.e. : First, by employing PoE, we can simply ignore missing views when finding the joint representation of an input regardless of its viewmissing patterns. Hence, we can fully utilize samples with incomplete views without discarding them (or applying viewcompletion in advance) for training [19], and can avoid auxiliary inference steps of generating the missing views [17, 18] during both training and testing. Second, aside from its flexibility, a PoE (i.e., the joint posterior) can produce a much sharper distribution than the individual experts (i.e., the marginal posteriors) allowing for them to be specialized in a particular aspect of the target task [26]. This is a desirable property for multiomics data integration where individual views often contain viewspecific (complementary) information or uneven amount of information about the target.
Computing the Joint Posterior. Suppose that we utilize marginal posteriors of the form where and use the prior of the form (typically, a spherical Gaussian). Because a product of Gaussian is itself Gaussian [27], we can derive the joint posterior as where and . Hence, we can efficiently compute the joint posterior of the incomplete multiview observations in terms of the available marginal posteriors.
4.3 Building ViewSpecific Expertise
However, training a PoE appears to be difficult, requiring artificial subsampling of the observed views [8]
or applying variants of contrastive divergence
[26] in order to ensure that the individual views are learnt faithfully. To mitigate such an issue, we instead introduce a set of viewspecific predictors and apply IB principle on the marginal representations in order to allow each encoder to build viewspecific expertise for predicting the target.Formally, for each view, given the latent representation drawn from , the viewspecific predictor estimates , which is a random variable for the target that can be solely described by the corresponding view , based on the distribution defined as . Then, we apply the IB principle on the marginal representations to capture viewspecific aspects of the observed views for predicting the target based on the following loss:
(3) 
where is a balancing coefficient. Minimizing (3) encourages to become a minimal suffcient statistics of for [28], enforcing the marginal representations to learn the viewspecific (possibly complementary) aspects of the target, which eventually ease the training of PoE.
4.4 Training
We train the overall network – the viewspecific encoderpredictor pairs and the final predictor – by minimizing a combination of the marginal and joint IB losses:
(4) 
where is a hyperparameter that trades off between the two losses. The pseudocode of DeepIMV is provided in Algorithm 1.^{5}^{5}5Here, we assume a classification task and slightly abuse notation; we will write
to denote a onehot vector and write
to denote the th element of .5 Experiments
Throughout the experiments, we evaluate different multiview learning methods on two realworld multiomics datasets that were collected by the Cancer Genome Atlas (TCGA)^{6}^{6}6https://www.cancer.gov/tcga and by the Cancer Cell Line Encyclopedia (CCLE) [29] in the context of integrating multiomics observations for predicting 1year mortality and drug sensitivity of cancer cells, respectively.
Benchmarks. We compare DeepIMV with 2 baselines and 6 stateoftheart multiview learning methods: The baselines include a preintegration method that simply concatenates observations from multiple views (denoted as Base1) and a postintegration method that integrates predictions of individual predictors trained on each view as an ensemble (denoted as Base2). The multiview learning methods that assume complete multiview observations include GCCA [5], DCCA [6], and DCCAE [7], and the methods that flexibly integrate incomplete multiview observations include MVAE [8], CPMNets [9], and MOFA [10]. Table 1 summarizes the key characteristics. It is worth highlighting that i) for the methods that cannot handle incomplete multiview observations, we use mean values (and further utilize the reconstructed inputs of MVAE in Tables 3 and 6
) to impute missing views of training and testing samples, and ii) for the methods that do not utilize label information, we train a multilayer perceptron (MLP) based on the learned representations as a downstream task.
Methods  Task Oriented  Incomplete Views 
Base1  ✓  ✗ 
Base2  ✓  ✗ 
GCCA  ✗  ✗ 
DCCA  ✗  ✗ 
DCCAE  ✗  ✗ 
MVAE  ✗  ✓ 
CPMNets  ✓  ✓ 
MOFA  ✗  ✓ 
DeepIMV  ✓  ✓ 
To focus our experiments on the integrative analysis and to overcome “curseofdimensionality” in the highdimensional multiomics data, we extracted lowdimensional representations (i.e., 100 features) using the kernelPCA (with polynomial kernels) on each view
[11] to train all the multiview learning methods except for MOFA, which is wellknown for capturing sparse factors across multiple views. Please see the Supplementary Material for more details.Implementations details and sensitivity analyses on the hyperparameters of DeepIMV and details on the benchmarks can be found in the Supplementary Material. Throughout the experiments, all the results are reported based on 10 and 100 random 64/16/20 train/validation/test splits for the TCGA dataset and the CCLE dataset, respectively.
Methods  1 View  2 Views  3 Views  4 Views  
complete  incomplete  complete  incomplete  complete  incomplete  complete  incomplete  
Base1  0.6600.04  0.6750.02  0.7220.03  0.7390.02  0.7500.02  0.7650.02  0.7660.02  0.7810.01 
Base2 
0.7110.02 
0.7170.02  0.7460.01  0.7660.00  0.7670.02  0.7750.01 
0.7830.02 
0.7900.01 
GCCA  0.6800.02  0.6500.03  0.7370.02  0.7370.03  0.7640.01  0.7690.02 
0.7830.01 
0.7920.01 
DCCA  0.7020.01  0.6380.03  0.7450.03  0.7610.02  0.7580.02  0.7750.01  0.7760.02  0.7840.01 
DCCAE  0.6230.04  0.6050.04  0.7470.03  0.7630.01  0.7740.02  0.7750.01  0.7760.02  0.7780.02 
MVAE  0.5920.05  0.5890.04  0.6770.02  0.6740.02  0.7310.02  0.7300.01  0.7740.01  0.7810.01 
CPMNets  0.7000.02  0.7090.01  0.7480.02  0.7610.02  0.7660.01  0.7710.01  0.7810.01  0.7880.01 
MOFA  0.6810.03  0.6460.01  0.7320.01  0.7340.01  0.7560.01  0.7640.02  0.7810.02  0.7850.02 
DeepIMV  0.7010.02 
0.7240.02 
0.7570.02 
0.7720.01 
0.7760.01 
0.7910.01 
0.7830.01 
0.8010.01 
5.1 Results: TCGA Dataset
Dataset Description. We analyze 1year mortality based on the comprehensive observations from multiple omics on 7,295 cancer cell lines (i.e. samples). The data consists of observations from 4 distinct views on each cell line across 3 different omics layers: (View 1) mRNA expressions, (View 2) DNA methylation, (View 3) microRNA expressions, and (View 4) reverse phase protein array. Among 7,295 samples, 3,282 samples have incomplete multiview observations with various viewmissing patterns: average missing rates were 0.90, 0.76, 0.87, and 0.26 for View 1, View 2, View 3, and View 4, respectively.
5.1.1 MultiOmics Data Integration
We start off by exploring the benefit of integrating multiomics data, specifically augmenting samples with incomplete multiview observations. To this goal, we compare predictions of different multiview learning methods in terms of area under the receiver operating characteristics (AUROC) in Table 2 by augmenting samples with incomplete views on the already available samples with complete views for training. For evaluating the performance, we artificially created missing views for holdout testing samples by varying the number of observed views from to .
There are a couple of things to be highlighted from Table 2: First, DeepIMV better integrates samples with incomplete views as the performance improvement were most significant outperforming the benchmarks regardless of the number of observed views. Second, even when trained only with completeview samples, our method better handles different view missing patterns during testing as it provided the highest performance (except for 1 View) with partially observed views. Third, MVAE and MOFA sacrifice their discriminative power since the latent representations focus on retaining the information of the input for view generation (reconstruction), which results in discarding the taskrelevant discriminative information.
Methods  Mean Impt.  MVAE Impt. 
Base1  0.7650.02  0.7710.01 
Base2  0.7750.01  0.7840.01 
GCCA  0.7690.02  0.7740.01 
DCCA  0.7750.01  0.7840.02 
DCCAE  0.7750.01  0.7730.02 
MVAE  0.7300.01  
CPMNets  0.7710.01  
MOFA  0.7640.02  
DeepIMV 
0.7910.01 
In addition, the proposed method still outperformed the benchmarks that do not handle incomplete views, when we replaced the mean imputation method with an advanced multiview imputation method (i.e., MVAE) as shown in Table 3. (More results can be found in the Supplementary Material.)
Methods  1 View  2 Views  3 Views  4 Views 
MoE  0.6290.03  0.6910.02  0.7360.02  0.7680.01 
MoE with marginal IBs  0.7120.01  0.7660.01  0.7860.01  0.7900.01 
PoE  0.6550.04  0.7190.03  0.7550.03  0.7830.02 
PoE with marginal IBs 
0.7240.02 
0.7720.01 
0.7910.01 
0.8010.01 
Visualization. We visually compare the principle component analysis (PCA) projections of the latent representations of the proposed method with different combinations of observed views in Figure 2. Here, DeepIMV is trained with both complete and incomplete multiview samples. The PCA projections of latent representations become more discriminative and representative of the target labels as DeepIMV incorporates more views during testing. This can be highlighted by how the PCA projections of a representative cell line labeled with 1year mortality, that are marked by star dots, move toward the opposite direction of the class boundary as the proposed method collects more views from (which provides the highest ; see Table 5) to . Also, DeepIMV was able to achieve very similar representations of complete views without incorporating all the observed views. In particular, the PCA projections of latent representations without incorporating View 4 (which provided the smallest ; see Table 5), i.e., , were almost the same with those of latent representations with . This highlights a potential role of DeepIMV in designing multiomics experiments by providing advice which omics layers should not be measured for costefficient predictions.
0.319  0.506  0.487  0.157  0.562 
5.1.2 Ablation Study
In Table 4, we study the effect of i) utilizing PoE over MoE, and ii) introducing viewspecific predictors and the marginal IB losses in (3) in oder to provide additional insight into the source of gain. As discussed in Section 4.2, a PoE allows for encoders to specialize in analyzing their corresponding views and to build different expertise. Furthermore, introducing viewspecific predictors and marginal IB losses encourage the viewspecific encoders to focus on (possibly complementary) taskrelevant information of each view and, thus, to ease the training associated with the PoE factorization. We observe that these components clearly contributes to DeepIMV’s performance improvement. Here, we use the same experimental setting in Section 5.1.1.
5.2 Results: CCLE Dataset
Dataset Description. We analyze sensitivities of heterogeneous cell lines to 4 different drugs – that are, Irinotecan, Panobinostat, Lapatinib, and PLX4720
– based on the multiple omics observations on 504 cancer cell lines (i.e. samples). Drug response was converted to a binary label by dividing cell lines into quartiles ranked by ActArea; the top 25% were assigned to the “sensitive” class and the rest were assigned to the “nonsensitive” class. The data consists of observations from 6 distinct views on each cell line across 5 different omics layers: (
View 1) DNA copy number, (View 2) DNA methylation, (View 3) mRNA expressions, (View 4) microRNA expressions, (View 5) reverse phase protein array, and (View 6) metabolites.Incomplete View Construction. We artificially construct incomplete multiview observations based on the following procedures: i) randomly select samples where is the rate of samples with missing views and ii) create incomplete multiview observations by choosing one of the possible viewmissing patterns for each sample. (Obviously, at least one view must be observed).
5.2.1 MultiOmics Data Integration
We explore the benefit of incorporating more samples and more views on predicting drug sensitivities of heterogeneous cell lines. To this end, we increase the set of available views from Views to Views and include samples that have observations from at least one of the views in this set. For example, given the set of available views is , we integrate all the samples that satisfy for training. In the top row of Figure 3, we compare the prediction of different multiview methods in terms of AUROC performance as we increase the set of available views from to with . There are a couple of things to be highlighted from this figure: First, our method provides better discriminative performance on all the tested drug sensitivity datasets (most of the times) as the number of integrated views increases. Second, the performances of DCCA and DCCAE are saturated since these methods can utilize only two views at the most, whereas GCCA provides consistentlyincreasing performance since it generalizes to multiple views. Third, MVAE and MOFA sacrifice their discriminative taskrelevant information since the latent representations focus on retaining the information of the input for view generation (reconstruction).
Methods  Mean Impt.  MVAE Impt. 
Base1  0.7510.01  0.7580.01 
Base2  0.7280.02  0.7520.01 
GCCA  0.7090.01  0.7150.01 
DCCA  0.7140.01  0.7170.01 
DCCAE  0.6880.01  0.6970.01 
MVAE  0.6780.01  
CPMNets  0.7020.01  
MOFA  0.7270.02  
DeepIMV 
0.7680.01 
In addition, Table 6 shows that the proposed method still outperformed the benchmarks that do not handle incomplete views, when the mean imputation method is replaced by an advanced multiview imputation method (i.e., MVAE). (More results can be found in the Supplementary Material.)
5.2.2 Robustness to Missing Views
Next, we evaluate how robust the multiview learning methods are with respect to the viewmissing rate. The bottom row of Figure 3 shows the AUROC performance as the rate of samples with missing views ranges from (all complete) to (all incomplete) with . We highlight the following observations: First, our method outperforms all the benchmarks on the Irinotecan and Panobinostat datasets and provides comparable performance on the Lapatinib and PLX4720 datasets to the best performing benchmark across different missing rates. Second, while other methods often fail, DeepIMV provides the most robust performance as the rate of samples with missing views increases. Third, DCCA and DCCAE show poor performance since these methods do not fully utilize the available views. Last, a similar trend can be found for MVAE and MOFA with that of the previous observation in Section 5.2.1.
6 Conclusion
Our proposed method finds intraview and interview interactions that are relevant for predicting the target labels by flexibly integrating available views regardless of the view missing patterns in a unified framework. Throughout the experiments, we evaluate DeepIMV on realworld multiomics datasets and show our method significantly outperforms existing multiview learning methods in terms of prediction performance. In the future, further work may investigate incorporating sparsity in different omics data to address highdimensionality.
Acknowledgment
This work was supported by the National Science Foundation (NSF) (Grant Number: 1722516) and the Office of Naval Research (ONR).
References
 [1] Y. Hasin, M. Seldin, and A. Lusis. Multiomics approaches to disease. Genome Biology, 18(83), 2017.
 [2] I. Subramanian, S. Verma, S. Kumar, A. Jere, and K. Anamika. Multiomics data integration, interpretation, and its application. Bioinformatics and Biology Insights, 14, 2020.
 [3] N. Rappoport and R. Shamir. NEMO: Cancer subtyping by integration of partial multiomic data. Bioinformatics, 35(18):3348‐3356, 2019.
 [4] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data, 2nd Edition. Wiley, 2002.
 [5] Y. Li, M. Yang, and Z. Zhang. A survey of multiview representation learning. IEEE Transactions on Knowledge and Data Engineering, 31:1863–1883, 2019.
 [6] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing (Allerton 1999), 1999.
 [7] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.
 [8] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
 [9] S. Akaho. A kernel method for canonical correlation analysis. arXiv preprint arXiv:cs/0609071, 2006.

[10]
G. Andrew, R. Arora, J. Bilmes, and K. Livescu.
Deep canonical correlation analysis.
In Proceedings of the 30th International Conference on Machine Learning (ICML 2013)
, 2013.  [11] W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multiview representation learning. In Proceedings of the 32th International Conference on Machine Learning (ICML 2015), 2015.
 [12] T. Diethe, D. R. Hardoon, and J. ShaweTaylor. Multiview fisher discriminant analysis. In NIPS Workshop on Learning from Multiple Sources, 2008.

[13]
M. Kan, S. Shan, H. Zhang, S. Lao, and X. Chen.
Multiview discriminant analysis.
In Proceedings of the 12th European Conference on Computer Vision (ECCV 2012)
, 2012. 
[14]
Kui Jia, Jiehong Lin, Mingkui Tan, and Dacheng Tao.
Deep multiview learning using neuronwise correlationmaximizing regularizers.
IEEE Transactions on Image Processing, 28(10), October 2019.  [15] T. Cai, T. T. Cai, and A. Zhang. Structured matrix completion with applications to genomic data integration. Journal of the American Statistical Association, 111(514):621–623, 2016.

[16]
L. Tran, X. Liu, J. Zhou, and R. Jin.
Missing modalities imputation via cascaded residual autoencoder.
In Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017)
, 2017.  [17] M. R. Amini, N. Usunier, and C. Goutte. Learning from multiple partially observed views – an application to multilingual text categorization. In Proceedings of the 22nd Conference on Neural Information Processing Systems (NIPS 2009), 2009.
 [18] Y.H. H. Tsai, P. P. Liang, A. Zadeh, L.P. Morency, and R. Salakhutdinov. Learning factorized multimodal representations. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), 2019.
 [19] Y. Shi, N. Siddharth, B. Paige, and P. H.S. Torr. Variational mixtureofexperts autoencoders for multimodal deep generative models. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
 [20] M. Wu and N. Goodman. Multimodal generative models for scalable weaklysupervised learning. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
 [21] R. Argelaguet, B. Velten, D. Arnol, S. Dietrich, T. Zenz, J. Marioni, F. Buettner, W. Huber, and O. Stegle. Multiomics factor analysis–a framework for unsupervised integration of multiomics data sets. Mol Syst Biol, 14(6):3348‐3356, 2018.
 [22] Changqing Zhang, Zongbo Han, Yajie Cui, Huazhu Fu, Joey T. Zhou, and Qinghua Hu. CPMnets: Cross partial multiview networks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
 [23] A. Achille and S. Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19, 2018.
 [24] Q. Wang, C. Boudreau, Q. Luo, P.N. Tan, and J. Zhou. Deep multiview information bottleneck. In Proceedings of the 2019 SIAM International Conference on Data Mining, 2019.
 [25] M. Federici, A. Dutta, P. Forré, N. Kushman, and Z. Akata. Learning robust representations via multiview information bottleneck. In Proceedings of the 8th International Conference on Learning Representations (ICLR 2020), 2020.
 [26] Geoffrey E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8):1771–1800, 2002.
 [27] Y. Cao and D. J. Fleet. Generalized product of experts for automatic and principled fusion of gaussian process predictions. arXiv preprint arXiv:1410.7827, 2014.
 [28] O. Shamir, S. Sabato, and N. Tishby. Learning and generalization with the information bottleneck. Theor. Comput. Sci., 411(29):2696–2711, June 2010.
 [29] J. Barretina, G. Caponigro, and N. Stransky et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature, 483:603–607, 2012.
 [30] J. R. Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433–451, 1971.

[31]
Y. Shiokawa, Y. Date, and J. Kikuchi.
Application of kernel principal component analysis and computational machine learning to exploration of metabolites strongly associated with diet.
Scientific Reports, 8(3426), February 2018.
Appendix A Details on Variational IB Loss
a.1 Derivation of Variational IB Loss in (1) and (2)
Following [1], the variational information bottleneck (IB) loss for the joint latent representation can be given as
(S.1) 
where is a coefficient chosen to balance between the two information quantities. Here, and can be derived using variational approximations, i.e., and , as follows:
(S.2) 
and
(S.3) 
where denotes the KullbackLeibler (KL) divergence between the two distributions and . Since the entropy of the labels is independent of our optimization procedure, we can simply ignore it and approximate the IB loss by plugging (S.2) and (S.3) into (S.1):
We can similarly derive the IB loss for the marginal representation for .
a.2 Training DeepIMV via IB Losses
Different network components are trained based on the joint and marginal IB losses, respectively. More specifically, the parameters of the viewspecific encoders and the multiview predictor – i.e., – are updated based on the joint IB loss while those of the viewspecific encoders and the viewspecific predictors – i.e., – are updated based on the marginal IB losses. Figure S.1 depicts the network components trained via the joint and the marginal IB losses.
Appendix B Implementation Details
Among the 4 network components of DeepIMV, we use multilayer perceptrons (MLP) as the baseline architecture for the viewspecific encoders, viewspecific predictors, and multiview predictor. (Note that the PoE module does not have parameters to be trained.) The number of hidden units and layers in each component are optimized by crossvalidation, and we choose the ones with the minimum total loss in (4), i.e., , on the validation set. The number of hidden units is selected among and the number of layers is selected among . We choose the dimension of latent representations among and use
as the activation function at each hidden layer. The parameters
are initialized by Xavier initialization [2] and optimized via Adam optimizer [3] with learning rate of . We utilize dropout [4]with dropout probability
to regularize the network. The network is further regularized via regularization with for the CCLE dataset and no additional regularization is used for the TCGA dataset.For the balancing coefficients, we use and where we assume that for for convenience. Please refer to our sensitivity analysis in Section S.E.2 for the selection of balancing coefficients and .
Appendix C Details of the Benchmarks
We compare DeepIMV with 2 baseline methods (i.e., Base1 and Base2) and 6 stateoftheart multiview learning methods (i.e., GCCA [5], DCCA [6], DCCAE [7], MVAE [8], CPMNets [9], and MOFA [10]).
For the baseline methods (i.e., Base1 and Base2), we directly use the observations from multiple views and the corresponding labels for training a MLP. For unsupervised multiview learning methods (i.e., GCCA, DCCA, DCCAE, MVAE, and MOFA), we use a twostep approach to provide predictions on the target labels: First, we train each method to find the representations of multiview observations in a common (latent) space. Then, we use the learned representations and the corresponding labels to train a MLP as a downstream task. For the network architecture of MLPs for the downstream task in each benchmark, we use as the activation function at each hidden layer and use dropout with dropout probability to regularize the network. The details of the benchmarks are described as follows:

[leftmargin=1.5em]

Base1: To handle multiview observations, we concatenate features from multiple views as a preintegration step and train a baseline network with the concatenated features as an input. The number of hidden units is selected among and the number of layer is selected among .

Base2: We separately train a MLP for each individual view, and then make an ensemble by averaging the predictions from observed views as a postintegration step. We use a MLP for each view where the number of hidden units is selected among and the number of layer is selected among .

GCCA^{7}^{7}7https://github.com/rupy/GCCA [5]: GCCA generalizes the CCA framework when there are more than two views. To provide predictions on the target label, we first train GCCA to find the representations in a common space, and then train a MLP based on the concatenated representations. The dimension of the common space is selected among which provides the best prediction performance on the validation set. For the downstream task, we use a MLP where the number of hidden units is selected among and the number of layer is selected among .

DCCA^{8}^{8}8https://ttic.uchicago.edu/~wwang5/dccae.html [6] and DCCAE8 [7]
: DCCA extracts lowdimensional representations for observations from two views in a common space by training neural networks to maximize the canonical correlation between the extracted representations. Similarly, DCCAE extracts lowdimensional representations by training autoencoders to optimize a combination of reconstruction loss and the canonical correlation. To make predictions on the target task, we first train each method to find latent representations in a latent space, and then train a baseline network based on the concatenated representations. For implementation, we utilize MLPs for DCCA and DCCAE. And, we select the number of hidden units among
and the number of layers among based on the validation loss. We use as the activation function at each hidden layer. The dimension of the latent representation among is selected based on the prediction performance on the validation set.It is worth highlighting that we select the two best performing views when the available views are more than two, i.e., , since DCCA and DCCAE can utilize only two. For the downstream task, we use a MLP where the number of hidden units is selected among and the number of layer is selected among .

MVAE^{9}^{9}9https://github.com/mhw32/multimodalvaepublic [8]: MVAE learns latent representations for incomplete multiview observations that can generate the original views under the VAE framework. We modify the publicly available code since it only supports observations from two views. We implement the VAE components using MLPs where the number of hidden units, the number of layers, and the dimension of the latent representations are selected among , , and , respectively, based on its validation loss. We use as the activation function at each hidden layer. To make predictions on the target task, we first train MVAE to find latent representations of incomplete multiview observations in the latent space, and then train a MLP based on the learned representations. For the downstream task, the number of hidden units is selected among and the number of layer is selected among .

CPMNets^{10}^{10}10{https://github.com/hanmenghan/CPM_Nets} [9]: CPMNets learns representations in a common space to provide predictions on the target classification task based on the incomplete multiview observations. We implement each component of CPMNets using MLPs where the number of hidden units, the number of layers, and the dimension of the latent representations are selected among , , and , respectively, based on the prediction performance on the validation set. We use as the activation function at each hidden layer.

MOFA^{11}^{11}11https://pypi.org/project/mofapy/ [10]: MOFA infers a lowdimensional representation of the data in terms of a small number of (latent) factors that capture the joint aspects across different views. For training MOFA, we used the original views (i.e., views without conducting kernel PCA) as it is wellknown for capturing sparse factors across multiple views. We set the initial number of factors as . For the downstream task, the number of hidden units is selected among and the number of layer is selected among .
It is worth highlighting that among the benchmarks, MVAE, CPMNets, and MOFA can flexibly handle incomplete multiview observations during training. Hence, for training the other benchmarks (except for Base2), we use mean imputation for missing views. For Base2, we train the baseline network for each individual view using samples that have observations for the corresponding view.
Appendix D Obtaining MultiOmics Datasets
d.1 TCGA Dataset
For constructing multiple views and the labels, the following datasets were downloaded from http://gdac.broadinstitute.org:

[leftmargin=1.5em]

DNA methylation (epigenomics): Methylation_Preprocess.Level_3.2016012800.0.0.tar.gz

microRNA expression (transcriptomics): miRseq_Preprocess.Level_3.2016012800.0.0.tar.gz

mRNA expression (transcriptomics): mRNAseq_Preprocess.Level_3.2016012800.0.0.tar.gz

RPPA (proteomics): RPPA_AnnotateWithGene.Level_3.2016012800.0.0.tar.gz

clinical labels: Clinical_Pick_Tier1.Level_4.2016012800.0.0.tar.gz
Time to death or censoring in clinical labels was converted to a binary label for 1year mortality.
d.2 CCLE Dataset
For constructing multiple views and the labels, the following datasets were downloaded from https://portals.broadinstitute.org/ccle/data:

[leftmargin=1.5em]

DNA copy number (genomics): CCLE_copynumber_byGene_20131203.txt

DNA methylation (epigenomics): CCLE_RRBS_enh_CpG_clusters_20181119.txt

microRNA expression (transcriptomics): CCLE_miRNA_20181103.gct.txt

mRNA expression (transcriptomics): CCLE_RNAseq_genes_counts_20180929.gct

RPPA (proteomics): CCLE_RPPA_20181003.csv

metabolites (Metabolomics):CCLE_metabolomics_20190502.csv

drug sensitivities: CCLE_NP24.2009_Drug_data_2015.02.24.csv
Drug response was converted to a binary label by dividing cell lines into quartiles ranked by ActArea; the top 25% were assigned to the “sensitive” class and the rest were assigned to the“nonsensitiv” class.
For both datasets, we imputed missing values within the observed views with mean values. To focus our experiments on the integrative analysis and to avoid “curseofdimensionality” in the highdimensional multiomics data, we extracted lowdimensional representations (i.e., 100 features) using the kernelPCA (with polynomial kernels) on each view [11].
Appendix E Additional Experiments
Methods  1 View  2 Views  3 Views  4 Views  
mean impt.  MVAE impt.  mean impt.  MVAE impt.  mean impt.  MVAE impt.  mean impt.  MVAE impt.  
Base1  0.6750.02  0.6790.02  0.7390.02  0.7440.01  0.7650.02  0.7710.01  0.7810.01  0.7800.02 
Base2  0.7170.02  0.7170.02  0.7660.00  0.7650.02  0.7750.01  0.7840.01  0.7900.01  0.7900.01 
GCCA  0.6500.03  0.6600.01  0.7370.03  0.7370.03  0.7690.02  0.7740.01  0.7920.01  0.7940.00 
DCCA  0.6380.03  0.6710.02  0.7610.02  0.7630.02  0.7750.01  0.7840.02  0.7840.01  0.7940.01 
DCCAE  0.6050.04  0.6260.03  0.7630.01  0.7630.03  0.7750.01  0.7730.02  0.7780.02  0.7790.01 
MVAE  0.5890.04  0.6740.02  0.7300.01  0.7810.01  
CPMNets  0.7090.01  0.7610.02  0.7710.01  0.7880.01  
DeepIMV 
0.7240.02 
0.7720.01 
0.7910.01 
0.8010.01 
Methods  Irinotecan  Panobinostat  Lapatinib  PLX4720  
mean impt.  MVAE impt.  mean impt.  MVAE impt.  mean impt.  MVAE impt.  mean impt.  MVAE impt.  
Base1  0.7360.01  0.7260.02  0.7510.01  0.7580.01  0.6000.01  0.6320.01  0.6330.01  0.6300.01 
Base2  0.7380.02  0.7300.02  0.7280.02  0.7520.01  0.6410.01  0.6270.01  0.6320.02  0.6310.02 
GCCA  0.6940.01  0.6980.02  0.7090.01  0.7150.01  0.6200.01  0.6190.01  0.6150.02  0.6170.01 
DCCA  0.6470.02  0.6620.02  0.7140.01  0.7170.01  0.6100.01  0.6080.02  0.5680.02  0.5670.01 
DCCAE  0.6610.02  0.6600.02  0.6880.01  0.6970.01  0.5650.01  0.5590.01  0.5540.01  0.5630.01 
MVAE  0.6720.01  0.6780.01  0.6030.01  0.5930.01  
CPMNets  0.6750.02  0.7020.01 
0.6480.01 
0.6350.01  
MOFA  0.7080.02  0.7270.02  0.5850.02  0.5590.02  
DeepIMV 
0.7520.01 
0.7680.01 
0.6410.01 
0.6400.01 
e.1 Additional Experiments with MultiView Imputations
We also imputed observations from missing views by utilizing the reconstructed inputs of MVAE, which can flexibly integrate incomplete multiview observations regardless of the viewmissing patterns. Table S.1 and S.2 shows the AUROC performance when two different imputation methods are used for the multiview learning methods (except for MVAE, CPMNets, MOFA, and DeepIMV that do not depend on the imputation methods), for the TCGA dataset and the CCLE dataset, respectively. For the TCGA dataset, all the methods are trained with both completeview and incompleteview samples. And, for the CCLE dataset, we set and for constructing missing views. The benchmarks trained with imputed observations based on MVAE did not always provide performance gain over those trained with mean imputed observations since reconstructing the inputs can fail to maintain information that is relevant for predicting the target. Even when the imputation based on MVAE improves the discriminative performance, our method still outperforms the benchmark for all the datasets except for Lapatinib of the CCLE dataset.
e.2 Sensitivity Analysis – Effects of and
In this section, we provide sensitivity analysis using the TCGA dataset to see effects of and on the prediction performance of DeepIMV. Figure S.2 shows the AUROC performance of our method with respect to different values of and , respectively. For training the variants of the proposed method, we used both completeview and incompleteview samples.
The Effect of . As shown in Figure 1(a), the discriminative performance drops at since too high value of makes DeepIMV to focus on viewspecific aspects which may end up with sacrificing joint aspects of observed views for predicting the target. Contrarily, too small value of makes the learning of taskrelevant information from the observed views difficult since each marginal representations do not capture the important information for predicting the target from the corresponding view.
The Effect of . Similar to the findings via extensive experiments in [1], , which balances between having a representation that is concise and one that provides good prediction power, plays an important role in DeepIMV. As shown in Figure 1(b), the classification performance drops at since too high value of blocks information from the input that is required to provide good predictions on the target task. For small values of , we witness DeepIMV becomes overfitted since the viewspecific encoder block learns to be more deterministic and thereby reducing the benefits of regularization. (Note that, in such cases, earlystopping is used to prevent from overfitting.)
Throughout our experiments, we set for the TCGA dataset and for the CCLE dataset.
References
 [1] Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), 2017.

[2]
X. Glorot and Y. Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS 2010)
, 2010.  [3] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [4] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 14(1), January 2014.
 [5] J. R. Kettenring. Canonical analysis of several sets of variables. Biometrika, 58(3):433–451, 1971.
 [6] G. Andrew, R. Arora, J. Bilmes, and K. Livescu. Deep canonical correlation analysis. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), 2013.
 [7] W. Wang, R. Arora, K. Livescu, and J. Bilmes. On deep multiview representation learning. In Proceedings of the 32th International Conference on Machine Learning (ICML 2015), 2015.
 [8] M. Wu and N. Goodman. Multimodal generative models for scalable weaklysupervised learning. In Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), 2018.
 [9] Changqing Zhang, Zongbo Han, Yajie Cui, Huazhu Fu, Joey T. Zhou, and Qinghua Hu. CPMnets: Cross partial multiview networks. In Proceedings of the 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
 [10] R. Argelaguet, B. Velten, D. Arnol, S. Dietrich, T. Zenz, J. Marioni, F. Buettner, W. Huber, and O. Stegle. Multiomics factor analysis–a framework for unsupervised integration of multiomics data sets. Mol Syst Biol, 14(6):3348‐3356, 2018.
 [11] Y. Shiokawa, Y. Date, and J. Kikuchi. Application of kernel principal component analysis and computational machine learning to exploration of metabolites strongly associated with diet. Scientific Reports, 8(3426), February 2018.
Comments
There are no comments yet.