Multi-view Information Bottleneck Without Variational Approximation

by   Qi Zhang, et al.
Xi'an Jiaotong University

By "intelligently" fusing the complementary information across different views, multi-view learning is able to improve the performance of classification tasks. In this work, we extend the information bottleneck principle to a supervised multi-view learning scenario and use the recently proposed matrix-based Rényi's α-order entropy functional to optimize the resulting objective directly, without the necessity of variational approximation or adversarial training. Empirical results in both synthetic and real-world datasets suggest that our method enjoys improved robustness to noise and redundant information in each view, especially given limited training samples. Code is available at <>.



page 1

page 2

page 3

page 4


Deep Deterministic Information Bottleneck with Matrix-based Entropy Functional

We introduce the matrix-based Renyi's α-order entropy functional to para...

Partially Shared Semi-supervised Deep Matrix Factorization with Multi-view Data

Since many real-world data can be described from multiple views, multi-v...

Joint Featurewise Weighting and Lobal Structure Learning for Multi-view Subspace Clustering

Multi-view clustering integrates multiple feature sets, which reveal dis...

On the Multi-View Information Bottleneck Representation

In this work, we generalize the information bottleneck (IB) approach to ...

Gated Information Bottleneck for Generalization in Sequential Environments

Deep neural networks suffer from poor generalization to unseen environme...

Multi-view Deep One-class Classification: A Systematic Exploration

One-class classification (OCC), which models one single positive class a...

Two Souls in an Adversarial Image: Towards Universal Adversarial Example Detection using Multi-view Inconsistency

In the evasion attacks against deep neural networks (DNN), the attacker ...

Code Repositories


Multiview information bottlenect based on Matrix Entroy

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-view learning becomes popular in many real-world applications, such as multi-omics data including genomics, epigenomics, proteomics, metabolomics [12]. Traditional methods of multi-view learning include canonical correlation analysis (CCA) [8] and its nonlinear extensions like kernel canonical correlation analysis (KCCA) [6]. However, both CCA and KCCA are restricted to only two views.

Due to the remarkable success achieved by deep neural networks (DNNs), there is a recent trend to leverage the power of DNNs to improve the performances of multi-view learning 

[25]. Andrew et al. [5] proposed deep canonical correlation analysis (DCCA) to perform complex nonlinear transformation on multi-view data. Wang et al. [22]

proposed deep canonically correlated autoencoders (DCCAE) to further improve the performance of DCCA by adding an autoencoder reconstruction error regularization.

Recently, the notable Information Bottleneck (IB) principle [17] has been extended to multi-view learning problem to compress redundant or task-irrelevant information in the input views and only preserve the most task-relevant features [24, 13]

. However, parameterizing the IB principle with DNNs is not a trivial task. A notorious reason is that the mutual information estimation in high-dimensional space is widely acknowledged as intractable or infeasible. To this end, existing deep multi-view IB approaches (e.g., 

[12, 20, 2, 9, 15]) usually use the variational approximation or adversarial training to maximize a lower bound of the original IB objective. However, the tightness of those derived lower bounds is hard to guarantee in practice, especially if there are only limited training data.

In this work, instead of evaluating a lower bound of mutual information terms or in deep multi-view information bottleneck, we demonstrate that it is feasible to directly optimize the IB objective without any variational approximation or adversarial training. We term our methodology Multi-view matrix-based Entropy Information Bottleneck (MEIB) and make the following contributions:

  • We introduce the matrix-based Rényi’s -order entropy functional [14] to estimate directly, which makes our MEIB enjoys a simple and tractable objective but also provides a deterministic compression of input variables from different views.

  • Empirical results suggest that our MEIB outperforms other competing methods in terms of robustness to noise and redundant information, especially given limited training samples in each view.

  • Our studies also raise a few new insights to the design and implementation of deep multi-view IB methods in the future, as will be discussed in the last section.

2 Preliminaries

2.1 Information Bottleneck Principle

Let us denote the input random variable as

and desired output variable as (e.g., class labels). The information bottleneck (IB) approach [17] considers extracting a compressed representation from that is relevant for predicting . Formally, this objective is formulated as finding such that the mutual information is maximized, while keeping mutual information below a threshold :


where is the set of random variables

that obeys a Markov chain

, which means that is conditional independent of given , and any information that has about is from .

In practice, one is hard to solve the above constrained optimization problem of Eq. (1), and is found by maximizing the following IB Lagrangian:


where is a Lagrange multiplier that controls the trade-off between the sufficiency (the performance on the task, as quantified by ) and the minimality (the complexity of the representation, as measured by ). In this sense, the IB principle also provides a natural approximation of minimal sufficient statistic [10].

2.2 Information Bottleneck for Multi-view Learning

Multi-view learning comes down to the problem of machine learning from data represented by multiple distinct feature sets 


The main challenge of applying the IB principle is that the mutual information terms are computationally intractable. In order to solve this problem, some researches have access to variational approximation to estimate mutual information such as [12, 20, 2, 9, 15]

. The main idea of variational approximation is to develop a variational bound on the sufficiency and redundancy trade-off. Specifically, the intractable distributions will be replaced by some known distributions like Gaussian distribution to calculate a lower bound of the original target. Then the optimal solution of the original objective is obtained by maximization of the lower bound. IB principle has also been applied for unsupervised multi-view learning. For example,

[19] compress redundant information with a collaborative multi-view IB network. Meanwhile, [9] assumes that each view contains the same task relevant information, and therefore suggests maximizing the mutual information between latent representations extracted from each view to compress redundant information. These studies also maximize some sample-based differentiable mutual information lower bound instead of directly optimizing objective. As mentioned above, these methods have the drawback that it is difficult to guarantee the tightness of this approximation.

3 Methodology

Figure 1: An illustration of our proposed Multi-view matrix-Entropy based Information Bottleneck (MEIB). MEIB learns a robust joint representation by striking a trade-off between and .

The architecture of our MEIB is illustrated in Fig. 1. Specifically, we consider a general supervised multi-view learning scenario in which we aim to learn a joint yet robust representation given different views of features and their labels . To this end, our MEIB consists of separate encoders (), a feature fusion network

, and a (nonlinear) classifier

. Each encoder maps view-specific data () to a latent representation to remove noise and redundant information contained in . The latent representations are then fused by a network to obtain a joint and compact representation .

Note that, in multi-view unsupervised clustering, a common strategy to fuse view-specific latent representations is to take the form of , s.t., and  [18]. Here, instead of using the naïve weighted linear combination, our fusion network is more general and absorbs the linear combiation as a special case.

Therefore, the overall objective of our MEIB can be expressed as:


where refers to regularization parameter for the -th view.

For the first term , it can be replaced with the risk associated with to the prediction performance on according to the cross-entropy loss 111The same strategy has also been used in the variational information bottleneck (VIB) [3], the nonlinear information bottleneck (NIB) [11], and the deep deterministic information bottleneck (DIB) [27]. [1, 4]. Therefore, Eq. (3) is equivalent to:


In this sense, the main challenge in optimizing Eq. (3) or Eq. (4) is that the exact computation of the compression term is almost impossible or intractable due to the high dimensionality of the data.

In this work, we address this issue by simply making use of the matrix-based Rényi’s -order entropy functional to estimate in each view. Specifically, given pairs of samples from a mini-batch of size in the -th view, i.e., , each be an input sample and each denotes its latent representation by encoder 222For simplicity, we remove the subscript of the view index in the remaining of this section.. We can view both and

as random vectors. According to 

[14], the entropy of can be defined over the eigenspectrum of a (normalized) Gram matrix ( and is a Gaussian kernel) as:


where . is the normalized version of , i.e., . denotes the

-th eigenvalue of


The entropy of can be measured similarly on the eigenspectrum of another (normalized) Gram matrix .

Further, the joint entropy for and can be defined as:


where denotes Hadamard (or element-wise) product.

Given Eqs. (5) and (6), the matrix-based Rényi’s -order mutual information in analogy of Shannon’s mutual information is given by:


The differentiability of matrix-based Rényi’s -order entropy functional has been derived in [26]

. In practice, the automatic eigenvalue decomposition is embedded in mainstream deep learning APIs like TensorFlow and PyTorch.

Compared with variational approximation or adversarial training that requires an additional auxiliary network (e.g., the mutual information neural estimator or MINE [7]) to approximate a lower bound of mutual information values, Eq. (7) is mathematically well defined and computational efficient, especially for large networks. Moreover, there are only two hyper-parameters here: the order and the kernel width . Throughout this work, we set to approximate the Shannon entropy. For , we evaluate the () nearest distances of each sample and take the mean. We choose as the average of mean values for all samples.

4 Experiments

In this section, we compare the performance of our MEIB with several popular multi-view learning methods on both synthetic and real-word datasets. The selected competitors include linear CCA [8], DCCA [5], DCCAE [22], DNN, and deep IB [20]. Here, DNN refers to a network shown in Fig. 1 trained by cross-entropy loss. We repeat the experiment 5 times to obtain average classification errors.

4.1 Synthetic data

The method for synthesizing data is the same as in [20]. We sample points per class from or to form latent representation . To distort the classification, we concatenate the extra features to the latent representation and synthesize new features , i.e., . Here the extra features for view contain samples from and samples from . The extra features for view contain samples from and samples from . Then the nonlinear transformation of is accompanied by adding noise, which is sampled from , where denotes the noise level. Thus, the generation method of each view can be expressed as , . Of all the synthetic datasets, are used for training and for testing.

4.1.1 Robustness to Noise

To evaluate the performances of all competing methods on their robustness to noise, we change the noise level which is calculated as , where is set to be . The dimension of extra features and latent representation are and , respectively. The sample size per class is set to be . So view can be expressed as and view can be expressed as . and are tuned in . For a fair comparison, in DNN, deep IB and our MEIB, the encoder for each view is a MLP with fully connected layers of the form for view and a single hidden layer for view . The fusion layer to classification layer is the form of

and all of the activate function is


Figure 2: “Classification error v.s. noise level” for all methods.

The experimental results of noise robustness are shown in Fig. 2. We can see that the performance of all competing methods decreases with the increase of noise level . However, our MEIB and deep IB are consistently better than others, which indicates the advantage of bottleneck regularization, i.e., , in suppressing noise.

4.1.2 Robustness to Redundant Dimensions

We further test the robustness of all competing methods to redundant dimensions. To this end, we change the extra features’ dimension from to , with an interval of . All other settings are the same as in Section 4.1.1, except that the noise factor is fixed to be . The experimental results are shown in Fig. 3 (left). Again, our MEIB and deep IB are superior to other methods. We also plot the norm of the learned weights (by MEIB) associated with input dimensions - (correspond to informative features) and dimensions - (correspond to additional redundant features). As can be seen in Fig. 3 (right), our MEIB is able to identify redundant features dimensions and put small network weights therein.

Figure 3: Robustness to redundant dimensions

4.2 Real-world dataset

We evaluate our proposed method on 3 real-world multi-view datasets. MNIST has views. The first view is obtained by rotating raw images with an angle of

following a uniform distribution in the range

; the second view is obtained by adding uniformly distributed background noise. BBC333 consists of documents from BBC news in five topical areas. The first view has features and the second view has features. Wisconsin X-Ray MicroBeam (XRMB) [21] consists of views. The first view has features and the second view has features. The performance of different methods is shown in Table 1, which further indicates the superiority of our MEIB.

linear CCA 0.4230.002 0.1680.002 0.4610.007
DCCA 0.2550.002 0.3460.012 0.3790.008
DCCAE 0.3740.004 0.2790.002 0.3830.001
DNN 0.1950.018 0.0790.011 0.2220.010
deep IB 0.1940.022 0.0770.018 0.2170.008
MEIB 0.1730.022 0.0760.023 0.1810.010
Table 1: Average classification error in real-world datasets

5 conclusion and future work

We developed the Multi-view matrix-Entropy based Information Bottleneck (MEIB) for multi-view learning. Using the recently proposed matrix-based Rényi’s -order entropy functional, MEIB can be optimized directly by SGD or ADAM, without variational approximation or adversarial training. Empirical results show that our MEIB outperforms other competing methods in terms of robustness to noise and redundant information contained in source data.

Our study also raise a few insights (or open problems) concerning the general IB approach and its implementation: 1) When is the matrix-based entropy necessary? As show in Fig. 4, our advantages become weak with the increase of the number of samples in each view. This makes sense, because the increased sample size makes the variational approximation or the adversarial training with an additional network becomes stable. The lower bound of mutual information values also becomes tighter.

Figure 4: Classification error (on synthetic data) of all competing methods with respect to sample sizes in each view.

2) Practical concern for more than views? Extending existing multi-view learning methods to more than views is always a challenging problem [9]. Although our framework can be simply extended to arbitrary number of views by just add a new regularization term , it causes an additional hyper-parameter as well. Although we observed that the performance of our MEIB is stable for a suitable range of (Fig. 5 right), we also observed that increasing the number of views does not continually to improve classification performance (Fig. 5 left). This suggests that the IB regularization term alone is insufficient. One possible way is to take into consideration the “synergistic” information [23] amongst all views in the future.

Figure 5: Left: error with respect to number of views for MEIB; Right: error with respect to different values of and (tuned in ).


  • [1] A. Achille and S. Soatto (2018) Information dropout: learning optimal representations through noisy computation. IEEE TPAMI 40 (12), pp. 2897–2905. Cited by: §3.
  • [2] I. E. Aguerri and A. Zaidi (2019) Distributed variational representation learning. IEEE TPAMI 43 (1), pp. 120–138. Cited by: §1, §2.2.
  • [3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In ICLR, Cited by: footnote 1.
  • [4] R. A. Amjad and B. C. Geiger (2019) Learning representations for neural network-based classification using the information bottleneck principle. IEEE TPAMI 42 (9), pp. 2225–2239. Cited by: §3.
  • [5] G. Andrew, R. Arora, J. Bilmes, and K. Livescu (2013) Deep canonical correlation analysis. In ICML, pp. 1247–1255. Cited by: §1, §4.
  • [6] R. Arora and K. Livescu (2012) Kernel cca for multi-view learning of acoustic features using articulatory measurements. In SMLSLP, Cited by: §1.
  • [7] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, A. Courville, and D. Hjelm (2018) Mutual information neural estimation. In ICML, pp. 531–540. Cited by: §3.
  • [8] K. Chaudhuri, S. M. Kakade, K. Livescu, and K. Sridharan (2009) Multi-view clustering via canonical correlation analysis. In ICML, pp. 129–136. Cited by: §1, §4.
  • [9] M. Federici, A. Dutta, P. Forré, N. Kushmann, and Z. Akata (2020) Learning robust representations via multi-view information bottleneck. In ICLR, Cited by: §1, §2.2, §5.
  • [10] R. Gilad-Bachrach, A. Navot, and N. Tishby (2003) An information theoretic tradeoff between complexity and accuracy. In Learning Theory and Kernel Machines, pp. 595–609. Cited by: §2.1.
  • [11] A. Kolchinsky, B. D. Tracey, and D. H. Wolpert (2019) Nonlinear information bottleneck. Entropy 21 (12), pp. 1181. Cited by: footnote 1.
  • [12] C. Lee and M. Schaar (2021) A variational information bottleneck approach to multi-omics data integration. In AISTATS, pp. 1513–1521. Cited by: §1, §1, §2.2.
  • [13] Z. Lou, Y. Ye, and X. Yan (2013) The multi-feature information bottleneck with application to unsupervised image categorization. In IJCAI, Cited by: §1.
  • [14] L. G. Sanchez Giraldo, M. Rao, and J. C. Principe (2014) Measures of entropy from data using infinitely divisible kernels. IEEE TIT 61 (1), pp. 535–548. Cited by: 1st item, §3.
  • [15] J. Song et al. (2021) Multicolor image classification using the multimodal information bottleneck network (mmib-net) for detecting diabetic retinopathy. Optics Express 29 (14), pp. 22732–22748. Cited by: §1, §2.2.
  • [16] S. Sun (2013) A survey of multi-view machine learning. NCAA 23 (7), pp. 2031–2038. Cited by: §2.2.
  • [17] N. Tishby, F. C. Pereira, and W. Bialek (1999) The information bottleneck method. In Allerton, pp. 368–377. Cited by: §1, §2.1.
  • [18] D. J. Trosten, S. Lokse, R. Jenssen, and M. Kampffmeyer (2021) Reconsidering representation alignment for multi-view clustering. In CVPR, pp. 1255–1265. Cited by: §3.
  • [19] Z. Wan, C. Zhang, P. Zhu, and Q. Hu (2021) Multi-view information-bottleneck representation learning. In AAAI, pp. 10085–10092. Cited by: §2.2.
  • [20] Q. Wang, C. Boudreau, Q. Luo, P. Tan, and J. Zhou (2019) Deep multi-view information bottleneck. In SDM, pp. 37–45. Cited by: §1, §2.2, §4.1, §4.
  • [21] W. Wang, R. Arora, K. Livescu, and J. A. Bilmes (2015) Unsupervised learning of acoustic features via deep canonical correlation analysis. In ICASSP, pp. 4590–4594. Cited by: §4.2.
  • [22] W. Wang, R. Arora, K. Livescu, and J. Bilmes (2015) On deep multi-view representation learning. In ICML, pp. 1083–1092. Cited by: §1, §4.
  • [23] P. L. Williams and R. D. Beer (2010) Nonnegative decomposition of multivariate information. arXiv preprint arXiv:1004.2515. Cited by: §5.
  • [24] C. Xu, D. Tao, and C. Xu (2014) Large-margin multi-view information bottleneck. IEEE TPAMI 36 (8), pp. 1559–1572. Cited by: §1.
  • [25] X. Yan, S. Hu, Y. Mao, Y. Ye, and H. Yu (2021) Deep multi-view learning methods: a review. Neurocomputing. Cited by: §1.
  • [26] S. Yu, F. Alesiani, X. Yu, R. Jenssen, and J. Principe (2021) Measuring dependence with matrix-based entropy functional. In AAAI, pp. 10781–10789. Cited by: §3.
  • [27] X. Yu, S. Yu, and J. C. Príncipe (2021) Deep deterministic information bottleneck with matrix-based entropy functional. In ICASSP, pp. 3160–3164. Cited by: footnote 1.