Multiview information bottlenect based on Matrix Entroy
By "intelligently" fusing the complementary information across different views, multi-view learning is able to improve the performance of classification tasks. In this work, we extend the information bottleneck principle to a supervised multi-view learning scenario and use the recently proposed matrix-based Rényi's α-order entropy functional to optimize the resulting objective directly, without the necessity of variational approximation or adversarial training. Empirical results in both synthetic and real-world datasets suggest that our method enjoys improved robustness to noise and redundant information in each view, especially given limited training samples. Code is available at <https://github.com/archy666/MEIB>.READ FULL TEXT VIEW PDF
Multiview information bottlenect based on Matrix Entroy
Multi-view learning becomes popular in many real-world applications, such as multi-omics data including genomics, epigenomics, proteomics, metabolomics . Traditional methods of multi-view learning include canonical correlation analysis (CCA)  and its nonlinear extensions like kernel canonical correlation analysis (KCCA) . However, both CCA and KCCA are restricted to only two views.
Due to the remarkable success achieved by deep neural networks (DNNs), there is a recent trend to leverage the power of DNNs to improve the performances of multi-view learning. Andrew et al.  proposed deep canonical correlation analysis (DCCA) to perform complex nonlinear transformation on multi-view data. Wang et al. 
proposed deep canonically correlated autoencoders (DCCAE) to further improve the performance of DCCA by adding an autoencoder reconstruction error regularization.
Recently, the notable Information Bottleneck (IB) principle  has been extended to multi-view learning problem to compress redundant or task-irrelevant information in the input views and only preserve the most task-relevant features [24, 13]
. However, parameterizing the IB principle with DNNs is not a trivial task. A notorious reason is that the mutual information estimation in high-dimensional space is widely acknowledged as intractable or infeasible. To this end, existing deep multi-view IB approaches (e.g.,[12, 20, 2, 9, 15]) usually use the variational approximation or adversarial training to maximize a lower bound of the original IB objective. However, the tightness of those derived lower bounds is hard to guarantee in practice, especially if there are only limited training data.
In this work, instead of evaluating a lower bound of mutual information terms or in deep multi-view information bottleneck, we demonstrate that it is feasible to directly optimize the IB objective without any variational approximation or adversarial training. We term our methodology Multi-view matrix-based Entropy Information Bottleneck (MEIB) and make the following contributions:
We introduce the matrix-based Rényi’s -order entropy functional  to estimate directly, which makes our MEIB enjoys a simple and tractable objective but also provides a deterministic compression of input variables from different views.
Empirical results suggest that our MEIB outperforms other competing methods in terms of robustness to noise and redundant information, especially given limited training samples in each view.
Our studies also raise a few new insights to the design and implementation of deep multi-view IB methods in the future, as will be discussed in the last section.
Let us denote the input random variable asand desired output variable as (e.g., class labels). The information bottleneck (IB) approach  considers extracting a compressed representation from that is relevant for predicting . Formally, this objective is formulated as finding such that the mutual information is maximized, while keeping mutual information below a threshold :
where is the set of random variables
that obeys a Markov chain, which means that is conditional independent of given , and any information that has about is from .
In practice, one is hard to solve the above constrained optimization problem of Eq. (1), and is found by maximizing the following IB Lagrangian:
where is a Lagrange multiplier that controls the trade-off between the sufficiency (the performance on the task, as quantified by ) and the minimality (the complexity of the representation, as measured by ). In this sense, the IB principle also provides a natural approximation of minimal sufficient statistic .
Multi-view learning comes down to the problem of machine learning from data represented by multiple distinct feature sets.
The main challenge of applying the IB principle is that the mutual information terms are computationally intractable. In order to solve this problem, some researches have access to variational approximation to estimate mutual information such as [12, 20, 2, 9, 15]
. The main idea of variational approximation is to develop a variational bound on the sufficiency and redundancy trade-off. Specifically, the intractable distributions will be replaced by some known distributions like Gaussian distribution to calculate a lower bound of the original target. Then the optimal solution of the original objective is obtained by maximization of the lower bound. IB principle has also been applied for unsupervised multi-view learning. For example, compress redundant information with a collaborative multi-view IB network. Meanwhile,  assumes that each view contains the same task relevant information, and therefore suggests maximizing the mutual information between latent representations extracted from each view to compress redundant information. These studies also maximize some sample-based differentiable mutual information lower bound instead of directly optimizing objective. As mentioned above, these methods have the drawback that it is difficult to guarantee the tightness of this approximation.
The architecture of our MEIB is illustrated in Fig. 1. Specifically, we consider a general supervised multi-view learning scenario in which we aim to learn a joint yet robust representation given different views of features and their labels . To this end, our MEIB consists of separate encoders (), a feature fusion network
, and a (nonlinear) classifier. Each encoder maps view-specific data () to a latent representation to remove noise and redundant information contained in . The latent representations are then fused by a network to obtain a joint and compact representation .
Note that, in multi-view unsupervised clustering, a common strategy to fuse view-specific latent representations is to take the form of , s.t., and . Here, instead of using the naïve weighted linear combination, our fusion network is more general and absorbs the linear combiation as a special case.
Therefore, the overall objective of our MEIB can be expressed as:
where refers to regularization parameter for the -th view.
For the first term , it can be replaced with the risk associated with to the prediction performance on according to the cross-entropy loss 111The same strategy has also been used in the variational information bottleneck (VIB) , the nonlinear information bottleneck (NIB) , and the deep deterministic information bottleneck (DIB) . [1, 4]. Therefore, Eq. (3) is equivalent to:
In this sense, the main challenge in optimizing Eq. (3) or Eq. (4) is that the exact computation of the compression term is almost impossible or intractable due to the high dimensionality of the data.
In this work, we address this issue by simply making use of the matrix-based Rényi’s -order entropy functional to estimate in each view. Specifically, given pairs of samples from a mini-batch of size in the -th view, i.e., , each be an input sample and each denotes its latent representation by encoder 222For simplicity, we remove the subscript of the view index in the remaining of this section.. We can view both and
as random vectors. According to, the entropy of can be defined over the eigenspectrum of a (normalized) Gram matrix ( and is a Gaussian kernel) as:
where . is the normalized version of , i.e., . denotes the
-th eigenvalue of.
The entropy of can be measured similarly on the eigenspectrum of another (normalized) Gram matrix .
Further, the joint entropy for and can be defined as:
where denotes Hadamard (or element-wise) product.
The differentiability of matrix-based Rényi’s -order entropy functional has been derived in 
Compared with variational approximation or adversarial training that requires an additional auxiliary network (e.g., the mutual information neural estimator or MINE ) to approximate a lower bound of mutual information values, Eq. (7) is mathematically well defined and computational efficient, especially for large networks. Moreover, there are only two hyper-parameters here: the order and the kernel width . Throughout this work, we set to approximate the Shannon entropy. For , we evaluate the () nearest distances of each sample and take the mean. We choose as the average of mean values for all samples.
In this section, we compare the performance of our MEIB with several popular multi-view learning methods on both synthetic and real-word datasets. The selected competitors include linear CCA , DCCA , DCCAE , DNN, and deep IB . Here, DNN refers to a network shown in Fig. 1 trained by cross-entropy loss. We repeat the experiment 5 times to obtain average classification errors.
The method for synthesizing data is the same as in . We sample points per class from or to form latent representation . To distort the classification, we concatenate the extra features to the latent representation and synthesize new features , i.e., . Here the extra features for view contain samples from and samples from . The extra features for view contain samples from and samples from . Then the nonlinear transformation of is accompanied by adding noise, which is sampled from , where denotes the noise level. Thus, the generation method of each view can be expressed as , . Of all the synthetic datasets, are used for training and for testing.
To evaluate the performances of all competing methods on their robustness to noise, we change the noise level which is calculated as , where is set to be . The dimension of extra features and latent representation are and , respectively. The sample size per class is set to be . So view can be expressed as and view can be expressed as . and are tuned in . For a fair comparison, in DNN, deep IB and our MEIB, the encoder for each view is a MLP with fully connected layers of the form for view and a single hidden layer for view . The fusion layer to classification layer is the form of
and all of the activate function is.
The experimental results of noise robustness are shown in Fig. 2. We can see that the performance of all competing methods decreases with the increase of noise level . However, our MEIB and deep IB are consistently better than others, which indicates the advantage of bottleneck regularization, i.e., , in suppressing noise.
We further test the robustness of all competing methods to redundant dimensions. To this end, we change the extra features’ dimension from to , with an interval of . All other settings are the same as in Section 4.1.1, except that the noise factor is fixed to be . The experimental results are shown in Fig. 3 (left). Again, our MEIB and deep IB are superior to other methods. We also plot the norm of the learned weights (by MEIB) associated with input dimensions - (correspond to informative features) and dimensions - (correspond to additional redundant features). As can be seen in Fig. 3 (right), our MEIB is able to identify redundant features dimensions and put small network weights therein.
We evaluate our proposed method on 3 real-world multi-view datasets. MNIST has views. The first view is obtained by rotating raw images with an angle of
following a uniform distribution in the range; the second view is obtained by adding uniformly distributed background noise. BBC333http://mlg.ucd.ie/datasets/bbc.html consists of documents from BBC news in five topical areas. The first view has features and the second view has features. Wisconsin X-Ray MicroBeam (XRMB)  consists of views. The first view has features and the second view has features. The performance of different methods is shown in Table 1, which further indicates the superiority of our MEIB.
We developed the Multi-view matrix-Entropy based Information Bottleneck (MEIB) for multi-view learning. Using the recently proposed matrix-based Rényi’s -order entropy functional, MEIB can be optimized directly by SGD or ADAM, without variational approximation or adversarial training. Empirical results show that our MEIB outperforms other competing methods in terms of robustness to noise and redundant information contained in source data.
Our study also raise a few insights (or open problems) concerning the general IB approach and its implementation: 1) When is the matrix-based entropy necessary? As show in Fig. 4, our advantages become weak with the increase of the number of samples in each view. This makes sense, because the increased sample size makes the variational approximation or the adversarial training with an additional network becomes stable. The lower bound of mutual information values also becomes tighter.
2) Practical concern for more than views? Extending existing multi-view learning methods to more than views is always a challenging problem . Although our framework can be simply extended to arbitrary number of views by just add a new regularization term , it causes an additional hyper-parameter as well. Although we observed that the performance of our MEIB is stable for a suitable range of (Fig. 5 right), we also observed that increasing the number of views does not continually to improve classification performance (Fig. 5 left). This suggests that the IB regularization term alone is insufficient. One possible way is to take into consideration the “synergistic” information  amongst all views in the future.