Log In Sign Up

Mutual Information Gradient Estimation for Representation Learning

Mutual Information (MI) plays an important role in representation learning. However, MI is unfortunately intractable in continuous and high-dimensional settings. Recent advances establish tractable and scalable MI estimators to discover useful representation. However, most of the existing methods are not capable of providing an accurate estimation of MI with low-variance when the MI is large. We argue that directly estimating the gradients of MI is more appealing for representation learning than estimating MI in itself. To this end, we propose the Mutual Information Gradient Estimator (MIGE) for representation learning based on the score estimation of implicit distributions. MIGE exhibits a tight and smooth gradient estimation of MI in the high-dimensional and large-MI settings. We expand the applications of MIGE in both unsupervised learning of deep representations based on InfoMax and the Information Bottleneck method. Experimental results have indicated significant performance improvement in learning useful representation.


page 1

page 2

page 3

page 4


Multimodal Representation Learning via Maximization of Local Mutual Information

We propose and demonstrate a representation learning approach by maximiz...

Representation Learning with Information Theory for COVID-19 Detection

Successful data representation is a fundamental factor in machine learni...

MINE: Mutual Information Neural Estimation

We argue that the estimation of the mutual information between high dime...

Which Mutual-Information Representation Learning Objectives are Sufficient for Control?

Mutual information maximization provides an appealing formalism for lear...

On Mutual Information Maximization for Representation Learning

Many recent methods for unsupervised or self-supervised representation l...

A robust estimator of mutual information for deep learning interpretability

We develop the use of mutual information (MI), a well-established metric...

Learning Bias-Invariant Representation by Cross-Sample Mutual Information Minimization

Deep learning algorithms mine knowledge from the training data and thus ...

1 Introduction

Mutual information (MI) is an appealing metric widely used in information theory and machine learning to quantify the amount of shared information between a pair of random variables. Specifically, given a pair of random variables

, the MI, denoted by , is defined as


where is the expectation over the given distribution. Since MI is invariant to invertible and smooth transformations, it can capture non-linear statistical dependencies between variables (Kinney and Atwal, 2014)

. These appealing properties make it act as a fundamental measure of true dependence. Therefore, MI has found applications in a wide range of machine learning tasks, including feature selection 

(Kwak and Choi, 2002; Fleuret, 2004; Peng et al., 2005), clustering (Müller et al., 2012; Ver Steeg and Galstyan, 2015), and causality (Butte and Kohane, 1999). It has also been pervasively used in science, such as biomedical sciences (Maes et al., 1997), computational biology (Krishnaswamy et al., 2014), and computational neuroscience (Palmer et al., 2015).

Recently, there has been a revival of methods in unsupervised representation learning based on MI. A seminal work is the InfoMax principle (Linsker, 1988), where given an input instance , the goal of the InfoMax principle is to learn a representation by maximizing the MI between the input and its representation. A growing set of recent works have demonstrated promising empirical performance in unsupervised representation learning via MI maximization (Krause et al., 2010; Hu et al., 2017; Alemi et al., 2018b; Oord et al., 2018; Hjelm et al., 2019). Another closely related work is the Information Bottleneck method (Tishby et al., 2000; Alemi et al., 2017), where MI is used to limit the contents of representations. Specifically, the representations are learned by extracting task-related information from the original data while being constrained to discard parts that are irrelevant to the task. Several recent works have also suggested that by controlling the amount of information between learned representations and the original data, one can tune desired characteristics of trained models such as generalization error (Tishby and Zaslavsky, 2015; Vera et al., 2018), robustness (Alemi et al., 2017), and detection of out-of-distribution data (Alemi et al., 2018a).

Despite playing a pivotal role across a variety of domains, MI is notoriously intractable. Exact computation is only tractable for discrete variables, or for a limited family of problems where the probability distributions are known. For more general problems, MI is challenging to analytically compute or estimate from samples. A variety of MI estimators have been developed over the years, including likelihood-ratio estimators 

(Suzuki et al., 2008), binning (Fraser and Swinney, 1986; Darbellay and Vajda, 1999; Shwartz-Ziv and Tishby, 2017), k-nearest neighbors (Kozachenko and Leonenko, 1987; Kraskov et al., 2004; Pérez-Cruz, 2008; Singh and Póczos, 2016)

, and kernel density estimators 

(Moon et al., 1995; Kwak and Choi, 2002; Kandasamy et al., 2015). However, few of these mutual information estimators scale well with dimension and sample size in machine learning problems (Gao et al., 2015).

In order to overcome the intractability of MI in the continuous and high-dimensional settings, Alemi et al. (2017) combines variational bounds of Barber and Agakov (2003)

with neural networks for the estimation. However, the tractable density for the approximate distribution is required due to variational approximation. This limits its application to the general-purpose estimation, since the underlying distributions are often unknown. Alternatively, the Mutual Information Neural Estimation (MINE, 

Belghazi et al. (2018)) and the Jensen-Shannon MI estimator (JSD, Hjelm et al. (2019)

) enable differentiable and tractable estimation of MI by training a discriminator to distinguish samples coming from the joint distribution or the product of the marginals. In detail, MINE employs a lower-bound to the MI based on the Donsker-Varadhan representation of the KL-divergence, and JSD follows the formulation of f-GAN KL-divergence. In general, these estimators are often noisy and can lead to unstable training due to their dependence on the discriminator used to estimate the bounds of mutual information. As pointed out by

Poole et al. (2019)

, these unnormalized critic estimators of MI exhibit high variance and are challenging to tune for estimation. An alternative low-variance choice of MI estimator is Information Noise-Contrastive Estimation (InfoNCE, 

Oord et al. (2018)), which introduces the Noise-Contrastive Estimation with flexible critics parameterized by neural networks as a bound to approximate MI. Nonetheless, its estimation saturates at log of the batch size and suffers from high bias. Despite their modeling power, none of the estimators are capable of providing accurate estimation of MI with low variance when the MI is large and the batch size is small (Poole et al., 2019). As supported by the theoretical findings in  McAllester and Statos (2018), any distribution-free high-confidence lower bound on entropy requires a sample size exponential in the size of the bound. More discussions about the bounds of MI and their relationship can be referred to Poole et al. (2019).

In summary, existing estimators first approximate MI and then use these approximations to optimize the associated parameters. For estimating MI based on any finite number of samples, there exists an infinite number of functions, with arbitrarily diverse gradients, that can perfectly approximate the true MI at these samples. However, these approximate functions can lead to unstable training and poor performance in optimization due to gradients discrepancy between approximate estimation and true MI. Estimating gradients of MI rather than estimating MI may be a better approach for MI optimization. To this end, to the best of our knowledge, we firstly propose the Mutual Information Gradient Estimator (MIGE) in representation learning. In detail, we estimate the score function of an implicit distribution, , to achieve a general-purpose MI gradient estimation for representation learning. In particular, to deal with high-dimensional inputs, such as text, images and videos, score function estimation via Spectral Stein Gradient Estimator (SSGE)  (Shi et al., 2018) is computationally expensive and complex. We thus propose an efficient high-dimensional score function estimator to make SSGE scalable. To this end, we derive a new reparameterization trick for the representation distribution based on the lower-variance reparameterization trick proposed by Roeder et al. (2017).

We summarize the contributions of this paper as follows:

  • We propose the Mutual Information Gradient Estimator (MIGE) for representation learning based on the score function estimation of implicit distributions. Compared with MINE and MINE-, MIGE provides a tighter and smoother gradient estimation of MI in a high-dimensional and large-MI setting, as shown in Figure 1 of Section 4.

  • We propose the Scalable SSGE to alleviate the exorbitant computational cost of SSGE in high-dimensional settings.

  • To learn meaningful representations, we apply SSGE as gradient estimators for both InfoMax and Information Bottlenck, and have achieved improved performance than their corresponding competitors.

Figure 1: Estimation performance of MINE, MINE- and MIGE. Each estimation approach has been taken additional 20 times and plotted with light curves. Top: True MI and corresponding estimation of MINE and MINE-. Bottom: True gradient and corresponding estimation of MINE, MINE- and MIGE. Our approach MIGE only appears in bottom figures since it directly gives gradient estimation. As we observe, MIGE gives more stable, smooth and accurate results.

2 Scalable Spectral Stein Gradient Estimator

Score estimation of implicit distributions has been widely explored in the past few years (Song et al., 2019; Li and Turner, 2017; Shi et al., 2018). A promising method of score estimation is the Stein gradient estimator (Li and Turner, 2017; Shi et al., 2018), which is proposed for implicit distributions. It is inspired by generalized Stein’s identity  (Gorham and Mackey, 2015; Liu and Wang, 2016) as follows.

Stein’s identity. Let be a continuously differentiable (also called smooth) density supported on , and

is a smooth vector function. Further, the boundary conditions on



Under this condition, the following identity can be easily checked using integration by parts, assuming mild zero boundary conditions on ,


Here is called the Stein class of if Stein’s identity Eq. (3) holds. Monte Carlo estimation of the expectation in Eq. (3) builds the connection between and the samples from in Stein’s identity. For modeling implicit distributions, Motivated by Stein’s identity, Shi et al. (2018)

proposed Spectral Stein Gradient Estimator (SSGE) for implicit distributions based on Stein’s identity and a spectral decomposition of kernel operators where the eigenfunctions are approximated by the Nystr

m method. Below we briefly review SSGE. More details refer to Shi et al. (2018). Specifically, we denote the target gradient function to estimate by . The component of the gradient is . We assume . denotes an orthonormal basis of . We can expand into the spectral series, i.e., . The value of the eigenfunction at can be approximated by the Nystrm method (Xu et al., 2015). Due to the orthonormality of eigenfunctions , there is a constraint under the probability measure q(.): , where . Based on this constraint, we can obtain the following equation for :


where is a kernel function. The left side of the above equation can be approximated by the Monte Carlo estimate using i.i.d. samples from : , where is the Gram Matrix and

. We can solve this eigenvalue problem by choose the

largest eigenvalues for .

denotes the eigenvector of the Gram matrix. The approximation for

can be obtained combined with Eq. (4) as following: .

Furthermore, based on the orthonormality of , we can easily obtain . By taking derivative both sides of Eq. (4), we can show that:


Then we can estimate as following:


Finally, by truncating the expansion to the first terms and plugging in the Nystrm approximations of , we can get the score estimator:


In general, representation learning for large-scale datasets is usually costly in terms of storage and computation. For instance, the dimension of images in the STL-10 dataset is (i.e., the vector length is 27648). This makes it almost impossible to directly estimate the gradient of MI between the input and representation. To alleviate this problem, we introduce random projection (RP)  (Bingham and Mannila, 2001) to reduce the dimension of .

We briefly review RP. More details refer to Bingham and Mannila (2001). RP projects the original -dimensional data into a -dimensional subspace. Concretely, let matrix denotes the original set of N -dimensional data, the projection of the original data

is obtained by introducing a random matrix

whose columns have unit length, as follows (Bingham and Mannila, 2001), After RP, the Euclidean distance between two original data vectors can be approximated by the Euclidean distance of the projective vectors in reduced spaces:


where and denote the two data vectors in the original large dimensional space.

Based on the principle of RP, we can derive a Salable Spectral Stein Gradient Estimator, which is an efficient high-dimensional score function estimator. One can show that the RBF kernel satisfies Stein’s identity (Liu and Wang, 2016). Shi et al. (2018) also shows that it is a promising choice for SSGE with a lower error bound. To reduce the computation of the kernel similarities of SSGE in high-dimensional settings, we replace the input of SSGE with a projections obtained by RP according to the approximation of Eq. (8) for the computation of the RBF kernel.

3 Mutual Information Gradient Estimator

As gradient estimation is a straightforward and effective method in optimization, we propose a gradient estimator for MI based on score estimation of implicit distributions, which is called Mutual Information Gradient estimator (MIGE). In this section, we focus on three most general cases of MI gradient estimation for representation learning, and derive the corresponding MI gradient estimator for these circumstances.

We outline the general setting of training an encoder to learn a representation. Let and be the domain, and with parameters denotes a continuous and (almost everywhere) differentiable parametric function, which is usually a neural network, namely an encoder. denotes the empirical distribution given the input data . We can obtain the representation of the input data through the encoder, . is defined as the marginal distribution induced by pushing samples from through encoder We also define as the joint distribution with and , which is determined by encoder .

Circumstance I.  Given that the encoder is deterministic, our goal is to estimate the gradient of MI between input and encoder output w.r.t. the encoder parameters . There is a close relationship between mutual information and entropy, which is as following:. Here is data entropy and not relevant to . The optimization of with parameters can neglect the entry . We decompose the gradient of the entropy of and as (see Appendix A):


Hence, we can represent the gradient of MI between input and encoder output w.r.t. encoder parameters as following:


However, this equation is intractable since an expectation w.r.t is directly not differentiable w.r.t .  Roeder et al. (2017) proposed a general variant of the standard reparameterization trick for the variational evidence lower bound, which demonstrates lower-variance. To address above problem, we adapt this trick for MI gradient estimator in representation learning. Specifically, we can obtain the samples from the marginal distribution of by pushing samples from the data empirical distribution through for representation learning. Hence we can reparameterize the representations variable using a differentiable transformation:, where the data empirical distribution is independent of encoder parameters . This reparameterization can rewrite an expectation w.r.t and such that the Monte Carlo estimate of the expectation is differentiable w.r.t .

Relying on this reparameterization trick, we can represent the gradient of MI w.r.t. encoder parameters in Eq. 10 as follows:


where the score function can be estimated based on i.i.d. samples from an implicit density  (Shi et al., 2018; Song et al., 2019). The samples form the joint distribution are produced as following: we sample observations from empirical distribution ; then the corresponding samples of is obtained through . Hence we can also estimate based on i.i.d. samples from . and are directly computed with .

Circumstance II.   Assume that we encode the input to latent data space that reflects useful structure in the data. Next, we summarize this latent variable mapping into final representations by the function , . The gradient estimator of MI between and is represented by the data reparameterization trick as follows:


Circumstance III.   Consider stochastic encoder function where is an auxiliary variable with independent marginal . By utilizing data reparameterization trick. we can represent the gradient of the conditional entropy as follows (see Appendix A):


where the term can be easily estimated by score estimation.

Based on the condition entropy gradient estimation in Eq. (14), the gradient estimator of MI between input and encoder output can be represented as following:


In practical MI optimization, we can construct MIGE of the full dataset based on mini-batch Monte Carlo estimates. We have provided an algorithm description for MIGE in Appendix B.

4 Toy Experiment

Recently, MINE and MINE- enable effective computation of MI in the continuous and high-dimensional settings. To compare with MINE and MINE-, we evaluate MIGE in the correlated Gaussian problem taken from  (Belghazi et al., 2018).

Experimental Settings. We consider two random variables and (), coming from a

-dimension multivariate Gaussian distribution. The component-wise correlation of

and is defined as follows: , where is Kronecker’s delta and is the correlation coefficient. Since MI is invariant to smooth transformations of , we only consider standardized Gaussian for marginal distribution and . The gradient of MI w.r.t has the analytical solution: . We apply MINE and MINE- to estimate MI of by sampling from the correlated Gaussian distribution and its marginal distributions, and the corresponding gradient of MI w.r.t

can be computed by backpropagation implemented in Pytorch.

Results. Fig.1 presents our experimental results in different dimensions . In the case of low-dimensional , all the estimators give promising estimation of MI and its gradient. However, the MI estimation of MINE and MINE- are unstable due to its relying on a discriminator to produce estimation of the bound on MI. Hence, as showed in Fig.1, corresponding estimation of MI and its gradient is not smooth. As the dimension and the absolute value of correlation coefficient increase, MINE and MINE- are apparently hard to reach the True MI, and their gradient estimation of MI is thus high biased. This phenomenon would be more significant in the case of high-dimensional or large MI. Contrastively, MIGE demonstrates the significant improvement over MINE and MINE- when estimating MI gradient between twenty-dimensional random variables . In this experiment, we compare our method with two baselines on an analyzable problem and find that the gradient curve estimated by our method is far superior to other methods in terms of smoothness and tightness in a high-dimensional and large-MI setting compared with MINE and MINE-.

5 Applications

To demonstrate the performance in downstream tasks, we deploy MIGE to Deep InfoMax (Hjelm et al., 2019) and Information Bottleneck (Tishby et al., 2000) respectively, namely replacing the original MI estimators with MIGE. We find that MIGE achieves higher and more stable classification accuracy, which indicating its good gradient estimation performance in practical applications.

5.1 Deep InfoMax

Discovering useful representations from unlabeled data is one core problem for deep learning. Recently, a growing set of methods is explored to train deep neural network encoders by maximizing the mutual information between its input and output. A number of methods based on tractable variational lower bounds, such as JSD and infoNCE, have been proposed to improve the estimation of MI between high dimensional input/output pairs of deep neural networks

(Hjelm et al., 2019). To compare with JSD and infoNCE, we expand the application of MIGE in unsupervised learning of deep representations based on the InfoMax principle.

Experimental Settings. For consistent comparison, we follow the experiments of Deep InfoMax(DIM)111Codes available at to set the experimental setup as in Hjelm et al. (2019). We test DIM on image datasets CIFAR-10, CIFAR-100 and STL-10 to evaluate our MIGE. For the high-dimensional images in STL-10, directly applying SSGE is almost impossible since it results in exorbitant computational cost. Our proposed Scalable SSGE is applied, to reduce the dimension of images and achieve reasonable computational cost. As mentioned in Hjelm et al. (2019)

, non-linear classifier is chosen to evaluate our representation, After learning representation, we freeze the parameters of the encoder and train a non-linear classifier using the representation as the input. The same classifiers are used for all methods. Our baseline results are directly copied from

Hjelm et al. (2019) or by running the code of author.

Model CIFAR-10 CIFAR-100
conv fc(1024) Y(64) conv fc(1024) Y(64)
DIM (JSD) 55.81% 45.73% 40.67% 28.41% 22.16% 16.50%
DIM (JSD + PM) 52.2% 52.84% 43.17% 24.40% 18.22% 15.22%
DIM (infoNCE) 51.82% 42.81% 37.79% 24.60% 16.54% 12.96%
DIM (infoNCE + PM) 56.77% 49.42% 42.68% 25.51% 20.15% 15.35%
MIGE 57.95% 57.09% 53.75% 29.86% 27.91% 25.84%
Table 1: CIFAR-10 and CIFAR-100 classification accuracy (top 1) of downstream tasks compared with vanilla DIM. JSD and infoNCE are MI estimators, and PM denotes matching representations to a prior distribution (Hjelm et al., 2019).

Results. As shown in Table 1, MIGE outperforms all the competitive models in DIM experiments on CIFAR-10 and CIFAR-100. Besides the numerical improvements, it is notable that our model have the less accuracy decrease across layers than that of DIM(JSD) and DIM(infoNCE). The results indicate that, compared to variational lower bound methods, MIGE gives more favorable gradient direction, and demonstrates more power in controlling information flows without significant loss. With the aid of Random Projection, we could evaluate on bigger datasets, e.g., STL-10. Table 2 shows the result of DIM experiments on STL-10. We can observe significant improvement over the baselines when RP to 512d. Note that our proposed gradient estimator can also be extended to the multi-view setting(i.e., with local and global features) of DIM, it is beyond the scope of this paper. More discussions refer to Appendix C.

Model STL-10
conv fc(1024) Y(64)
DIM (JSD) 42.03% 30.28% 28.09%
DIM (infoNCE) 43.13% 35.80% 34.44%
MIGE unaffordable computational cost
MIGE + RP to 512d 52.00% 48.14% 44.89%
Figure 2: STL-10 Y(64) classification accuracy (top 1) with different RP dimension.
Table 2: STL-10 classification accuracy (top 1) of downstream tasks compared with vanilla DIM. The dimension of STL-10 images (27648) results in exorbitant computational cost. Random Projection (RP) is applied to reduce the dimension.

Ablation Study. To verify the effect of different dimensions of Random Projection on classification accuracy in DIM experiments, we conduct an ablation study on STL-10 with the above experimental settings. Varying RP dimension , we measure the classification accuracy of Y(64) which is shown in Fig.2. We find that the classification accuracy increases with RP dimension from 16 to 128. After that, the approximation in Equ.(8) with the further increase of the RP dimension reaches saturation, while bringing extra computational costs.

5.2 Information Bottleneck

Information Bottleneck (IB) has been widely applied to a variety of application domains, such as classification (Tishby and Zaslavsky, 2015; Alemi et al., 2017; Chalk et al., 2016; Kolchinsky et al., 2017), clustering (Slonim and Tishby, 2000), and coding theory and quantization (Zeitler et al., 2008; Courtade and Wesel, 2011). In particular, given the input variable and the target variable , the goal of the IB is to learn a representation of (denoted by the variable ) that satisfies the following characteristics:

  • is sufficient for the target , that is, all information about target contained in should also be contained in . In optimization, it should be

  • is minimal. In order not to contain irrelevant information that is not related to , is required to contain the smallest information among all sufficient representations.

The objective function of IB is written as follows:


Equivalently, by introducing a Lagrangian multiplier , the IB method can maximize the following objective function: Further, it is generally acknowledged that , and is constant. Hence we can also minimize the objective function of the following form:


where plays a role in trading off the sufficiency and minimality. Note that the above formulas omit the parameters for simplicity.

To overcome the intractability of MI in the continuous and high-dimension setting, Alemi et al. (2017)

presents a variational approximation to IB, which adopts deep neural network encoder to produce a conditional multivariate normal distribution, called Deep Variational Bottleneck (DVB). Rencently, DVB is exploited to restrict the capacity of discriminators in GANs  

(Peng et al., 2019). However, a tractable density is required for the approximate posterior in DVB due to their reliance on a variational approximation while MIGE does not.

To evaluate our method, we compare MIGE-IB with DVB and MINE-IB in IB application. We demonstrate an implementation of the IB objective on permutation invariant MNIST using MIGE.

Experiments. For consistent comparison, we adopt the same architecture and empirical settings used in Alemi et al. (2017)

except that the initial learning rate of 2e-4 is set for Adam optimizer, and exponential decay with decaying rate by a factor of 0.96 was set for every 2 epochs. The implementation of DVB is available from its authors

222 Under these experimental settings, we use our MI Gradient Estimator to replace the MI estimator in DVB experiment. The threshold of score function’s Stein gradient estimator is set as . The threshold is the hyper-parameter of Spectral Stein Gradient Estimator (SSGE), and it is used to set the kernel bandwidth of RBF kernel. Our results can be seen in Table 3 and it manifests that our proposed MIGE-IB outperforms DVB and MINE-IB.

Model Misclass rate
Baseline 1.38%
Dropout 1.34%
Confidence penalty 1.36%
Label Smoothing 1.4%
DVB 1.13%
MINE-IB 1.11%
MIGE-IB (ours) 1.05%
Table 3: Permutation-invariant MNIST misclassification rate. Datas except our model are cited from Belghazi et al. (2018)

6 Conclusion

In this paper, we present a gradient estimator, called Mutual Information Gradient Estimator (MIGE), to avoid the various problems met in direct mutual information estimation. We manifest the effectiveness of gradient estimation of MI over direct MI estimation by applying it in unsupervised or supervised representation learning. Experimental results have indicated the remarkable improvement over MI estimation in the Deep InfoMax method and the Information Bottleneck method.


This work was partially funded by the National Key R&D Program of China (No. 2018YFB1005100 & No. 2018YFB1005104).


  • A. A. Alemi, I. Fischer, and J. V. Dillon (2018a) Uncertainty in the variational information bottleneck. arXiv preprint arXiv:1807.00906. Cited by: §1.
  • A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy (2017) Deep variational information bottleneck. In ICLR, Cited by: §1, §1, §5.2, §5.2, §5.2.
  • A. A. Alemi, B. Poole, I. Fischer, J. V. Dillon, R. A. Saurous, and K. Murphy (2018b) Fixing a broken ELBO. In ICML, Cited by: §1.
  • D. Barber and F. V. Agakov (2003) The im algorithm: a variational approach to information maximization. In Advances in neural information processing systems, pp. None. Cited by: §1.
  • M. I. Belghazi, A. Baratin, S. Rajeswar, S. Ozair, Y. Bengio, A. Courville, and R. D. Hjelm (2018) Mine: mutual information neural estimation. ICML. Cited by: §1, §4, Table 3.
  • E. Bingham and H. Mannila (2001) Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA, pp. 245–250. Cited by: §2, §2.
  • A. J. Butte and I. S. Kohane (1999) Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In Biocomputing 2000, pp. 418–429. Cited by: §1.
  • M. Chalk, O. Marre, and G. Tkacik (2016) Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems, pp. 1957–1965. Cited by: §5.2.
  • T. A. Courtade and R. D. Wesel (2011) Multiterminal source coding with an entropy-based distortion measure. In 2011 IEEE International Symposium on Information Theory Proceedings, pp. 2040–2044. Cited by: §5.2.
  • G. A. Darbellay and I. Vajda (1999) Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory 45 (4), pp. 1315–1321. Cited by: §1.
  • F. Fleuret (2004) Fast binary feature selection with conditional mutual information. Journal of Machine learning research 5 (Nov), pp. 1531–1555. Cited by: §1.
  • A. M. Fraser and H. L. Swinney (1986) Independent coordinates for strange attractors from mutual information. Physical review A 33 (2), pp. 1134. Cited by: §1.
  • S. Gao, G. Ver Steeg, and A. Galstyan (2015) Efficient estimation of mutual information for strongly dependent variables. In Artificial intelligence and statistics, pp. 277–286. Cited by: §1.
  • J. Gorham and L. Mackey (2015) Measuring sample quality with stein’s method. In Advances in Neural Information Processing Systems, pp. 226–234. Cited by: §2.
  • R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bachman, A. Trischler, and Y. Bengio (2019) Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: Appendix C, §1, §1, §5.1, §5.1, Table 1, §5.
  • W. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama (2017) Learning discrete representations via information maximizing self-augmented training. In ICML, Cited by: §1.
  • K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman, et al. (2015) Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, pp. 397–405. Cited by: §1.
  • J. B. Kinney and G. S. Atwal (2014) Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences 111 (9), pp. 3354–3359. Cited by: §1.
  • A. Kolchinsky, B. D. Tracey, and D. H. Wolpert (2017) Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436. Cited by: §5.2.
  • L. Kozachenko and N. N. Leonenko (1987) Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii 23 (2), pp. 9–16. Cited by: §1.
  • A. Kraskov, H. Stögbauer, and P. Grassberger (2004) Estimating mutual information. Physical review E 69 (6), pp. 066138. Cited by: §1.
  • A. Krause, P. Perona, and R. G. Gomes (2010) Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, Cited by: §1.
  • S. Krishnaswamy, M. H. Spitzer, M. Mingueneau, S. C. Bendall, O. Litvin, E. Stone, D. Pe’er, and G. P. Nolan (2014) Conditional density-based analysis of t cell signaling in single-cell data. Science 346 (6213), pp. 1250689. Cited by: §1.
  • N. Kwak and C. Choi (2002) Input feature selection by mutual information based on parzen window. IEEE Transactions on Pattern Analysis & Machine Intelligence (12), pp. 1667–1671. Cited by: §1, §1.
  • Y. Li and R. E. Turner (2017) Gradient estimators for implicit models. arXiv preprint arXiv:1705.07107. Cited by: §2.
  • R. Linsker (1988) Self-organization in a perceptual network. Computer 21 (3), pp. 105–117. Cited by: §1.
  • Q. Liu and D. Wang (2016)

    Stein variational gradient descent: a general purpose bayesian inference algorithm

    In Advances in neural information processing systems, pp. 2378–2386. Cited by: §2, §2.
  • F. Maes, A. Collignon, D. Vandermeulen, G. Marchal, and P. Suetens (1997) Multimodality image registration by maximization of mutual information. IEEE transactions on Medical Imaging 16 (2), pp. 187–198. Cited by: §1.
  • D. McAllester and K. Statos (2018) Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251. Cited by: §1.
  • Y. Moon, B. Rajagopalan, and U. Lall (1995) Estimation of mutual information using kernel density estimators. Physical Review E 52 (3), pp. 2318. Cited by: §1.
  • A. C. Müller, S. Nowozin, and C. H. Lampert (2012) Information theoretic clustering using minimum spanning trees. In

    Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium

    pp. 205–215. Cited by: §1.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. NIPS. Cited by: §1, §1.
  • S. E. Palmer, O. Marre, M. J. Berry, and W. Bialek (2015) Predictive information in a sensory population. Proceedings of the National Academy of Sciences 112 (22), pp. 6908–6913. Cited by: §1.
  • H. Peng, F. Long, and C. Ding (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence (8), pp. 1226–1238. Cited by: §1.
  • X. B. Peng, A. Kanazawa, S. Toyer, P. Abbeel, and S. Levine (2019)

    Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow

    In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, Cited by: §5.2.
  • F. Pérez-Cruz (2008) Kullback-leibler divergence estimation of continuous distributions. In 2008 IEEE international symposium on information theory, pp. 1666–1670. Cited by: §1.
  • B. Poole, S. Ozair, A. van den Oord, A. Alemi, and G. Tucker (2019) On variational bounds of mutual information. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 5171–5180. Cited by: §1.
  • G. Roeder, Y. Wu, and D. K. Duvenaud (2017) Sticking the landing: simple, lower-variance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pp. 6925–6934. Cited by: §1, §3.
  • J. Shi, S. Sun, and J. Zhu (2018) A spectral approach to gradient estimation for implicit distributions. arXiv preprint arXiv:1806.02925. Cited by: §1, §2, §2, §2, §3.
  • R. Shwartz-Ziv and N. Tishby (2017) Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §1.
  • S. Singh and B. Póczos (2016) Finite-sample analysis of fixed-k nearest neighbor density functional estimators. In Advances in neural information processing systems, pp. 1217–1225. Cited by: §1.
  • N. Slonim and N. Tishby (2000) Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215. Cited by: §5.2.
  • Y. Song, S. Garg, J. Shi, and S. Ermon (2019) Sliced score matching: a scalable approach to density and score estimation. arXiv preprint arXiv:1905.07088. Cited by: §2, §3.
  • T. Suzuki, M. Sugiyama, J. Sese, and T. Kanamori (2008) Approximating mutual information by maximum likelihood density ratio estimation. In New challenges for feature selection in data mining and knowledge discovery, pp. 5–20. Cited by: §1.
  • N. Tishby, F. C. Pereira, and W. Bialek (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §1, §5.
  • N. Tishby and N. Zaslavsky (2015) Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §1, §5.2.
  • M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic (2019) On mutual information maximization for representation learning. ArXiv abs/1907.13625. Cited by: Appendix C.
  • G. Ver Steeg and A. Galstyan (2015)

    Maximally informative hierarchical representations of high-dimensional data

    In Artificial Intelligence and Statistics, pp. 1004–1012. Cited by: §1.
  • M. Vera, P. Piantanida, and L. R. Vega (2018) The role of the information bottleneck in representation learning. In IEEE International Symposium on Information Theory (ISIT), pp. 1580–1584. Cited by: §1.
  • Z. Xu, R. Jin, B. Shen, and S. Zhu (2015) Nystrom approximation for sparse kernel methods: theoretical analysis and empirical evaluation. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA, B. Bonet and S. Koenig (Eds.), pp. 3115–3121. External Links: Link Cited by: §2.
  • G. Zeitler, R. Koetter, G. Bauch, and J. Widmer (2008) Design of network coding functions in multihop relay networks. In 2008 5th International Symposium on Turbo Codes and Related Topics, pp. 249–254. Cited by: §5.2.

Appendix A Derivation of Gradient Estimates for Entropy

Unconditional Entropy Given that the encoder is deterministic, our goal is to optimize the entropy , where is short for the distribution of the representation w.r.t. its parameters . We can decompose the gradient of the entropy of as:


The second term on the right side of the equation can be calculated:


Therefore, the gradient of the entropy of becomes


Conditional Entropy Consider nondeterministic encoder function where is an auxiliary variable with independent marginal . The distribution is determined by and the encoder parameters . The auxiliary variable introduces randomness to the encoder. First, we decompose the gradients of Conditional Entropy as following:


Note that , such that we can apply reparameterization trick to the gradient estimator of conditional entropy in Eq. (A),


Appendix B MIGE Algorithm Description

The algorithm description of our proposed MIGE is stated in Algorithm 1.

  1. Sampling:       Draw samples from the data distribution , denotes mini-batch size,        then compute the corresponding output of the encoder              
  2. Estimate the score function:             
  3. Estimate the entropy gradient:             
  4. Estimate the MI gradient:      
Algorithm 1 MIGE (Circumstance I)

Appendix C Discussion on DIM(L)

DIM(L) (Hjelm et al., 2019)

is the state-of-the-art unsupervised model for representaion learning, which maximizes the average MI between the high-level representation and local patches of the image, and achieve an even higher classification accuracy than supervised learning. As shown in Table

4, we apply MIGE into DIM(L) and surprisingly find there is a significant performance gap to DIM(L).

To our knowledge, the principle of DIM(L) is still unclear. Tschannen et al. (2019) argues that maximizing tighter bounds in DIM(L) can lead to worse results, and the success of these methods cannot be attributed to the properties of MI alone, and they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parameterization of the employed MI estimators. For MIGE, we are investigating the behind reasons, e.g., to investigate the distributions of the patches.

Model CIFAR-10 CIFAR-100
conv fc(1024) Y(64) conv fc(1024) Y(64)
DIM(L) (JSD) 72.16% 67.99% 66.35% 41.65% 39.60% 39.66%
DIM(L) (JSD + PM) 73.25% 73.62% 66.96% 48.13% 45.92% 39.6%
DIM(L) (infoNCE) 75.05% 70.68% 69.24% 44.11% 42.97% 42.74%
DIM(L) (infoNCE + PM) 75.21% 75.57% 69.13% 49.74% 47.72% 41.61%
MIGE 59.72% 56.14% 54.01% 30.0% 28.96% 27.65%
Table 4: CIFAR-10 and CIFAR-100 classification accuracy (top 1) of downstream tasks compared with vanilla DIM(L).