1 Introduction
Mutual information (MI) is an appealing metric widely used in information theory and machine learning to quantify the amount of shared information between a pair of random variables. Specifically, given a pair of random variables
, the MI, denoted by , is defined as(1) 
where is the expectation over the given distribution. Since MI is invariant to invertible and smooth transformations, it can capture nonlinear statistical dependencies between variables (Kinney and Atwal, 2014)
. These appealing properties make it act as a fundamental measure of true dependence. Therefore, MI has found applications in a wide range of machine learning tasks, including feature selection
(Kwak and Choi, 2002; Fleuret, 2004; Peng et al., 2005), clustering (Müller et al., 2012; Ver Steeg and Galstyan, 2015), and causality (Butte and Kohane, 1999). It has also been pervasively used in science, such as biomedical sciences (Maes et al., 1997), computational biology (Krishnaswamy et al., 2014), and computational neuroscience (Palmer et al., 2015).Recently, there has been a revival of methods in unsupervised representation learning based on MI. A seminal work is the InfoMax principle (Linsker, 1988), where given an input instance , the goal of the InfoMax principle is to learn a representation by maximizing the MI between the input and its representation. A growing set of recent works have demonstrated promising empirical performance in unsupervised representation learning via MI maximization (Krause et al., 2010; Hu et al., 2017; Alemi et al., 2018b; Oord et al., 2018; Hjelm et al., 2019). Another closely related work is the Information Bottleneck method (Tishby et al., 2000; Alemi et al., 2017), where MI is used to limit the contents of representations. Specifically, the representations are learned by extracting taskrelated information from the original data while being constrained to discard parts that are irrelevant to the task. Several recent works have also suggested that by controlling the amount of information between learned representations and the original data, one can tune desired characteristics of trained models such as generalization error (Tishby and Zaslavsky, 2015; Vera et al., 2018), robustness (Alemi et al., 2017), and detection of outofdistribution data (Alemi et al., 2018a).
Despite playing a pivotal role across a variety of domains, MI is notoriously intractable. Exact computation is only tractable for discrete variables, or for a limited family of problems where the probability distributions are known. For more general problems, MI is challenging to analytically compute or estimate from samples. A variety of MI estimators have been developed over the years, including likelihoodratio estimators
(Suzuki et al., 2008), binning (Fraser and Swinney, 1986; Darbellay and Vajda, 1999; ShwartzZiv and Tishby, 2017), knearest neighbors (Kozachenko and Leonenko, 1987; Kraskov et al., 2004; PérezCruz, 2008; Singh and Póczos, 2016), and kernel density estimators
(Moon et al., 1995; Kwak and Choi, 2002; Kandasamy et al., 2015). However, few of these mutual information estimators scale well with dimension and sample size in machine learning problems (Gao et al., 2015).In order to overcome the intractability of MI in the continuous and highdimensional settings, Alemi et al. (2017) combines variational bounds of Barber and Agakov (2003)
with neural networks for the estimation. However, the tractable density for the approximate distribution is required due to variational approximation. This limits its application to the generalpurpose estimation, since the underlying distributions are often unknown. Alternatively, the Mutual Information Neural Estimation (MINE,
Belghazi et al. (2018)) and the JensenShannon MI estimator (JSD, Hjelm et al. (2019)) enable differentiable and tractable estimation of MI by training a discriminator to distinguish samples coming from the joint distribution or the product of the marginals. In detail, MINE employs a lowerbound to the MI based on the DonskerVaradhan representation of the KLdivergence, and JSD follows the formulation of fGAN KLdivergence. In general, these estimators are often noisy and can lead to unstable training due to their dependence on the discriminator used to estimate the bounds of mutual information. As pointed out by
Poole et al. (2019), these unnormalized critic estimators of MI exhibit high variance and are challenging to tune for estimation. An alternative lowvariance choice of MI estimator is Information NoiseContrastive Estimation (InfoNCE,
Oord et al. (2018)), which introduces the NoiseContrastive Estimation with flexible critics parameterized by neural networks as a bound to approximate MI. Nonetheless, its estimation saturates at log of the batch size and suffers from high bias. Despite their modeling power, none of the estimators are capable of providing accurate estimation of MI with low variance when the MI is large and the batch size is small (Poole et al., 2019). As supported by the theoretical findings in McAllester and Statos (2018), any distributionfree highconfidence lower bound on entropy requires a sample size exponential in the size of the bound. More discussions about the bounds of MI and their relationship can be referred to Poole et al. (2019).In summary, existing estimators first approximate MI and then use these approximations to optimize the associated parameters. For estimating MI based on any finite number of samples, there exists an infinite number of functions, with arbitrarily diverse gradients, that can perfectly approximate the true MI at these samples. However, these approximate functions can lead to unstable training and poor performance in optimization due to gradients discrepancy between approximate estimation and true MI. Estimating gradients of MI rather than estimating MI may be a better approach for MI optimization. To this end, to the best of our knowledge, we firstly propose the Mutual Information Gradient Estimator (MIGE) in representation learning. In detail, we estimate the score function of an implicit distribution, , to achieve a generalpurpose MI gradient estimation for representation learning. In particular, to deal with highdimensional inputs, such as text, images and videos, score function estimation via Spectral Stein Gradient Estimator (SSGE) (Shi et al., 2018) is computationally expensive and complex. We thus propose an efficient highdimensional score function estimator to make SSGE scalable. To this end, we derive a new reparameterization trick for the representation distribution based on the lowervariance reparameterization trick proposed by Roeder et al. (2017).
We summarize the contributions of this paper as follows:

We propose the Mutual Information Gradient Estimator (MIGE) for representation learning based on the score function estimation of implicit distributions. Compared with MINE and MINE, MIGE provides a tighter and smoother gradient estimation of MI in a highdimensional and largeMI setting, as shown in Figure 1 of Section 4.

We propose the Scalable SSGE to alleviate the exorbitant computational cost of SSGE in highdimensional settings.

To learn meaningful representations, we apply SSGE as gradient estimators for both InfoMax and Information Bottlenck, and have achieved improved performance than their corresponding competitors.
2 Scalable Spectral Stein Gradient Estimator
Score estimation of implicit distributions has been widely explored in the past few years (Song et al., 2019; Li and Turner, 2017; Shi et al., 2018). A promising method of score estimation is the Stein gradient estimator (Li and Turner, 2017; Shi et al., 2018), which is proposed for implicit distributions. It is inspired by generalized Stein’s identity (Gorham and Mackey, 2015; Liu and Wang, 2016) as follows.
Stein’s identity. Let be a continuously differentiable (also called smooth) density supported on , and
is a smooth vector function. Further, the boundary conditions on
is:(2) 
Under this condition, the following identity can be easily checked using integration by parts, assuming mild zero boundary conditions on ,
(3) 
Here is called the Stein class of if Stein’s identity Eq. (3) holds. Monte Carlo estimation of the expectation in Eq. (3) builds the connection between and the samples from in Stein’s identity. For modeling implicit distributions, Motivated by Stein’s identity, Shi et al. (2018)
proposed Spectral Stein Gradient Estimator (SSGE) for implicit distributions based on Stein’s identity and a spectral decomposition of kernel operators where the eigenfunctions are approximated by the Nystr
m method. Below we briefly review SSGE. More details refer to Shi et al. (2018). Specifically, we denote the target gradient function to estimate by . The component of the gradient is . We assume . denotes an orthonormal basis of . We can expand into the spectral series, i.e., . The value of the eigenfunction at can be approximated by the Nystrm method (Xu et al., 2015). Due to the orthonormality of eigenfunctions , there is a constraint under the probability measure q(.): , where . Based on this constraint, we can obtain the following equation for :(4) 
where is a kernel function. The left side of the above equation can be approximated by the Monte Carlo estimate using i.i.d. samples from : , where is the Gram Matrix and
. We can solve this eigenvalue problem by choose the
largest eigenvalues for .denotes the eigenvector of the Gram matrix. The approximation for
can be obtained combined with Eq. (4) as following: .Furthermore, based on the orthonormality of , we can easily obtain . By taking derivative both sides of Eq. (4), we can show that:
(5) 
Then we can estimate as following:
(6) 
Finally, by truncating the expansion to the first terms and plugging in the Nystrm approximations of , we can get the score estimator:
(7) 
In general, representation learning for largescale datasets is usually costly in terms of storage and computation. For instance, the dimension of images in the STL10 dataset is (i.e., the vector length is 27648). This makes it almost impossible to directly estimate the gradient of MI between the input and representation. To alleviate this problem, we introduce random projection (RP) (Bingham and Mannila, 2001) to reduce the dimension of .
We briefly review RP. More details refer to Bingham and Mannila (2001). RP projects the original dimensional data into a dimensional subspace. Concretely, let matrix denotes the original set of N dimensional data, the projection of the original data
is obtained by introducing a random matrix
whose columns have unit length, as follows (Bingham and Mannila, 2001), After RP, the Euclidean distance between two original data vectors can be approximated by the Euclidean distance of the projective vectors in reduced spaces:(8) 
where and denote the two data vectors in the original large dimensional space.
Based on the principle of RP, we can derive a Salable Spectral Stein Gradient Estimator, which is an efficient highdimensional score function estimator. One can show that the RBF kernel satisfies Stein’s identity (Liu and Wang, 2016). Shi et al. (2018) also shows that it is a promising choice for SSGE with a lower error bound. To reduce the computation of the kernel similarities of SSGE in highdimensional settings, we replace the input of SSGE with a projections obtained by RP according to the approximation of Eq. (8) for the computation of the RBF kernel.
3 Mutual Information Gradient Estimator
As gradient estimation is a straightforward and effective method in optimization, we propose a gradient estimator for MI based on score estimation of implicit distributions, which is called Mutual Information Gradient estimator (MIGE). In this section, we focus on three most general cases of MI gradient estimation for representation learning, and derive the corresponding MI gradient estimator for these circumstances.
We outline the general setting of training an encoder to learn a representation. Let and be the domain, and with parameters denotes a continuous and (almost everywhere) differentiable parametric function, which is usually a neural network, namely an encoder. denotes the empirical distribution given the input data . We can obtain the representation of the input data through the encoder, . is defined as the marginal distribution induced by pushing samples from through encoder We also define as the joint distribution with and , which is determined by encoder .
Circumstance I. Given that the encoder is deterministic, our goal is to estimate the gradient of MI between input and encoder output w.r.t. the encoder parameters . There is a close relationship between mutual information and entropy, which is as following:. Here is data entropy and not relevant to . The optimization of with parameters can neglect the entry . We decompose the gradient of the entropy of and as (see Appendix A):
(9) 
Hence, we can represent the gradient of MI between input and encoder output w.r.t. encoder parameters as following:
(10) 
However, this equation is intractable since an expectation w.r.t is directly not differentiable w.r.t . Roeder et al. (2017) proposed a general variant of the standard reparameterization trick for the variational evidence lower bound, which demonstrates lowervariance. To address above problem, we adapt this trick for MI gradient estimator in representation learning. Specifically, we can obtain the samples from the marginal distribution of by pushing samples from the data empirical distribution through for representation learning. Hence we can reparameterize the representations variable using a differentiable transformation:, where the data empirical distribution is independent of encoder parameters . This reparameterization can rewrite an expectation w.r.t and such that the Monte Carlo estimate of the expectation is differentiable w.r.t .
Relying on this reparameterization trick, we can represent the gradient of MI w.r.t. encoder parameters in Eq. 10 as follows:
(11) 
where the score function can be estimated based on i.i.d. samples from an implicit density (Shi et al., 2018; Song et al., 2019). The samples form the joint distribution are produced as following: we sample observations from empirical distribution ; then the corresponding samples of is obtained through . Hence we can also estimate based on i.i.d. samples from . and are directly computed with .
Circumstance II. Assume that we encode the input to latent data space that reflects useful structure in the data. Next, we summarize this latent variable mapping into final representations by the function , . The gradient estimator of MI between and is represented by the data reparameterization trick as follows:
(12)  
(13) 
Circumstance III. Consider stochastic encoder function where is an auxiliary variable with independent marginal . By utilizing data reparameterization trick. we can represent the gradient of the conditional entropy as follows (see Appendix A):
(14) 
where the term can be easily estimated by score estimation.
Based on the condition entropy gradient estimation in Eq. (14), the gradient estimator of MI between input and encoder output can be represented as following:
(15)  
(16) 
In practical MI optimization, we can construct MIGE of the full dataset based on minibatch Monte Carlo estimates. We have provided an algorithm description for MIGE in Appendix B.
4 Toy Experiment
Recently, MINE and MINE enable effective computation of MI in the continuous and highdimensional settings. To compare with MINE and MINE, we evaluate MIGE in the correlated Gaussian problem taken from (Belghazi et al., 2018).
Experimental Settings. We consider two random variables and (), coming from a
dimension multivariate Gaussian distribution. The componentwise correlation of
and is defined as follows: , where is Kronecker’s delta and is the correlation coefficient. Since MI is invariant to smooth transformations of , we only consider standardized Gaussian for marginal distribution and . The gradient of MI w.r.t has the analytical solution: . We apply MINE and MINE to estimate MI of by sampling from the correlated Gaussian distribution and its marginal distributions, and the corresponding gradient of MI w.r.tcan be computed by backpropagation implemented in Pytorch.
Results. Fig.1 presents our experimental results in different dimensions . In the case of lowdimensional , all the estimators give promising estimation of MI and its gradient. However, the MI estimation of MINE and MINE are unstable due to its relying on a discriminator to produce estimation of the bound on MI. Hence, as showed in Fig.1, corresponding estimation of MI and its gradient is not smooth. As the dimension and the absolute value of correlation coefficient increase, MINE and MINE are apparently hard to reach the True MI, and their gradient estimation of MI is thus high biased. This phenomenon would be more significant in the case of highdimensional or large MI. Contrastively, MIGE demonstrates the significant improvement over MINE and MINE when estimating MI gradient between twentydimensional random variables . In this experiment, we compare our method with two baselines on an analyzable problem and find that the gradient curve estimated by our method is far superior to other methods in terms of smoothness and tightness in a highdimensional and largeMI setting compared with MINE and MINE.
5 Applications
To demonstrate the performance in downstream tasks, we deploy MIGE to Deep InfoMax (Hjelm et al., 2019) and Information Bottleneck (Tishby et al., 2000) respectively, namely replacing the original MI estimators with MIGE. We find that MIGE achieves higher and more stable classification accuracy, which indicating its good gradient estimation performance in practical applications.
5.1 Deep InfoMax
Discovering useful representations from unlabeled data is one core problem for deep learning. Recently, a growing set of methods is explored to train deep neural network encoders by maximizing the mutual information between its input and output. A number of methods based on tractable variational lower bounds, such as JSD and infoNCE, have been proposed to improve the estimation of MI between high dimensional input/output pairs of deep neural networks
(Hjelm et al., 2019). To compare with JSD and infoNCE, we expand the application of MIGE in unsupervised learning of deep representations based on the InfoMax principle.Experimental Settings. For consistent comparison, we follow the experiments of Deep InfoMax(DIM)^{1}^{1}1Codes available at https://github.com/rdevon/DIM to set the experimental setup as in Hjelm et al. (2019). We test DIM on image datasets CIFAR10, CIFAR100 and STL10 to evaluate our MIGE. For the highdimensional images in STL10, directly applying SSGE is almost impossible since it results in exorbitant computational cost. Our proposed Scalable SSGE is applied, to reduce the dimension of images and achieve reasonable computational cost. As mentioned in Hjelm et al. (2019)
, nonlinear classifier is chosen to evaluate our representation, After learning representation, we freeze the parameters of the encoder and train a nonlinear classifier using the representation as the input. The same classifiers are used for all methods. Our baseline results are directly copied from
Hjelm et al. (2019) or by running the code of author.Model  CIFAR10  CIFAR100  

conv  fc(1024)  Y(64)  conv  fc(1024)  Y(64)  
DIM (JSD)  55.81%  45.73%  40.67%  28.41%  22.16%  16.50% 
DIM (JSD + PM)  52.2%  52.84%  43.17%  24.40%  18.22%  15.22% 
DIM (infoNCE)  51.82%  42.81%  37.79%  24.60%  16.54%  12.96% 
DIM (infoNCE + PM)  56.77%  49.42%  42.68%  25.51%  20.15%  15.35% 
MIGE  57.95%  57.09%  53.75%  29.86%  27.91%  25.84% 
Results. As shown in Table 1, MIGE outperforms all the competitive models in DIM experiments on CIFAR10 and CIFAR100. Besides the numerical improvements, it is notable that our model have the less accuracy decrease across layers than that of DIM(JSD) and DIM(infoNCE). The results indicate that, compared to variational lower bound methods, MIGE gives more favorable gradient direction, and demonstrates more power in controlling information flows without significant loss. With the aid of Random Projection, we could evaluate on bigger datasets, e.g., STL10. Table 2 shows the result of DIM experiments on STL10. We can observe significant improvement over the baselines when RP to 512d. Note that our proposed gradient estimator can also be extended to the multiview setting(i.e., with local and global features) of DIM, it is beyond the scope of this paper. More discussions refer to Appendix C.
Model  STL10  

conv  fc(1024)  Y(64)  
DIM (JSD)  42.03%  30.28%  28.09% 
DIM (infoNCE)  43.13%  35.80%  34.44% 
MIGE  unaffordable computational cost  
MIGE + RP to 512d  52.00%  48.14%  44.89% 
Ablation Study. To verify the effect of different dimensions of Random Projection on classification accuracy in DIM experiments, we conduct an ablation study on STL10 with the above experimental settings. Varying RP dimension , we measure the classification accuracy of Y(64) which is shown in Fig.2. We find that the classification accuracy increases with RP dimension from 16 to 128. After that, the approximation in Equ.(8) with the further increase of the RP dimension reaches saturation, while bringing extra computational costs.
5.2 Information Bottleneck
Information Bottleneck (IB) has been widely applied to a variety of application domains, such as classification (Tishby and Zaslavsky, 2015; Alemi et al., 2017; Chalk et al., 2016; Kolchinsky et al., 2017), clustering (Slonim and Tishby, 2000), and coding theory and quantization (Zeitler et al., 2008; Courtade and Wesel, 2011). In particular, given the input variable and the target variable , the goal of the IB is to learn a representation of (denoted by the variable ) that satisfies the following characteristics:

is sufficient for the target , that is, all information about target contained in should also be contained in . In optimization, it should be

is minimal. In order not to contain irrelevant information that is not related to , is required to contain the smallest information among all sufficient representations.
The objective function of IB is written as follows:
(17) 
Equivalently, by introducing a Lagrangian multiplier , the IB method can maximize the following objective function: Further, it is generally acknowledged that , and is constant. Hence we can also minimize the objective function of the following form:
(18) 
where plays a role in trading off the sufficiency and minimality. Note that the above formulas omit the parameters for simplicity.
To overcome the intractability of MI in the continuous and highdimension setting, Alemi et al. (2017)
presents a variational approximation to IB, which adopts deep neural network encoder to produce a conditional multivariate normal distribution, called Deep Variational Bottleneck (DVB). Rencently, DVB is exploited to restrict the capacity of discriminators in GANs
(Peng et al., 2019). However, a tractable density is required for the approximate posterior in DVB due to their reliance on a variational approximation while MIGE does not.To evaluate our method, we compare MIGEIB with DVB and MINEIB in IB application. We demonstrate an implementation of the IB objective on permutation invariant MNIST using MIGE.
Experiments. For consistent comparison, we adopt the same architecture and empirical settings used in Alemi et al. (2017)
except that the initial learning rate of 2e4 is set for Adam optimizer, and exponential decay with decaying rate by a factor of 0.96 was set for every 2 epochs. The implementation of DVB is available from its authors
^{2}^{2}2https://github.com/alexalemi/vib_demo. Under these experimental settings, we use our MI Gradient Estimator to replace the MI estimator in DVB experiment. The threshold of score function’s Stein gradient estimator is set as . The threshold is the hyperparameter of Spectral Stein Gradient Estimator (SSGE), and it is used to set the kernel bandwidth of RBF kernel. Our results can be seen in Table 3 and it manifests that our proposed MIGEIB outperforms DVB and MINEIB.Model  Misclass rate 

Baseline  1.38% 
Dropout  1.34% 
Confidence penalty  1.36% 
Label Smoothing  1.4% 
DVB  1.13% 
MINEIB  1.11% 
MIGEIB (ours)  1.05% 
6 Conclusion
In this paper, we present a gradient estimator, called Mutual Information Gradient Estimator (MIGE), to avoid the various problems met in direct mutual information estimation. We manifest the effectiveness of gradient estimation of MI over direct MI estimation by applying it in unsupervised or supervised representation learning. Experimental results have indicated the remarkable improvement over MI estimation in the Deep InfoMax method and the Information Bottleneck method.
Accknowledgement
This work was partially funded by the National Key R&D Program of China (No. 2018YFB1005100 & No. 2018YFB1005104).
References
 Uncertainty in the variational information bottleneck. arXiv preprint arXiv:1807.00906. Cited by: §1.
 Deep variational information bottleneck. In ICLR, Cited by: §1, §1, §5.2, §5.2, §5.2.
 Fixing a broken ELBO. In ICML, Cited by: §1.
 The im algorithm: a variational approach to information maximization. In Advances in neural information processing systems, pp. None. Cited by: §1.
 Mine: mutual information neural estimation. ICML. Cited by: §1, §4, Table 3.
 Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, San Francisco, CA, USA, pp. 245–250. Cited by: §2, §2.
 Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In Biocomputing 2000, pp. 418–429. Cited by: §1.
 Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems, pp. 1957–1965. Cited by: §5.2.
 Multiterminal source coding with an entropybased distortion measure. In 2011 IEEE International Symposium on Information Theory Proceedings, pp. 2040–2044. Cited by: §5.2.
 Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory 45 (4), pp. 1315–1321. Cited by: §1.
 Fast binary feature selection with conditional mutual information. Journal of Machine learning research 5 (Nov), pp. 1531–1555. Cited by: §1.
 Independent coordinates for strange attractors from mutual information. Physical review A 33 (2), pp. 1134. Cited by: §1.
 Efficient estimation of mutual information for strongly dependent variables. In Artificial intelligence and statistics, pp. 277–286. Cited by: §1.
 Measuring sample quality with stein’s method. In Advances in Neural Information Processing Systems, pp. 226–234. Cited by: §2.
 Learning deep representations by mutual information estimation and maximization. In ICLR, Cited by: Appendix C, §1, §1, §5.1, §5.1, Table 1, §5.
 Learning discrete representations via information maximizing selfaugmented training. In ICML, Cited by: §1.
 Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, pp. 397–405. Cited by: §1.
 Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences 111 (9), pp. 3354–3359. Cited by: §1.
 Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436. Cited by: §5.2.
 Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii 23 (2), pp. 9–16. Cited by: §1.
 Estimating mutual information. Physical review E 69 (6), pp. 066138. Cited by: §1.
 Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, Cited by: §1.
 Conditional densitybased analysis of t cell signaling in singlecell data. Science 346 (6213), pp. 1250689. Cited by: §1.
 Input feature selection by mutual information based on parzen window. IEEE Transactions on Pattern Analysis & Machine Intelligence (12), pp. 1667–1671. Cited by: §1, §1.
 Gradient estimators for implicit models. arXiv preprint arXiv:1705.07107. Cited by: §2.
 Selforganization in a perceptual network. Computer 21 (3), pp. 105–117. Cited by: §1.

Stein variational gradient descent: a general purpose bayesian inference algorithm
. In Advances in neural information processing systems, pp. 2378–2386. Cited by: §2, §2.  Multimodality image registration by maximization of mutual information. IEEE transactions on Medical Imaging 16 (2), pp. 187–198. Cited by: §1.
 Formal limitations on the measurement of mutual information. arXiv preprint arXiv:1811.04251. Cited by: §1.
 Estimation of mutual information using kernel density estimators. Physical Review E 52 (3), pp. 2318. Cited by: §1.

Information theoretic clustering using minimum spanning trees.
In
Joint DAGM (German Association for Pattern Recognition) and OAGM Symposium
, pp. 205–215. Cited by: §1.  Representation learning with contrastive predictive coding. NIPS. Cited by: §1, §1.
 Predictive information in a sensory population. Proceedings of the National Academy of Sciences 112 (22), pp. 6908–6913. Cited by: §1.
 Feature selection based on mutual information: criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on Pattern Analysis & Machine Intelligence (8), pp. 1226–1238. Cited by: §1.

Variational discriminator bottleneck: improving imitation learning, inverse rl, and gans by constraining information flow
. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 69, 2019, Cited by: §5.2.  Kullbackleibler divergence estimation of continuous distributions. In 2008 IEEE international symposium on information theory, pp. 1666–1670. Cited by: §1.
 On variational bounds of mutual information. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, pp. 5171–5180. Cited by: §1.
 Sticking the landing: simple, lowervariance gradient estimators for variational inference. In Advances in Neural Information Processing Systems, pp. 6925–6934. Cited by: §1, §3.
 A spectral approach to gradient estimation for implicit distributions. arXiv preprint arXiv:1806.02925. Cited by: §1, §2, §2, §2, §3.
 Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. Cited by: §1.
 Finitesample analysis of fixedk nearest neighbor density functional estimators. In Advances in neural information processing systems, pp. 1217–1225. Cited by: §1.
 Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 208–215. Cited by: §5.2.
 Sliced score matching: a scalable approach to density and score estimation. arXiv preprint arXiv:1905.07088. Cited by: §2, §3.
 Approximating mutual information by maximum likelihood density ratio estimation. In New challenges for feature selection in data mining and knowledge discovery, pp. 5–20. Cited by: §1.
 The information bottleneck method. arXiv preprint physics/0004057. Cited by: §1, §5.
 Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pp. 1–5. Cited by: §1, §5.2.
 On mutual information maximization for representation learning. ArXiv abs/1907.13625. Cited by: Appendix C.

Maximally informative hierarchical representations of highdimensional data
. In Artificial Intelligence and Statistics, pp. 1004–1012. Cited by: §1.  The role of the information bottleneck in representation learning. In IEEE International Symposium on Information Theory (ISIT), pp. 1580–1584. Cited by: §1.
 Nystrom approximation for sparse kernel methods: theoretical analysis and empirical evaluation. In Proceedings of the TwentyNinth AAAI Conference on Artificial Intelligence, January 2530, 2015, Austin, Texas, USA, B. Bonet and S. Koenig (Eds.), pp. 3115–3121. External Links: Link Cited by: §2.
 Design of network coding functions in multihop relay networks. In 2008 5th International Symposium on Turbo Codes and Related Topics, pp. 249–254. Cited by: §5.2.
Appendix A Derivation of Gradient Estimates for Entropy
Unconditional Entropy Given that the encoder is deterministic, our goal is to optimize the entropy , where is short for the distribution of the representation w.r.t. its parameters . We can decompose the gradient of the entropy of as:
(19) 
The second term on the right side of the equation can be calculated:
(20) 
Therefore, the gradient of the entropy of becomes
(21) 
Conditional Entropy Consider nondeterministic encoder function where is an auxiliary variable with independent marginal . The distribution is determined by and the encoder parameters . The auxiliary variable introduces randomness to the encoder. First, we decompose the gradients of Conditional Entropy as following:
(22) 
Note that , such that we can apply reparameterization trick to the gradient estimator of conditional entropy in Eq. (A),
(23) 
Appendix B MIGE Algorithm Description
The algorithm description of our proposed MIGE is stated in Algorithm 1.
Appendix C Discussion on DIM(L)
DIM(L) (Hjelm et al., 2019)
is the stateoftheart unsupervised model for representaion learning, which maximizes the average MI between the highlevel representation and local patches of the image, and achieve an even higher classification accuracy than supervised learning. As shown in Table
4, we apply MIGE into DIM(L) and surprisingly find there is a significant performance gap to DIM(L).To our knowledge, the principle of DIM(L) is still unclear. Tschannen et al. (2019) argues that maximizing tighter bounds in DIM(L) can lead to worse results, and the success of these methods cannot be attributed to the properties of MI alone, and they strongly depend on the inductive bias in both the choice of feature extractor architectures and the parameterization of the employed MI estimators. For MIGE, we are investigating the behind reasons, e.g., to investigate the distributions of the patches.
Model  CIFAR10  CIFAR100  

conv  fc(1024)  Y(64)  conv  fc(1024)  Y(64)  
DIM(L) (JSD)  72.16%  67.99%  66.35%  41.65%  39.60%  39.66% 
DIM(L) (JSD + PM)  73.25%  73.62%  66.96%  48.13%  45.92%  39.6% 
DIM(L) (infoNCE)  75.05%  70.68%  69.24%  44.11%  42.97%  42.74% 
DIM(L) (infoNCE + PM)  75.21%  75.57%  69.13%  49.74%  47.72%  41.61% 
MIGE  59.72%  56.14%  54.01%  30.0%  28.96%  27.65% 