1 Introduction
Mutual information is a fundamental quantity for measuring the relationship between random variables. In data science it has found applications in a wide range of domains and tasks, including biomedical sciences
(Maes et al., 1997), blind source separation(BSS, e.g., independent component analysis,
Hyvärinen et al., 2004), information bottleneck (IB, Tishby et al., 2000)(Kwak & Choi, 2002; Peng et al., 2005), and causality (Butte & Kohane, 2000).Put simply, mutual information quantifies the dependence of two random variables and . It has the form,
(1) 
where
is the joint probability distribution, and
and are the marginals. In contrast to correlation, mutual information captures nonlinear statistical dependencies between variables, and thus can act as a measure of true dependence (Kinney & Atwal, 2014).Despite being a pivotal quantity across data science, mutual information has historically been difficult to compute (Paninski, 2003). Exact computation is only tractable for discrete variables (as the sum can be computed exactly), or for a limited family of problems where the probability distributions are known. For more general problems, this is not possible. Common approaches are nonparametric
(e.g., binning, likelihoodratio estimators based on support vector machines, nonparametric kerneldensity estimators; see,
Fraser & Swinney, 1986; Darbellay & Vajda, 1999; Suzuki et al., 2008; Kwak & Choi, 2002; Moon et al., 1995; Kraskov et al., 2004), or rely on approximate gaussianity of data distribution (e.g., Edgeworth expansion, Van Hulle, 2005). Unfortunately, these estimators typically do not scale well with sample size or dimension (Gao et al., 2014), and thus cannot be said to be generalpurpose. Other recent works include Kandasamy et al. (2017); Singh & Póczos (2016); Moon et al. (2017).In order to achieve a generalpurpose estimator, we rely on the wellknown characterization of the mutual information as the KullbackLeibler (KL) divergence (Kullback, 1997)
between the joint distribution and the product of the marginals (i.e.,
). Recent work uses a dual formulation to cast the estimation of divergences (including the KLdivergence, see Nguyen et al., 2010) as part of an adversarial game between competing deep neural networks (Nowozin et al., 2016). This approach is at the cornerstone of generative adversarial networks (GANs, Goodfellow et al., 2014), which train a generative model without any explicit assumptions about the underlying distribution of the data.In this paper we demonstrate that exploiting dual optimization to estimate divergences goes beyond the minimax objective as formalized in GANs. We leverage this strategy to offer a generalpurpose parametric neural estimator of mutual information based on dual representations of the KLdivergence (Ruderman et al., 2012), which we show is valuable in settings that do not necessarily involve an adversarial game. Our estimator is scalable, flexible, and completely trainable via backpropagation. The contributions of this paper are as follows:

We introduce the Mutual Information Neural Estimator (MINE), which is scalable, flexible, and completely trainable via backprop, as well as provide a thorough theoretical analysis.

We show that the utility of this estimator transcends the minimax objective as formalized in GANs, such that it can be used in mutual information estimation, maximization, and minimization.

We apply MINE to palliate modedropping in GANs and to improve reconstructions and inference in Adversarially Learned Inference (ALI, Dumoulin et al., 2016) on large scale datasets.
2 Background
2.1 Mutual Information
Mutual information is a Shannon entropybased measure of dependence between random variables. The mutual information between and can be understood as the decrease of the uncertainty in given :
(2) 
where is the Shannon entropy, and is the conditional entropy of given . As stated in Eqn. 1 and the discussion above, the mutual information is equivalent to the KullbackLeibler (KL) divergence between the joint, , and the product of the marginals :
(3) 
where is defined as^{1}^{1}1Although the discussion is more general, we can think of and as being distributions on some compact domain , with density and respect the Lebesgue measure , so that .,
(4) 
whenever is absolutely continuous with respect to ^{2}^{2}2and infinity otherwise. .
The intuitive meaning of Eqn. 3 is clear: the larger the divergence between the joint and the product of the marginals, the stronger the dependence between and . This divergence, hence the mutual information, vanishes for fully independent variables.
2.2 Dual representations of the KLdivergence.
A key technical ingredient of MINE are dual representations of the KLdivergence. We will primarily work with the DonskerVaradhan representation (Donsker & Varadhan, 1983), which results in a tighter estimator; but will also consider the dual divergence representation (Keziou, 2003; Nguyen et al., 2010; Nowozin et al., 2016).
The DonskerVaradhan representation.
The following theorem gives a representation of the KLdivergence (Donsker & Varadhan, 1983):
Theorem 1 (DonskerVaradhan representation).
The KL divergence admits the following dual representation:
(5) 
where the supremum is taken over all functions such that the two expectations are finite.
Proof.
See the Supplementary Material.
A straightforward consequence of Theorem 1 is as follows. Let be any class of functions satisfying the integrability constraints of the theorem. We then have the lowerbound^{3}^{3}3The bound in Eqn. 6 is known as the compression lemma in the PACBayes literature (Banerjee, 2006).:
(6) 
Note also that the bound is tight for optimal functions that relate the distributions to the Gibbs density as,
(7) 
The divergence representation.
It is worthwhile to compare the DonskerVaradhan representation to the divergence representation proposed in Nguyen et al. (2010); Nowozin et al. (2016), which leads to the following bound:
(8) 
Although the bounds in Eqns. 6 and 8 are tight for sufficiently large families , the DonskerVaradhan bound is stronger in the sense that, for any fixed , the right hand side of Eqn. 6 is larger^{4}^{4}4To see this, just apply the identity with . than the right hand side of Eqn. 8. We refer to the work by Ruderman et al. (2012) for a derivation of both representations in Eqns. 6 and 8 from the unifying perspective of Fenchel duality. In Section 3 we discuss versions of MINE based on these two representations, and numerical comparisons are performed in Section 4.
3 The Mutual Information Neural Estimator
In this section we formulate the framework of the Mutual Information Neural Estimator (MINE). We define MINE and present a theoretical analysis of its consistency and convergence properties.
3.1 Method
Using both Eqn. 3 for the mutual information and the dual representation of the KLdivergence, the idea is to choose to be the family of functions parametrized by a deep neural network with parameters . We call this network the statistics network. We exploit the bound:
(9) 
where is the neural information measure defined as
(10) 
The expectations in Eqn. 10 are estimated using empirical samples^{5}^{5}5Note that samples and from the marginals are obtained by simply dropping from samples and . from and or by shuffling the samples from the joint distribution along the batch axis. The objective can be maximized by gradient ascent.
It should be noted that Eqn. 10 actually defines a new class information measures, The expressive power of neural network insures that they can approximate the mutual information with arbitrary accuracy.
In what follows, given a distribution , we denote by as the empirical distribution associated to i.i.d. samples.
Definition 3.1 (Mutual Information Neural Estimator (MINE)).
Let be the set of functions parametrized by a neural network. MINE is defined as,
(11) 
Details on the implementation of MINE are provided in Algorithm 1. An analogous definition and algorithm also hold for the divergence formulation in Eqn. 8, which we refer to as MINE. Since Eqn. 8 lowerbounds Eqn. 6, it generally leads to a looser estimator of the mutual information, and numerical comparisons of MINE with MINE can be found in Section 4. However, in a minibatch setting, the SGD gradients of MINE are biased. We address this in the next section.
3.2 Correcting the bias from the stochastic gradients
A naive application of stochastic gradient estimation leads to the gradient estimate:
(12) 
where, in the second term, the expectations are over the samples of a minibatch , leads to a biased estimate of the full batch gradient^{6}^{6}6From the optimization point of view, the divergence formulation has the advantage of making the use of SGD with unbiased gradients straightforward..
Fortunately, the bias can be reduced by replacing the estimate in the denominator by an exponential moving average. For small learning rates, this improved MINE gradient estimator can be made to have arbitrarily small bias.
We found in our experiments that this improves allaround performance of MINE.
3.3 Theoretical properties
In this section we analyze the consistency and convergence properties of MINE. All the proofs can be found in the Supplementary Material.
3.3.1 Consistency
MINE relies on a choice of a statistics network and samples from the data distribution .
Definition 3.2 (Strong consistency).
The estimator is strongly consistent if for all , there exists a positive integer and a choice of statistics network such that:
where the probability is over a set of samples.
In a nutshell, the question of consistency is divided into two problems: an approximation problem related to the size of the family, , and an estimation problem related to the use of empirical measures. The first problem is addressed by universal approximation theorems for neural networks (Hornik, 1989). For the second problem, classical consistency theorems for extremum estimators apply (Van de Geer, 2000) under mild conditions on the parameter space.
This leads to the two lemmas below. The first lemma states that the neural information measures , defined in Eqn. 10, can approximate the mutual information with arbitrary accuracy:
Lemma 1 (approximation).
Let . There exists a neural network parametrizing functions with parameters in some compact domain , such that
The second lemma states the almost sure convergence of MINE to a neural information measure as the number of samples goes to infinity:
Lemma 2 (estimation).
Let . Given a family of neural network functions with parameters in some bounded domain , there exists an , such that
(13) 
Combining the two lemmas with the triangular inequality, we have,
Theorem 2.
MINE is strongly consistent.
3.3.2 Sample complexity
In this section we discuss the sample complexity of our estimator. Since the focus here is on the empirical estimation problem, we assume that the mutual information is well enough approximated by the neural information measure . The theorem below is a refinement of Lemma 2: it gives how many samples we need for an empirical estimation of the neural information measure at a given accuracy and with high confidence.
We make the following assumptions: the functions are bounded (i.e., ) and Lipschitz with respect to the parameters . The domain is bounded, so that for some constant . The theorem below shows a sample complexity of , where is the dimension of the parameter space.
Theorem 3.
Given any values of the desired accuracy and confidence parameters, we have,
(14) 
whenever the number of samples satisfies
(15) 
4 Empirical comparisons
Before diving into applications, we perform some simple empirical evaluation and comparisons of MINE. The objective is to show that MINE is effectively able to estimate mutual information and account for nonlinear dependence.
4.1 Comparing MINE to nonparametric estimation
We compare MINE and MINE to the NNbased nonparametric estimator found in Kraskov et al. (2004). In our experiment, we consider multivariate Gaussian random variables, and , with componentwise correlation, , where and is Kronecker’s delta. As the mutual information is invariant to continuous bijective transformations of the considered variables, it is enough to consider standardized Gaussians marginals. We also compare MINE (using the DonskerVaradhan representation in Eqn. 6) and MINE (based on the divergence representation in Eqn. 8).
Our results are presented in Figs. 1. We observe that both MINE and Kraskov’s estimation are virtually indistinguishable from the ground truth when estimating the mutual information between bivariate Gaussians. MINE shows marked improvement over Krakov’s when estimating the mutual information between twenty dimensional random variables. We also remark that MINE provides a tighter estimate of the mutual information than MINE.
4.2 Capturing nonlinear dependencies
An important property of mutual information between random variables with relationship , where
is a deterministic nonlinear transformation and
is random noise, is that it is invariant to the deterministic nonlinear transformation, but should only depend on the amount of noise, . This important property, that guarantees the quantification dependence without bias for the relationship, is called equitability (Kinney & Atwal, 2014). Our results (Fig. 2) show that MINE captures this important property.5 Applications
In this section, we use MINE to present applications of mutual information and compare to competing methods designed to achieve the same goals. Specifically, by using MINE to maximize the mutual information, we are able to improve mode representation and reconstruction of generative models. Finally, by minimizing mutual information, we are able to effectively implement the information bottleneck in a continuous setting.
5.1 Maximizing mutual information to improve GANs
Mode collapse (Che et al., 2016; Dumoulin et al., 2016; Donahue et al., 2016; Salimans et al., 2016; Metz et al., 2017; Saatchi & Wilson, 2017; Nguyen et al., 2017; Lin et al., 2017; Ghosh et al., 2017) is a common pathology of generative adversarial networks (GANs, Goodfellow et al., 2014), where the generator fails to produces samples with sufficient diversity (i.e., poorly represent some modes).
GANs as formulated in Goodfellow et al. (2014) consist of two components: a discriminator, and a generator, , where is a domain such as a compact subspace of . Given follows some simple prior distribution (e.g., a spherical Gaussian with density, ), the goal of the generator is to match its output distribution to a target distribution, (specified by the data samples). The discriminator and generator are optimized through the value function,
(16) 
A natural approach to diminish mode collapse would be regularizing the generator’s loss with the negentropy of the samples. As the sample entropy is intractable, we propose to use the mutual information as a proxy.
Following Chen et al. (2016), we write the prior as the concatenation of noise and code variables, . We propose to palliate mode collapse by maximizing the mutual information between the samples and the code. . The generator objective then becomes,
(17) 
As the samples are differentiable w.r.t. the parameters of
, and the statistics network being a differentiable function, we can maximize the mutual information using backpropagation and gradient ascent by only specifying this additional loss term. Since the mutual information is theoretically unbounded, we use adaptive gradient clipping (see the Supplementary Material) to ensure that the generator receives learning signals similar in magnitude from the discriminator and the statistics network.
Related works on modedropping
Methods to address mode dropping in GANs can readily be found in the literature. Salimans et al. (2016) use minibatch discrimination. In the same spirit, Lin et al. (2017) successfully mitigates mode dropping in GANs by modifying the discriminator to make decisions on multiple real or generated samples. Ghosh et al. (2017) uses multiple generators that are encouraged to generate different parts of the target distribution. Nguyen et al. (2017) uses two discriminators to minimize the KL and reverse KL divergences between the target and generated distributions. Che et al. (2016) learns a reconstruction distribution, then teach the generator to sample from it, the intuition being that the reconstruction distribution is a denoised or smoothed version of the data distribution, and thus easier to learn. Srivastava et al. (2017) minimizes the reconstruction error in the latent space of bidirectional GANs (Dumoulin et al., 2016; Donahue et al., 2016). Metz et al. (2017) includes many steps of the discriminator’s optimization as part of the generator’s objective. While Chen et al. (2016) maximizes the mutual information between the code and the samples, it does so by minimizing a variational upper bound on the conditional entropy (Barber & Agakov, 2003) therefore ignoring the entropy of the samples. Chen et al. (2016) makes no claim about modedropping.
Experiments: Spiral, 25Gaussians datasets
We apply MINE to improve mode coverage when training a generative adversarial network (GAN, Goodfellow et al., 2014). We demonstrate using Eqn. 17 on the spiral and the 25Gaussians datasets, comparing two models, one with (which corresponds to the orthodox GAN as in Goodfellow et al. (2014)) and one with , which corresponds to mutual information maximization.
Our results on the spiral (Fig. 3) and the Gaussians (Fig. 4) experiments both show improved mode coverage over the baseline with no mutual information objective. This confirms our hypothesis that maximizing mutual information helps against modedropping in this simple setting.
Experiment: Stacked MNIST
Following Che et al. (2016); Metz et al. (2017); Srivastava et al. (2017); Lin et al. (2017), we quantitatively assess MINE’s ability to diminish mode dropping on the stacked MNIST dataset which is constructed by stacking three randomly sampled MNIST digits. As a consequence, stacked MNIST offers 1000 modes. Using the same architecture and training protocol as in Srivastava et al. (2017); Lin et al. (2017)
, we train a GAN on the constructed dataset and use a pretrained classifier on 26,000 samples to count the number of modes in the samples, as well as to compute the KL divergence between the sample and expected data distributions. Our results in Table
1 demonstrate the effectiveness of MINE in preventing mode collapse on Stacked MNIST.Stacked MNIST  
Modes (Max 1000)  KL  
DCGAN  
ALI  
Unrolled GAN  
VEEGAN  
PacGAN  
GAN+MINE (Ours) 
5.2 Maximizing mutual information to improve inference in bidirectional adversarial models
Adversarial bidirectional models were introduced in Adversarially Learned Inference (ALI, Dumoulin et al., 2016) and BiGAN (Donahue et al., 2016) and are an extension of GANs which incorporate a reverse model, jointly trained with the generator. These models formulate the problem in terms of the value function in Eqn. 16 between two joint distributions, and induced by the forward (encoder) and reverse (decoder) models, respectively^{7}^{7}7We switch to density notations for convenience throughout this section..
One goal of bidirectional models is to do inference as well as to learn a good generative model. Reconstructions are one desirable property of a model that does both inference and generation, but in practice ALI can lack fidelity (i.e., reconstructs less faithfully than desired, see Li et al., 2017; Ulyanov et al., 2017; Belghazi et al., 2018). To demonstrate the connection to mutual information, it can be shown (see the Supplementary Material for details) that the reconstruction error, , is bounded by,
(18) 
If the joint distributions are matched, tends to , which is fixed as long as the prior, , is itself fixed. Subsequently, maximizing the mutual information minimizes the expected reconstruction error.
Assuming that the generator is the same as with GANs in the previous section, the objectives for training a bidirectional adversarial model then become:
(19) 
Related works
Ulyanov et al. (2017) improves reconstructions quality by forgoing the discriminator and expressing the adversarial game between the encoder and decoder.
Kumar et al. (2017) augments the bidirectional objective by considering the reconstruction and the corresponding encodings as an additional fake pair.
Belghazi et al. (2018) shows that a Markovian hierarchical generator in a bidirectional adversarial model provide a hierarchy of reconstructions with increasing levels of fidelity (increasing reconstruction quality).
Li et al. (2017) shows that the expected reconstruction error can be diminished by minimizing the conditional entropy of the observables given the latent representations.
The conditional entropy being intractable for general posterior, Li et al. (2017) proposes to augment the generator’s loss with an adversarial cycle consistency loss (Zhu et al., 2017) between the observables and their reconstructions.
Experiment: ALI+MINE
In this section we compare MINE to existing bidirectional adversarial models. As the decoder’s density is generally intractable, we use three different metrics to measure the fidelity of the reconstructions with respect to the samples; the euclidean reconstruction error, reconstruction accuracy, which is the proportion of labels preserved by the reconstruction as identified by a pretrained classifier; the Multiscale structural similarity metric (MSSSIM, Wang et al., 2004) between the observables and their reconstructions.
We train MINE on datasets of increasing order of complexity: a toy dataset composed of 25Gaussians, MNIST (LeCun, 1998), and the CelebA dataset (Liu et al., 2015). Fig. 6 shows the reconstruction ability of MINE compared to ALI. Although ALICE does perfect reconstruction (which is in its explicit formulation), we observe significant modedropping in the sample space. MINE does a balanced job of reconstructing along with capturing all the modes of the underlying data distribution.
Next, we measure the fidelity of the reconstructions over ALI, ALICE, and MINE. Tbl. 2 compares MINE to the existing baselines in terms of euclidean reconstruction errors, reconstruction accuracy, and MSSSIM. On MNIST, MINE outperforms ALI in terms of reconstruction errors by a good margin and is competitive to ALICE with respect to reconstruction accuracy and MSSSIM. Our results show that MINE’s effect on reconstructions is even more dramatic when compared to ALI and ALICE on the CelebA dataset.
Model  Recons. Error  Recons. Acc.(%)  MSSSIM 
MNIST  
ALI  14.24  45.95  0.97 
ALICE()  3.20  99.03  0.97 
ALICE(Adv.)  5.20  98.17  0.98 
MINE  9.73  96.10  0.99 
CelebA  
ALI  53.75  57.49  0.81 
ALICE()  8.01  32.22  0.93 
ALICE(Adv.)  92.56  48.95  0.51 
MINE  36.11  76.08  0.99 
5.3 Information Bottleneck
The Information Bottleneck (IB, Tishby et al., 2000) is an information theoretic method for extracting relevant information, or yielding a representation, that an input contains about an output . An optimal representation of would capture the relevant factors and compress by diminishing the irrelevant parts which do not contribute to the prediction of . IB was recently covered in the context of deep learning (Tishby & Zaslavsky, 2015), and as such can be seen as a process to construct an approximation of the minimally sufficient statistics of the data. IB seeks an encoder, , that induces the Markovian structure . This is done by minimizing the IB Lagrangian,
(20) 
which appears as a standard crossentropy loss augmented with a regularizer promoting minimality of the representation (Achille & Soatto, 2017). Here we propose to estimate the regularizer with MINE.
Related works
In the discrete setting, (Tishby et al., 2000) uses the BlahutArimoto Algorithm (Arimoto, 1972), which can be understood as cyclical coordinate ascent in function spaces. While IB is successful and popular in a discrete setting, its application to the continuous setting was stifled by the intractability of the continuous mutual information. Nonetheless, IB was applied in the case of jointly Gaussian random variables in (Chechik et al., 2005).
In order to overcome the intractability of in the continuous setting, Alemi et al. (2016); Kolchinsky et al. (2017); Chalk et al. (2016) exploit the variational
bound of Barber & Agakov (2003) to approximate the conditional entropy in .
These approaches differ only on their treatment of the marginal distribution of the bottleneck variable: Alemi et al. (2016) assumes a
standard multivariate normal marginal distribution,
Chalk et al. (2016) uses a Studentt distribution, and Kolchinsky et al. (2017) uses nonparametric estimators.
Due to their reliance on a variational approximation, these methods require a tractable density for the approximate posterior, while MINE does not.
Experiment: Permutationinvariant MNIST classification
Here, we demonstrate an implementation of the IB objective on permutation invariant MNIST using MINE. We compare to the Deep Variational Bottleneck (DVB, Alemi et al., 2016) and use the same empirical setup. As the DVB relies on a variational bound on the conditional entropy, it therefore requires a tractable density. Alemi et al. (2016) opts for a conditional Gaussian encoder , where . As MINE does not require a tractable density, we consider three type of encoders: a Gaussian encoder as in Alemi et al. (2016); an additive noise encoder, ; and a propagated noise encoder, . Our results can be seen in Tbl. 3, and this shows MINE as being superior in these settings.
Model  Misclass. rate(%) 
Baseline  1.38% 
Dropout  1.34% 
Confidence penalty  1.36% 
Label Smoothing  1.40% 
DVB  1.13% 
DVB + Additive noise  1.06% 
MINE(Gaussian) (ours)  1.11% 
MINE(Propagated) (ours)  1.10% 
MINE(Additive) (ours)  1.01% 
6 Conclusion
We proposed a mutual information estimator, which we called the mutual information neural estimator (MINE), that is scalable in dimension and samplesize. We demonstrated the efficiency of this estimator by applying it in a number of settings. First, a term of mutual information can be introduced alleviate modedropping issue in generative adversarial networks (GANs, Goodfellow et al., 2014). Mutual information can also be used to improve inference and reconstructions in adversariallylearned inference (ALI, Dumoulin et al., 2016). Finally, we showed that our estimator allows for tractable application of Information bottleneck methods (Tishby et al., 2000) in a continuous setting.
7 Acknowledgements
We would like to thank Martin Arjovsky, Caglar Gulcehre, Marcin Moczulski, Negar Rostamzadeh, Thomas Boquet, Ioannis Mitliagkas, Pedro Oliveira Pinheiro for helpful comments, as well as Samsung and IVADO for their support.
References
 Achille & Soatto (2017) Achille, A. and Soatto, S. Emergence of invariance and disentanglement in deep representations. arXiv preprint 1706.01350v2[cs.LG], 2017.
 Alemi et al. (2016) Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
 Arimoto (1972) Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Transactions on Information Theory, 18(1):14–20, 1972.
 Banerjee (2006) Banerjee, A. On baysian bounds. ICML, pp. 81–88, 2006.
 Barber & Agakov (2003) Barber, D. and Agakov, F. The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pp. 201–208. MIT Press, 2003.
 Belghazi et al. (2018) Belghazi, M. I., Rajeswar, S., Mastropietro, O., Mitrovic, J., Rostamzadeh, N., and Courville, A. Hierarchical adversarially learned inference. arXiv preprint arXiv:1802.01071, 2018.
 Butte & Kohane (2000) Butte, A. J. and Kohane, I. S. Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In Pac Symp Biocomput, volume 5, pp. 26, 2000.
 Chalk et al. (2016) Chalk, M., Marre, O., and Tkacik, G. Relevant sparse codes with variational information bottleneck. In Advances in Neural Information Processing Systems, pp. 1957–1965, 2016.
 Che et al. (2016) Che, T., Li, Y., Jacob, A. P., Bengio, Y., and Li, W. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
 Chechik et al. (2005) Chechik, G., Globerson, A., Tishby, N., and Weiss, Y. Information bottleneck for gaussian variables. Journal of Machine Learning Research, 6(Jan):165–188, 2005.
 Chen et al. (2016) Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever, I., and Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172–2180, 2016.
 Clevert et al. (2015) Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponential linear units (elus). CoRR, abs/1511.07289, 2015.
 Darbellay & Vajda (1999) Darbellay, G. A. and Vajda, I. Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory, 45(4):1315–1321, 1999.
 Donahue et al. (2016) Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 Donsker & Varadhan (1983) Donsker, M. and Varadhan, S. Asymptotic evaluation of certain markov process expectations for large time, iv. Communications on Pure and Applied Mathematics, 36(2):183?212, 1983.
 Dumoulin et al. (2016) Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville, A. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Fraser & Swinney (1986) Fraser, A. M. and Swinney, H. L. Independent coordinates for strange attractors from mutual information. Physical review A, 33(2):1134, 1986.
 Gao et al. (2014) Gao, S., Ver Steeg, G., and Galstyan, A. Efficient estimation of mutual information for strongly dependent variables. Arxiv preprint arXiv:1411.2003[cs.IT], 2014.
 Ghosh et al. (2017) Ghosh, A., Kulharia, V., Namboodiri, V., Torr, P. H., and Dokania, P. K. Multiagent diverse generative adversarial networks. arXiv preprint arXiv:1704.02906, 2017.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680, 2014.
 Györfi & van der Meulen (1987) Györfi, L. and van der Meulen, E. C. Densityfree convergence properties of various estimators of entropy. Computational Statistics and Data Analysis, 5:425?436, 1987.
 Hornik (1989) Hornik, K. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989.
 Hyvärinen et al. (2004) Hyvärinen, A., Karhunen, J., and Oja, E. Independent component analysis, volume 46. John Wiley & Sons, 2004.
 Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167.
 Kandasamy et al. (2017) Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasserman, L., and Robins, J. Nonparametric von mises estimators for entropies, divergences and mutual informations. NIPS, 2017.
 Keziou (2003) Keziou, A. Dual representation of Ïdivergences and applications. 336:857–862, 05 2003.
 Kingma & Ba (2014) Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL http://arxiv.org/abs/1412.6980.
 Kinney & Atwal (2014) Kinney, J. B. and Atwal, G. S. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, 111(9):3354–3359, 2014.
 Kolchinsky et al. (2017) Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436, 2017.
 Kraskov et al. (2004) Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information. Physical review E, 69(6):066138, 2004.
 Kullback (1997) Kullback, S. Information theory and statistics. Courier Corporation, 1997.
 Kumar et al. (2017) Kumar, A., Sattigeri, P., and Fletcher, P. T. Improved semisupervised learning with gans using manifold invariances. arXiv preprint arXiv:1705.08850, 2017.
 Kwak & Choi (2002) Kwak, N. and Choi, C.H. Input feature selection by mutual information based on parzen window. IEEE transactions on pattern analysis and machine intelligence, 24(12):1667–1671, 2002.

LeCun (1998)
LeCun, Y.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/, 1998.  Li et al. (2017) Li, C., Liu, H., Chen, C., Pu, Y., Chen, L., Henao, R., and Carin, L. Towards understanding adversarial learning for joint distribution matching. arXiv preprint arXiv:1709.01215, 2017.
 Lin et al. (2017) Lin, Z., Khetan, A., Fanti, G., and Oh, S. Pacgan: The power of two samples in generative adversarial networks. arXiv preprint arXiv:1712.04086, 2017.

Liu et al. (2015)
Liu, Z., Luo, P., Wang, X., and Tang, X.
Deep learning face attributes in the wild.
In
Proceedings of the IEEE International Conference on Computer Vision
, pp. 3730–3738, 2015.  Maes et al. (1997) Maes, F., Collignon, A., Vandermeulen, D., Marchal, G., and Suetens, P. Multimodality image registration by maximization of mutual information. IEEE transactions on Medical Imaging, 16(2):187–198, 1997.
 Metz et al. (2017) Metz, L., Poole, B., Pfau, D., and SohlDickstein, J. Unrolled generative adversarial networks. 2017. URL https://openreview.net/pdf?id=BydrOIcle.
 Moon et al. (2017) Moon, K., Sricharan, K., and Hero III, A. O. Ensemble estimation of mutual information. arXiv preprint arXiv:1701.08083, 2017.
 Moon et al. (1995) Moon, Y.I., Rajagopalan, B., and Lall, U. Estimation of mutual information using kernel density estimators. Physical Review E, 52(3):2318, 1995.
 Nguyen et al. (2017) Nguyen, T., Le, T., Vu, H., and Phung, D. Dual discriminator generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2667–2677, 2017.
 Nguyen et al. (2010) Nguyen, X., Wainwright, M. J., and Jordan, M. I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 Nowozin et al. (2016) Nowozin, S., Cseke, B., and Tomioka, R. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
 Paninski (2003) Paninski, L. Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253, 2003.
 Peng et al. (2005) Peng, H., Long, F., and Ding, C. Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8):1226–1238, 2005.
 Pereyra et al. (2017) Pereyra, G., Tucker, G., Chorowski, J., Kaiser, Ł., and Hinton, G. Regularizing neural networks by penalizing confident output distributions. ICLR Workshop, 2017.
 Radford et al. (2015) Radford, A., Metz, L., and Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Ruderman et al. (2012) Ruderman, A., Reid, M., GarcíaGarcía, D., and Petterson, J. Tighter variational representations of fdivergences via restriction to probability measures. arXiv preprint arXiv:1206.4664, 2012.
 Saatchi & Wilson (2017) Saatchi, Y. and Wilson, A. G. Bayesian gan. In Advances in Neural Information Processing Systems, pp. 3625–3634, 2017.
 Salimans et al. (2016) Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
 ShalevSchwartz & BenDavid (2014) ShalevSchwartz, S. and BenDavid, S. Understanding Machine Learning  from Theory to Algorithms. Cambridge university press, 2014.
 Singh & Póczos (2016) Singh, S. and Póczos, B. Finitesample analysis of fixedk nearest neighbor density functional estimators. arXiv preprint 1606.01554, 2016.
 Srivastava et al. (2017) Srivastava, A., Valkov, L., Russell, C., Gutmann, M., and Sutton, C. Veegan: Reducing mode collapse in gans using implicit variational learning. arXiv preprint arXiv:1705.07761, 2017.
 Suzuki et al. (2008) Suzuki, T., Sugiyama, M., Sese, J., and Kanamori, T. Approximating mutual information by maximum likelihood density ratio estimation. In New challenges for feature selection in data mining and knowledge discovery, pp. 5–20, 2008.
 Tishby & Zaslavsky (2015) Tishby, N. and Zaslavsky, N. Deep learning and the information bottleneck principle. In Information Theory Workshop (ITW), 2015 IEEE, pp. 1–5. IEEE, 2015.
 Tishby et al. (2000) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. arXiv preprint physics/0004057, 2000.
 Ulyanov et al. (2017) Ulyanov, D., Vedaldi, A., and Lempitsky, V. Adversarial generatorencoder networks. arXiv preprint arXiv:1704.02304, 2017.
 Van de Geer (2000) Van de Geer, S. Empirical Processes in Mestimation. Cambridge University Press, 2000.
 Van Hulle (2005) Van Hulle, M. M. Edgeworth approximation of multivariate differential entropy. Neural computation, 17(9):1903–1910, 2005.
 Wang et al. (2004) Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13:600–612, 2004.
 Zhu et al. (2017) Zhu, J.Y., Park, T., Isola, P., and Efros, A. A. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593, 2017.
8 Appendix
In this Appendix, we provide additional experiment details and spell out the proofs omitted in the text.
8.1 Experimental Details
8.1.1 Adaptive Clipping
Here we assume we are in the context of GANs described in Sections 5.1 and 5.2, where the mutual information shows up as a regularizer in the generator objective.
Notice that the generator is updated by two gradients. The first gradient is that of the generator’s loss, with respect to the generator’s parameters , . The second flows from the mutual information estimate to the generator, . If left unchecked, because mutual information is unbounded, the latter can overwhelm the former, leading to a failure mode of the algorithm where the generator puts all of its attention on maximizing the mutual information and ignores the adversarial game with the discriminator. We propose to adaptively clip the gradient from the mutual information so that its Frobenius norm is at most that of the gradient from the discriminator. Defining to be the adapted gradient following from the statistics network to the generator, we have,
(21) 
Note that adaptive clipping can be considered in any situation where MINE is to be maximized.
8.1.2 GAN+MINE: Spiral and 25gaussians
In this section we state the details of experiments supporting mode dropping experiments on the spiral and 25Gaussians dataset. For both the datasets we use 100,000 examples sampled from the target distributions, using a standard deviation of
in the case of 25gaussians, and using additive noise for the spiral. The generator for the GAN consists of two fully connected layers withunits in each layer with batchnormalization
(Ioffe & Szegedy, 2015)and LeakyReLU as activation function as in
Dumoulin et al. (2016). The discriminator and statistics networks have three fully connected layers with units each. We use the Adam (Kingma & Ba, 2014) optimizer with a learning rate of . Both GAN baseline and GAN+MINE were trained for iterations with a mini batchsize of .8.1.3 GAN+MINE: StackedMNIST
Here we describe the experimental setup and architectural details of stackedMNIST task with GAN+MINE. We compare to the exact same experimental setup followed and reported in PacGAN(Lin et al., 2017) and VEEGAN(Srivastava et al., 2017)
. We use a pretrained classifier to classify generated samples on each of the three stacked channels. Evaluation is done on 26,000 test samples as followed in the baselines. We train GAN+MINE for 50 epochs on
samples. Details for generator and discriminator networks are given below in the table4 and table5. Specifically the statistics network has the same architecture as discriminator in DCGAN with ELU (Clevert et al., 2015) as activation function for the individual layers and without batchnormalization as highlighted in Table 6. In order to condition the statistics network on the variable, we use linear MLPs at each layer, whose output are reshaped to the number of feature maps. The linear MLPs output is then added as a dynamic bias.Generator  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  100  
Fullyconnected  2*2*512  ReLU  
Transposed convolution  4*4*256  2  ReLU  
Transposed convolution  7*7*128  2  ReLU  
Transposed convolution  14*14*64  2  ReLU  
Transposed convolution  28*28*3  2  Tanh 
Discriminator  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  
Convolution  14*14*64  2  ReLU  
Convolution  7*7*128  2  ReLU  
Convolution  4*4*256  2  ReLU  
Convolution  2*2*512  2  ReLU  
Fullyconnected  1  1  Valid  Sigmoid 
Statistics Network  
Layer  number of outputs  kernel size  stride  activation function 
vhv Input  
Convolution  14*14*16  2  ELU  
Convolution  7*7*32  2  ELU  
Convolution  4*4*64  2  ELU  
Flatten         
FullyConnected  1024  1  Valid  None 
FullyConnected  1  1  Valid  None 
8.1.4 ALI+MINE: MNIST and CelebA
In this section we state the details of experimental setup and the network architectures used for the task of improving reconstructions and representations in bidirectional adversarial models with MINE. The generator and discriminator network architectures along with the hyper parameter setup used in these tasks are similar to the ones used in DCGAN (Radford et al., 2015).
Statistics network conditioning on the latent code was done as in the StackedMNIST experiments. We used Adam as the optimizer with a learning rate of 0.0001. We trained the model for a total of iterations on CelebA and iterations on MNIST, both with a mini batchsize of .
Encoder  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  28*28*129  
Convolution  14*14*64  2  ReLU  
Convolution  7*7*128  2  ReLU  
Convolution  4*4*256  2  ReLU  
Convolution  256  Valid  ReLU  
Fullyconnected  128      None 
Decoder  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  128  
Fullyconnected  4*4*256  ReLU  
Transposed convolution  7*7*128  2  ReLU  
Transposed convolution  14*14*64  2  ReLU  
Transposed convolution  28*28*1  2  Tanh 
Discriminator  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  
Convolution  14*14*64  2  LearkyReLU  
Convolution  7*7*128  2  LeakyReLU  
Convolution  4*4*256  2  LeakyReLU  
Flatten        
Concatenate        
Fullyconnected  1024      LeakyReLU 
Fullyconnected  1      Sigmoid 
Statistics Network  
Layer  number of outputs  kernel size  stride  activation function 
vhv Input  
Convolution  14*14*64  2  LeakyReLU  
Convolution  7*7*128  2  LeakyReLU  
Convolution  4*4*256  2  LeakyReLU  
Flatten         
Fullyconnected  1      None 
Encoder  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  64*64*259  
Convolution  32*32*64  2  ReLU  
Convolution  16*16*128  2  ReLU  
Convolution  8*8*256  2  ReLU  
Convolution  4*4*512  2  ReLU  
Convolution  512  Valid  ReLU  
Fullyconnected  256      None 
Decoder  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  256  
FullyConnected  4*4*512      ReLU 
Transposed convolution  8*8*256  2  ReLU  
Transposed convolution  16*16*128  2  ReLU  
Transposed convolution  32*32*64  2  ReLU  
Transposed convolution  64*64*3  2  Tanh 
Discriminator  
Layer  Number of outputs  Kernel size  Stride  Activation function 
vhv Input  
Convolution  32*32*64  2  LearkyReLU  
Convolution  16*16*128  2  LeakyReLU  
Convolution  8*8*256  2  LeakyReLU  
Convolution  4*4*512  2  LeakyReLU  
Flatten        
Concatenate        
Fullyconnected  1024      LeakyReLU 
Fullyconnected  1      Sigmoid 
Statistics Network  
Layer  number of outputs  kernel size  stride  activation function 
vhv Input  
Convolution  32*32*16  2  ELU  
Convolution  16*16*32  2  ELU  
Convolution  8*8*64  2  ELU  
Convolution  4*4*128  2  ELU  
Flatten         
Fullyconnected  1      None 
8.1.5 Information bottleneck with MINE
In this section we outline the network details and hyperparameters used for the information bottleneck task using MINE. To keep comparison fair all hyperparameters and architectures are those outlined in
Alemi et al. (2016). The statistics network is shown, a two layer MLP with additive noise at each layer and 512 ELUs (Clevert et al., 2015) activations, is outlined in table15.Statistics Network  
Layer  number of outputs  activation function 
vhv input  
Gaussian noise(std=0.3)     
dense layer  512  ELU 
Gaussian noise(std=0.5)     
dense layer  512  ELU 
Gaussian noise(std=0.5)     
dense layer  1  None 
8.2 Proofs
8.2.1 DonskerVaradhan Representation
Theorem 4 (Theorem 1 restated).
The KL divergence admits the following dual representation:
(22) 
where the supremum is taken over all functions such that the two expectations are finite.
Proof.
A simple proof goes as follows. For a given function , consider the Gibbs distribution defined by , where . By construction,
(23) 
Let be the gap,
(24) 
Using Eqn 23, we can write as a KLdivergence:
(25) 
The positivity of the KLdivergence gives . We have thus shown that for any ,
(26) 
and the inequality is preserved upon taking the supremum over the righthand side. Finally, the identity (25) also shows that this bound is tight whenever , namely for optimal functions taking the form for some constant . ∎
8.2.2 Consistency Proofs
This section presents the proofs of the Lemma and consistency theorem stated in the consistency in Section 3.3.1.
In what follows, we assume that the input space is a compact domain of
, and all measures are absolutely continuous with respect to the Lebesgue measure. We will restrict to families of feedforward functions with continuous activations, with a single output neuron, so that a given architecture defines a continuous mapping
from to .To avoid unnecessary heavy notation, we denote and for the joint distribution and product of marginals, and for their empirical versions. We will use the notation for the quantity:
(27) 
so that .
Lemma 3 (Lemma 1 restated).
Let . There exists a family of neural network functions with parameters in some compact domain , such that
(28) 
where
(29) 
Proof.
Let . By construction, satisfies:
(30) 
For a function , the (positive) gap can be written as
(31) 
where we used the inequality .
Fix . We first consider the case where is bounded from above by a constant . By the universal approximation theorem (see corollary 2.2 of Hornik (1989)^{8}^{8}8Specifically, the argument relies on the density of feedforward network functions in the space of integrable functions with respect the measure .), we may choose a feedforward network function such that
(32) 
Since is Lipschitz continuous with constant on , we have
(33) 
From Equ 31 and the triangular inequality, we then obtain:
(34) 
In the general case, the idea is to partition in two subset and for a suitably chosen large value of . For a given subset , we will denote by
if and otherwise. is integrable with respect to ^{9}^{9}9This can be seen from the identity (Györfi & van der Meulen, 1987)(35) 
We then write
(36)  
(37) 
where the inequality in the second line arises from the convexity and positivity of . Eqns. 35 and 36, together with the triangular inequality, lead to Eqn. 34, which proves the Lemma.
∎
Lemma 4 (Lemma 2 restated).
Let . Given a family of neural network functions with parameters in some compact domain , there exists such that
(38) 
Proof.
We start by using the triangular inequality to write,
(39) 
The continuous function , defined on the compact domain , is bounded. So the functions are uniformly bounded by a constant , i.e for all . Since is Lipschitz continuous with constant in the interval , we have
(40) 
Since is compact and the feedforward network functions are continuous, the families of functions and
satisfy the uniform law of large numbers
(Van de Geer, 2000). Given we can thus choose such that and with probability one,(41) 
Together with Eqns. 39 and 40, this leads to
(42) 
∎
Theorem 5 (Theorem 2 restated).
MINE is strongly consistent.
8.2.3 Sample complexity proof
Theorem 6 (Theorem 3 restated).
Assume that the functions in are bounded (i.e., ) and Lipschitz with respect to the parameters . The domain is bounded, so that for some constant . Given any values of the desired accuracy and confidence parameters, we have,
(44) 
whenever the number of samples satisfies
(45) 
Proof.
The assumptions of Lemma 2 apply, so let us begin with Eqns. 39 and 40. By the Hoeffding inequality, for all function ,
(46) 
To extend this inequality to a uniform inequality over all functions and , the standard technique is to choose a minimal cover of the domain by a finite set of small balls of radius , , and to use the union bound. The minimal cardinality of such covering is bounded by the covering number of , known to satisfy(ShalevSchwartz & BenDavid, 2014)
(47) 
Successively applying a union bound in Eqn 46 with the set of functions and gives
(48) 
and