1 Introduction
Mutual information (MI) is a fundamental measure of the dependence between two random variables. Mathematically, the definition of MI between variables
and is(1) 
This important tool has been applied in a wide range of scientific fields, including statistics (Granger and Lin, 1994; Jiang et al., 2015), bioinformatics (Lachmann et al., 2016; Zea et al., 2016), robotics (Julian et al., 2014; Charrow et al., 2015), and machine learning (Chen et al., 2016; Alemi et al., 2016; Hjelm et al., 2018; Cheng et al., 2020b).
In machine learning, especially in deep learning frameworks, MI is typically utilized as a criterion or a regularizer in loss functions, to encourage or limit the dependence between variables. MI maximization has been studied extensively in various tasks,
e.g., representation learning (Hjelm et al., 2018; Hu et al., 2017), generative models (Chen et al., 2016), information distillation (Ahn et al., 2019), and reinforcement learning
(Florensa et al., 2017). Recently, MI minimization has obtained increasing attention for its applications in disentangled representation learning
(Chen et al., 2018), style transfer (Kazemi et al., 2018), domain adaptation (Gholami et al., 2018), fairness (Kamishima et al., 2011), and the information bottleneck (Alemi et al., 2016).However, only in a few special cases can one calculate the exact value of mutual information, since the calculation requires closed forms of density functions and a tractable logdensity ratio between the joint and marginal distributions. In most machine learning tasks, only samples from the joint distribution are accessible. Therefore, samplebased MI estimation methods have been proposed. To approximate MI, most previous works focused on lowerbound estimation
(Chen et al., 2016; Belghazi et al., 2018; Oord et al., 2018), which is inconsistent to MI minimization tasks. In contrast, MI upper bound estimation lacks extensive exploration in the literature. Among the existing MI upper bounds, Alemi et al. (2016) fixes one of the marginal distribution ( in (1)) to a standard Gaussian, and obtains a variational upper bound in closed form. However, the Gaussian marginal distribution assumption is unduly strong, which makes the upper bound fail to estimate MI with low bias. Poole et al. (2019) points out a leaveoneout upper bound, which provides tighter MI estimation when sample size is large. However, it suffers from high numerical instability in practice when applied to MI minimization models.To overcome the defects of previous MI estimators, we introduce a Contrastive Logratio Upper Bound (CLUB). Specifically, CLUB bridges mutual information estimation with contrastive learning (Oord et al., 2018)
, where MI is estimated by the difference of conditional probabilities between positive and negative sample pairs. Further, we develop a variational form of CLUB (vCLUB) into scenarios where the conditional distribution
is unknown, by approximatingwith a neural network. We theoretically prove that, with good variational approximation, vCLUB can either provide reliable MI estimation or remain a valid MI upper bound. Based on this new bound, we propose an MI minimization algorithm, and further accelerate it via a negative sampling strategy. The main contributions of this paper are summarized as follows.

We introduce a Contrastive Logratio Upper Bound (CLUB) of mutual information, which is not only reliable as a mutual information estimator, but also trainable in gradientdescent frameworks.

We extend CLUB with a variational network approximation, and provide theoretical analysis to the good properties of this variational bound.

We develop a CLUBbased MI minimization algorithm, and accelerate it with a negative sampling strategy.

We compare CLUB with previous MI estimators on both simulation studies and realworld applications, which demonstrate CLUB is not only better in the biasvariance estimation tradeoff, but also more effective when applied to MI minimization.
2 Background
Although it has widespread use in numerous applications, mutual information (MI) remains challenging to estimate accurately, especially when the closedforms of distributions are unknown or intractable. Earlier MI estimation approaches include nonparametric binning (Darbellay and Vajda, 1999)
(Härdle et al., 2004), likelihoodratio estimation (Suzuki et al., 2008), and nearest neighbor entropy estimation (Kraskov et al., 2004). These methods fail to provide reliable approximations when the data dimension increases (Belghazi et al., 2018). Also, the gradient of these estimators is difficult to calculate, which makes them inapplicable to backpropagation frameworks for MI optimization tasks.To obtain differentiable and scalable MI estimation, recent approaches utilize deep neural networks to construct variational MI estimators. Most of these estimators focus on MI maximization problems, and provide MI lower bounds. Specifically, Barber and Agakov (2003) replaces the conditional distribution with an auxiliary distribution , and obtains the BarberAgakov (BA) bound:
(2) 
where is the entropy of variable . Belghazi et al. (2018) introduces a Mutual Information Neural Estimator (MINE), which treats MI as the KullbackLeibler (KL) divergence (Kullback, 1997) between the joint and marginal distributions, and converts it into the dual representation:
(3) 
where is a score function (or, a critic) approximated by a neural network. Nguyen, Wainwright, and Jordan (NWJ) (Nguyen et al., 2010) derives another lower bound based on the MI divergence representation:
(4) 
More recently, based on Noise Contrastive Estimation (NCE)
(Gutmann and Hyvärinen, 2010), an MI lower bound, called InfoNCE, was introduced in Oord et al. (2018):(5) 
where the expectation is over samples drawn from the joint distribution .
Unlike the above MI lower bounds that have been studied extensively, MI upper bounds are still lacking extensive published exploration. Most existing MI upper bounds require the conditional distribution to be known. For example, Alemi et al. (2016) introduces a variational marginal approximation to build a variational upper bound (VUB):
(6) 
The inequality is based on the fact that the KLdivergence is always nonnegative. To be a good MI estimation, this upper bound requires a welllearned density approximation to , so that the difference could be small. However, learning a good marginal approximation without any additional information, recognized as the distribution density estimation problem (MagdonIsmail and Atiya, 1999), is challenging, especially when variable is in a highdimensional space. In practice, Alemi et al. (2016) fixes
as a standard normal distribution,
, which results in a highbias MI estimation. With sample pairs , Poole et al. (2019) replaces with a Monte Carlo approximation and derives a leaveoneout upper bound (LOut):(7) 
This bound does not require any additional parameters, but highly depends on a sufficient sample size to achieve satisfying Monte Carlo approximation. In practice, LOut suffers from numerical instability when applied to realworld MI minimization problems.
To compare our method with the aforementioned MI upper bounds in more general scenarios (i.e., is unknown), we use a neural network to approximate , and develop variational versions of VUB and LOut as :
(8)  
(9) 
We discuss theoretical properties of these two variational bounds in the Supplementary Material. In a simulation study (Section 4.1), we find that variational LOut reaches better performance than previous lower bounds for MI estimation. However, the numerical instability problem still remains for variational LOut in realworld applications (as shown in Section 4.4). To the best of our knowledge, we provide the first variational version of VUB and LOut upper bounds, and study their properties in both the theoretical analysis and the empirical performance.
3 Proposed Method
Suppose we have sample pairs drawn from an unknown or intractable distribution . We aim to derive a upper bound estimator of the mutual information based on the given samples. In a range of machine learning tasks (e.g., information bottleneck), one of the conditional distributions between variables and (as or ) can be known. To efficiently utilize this additional information, we first derive a mutual information (MI) upper bound with the assumption that one of the conditional distribution is provided (suppose is provided, without loss of generality). Then, we extend the bound into more general cases where no conditional distribution is known. Finally, we develop a MI minimization algorithm based on the derived bound.
3.1 CLUB with Known
With the conditional distribution , our MI Contrastive Logratio Upper Bound (CLUB) is defined as:
(10) 
To show that is an upper bound of , we calculate the gap between them:
(11) 
By the definition of the marginal distribution, we have Note that is a concave function, by Jensen’s Inequality, we have . Applying this inequality to equation (11), we conclude that the gap is always nonnegative. Therefore, is an upper bound of . The bound is tight when has the same value for any , which means variables and are independent. Consequently, we summarize the above discussion into the following Theorem 3.1.
Theorem 3.1.
For two random variables and ,
(12) 
Equality is achieved if and only if and are independent.
With sample pairs ,
has an unbiased estimation as:
(13) 
In the estimator , provides the conditional loglikelihood of positive sample pair ; provide the conditional loglikelihood of negative sample pair . The difference between and is the contrastive probability logratio between two conditional distributions. Therefore, we name this novel MI upper bound estimator as Contrastive Logratio Upper Bound (CLUB). Compared with previous MI neural estimators, CLUB has a simpler form as a linear combination of logratios between positive and negative sample pairs. The linear form of logratios improves the numerical stability for calculation of CLUB and its gradient, which we discuss in details in Section 3.3.
3.2 CLUB with Conditional Distributions Unknown
When the conditional distributions or is provided, the MI can be directly upperbounded by equation (13) with samples . Unfortunately, in a large number of machine learning tasks, the conditional relation between variables is unavailable.
To further extend the CLUB estimator into more general scenarios, we use a variational distribution with parameter to approximate . Consequently, a variational CLUB term (vCLUB) is defined by:
(14) 
Similar to the MI upper bound estimator in (13), the unbiased estimator for vCLUB with samples is:
(15) 
Using the variational approximation , vCLUB no longer guarantees a upper bound of . However, the vCLUB shares good properties with CLUB. We claim that with good variational approximation , vCLUB can still hold a MI upper bound or become a reliable MI estimator. The following analyses support this claim.
Let be the variational joint distribution induced by . Generally, we have the following Theorem 3.2. Note that when and are independent, has exactly the same value as , without requiring any additional assumption on . However, unlike in Theorem 3.1 as a sufficient and necessary condition, the “independence between and ” becomes sufficient but not necessary to conclude “”, due to the variation approximation .
Theorem 3.2.
Denote . If
then . The equality holds when and are independent.
Theorem 3.2 provides insight that vCLUB remains a MI upper bound if the variational joint distribution is “closer” to than to . Therefore, minimizing will facilitate the condition in Theorem 3.2 to be achieved. We show that can be minimized by maximizing the loglikelihood of , because of the following equation:
(16) 
Equation (16) equals , in which the first term has no relation with parameter . Therefore, is equivalent to the maximization of the second term, . With samples , we can maximize the loglikelihood function , which is the unbiased estimation of .
In practice, the variational distribution is usually implemented with neural networks. By enlarging the network capacity (i.e.
, adding layers and neurons) and applying gradientascent to the loglikelihood
, we can obtain far more accurate approximation to , thanks to the high expressiveness of neural networks (Hu et al., 2019; Oymak and Soltanolkotabi, 2019). Therefore, to further discuss the properties of vCLUB, we assume the neural network approximation achieves with a small number . In the Supplementary Material, we quantitatively discuss the reasonableness of this assumption. Consider the KLdivergence between and . If , by Theorem 3.2, vCLUB is already a MI upper bound. Otherwise, if , we have the following corollary:Corollary 3.3.
Given , if
then .
3.3 CLUB in MI Minimization
One of the major applications of MI upper bounds is for mutual information minimization. In general, MI minimization aims to reduce the correlation between two variables and by selecting an optimal parameter
of the joint variational distribution
. Under some application scenarios, additional conditional information between and is known. For example, in the information bottleneck task, the joint distribution between input and bottleneck representation is . Then the MI upper bound can be calculated directly based on Eqn. (13).For cases in which the conditional information between and remains unclear, we propose an MI minimization algorithm using the vCLUB estimator. At each training iteration, we first obtain a batch of samples from . Then we update the variational approximation by maximizing the loglikelihood . After is updated, we calculate the vCLUB estimator as described in (15). Finally, the gradient of is calculated and backpropagated to parameters of . The reparameterization trick (Kingma and Welling, 2013) ensures the gradient backpropagates through the sampled embeddings . Updating joint distribution will lead to the change of conditional distribution . Therefore, we need to update the approximation network again. Consequently, and are updated alternately during the training (as shown in Algorithm 1 without sampling).
In each training iteration, the vCLUB estimator requires calculation of all conditional distributions , which leads to computational complexity. To accelerate the training, we use stochastic sampling to approximate the mean of conditional probabilities in (Eqn. (15)), and obtain a sampled vCLUB estimator:
(17) 
with uniformly selected from indices . With this sampling strategy, the computational complexity in each iteration can be reduced to (as shown in Algorithm 1 using sampling). A similar sampling strategy can also be applied to CLUB when is known. This stochastic sampling estimator not only provides an unbiased estimation to , but bridges the MI minimization with negative sampling, a commonly used training strategy (Grover and Leskovec, 2016; Chen et al., 2019; Cheng et al., 2020a), in which for each positive data pair , a negative pair is sampled. The mutual information is minimized by reducing the positive conditional probability, while enlarging the negative conditional probability. Although previous MI upper bounds also utilize the negative data pairs (such as LOut in (7)), they cannot hold an unbiased estimation when accelerated with negative sampling, because of the nonlinear log function applied after the linear probability summation. The unbiasedness of our sampled CLUB thanks to the form of linear logratio summation. In the experiments, we find the sampled vCLUB not only provides comparable MI estimation performance, but also improves the model generalization abilities.
4 Experiments
We first show the performance of CLUB as a MI estimator on tractable toy (simulated) cases. Then we evaluate the minimization ability of CLUB on two realworld applications: Information Bottleneck (IB) and Unsupervised Domain Adaptation (UDA). In the information bottleneck, the conditional distribution is known, so we compare both CLUB and variational CLUB (vCLUB) estimators. In other experiments for which is unknown, all the tested upper bounds require variational approximation. Without ambiguity, we abbreviate all variational upper bounds (e.g., vCLUB) with their original names (e.g., CLUB) for simplicity.
4.1 MI Estimation Quality
Following the setup from Poole et al. (2019), we apply CLUB as an MI estimator in two toy tasks: () estimating MI with samples drawn jointly from a multivariate Gaussian distribution with correlation ; () estimating MI with samples , where still comes from a Gaussian with correlation , and is a fullrank matrix. Since the transformation is smooth and bijective, the mutual information is invariant (Kraskov et al., 2004), . For both of the tasks, the dimension of samples and is set to . Under Gaussian distributions, the MI true value can be calculated as , and therefore we set the MI true value in the range by varying the value of . At each MI true value, we sample data batches 4000 times, with batch size equal to 64, for the training of variational MI estimators.
column is under Cubic setup. In each column, estimation metrics are reported as bias, variance, and meansquareerror (MSE). In each plot, the evaluation metric is reported with different true MI values varying from 2 to 10.
We compare our method with baselines including MINE (Belghazi et al., 2018), NWJ (Nguyen et al., 2010), InfoNCE (Oord et al., 2018), VUB (Alemi et al., 2016) and LOut (Poole et al., 2019). Since the conditional distribution is unknown in this simulation setup, all upper bounds (VUB, LOut, CLUB) are calculated with an auxiliary approximation network . The approximation network has the same structure for all upper bounds, parameterized in a Gaussian family, with mean and variance inferred by neural networks. On the other hand, all the MI lower bounds (MINE, NWJ, InfoNCE) require learning of a value function . To make fair comparison, we set the value function and the neural approximation with one hidden layer and the same hidden units. For Gaussian setup, the number of hidden units is ; for Cubic setup, the number of hidden units is
. On the top of hidden layer outputs, we add the ReLU activation function. The learning rate for all estimators is set to
.We report in Figure 1 the estimated MI values in each training step. The estimation of VUB has incomparably large bias, so we provide its results in the Supplementary Material. Lower bound estimators, such as NWJ, MINE, and InfoNCE, provide estimated values mainly under the true MI values step function, while LOut, CLUB and Sampled CLUB (CLUBSample) estimate values above the step function, which supports our theoretical analysis about CLUB with variational approximation. The numerical results of bias and variance in the estimation are reported in Figure 2. Among these methods, CLUB and CLUBSample have the lowest bias. The bias difference between CLUB and CLUBSample is insignificant, supporting our claim in Section 3.3 that CLUBSample is an unbiased stochastic approximation of CLUB. LOut also provides small bias estimation which is slightly worse than CLUB. NWJ and InfoNCE have the lowest variance under both setups. CLUBSample has larger variance than CLUB and LOut due to the use of the sampling strategy. When considering the biasvariance tradeoff as the mean square estimation error (MSE, equals biasvariance), CLUB outperforms other estimators, while LOut and CLUBSample also provide competitive performance.
4.2 Time Efficiency of MI Estimators
Besides the estimation quality comparison, we further study the time efficiency of different MI estimators. We conduct the comparison under the same experimental setup as the Gaussian case in Section 4.1. Each MI estimator is tested with different batch size from 32 to 512. We count the total time cost of the whole estimation process and average it into each estimation step. In Figure 3, we report the average estimation time costs of different MI estimators. MINE and CLUBSample have the highest computational efficiency; both have computational complexity with respect to the sample size , because of the negative sampling strategy. Among other computational methods, CLUB has the highest estimation speed, thanks to its simple form as mean of logratios, which can be easily accelerated by matrix multiplication. Leaveoneout (Lout) has the highest time cost, because it requires “leaving out” the positive sample pair each time in the denominator of equation (7).
4.3 MI Minimization in Information Bottleneck
The Information Bottleneck (Tishby et al., 2000) (IB) is an informationtheoretical method for latent representation learning. Given an input source and a corresponding output target , the information bottleneck aims to learn an encoder , such that the compressed latent code is highly relevant to the target , with irrelevant source information from being filtered. In other words, IB seeks to find the sufficient statistics of with respect to (Alemi et al., 2016), with minimum information used from . To address this task, an objective is introduced as
(18) 
where hyperparameter . Following the same setup from Alemi et al. (2016), we apply the IB technique in the permutationinvariant MNIST classification. The input
is a vector converted from a
image of a handwritten number, and the output is the class label of this number. The stochastic encoder is implemented in a Gaussian variational family, , where and are two fullyconnected neural networks.For the first part of the IB objective (18), the MI between target and latent code is maximized. We use the same strategy as in the deep variational information bottleneck (DVB) (Alemi et al., 2016)
, where a variational classifier
is introduced to implement a BarberAgakov MI lower bound (Eqn. (2)) of . The second term in the IB objective requires the MI minimization between input and the latent representation . DVB (Alemi et al., 2016) utilizes the MI variation upper bound (VUB) (Eqn. (6)) for the minimization of . Since the closed form of is already known as a Gaussian distribution parameterized by neural networks, we can directly apply our CLUB estimator for minimizing . Alternatively, the variational CLUB can be also applied under this scenario. Besides CLUB and vCLUB, we compare previous methods such as MINE, NWJ, InfoNCE, and LOut. The misclassification rates for different MI estimators are reported in Table 1.Method  Misclass. rate(%) 

NWJ (Nguyen et al., 2010)  1.29 
MINE (Belghazi et al., 2018)  1.17 
InfoNCE (Oord et al., 2018)  1.24 
DVB (VUB) (Alemi et al., 2016)  1.13 
LOut (Poole et al., 2019)   
CLUB  1.12 
CLUB (Sample)  1.10 
vCLUB  1.10 
vCLUB (Sample)  1.06 
MINE achieves the lowest misclassification error among lower bound estimators. Although providing good MI estimation in the Gaussian simulation study, LOut suffers from numerical instability in MI optimization and fails during training. Both CLUB and vCLUB estimators outperform previous methods in bottleneck representation learning, with lower misclassification rates. Note that sampled versions of CLUB and vCLUB improve the accuracy compared with original CLUB and vCLUB, respectively, which verify the claim the negative sampling strategy improves model’s robustness. Besides, using variational approximation even attains higher accuracy than using ground truth for CLUB. Although provides more accurate MI estimation, the variational approximation can add noise into the gradient of CLUB. Both the sampling and the variational approximation increase the randomness in the model, which helps to increase the model generalization ability (Hinton et al., 2012; Belghazi et al., 2018).
4.4 MI Minimization in Domain Adaptation
Another important application of MI minimization is disentangled representation learning (DRL) (Kim and Mnih, 2018; Chen et al., 2018; Locatello et al., 2019). Specifically, we aim to encode the data into several separate embedding parts, each with different semantic meanings. The semantically disentangled representations help improve the performance of deep learning models, especially in the fields of conditional generation (Ma et al., 2018), style transfer (John et al., 2019), and domain adaptation (Gholami et al., 2018). To learn (ideally) independent disentangled representations, one effective solution is to minimize the mutual information among different latent embedding parts.
We compare performance of MI estimators for learning disentangled representations in unsupervised domain adaptation (UDA) tasks. In UDA, we have images from the source domain and from the target domain . While each source image has a corresponding label , no label information is available for observations in the target domain. The objective is to learn a model based on data and , which not only performs well in source domain classification, but also provides satisfying predictions in the target domain.
To solve this problem, we use the informationtheoretical framework inspired from Gholami et al. (2018). Specifically, two feature extractors are introduced: the domain encoder and the content encoder . The former encodes the domain information from an observation into a domain embedding ; the latter outputs a content embedding based on an input data point . As shown in Figure 4, the content embedding from the source domain is further used as an input to a content classifier to predict the corresponding class label, with a content loss defined as . The domain embedding (including and ) is input to a domain discriminator to predict whether the observation comes from the source domain or target domain, with a domain loss defined as . Since the content information and the domain information should be independent, we minimize the mutual information between the content embedding and domain embedding . The final objective is (shown in Figure 4):
(19) 
where are hyperparameters.
Method  MMM  MU  UM  SVM  CS  SC 

SourceOnly  59.9  76.7  63.4  67.1     
MIbased Disentangling Framework  
NWJ  83.3  98.3  91.1  86.5  78.2  71.0 
MINE  88.4  98.1  94.8  83.4  77.9  70.5 
InfoNCE  85.5  98.3  92.7  84.1  77.4  69.4 
VUB  76.4  97.1  96.3  81.5     
L1Out  76.2  96.3  93.9    77.8  69.2 
CLUB  93.7  98.9  97.7  89.7  78.7  71.8 
CLUBS  94.6  98.9  98.1  90.6  79.1  72.3 
Other Frameworks  
DANN  81.5  77.1  73.0  71.1     
DSN  83.2  91.3    76.0     
MCD  93.5  94.2  94.1  92.6  78.1  69.2 
We apply different MI estimators to the framework (19), and evaluate the performance on several DA benchmark datasets, including MNIST, MNISTM, USPS, SVHN, CIFAR10, and STL. Detailed description to the datasets and model setups is in the Supplementary Material. Besides the proposed informationtheoretical UDA model, we also compare the performance with other UDA frameworks: DANN (Ganin et al., 2016), DSN (Bousmalis et al., 2016), and MCD (Saito et al., 2018). The numerical results are shown in Table 2. From the results, we find our MIbased disentangling shows competitive results with previous UDA methods. Among different MI estimators, the Sampled CLUB uniformly outperforms other competitive methods on four DA tasks. The stochastic sampling in CLUBSample improves the model generalization ability and preserves the model from overfitting. The other two MI upper bounds, VUB and LOut, fail to train a satisfying UDA model, whose results are worse than the MI lower bound estimators. With LOut, the training loss cannot even decrease on the most challenging SVHNMNIST task, due to the numerical instability.
5 Conclusions
We have introduced a novel mutual information upper bound called Contrastive Logratio Upper Bound (CLUB). This novel MI estimator can be extended to a variational version for general scenarios when only samples of the joint distribution are obtainable. Based on the variational CLUB, we have proposed a new MI minimization algorithm, and further accelerated it with a negative sampling strategy. We have studied the good properties of CLUB both theoretically and empirically. Experimental results on simulation studies and realworld applications show the attractive performance of CLUB on both MI estimation and MI minimization tasks. This work provides an insight on the connection between mutual information and widespread machine learning training strategies, including contrastive learning and negative sampling. We believe the proposed CLUB estimator will have vast applications for reducing the correlation of different model parts, especially in the domains of interpretable machine learning, controllable generation, and fairness.
References
 Variational information distillation for knowledge transfer. In CVPR, Cited by: §1.
 Deep variational information bottleneck. arXiv preprint arXiv:1612.00410. Cited by: Appendix D, §1, §1, §1, §2, §4.1, §4.3, §4.3, Table 1.
 The im algorithm: a variational approach to information maximization. In NeurIPS, Cited by: §2.
 Mutual information neural estimation. In ICML, Cited by: §1, §2, §2, §4.1, §4.3, Table 1.
 Domain separation networks. In NeurIPS, Cited by: §4.4.
 Informationtheoretic mapping using cauchyschwarz quadratic mutual information. In ICRA, Cited by: §1.
 Improving textual network embedding with global attention via optimal transport. In ACL, Cited by: §3.3.

Isolating sources of disentanglement in variational autoencoders
. In NeurIPS, Cited by: §1, §4.4.  Infogan: interpretable representation learning by information maximizing generative adversarial nets. In NeurIPS, Cited by: §1, §1, §1.
 Dynamic embedding on textual networks via a gaussian process. In AAAI, Cited by: §3.3.
 Improving disentangled text representation learning with informationtheoretic guidance. arXiv preprint arXiv:2006.00693. Cited by: §1.
 Contrastively smoothed class alignment for unsupervised domain adaptation. arXiv preprint arXiv:1909.05288. Cited by: Table 2.
 Estimation of the information by an adaptive partitioning of the observation space. IEEE Transactions on Information Theory. Cited by: §2.
 Stochastic neural networks for hierarchical reinforcement learning. arXiv preprint arXiv:1704.03012. Cited by: §1.
 Domainadversarial training of neural networks. JMLR. Cited by: §4.4.
 Unsupervised multitarget domain adaptation: an information theoretic approach. arXiv preprint arXiv:1810.11547. Cited by: §1, §4.4, §4.4.
 Using the mutual information coefficient to identify lags in nonlinear models. Journal of time series analysis. Cited by: §1.
 Node2vec: scalable feature learning for networks. In KDD, Cited by: §3.3.
 Noisecontrastive estimation: a new estimation principle for unnormalized statistical models. In AISTATS, Cited by: §2.
 Nonparametric and semiparametric models. Springer Science & Business Media. Cited by: §2.
 Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580. Cited by: §4.3.
 Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670. Cited by: §1, §1.
 Understanding generalization of deep neural networks trained with noisy labels. arXiv preprint arXiv:1905.11368. Cited by: Appendix B, §3.2.
 Learning discrete representations via information maximizing selfaugmented training. In ICML, Cited by: §1.
 Nonparametric ksample tests via dynamic slicing. Journal of the American Statistical Association. Cited by: §1.
 Disentangled representation learning for nonparallel text style transfer. In ACL, Cited by: §4.4.
 On mutual informationbased control of range sensing robots for mapping applications. The International Journal of Robotics Research. Cited by: §1.
 Fairnessaware learning through regularization approach. In IEEE 11th International Conference on Data Mining Workshops, Cited by: §1.

Unsupervised imagetoimage translation using domainspecific variational information bound
. In NeurIPS, Cited by: §1.  Disentangling by factorising. In ICML, Cited by: §4.4.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §3.3.
 Estimating mutual information. Physical review E. Cited by: §2, §4.1.
 Information theory and statistics. Courier Corporation. Cited by: §2.
 ARACNeap: gene network reverse engineering through adaptive partitioning inference of mutual information. Bioinformatics. Cited by: §1.

Challenging common assumptions in the unsupervised learning of disentangled representations
. In ICML, Cited by: §4.4.  Disentangled person image generation. In CVPR, Cited by: §4.4.
 Neural networks for density estimation. In NeurIPS, Cited by: §2.
 Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory. Cited by: §2, §4.1, Table 1.
 Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §1, §1, §2, §4.1, Table 1.
 Overparameterized nonlinear learning: gradient descent takes the shortest path?. In ICML, Cited by: §3.2.
 On variational bounds of mutual information. In ICML, Cited by: §1, §2, §4.1, §4.1, Table 1.
 Maximum classifier discrepancy for unsupervised domain adaptation. In CVPR, Cited by: §4.4.

Approximating mutual information by maximum likelihood density ratio estimation.
In
New challenges for feature selection in data mining and knowledge discovery
, Cited by: §2.  The information bottleneck method. arXiv preprint physics/0004057. Cited by: §4.3.
 MIToS. jl: mutual information tools for protein sequence analysis in the julia language. Bioinformatics. Cited by: §1.
Appendix A Proofs of Theorems
Proof of Theorem 3.2.
We calculate the gap between and :
Therefore, is an upper bound of if and only if .
If and are independent, . Then, and . Therefore, , the equality holds. ∎
Proof of Corollary 3.3.
If , then
Comments
There are no comments yet.