1 Introduction
Conditional mutual information (CMI) is a fundamental information theoretic quantity that extends the nice properties of mutual information (MI) in conditional settings. For three continuous random variables,
, and , the conditional mutual information is defined as:assuming that the distributions admit the respective densities . One of the striking features of MI and CMI is that they can capture nonlinear dependencies between the variables. In scenarios where Pearson correlation is zero even when the two random variables are dependent, mutual information can recover the truth. Likewise, in the sense of conditional independence for the case of three random variables , and , conditional mutual information provides strong guarantees, i.e., .
The conditional setting is even more interesting as dependence between and
can potentially change based on how they are connected to the conditioning variable. For instance, consider a simple Markov chain where
. Here, . But a slightly different relation has , even though andmay be independent as a pair. It is a well known fact in Bayesian networks that a node is independent of its nondescendants given its parents. CMI goes beyond stating whether the pair
is conditionally dependent or not. It also provides a quantitative strength of dependence.1.1 Prior Art
The literature is replete with works aimed at applying CMI for datadriven knowledge discovery. Fleuret (2004) used CMI for fast binary feature selection to improve classification accuracy. Loeckx et al. (2010) improved nonrigid image registration by using CMI as a similarity measure instead of global mutual information. CMI has been used to infer generegulatory networks (Liang and Wang 2008) or protein modulation (Giorgi et al. 2014) from gene expression data. Causal discovery (Li et al. 2011; Hlinka et al. 2013; Vejmelka and Paluš 2008) is yet another application area of CMI estimation.
Despite its widespread use, estimation of conditional mutual information remains a challenge. One naive method may be to estimate the joint and conditional densities from data and plug it into the expression for CMI. But density estimation is not sample efficient and is often more difficult than estimating the quantities directly. The most widely used technique expresses CMI in terms of appropriate arithmetic of differential entropy estimators (referred to here as estimator): , where is known as the differential entropy.
The differential entropy estimation problem has been studied extensively by Beirlant et al. (1997); Nemenman et al. (2002); Miller (2003); Lee (2010); Leśniewicz (2014); Sricharan et al. (2012); Singh and Póczos (2014) and can be estimated either based on kerneldensity (Kandasamy et al. 2015; Gao et al. 2016) or nearestneighbor estimates (Sricharan et al. 2013; Jiao et al. 2018; Pál et al. 2010; Kozachenko and Leonenko 1987; Singh et al. 2003; Singh and Póczos 2016). Building on top of nearestneighbor estimates and breaking the paradigm of estimation, a coupled estimator (which we address henceforth as KSG) was proposed by Kraskov et al. (2004). It generalizes to mutual information, conditional mutual information as well as for other multivariate information measures, including estimation in scenarios when the distribution can be mixed (Runge 2018; Frenzel and Pompe 2007; Gao et al. 2017, 2018; Vejmelka and Paluš 2008; Rahimzamani et al. 2018).
The NN approach has the advantage that it can naturally adapt to the data density and does not require extensive tuning of kernel bandwidths. However, all these approaches suffer from the curse of dimensionality and are unable to scale well with dimensions. Moreover, Gao et al. (2015) showed that exponentially many samples are required (as MI grows) for the accurate estimation using NN based estimators. This brings us to the central motivation of this work : Can we propose estimators for conditional mutual information that estimate well even in high dimensions ?
1.2 Our Contribution
In this paper, we explore various ways of estimating CMI by leveraging tools from classifiers and generative models. To the best of our knowledge, this is the first work that deviates from the framework of
NN and kernel based CMI estimation and introduces neural networks to solve this problem.
The main contributions of the paper can be summarized as follows :
Classifier Based MI Estimation: We propose a novel KLdivergence estimator based on classifier twosample approach that is more stable and performs superior to the recent neural methods (Belghazi et al. 2018).
Divergence Based CMI Estimation: We express CMI as the KLdivergence between two distributions and , and explore candidate generators for obtaining samples from . The CMI estimate is then obtained from the divergence estimator.
Difference Based CMI Estimation: Using the improved MI estimates, and the difference relation , we show that estimating CMI using a difference of two MI estimates performs best among several other proposed methods in this paper such as divergence based CMI estimation and KSG.
Improved Performance in High Dimensions: On both linear and nonlinear datasets, all our estimators perform significantly better than KSG. Surprisingly, our estimators perform well even for dimensions as high as , while KSG fails to obtain reasonable estimates even beyond dimensions.
Improved Performance in Conditional Independence Testing: As an application of CMI estimation, we use our best estimator for conditional independence testing (CIT) and obtain improved performance compared to the stateoftheart CIT tester on both synthetic and real datasets.
2 Estimation of Conditional Mutual Information
The CMI estimation problem from finite samples can be stated as follows. Let us consider three random variables , , , where is the joint distribution. Let the dimensions of the random variables be , and respectively. We are given samples drawn i.i.d from . So and . The goal is to estimate from these samples.
2.1 Divergence Based CMI Estimation
Definition 1.
The KullbackLeibler (KL) divergence between two distributions and is given as :
Definition 2.
Conditional Mutual Information (CMI) can be expressed as a KLdivergence between two distributions and , i.e.,
The definition of CMI as a KLdivergence naturally leads to the question : Can we estimate CMI using an estimator for divergence ? However, the problem is still nontrivial since we are only given samples from and the divergence estimator would also require samples from . This further boils down to whether we can learn the distribution .
2.1.1 Generative Models
We now explore various techniques to learn the conditional distribution given samples . This problem is fundamentally different from drawing independent samples from the marginals and , given the joint . In this simpler setting, we can simply permute the data to obtain ( denotes a permutation, ). This would emulate samples drawn from . But, such a permutation scheme does not work for since it would destroy the dependence between and . The problem is solved using recent advances in generative models which aim to learn an unknown underlying distribution from samples.
Conditional Generative Adversarial Network (CGAN): There exist extensions of the basic GAN framework (Goodfellow et al. 2014) in conditional settings, CGAN (Mirza and Osindero 2014). Once trained, the CGAN can then generate samples from the generator network as .
Conditional Variational Autoencoder (CVAE)
: Similar to CGAN, the conditional setting, CVAE (Kingma and Welling 2013) (Sohn et al. 2015), aims to maximize the conditional loglikelihood. The input to the decoder network is the value ofand the latent vector
sampled from standard Gaussian. The decodergives the conditional mean and conditional variance (parametric functions of
and ) from which is then sampled.NN based permutation: A simpler algorithm for generating the conditional is to permute data values where . Such methods are popular in conditional independence testing literature (Sen et al. 2017; Doran et al. ). For a given point , we find the nearest neighbor of . Let us say it is with the corresponding data point as . Then is a sample from .
Now that we have outlined multiple techniques for estimating , we next proceed to the problem of estimating KLdivergence.
2.1.2 Divergence Estimation
Recently, Belghazi et al. (2018) proposed a neural network based estimator of mutual information (MINE) by utilizing lower bounds on KLdivergence. Since MI is a special case of KLdivergence, their neural estimator can be extended for divergence estimation as well. The estimator can be trained using backpropagation and was shown to outperform traditional methods for MI estimation. The core idea of MINE is cradled in a dual representation of KLdivergence. The two main lower bounds used by MINE are stated below.
Definition 3.
The DonskerVaradhan representation expresses KLdivergence as a supremum over functions,
(1) 
where the function class includes those functions that lead to finite values of the expectations.
Definition 4.
The fdivergence bound gives a lower bound on the KLdivergence:
(2) 
MINE uses a neural network to represent the function class and uses gradient descent to maximize the RHS in the above bounds.
Even though this framework is flexible and straightforward to apply, it presents several practical limitations. The estimation is very sensitive to choices of hyperparameters (hiddenunits/layers) and training steps (batch size, learning rate). We found the optimization process to be unstable and to diverge at high dimensions (Section 4. Experimental Results). Our findings resonate those by Poole et al. in which the authors found the networks difficult to tune even in toy problems.
2.2 Difference Based CMI Estimation
Another seemingly simple approach to estimate CMI could be to express it as a difference of two mutual information terms by invoking the chain rule, i.e.:
. As stated before, since mutual information is a special case of KLdivergence, viz. , this again calls for a stable, scalable, sample efficient KLdivergence estimator as we present in the next Section.3 Classifier Based MI Estimation
In their seminal work on independence testing, LopezPaz and Oquab (2016) introduced classifier twosample test to distinguish between samples coming from two unknown distributions and . The idea was also adopted for conditional independence testing by Sen et al. (2017). The basic principle is to train a binary classifier by labeling samples as and those coming from as
, and to test the null hypothesis
. Under the null, the accuracy of the binary classifier will be close to . It will be away from under the alternative. The accuracy of the binary classifier can then be carefully used to define values for the test.We propose to use the classier twosample principle for estimating the likelihood ratio . While existing literature has instances of using the likelihood ratio for MI estimation, the algorithms to estimate the likelihood ratio are quite different from ours. Both (Suzuki et al. 2008; Nguyen et al. 2008) formulate the likelihood ratio estimation as a convex relaxation by leveraging the LegendreFenchel duality. But performance of the methods depend on the choice of suitable kernels and would suffer from the same disadvantages as mentioned in the Introduction.
3.1 Problem Formulation
Given i.i.d samples and i.i.d samples , we want to estimate . We label the points drawn from as and those from as . A binary classifier is then trained on this supervised classification task. Let the prediction for a point by the classifier is where (
denotes probability). Then the pointwise likelihood ratio for data point
is given by .The following Proposition is elementary and has already been observed in Belghazi et al. (2018)(Proof of Theorem 4). We restate it here for completeness and quick reference.
Proposition 1.
The optimal function in DonskerVaradhan representation (1) is the one that computes the pointwise loglikelihood ratio, i.e, , (assuming , whereever ).
Based on Proposition 1, the next step is to substitute the estimates of pointwise likelihood ratio in (1) to obtain an estimate of KLdivergence.
(3) 
We obtain an estimate of mutual information from (3) as . This classifierbased estimator for MI (ClassifierMI) has the following theoretical properties under Assumptions (A1)(A4) (stated in Section 9).
Theorem 1.
Under Assumptions (A1)(A4), ClassifierMI is consistent, i.e., given , such that with probability at least , we have
Proof.
Here, we provide a sketch of the proof. The classifier is trained to minimize the binary cross entropy (BCE) loss on the train set and obtains the minimizer as . From generalization bound of classifier, the loss value on the test set from is close to the loss obtained by the best optimizer in the classifier family, which itself is close to the global minimizer of BCE (as a function ) by Universal Function Approximation Theorem of neuralnetworks.
The loss is strongly convex in . links to , i.e., . ∎
While consistency provides a characterization of the estimator in large sample regime, it is not clear what guarantees we obtain for finite samples. The following Theorem shows that even for a small number of samples, the produced MI estimate is a true lower bound on mutual information value with high probability.
Theorem 2.
Under Assumptions (A1)(A4), the finite sample estimate from ClassifierMI is a lower bound on the true MI value with high probability, i.e., given test samples, we have for
where is some constant independent of and the dimension of the data.
3.2 Probability Calibration
The estimation of likelihood ratio from classifier predictions
hinges on the fact that the classifier is wellcalibrated. As a rule of thumb, classifiers trained directly on the cross entropy loss are wellcalibrated. But boosted decision trees would introduce distortions in the likelihoodratio estimates. There is an extensive literature devoted to obtaining better calibrated classifiers that can be used to improve the estimation further
(Lakshminarayanan et al. 2017; NiculescuMizil and Caruana 2005; Guo et al. 2017). We experimented with Gradient Boosted Decision Trees and multilayer perceptron trained on the logloss in our algorithms. Multilayer perceptron gave better estimates and so is used in all the experiments. Supplementary Figures show that the neural networks used in our estimators are wellcalibrated.
Even though logistic regression is wellcalibrated and might seem to be an attractive candidate for classification in sparse sample regimes, we show that linear classifiers cannot be used to estimate
by twosample approach. For this, we consider the simple setting of estimating mutual information of two correlated Gaussian random variables as a counterexample.Lemma 1.
A linear classifier with marginal features fails the classifier Two sample MI estimation.
Proof.
Consider two correlated Gaussians in dimensions , where is the Pearson correlation. The marginals are standard Gaussans . Suppose we are trying to estimate the mutual information . The classifier decision boundary would seek to find , thus ∎
The decision boundary is a rectangular hyperbola. Here the classifier would return as prediction for either class (leading to ), even when and are highly correlated and the mutual information is high.
We use the Classifier twosample estimator to first compute the mutual information of two correlated Gaussians (Belghazi et al. 2018) for samples. This setting also provides us a way to choose reasonable hyperparameters that are used throughout in all the synthetic experiments. We also plot the estimates of fMINE and KSG to ensure we are able to make them work in simple settings. In the toy setting , all estimators accurately estimate as shown in Figure 1.
3.3 Modular Approach to CMI Estimation
Our classifier based divergence estimator does not encounter an optimization problem involving exponentials. MINE optimizing (1) has biased gradients while that based on (2) is a weaker lower bound (Belghazi et al. 2018). On the contrary, our classifier is trained on crossentropy loss which has unbiased gradients. Furthermore, we plug in the likelihood ratio estimates into the tighter DonskerVaradhan bound, thereby, achieving the best of both worlds. Equipped with a KLdivergence estimator, we can now couple it with the generators or use the expression of CMI as a difference of two MIs (which we address from now as MIDiff.). Algorithm LABEL:alg:ccmi describes the CMI estimation by tying together the generator and Classifier block. For MIDiff., function block “Classifier” in Algorithm LABEL:alg:ccmi has to be used twice : once for estimating and another for . For mutual information, in “Classifier” is obtained by permuting the samples of .
For the Classifier coupled with a generator, the generated distribution may deviate from the target distribution  introducing a different kind of bias. The following Lemma suggests how such a bias can be corrected by subtracting the KL divergence of the subtuple from the divergence of the entire triple . We note that such a clean relationship is not true for general divergence measures, and indeed require more sophisticated conditions for the totalvariation metric (Sen et al. 2018).
Lemma 2 (Bias Cancellation).
The estimation error due to incorrect generated distribution can be accounted for using the following relation :
algocf[htbp]
4 Experimental Results
In this Section, we compare the performance of various estimators on the CMI estimation task. We used the Classifier based divergence estimator (Section 3) and MINE in our experiments. Belghazi et al. (2018) had two MINE variants, namely Donskervaradhan (DV) MINE and fMINE. The fMINE has unbiased gradients and we found it to have similar performance as DVMINE, albeit with lower variance. So we used fMINE in all our experiments.
The “generator”+“Divergence estimator” notation will be used to denote the various estimators. For instance, if we use CVAE for the generation and couple it with fMINE, we denote the estimator as CVAE+fMINE. When coupled with the Classifier based Divergence block, it will be denoted as CVAE+Classifier. For MIDiff. we represent it similarly as MIDiff.+“Divergence estimator”.
We compare our estimators with the widely used KSG estimator.^{1}^{1}1The implementation of CMI estimator in Nonparametric Entropy Estimation Toolbox (https://github.com/gregversteeg/NPEET) is used. For fMINE, we used the code provided to us by the author (Belghazi et al. 2018). The same hyperparameter setting is used in all our synthetic datasets for all estimators (including generators and divergence blocks). Supplementary contains the details about the hyperparameter values. For KSG, we vary and report the results for the best for each dataset.
4.1 Linear Relations
We start with the simple setting where the three random variables , , are related in a linear fashion. We consider the following two linear models.
Model I  Model II 
where means that each coordinate of
is drawn i.i.d from a uniform distribution between
and . Similar notation is used for the Gaussian : . is the first dimension of . We used and obtained the constant unit norm random vector from . is kept constant for all points during dataset preparation.As common in literature on causal discovery and independence testing (Sen et al. 2017; Doran et al. ), the dimension of and is kept as , while can scale. Our estimators are general enough to accommodate multidimensional and , where we consider a concatenated vector and . This has applications in learning interactions between Modules in Bayesian networks (Segal et al. 2005) or dependence between group variables (Entner and Hoyer 2012; Parviainen and Kaski 2016) such as distinct functional groups of proteins/genes instead of individual entities. Both the linear models are representative of problems encountered in Graphical models and independence testing literature. In Model I, the conditioning set can go on increasing with independent variables , while only depends on . In Model II, we have the variables in the conditioning set combining linearly to produce . It is also easy to obtain the ground truth CMI value in such models by numerical integration.
For both these models, we generate datasets with varying number of samples and varying dimension to study their effect on estimator performance. The sample size is varied as keeping fixed at . We also vary , keeping sample size fixed at .
Several observations stand out from the experiments: (1) KSG estimates are accurate at very low dimension but drastically fall with increasing even when the conditioning variables are completely independent and do not influence and (ModelI). (2) Increasing the sample size does not improve KSG estimates once the dimension is kept moderate (even !). The dimension issue is more acute than sample scarcity. (3) The estimates from fMINE have greater deviation from the truth at low sample sizes. At high dimensions, the instability is clearly portrayed when the estimate suddenly goes negative (Truncated to to maintain the scale of the plot). (4) All our estimators using Classifier are able to obtain reasonable estimates even at dimensions as high as , with MIDiff.+Classifier performing the best.
4.2 NonLinear Relations
Here, we study models where the underlying relations between , and are nonlinear. Let . and are nonlinear bounded functions drawn uniformly at random from for each dataset. is a random vector whose entries are drawn and normalized to have unit norm. The vector once generated is kept fixed for a particular dataset. We have the setting where and can scale. is then a constant. We used in our simulations. The noise variables are drawn i.i.d , .
We vary across each dimension . The dimension itself is then varied as giving rise to datasets. Dataindex has , dataindex has and so on until dataindex with .
Obtaining Ground Truth : Since it is not possible to obtain the ground truth CMI value in such complicated settings using a closed form expression, we resort to using the relation where . The dependence of on can be completely captured once is given. But, has dimension and can be estimated accurately using KSG. We generate samples separately for each dataset to estimate and use it as the ground truth.
We observed similar behavior (as in Linear models) for our estimators in the Nonlinear setting.
(1) KSG continues to have low estimates even though in this setup the true CMI values are themselves low (). (2) Up to , we find all our estimators closely tracking . But in higher dimensions, they fail to perform accurately. (3) MIDiff. + Classifier is again the best estimator, giving CMI estimates away from even at dimensions.
From the above experiments, we found MIDiff.+Classifier to be the most accurate and stable estimator. We use this combination for our downstream applications and henceforth refer to it as CCMI.
5 Application to Conditional Independence Testing
As a testimony to accurate CMI estimation, we apply CCMI to the problem of Conditional Independence Testing(CIT). Here, we are given samples from two distributions and . The hypothesis testing in CIT is to distinguish the null from the alternative .
We seek to design a CIT tester using CMI estimation by using the fact that . A simple approach would be to reject the null if and accept it otherwise. The CMI estimates can serve as a proxy for the value. CIT testing based on CMI Estimation has been studied by Runge (2018), where the author uses KSG for CMI estimation and use NN based permutation to generate a value. The value is computed as the fraction of permuted datasets where the CMI estimate is that of the original dataset. The same approach can be adopted for CCMI to obtain a
value. But since we report the AuROC (Area under the Receiver Operating Characteristic curve), CMI estimates suffice.
5.1 Post Nonlinear Noise : Synthetic Data
In this experiment, we generate data based on the post nonlinear noise model similar to Sen et al. (2017). As before, and can scale in dimension. The data is generated using the follow model.
The entries of random vectors(matrices if ) and are drawn and the vectors are normalized to have unit norm, i.e., . . This is different from the implementation in Sen et al. (2017) where the constant is in all datasets. But by varying , we obtain a tougher problem where the true CMI value can be quite low for a dependent dataset.
and are kept constant for generating points for a single dataset and are varied across datasets. We vary and simulate datasets for each dimension. The number of samples is in each dataset. Our algorithm is compared with the stateoftheart CIT tester in Sen et al. (2017), known as CCIT. We used the implementation provided by the authors and ran CCIT with bootstraps ^{2}^{2}2https://github.com/rajatsen91/CCIT. For each dataset, an AuROC value is obtained. Figure 4 shows the mean AuROC values from runs for both the testers as varies. While both algorithms perform accurately upto , the performance of CCIT starts to degrade beyond dimensions. Beyond dimensions, it performs close to random guessing. CCMI retains its superior performance even at , obtaining a mean AuROC value of .
Since AuROC metric finds best performance by varying thresholds, it is not clear what precision and recall is obtained from CCMI when we threshold the CCMI estimate at
(and reject or accept the null based on it). So, for we plotted the histogram of CMI estimates separately for CI and nonCI datasets. Figure 3(b) shows that there a clear demarcation of CMI estimates between the two dataset categories and choosing the threshold as gave the precision as and recall as .5.2 FlowCytometry : Real Data
To extend our estimator beyond simulated settings, we use CMI estimation to test for conditional independence in the protein network data used in Sen et al. (2017). The consensus graph in Sachs et al. (2005) is used as the ground truth. We obtained CI and nonCI relations from the Bayesian network. The basic philosophy used is that a protein is independent of all other proteins in the network given its parents, children and parents of children. Moreover, in the case of nonCI, we notice that a direct edge between and would never render them conditionally independent. So the conditioning set can be chosen at random from other proteins. These two settings are used to obtain the CI and nonCI datasets. The number of samples in each dataset is only and the dimension of varies from to .
For FlowCytometry data, since the number of samples is too small, we train the Classifier for fewer epochs to prevent overfitting, keeping every other hyperparameter the same. CCMI is compared with CCIT on the real data and the mean AuROC curves from
runs is plotted in Figure 5. The superior performance of CCMI over CCIT is retained in sparse data regime.6 Conclusion and Future Directions
In this work we explored various CMI estimators by drawing from recent advances in generative models and classifies. We proposed a new divergence estimator, based on Classifierbased twosample estimation, and built several conditional mutual information estimators using this primitive. We demonstrated their efficacy in a variety of practical settings. Future work will aim to approximate the null distribution for CCMI, so that we can compute values for the conditional independence testing problem efficiently.
7 Acknowledgments
This work was supported by NSF awards 1651236 and 1703403 and NIH grant 5R01HG008164.
References
 Beirlant et al. (1997) Jan Beirlant, Edward J Dudewicz, László Györfi, and Edward C Van der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17–39, 1997.

Belghazi et al. (2018)
Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair,
Yoshua Bengio, Aaron Courville, and Devon Hjelm.
Mutual information neural estimation.
In
Proceedings of the 35th International Conference on Machine Learning
, 2018. 
(3)
G Doran, K Muandet, K Zhang, and B Schölkopf.
A permutationbased kernel conditional independence test.
In
30th Conference on Uncertainty in Artificial Intelligence (UAI 2014)
.  Entner and Hoyer (2012) Doris Entner and Patrik O Hoyer. Estimating a causal order among groups of variables in linear models. In International Conference on Artificial Neural Networks, pages 84–91. Springer, 2012.
 Fleuret (2004) François Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine learning research, 5(Nov):1531–1555, 2004.
 Frenzel and Pompe (2007) Stefan Frenzel and Bernd Pompe. Partial mutual information for coupling analysis of multivariate time series. Physical review letters, 99(20):204101, 2007.
 Gao et al. (2015) Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286, 2015.
 Gao et al. (2016) Weihao Gao, Sewoong Oh, and Pramod Viswanath. Breaking the bandwidth barrier: Geometrical adaptive entropy estimation. In Advances in Neural Information Processing Systems, pages 2460–2468, 2016.
 Gao et al. (2017) Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Estimating mutual information for discretecontinuous mixtures. In Advances in Neural Information Processing Systems, pages 5988–5999, 2017.
 Gao et al. (2018) Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying fixed nearest neighbor information estimators. IEEE Transactions on Information Theory, 64(8):5629–5661, 2018.
 Giorgi et al. (2014) Federico M Giorgi, Gonzalo Lopez, Jung H Woo, Brygida Bisikirska, Andrea Califano, and Mukesh Bansal. Inferring protein modulation from gene expression data using conditional mutual information. PloS one, 9(10):e109569, 2014.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1321–1330. JMLR. org, 2017.
 Hlinka et al. (2013) Jaroslav Hlinka, David Hartman, Martin Vejmelka, Jakob Runge, Norbert Marwan, Jürgen Kurths, and Milan Paluš. Reliability of inference of directed climate networks using conditional mutual information. Entropy, 15(6):2023–2045, 2013.
 Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
 Jiao et al. (2018) Jiantao Jiao, Weihao Gao, and Yanjun Han. The nearest neighbor information estimator is adaptively near minimax rateoptimal. In Advances in neural information processing systems, 2018.
 Kandasamy et al. (2015) Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, et al. Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, pages 397–405, 2015.
 Kingma and Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kozachenko and Leonenko (1987) LF Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
 Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.
 Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
 Lee (2010) Intae Lee. Samplespacingsbased density and entropy estimators for spherically invariant multidimensional data. Neural Computation, 22(8):2208–2227, 2010.
 Leśniewicz (2014) Marek Leśniewicz. Expected entropy as a measure and criterion of randomness of binary sequences. Przegląd Elektrotechniczny, 90(1):42–46, 2014.
 Li et al. (2011) Zhaohui Li, Gaoxiang Ouyang, Duan Li, and Xiaoli Li. Characterization of the causality between spike trains with permutation conditional mutual information. Physical Review E, 84(2):021929, 2011.
 Liang and Wang (2008) KuoChing Liang and Xiaodong Wang. Gene regulatory network reconstruction using conditional mutual information. EURASIP Journal on Bioinformatics and Systems Biology, 2008(1):253894, 2008.
 Loeckx et al. (2010) Dirk Loeckx, Pieter Slagmolen, Frederik Maes, Dirk Vandermeulen, and Paul Suetens. Nonrigid image registration using conditional mutual information. IEEE transactions on medical imaging, 29(1):19–29, 2010.
 LopezPaz and Oquab (2016) David LopezPaz and Maxime Oquab. Revisiting classifier twosample tests. arXiv preprint arXiv:1610.06545, 2016.
 Miller (2003) Erik G Miller. A new class of entropy estimators for multidimensional densities. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, volume 3, pages III–297. IEEE, 2003.
 Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. 2018.
 Nemenman et al. (2002) Ilya Nemenman, Fariel Shafee, and William Bialek. Entropy and inference, revisited. In Advances in neural information processing systems, pages 471–478, 2002.
 Nguyen et al. (2008) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems, pages 1089–1096, 2008.
 NiculescuMizil and Caruana (2005) Alexandru NiculescuMizil and Rich Caruana. Obtaining calibrated probabilities from boosting. In UAI, 2005.
 Pál et al. (2010) Dávid Pál, Barnabás Póczos, and Csaba Szepesvári. Estimation of rényi entropy and mutual information based on generalized nearestneighbor graphs. In Advances in Neural Information Processing Systems, pages 1849–1857, 2010.
 Parviainen and Kaski (2016) Pekka Parviainen and Samuel Kaski. Bayesian networks for variable groups. In Conference on Probabilistic Graphical Models, pages 380–391, 2016.
 (36) Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A Alemi, and George Tucker. On variational lower bounds of mutual information.
 Rahimzamani et al. (2018) Arman Rahimzamani, Himanshu Asnani, Pramod Viswanath, and Sreeram Kannan. Estimators for multivariate information measures in general probability spaces. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
 Runge (2018) Jakob Runge. Conditional independence testing based on a nearestneighbor estimator of conditional mutual information. In Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics, 2018.
 Sachs et al. (2005) Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal proteinsignaling networks derived from multiparameter singlecell data. Science, 308(5721):523–529, 2005.
 Segal et al. (2005) Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller, and Nir Friedman. Learning module networks. Journal of Machine Learning Research, 6(Apr):557–588, 2005.
 Sen et al. (2017) Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Modelpowered conditional independence test. In Advances in Neural Information Processing Systems, pages 2951–2961, 2017.
 Sen et al. (2018) Rajat Sen, Karthikeyan Shanmugam, Himanshu Asnani, Arman Rahimzamani, and Sreeram Kannan. Mimic and classify: A metaalgorithm for conditional independence testing. arXiv preprint arXiv:1806.09708, 2018.
 Singh et al. (2003) Harshinder Singh, Neeraj Misra, Vladimir Hnizdo, Adam Fedorowicz, and Eugene Demchuk. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(34):301–321, 2003.
 Singh and Póczos (2014) Shashank Singh and Barnabás Póczos. Exponential concentration of a density functional estimator. In Advances in Neural Information Processing Systems, pages 3032–3040, 2014.
 Singh and Póczos (2016) Shashank Singh and Barnabás Póczos. Finitesample analysis of fixedk nearest neighbor density functional estimators. In Advances in Neural Information Processing Systems, pages 1217–1225, 2016.
 Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems 28. 2015.
 Sricharan et al. (2012) Kumar Sricharan, Raviv Raich, and Alfred O Hero. Estimation of nonlinear functionals of densities with confidence. IEEE Transactions on Information Theory, 58(7):4135–4159, 2012.
 Sricharan et al. (2013) Kumar Sricharan, Dennis Wei, and Alfred O Hero. Ensemble estimators for multivariate entropy estimation. IEEE transactions on information theory, 59(7):4374–4388, 2013.
 Suzuki et al. (2008) Taiji Suzuki, Masashi Sugiyama, Jun Sese, and Takafumi Kanamori. Approximating mutual information by maximum likelihood density ratio estimation. In New challenges for feature selection in data mining and knowledge discovery, pages 5–20, 2008.
 Vejmelka and Paluš (2008) Martin Vejmelka and Milan Paluš. Inferring the directionality of coupling with conditional mutual information. Physical Review E, 77(2):026214, 2008.
8 Supplementary
8.1 Hyperparameters
We provide the experimental settings and hyperparameters for ease of reproducibility of the results.
Hyperparameter  Value 
Hidden Units  
# Hidden Layers  (Inp6464Out) 
Activation  ReLU 
BatchSize  
Learning Rate  
Optimizer  Adam 
()  
# Epoch  
Regularizer  L2 () 
Hyperparameter  Value 
Hidden Units  
# Hidden Layers  (Inp256256Out) 
Activation  Leaky ReLU() 
BatchSize  
Learning Rate  
Optimizer  Adam 
()  
# Epoch  
Noise dimension  
Noise distribution 
Hyperparameter  Value 
Hidden Units  
# Hidden Layers  (Inp256256Out) 
Activation  Leaky ReLU() 
BatchSize  
Learning Rate  
Optimizer  Adam 
()  
# Epoch  
Dropout  
Latent dimension 
Hyperparameter  Value 
Hidden Units  
# Hidden Layers  (Inp64Out) 
Activation  ReLU 
BatchSize  ( for DVMINE) 
Learning Rate  
Optimizer  Adam 
()  
# Epoch 
Tester  AuROC 
CCIT  
CCMI 
8.2 Calibration Curve
While NiculescuMizil and Caruana (2005) showed that neural networks for binary classification produce wellcalibrated outputs. the authors in Guo et al. (2017)
found miscalibration in deep networks with batchnormalization and no L2 regularization. In our experiments, the classifier is shallow, consisting of only
layers with relatively small number of hidden units. There is no batchnormalization or dropout used. Instead, we use regularization which was shown in Guo et al. (2017) to be favorable for calibration. Figure 6 shows that our classifiers are wellcalibrated.8.3 Choosing Optimal Hyperparameter
The DonskerVaradhan representation 1 is a lower bound on the true MI estimate (which is the supremum over all functions). So, for any classifier parameter, the plugin estimate value computed on the test samples will be less than or equal to the true value with high probability (Theorem 2). We illustrate this using estimation of MI for Correlated Gaussians in Figure 7. The estimated value lies below the true values of MI. Thus, the optimal hyperparameter is the one that returns the maximum value of MI estimate on the test set.
Once we have this block that returns the maximum MI estimate after searching over hyperparameters, CMI estimate in CCMI is the difference of two MI estimates, calling this block twice.
We also plot the AuROC curves for the two choices of number of hidden units in flowCytometry data (Figure 8(b)) and post Nonlinear noise synthetic data (Figure 8(a)). When the number of samples is high, the estimates are pretty robust to hyperparameter choice (Figure 7 (b), 8(a)). But in sparse sample regime, proper choice of hyperparameter can improve performance (Figure 8(b)).
8.4 Additional Figures and Tables

For FlowCytometry dataset, we used number of hidden units = for Classifier and trained for epochs. Table 6 shows the mean AuROC values for two CIT testers.

Figure 8 shows the distribution of points from and . Here the classifier would return as prediction for either class (leading to ), even though and are highly correlated () and the mutual information is high.
9 Theoretical Properties of CCMI
In this Section, we explore some of the theoretical properties of CCMI. Let the samples be labeled as and be labeled as . Let . The positive label probability for a given point is denoted as . When the prediction is from a classifier with parameter , then it is denoted as . The argument of is dropped when it is understood from the context.
The following assumptions are used throughout this Section.

Assumption (A1) : The underlying data distributions and admit densities in a compact subset .

Assumption (A2) : , such that

Assumption (A3) : We clip predictions in algorithm such that , with .

Assumption (A4) : The classifier class is parameterized by in some compact domain . constant , such that and the output of the classifier is Lipschitz with respect to parameters .
Notation and Computation Procedure

In the case of mutual information estimation , represents the concatenated data point . To be precise, and .

In the proofs below, we need to compute the Lipschitz constant for various functions. The general procedure for those computations are as follows.
We compute using . The functions encountered in the proofs are continuous, differentiable and have bounded domains.

The binarycross entropy loss estimated from samples is
(4) When computed on the train samples (resp. test samples), it is denoted as (resp. ). The population mean over the joint distribution of data and labels is
(5) 
The estimate of MI from test samples for classifier parameter is given by
The population estimate for classifier parameter is given by
Theorem 3 (Theorem 1 restated).
ClassifierMI is consistent, i.e., given , such that with probability at least , we have
Intuition of Proof
The classifier is trained to minimize the empirical risk on the train set and obtains the minimizer as . From generalization bound of classifier, this loss value () on the test set is close to the loss obtained by the best optimizer in the classifier family (), which itself is close to the loss from global optimizer (viz. ) by Universal Function Approximation Theorem of neuralnetworks.
The loss is strongly convex in . links to , i.e., .
Lemma 3 (LikelihoodRatio from CrossEntropy Loss).
The pointwise minimizer of binary crossentropy loss is related to the likelihood ratio as , where and is the label of point .
Proof.
The binary cross entropy loss as a function of gamma is defined in (5). Now,
Similarly,
Using these in the expression for , we obtain
The pointwise minimizer of gives . ∎
Lemma 4 (Function Approximation).
Given , such that
Proof.
The last layer of the neural network being sigmoid (followed by clipping to
) ensures that the outputs are bounded. So by the Universal Function Approximation Theorem for multilayer feedforward neural networks
(Hornik et al. 1989), parameter such that , where is the estimated classifier prediction function with parameter . So,since is Lipshitz continuous with constant . Choose to complete the proof.
∎
Lemma 5 (Generalization).
Given , , such that with probability at least , we have
Proof.
Let .
From Hoeffding’s inequality,
where .
Similarly, for the test samples,
(6) 
We want this to hold for all parameters . This is obtained using the covering number of the compact domain . We use small balls of radius centered at so that The covering number is finite as is compact and is bounded as
Using the union bound on these finite hypotheses,
(7) 
Choose (Mohri et al. 2018). Solving for number of samples with , we obtain .
So for , with probability at least ,
Lemma 6 (Convergence to minimizer).
Given , such that whenever , we have
where and is the Lebesgue measure of compact set .
Proof.
is strongly convex as a function of under Assumption (A2), where . So for and otherwise. Using the Taylor expansion for strongly convex functions, we have
Comments
There are no comments yet.