 # CCMI : Classifier based Conditional Mutual Information Estimation

Conditional Mutual Information (CMI) is a measure of conditional dependence between random variables X and Y, given another random variable Z. It can be used to quantify conditional dependence among variables in many data-driven inference problems such as graphical models, causal learning, feature selection and time-series analysis. While k-nearest neighbor (kNN) based estimators as well as kernel-based methods have been widely used for CMI estimation, they suffer severely from the curse of dimensionality. In this paper, we leverage advances in classifiers and generative models to design methods for CMI estimation. Specifically, we introduce an estimator for KL-Divergence based on the likelihood ratio by training a classifier to distinguish the observed joint distribution from the product distribution. We then show how to construct several CMI estimators using this basic divergence estimator by drawing ideas from conditional generative models. We demonstrate that the estimates from our proposed approaches do not degrade in performance with increasing dimension and obtain significant improvement over the widely used KSG estimator. Finally, as an application of accurate CMI estimation, we use our best estimator for conditional independence testing and achieve superior performance than the state-of-the-art tester on both simulated and real data-sets.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Conditional mutual information (CMI) is a fundamental information theoretic quantity that extends the nice properties of mutual information (MI) in conditional settings. For three continuous random variables,

, and , the conditional mutual information is defined as:

 I(X;Y|Z)=∭p(x,y,z)logp(x,y,z)p(x,z)p(y|z)dxdydz

assuming that the distributions admit the respective densities . One of the striking features of MI and CMI is that they can capture non-linear dependencies between the variables. In scenarios where Pearson correlation is zero even when the two random variables are dependent, mutual information can recover the truth. Likewise, in the sense of conditional independence for the case of three random variables , and , conditional mutual information provides strong guarantees, i.e., .

The conditional setting is even more interesting as dependence between and

can potentially change based on how they are connected to the conditioning variable. For instance, consider a simple Markov chain where

. Here, . But a slightly different relation has , even though and

may be independent as a pair. It is a well known fact in Bayesian networks that a node is independent of its non-descendants given its parents. CMI goes beyond stating whether the pair

is conditionally dependent or not. It also provides a quantitative strength of dependence.

### 1.1 Prior Art

The literature is replete with works aimed at applying CMI for data-driven knowledge discovery. Fleuret (2004) used CMI for fast binary feature selection to improve classification accuracy. Loeckx et al. (2010) improved non-rigid image registration by using CMI as a similarity measure instead of global mutual information. CMI has been used to infer gene-regulatory networks (Liang and Wang 2008) or protein modulation (Giorgi et al. 2014) from gene expression data. Causal discovery (Li et al. 2011; Hlinka et al. 2013; Vejmelka and Paluš 2008) is yet another application area of CMI estimation.

Despite its wide-spread use, estimation of conditional mutual information remains a challenge. One naive method may be to estimate the joint and conditional densities from data and plug it into the expression for CMI. But density estimation is not sample efficient and is often more difficult than estimating the quantities directly. The most widely used technique expresses CMI in terms of appropriate arithmetic of differential entropy estimators (referred to here as estimator): , where is known as the differential entropy.

The differential entropy estimation problem has been studied extensively by Beirlant et al. (1997); Nemenman et al. (2002); Miller (2003); Lee (2010); Leśniewicz (2014); Sricharan et al. (2012); Singh and Póczos (2014) and can be estimated either based on kernel-density (Kandasamy et al. 2015; Gao et al. 2016) or -nearest-neighbor estimates (Sricharan et al. 2013; Jiao et al. 2018; Pál et al. 2010; Kozachenko and Leonenko 1987; Singh et al. 2003; Singh and Póczos 2016). Building on top of -nearest-neighbor estimates and breaking the paradigm of estimation, a coupled estimator (which we address henceforth as KSG) was proposed by Kraskov et al. (2004). It generalizes to mutual information, conditional mutual information as well as for other multivariate information measures, including estimation in scenarios when the distribution can be mixed (Runge 2018; Frenzel and Pompe 2007; Gao et al. 2017, 2018; Vejmelka and Paluš 2008; Rahimzamani et al. 2018).

The NN approach has the advantage that it can naturally adapt to the data density and does not require extensive tuning of kernel band-widths. However, all these approaches suffer from the curse of dimensionality and are unable to scale well with dimensions. Moreover, Gao et al. (2015) showed that exponentially many samples are required (as MI grows) for the accurate estimation using NN based estimators. This brings us to the central motivation of this work : Can we propose estimators for conditional mutual information that estimate well even in high dimensions ?

### 1.2 Our Contribution

In this paper, we explore various ways of estimating CMI by leveraging tools from classifiers and generative models. To the best of our knowledge, this is the first work that deviates from the framework of

NN and kernel based CMI estimation and introduces neural networks to solve this problem.

The main contributions of the paper can be summarized as follows :

Classifier Based MI Estimation: We propose a novel KL-divergence estimator based on classifier two-sample approach that is more stable and performs superior to the recent neural methods (Belghazi et al. 2018).
Divergence Based CMI Estimation: We express CMI as the KL-divergence between two distributions and , and explore candidate generators for obtaining samples from . The CMI estimate is then obtained from the divergence estimator.
Difference Based CMI Estimation: Using the improved MI estimates, and the difference relation , we show that estimating CMI using a difference of two MI estimates performs best among several other proposed methods in this paper such as divergence based CMI estimation and KSG.
Improved Performance in High Dimensions: On both linear and non-linear data-sets, all our estimators perform significantly better than KSG. Surprisingly, our estimators perform well even for dimensions as high as , while KSG fails to obtain reasonable estimates even beyond dimensions.
Improved Performance in Conditional Independence Testing: As an application of CMI estimation, we use our best estimator for conditional independence testing (CIT) and obtain improved performance compared to the state-of-the-art CIT tester on both synthetic and real data-sets.

## 2 Estimation of Conditional Mutual Information

The CMI estimation problem from finite samples can be stated as follows. Let us consider three random variables , , , where is the joint distribution. Let the dimensions of the random variables be , and respectively. We are given samples drawn i.i.d from . So and . The goal is to estimate from these samples.

### 2.1 Divergence Based CMI Estimation

###### Definition 1.

The Kullback-Leibler (KL) divergence between two distributions and is given as :

 DKL(p||q)=∫p(x)logp(x)q(x)dx
###### Definition 2.

Conditional Mutual Information (CMI) can be expressed as a KL-divergence between two distributions and , i.e.,

 I(X;Y|Z)=DKL(p(x,y,z)||p(x,z)p(y|z))

The definition of CMI as a KL-divergence naturally leads to the question : Can we estimate CMI using an estimator for divergence ? However, the problem is still non-trivial since we are only given samples from and the divergence estimator would also require samples from . This further boils down to whether we can learn the distribution .

#### 2.1.1 Generative Models

We now explore various techniques to learn the conditional distribution given samples . This problem is fundamentally different from drawing independent samples from the marginals and , given the joint . In this simpler setting, we can simply permute the data to obtain ( denotes a permutation, ). This would emulate samples drawn from . But, such a permutation scheme does not work for since it would destroy the dependence between and . The problem is solved using recent advances in generative models which aim to learn an unknown underlying distribution from samples.

Conditional Generative Adversarial Network (CGAN): There exist extensions of the basic GAN framework (Goodfellow et al. 2014) in conditional settings, CGAN (Mirza and Osindero 2014). Once trained, the CGAN can then generate samples from the generator network as .

Conditional Variational Autoencoder (CVAE)

: Similar to CGAN, the conditional setting, CVAE (Kingma and Welling 2013) (Sohn et al. 2015), aims to maximize the conditional log-likelihood. The input to the decoder network is the value of

and the latent vector

sampled from standard Gaussian. The decoder

gives the conditional mean and conditional variance (parametric functions of

and ) from which is then sampled.

NN based permutation: A simpler algorithm for generating the conditional is to permute data values where . Such methods are popular in conditional independence testing literature (Sen et al. 2017; Doran et al. ). For a given point , we find the -nearest neighbor of . Let us say it is with the corresponding data point as . Then is a sample from .

Now that we have outlined multiple techniques for estimating , we next proceed to the problem of estimating KL-divergence.

#### 2.1.2 Divergence Estimation

Recently, Belghazi et al. (2018) proposed a neural network based estimator of mutual information (MINE) by utilizing lower bounds on KL-divergence. Since MI is a special case of KL-divergence, their neural estimator can be extended for divergence estimation as well. The estimator can be trained using back-propagation and was shown to out-perform traditional methods for MI estimation. The core idea of MINE is cradled in a dual representation of KL-divergence. The two main lower bounds used by MINE are stated below.

###### Definition 3.

The Donsker-Varadhan representation expresses KL-divergence as a supremum over functions,

 DKL(p||q)=supf∈FEx∼p[f(x)]−log(Ex∼q[exp(f(x))]) (1)

where the function class includes those functions that lead to finite values of the expectations.

###### Definition 4.

The f-divergence bound gives a lower bound on the KL-divergence:

 DKL(p||q)≥supf∈FEx∼p[f(x)]−Ex∼q[exp(f(x)−1)] (2)

MINE uses a neural network to represent the function class and uses gradient descent to maximize the RHS in the above bounds.

Even though this framework is flexible and straight-forward to apply, it presents several practical limitations. The estimation is very sensitive to choices of hyper-parameters (hidden-units/layers) and training steps (batch size, learning rate). We found the optimization process to be unstable and to diverge at high dimensions (Section 4. Experimental Results). Our findings resonate those by Poole et al. in which the authors found the networks difficult to tune even in toy problems.

### 2.2 Difference Based CMI Estimation

Another seemingly simple approach to estimate CMI could be to express it as a difference of two mutual information terms by invoking the chain rule, i.e.:

. As stated before, since mutual information is a special case of KL-divergence, viz. , this again calls for a stable, scalable, sample efficient KL-divergence estimator as we present in the next Section.

## 3 Classifier Based MI Estimation

In their seminal work on independence testing, Lopez-Paz and Oquab (2016) introduced classifier two-sample test to distinguish between samples coming from two unknown distributions and . The idea was also adopted for conditional independence testing by Sen et al. (2017). The basic principle is to train a binary classifier by labeling samples as and those coming from as

, and to test the null hypothesis

. Under the null, the accuracy of the binary classifier will be close to . It will be away from under the alternative. The accuracy of the binary classifier can then be carefully used to define -values for the test.

We propose to use the classier two-sample principle for estimating the likelihood ratio . While existing literature has instances of using the likelihood ratio for MI estimation, the algorithms to estimate the likelihood ratio are quite different from ours. Both (Suzuki et al. 2008; Nguyen et al. 2008) formulate the likelihood ratio estimation as a convex relaxation by leveraging the Legendre-Fenchel duality. But performance of the methods depend on the choice of suitable kernels and would suffer from the same disadvantages as mentioned in the Introduction.

### 3.1 Problem Formulation

Given i.i.d samples and i.i.d samples , we want to estimate . We label the points drawn from as and those from as . A binary classifier is then trained on this supervised classification task. Let the prediction for a point by the classifier is where (

denotes probability). Then the point-wise likelihood ratio for data point

is given by .

The following Proposition is elementary and has already been observed in Belghazi et al. (2018)(Proof of Theorem 4). We restate it here for completeness and quick reference.

###### Proposition 1.

The optimal function in Donsker-Varadhan representation (1) is the one that computes the point-wise log-likelihood ratio, i.e, , (assuming , where-ever ).

Based on Proposition 1, the next step is to substitute the estimates of point-wise likelihood ratio in (1) to obtain an estimate of KL-divergence.

 ^DKL(p||q)=1nn∑i=1logL(xpi)−log(1mm∑j=1L(xqj)) (3)

We obtain an estimate of mutual information from (3) as . This classifier-based estimator for MI (Classifier-MI) has the following theoretical properties under Assumptions (A1)-(A4) (stated in Section 9).

###### Theorem 1.

Under Assumptions (A1)-(A4), Classifier-MI is consistent, i.e., given , such that with probability at least , we have

 |^In(X;Y)−I(X;Y)|≤ϵ
###### Proof.

Here, we provide a sketch of the proof. The classifier is trained to minimize the binary cross entropy (BCE) loss on the train set and obtains the minimizer as . From generalization bound of classifier, the loss value on the test set from is close to the loss obtained by the best optimizer in the classifier family, which itself is close to the global minimizer of BCE (as a function ) by Universal Function Approximation Theorem of neural-networks.

The loss is strongly convex in . links to , i.e., . ∎

While consistency provides a characterization of the estimator in large sample regime, it is not clear what guarantees we obtain for finite samples. The following Theorem shows that even for a small number of samples, the produced MI estimate is a true lower bound on mutual information value with high probability.

###### Theorem 2.

Under Assumptions (A1)-(A4), the finite sample estimate from Classifier-MI is a lower bound on the true MI value with high probability, i.e., given test samples, we have for

 Pr(I(X;Y)+ϵ≥^In(X;Y))≥1−2exp(−Cn)

where is some constant independent of and the dimension of the data.

### 3.2 Probability Calibration

The estimation of likelihood ratio from classifier predictions

hinges on the fact that the classifier is well-calibrated. As a rule of thumb, classifiers trained directly on the cross entropy loss are well-calibrated. But boosted decision trees would introduce distortions in the likelihood-ratio estimates. There is an extensive literature devoted to obtaining better calibrated classifiers that can be used to improve the estimation further

(Lakshminarayanan et al. 2017; Niculescu-Mizil and Caruana 2005; Guo et al. 2017)

. We experimented with Gradient Boosted Decision Trees and multi-layer perceptron trained on the log-loss in our algorithms. Multi-layer perceptron gave better estimates and so is used in all the experiments. Supplementary Figures show that the neural networks used in our estimators are well-calibrated.

Even though logistic regression is well-calibrated and might seem to be an attractive candidate for classification in sparse sample regimes, we show that linear classifiers cannot be used to estimate

by two-sample approach. For this, we consider the simple setting of estimating mutual information of two correlated Gaussian random variables as a counter-example.

###### Lemma 1.

A linear classifier with marginal features fails the classifier Two sample MI estimation.

###### Proof.

Consider two correlated Gaussians in dimensions , where is the Pearson correlation. The marginals are standard Gaussans . Suppose we are trying to estimate the mutual information . The classifier decision boundary would seek to find , thus

The decision boundary is a rectangular hyperbola. Here the classifier would return as prediction for either class (leading to ), even when and are highly correlated and the mutual information is high.

We use the Classifier two-sample estimator to first compute the mutual information of two correlated Gaussians (Belghazi et al. 2018) for samples. This setting also provides us a way to choose reasonable hyper-parameters that are used throughout in all the synthetic experiments. We also plot the estimates of f-MINE and KSG to ensure we are able to make them work in simple settings. In the toy setting , all estimators accurately estimate as shown in Figure 1.

### 3.3 Modular Approach to CMI Estimation

Our classifier based divergence estimator does not encounter an optimization problem involving exponentials. MINE optimizing (1) has biased gradients while that based on (2) is a weaker lower bound (Belghazi et al. 2018). On the contrary, our classifier is trained on cross-entropy loss which has unbiased gradients. Furthermore, we plug in the likelihood ratio estimates into the tighter Donsker-Varadhan bound, thereby, achieving the best of both worlds. Equipped with a KL-divergence estimator, we can now couple it with the generators or use the expression of CMI as a difference of two MIs (which we address from now as MI-Diff.). Algorithm LABEL:alg:ccmi describes the CMI estimation by tying together the generator and Classifier block. For MI-Diff., function block “Classifier-” in Algorithm LABEL:alg:ccmi has to be used twice : once for estimating and another for . For mutual information, in “Classifier-” is obtained by permuting the samples of .

For the Classifier coupled with a generator, the generated distribution may deviate from the target distribution - introducing a different kind of bias. The following Lemma suggests how such a bias can be corrected by subtracting the KL divergence of the sub-tuple from the divergence of the entire triple . We note that such a clean relationship is not true for general divergence measures, and indeed require more sophisticated conditions for the total-variation metric (Sen et al. 2018).

###### Lemma 2 (Bias Cancellation).

The estimation error due to incorrect generated distribution can be accounted for using the following relation :

 DKL(p(x,y,z)||p(x,z)p(y|z))= DKL(p(x,y,z)||p(x,z)g(y|z))−DKL(p(y,z)||p(z)g(y|z))

algocf[htbp]

## 4 Experimental Results

In this Section, we compare the performance of various estimators on the CMI estimation task. We used the Classifier based divergence estimator (Section 3) and MINE in our experiments. Belghazi et al. (2018) had two MINE variants, namely Donsker-varadhan (DV) MINE and f-MINE. The f-MINE has unbiased gradients and we found it to have similar performance as DV-MINE, albeit with lower variance. So we used f-MINE in all our experiments.

The “generator”+“Divergence estimator” notation will be used to denote the various estimators. For instance, if we use CVAE for the generation and couple it with f-MINE, we denote the estimator as CVAE+f-MINE. When coupled with the Classifier based Divergence block, it will be denoted as CVAE+Classifier. For MI-Diff. we represent it similarly as MI-Diff.+“Divergence estimator”.

We compare our estimators with the widely used KSG estimator.111The implementation of CMI estimator in Non-parametric Entropy Estimation Toolbox (https://github.com/gregversteeg/NPEET) is used. For f-MINE, we used the code provided to us by the author (Belghazi et al. 2018). The same hyper-parameter setting is used in all our synthetic data-sets for all estimators (including generators and divergence blocks). Supplementary contains the details about the hyper-parameter values. For KSG, we vary and report the results for the best for each data-set.

### 4.1 Linear Relations

We start with the simple setting where the three random variables , , are related in a linear fashion. We consider the following two linear models.

where means that each co-ordinate of

is drawn i.i.d from a uniform distribution between

and . Similar notation is used for the Gaussian : . is the first dimension of . We used and obtained the constant unit norm random vector from . is kept constant for all points during data-set preparation.

As common in literature on causal discovery and independence testing (Sen et al. 2017; Doran et al. ), the dimension of and is kept as , while can scale. Our estimators are general enough to accommodate multi-dimensional and , where we consider a concatenated vector and . This has applications in learning interactions between Modules in Bayesian networks (Segal et al. 2005) or dependence between group variables (Entner and Hoyer 2012; Parviainen and Kaski 2016) such as distinct functional groups of proteins/genes instead of individual entities. Both the linear models are representative of problems encountered in Graphical models and independence testing literature. In Model I, the conditioning set can go on increasing with independent variables , while only depends on . In Model II, we have the variables in the conditioning set combining linearly to produce . It is also easy to obtain the ground truth CMI value in such models by numerical integration.

For both these models, we generate data-sets with varying number of samples and varying dimension to study their effect on estimator performance. The sample size is varied as keeping fixed at . We also vary , keeping sample size fixed at .

Several observations stand out from the experiments: (1) KSG estimates are accurate at very low dimension but drastically fall with increasing even when the conditioning variables are completely independent and do not influence and (Model-I). (2) Increasing the sample size does not improve KSG estimates once the dimension is kept moderate (even !). The dimension issue is more acute than sample scarcity. (3) The estimates from f-MINE have greater deviation from the truth at low sample sizes. At high dimensions, the instability is clearly portrayed when the estimate suddenly goes negative (Truncated to to maintain the scale of the plot). (4) All our estimators using Classifier are able to obtain reasonable estimates even at dimensions as high as , with MI-Diff.+Classifier performing the best.

### 4.2 Non-Linear Relations (a) Non-linear Model : Number of samples increase with Data-index, dz=10 (fixed)

Here, we study models where the underlying relations between , and are non-linear. Let . and are non-linear bounded functions drawn uniformly at random from for each data-set. is a random vector whose entries are drawn and normalized to have unit norm. The vector once generated is kept fixed for a particular data-set. We have the setting where and can scale. is then a constant. We used in our simulations. The noise variables are drawn i.i.d , .

We vary across each dimension . The dimension itself is then varied as giving rise to data-sets. Data-index has , data-index has and so on until data-index with .

Obtaining Ground Truth : Since it is not possible to obtain the ground truth CMI value in such complicated settings using a closed form expression, we resort to using the relation where . The dependence of on can be completely captured once is given. But, has dimension and can be estimated accurately using KSG. We generate samples separately for each data-set to estimate and use it as the ground truth. (a) CCIT performance degrades with increasing dz; CCMI retains high AuROC score even at dz = 100.

We observed similar behavior (as in Linear models) for our estimators in the Non-linear setting.

(1) KSG continues to have low estimates even though in this setup the true CMI values are themselves low (). (2) Up to , we find all our estimators closely tracking . But in higher dimensions, they fail to perform accurately. (3) MI-Diff. + Classifier is again the best estimator, giving CMI estimates away from even at dimensions.

From the above experiments, we found MI-Diff.+Classifier to be the most accurate and stable estimator. We use this combination for our downstream applications and henceforth refer to it as CCMI. Figure 5: AuROC Curves : Flow-Cytometry Data-set. CCIT obtains a mean AuROC score of 0.6665, while CCMI out-performs with mean of 0.7569.

## 5 Application to Conditional Independence Testing

As a testimony to accurate CMI estimation, we apply CCMI to the problem of Conditional Independence Testing(CIT). Here, we are given samples from two distributions and . The hypothesis testing in CIT is to distinguish the null from the alternative .

We seek to design a CIT tester using CMI estimation by using the fact that . A simple approach would be to reject the null if and accept it otherwise. The CMI estimates can serve as a proxy for the -value. CIT testing based on CMI Estimation has been studied by Runge (2018), where the author uses KSG for CMI estimation and use -NN based permutation to generate a -value. The -value is computed as the fraction of permuted data-sets where the CMI estimate is that of the original data-set. The same approach can be adopted for CCMI to obtain a

-value. But since we report the AuROC (Area under the Receiver Operating Characteristic curve), CMI estimates suffice.

### 5.1 Post Non-linear Noise : Synthetic Data

In this experiment, we generate data based on the post non-linear noise model similar to Sen et al. (2017). As before, and can scale in dimension. The data is generated using the follow model.

 Z∼N(1,Idz),X=cos(axZ+η1) Y={cos(byZ+η2)ifX⊥Y|Zcos(cX+byZ+η2)ifX⊥/Y|Z

The entries of random vectors(matrices if ) and are drawn and the vectors are normalized to have unit norm, i.e., . . This is different from the implementation in Sen et al. (2017) where the constant is in all data-sets. But by varying , we obtain a tougher problem where the true CMI value can be quite low for a dependent data-set.

and are kept constant for generating points for a single data-set and are varied across data-sets. We vary and simulate data-sets for each dimension. The number of samples is in each data-set. Our algorithm is compared with the state-of-the-art CIT tester in Sen et al. (2017), known as CCIT. We used the implementation provided by the authors and ran CCIT with bootstraps . For each data-set, an AuROC value is obtained. Figure 4 shows the mean AuROC values from runs for both the testers as varies. While both algorithms perform accurately upto , the performance of CCIT starts to degrade beyond dimensions. Beyond dimensions, it performs close to random guessing. CCMI retains its superior performance even at , obtaining a mean AuROC value of .

Since AuROC metric finds best performance by varying thresholds, it is not clear what precision and recall is obtained from CCMI when we threshold the CCMI estimate at

(and reject or accept the null based on it). So, for we plotted the histogram of CMI estimates separately for CI and non-CI data-sets. Figure 3(b) shows that there a clear demarcation of CMI estimates between the two data-set categories and choosing the threshold as gave the precision as and recall as .

### 5.2 Flow-Cytometry : Real Data

To extend our estimator beyond simulated settings, we use CMI estimation to test for conditional independence in the protein network data used in Sen et al. (2017). The consensus graph in Sachs et al. (2005) is used as the ground truth. We obtained CI and non-CI relations from the Bayesian network. The basic philosophy used is that a protein is independent of all other proteins in the network given its parents, children and parents of children. Moreover, in the case of non-CI, we notice that a direct edge between and would never render them conditionally independent. So the conditioning set can be chosen at random from other proteins. These two settings are used to obtain the CI and non-CI data-sets. The number of samples in each data-set is only and the dimension of varies from to .

For Flow-Cytometry data, since the number of samples is too small, we train the Classifier for fewer epochs to prevent over-fitting, keeping every other hyper-parameter the same. CCMI is compared with CCIT on the real data and the mean AuROC curves from

runs is plotted in Figure 5. The superior performance of CCMI over CCIT is retained in sparse data regime.

## 6 Conclusion and Future Directions

In this work we explored various CMI estimators by drawing from recent advances in generative models and classifies. We proposed a new divergence estimator, based on Classifier-based two-sample estimation, and built several conditional mutual information estimators using this primitive. We demonstrated their efficacy in a variety of practical settings. Future work will aim to approximate the null distribution for CCMI, so that we can compute -values for the conditional independence testing problem efficiently.

## 7 Acknowledgments

This work was supported by NSF awards 1651236 and 1703403 and NIH grant 5R01HG008164.

## References

• Beirlant et al. (1997) Jan Beirlant, Edward J Dudewicz, László Györfi, and Edward C Van der Meulen. Nonparametric entropy estimation: An overview. International Journal of Mathematical and Statistical Sciences, 6(1):17–39, 1997.
• Belghazi et al. (2018) Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeshwar, Sherjil Ozair, Yoshua Bengio, Aaron Courville, and Devon Hjelm. Mutual information neural estimation. In

Proceedings of the 35th International Conference on Machine Learning

, 2018.
• (3) G Doran, K Muandet, K Zhang, and B Schölkopf. A permutation-based kernel conditional independence test. In

30th Conference on Uncertainty in Artificial Intelligence (UAI 2014)

.
• Entner and Hoyer (2012) Doris Entner and Patrik O Hoyer. Estimating a causal order among groups of variables in linear models. In International Conference on Artificial Neural Networks, pages 84–91. Springer, 2012.
• Fleuret (2004) François Fleuret. Fast binary feature selection with conditional mutual information. Journal of Machine learning research, 5(Nov):1531–1555, 2004.
• Frenzel and Pompe (2007) Stefan Frenzel and Bernd Pompe. Partial mutual information for coupling analysis of multivariate time series. Physical review letters, 99(20):204101, 2007.
• Gao et al. (2015) Shuyang Gao, Greg Ver Steeg, and Aram Galstyan. Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286, 2015.
• Gao et al. (2016) Weihao Gao, Sewoong Oh, and Pramod Viswanath. Breaking the bandwidth barrier: Geometrical adaptive entropy estimation. In Advances in Neural Information Processing Systems, pages 2460–2468, 2016.
• Gao et al. (2017) Weihao Gao, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Estimating mutual information for discrete-continuous mixtures. In Advances in Neural Information Processing Systems, pages 5988–5999, 2017.
• Gao et al. (2018) Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying fixed -nearest neighbor information estimators. IEEE Transactions on Information Theory, 64(8):5629–5661, 2018.
• Giorgi et al. (2014) Federico M Giorgi, Gonzalo Lopez, Jung H Woo, Brygida Bisikirska, Andrea Califano, and Mukesh Bansal. Inferring protein modulation from gene expression data using conditional mutual information. PloS one, 9(10):e109569, 2014.
• Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
• Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1321–1330. JMLR. org, 2017.
• Hlinka et al. (2013) Jaroslav Hlinka, David Hartman, Martin Vejmelka, Jakob Runge, Norbert Marwan, Jürgen Kurths, and Milan Paluš. Reliability of inference of directed climate networks using conditional mutual information. Entropy, 15(6):2023–2045, 2013.
• Hornik et al. (1989) Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
• Jiao et al. (2018) Jiantao Jiao, Weihao Gao, and Yanjun Han. The nearest neighbor information estimator is adaptively near minimax rate-optimal. In Advances in neural information processing systems, 2018.
• Kandasamy et al. (2015) Kirthevasan Kandasamy, Akshay Krishnamurthy, Barnabas Poczos, Larry Wasserman, et al. Nonparametric von mises estimators for entropies, divergences and mutual informations. In Advances in Neural Information Processing Systems, pages 397–405, 2015.
• Kingma and Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
• Kozachenko and Leonenko (1987) LF Kozachenko and Nikolai N Leonenko. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
• Kraskov et al. (2004) Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Physical review E, 69(6):066138, 2004.
• Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pages 6402–6413, 2017.
• Lee (2010) Intae Lee. Sample-spacings-based density and entropy estimators for spherically invariant multidimensional data. Neural Computation, 22(8):2208–2227, 2010.
• Leśniewicz (2014) Marek Leśniewicz. Expected entropy as a measure and criterion of randomness of binary sequences. Przegląd Elektrotechniczny, 90(1):42–46, 2014.
• Li et al. (2011) Zhaohui Li, Gaoxiang Ouyang, Duan Li, and Xiaoli Li. Characterization of the causality between spike trains with permutation conditional mutual information. Physical Review E, 84(2):021929, 2011.
• Liang and Wang (2008) Kuo-Ching Liang and Xiaodong Wang. Gene regulatory network reconstruction using conditional mutual information. EURASIP Journal on Bioinformatics and Systems Biology, 2008(1):253894, 2008.
• Loeckx et al. (2010) Dirk Loeckx, Pieter Slagmolen, Frederik Maes, Dirk Vandermeulen, and Paul Suetens. Nonrigid image registration using conditional mutual information. IEEE transactions on medical imaging, 29(1):19–29, 2010.
• Lopez-Paz and Oquab (2016) David Lopez-Paz and Maxime Oquab. Revisiting classifier two-sample tests. arXiv preprint arXiv:1610.06545, 2016.
• Miller (2003) Erik G Miller. A new class of entropy estimators for multi-dimensional densities. In Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on, volume 3, pages III–297. IEEE, 2003.
• Mirza and Osindero (2014) Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
• Mohri et al. (2018) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. 2018.
• Nemenman et al. (2002) Ilya Nemenman, Fariel Shafee, and William Bialek. Entropy and inference, revisited. In Advances in neural information processing systems, pages 471–478, 2002.
• Nguyen et al. (2008) XuanLong Nguyen, Martin J Wainwright, and Michael I Jordan. Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization. In Advances in neural information processing systems, pages 1089–1096, 2008.
• Niculescu-Mizil and Caruana (2005) Alexandru Niculescu-Mizil and Rich Caruana. Obtaining calibrated probabilities from boosting. In UAI, 2005.
• Pál et al. (2010) Dávid Pál, Barnabás Póczos, and Csaba Szepesvári. Estimation of rényi entropy and mutual information based on generalized nearest-neighbor graphs. In Advances in Neural Information Processing Systems, pages 1849–1857, 2010.
• Parviainen and Kaski (2016) Pekka Parviainen and Samuel Kaski. Bayesian networks for variable groups. In Conference on Probabilistic Graphical Models, pages 380–391, 2016.
• (36) Ben Poole, Sherjil Ozair, Aäron van den Oord, Alexander A Alemi, and George Tucker. On variational lower bounds of mutual information.
• Rahimzamani et al. (2018) Arman Rahimzamani, Himanshu Asnani, Pramod Viswanath, and Sreeram Kannan. Estimators for multivariate information measures in general probability spaces. In Advances in Neural Information Processing Systems 31. Curran Associates, Inc., 2018.
• Runge (2018) Jakob Runge. Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, 2018.
• Sachs et al. (2005) Karen Sachs, Omar Perez, Dana Pe’er, Douglas A Lauffenburger, and Garry P Nolan. Causal protein-signaling networks derived from multiparameter single-cell data. Science, 308(5721):523–529, 2005.
• Segal et al. (2005) Eran Segal, Dana Pe’er, Aviv Regev, Daphne Koller, and Nir Friedman. Learning module networks. Journal of Machine Learning Research, 6(Apr):557–588, 2005.
• Sen et al. (2017) Rajat Sen, Ananda Theertha Suresh, Karthikeyan Shanmugam, Alexandros G Dimakis, and Sanjay Shakkottai. Model-powered conditional independence test. In Advances in Neural Information Processing Systems, pages 2951–2961, 2017.
• Sen et al. (2018) Rajat Sen, Karthikeyan Shanmugam, Himanshu Asnani, Arman Rahimzamani, and Sreeram Kannan. Mimic and classify: A meta-algorithm for conditional independence testing. arXiv preprint arXiv:1806.09708, 2018.
• Singh et al. (2003) Harshinder Singh, Neeraj Misra, Vladimir Hnizdo, Adam Fedorowicz, and Eugene Demchuk. Nearest neighbor estimates of entropy. American journal of mathematical and management sciences, 23(3-4):301–321, 2003.
• Singh and Póczos (2014) Shashank Singh and Barnabás Póczos. Exponential concentration of a density functional estimator. In Advances in Neural Information Processing Systems, pages 3032–3040, 2014.
• Singh and Póczos (2016) Shashank Singh and Barnabás Póczos. Finite-sample analysis of fixed-k nearest neighbor density functional estimators. In Advances in Neural Information Processing Systems, pages 1217–1225, 2016.
• Sohn et al. (2015) Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems 28. 2015.
• Sricharan et al. (2012) Kumar Sricharan, Raviv Raich, and Alfred O Hero. Estimation of nonlinear functionals of densities with confidence. IEEE Transactions on Information Theory, 58(7):4135–4159, 2012.
• Sricharan et al. (2013) Kumar Sricharan, Dennis Wei, and Alfred O Hero. Ensemble estimators for multivariate entropy estimation. IEEE transactions on information theory, 59(7):4374–4388, 2013.
• Suzuki et al. (2008) Taiji Suzuki, Masashi Sugiyama, Jun Sese, and Takafumi Kanamori. Approximating mutual information by maximum likelihood density ratio estimation. In New challenges for feature selection in data mining and knowledge discovery, pages 5–20, 2008.
• Vejmelka and Paluš (2008) Martin Vejmelka and Milan Paluš. Inferring the directionality of coupling with conditional mutual information. Physical Review E, 77(2):026214, 2008.

## 8 Supplementary

### 8.1 Hyper-parameters

We provide the experimental settings and hyper-parameters for ease of reproducibility of the results.

### 8.2 Calibration Curve Figure 6: Calibrated Classifiers : We find that our classifiers trained with L2-regularization and two hidden layers are well-calibrated. The calibration is obtained for MI Estimation of Correlated Gaussians with dx=10,ρ=0.5

While Niculescu-Mizil and Caruana (2005) showed that neural networks for binary classification produce well-calibrated outputs. the authors in Guo et al. (2017)

found miscalibration in deep networks with batch-normalization and no L2 regularization. In our experiments, the classifier is shallow, consisting of only

layers with relatively small number of hidden units. There is no batch-normalization or dropout used. Instead, we use -regularization which was shown in Guo et al. (2017) to be favorable for calibration. Figure 6 shows that our classifiers are well-calibrated.

### 8.3 Choosing Optimal Hyper-parameter

The Donsker-Varadhan representation 1 is a lower bound on the true MI estimate (which is the supremum over all functions). So, for any classifier parameter, the plug-in estimate value computed on the test samples will be less than or equal to the true value with high probability (Theorem 2). We illustrate this using estimation of MI for Correlated Gaussians in Figure 7. The estimated value lies below the true values of MI. Thus, the optimal hyper-parameter is the one that returns the maximum value of MI estimate on the test set.

Once we have this block that returns the maximum MI estimate after searching over hyper-parameters, CMI estimate in CCMI is the difference of two MI estimates, calling this block twice.

We also plot the AuROC curves for the two choices of number of hidden units in flow-Cytometry data (Figure 8(b)) and post Non-linear noise synthetic data (Figure 8(a)). When the number of samples is high, the estimates are pretty robust to hyper-parameter choice (Figure 7 (b), 8(a)). But in sparse sample regime, proper choice of hyper-parameter can improve performance (Figure 8(b)). Figure 8: Logistic Regression Fails to Classify points from p(x1,x2) (colored red) and those from p(x1)p(x2) (colored blue). (a) Post Non-linear noise data-sets

### 8.4 Additional Figures and Tables

• For Flow-Cytometry data-set, we used number of hidden units = for Classifier and trained for epochs. Table 6 shows the mean AuROC values for two CIT testers.

• Figure 8 shows the distribution of points from and . Here the classifier would return as prediction for either class (leading to ), even though and are highly correlated () and the mutual information is high.

## 9 Theoretical Properties of CCMI

In this Section, we explore some of the theoretical properties of CCMI. Let the samples be labeled as and be labeled as . Let . The positive label probability for a given point is denoted as . When the prediction is from a classifier with parameter , then it is denoted as . The argument of is dropped when it is understood from the context.

The following assumptions are used throughout this Section.

• Assumption (A1) : The underlying data distributions and admit densities in a compact subset .

• Assumption (A2) : , such that

• Assumption (A3) : We clip predictions in algorithm such that , with .

• Assumption (A4) : The classifier class is parameterized by in some compact domain . constant , such that and the output of the classifier is -Lipschitz with respect to parameters .

### Notation and Computation Procedure

• In the case of mutual information estimation , represents the concatenated data point . To be precise, and .

• In the proofs below, we need to compute the Lipschitz constant for various functions. The general procedure for those computations are as follows.

 |ϕ(x)−ϕ(y)|≤Lϕ|x−y|

We compute using . The functions encountered in the proofs are continuous, differentiable and have bounded domains.

• The binary-cross entropy loss estimated from samples is

 BCEn(γ)=−(1n∑ililogγ(xi)+(1−li)log(1−γ(xi))) (4)

When computed on the train samples (resp. test samples), it is denoted as (resp. ). The population mean over the joint distribution of data and labels is

 BCE(γ)=−(EXLLlogγ(X)+(1−L)log(1−γ(X))) (5)
• The estimate of MI from test samples for classifier parameter is given by

 Iγ^θn=1nn∑i=1logγ^θ(xi)1−γ^θ(xi)−log(1nn∑j=1γ^θ(xj)1−γ^θ(xj))

The population estimate for classifier parameter is given by

 Iγ^θ=Ex∼plogγ^θ(x)1−γ^θ(x)−log(Ex∼qγ^θ(x)1−γ^θ(x))
###### Theorem 3 (Theorem 1 restated).

Classifier-MI is consistent, i.e., given , such that with probability at least , we have

 |Iγ^θn(U;V)−I(U;V)|≤ϵ

### Intuition of Proof

The classifier is trained to minimize the empirical risk on the train set and obtains the minimizer as . From generalization bound of classifier, this loss value () on the test set is close to the loss obtained by the best optimizer in the classifier family (), which itself is close to the loss from global optimizer (viz. ) by Universal Function Approximation Theorem of neural-networks.

The loss is strongly convex in . links to , i.e., .

###### Lemma 3 (Likelihood-Ratio from Cross-Entropy Loss).

The point-wise minimizer of binary cross-entropy loss is related to the likelihood ratio as , where and is the label of point .

###### Proof.

The binary cross entropy loss as a function of gamma is defined in (5). Now,

 EXLLlogγ(X) =∑x,lp(x,l)llogγ(x) =∑x,l=1p(x|l=1)p(l=1)logγ(x)+0 =12∑xp(x)logγ(x)

Similarly,

 EXL(1−L)log(1−γ(X))=12∑xq(x)log(1−γ(x))

Using these in the expression for , we obtain

 BCE(γ)=−12(∑x∈Xp(x)logγ(x)+q(x)log(1−γ(x)))

The point-wise minimizer of gives . ∎

###### Lemma 4 (Function Approximation).

Given , such that

 BCE(γ~θ)≤BCE(γ∗)+ϵ′2
###### Proof.

The last layer of the neural network being sigmoid (followed by clipping to

) ensures that the outputs are bounded. So by the Universal Function Approximation Theorem for multi-layer feed-forward neural networks

(Hornik et al. 1989), parameter such that , where is the estimated classifier prediction function with parameter . So,

 |BCE(γ~θ)−BCE(γ∗)|≤1τϵ′′

since is Lipshitz continuous with constant . Choose to complete the proof.

###### Lemma 5 (Generalization).

Given , , such that with probability at least , we have

 BCEn(γ^θ)≤BCE(γ~θ)+ϵ′2
###### Proof.

Let .

From Hoeffding’s inequality,

 Pr(|BCEERMn(γθ)−BCE(γθ)|≥μ)≤2exp(−2nμ2M2)

where .

Similarly, for the test samples,

 Pr(|BCEn(γθ)−BCE(γθ)|≥μ)≤2exp(−2nμ2M2) (6)

We want this to hold for all parameters . This is obtained using the covering number of the compact domain . We use small balls of radius centered at so that The covering number is finite as is compact and is bounded as

 κ(Θ,r)≤(2K√hr)h

Using the union bound on these finite hypotheses,

 Pr(maxθ|BCEERMn(γθ)−BCE(γθ)|≥μ)≤2κ(Θ,r)exp(−2nμ2M2) (7)

Choose (Mohri et al. 2018). Solving for number of samples with , we obtain .

So for , with probability at least ,

 BCEn(γ^θ) (a)≤BCE(γ^θ)+μ(b)≤BCEERMn(γ^θ)+2μ (c)≤BCEERMn(γ~θ)+2μ(d)≤BCE(γ~θ)+3μ

follows from (6). and follow from (7). is due to the fact that is the minimizer of train loss. Choosing completes the proof. ∎

###### Lemma 6 (Convergence to minimizer).

Given , such that whenever , we have

 ∥→γθ−→γ∗∥1≤η

where and is the Lebesgue measure of compact set .

###### Proof.
 BCE(γ)=−12(∑x∈Xp(x)logγ(x)+q(x)log(1−γ(x)))

is -strongly convex as a function of under Assumption (A2), where . So for and otherwise. Using the Taylor expansion for strongly convex functions, we have

 BCE(