Efficient Estimation of Mutual Information for Strongly Dependent Variables

11/07/2014 ∙ by Shuyang Gao, et al. ∙ USC Information Sciences Institute 0

We demonstrate that a popular class of nonparametric mutual information (MI) estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between two strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. We introduce a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude. We demonstrate the superior performance of the proposed estimator on both synthetic and real-world data.



There are no comments yet.


page 1

page 2

page 3

page 4

Code Repositories


local non-uniformity correction for mutual information estimation

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a measure of the dependence between two random variables, mutual information is remarkably general and has several intuitive interpretations 

(Cover and Thomas, 2006)

, which explains its widespread use in statistics, machine learning, and computational neuroscience; see 

(Wang et al., 2009)

for a list of various applications. In a typical scenario, we do not know the underlying joint distribution of the random variables, and instead need to estimate mutual information using i.i.d. samples from that distribution. A naive approach to this problem is to first estimate the underlying probability density, and then calculate mutual information using its definition. Unfortunately, estimating joint densities from a limited number of samples is often infeasible in many practical settings.

A different approach is to estimate mutual information directly from samples in a non-parametric way, without ever describing the entire probability density. The main intuition behind such direct estimators is that evaluating mutual information can in principle be a more tractable problem than estimating the density over the whole state space (Pérez-Cruz, 2008). The most popular class of estimators taking this approach is the

-nearest-neighbor(kNN) based estimators. One example of such estimator is due to Kraskov, Stögbauer, and Grassberger (referred to as the KSG estimator from here on) 

(Kraskov et al., 2004), which has been extended to generalized nearest neighbors graphs (Pál et al., 2010).

Our first contribution is to demonstrate that kNN-based estimators suffer from a critical yet overlooked flaw. Namely, we illustrate that if the true mutual information is , then kNN-based estimators requires exponentially many (in ) samples for accurate estimation. This means that strong relationships are actually more difficult to measure. This counterintuitive property reflects the fact that most work on mutual information estimation has focused on estimators that are good at detecting independence of variables rather than precisely measuring strong dependence. In the age of big data, it is often the case that many strong relationships are present in the data and we are interested in picking out only the strongest ones. MI estimation will perform poorly for this task if their accuracy is low in the regime of strong dependence.

We show that the undesired behavior of previous kNN-based MI estimators can be attributed to the assumption of local uniformity utilized by those estimators, which can be violated for sufficiently strong (almost deterministic) dependencies. As our second major contribution, we suggest a new kNN estimator that relaxes the local uniformity condition by introducing a correction term for local non-uniformity. We demonstrate empirically that for strong relationships, the proposed estimator needs significantly fewer samples for accurately estimating mutual information.

In Sec. 2, we introduce kNN-based non-parametric entropy and mutual information estimators. In Sec. 3, we demonstrate the limitations of kNN-based mutual information estimators. And then we suggest a correction term to overcome these limitations in Sec. 4. In Sec. 5, we show empirically using synthetic and real-world data that our method outperforms existing techniques. In particular, we are able to accurately measure strong relationships using smaller sample sizes. Finally, we conclude with related work and discussion.

2 kNN-based Estimation of Entropic Measures

In this section, we will first introduce the naive kNN estimator for mutual information, which is based on an entropy estimator due to (Singh et al., 2003), and show its theoretical properties. Next, we focus on a popular variant of kNN estimators, called KSG estimator (Kraskov et al., 2004).

2.1 Basic Definitions

Let denote a

-dimensional absolute continuous random variable whose probability density function is defined as

and marginal densities of each are defined as . Shannon Differential Entropy and Mutual Information are defined in the usual way:


We use natural logarithms so that information is measured in nats. For , the generalized mutual information is also called total correlation (Watanabe, 1960) or multi-information (Studenỳ and Vejnarová, 1998).

Given i.i.d. samples, , drawn from , the task is to construct a mutual information estimator based on these samples.

2.2 Naive kNN Estimator

Entropy Estimation

The naive kNN entropy estimator is as follows:




in Eq. 4 is the Euclidean distance from to its th nearest neighbor in

. By introducing a correction term, an asymptotic unbiased estimator is obtained:




where represents the digamma function.

The following theorem shows asymptotic unbiasedness of according to (Singh et al., 2003).

Theorem 1 (kNN entropy estimator, asymptotic unbiasedness, (Singh et al., 2003)).

Assume that is absolutely continuous and is a positive integer, then


i.e., this entropy estimator is asymptotically unbiased.

From Entropy to Mutual Information

To construct a mutual information estimator from an entropy estimator is straightforward by combining entropy estimators using the identity (Cover and Thomas, 2006):


Combining Eqs. 3 and 8, we have,


where denotes the point projected into th dimension in and represents the marginal kNN density estimator projected into th dimension of .

Similar to Eq. 5, we can also construct an asymptotically unbiased mutual information estimator based on Theorem 1:

Corollary 1 (kNN MI estimator, asymptotic unbiasedness).

Assume that is absolute continuous and is a positive integer, then:


2.3 KSG Estimator

The KSG mutual information estimator (Kraskov et al., 2004) is a popular variant of the naive kNN estimator. The general principle of KSG is that for each density estimator in different spaces, we would like to use the similar length-scales for -nearest-neighbor distance as in the joint space so that the bias would be approximately smaller. Although the theoretical properties of this estimator is unknown, it has a relatively good performance in practice, see (Khan et al., 2007) for a comparison of different estimation methods. Unlike naive kNN estimator, the KSG estimator uses the max-norm distance instead of L2-norm. In particular, if is twice the (max-norm) distance to the -th nearest neighbor of , it can be shown that the expectation value (over all ways of drawing the surrounding points) of the log probability mass within the box centered at is given by this expression.

We use to represent the digamma function. If we assume that the density inside the box (with sides of length ) is constant, then the integral becomes trivial and we find that,


Rearranging and taking the mean over leads us to the following entropy estimator:

Note that , defining the size of neighborhood to use in local density estimation, is a free parameter. Using smaller should be more accurate, but larger

reduces the variance of the estimate 

(Khan et al., 2007). While consistent density estimation requires to grow with  (Von Luxburg and Alamgir, 2013), entropy estimates converge almost surely for any fixed  (Wang et al., 2009; Pérez-Cruz, 2008) under some weak conditions.111This assumes the probability density is absolutely continuous but see (Pál et al., 2010) for some technical concerns.

To estimate the mutual information, in the joint space we set , the size of the neighborhood, which determines for each point . Next, we consider the smallest rectilinear hyper-rectangle that contains these points, which has sides of length for each marginal direction . We refer to this as the “max-norm rectangle” (as shown in Fig. 1). Let be the number of points at a distance less than or equal to in the -subspace. For each marginal entropy estimate, we use instead of to set the neighborhood size at each point. Finally, in the joint space using a rectangle instead of a box in Eq. 2.3 leads to a correction term of size (details given in (Kraskov et al., 2004)). Adding the entropy estimators together with these choices yields the following.


Figure 1: Centered at a given sample point, , we show the max-norm rectangle containing

nearest neighbors (a) for points drawn from a uniform distribution,

, and (b) for points drawn from a strongly correlated distribution, .

3 Limitations of kNN-based MI Estimators

In this section, we demonstrate a significant flaw in kNN-based MI estimators which is summarized in the following theorems. We then explain why these estimators fail to accurately estimate mutual information unless the correlations are relatively weak or the number of samples is large.

Theorem 2.

For any d-dimensional absolute continuous probability density function, , for any , for the estimated mutual information to be close to the true mutual information, , requires that the number of samples, , is at least, , where is a constant which scales like .

The proof of Theorem 2 is shown in the Appendix A.

Theorem 3.

For any d-dimensional absolute continuous probability density function, , for any , for the estimated mutual information to be close to the true mutual information, , requires that the number of samples, , is at least, , where .


Note that , where is the -th harmonic number and is the Euler-Mascheroni constant.

The first inequality is obtained from Eq. 2.3 by observing that for any and that is a monotonically increasing function. And the last inequality is obtained by dropping the term and using the well-known upper bound . Requiring that , we obtain , where

The above theorems state that for any fixed dimensionality, the number of samples needed for estimating mutual information increases exponentially with the magnitude of . From the point of view of determining independence, i.e., distinguishing from , this restriction is not particularly troubling. However, for finding strong signals in data it presents a major barrier. Indeed, consider two random variables and , where and . When , the relationship between and becomes nearly functional, and the mutual information diverges as . As a consequence, the number of samples needed for accurately estimating diverges as well. This is depicted in Fig. 2 where we compare the empirical lower bound to our theoretical bound given by Theorem 3. It can be seen that the theoretical bounds are rather conservative, but they have the same exponentially growing rates comparing to the empirical ones.

What is the origin of this undesired behavior? An intuitive and general argument comes from looking at the assumption of local uniformity in kNN-based estimators. In particular, both naive kNN and KSG estimators approximate the probability density in the kNN ball or max-norm rectangle containing the nearest neighbors with uniform density. If there are strong relationships (in the joint space, the density becomes more singular), then we can see in Fig. 1 that the uniform assumption becomes problematic.

Figure 2: A semi-logarithmic plot of (number of required samples to achieve an error at most ) for KSG estimator for different values of . We set , .

4 Improved kNN-based Estimators

In this section, we suggest a class of kNN-based estimators that relaxes the local uniformity assumption.

Local Nonuniformity Correction (LNC)

Our second contribution is to provide a general method for overcoming the limitations above. Considering the ball (in naive kNN estimator) or max-norm hyper-rectangle (in KSG estimator) around the point which contains nearest neighbors, let us denote this region of the space with , whose volume is . Instead of assuming that the density is uniform inside around the point , we assume that there is some subset, with volume on which the density is constant, i.e., This is illustrated with a shaded region in Fig. 1. We now repeat the derivation above using this altered assumption about the local density around each point for . We make no changes to the entropy estimates in the marginal subspaces. Based on this idea, we get a general correction term for kNN-based MI estimators (see Appendix B for the details of derivation):


where can be either or .

If the local density in the -nearest-neighbor region is highly non-uniform, as is the case for strongly related variables like those in Fig. 1, then the proposed correction term will improve the estimate. For instance, if we assume that relationships in data are smooth, functional relationships plus some noise, the correction term will yield significant improvement as demonstrated empirically in Sec. 5. We note that this correction term is not bounded by , but rather by our method of estimating . Next, we will give one concrete implementation of this idea by focusing on the modification of KSG estimator.

Estimating Nonuniformity by Local PCA

With the correction term in Eq. 17, we have transformed the problem into that of finding a local volume on which we believe the density is positive. Regarding KSG estimator, instead of a uniform distribution within the max-norm rectangle in the neighborhood around the point , we look for a small, rotated (hyper)rectangle that covers the neighborhood of . The volume of the rotated rectangle is obtained by doing a localized principle component analysis around all of ’s nearest neighbors, and then multiplying the maximal axis values together in each principle component after the points are transformed to the new coordinate system222Note that we manually set the mean of these points to be when doing PCA, in order to put in the center of the rotated rectangle.. The key advantage of our proposed estimator is as follows: while KSG assumes local uniformity of the density over a region containing nearest neighbors of a particular point, our estimator relies on a much weaker assumption of local linearity over the same region. Note that the local linearity assumption has also been widely adopted in the manifold learning, for example, local linear embedding(LLE) (Roweis and Saul, 2000) and local tangent space alignment(LTSA) (Zhang and Zha, 2002).

Testing for Local Nonuniformity

One problem with this procedure is that we may find that, locally, points may occupy a small sub-volume, i.e., even if the local neighborhood is actually drawn from a uniform distribution(as shown in Fig. 1), the volume of the PCA-aligned rectangle will with high probability be smaller than the volume of the max-norm rectangle, leading to an artificially large non-uniformity correction.

To avoid this artifact, we consider a trade-off between the two possibilities: for a fixed dimension and nearest neighbor parameter , we find a constant , such that if , then we assume local uniformity is violated and use the correction , otherwise the correction is discarded for point . Note that if is sufficiently small, then the correction term will be always discarded, so that our estimator reduces to the KSG estimator. Good choices of are set using arguments described in Appendix C. Furthermore, we believe that as long as the expected value of in large limit, for some properly selected , then the consistency properties of the proposed estimator will be identical to the constancy properties of the KSG estimator. A rigorous proof of this hypothesis is left as a future work.

The full algorithm for our estimator is given in Algorithm 1.

Input: points , parameter (dimension), (nearest neighbor),
Calculate by KSG estimator, using the same nearest neighbor parameter
for each point  do
     Find nearest neighbors of : , ,…,
     Do PCA on these neighbors, calculate the volume corrected rectangle
     Calculate the volume of max-norm rectangle
     if  then
     end if
end for
Calculate : average value of
Algorithm 1 Mutual Information Estimation with Local Nonuniform Correction

5 Experiments

We evaluate the proposed estimator on both synthetically generated and real-world data. For the former, we considered various functional relationships and thoroughly examined the performance of the estimator over a range of noise intensities. For the latter, we applied our estimator to the WHO dataset used previously in (Reshef et al., 2011). Below we report the results.

5.1 Experiments with synthetic data

Functional relationships in two dimensions

In the first set of experiments, we generate samples from various functional relationships of the form that were previously studied in Reshef et al. (2011); Kinney and Atwal (2014). The noise term is distributed uniformly over the interval , where is used to control the noise intensity. We also compare the results to several baseline estimators: KSG (Kraskov et al., 2004), generalized nearest neighbor graph (GNN) (Pál et al., 2010) 333We use the online code http://www.cs.cmu.edu/~bapoczos/codes/REGO_with_kNN.zip for the GNN estimator., minimum spanning trees (MST) (Müller et al., 2012; Yukich and Yukich, 1998), and exponential family with maximum likelihood estimation (EXP) (Nielsen and Nock, 2010) 444We use the Information Theoretical Estimators Toolbox (ITE) (Szabó, 2014) for MST and EXP estimators..

Figure 3: For all the functional relationships above, we used a sample size for each noise level and the nearest neighbor parameter for LNC, KSG and GNN estimators.

Figure 3 demonstrates that the proposed estimator, LNC, consistently outperforms the other estimators. Its superiority is most significant for the low noise regime. In that case, both KSG and GNN estimators are bounded by the sample size while LNC keeps growing. MST tends to overestimate MI for large noise but still stops growing after the noise fall below a certain intensity555Even if or , KSG and GNN estimators still stop growing after the noise fall below a certain intensity.. Surprisingly, EXP is the only other estimator that performs comparably with LNC for the linear relationship (the left most plot in Figure 3). However, it fails dramatically for all the other relationship types. See Appendix D for more functional relationship tests.

Convergence rate

Figure 4 shows the convergence of the two estimators and , at a fixed (small) noise intensity, as we vary the sample size. We test the estimators on both two and five dimensional data for linear and quadratic relationships. We observe that LNC is doing better overall. In particular, for linear relationships in 2D and 5D, as well as quadratic relationships in 2D, the required sample size for LNC is several orders of magnitude less that for . For instance, for the 5D linear relationship does not converge even for the sample size () while converges to the true value with only samples.

Finally, it is worthwhile to remark on the relatively slow convergence of for the 5D quadratic example. This is because , while relaxing the local uniformity assumptions, still assumes local linearity. The stronger the nonlinearity, the more samples are required to find neighborhoods that are locally approximately linear. We note, however, that LNC still converges faster than KSG.

Figure 4: Estimated MI using both KSG and LNC estimators in the number of samples ( and for 2D examples; and for 5D examples)

5.2 Experiments with real-world data

5.2.1 Ranking Relationship Strength

We evaluate the proposed estimator on the WHO dataset which has variables describing various socio-economic, political, and health indicators for different countries 666WHO dataset is publicly available at http://www.exploredata.net/Downloads. We calculate the mutual information between pairs of variables which have at least 150 samples. Next, we rank the pairs based on their estimated mutual information and choose the top 150 pairs with highest mutual information. For these top 150 pairs, We randomly select a fraction of samples for each pair, hide the rest samples and then recalculate the mutual information. We want to see how mutual information-based rank changes by giving different amount of less data, i.e., varying . A good mutual information estimator should give a similar rank using less data as using the full data. We compare our LNC estimator to KSG estimator. Rank similarities are calculated using the standard Spearman’s rank correlation coefficient described in (Spearman, 1904). Fig 5 shows the results. We can see that LNC estimator outperforms KSG estimator, especially when the missing data approaches 90%, Spearman correlation drops to 0.4 for KSG estimator, while our LNC estimator still has a relatively high score of 0.7.

Figure 5: Spearman correlation coefficient between the original MI rank and the rank after hiding some percentage of data by KSG and LNC estimator respectively. The 95% confidence bars are obtained by repeating the experiment for 200 times.

5.2.2 Finding interesting triplets

We also use our estimator to find strong multivariate relationships in the WHO data set. Specifically, we search for synergistic triplets , where one of the variables, say , can be predicted by knowing both and simultaneously, but not by using either variable separately. In other words, we search for triplets such that the pair-wise mutual information between the pairs , and are low, but the multi-information is relatively high. We rank relationships using the following synergy score:777Another measure of synergy is given by the so called “interaction information”: . .

We select the triplets that have synergy score above a certain threshold. Figure 6 shows two synergistic relationships detected by but not by . In these examples, both and estimators yield low mutual information for the pairs , and . However, in contrast to KSG, our estimator yields a relatively high score for multi-information among the three variables.

Figure 6: Two examples of synergistic triplets: and for the first example; and for the second example

For the first relationship, the synergistic behavior can be explained by noting that the ratio of the Total Energy Generation () to Electricity Generation per Person () essentially yields the size of the population, which is highly predictive of the Number of Female Cervical Cancer cases (). While this example might seem somewhat trivial, it illustrates the ability of our method to extract synergistic relationships automatically without any additional assumptions and/or data preprocessing.

In the second example, LNC predicts a strong synergistic interaction between Total Cell Phones (), Number of Female Cervical Cancer cases (), and rate of Tuberculosis deaths (). Since the variable (the number of female cervical cancer cases) grows with the total population, is proportional to the average number of cell phones per person. The last plot indicates that a higher number of cell phones per person are predictive of lower tuberculosis death rate. One possible explanation for this correlation is some common underlying cause (e.g., overall economic development). Another intriguing possibility is that this finding reflects recent efforts to use mobile technology in TB control.888See Stop TB Partnership, http://www.stoptb.org.

6 Related Work

MI estimators

There has been a significant amount of recent work on estimating entropic measures such as divergences and mutual information from samples (see this survey (Walters-Williams and Li, 2009) for an exhaustive list). Khan et al. (2007) compared different MI estimators for varying sample sizes and noise intensity, and reported that for small samples, the KSG estimator was the best choice overall for relatively low noise intensities, while KDE performed better at higher noise intensities. Other approaches include estimators based on Generalized Nearest-Neighbor Graphs (Pál et al., 2010), minimum spanning trees (Müller et al., 2012), maximum likelihood density ratio estimation (Suzuki et al., 2008), and ensemble methods (Sricharan et al., 2013; Moon and Hero, 2014). In particular, the latter approach works by taking a weighted average of simple density plug-in estimates such as kNN or KDE. However, it is upper-bounded by the largest value among its simple density estimates. Therefore, this method would still underestimate the mutual information when it goes larger as discussed before.

It has been recognized that kNN-based entropic estimators underestimate probability density at the sample points that are close to the boundary of support (Liitiäinen et al., 2010). Sricharan et al. (2012) proposed an bipartite plug-in(BPI) estimator for non-linear density functionals that extrapolates the density estimates at interior points that are close to the boundary points in order to compensate the boundary bias. However, this method requires to identify boundary points and interior points which becomes difficult to distinguish as mutual information gets large that almost all the points are close to the boundary. Singh and Poczos (2014)

used a ”mirror image” kernel density estimator to escape the boundary effect, but their estimator relies on the knowledge of the support of the densities, and assumes that the boundary are parallel to the axes.

Mutual Information and Equitability

Reshef et al. (2011) introduced a property they called “suitability” for a measure of correlation. If two variables are related by a functional form with some noise, equitable measures should reflect the magnitude of the noise while being insensitive to the form of the functional relationship. They used this notion to justify a new correlation measure called MIC. Based on comparisons with MI using the KSG estimator, they concluded that MIC is “more equitable” for comparing relationship strengths. While several problems (Simon and Tibshirani, 2014; Gorfine et al., ) and alternatives (Heller et al., 2013; Székely et al., 2009) were pointed out, Kinney and Atwal (KA) were the first to point out that MIC’s apparent superiority to MI was actually due to flaws in estimation (Kinney and Atwal, 2014). A more careful definition of equitability led KA to the conclusion that MI is actually more equitable than MIC. KA suggest that the poor performance of the KSG estimator that led to Reshef et. al.’s mistaken conclusion could be improved by using more samples for estimation. However, here we showed that the number of samples required for KSG is prohibitively large, but that this difficulty can be overcome by using an improved MI estimator.

7 Conclusion

The problem of deciding whether or not two variables are independent is a historically significant endeavor. In that context, research on mutual information estimation has been geared towards distinguishing weak dependence from independence. However, modern data mining presents us with problems requiring a totally different perspective. It is not unusual to have thousands of variables which could have millions of potential relationships. We have insufficient resources to examine each potential relationship so we need an assumption-free way to pick out only the most promising relationships for further study. Many applications have this flavor including the health indicator data considered above as well as gene expression microarray data, human behavior data, economic indicators, and sensor data, to name a few.

How can we select the most interesting relationships to study? Classic correlation measures like the Pearson coefficient bias the results towards linear variables. Mutual information gives a clear and general basis for comparing the strength of otherwise dissimilar variables and relationships. While non-parametric mutual information estimators exist, we showed that strong relationships require exponentially many samples to accurately measure using some of these techniques.

We introduced a non-parametric mutual information estimator that can measure the strength of nonlinear relationships even with small sample sizes. We have incorporated these novel estimators into an open source entropy estimation toolbox 999https://github.com/BiuBiuBiLL/MIE.As the amount and variety of available data grows, general methods for identifying strong relationships will become increasingly necessary. We hope that the developments suggested here will help to address this need.


This research was supported in part by DARPA grant No. W911NF–12–1–0034.


  • Cover and Thomas (2006) T.M. Cover and J.A. Thomas. Elements of information theory. Wiley-Interscience, 2006.
  • (2) Malka Gorfine, Ruth Heller, and Yair Heller. Comment on “detecting novel associations in large data sets”. (available at http://emotion. technion. ac. il/  gorfinm/files/science6. pdf on 11 Nov. 2012).
  • Heller et al. (2013) Ruth Heller, Yair Heller, and Malka Gorfine. A consistent multivariate test of association based on ranks of distances. Biometrika, 100(2):503–510, 2013.
  • Khan et al. (2007) Shiraj Khan, Sharba Bandyopadhyay, Auroop R. Ganguly, Sunil Saigal, David J. Erickson, Vladimir Protopopescu, and George Ostrouchov. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E, 76:026209, Aug 2007. doi: 10.1103/PhysRevE.76.026209. URL http://link.aps.org/doi/10.1103/PhysRevE.76.026209.
  • Kinney and Atwal (2014) J. Kinney and G. Atwal. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, 111(9):3354–3359, 2014.
  • Kraskov et al. (2004) A. Kraskov, H. Stögbauer, and P. Grassberger. Estimating mutual information. Phys. Rev. E, 69:066138, 2004. doi: 10.1103/PhysRevE.69.066138. URL http://link.aps.org/doi/10.1103/PhysRevE.69.066138.
  • Liitiäinen et al. (2010) Elia Liitiäinen, Amaury Lendasse, and Francesco Corona.

    A boundary corrected expansion of the moments of nearest neighbor distributions.

    Random Struct. Algorithms, 37(2):223–247, September 2010. ISSN 1042-9832. doi: 10.1002/rsa.v37:2. URL http://dx.doi.org/10.1002/rsa.v37:2.
  • Moon and Hero (2014) K.R. Moon and A.O. Hero. Ensemble estimation of multivariate f-divergence. In Information Theory (ISIT), 2014 IEEE International Symposium on, pages 356–360, June 2014. doi: 10.1109/ISIT.2014.6874854.
  • Müller et al. (2012) Andreas Müller, Sebastian Nowozin, and Christoph Lampert. Information theoretic clustering using minimum spanning trees. Pattern Recognition, pages 205–215, 2012.
  • Nielsen and Nock (2010) F. Nielsen and R. Nock. Entropies and cross-entropies of exponential families. In 17th IEEE International Conference on Image Processing (ICIP), pages 3621–3624. IEEE, 2010.
  • Pál et al. (2010) Dávid Pál, Barnabás Póczos, and Csaba Szepesvári. Estimation of rényi entropy and mutual information based on generalized nearest-neighbor graphs. In Advances in Neural Information Processing Systems 23, pages 1849–1857. Curran Associates, Inc., 2010.
  • Pérez-Cruz (2008) Fernando Pérez-Cruz. Estimation of information theoretic measures for continuous random variables. In Proceedings of NIPS-08, pages 1257–1264, 2008.
  • Reshef et al. (2011) David N Reshef, Yakir A Reshef, Hilary K Finucane, Sharon R Grossman, Gilean McVean, Peter J Turnbaugh, Eric S Lander, Michael Mitzenmacher, and Pardis C Sabeti. Detecting novel associations in large data sets. science, 334(6062):1518–1524, 2011.
  • Roweis and Saul (2000) Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • Simon and Tibshirani (2014) Noah Simon and Robert Tibshirani. Comment on” detecting novel associations in large data sets” by reshef et al, science dec 16, 2011. arXiv preprint arXiv:1401.7645, 2014.
  • Singh et al. (2003) Harshinder Singh, Neeraj Misra, Vladimir Hnizdo, Adam Fedorowicz, and Eugene Demchuk. Nearest neighbor estimates of entropy. American Journal of Mathematical and Management Sciences, 23(3-4):301–321, 2003. doi: 10.1080/01966324.2003.10737616. URL http://dx.doi.org/10.1080/01966324.2003.10737616.
  • Singh and Poczos (2014) Shashank Singh and Barnabas Poczos. Generalized exponential concentration inequality for renyi divergence estimation. In Proceedings of the 31st International Conference on Machine Learning (ICML-14), pages 333–341, 2014. URL http://machinelearning.wustl.edu/mlpapers/papers/icml2014c1_singh14.
  • Spearman (1904) C. Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):pp. 72–101, 1904. ISSN 00029556. URL http://www.jstor.org/stable/1412159.
  • Sricharan et al. (2012) K. Sricharan, R. Raich, and A.O. Hero. Estimation of nonlinear functionals of densities with confidence. Information Theory, IEEE Transactions on, 58(7):4135–4159, July 2012. ISSN 0018-9448. doi: 10.1109/TIT.2012.2195549.
  • Sricharan et al. (2013) K. Sricharan, D. Wei, and A.O. Hero. Ensemble estimators for multivariate entropy estimation. Information Theory, IEEE Transactions on, 59(7):4374–4388, July 2013. ISSN 0018-9448. doi: 10.1109/TIT.2013.2251456.
  • Studenỳ and Vejnarová (1998) Milan Studenỳ and Jirina Vejnarová. The multiinformation function as a tool for measuring stochastic dependence. In Learning in graphical models, pages 261–297. Springer, 1998.
  • Suzuki et al. (2008) Taiji Suzuki, Masashi Sugiyama, Jun Sese, and Takafumi Kanamori. Approximating mutual information by maximum likelihood density ratio estimation. In Yvan Saeys, Huan Liu, Iñaki Inza, Louis Wehenkel, and Yves Van de Peer, editors, FSDM, volume 4 of JMLR Proceedings, pages 5–20. JMLR.org, 2008.
  • Szabó (2014) Zoltán Szabó. Information theoretical estimators toolbox. Journal of Machine Learning Research, 15:283–287, 2014. (https://bitbucket.org/szzoli/ite/).
  • Székely et al. (2009) Gábor J Székely, Maria L Rizzo, et al. Brownian distance covariance. The annals of applied statistics, 3(4):1236–1265, 2009.
  • Von Luxburg and Alamgir (2013) Ulrike Von Luxburg and Morteza Alamgir. Density estimation from unweighted k-nearest neighbor graphs: a roadmap. In Advances in Neural Information Processing Systems, 2013.
  • Walters-Williams and Li (2009) Janett Walters-Williams and Yan Li. Estimation of mutual information: A survey. In Rough Sets and Knowledge Technology, volume 5589 of Lecture Notes in Computer Science, pages 389–396. Springer Berlin Heidelberg, 2009. ISBN 978-3-642-02961-5. doi: 10.1007/978-3-642-02962-2˙49. URL http://dx.doi.org/10.1007/978-3-642-02962-2_49.
  • Wang et al. (2009) Q. Wang, S.R. Kulkarni, and S. Verdú. Divergence estimation for multidimensional densities via k-nearest-neighbor distances. IEEE Trans. Inf. Theor., 55:2392–2405, May 2009. ISSN 0018-9448. doi: 10.1109/TIT.2009.2016060. URL http://dl.acm.org/citation.cfm?id=1669487.1669521.
  • Watanabe (1960) Satosi Watanabe. Information theoretical analysis of multivariate correlation. IBM Journal of research and development, 4(1):66–82, 1960.
  • Yukich and Yukich (1998) Joseph E Yukich and Joseph Yukich. Probability theory of classical Euclidean optimization problems. Springer Berlin, 1998.
  • Zhang and Zha (2002) Zhenyue Zhang and Hongyuan Zha. Principal manifolds and nonlinear dimension reduction via local tangent space alignment. SIAM Journal of Scientific Computing, 26:313–338, 2002.

Supplementary Material for “Efficient Estimation of Mutual Information for Strongly Dependent Variables”

Appendix A Proof of Theorem 2

Notice that for a fixed sample point , its -nearest-neighbor distance is always equal to or larger than the -nearest-neighbor distance of at the same point projected into a sub-dimension , i.e., for any , we have


Using Eq. A.1, we get the upper bound of as follows:

The last inequality is obtained by noticing that is a monotonous decreasing function.

Also, we have,

The inequality above is obtained by using the bound of gamma function that,

Therefore, reconsidering A, we get the following inequality for :

Requiring that , we obtain,


where is a constant which scales like .

Appendix B Derivation of Eq. 17

The naive kNN or KSG estimator can be written as:


where is the probability mass around the -nearest-neighborhood at and is the probability mass around the -nearest-neighborhood (or -nearest-neighborhood for KSG) at projected into -th dimension. Also, and denote the volume of the kNN ball(or hype-rectangle in KSG) in the joint space and projected subspaces respectively.

Now our local nonuniform correction method replaces the volume in Eq. B.1 with the corrected volume , thus, our estimator is obtained as follows:

Appendix C Empirical Evaluation for

Suppose we have a uniform distribution on the dimensional (hyper)rectangle with volume . We sample points from this uniform distribution. We perform PCA using these points to get a new basis. After rotating into this new basis, we find the volume, , of the smallest rectilinear rectangle containing the points. By chance, we will typically find , even though the distribution is uniform. This will lead to us to (incorrectly) apply a local non-uniformity correction. Instead, we set a threshold and if is above the threshold, we assume that the distribution is locally uniform. Setting involves a trade-off. If it is set too high, we will incorrectly conclude there is local non-uniformity and therefore over-estimate the mutual information. If we set too low, we will lose statistical power for “medium-strength” relationships (though very strong relationships will still lead to values of smaller than ).

In practice, we determine the correct value of

empirically. We look at the probability distribution of

that occurs when the true distribution is uniform. We set conservatively so that when the true distribution is uniform, our criteria rejects this hypothesis with small probability, . Specifically, we do a number of trials, , and set such that where is a relatively small value. In practice, we chose and . The following algorithm describes this procedure:

Input: parameter (dimension), (nearest neighbor), ,
set array to be NULL
     Randomly choose a uniform distribution supported on dimensional (hyper) rectangle, denote its volume to be
     Draw points from this uniform distribution, get the correcting volume after doing PCA
     add the ratio to array
until above procedure repeated times
smallest number in
Algorithm C.1 Estimating for LNC
Figure C.1: as a function of . ranges over for each dimension .

Figure C.1 shows empirical value of for different pairs. We can see that for a fixed dimension , grows as increases, meaning that must be closer to

to accept the null hypothesis of

uniformity. We also find that decreases as the dimension increases, indicating that for a fixed , becomes much smaller than when points are drawn from a uniform distribution in higher dimensions.

Appendix D More Functional Relationship Tests in Two Dimensions

We have tested together twenty-one functional relationships described in  Reshef et al. (2011); Kinney and Atwal (2014), we show six of them in Section 5. The complete results are shown in Figure D.1. Detailed description of the functions can be found in Table S1 of Supporting Information in Kinney and Atwal (2014).

Figure D.1: Mutual Information tests of , , , , estimators. Twenty-one functional relationships with different noise intensities are tested. Noise has the form where varies(as shown in X axis of the plots). For KSG, GNN and LNC estimators, nearest neighbor parameter . We are using data points for each noisy functional relationship.