local non-uniformity correction for mutual information estimation
We demonstrate that a popular class of nonparametric mutual information (MI) estimators based on k-nearest-neighbor graphs requires number of samples that scales exponentially with the true MI. Consequently, accurate estimation of MI between two strongly dependent variables is possible only for prohibitively large sample size. This important yet overlooked shortcoming of the existing estimators is due to their implicit reliance on local uniformity of the underlying joint distribution. We introduce a new estimator that is robust to local non-uniformity, works well with limited data, and is able to capture relationship strengths over many orders of magnitude. We demonstrate the superior performance of the proposed estimator on both synthetic and real-world data.READ FULL TEXT VIEW PDF
Like most nonparametric estimators of information functionals involving
Estimating mutual information from i.i.d. samples drawn from an unknown ...
We present simple and computationally efficient nonparametric estimators...
We study the problem of using i.i.d. samples from an unknown multivariat...
Variational approaches based on neural networks are showing promise for
The robust estimator presented in this paper processes each structure
Estimators of information theoretic measures such as entropy and mutual
local non-uniformity correction for mutual information estimation
As a measure of the dependence between two random variables, mutual information is remarkably general and has several intuitive interpretations(Cover and Thomas, 2006)
, which explains its widespread use in statistics, machine learning, and computational neuroscience; see(Wang et al., 2009)
for a list of various applications. In a typical scenario, we do not know the underlying joint distribution of the random variables, and instead need to estimate mutual information using i.i.d. samples from that distribution. A naive approach to this problem is to first estimate the underlying probability density, and then calculate mutual information using its definition. Unfortunately, estimating joint densities from a limited number of samples is often infeasible in many practical settings.
A different approach is to estimate mutual information directly from samples in a non-parametric way, without ever describing the entire probability density. The main intuition behind such direct estimators is that evaluating mutual information can in principle be a more tractable problem than estimating the density over the whole state space (Pérez-Cruz, 2008). The most popular class of estimators taking this approach is the
-nearest-neighbor(kNN) based estimators. One example of such estimator is due to Kraskov, Stögbauer, and Grassberger (referred to as the KSG estimator from here on)(Kraskov et al., 2004), which has been extended to generalized nearest neighbors graphs (Pál et al., 2010).
Our first contribution is to demonstrate that kNN-based estimators suffer from a critical yet overlooked flaw. Namely, we illustrate that if the true mutual information is , then kNN-based estimators requires exponentially many (in ) samples for accurate estimation. This means that strong relationships are actually more difficult to measure. This counterintuitive property reflects the fact that most work on mutual information estimation has focused on estimators that are good at detecting independence of variables rather than precisely measuring strong dependence. In the age of big data, it is often the case that many strong relationships are present in the data and we are interested in picking out only the strongest ones. MI estimation will perform poorly for this task if their accuracy is low in the regime of strong dependence.
We show that the undesired behavior of previous kNN-based MI estimators can be attributed to the assumption of local uniformity utilized by those estimators, which can be violated for sufficiently strong (almost deterministic) dependencies. As our second major contribution, we suggest a new kNN estimator that relaxes the local uniformity condition by introducing a correction term for local non-uniformity. We demonstrate empirically that for strong relationships, the proposed estimator needs significantly fewer samples for accurately estimating mutual information.
In Sec. 2, we introduce kNN-based non-parametric entropy and mutual information estimators. In Sec. 3, we demonstrate the limitations of kNN-based mutual information estimators. And then we suggest a correction term to overcome these limitations in Sec. 4. In Sec. 5, we show empirically using synthetic and real-world data that our method outperforms existing techniques. In particular, we are able to accurately measure strong relationships using smaller sample sizes. Finally, we conclude with related work and discussion.
In this section, we will first introduce the naive kNN estimator for mutual information, which is based on an entropy estimator due to (Singh et al., 2003), and show its theoretical properties. Next, we focus on a popular variant of kNN estimators, called KSG estimator (Kraskov et al., 2004).
Let denote aand marginal densities of each are defined as . Shannon Differential Entropy and Mutual Information are defined in the usual way:
We use natural logarithms so that information is measured in nats. For , the generalized mutual information is also called total correlation (Watanabe, 1960) or multi-information (Studenỳ and Vejnarová, 1998).
Given i.i.d. samples, , drawn from , the task is to construct a mutual information estimator based on these samples.
The naive kNN entropy estimator is as follows:
in Eq. 4 is the Euclidean distance from to its th nearest neighbor in
. By introducing a correction term, an asymptotic unbiased estimator is obtained:
where represents the digamma function.
The following theorem shows asymptotic unbiasedness of according to (Singh et al., 2003).
Assume that is absolutely continuous and is a positive integer, then
i.e., this entropy estimator is asymptotically unbiased.
To construct a mutual information estimator from an entropy estimator is straightforward by combining entropy estimators using the identity (Cover and Thomas, 2006):
where denotes the point projected into th dimension in and represents the marginal kNN density estimator projected into th dimension of .
Assume that is absolute continuous and is a positive integer, then:
The KSG mutual information estimator (Kraskov et al., 2004) is a popular variant of the naive kNN estimator. The general principle of KSG is that for each density estimator in different spaces, we would like to use the similar length-scales for -nearest-neighbor distance as in the joint space so that the bias would be approximately smaller. Although the theoretical properties of this estimator is unknown, it has a relatively good performance in practice, see (Khan et al., 2007) for a comparison of different estimation methods. Unlike naive kNN estimator, the KSG estimator uses the max-norm distance instead of L2-norm. In particular, if is twice the (max-norm) distance to the -th nearest neighbor of , it can be shown that the expectation value (over all ways of drawing the surrounding points) of the log probability mass within the box centered at is given by this expression.
We use to represent the digamma function. If we assume that the density inside the box (with sides of length ) is constant, then the integral becomes trivial and we find that,
Rearranging and taking the mean over leads us to the following entropy estimator:
Note that , defining the size of neighborhood to use in local density estimation, is a free parameter. Using smaller should be more accurate, but larger
reduces the variance of the estimate(Khan et al., 2007). While consistent density estimation requires to grow with (Von Luxburg and Alamgir, 2013), entropy estimates converge almost surely for any fixed (Wang et al., 2009; Pérez-Cruz, 2008) under some weak conditions.111This assumes the probability density is absolutely continuous but see (Pál et al., 2010) for some technical concerns.
To estimate the mutual information, in the joint space we set , the size of the neighborhood, which determines for each point . Next, we consider the smallest rectilinear hyper-rectangle that contains these points, which has sides of length for each marginal direction . We refer to this as the “max-norm rectangle” (as shown in Fig. 1). Let be the number of points at a distance less than or equal to in the -subspace. For each marginal entropy estimate, we use instead of to set the neighborhood size at each point. Finally, in the joint space using a rectangle instead of a box in Eq. 2.3 leads to a correction term of size (details given in (Kraskov et al., 2004)). Adding the entropy estimators together with these choices yields the following.
nearest neighbors (a) for points drawn from a uniform distribution,, and (b) for points drawn from a strongly correlated distribution, .
In this section, we demonstrate a significant flaw in kNN-based MI estimators which is summarized in the following theorems. We then explain why these estimators fail to accurately estimate mutual information unless the correlations are relatively weak or the number of samples is large.
For any d-dimensional absolute continuous probability density function, , for any , for the estimated mutual information to be close to the true mutual information, , requires that the number of samples, , is at least, , where is a constant which scales like .
For any d-dimensional absolute continuous probability density function, , for any , for the estimated mutual information to be close to the true mutual information, , requires that the number of samples, , is at least, , where .
Note that , where is the -th harmonic number and is the Euler-Mascheroni constant.
The first inequality is obtained from Eq. 2.3 by observing that for any and that is a monotonically increasing function. And the last inequality is obtained by dropping the term and using the well-known upper bound . Requiring that , we obtain , where ∎
The above theorems state that for any fixed dimensionality, the number of samples needed for estimating mutual information increases exponentially with the magnitude of . From the point of view of determining independence, i.e., distinguishing from , this restriction is not particularly troubling. However, for finding strong signals in data it presents a major barrier. Indeed, consider two random variables and , where and . When , the relationship between and becomes nearly functional, and the mutual information diverges as . As a consequence, the number of samples needed for accurately estimating diverges as well. This is depicted in Fig. 2 where we compare the empirical lower bound to our theoretical bound given by Theorem 3. It can be seen that the theoretical bounds are rather conservative, but they have the same exponentially growing rates comparing to the empirical ones.
What is the origin of this undesired behavior? An intuitive and general argument comes from looking at the assumption of local uniformity in kNN-based estimators. In particular, both naive kNN and KSG estimators approximate the probability density in the kNN ball or max-norm rectangle containing the nearest neighbors with uniform density. If there are strong relationships (in the joint space, the density becomes more singular), then we can see in Fig. 1 that the uniform assumption becomes problematic.
In this section, we suggest a class of kNN-based estimators that relaxes the local uniformity assumption.
Our second contribution is to provide a general method for overcoming the limitations above. Considering the ball (in naive kNN estimator) or max-norm hyper-rectangle (in KSG estimator) around the point which contains nearest neighbors, let us denote this region of the space with , whose volume is . Instead of assuming that the density is uniform inside around the point , we assume that there is some subset, with volume on which the density is constant, i.e., This is illustrated with a shaded region in Fig. 1. We now repeat the derivation above using this altered assumption about the local density around each point for . We make no changes to the entropy estimates in the marginal subspaces. Based on this idea, we get a general correction term for kNN-based MI estimators (see Appendix B for the details of derivation):
where can be either or .
If the local density in the -nearest-neighbor region is highly non-uniform, as is the case for strongly related variables like those in Fig. 1, then the proposed correction term will improve the estimate. For instance, if we assume that relationships in data are smooth, functional relationships plus some noise, the correction term will yield significant improvement as demonstrated empirically in Sec. 5. We note that this correction term is not bounded by , but rather by our method of estimating . Next, we will give one concrete implementation of this idea by focusing on the modification of KSG estimator.
With the correction term in Eq. 17, we have transformed the problem into that of finding a local volume on which we believe the density is positive. Regarding KSG estimator, instead of a uniform distribution within the max-norm rectangle in the neighborhood around the point , we look for a small, rotated (hyper)rectangle that covers the neighborhood of . The volume of the rotated rectangle is obtained by doing a localized principle component analysis around all of ’s nearest neighbors, and then multiplying the maximal axis values together in each principle component after the points are transformed to the new coordinate system222Note that we manually set the mean of these points to be when doing PCA, in order to put in the center of the rotated rectangle.. The key advantage of our proposed estimator is as follows: while KSG assumes local uniformity of the density over a region containing nearest neighbors of a particular point, our estimator relies on a much weaker assumption of local linearity over the same region. Note that the local linearity assumption has also been widely adopted in the manifold learning, for example, local linear embedding(LLE) (Roweis and Saul, 2000) and local tangent space alignment(LTSA) (Zhang and Zha, 2002).
One problem with this procedure is that we may find that, locally, points may occupy a small sub-volume, i.e., even if the local neighborhood is actually drawn from a uniform distribution(as shown in Fig. 1), the volume of the PCA-aligned rectangle will with high probability be smaller than the volume of the max-norm rectangle, leading to an artificially large non-uniformity correction.
To avoid this artifact, we consider a trade-off between the two possibilities: for a fixed dimension and nearest neighbor parameter , we find a constant , such that if , then we assume local uniformity is violated and use the correction , otherwise the correction is discarded for point . Note that if is sufficiently small, then the correction term will be always discarded, so that our estimator reduces to the KSG estimator. Good choices of are set using arguments described in Appendix C. Furthermore, we believe that as long as the expected value of in large limit, for some properly selected , then the consistency properties of the proposed estimator will be identical to the constancy properties of the KSG estimator. A rigorous proof of this hypothesis is left as a future work.
The full algorithm for our estimator is given in Algorithm 1.
We evaluate the proposed estimator on both synthetically generated and real-world data. For the former, we considered various functional relationships and thoroughly examined the performance of the estimator over a range of noise intensities. For the latter, we applied our estimator to the WHO dataset used previously in (Reshef et al., 2011). Below we report the results.
In the first set of experiments, we generate samples from various functional relationships of the form that were previously studied in Reshef et al. (2011); Kinney and Atwal (2014). The noise term is distributed uniformly over the interval , where is used to control the noise intensity. We also compare the results to several baseline estimators: KSG (Kraskov et al., 2004), generalized nearest neighbor graph (GNN) (Pál et al., 2010) 333We use the online code http://www.cs.cmu.edu/~bapoczos/codes/REGO_with_kNN.zip for the GNN estimator., minimum spanning trees (MST) (Müller et al., 2012; Yukich and Yukich, 1998), and exponential family with maximum likelihood estimation (EXP) (Nielsen and Nock, 2010) 444We use the Information Theoretical Estimators Toolbox (ITE) (Szabó, 2014) for MST and EXP estimators..
Figure 3 demonstrates that the proposed estimator, LNC, consistently outperforms the other estimators. Its superiority is most significant for the low noise regime. In that case, both KSG and GNN estimators are bounded by the sample size while LNC keeps growing. MST tends to overestimate MI for large noise but still stops growing after the noise fall below a certain intensity555Even if or , KSG and GNN estimators still stop growing after the noise fall below a certain intensity.. Surprisingly, EXP is the only other estimator that performs comparably with LNC for the linear relationship (the left most plot in Figure 3). However, it fails dramatically for all the other relationship types. See Appendix D for more functional relationship tests.
Figure 4 shows the convergence of the two estimators and , at a fixed (small) noise intensity, as we vary the sample size. We test the estimators on both two and five dimensional data for linear and quadratic relationships. We observe that LNC is doing better overall. In particular, for linear relationships in 2D and 5D, as well as quadratic relationships in 2D, the required sample size for LNC is several orders of magnitude less that for . For instance, for the 5D linear relationship does not converge even for the sample size () while converges to the true value with only samples.
Finally, it is worthwhile to remark on the relatively slow convergence of for the 5D quadratic example. This is because , while relaxing the local uniformity assumptions, still assumes local linearity. The stronger the nonlinearity, the more samples are required to find neighborhoods that are locally approximately linear. We note, however, that LNC still converges faster than KSG.
We evaluate the proposed estimator on the WHO dataset which has variables describing various socio-economic, political, and health indicators for different countries 666WHO dataset is publicly available at http://www.exploredata.net/Downloads. We calculate the mutual information between pairs of variables which have at least 150 samples. Next, we rank the pairs based on their estimated mutual information and choose the top 150 pairs with highest mutual information. For these top 150 pairs, We randomly select a fraction of samples for each pair, hide the rest samples and then recalculate the mutual information. We want to see how mutual information-based rank changes by giving different amount of less data, i.e., varying . A good mutual information estimator should give a similar rank using less data as using the full data. We compare our LNC estimator to KSG estimator. Rank similarities are calculated using the standard Spearman’s rank correlation coefficient described in (Spearman, 1904). Fig 5 shows the results. We can see that LNC estimator outperforms KSG estimator, especially when the missing data approaches 90%, Spearman correlation drops to 0.4 for KSG estimator, while our LNC estimator still has a relatively high score of 0.7.
We also use our estimator to find strong multivariate relationships in the WHO data set. Specifically, we search for synergistic triplets , where one of the variables, say , can be predicted by knowing both and simultaneously, but not by using either variable separately. In other words, we search for triplets such that the pair-wise mutual information between the pairs , and are low, but the multi-information is relatively high. We rank relationships using the following synergy score:777Another measure of synergy is given by the so called “interaction information”: . .
We select the triplets that have synergy score above a certain threshold. Figure 6 shows two synergistic relationships detected by but not by . In these examples, both and estimators yield low mutual information for the pairs , and . However, in contrast to KSG, our estimator yields a relatively high score for multi-information among the three variables.
For the first relationship, the synergistic behavior can be explained by noting that the ratio of the Total Energy Generation () to Electricity Generation per Person () essentially yields the size of the population, which is highly predictive of the Number of Female Cervical Cancer cases (). While this example might seem somewhat trivial, it illustrates the ability of our method to extract synergistic relationships automatically without any additional assumptions and/or data preprocessing.
In the second example, LNC predicts a strong synergistic interaction between Total Cell Phones (), Number of Female Cervical Cancer cases (), and rate of Tuberculosis deaths (). Since the variable (the number of female cervical cancer cases) grows with the total population, is proportional to the average number of cell phones per person. The last plot indicates that a higher number of cell phones per person are predictive of lower tuberculosis death rate. One possible explanation for this correlation is some common underlying cause (e.g., overall economic development). Another intriguing possibility is that this finding reflects recent efforts to use mobile technology in TB control.888See Stop TB Partnership, http://www.stoptb.org.
There has been a significant amount of recent work on estimating entropic measures such as divergences and mutual information from samples (see this survey (Walters-Williams and Li, 2009) for an exhaustive list). Khan et al. (2007) compared different MI estimators for varying sample sizes and noise intensity, and reported that for small samples, the KSG estimator was the best choice overall for relatively low noise intensities, while KDE performed better at higher noise intensities. Other approaches include estimators based on Generalized Nearest-Neighbor Graphs (Pál et al., 2010), minimum spanning trees (Müller et al., 2012), maximum likelihood density ratio estimation (Suzuki et al., 2008), and ensemble methods (Sricharan et al., 2013; Moon and Hero, 2014). In particular, the latter approach works by taking a weighted average of simple density plug-in estimates such as kNN or KDE. However, it is upper-bounded by the largest value among its simple density estimates. Therefore, this method would still underestimate the mutual information when it goes larger as discussed before.
It has been recognized that kNN-based entropic estimators underestimate probability density at the sample points that are close to the boundary of support (Liitiäinen et al., 2010). Sricharan et al. (2012) proposed an bipartite plug-in(BPI) estimator for non-linear density functionals that extrapolates the density estimates at interior points that are close to the boundary points in order to compensate the boundary bias. However, this method requires to identify boundary points and interior points which becomes difficult to distinguish as mutual information gets large that almost all the points are close to the boundary. Singh and Poczos (2014)
used a ”mirror image” kernel density estimator to escape the boundary effect, but their estimator relies on the knowledge of the support of the densities, and assumes that the boundary are parallel to the axes.
Reshef et al. (2011) introduced a property they called “suitability” for a measure of correlation. If two variables are related by a functional form with some noise, equitable measures should reflect the magnitude of the noise while being insensitive to the form of the functional relationship. They used this notion to justify a new correlation measure called MIC. Based on comparisons with MI using the KSG estimator, they concluded that MIC is “more equitable” for comparing relationship strengths. While several problems (Simon and Tibshirani, 2014; Gorfine et al., ) and alternatives (Heller et al., 2013; Székely et al., 2009) were pointed out, Kinney and Atwal (KA) were the first to point out that MIC’s apparent superiority to MI was actually due to flaws in estimation (Kinney and Atwal, 2014). A more careful definition of equitability led KA to the conclusion that MI is actually more equitable than MIC. KA suggest that the poor performance of the KSG estimator that led to Reshef et. al.’s mistaken conclusion could be improved by using more samples for estimation. However, here we showed that the number of samples required for KSG is prohibitively large, but that this difficulty can be overcome by using an improved MI estimator.
The problem of deciding whether or not two variables are independent is a historically significant endeavor. In that context, research on mutual information estimation has been geared towards distinguishing weak dependence from independence. However, modern data mining presents us with problems requiring a totally different perspective. It is not unusual to have thousands of variables which could have millions of potential relationships. We have insufficient resources to examine each potential relationship so we need an assumption-free way to pick out only the most promising relationships for further study. Many applications have this flavor including the health indicator data considered above as well as gene expression microarray data, human behavior data, economic indicators, and sensor data, to name a few.
How can we select the most interesting relationships to study? Classic correlation measures like the Pearson coefficient bias the results towards linear variables. Mutual information gives a clear and general basis for comparing the strength of otherwise dissimilar variables and relationships. While non-parametric mutual information estimators exist, we showed that strong relationships require exponentially many samples to accurately measure using some of these techniques.
We introduced a non-parametric mutual information estimator that can measure the strength of nonlinear relationships even with small sample sizes. We have incorporated these novel estimators into an open source entropy estimation toolbox 999https://github.com/BiuBiuBiLL/MIE.As the amount and variety of available data grows, general methods for identifying strong relationships will become increasingly necessary. We hope that the developments suggested here will help to address this need.
This research was supported in part by DARPA grant No. W911NF–12–1–0034.
A boundary corrected expansion of the moments of nearest neighbor distributions.Random Struct. Algorithms, 37(2):223–247, September 2010. ISSN 1042-9832. doi: 10.1002/rsa.v37:2. URL http://dx.doi.org/10.1002/rsa.v37:2.
Notice that for a fixed sample point , its -nearest-neighbor distance is always equal to or larger than the -nearest-neighbor distance of at the same point projected into a sub-dimension , i.e., for any , we have
Using Eq. A.1, we get the upper bound of as follows:
The last inequality is obtained by noticing that is a monotonous decreasing function.
Also, we have,
The inequality above is obtained by using the bound of gamma function that,
Therefore, reconsidering A, we get the following inequality for :
Requiring that , we obtain,
where is a constant which scales like .
The naive kNN or KSG estimator can be written as:
where is the probability mass around the -nearest-neighborhood at and is the probability mass around the -nearest-neighborhood (or -nearest-neighborhood for KSG) at projected into -th dimension. Also, and denote the volume of the kNN ball(or hype-rectangle in KSG) in the joint space and projected subspaces respectively.
Now our local nonuniform correction method replaces the volume in Eq. B.1 with the corrected volume , thus, our estimator is obtained as follows:
Suppose we have a uniform distribution on the dimensional (hyper)rectangle with volume . We sample points from this uniform distribution. We perform PCA using these points to get a new basis. After rotating into this new basis, we find the volume, , of the smallest rectilinear rectangle containing the points. By chance, we will typically find , even though the distribution is uniform. This will lead to us to (incorrectly) apply a local non-uniformity correction. Instead, we set a threshold and if is above the threshold, we assume that the distribution is locally uniform. Setting involves a trade-off. If it is set too high, we will incorrectly conclude there is local non-uniformity and therefore over-estimate the mutual information. If we set too low, we will lose statistical power for “medium-strength” relationships (though very strong relationships will still lead to values of smaller than ).
In practice, we determine the correct value of
empirically. We look at the probability distribution ofthat occurs when the true distribution is uniform. We set conservatively so that when the true distribution is uniform, our criteria rejects this hypothesis with small probability, . Specifically, we do a number of trials, , and set such that where is a relatively small value. In practice, we chose and . The following algorithm describes this procedure:
Figure C.1 shows empirical value of for different pairs. We can see that for a fixed dimension , grows as increases, meaning that must be closer to
to accept the null hypothesis ofuniformity. We also find that decreases as the dimension increases, indicating that for a fixed , becomes much smaller than when points are drawn from a uniform distribution in higher dimensions.
We have tested together twenty-one functional relationships described in Reshef et al. (2011); Kinney and Atwal (2014), we show six of them in Section 5. The complete results are shown in Figure D.1. Detailed description of the functions can be found in Table S1 of Supporting Information in Kinney and Atwal (2014).