# Consistency of the Maximal Information Coefficient Estimator

The Maximal Information Coefficient (MIC) of Reshef et al. (Science, 2011) is a statistic for measuring dependence between variable pairs in large datasets. In this note, we prove that MIC is a consistent estimator of the corresponding population statistic MIC_*. This corrects an error in an argument of Reshef et al. (JMLR, 2016), which we describe.

## Authors

• 3 publications
• 6 publications
05/09/2015

### Measuring dependence powerfully and equitably

Given a high-dimensional data set we often wish to find the strongest re...
06/13/2019

### A technical note on divergence of the Wald statistic

The Wald test statistic has been shown to diverge (Dufour et al, 2013, 2...
01/31/2013

### Equitability, mutual information, and the maximal information coefficient

Reshef et al. recently proposed a new statistical measure, the "maximal ...
06/23/2020

### A Note on the Cross-Correlation of Costas Permutations

We build on the work of Drakakis et al. (2011) on the maximal cross-corr...
01/13/2018

### Censored Quantile Instrumental Variable Estimation with Stata

Many applications involve a censored dependent variable and an endogenou...
05/04/2018

### A Note on "New techniques for noninteractive zero-knowledge"

In 2012, Groth, et al. [J. ACM, 59 (3), 1-35, 2012] developed some new t...
03/25/2018

### Diversity and Interdisciplinarity: How Can One Distinguish and Recombine Disparity, Variety, and Balance?

The dilemma which remained unsolved using Rao-Stirling diversity, namely...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Maximal Information Coefficient (MIC) of two-dimensional data points is a statistic introduced by Reshef et al. [reshef2011] for measuring the dependence between pairs of variables. In later work [reshef16-mice], the authors introduced the

statistic, which is defined analogously to MIC but for jointly-distributed pairs of random variables. Both statistics are based upon measuring the mutual information of the discrete distributions specified by imposing finite grids over the data (respectively, the joint distributions). Given a dataset

of points drawn iid from a jointly-distributed pair of random variables , the authors of [reshef16-mice] sought to show that the statistic is a consistent estimator of (i.e., that

converges in probability to

as ). In this note, we identify and correct an error in an argument of [reshef16-mice] related to proving this consistency. Our new proof modifies the original approach of Reshef et al., but the result is slightly weaker than the originally-desired claim (the set of parameters for which the consistency holds is smaller). It is left open whether the full consistency claim of [reshef16-mice] can be recovered. After introducing some notation in Section 2, we describe the flaw in the original argument of Reshef et al. in Section 3, and then we provide our new proof of consistency in Section 4.

## 2 Preliminaries

We primarily adopt the notation used by Reshef et al. [reshef16-mice], and we summarize a few key pieces here. For the sake of brevity, readers should refer to the original paper of Reshef et al. for exact definitions of the and statistics. Let denote a pair of jointly-distributed random variables, and let be a sample of points drawn iid from . If is a grid partition with rows and columns, then denotes the discrete distribution induced by on the cells of . We let denote the population characteristic matrix for , and we let denote the sample characterisic matrix for . We say that a grid partition is an equipartition of if all rows of have equal probability mass and all columns of have equal probability mass. The total variation distance between distributions and is given by

 DTV(Π,Ψ)=12∥Π−Ψ∥1.

We use to denote the mutual information of a jointly-distributed pair of random variables .

## 3 Flaw in Original Argument

Here we point out the error in the proof of Lemma 37 from [reshef16-mice, Appendix A] in the paragraph with header “Bounding the ” . To provide context, we consider a jointly-distributed pair of random variables and a dataset of points drawn iid from for some . Let be an equipartition of with rows and columns for some and . Let denote the total number of cells in . So and are discrete distributions, and we let and denote their PMFs respectively. Note also that because the points of are drawn iid from , for any cell the quantity is the sum of iid Bernoulli random variables with mean , and so . The purpose of Lemma 37 is to give a uniform bound on the absolute difference for all grids that holds with high probability. The strategy of the original authors is to obtain this bound by introducing the “common” equipartition grid . The main consequence of the error in Lemma 37 is that the probabilistic guarantee of the subsequent Lemma 38 does not hold as stated, which prevents the overall argument of consistency from going through. We start by pointing out the error in the proof of Lemma 37 before showing an example of its consequences on Lemma 38.

##### Error in Lemma 37

For every , the authors define , and so

 |ϵi,j|=∣∣∣ψi,j−πi,jπi,j∣∣∣≥δ⟹|nψi,j−nπi,j|≥δ⋅nπi,j. (1)

Given that is the sum of iid Bernoulli RVs, the authors state the following multiplicative Chernoff bound:

 Pr[|ϵi,j|≥δ]=Pr[|ψi,j−πi,j|≥δ⋅πi,j] =Pr[|nψi,j−nπi,j|≥δ⋅nπi,j] (2) ≤exp(−Ω(nπi,jδ2)). (3)

For reference, we state below the standard two-sided Chernoff bound from Corollary 4.6 in [mmbook], which says that

 Pr[|nψi,j−nπi,j|≥δ⋅nπi,j]≤2exp(−nπi,j⋅δ23) (4)

for any . Now, the authors set for some . Applying the bound from (3) with this value of , the authors write

 Pr[|ϵi,j|≥√πi,jC(n)0.5+α]≤exp(−Ω(nC(n)1+2α)), (5)

which incorrectly drops a dependence on . A correct application of the true Chernoff bound from (4) instead yields

 Pr[|ϵi,j|≥√πi,jC(n)0.5+α] =Pr[∣∣nψi,j−nπi,j∣∣≥√πi,jC(n)0.5+α⋅nπi,j] (6) ≤2exp(−nπi,j3⋅πi,jC(n)1+2α) (7) ≤2exp(−Ω(n⋅(πi,j)2C(n)1+2α)). (8)

Compared to (5), the term in (8) resulting from the correct application of the Chernoff bound has a dependence on . This means that the bound on the error probability of becomes non-negligible as the value of some goes to 0. We note that since the number of grid cells increases with by the definition of , the value of any can decrease with . When this occurs, we expect the bound on the probability in (8) to grow undesirably large.

##### Ramifications on Lemma 38

Using the corrected error probability from (8) for a single and taking a union bound over all means that the statement of Lemma 37 now holds with total error probability at most

 2∑(i,j)exp(−Ω(n(πi,j)2C(n)1+2α)). (9)

This updated probability seems to prevent the proof (as written) of Lemma 38 in [reshef16-mice] from working in general. For example, consider the special case where and are independent. Then the discrete distribution has PMF for all given that is an equipartition. The overall error term from Lemma 37 in (9) can be rewritten then as

 2∑(i,j)exp(−Ω((n⋅(1/C(n))2C(n)1+2α))=2⋅C(n)exp(−Ω(nC(n)3+2α)). (10)

Now in Lemma 38, the authors consider , and since , we have

 C(n)≤B(n)⋅nϵ/2=O(n1−ϵ/2) (11)

as written on page 34. But this means

 C(n)3+2α=O(n(1−ϵ/2)⋅(3+2α)). (12)

The current strategy in the proof of Lemma 38 relies on bounding the error term in (10) as

 2⋅C(n)exp(−Ω(nC(n)3+2α))≤O(n)exp(−Ω(nu)) (13)

for some . To achieve such a bound requires , and in turn this requires from (12) that

 (1−ϵ/2)(3+2α)=(2−ϵ2)(3+2α)≤1. (14)

Simplifying yields the constraint

 2α≤22−ϵ−3=3ϵ−42−ϵ, (15)

and so to require means we must have

 α≤3ϵ−44−2ϵ. (16)

Now for , which corresponds to a value of which grows with , we can verify that the right hand side of (16) is always negative. So to obtain the desired error term in (13) constrains to be negative, which contradicts the requirement of used in earlier parts of the proof of Lemma 38. So in the case where the joint distribution is formed by two independent random variables, using the corrected error bound from Lemma 37 in (10) renders the proof of Lemma 38 incorrect. Thus for general joint distributions , we should not expect the current technique in the proof of Lemma 38 to work.

## 4 New Consistency Proof

We now outline an alternative approach to replace Lemmas 35-38 in [reshef16-mice, Appendix A], which are needed to prove the consistency of the MIC estimator in Theorem 6.

### 4.1 Overview of Argument

Our main goal is to prove a statement similar to Lemma 38 of [reshef16-mice], which probabilistically bounds the difference between corresponding entries of and : We want to show that there exists a function that grows with such that, for every joint distribution and , if is a sample of points drawn iid from , then

 |Mk,ℓ−ˆMk,ℓ|=o(1)

holds simultaneously for all with probability at least (where the randomness is over the sampling that determines and the asymptotics are defined wrt increasing ). If we obtain Goal 4.1, then the proof of Theorem 6 [reshef16-mice, Appendix A] (which shows the consistency of the MIC estimator and relies on obtaining Goal 4.1) can remain unmodified.

#### 4.1.1 Proof Sketch of Goal 4.1

Recall that for a fixed pair (and assuming wlog ) we have

 |Mk,ℓ−ˆMk,ℓ| =∣∣∣maxG:k×ℓI((X,Y)|G)log2k−maxG:k×ℓI(Dn|G)log2k∣∣∣ =1log2k⋅∣∣∣maxG:k×ℓI((X,Y)|G)−maxG:k×ℓI(Dn|G)∣∣∣ ≤maxG:k×ℓ∣∣I((X,Y)|G)−I(Dn|G)∣∣. (17)

In other words, expression (17) shows that to bound the difference between and , it is sufficient to bound the maximum difference in mutual information between the discrete distributions and for a grid of size at most . So our strategy for Goal 4.1 is to first obtain such a bound on expression (17) for a fixed that holds with probability at least , where . Then by taking a union bound over all (which is at most pairs) and by choosing the function appropriately, the statement of Goal 4.1 will hold with total probability at least . The purpose of the original Lemmas 35-37 in [reshef16-mice] is to bound this maximum difference in mutual information from (17) for grids, but here we will circumvent Lemma 37 and obtain our goal by adapting the original argument of Lemma 36. The result is a probabilistic bound on the expression (17), but we note that the bound only holds for where . This is a slightly weaker guarantee compared to the original statement of Lemma 38 and Theorem 6, which held for for . We first state the following variant of Lemma 36 from [reshef16-mice, Appendix A], which follows directly from the original proof of the lemma. [(Variant of [reshef16-mice, Lemma 36])]

• Let and be random variables.

• Let be a grid with cells.

• Let be any grid with cells.

• Let (resp. ) be the total probability mass of (resp. ) falling in cells of that are not contained in individual cells of .

• Let be a sub-grid of of cells obtained by replacing every horizontal or vertical line in that is not in with a closest line in .

Then

 |I(Π|G)−I(Ψ|G)|≤ O(δlog2(β/δ))+ (18) O(dlog2(β/d))+ (19) |I(Π|G′)−I(Ψ|G′)|. (20)

To apply this lemma, we will suppose and (by slight abuse of notation) , and we consider any grid where for some . We will set to be an equipartition of into rows and columns for any . With these settings, we obtain a probabilistic bound on that holds for every grid simultaneously (where the probability is over the randomness of the sampled points ) by deriving probabilistic bounds on (18), (19), and (20) separately and applying a union bound. Stated formally: Let be a pair of jointly-distributed random variables and let be a dataset of points sampled iid from . For any and any , consider any pair where . Let be an equipartition of into rows, columns, and total cells for any . For any grid , let be a grid of equal size as defined in Lemma 4.1.1. Then the following probabilistic bounds hold simultaneously for every grid :

1. with probability 1:

 δ≤2nϵ⟹O(δlog2(β/δ))=O(log2nnϵ) (21)
2. with probability at least where :

 d≤4nϵ⟹O(dlog2(β/d))=O(log2nnϵ) (22)
3. with probability at least where :

 |I((X,Y)|G′)−I(Dn|G′)|=O(ϕlog2(nαϕ)) (23)

where

 ϕ=O(nα+2ϵ⋅log0.52nn0.5). (24)

Granting Lemma 4.1.1 as true and applying Lemma 4.1.1, we have the following corollary that results in a bound on our original target expression (17): Let be a pair of jointly-distributed random variables and let be a dataset of points sampled iid from , and let and be defined as in Lemma 4.1.1. For every , there exists some such that for all and for all :

 |I((X,Y)|G)−I(Dn|G)|=O(1nu) (25)

holds for every grid simultaneously with probability at least

 1−(pd+pG′)≥1−O(n−2.5). (26)
###### Proof.

As in the statement of Lemma 4.1.1, let be an equipartition of into rows and columns. When , any choice of ensures , which means that expressions (21), (22), and (23) from Lemma 4.1.1 are all for some positive constant . So for every , expression (25) of the corollary follows from applying Lemma 4.1.1 with these three bounds. The same setting of and also means that and (since is for any ), from which expression (26) of the corollary follows. ∎

We can now use Corollary 4.1.1 to formally state the following theorem which achieves our original Goal 4.1. Let be a pair of jointly-distributed random variables and let be a dataset of points sampled iid from . For every , there exists a constant such that for all :

 |Mk,ℓ−ˆMk,ℓ|=O(1nu) (27)

holds for every pair where simultaneously with probability at least (where the randomness is over the sampling that determines ).

###### Proof.

Recall expression (17), which says that

 |Mk,ℓ−ˆMk,ℓ|≤maxG:k×ℓ∣∣I((X,Y)|G)−I(Dn|G)∣∣

for a fixed pair . Then by Corollary 4.1.1, for every and every , there exists some such that the right hand side of this expression is with probability at least for every where . Given that there are at most pairs satisfying , it follows from a union bound that for all such pairs simultaneously with probability . ∎

This gives us the desired result of Goal 4.1, and it now remains to prove the three parts of Lemma 4.1.1.

### 4.2 Lemma 4.1.1 Proof: Parts 1 and 2

Recall that , , is an equiparition of with rows and columns, and is a grid where for some . We define (resp. ) to be the total mass of (resp. ) falling in cells of that are not contained in individual cells of . We will prove parts 1 and 2 of Lemma 4.1.1 together, which say that:

1. with probability 1.

2. with probability at least , where .

Our strategy will be to bound by , and then to show with probability all but .

#### Chernoff Bounds [mmbook, Chapter 4]

First, we (re)state two standard Chernoff bounds that will be used in this section and the next: Let , where each is an iid Bernoulli RV with .

1. Two-sided tail bound: for any :

 Pr[|X−nμ|≥t⋅nμ]≤2⋅exp(−nμt23) (28)
2. Upper tail bound: for any and :

 Pr[X≥(1+t)⋅n^μ]≤exp(−n^μt23) (29)

#### Bound on δ

By definition, is the sum of mass in a subset of columns and rows of . Let denote the pmf at cell of , let denote the total mass of in row , and let denote the total mass of in column . So

 π∗,j =∑iπi,j=1ℓnϵ πi,∗ =∑jπi,j=1knϵ

by the definition of as an equipartition of . Now let be the column indices of containing a column separator of , and let be the row indices of containing a row separator of . Since is a grid, we must have and . Then (with probability 1):

 δ ≤∑j∈Kπ∗,j+∑i∈Rπi,∗ ≤ℓ⋅π∗,j+k⋅πi,∗=ℓℓnϵ+kknϵ=2nϵ.

#### Bound on d with probability 1−pd

Again by definition, is the sum of mass in a subset of columns and rows of . We let denote the pmf at cell of , and we define and analogously to and . We will show that each (respectively ) probabilistically. Observe that is a sum of iid Bernoullis, each with mean . So

 \E[n⋅ψ∗,j]=n⋅π∗,j=nℓnϵ=n1−ϵℓ. (30)

Then by the Chernoff bound (29):

 Pr[n⋅ψ∗,j≥2⋅(n1−ϵ/ℓ)] ≤exp(−n1−ϵ3ℓ) ≤exp(−Ω(n1−ϵ−α)),

where the final inequality is due to , which follows from the assumption that . So for each we have with probability all but . A similar calculation shows that for each with probability all but . Combining the two inequalities and taking a union bound shows

 d ≤∑j∈Kψ∗,j+∑i∈Rψi,∗

with probability all but , since and by using the bound on previously established.

### 4.3 Lemma 4.1.1 Proof: Part 3

Recall that given the grid (which is an equipartition of into rows and columns) and the grid , the grid is a sub-grid of obtained by replacing every horizontal or vertical line in that is not in with a closet line in . To prove an upper bound on the quantity , we will use Proposition 40 from Appendix B of [reshef16-mice], which relates the statistical distance between two discrete distributions to their change in mutual information: [([reshef16-mice, Proposition 40, Appendix B])] Let and be discrete distributions over grids. If for any , then

 |I(Π)−I(Ψ)|≤O(δlog2(min{k,l}δ)).

Because , and since is a subgrid of , it follows by the triangle inequality that

 DTV((X,Y)|G′,Dn|G′)≤DTV((X,Y)|Γ,Dn|Γ).

Thus if we obtain a bound , then applying Proposition 4.3 yields

 |I((X,Y)|G′)−I(Dn|G′)|=O(ϕlog2(min{k,ℓ}ϕ))=O(ϕlog2(nαϕ))

by the assumption that . So given , the dataset , and the equipartition of total cells, we will prove the following bound on , which implies Part (3) of Lemma 4.1.1. Let be a pair of jointly-distributed random variables and let be a dataset of points sampled iid from . For any and any , consider any pair where . Let be an equipartition of into rows, columns, and total cells for any . Then

 DTV((X,Y)|Γ,Dn|Γ)=O(nα+2ϵ⋅log0.52nn0.5)

with probability at least .

###### Proof.

Given , a sample of points drawn iid from , and the grid , define the discrete distributions

 Π =(X,Y)|Γwith pmfπi,j Ψ =Dn|Γwith pmfψi,j

where and (note that the use of and here differs slightly from the previous subsection). Also, for every cell of , we say that

 (i,j)is {large} ifπij>9log2nn and (i,j)is {small} ifπij≤9log2nn,

and let and denote the sets of large and small cells, respectively. Now recall that

 DTV(Π,Ψ)≤∥Π−Ψ∥1 =∑(i,j)|πi,j−ψi,j| =∑(i,j)∈L|πi,j−ψi,j|+∑(i,j)∈S|πi,j−ψi,j|. (31)

By the triangle inequality, we have that

 ∑(i,j)∈S|πi,j−ψi,j| ≤∑(i,j)∈S|πi,j|+|ψi,j| =∑(i,j)∈S|πi,j|+∑(i,j)∈S|ψi,j|,

and substituting back into (31) gives

 DTV(Π,Ψ)≤∑(i,j)∈L|πi,j−ψi,j|+∑(i,j)∈S|πi,j|+∑(i,j)∈S|ψi,j|. (32)

So to bound , we will bound each term of (32) separately.
Bound on for large : Observe that is the fraction of points of contained in cell of . Each point has probability of falling in cell , so is the sum of iid Bernoullis, each with mean . Using the two-sided Chernoff bound from (28), we then have that for any

 Pr[|nψi,j−nπi,j|≥tnπi,j]≤2⋅exp(−nπi,j⋅t23) (33)

for any . Since for each large , observe that setting means that

 t<3√log2n√n⋅√n√9log2n=1,

and thus the bound in (33) can be applied111The only lower bound constraint on for large comes from ensuring so that our Chernoff bound variant can be applied. This constraint also determines the denominator in (34).. Then for each large , using this setting of gives

 Pr[|nψi,j−nπi,j|≥tnπi,j] ≤2⋅exp(−nπi,j⋅t23) ≤2⋅exp(−3log2n)≤2n3.

So for each large , with probability at least we have

 |nπi,j−nψi,j|≤tnπi,j⟺|πi,j−ψi,j|≤tπi,j

which means by our setting of that

 |πi,j−ψi,j|≤3log0.52n√πi,jn⋅πi,j=3log0.52n√n⋅√πi,j≤3log0.52n√n.

Here, the last inequality holds given that for any large . Now summing over all large and taking a union bound, we have

 ∑(i,j)∈L|πi,j−ψi,j|≤kℓn2ϵ⋅3log0.52nn0.5≤O(nα+2ϵ⋅log0.52nn0.5) (34)

with probability at least , since and by the assumption that .
Bound on for small : Recall that cell is small if , and so with probability 1:

 ∑(i,j)∈Sπi,j≤9⋅C(n)log2nn.

Bound on for small : Observe that is the total number of points of contained in small cells of and is thus the sum of iid Bernoullis, each with mean . So in expectation we have

 \E⎡⎣n⋅⎛⎝∑(i,j)∈Sψi,j⎞⎠⎤⎦ =n⋅⎛⎝∑(i,j)∈Sπi,j⎞⎠ ≤n⋅(9⋅C(n)log2nn)=9⋅C(n)log2n,

where the inequality is due to the bound on for small from the previous step. Now using the upper Chernoff bound from (29) and setting gives

 Pr⎡⎣n⋅⎛⎝∑(i,j)∈Sψi,j⎞⎠≥18⋅C(n)log2n⎤⎦ ≤exp(−3kℓn2ϵ⋅log2n) ≤exp(−Ω(n2ϵ))

since and . This means that with probability at least we have

 n⋅⎛⎝∑(i,j)∈Sψi,j⎞⎠<18⋅C(n)log2n

and so

 ∑(i,j)∈Sψi,j<18⋅C(n)log2nn=O(nα+2ϵ⋅log2nn).

Final bound on To conclude the proof, using the preceding three individual bounds on the terms from (32) we have that

 DTV(Π,Ψ) ≤∑(i,j