A Normal Test for Independence via Generalized Mutual Information

by   Jialin Zhang, et al.

Testing hypothesis of independence between two random elements on a joint alphabet is a fundamental exercise in statistics. Pearson's chi-squared test is an effective test for such a situation when the contingency table is relatively small. General statistical tools are lacking when the contingency data tables are large or sparse. A test based on generalized mutual information is derived and proposed in this article. The new test has two desired theoretical properties. First, the test statistic is asymptotically normal under the hypothesis of independence; consequently it does not require the knowledge of the row and column sizes of the contingency table. Second, the test is consistent and therefore it would detect any form of dependence structure in the general alternative space given a sufficiently large sample. In addition, simulation studies show that the proposed test converges faster than Pearson's chi-squared test when the contingency table is large or sparse.


page 1

page 2

page 3

page 4


USP: an independence test that improves on Pearson's chi-squared and the G-test

We present the U-Statistic Permutation (USP) test of independence in the...

Nonparametric independence testing via mutual information

We propose a test of independence of two multivariate random vectors, gi...

Data-Driven Representations for Testing Independence: Modeling, Analysis and Connection with Mutual Information Estimation

This work addresses testing the independence of two continuous and finit...

Seven proofs of the Pearson Chi-squared independence test and its graphical interpretation

This paper revisits the Pearson Chi-squared independence test. After pre...

Goodness-of-fit Test on the Number of Biclusters in Relational Data Matrix

Biclustering is a method for detecting homogeneous submatrices in a give...

An Independence Test Based on Recurrence Rates. An empirical study and applications to real data

In this paper we propose several variants to perform the independence te...

A nonparametric test of independence based on L_1-error

We propose a test of mutual independence between random vectors with arb...

1 Introduction and Summary

Let be a pair of random elements on a joint alphabet,

, with a joint probability distribution,

, and the two marginal distributions, and , for and respectively. Consider one the most fundamental problems of statistics: testing the hypothesis of independence between and , denoted versus . Let an identically and independently distributed (iid) sample of size be represented by the empirical distribution, that is, where is the observed frequency of letter . Let and be the two observed marginal relative frequencies of and .

In statistical practice, the standard procedure for such a setting is the well-studied Pearson’s chi-squared test, which is based on the fact that, as ,



is a chi-squared random variable with degrees of freedom

. The Pearson’s chi-squared test is an effective tool for relatively small 2-way contingency tables. However it is not without discomforting issues in practice, particularly when it is applied to a large or sparse contingency table.

One of the issues in practice is when and are unknown. In such a case, it is difficult to fix the reference distribution in (1). A popular adjustment in practice is to replace and with observed numbers of rows and columns, and

. However, such an adjustment lacks theoretical support and the reference distribution used may be quite far away from the asymptotic chi-squared distribution. Another commonly encountered issue in a large contingency table is the occurrence of low-frequency cells. Given the fact that the essence of the asymptotic behavior of Pearson’s chi-squared statistic is the asymptotic normality of

in each cell, many low or zero frequency cells in a large contingency table could negatively impact the performance of the test, mostly in the form of a much inflated Type I error probability. This is a long-standing issue considered by many in the existing literature.

A popular adjustment to offset the low or zero frequency cells, when applying a Pearson-type chi-squared test, is to combine cells to increase the cell frequency. A well-known rule of thumb, often thought to be suggested by R.A. Fisher, is to combine cells into new cells such that the combined frequencies are at least five. However, in applying Pearson’s chi-squared test for independence, it is not clear how this adjustment may be done. To assure independence under , the adjustment must be made by combining low-frequency rows and low-frequency columns, respectively, due to the invariance of independence under any row permutation and/or column permutation. However, by doing so, it is not guaranteed that all new cells would see enough frequencies. Only by combining the rows and columns further, can the new cells then see meaningfully higher frequencies. When this is the case, the re-aggregation becomes somewhat arbitrary. Even if this could be done, the following two points of concern remain.

  1. Aggressive re-aggregation of cells could greatly reduce the number of (observed) degrees of freedom and consequently could shift the reference distribution to one that is far away from .

  2. Aggressive re-aggregation of cells could cause local dependence between and

    , manifested in fine structures of the joint distribution, to be inadvertently buried, and hence could deprive Pearson’s chi-squared test a chance to detect such a dependence.

Consider a simple but amplified illustrative example of concept as follows. Let

Under independence, the joint distribution of is given in below.

Summing up all probabilities in , . Let the total mass of be redistributed uniformly only on the diagonal of , augmenting into

with all off-diagonal elements being zeros. Let

and let it be assumed that follows the joint distribution . Clearly, and are not independent. Suppose there is a sample with a sufficiently large , such that, all cells with a positive probability see . That however does not change the fact that the observed frequencies for cells corresponding to the zero-probability locations in are zeros. In applying the usual re-aggregation of the contingency table, by means of combining rows and columns, one would not be able to end up with all cell frequencies greater than or equal to five, unless all rows of are lumped together and all columns of are lumped together. However by doing this, the underlying joint distribution becomes

under which and are independent. In this example, it is evident that the aggregated data would imply , far away from the . It is also evident that the data aggregation would completely erase the fine dependence structure in and would leave no chance for the dependence to be detected.

As the need to extend Pearson’s chi-squared test to accommodate data in a large or a sparse contingency table increases, many studies have been reported and less stringent rules of thumb have been proposed. The main guideline for the chi-squared test focuses on the (estimated) expected cell counts in the contingency table, following from

Cochran (1952), Cochran (1954), Agresti (2003), and Yates et al. (1999). The widely accepted general rule of thumb (referred to below as the Rule) is: (a) at least 80 the expected counts are five or greater, and (b) all individual expected counts are one or greater. This however is still very stringent. In the example given above, Part (b) alone requires on average , or , or much greater if is required. In fact, in the example above, cannot be satisfied for each and every .

To alleviate the above-mentioned difficulties, a new test of independence in a contingency table is proposed in this article. The proposed test has at least two desirable properties. First, the asymptotic distribution of the test statistic is normal under the independence assumption, and consequently neither the test statistic nor its asymptotic distribution requires the knowledge of and . Second, the test is consistent and therefore it would detect any form of dependence structure in the general alternative space given a sufficiently large sample. In addition, empirical evidence shows that the proposed test converges faster than Pearson’s chi-squared test when the contingency table is large or sparse.

There are five sections in this article. The main results leading to the proposed test are discussed in Section 2. In Section 3, several simulation studies are presented. A few concluding remarks are given in Section 4. The article ends with Appendix where a few proofs are found.

2 Toward a Normal Test for Independence

Consider Shannon’s entropy, introduced in Shannon (1948), for an random element assuming a label in a countable alphabet with probability distribution , , and Shannon’s mutual information of and , . One of the most important utilities of Shannon’s mutual information is based on the fact that if and only if and are independent. The plug-in estimator of , , where , , and , is well-studied, and it is known, under mild conditions, that where may be estimated by a consistent estimator. Many details of the said fact may be found in Zhang (2016). However the mild conditions include . When , degenerates, but on the other hand, , and the derivation of this fact may be found in Wilks (1938).

Toward proposing the normal test, let the notion of escort distributions be introduced. In the context of thermodynamics, Beck & Schögl (1995) defines an escort distribution as an induced distribution based on an original distribution, , by means of a positive function on . Let for each . is referred to as an escort distribution. The notion of escort distributions is increasingly adopted in recent years as a means of describing random behaviors of different components in a complex system, each of which scans an underlying distribution via a possibly different function . For a specific function form, where is a parameter, the resulting escort distribution, , is known as a power escort distribution.

Applying the power escort transformation to the joint distribution , the resulting distribution is


Let and be a pair of random elements on the same joint alphabet according to the joint distribution of (2). The following lemma is due to Zhang (2020).

Lemma 1.

Given ,

  1. and uniquely determine each other, that is, ; and

  2. and are independent if and only if and are independent, that is, .

By Part 2 of Lemma 1

, the null hypothesis,

may then be stated equivalently as , that is, , or letting ,


On the other hand, let it be observed that under ,


Let it also be noted that

  1. (3) is a necessary and sufficient condition, that is, the equality of (3) holds if and only if ,

  2. the equalities in (4), (5) and (6) do not necessarily hold in general but under the assumption of .

Adding and subtracting the left-hand sides of (5) and (6) to and from (3), another restatement of is obtained below.


Writing the terms within the curly brackets in (2) as and , (2) becomes


A natural test for independence would be to statistically check the value of the left-hand-side in (3), or that in (2), or that in (8), and assess the statistical evidence against that value being zero. Consider the plug-in estimator of the the left-hand-side of (8), by replacing with for every pair , with for every , and with for every , resulting in plug-in estimators of and , denoted by and .

By Wilks (1938), under , which implies that


The following proposition is the keystone of the test to be proposed.

Proposition 1.

Suppose neither of the two underlying marginal distributions, and , is uniform, and holds. as , where is a positive constant depending on the parameter .

A proof of Proposition 1 is given in Appendix.

Let denote the normal random variable under the conditions of Proposition 1. By (9) and Lemma 1, , where is the same random variable as in .

Proposition 2.

Under the conditions of Proposition 1,

  1. ,

  2. , and

  3. ,


is the variance given in (

12) with all replaced by for and .

At least three tests for are feasible according to Proposition 2.

  1. Test 1: is rejected if or ;

  2. Test 2: is rejected if or ; and

  3. Test 3: is rejected if or

where is a prefixed constant, is the

th percentile of the standard normal distribution. The test based on

is the proposed test, and it is a consistent test as described in Proposition 3 below.

Proposition 3.

Suppose neither of the two underlying marginal distributions, and , is uniform. Then

A proof of Proposition 3 is given in Appendix.

3 Simulations

The performance of the proposed test is assessed by simulations. Numerous simulation studies are carried out for cases with various forms of underlying distributions. In each case, the proposed test is compared against Pearson’s chi-squared test, with degrees of freedom and respectively, with six levels of sample size, , and . The results summarized in Table 1 are representative of the general trends observed and therefore are presented below.

The sequence of five pairs of and is specifically constructed as follows. The example of the contingency table in Section 1 is one of such cases. In that example, the contingency table has rows and columns. A more general distribution may be described as follows. For a given value , let both of the row and the column marginal be . A pair of and may be constructed as follows.

Under , a joint distribution is constructed as follows.


The joint distribution of (10) is reconstructed, first by summing all entries in the lower-right sub-matrix and then redistributing the sum on the diagonal of the sub-matrix uniformly, resulting in

Thus, versus becomes a pair. Letting the parameter take on values, 0.5, 0.6, 0.7, 0.8, and 0.9, respectively, the dependence structure in becomes weaker and weaker.

The results of the simulation studies are reported in Table 1, each based on one hundred thousand replicates of iid sample of indicated size , and .

Referring to Table 1, the distribution with on the top corresponds to the strongest contrast between and among all five cases in the table. For and , none of the three test statistics converges satisfactorily under at . For , both the proposed test and the Pearson’s chi-squared test with observed degrees of freedom converge satisfactorily under , and provide very good power at . However, the Pearson’s chi-squared test with theoretical degrees of freedom does not converge under .

The distribution with provides a less strong contrast between and . For , both the proposed test and the Pearson’s chi-squared test with observed degrees of freedom converge satisfactorily under , and provide very good power at . However, the Pearson’s chi-squared test with theoretical degrees of freedom has a very inflated Type I error probability. One may also notice that the Pearson’s chi-squared test with observed degrees of freedom perhaps converges a little slower than the proposed test under .

The distribution with provides an even less strong contrast between and . For , only the proposed test converges satisfactorily under , and provide meaningful power at . However, both of the Pearson’s chi-squared tests have inflated Type I error probability.

The distribution with provides a weak contrast between and . For , the proposed test converges satisfactorily under , and provide little power at . On the other hand, both of the Pearson’s chi-squared tests have very inflated Type I error probability. One may also notice that the Pearson’s chi-squared test with observed degrees of freedom perhaps converges a little slower than the proposed test under .

The distribution with provides a very weak contrast between and . So much so that even for , the proposed test converges very slowly under , and provides essentially no power at . On the other hand, both of the Pearson’s chi-squared tests have a total breakdown under .

The following three major points are observed in Table 1, as well as in other simulation studies investigated but not presented herewithin.

  1. In small or dense contingency tables, both Pearson’s chi-squared tests converge quickly under and are generally more powerful than the proposed test.

  2. In larger and sparse contingency tables, both Pearson’s chi-squared tests converge much slower under and tend to have a much higher probability of Type I error than what is intended.

All things considered, in practice, if the Rule is satisfied, then Pearson’s chi-squared test is recommended, otherwise the proposed test of this article is recommended, provided that or more practically . In that regard, the proposed test is not meant to replace Pearson’s test in all circumstances, but only when Pearson’s test is judged not appropriate by the current rule-of-thumb criteria.

4 Remarks

The idea of the article may be explained simply by the convergence rate of under . is -convergent, that is,

In other words, under , the difference between the two additive parts of approaches zero very fast. However by inserting a zero in the form of four terms, as in (2), is wedged into two terms, and , both of which approach zero under , but are at a much slower rate, that is, -convergent. This insertion is almost literally a keystone, splitting into two random variables and eking out asymptotic normality of , of , and hence of under . The immediate advantages of the normality of are that the knowledge of and is not required when testing , and that the test statistic seems to converge faster than Pearson’s chi-squared test under . On the other hand, the statistical assessment of is done by two separate random pieces instead of one, as in , and some efficiency may be lost as evidenced by the simulation studies. The observed loss of power may be considered a cost for more generality, that is, the knowledge of and is not required.

It is to be noted that the concept of escort distributions is essential in the arguments leading to the proposed test. Only when , the inserts, the left-hand-sides of (5) and (6), would enable an positive variance in (12), and hence the asymptotic normality of .

It is also to be highlighted that the proposed test based on is a consistent test as stated in Proposition 3. This fact lends the utility of the proposed test in the general alternative space. That is to say, provided a sufficiently large sample, any form of dependent structure between and will be detected. The test is proposed herewithin in the form of a two-sided test due to its generality. For specific forms of dependence structures between and , some of the tests based on , or , one-sided or two-sided, may have better performance in terms of faster convergence under and higher power under than others. This provides a potentially fruitful direction for further investigation.

5 Appendix

Proof of Proposition 1.

It may be verified that for each pair , , and ,

where is defined in (2), , and ;

for , and ,

and for , and ,


noting specifically the implied enumeration of the indexes of corresponds to the arrangement of as in , that is,


Let the gradient of with respect to for all be denoted by with the index arrangement given in (11).

It follows that , where is the covariance matrix given by

According to the first-order delta method, , as , where


Proof of Proposition 3.

Under , by Lemma 1, where and are as in (8). By the respective consistencies of plug-in estimators of , and , , and hence , which in turn implies that at least one of and is carried above all bounds in probability, which finally implies that , as . ∎

Distribution Sample Size
= 30 0.2002 0.3235 0.0632 0.4377 0.0831 0.0694
= 100 0.0561 0.7739 0.0373 0.9951 0.0439 0.9940
= 500 0.0145 1.0000 0.0124 1.0000 0.2058 1.0000
= 1000 0.0122 1.0000 0.0113 1.0000 0.1437 1.0000
= 1500 0.0122 1.0000 0.0109 1.0000 0.10841 1.0000
= 2000 0.0113 1.0000 0.0098 1.0000 0.0858 1.0000
= 30 0.0540 0.0862 0.0968 0.2811 0.0021 0.0133
= 100 0.0071 0.0965 0.0764 0.8760 0.0997 0.8515
= 500 0.0087 0.9267 0.0183 1.0000 0.0767 1.0000
= 1000 0.0105 0.9997 0.0143 1.0000 0.0412 1.0000
= 1500 0.0103 1.0000 0.0123 1.0000 0.0316 1.0000
= 2000 0.0102 1.0000 0.0120 1.0000 0.0263 1.0000
= 30 0.0040 0.0054 0.1352 0.2118 0.0002 0.0020
= 100 0.0003 0.0011 0.1400 0.5563 0.0966 0.4720
= 500 0.0081 0.1324 0.0320 1.0000 0.0320 1.0000
= 1000 0.0096 0.4032 0.0200 1.0000 0.0200 1.0000
= 1500 0.0102 0.6541 0.0168 1.0000 0.0168 1.0000
= 2000 0.0104 0.8196 0.0151 1.0000 0.0151 1.0000
= 30 0.0021 0.0027 0.1697 0.1941 0.00213 0.0027
= 100 0.0000 0.0000 0.2082 0.3369 0.0997 0.1977
= 500 0.0086 0.0189 0.0768 0.9426 0.0767 0.9426
= 1000 0.0104 0.0389 0.0412 0.9999 0.0412 0.9999
= 1500 0.0108 0.0586 0.0316 1.0000 0.0316 1.0000
= 2000 0.0109 0.0813 0.0263 1.0000 0.0263 1.0000
= 30 0.0831 0.0822 0.2326 0.2337 0.0832 0.0822
= 100 0.0000 0.0000 0.2386 0.2519 0.0439 0.0532
= 500 0.0213 0.0222 0.2120 0.4037 0.2058 0.3966
= 1000 0.0202 0.0227 0.1437 0.6408 0.1437 0.6407
= 1500 0.0162 0.0212 0.1084 0.8330 0.1084 0.8330
= 2000 0.0158 0.0206 0.0854 0.9385 0.0854 0.9385
Table 1: Simulated Convergence and Power Comparison, ,


  • (1)
  • Agresti (2003) Agresti, A. (2003), Categorical data analysis, John Wiley & Sons.
  • Beck & Schögl (1995) Beck, C. & Schögl, F. (1995), Thermodynamics of chaotic systems.
  • Cochran (1952) Cochran, W. G. (1952), ‘The 2 test of goodness of fit’, The Annals of mathematical statistics pp. 315–345.
  • Cochran (1954) Cochran, W. G. (1954), ‘Some methods for strengthening the common 2 tests’, Biometrics 10(4), 417–451.
  • Shannon (1948) Shannon, C. E. (1948), ‘A mathematical theory of communication’, The Bell system technical journal 27(3), 379–423.
  • Wilks (1938) Wilks, S. S. (1938), ‘The large-sample distribution of the likelihood ratio for testing composite hypotheses’, The annals of mathematical statistics 9(1), 60–62.
  • Yates et al. (1999) Yates, D., Moore, D. & McCabe, G. (1999), ‘The practice of statistics. new york, ny: H’.
  • Zhang (2016) Zhang, Z. (2016), Statistical Implications of Turing’s Formula, John Wiley & Sons.
  • Zhang (2020) Zhang, Z. (2020), ‘Generalized mutual information’, Stats 3(2), 158–165.