1 Introduction and Summary
Let be a pair of random elements on a joint alphabet,
, with a joint probability distribution,
, and the two marginal distributions, and , for and respectively. Consider one the most fundamental problems of statistics: testing the hypothesis of independence between and , denoted versus . Let an identically and independently distributed (iid) sample of size be represented by the empirical distribution, that is, where is the observed frequency of letter . Let and be the two observed marginal relative frequencies of and .In statistical practice, the standard procedure for such a setting is the wellstudied Pearson’s chisquared test, which is based on the fact that, as ,
(1) 
where
is a chisquared random variable with degrees of freedom
. The Pearson’s chisquared test is an effective tool for relatively small 2way contingency tables. However it is not without discomforting issues in practice, particularly when it is applied to a large or sparse contingency table.One of the issues in practice is when and are unknown. In such a case, it is difficult to fix the reference distribution in (1). A popular adjustment in practice is to replace and with observed numbers of rows and columns, and
. However, such an adjustment lacks theoretical support and the reference distribution used may be quite far away from the asymptotic chisquared distribution. Another commonly encountered issue in a large contingency table is the occurrence of lowfrequency cells. Given the fact that the essence of the asymptotic behavior of Pearson’s chisquared statistic is the asymptotic normality of
in each cell, many low or zero frequency cells in a large contingency table could negatively impact the performance of the test, mostly in the form of a much inflated Type I error probability. This is a longstanding issue considered by many in the existing literature.
A popular adjustment to offset the low or zero frequency cells, when applying a Pearsontype chisquared test, is to combine cells to increase the cell frequency. A wellknown rule of thumb, often thought to be suggested by R.A. Fisher, is to combine cells into new cells such that the combined frequencies are at least five. However, in applying Pearson’s chisquared test for independence, it is not clear how this adjustment may be done. To assure independence under , the adjustment must be made by combining lowfrequency rows and lowfrequency columns, respectively, due to the invariance of independence under any row permutation and/or column permutation. However, by doing so, it is not guaranteed that all new cells would see enough frequencies. Only by combining the rows and columns further, can the new cells then see meaningfully higher frequencies. When this is the case, the reaggregation becomes somewhat arbitrary. Even if this could be done, the following two points of concern remain.

Aggressive reaggregation of cells could greatly reduce the number of (observed) degrees of freedom and consequently could shift the reference distribution to one that is far away from .

Aggressive reaggregation of cells could cause local dependence between and
, manifested in fine structures of the joint distribution, to be inadvertently buried, and hence could deprive Pearson’s chisquared test a chance to detect such a dependence.
Consider a simple but amplified illustrative example of concept as follows. Let
Under independence, the joint distribution of is given in below.
Summing up all probabilities in , . Let the total mass of be redistributed uniformly only on the diagonal of , augmenting into
with all offdiagonal elements being zeros. Let
and let it be assumed that follows the joint distribution . Clearly, and are not independent. Suppose there is a sample with a sufficiently large , such that, all cells with a positive probability see . That however does not change the fact that the observed frequencies for cells corresponding to the zeroprobability locations in are zeros. In applying the usual reaggregation of the contingency table, by means of combining rows and columns, one would not be able to end up with all cell frequencies greater than or equal to five, unless all rows of are lumped together and all columns of are lumped together. However by doing this, the underlying joint distribution becomes
under which and are independent. In this example, it is evident that the aggregated data would imply , far away from the . It is also evident that the data aggregation would completely erase the fine dependence structure in and would leave no chance for the dependence to be detected.
As the need to extend Pearson’s chisquared test to accommodate data in a large or a sparse contingency table increases, many studies have been reported and less stringent rules of thumb have been proposed. The main guideline for the chisquared test focuses on the (estimated) expected cell counts in the contingency table, following from
Cochran (1952), Cochran (1954), Agresti (2003), and Yates et al. (1999). The widely accepted general rule of thumb (referred to below as the Rule) is: (a) at least 80 the expected counts are five or greater, and (b) all individual expected counts are one or greater. This however is still very stringent. In the example given above, Part (b) alone requires on average , or , or much greater if is required. In fact, in the example above, cannot be satisfied for each and every .To alleviate the abovementioned difficulties, a new test of independence in a contingency table is proposed in this article. The proposed test has at least two desirable properties. First, the asymptotic distribution of the test statistic is normal under the independence assumption, and consequently neither the test statistic nor its asymptotic distribution requires the knowledge of and . Second, the test is consistent and therefore it would detect any form of dependence structure in the general alternative space given a sufficiently large sample. In addition, empirical evidence shows that the proposed test converges faster than Pearson’s chisquared test when the contingency table is large or sparse.
There are five sections in this article. The main results leading to the proposed test are discussed in Section 2. In Section 3, several simulation studies are presented. A few concluding remarks are given in Section 4. The article ends with Appendix where a few proofs are found.
2 Toward a Normal Test for Independence
Consider Shannon’s entropy, introduced in Shannon (1948), for an random element assuming a label in a countable alphabet with probability distribution , , and Shannon’s mutual information of and , . One of the most important utilities of Shannon’s mutual information is based on the fact that if and only if and are independent. The plugin estimator of , , where , , and , is wellstudied, and it is known, under mild conditions, that where may be estimated by a consistent estimator. Many details of the said fact may be found in Zhang (2016). However the mild conditions include . When , degenerates, but on the other hand, , and the derivation of this fact may be found in Wilks (1938).
Toward proposing the normal test, let the notion of escort distributions be introduced. In the context of thermodynamics, Beck & Schögl (1995) defines an escort distribution as an induced distribution based on an original distribution, , by means of a positive function on . Let for each . is referred to as an escort distribution. The notion of escort distributions is increasingly adopted in recent years as a means of describing random behaviors of different components in a complex system, each of which scans an underlying distribution via a possibly different function . For a specific function form, where is a parameter, the resulting escort distribution, , is known as a power escort distribution.
Applying the power escort transformation to the joint distribution , the resulting distribution is
(2) 
Let and be a pair of random elements on the same joint alphabet according to the joint distribution of (2). The following lemma is due to Zhang (2020).
Lemma 1.
Given ,

and uniquely determine each other, that is, ; and

and are independent if and only if and are independent, that is, .
By Part 2 of Lemma 1
, the null hypothesis,
may then be stated equivalently as , that is, , or letting ,(3) 
On the other hand, let it be observed that under ,
(4)  
(5)  
(6) 
Let it also be noted that
Adding and subtracting the lefthand sides of (5) and (6) to and from (3), another restatement of is obtained below.
(7) 
Writing the terms within the curly brackets in (2) as and , (2) becomes
(8) 
A natural test for independence would be to statistically check the value of the lefthandside in (3), or that in (2), or that in (8), and assess the statistical evidence against that value being zero. Consider the plugin estimator of the the lefthandside of (8), by replacing with for every pair , with for every , and with for every , resulting in plugin estimators of and , denoted by and .
By Wilks (1938), under , which implies that
(9) 
The following proposition is the keystone of the test to be proposed.
Proposition 1.
Suppose neither of the two underlying marginal distributions, and , is uniform, and holds. as , where is a positive constant depending on the parameter .
A proof of Proposition 1 is given in Appendix.
Let denote the normal random variable under the conditions of Proposition 1. By (9) and Lemma 1, , where is the same random variable as in .
Proposition 2.
At least three tests for are feasible according to Proposition 2.

Test 1: is rejected if or ;

Test 2: is rejected if or ; and

Test 3: is rejected if or
where is a prefixed constant, is the
th percentile of the standard normal distribution. The test based on
is the proposed test, and it is a consistent test as described in Proposition 3 below.Proposition 3.
Suppose neither of the two underlying marginal distributions, and , is uniform. Then
A proof of Proposition 3 is given in Appendix.
3 Simulations
The performance of the proposed test is assessed by simulations. Numerous simulation studies are carried out for cases with various forms of underlying distributions. In each case, the proposed test is compared against Pearson’s chisquared test, with degrees of freedom and respectively, with six levels of sample size, , and . The results summarized in Table 1 are representative of the general trends observed and therefore are presented below.
The sequence of five pairs of and is specifically constructed as follows. The example of the contingency table in Section 1 is one of such cases. In that example, the contingency table has rows and columns. A more general distribution may be described as follows. For a given value , let both of the row and the column marginal be . A pair of and may be constructed as follows.
Under , a joint distribution is constructed as follows.
(10) 
The joint distribution of (10) is reconstructed, first by summing all entries in the lowerright submatrix and then redistributing the sum on the diagonal of the submatrix uniformly, resulting in
Thus, versus becomes a pair. Letting the parameter take on values, 0.5, 0.6, 0.7, 0.8, and 0.9, respectively, the dependence structure in becomes weaker and weaker.
The results of the simulation studies are reported in Table 1, each based on one hundred thousand replicates of iid sample of indicated size , and .
Referring to Table 1, the distribution with on the top corresponds to the strongest contrast between and among all five cases in the table. For and , none of the three test statistics converges satisfactorily under at . For , both the proposed test and the Pearson’s chisquared test with observed degrees of freedom converge satisfactorily under , and provide very good power at . However, the Pearson’s chisquared test with theoretical degrees of freedom does not converge under .
The distribution with provides a less strong contrast between and . For , both the proposed test and the Pearson’s chisquared test with observed degrees of freedom converge satisfactorily under , and provide very good power at . However, the Pearson’s chisquared test with theoretical degrees of freedom has a very inflated Type I error probability. One may also notice that the Pearson’s chisquared test with observed degrees of freedom perhaps converges a little slower than the proposed test under .
The distribution with provides an even less strong contrast between and . For , only the proposed test converges satisfactorily under , and provide meaningful power at . However, both of the Pearson’s chisquared tests have inflated Type I error probability.
The distribution with provides a weak contrast between and . For , the proposed test converges satisfactorily under , and provide little power at . On the other hand, both of the Pearson’s chisquared tests have very inflated Type I error probability. One may also notice that the Pearson’s chisquared test with observed degrees of freedom perhaps converges a little slower than the proposed test under .
The distribution with provides a very weak contrast between and . So much so that even for , the proposed test converges very slowly under , and provides essentially no power at . On the other hand, both of the Pearson’s chisquared tests have a total breakdown under .
The following three major points are observed in Table 1, as well as in other simulation studies investigated but not presented herewithin.

In small or dense contingency tables, both Pearson’s chisquared tests converge quickly under and are generally more powerful than the proposed test.

In larger and sparse contingency tables, both Pearson’s chisquared tests converge much slower under and tend to have a much higher probability of Type I error than what is intended.
All things considered, in practice, if the Rule is satisfied, then Pearson’s chisquared test is recommended, otherwise the proposed test of this article is recommended, provided that or more practically . In that regard, the proposed test is not meant to replace Pearson’s test in all circumstances, but only when Pearson’s test is judged not appropriate by the current ruleofthumb criteria.
4 Remarks
The idea of the article may be explained simply by the convergence rate of under . is convergent, that is,
In other words, under , the difference between the two additive parts of approaches zero very fast. However by inserting a zero in the form of four terms, as in (2), is wedged into two terms, and , both of which approach zero under , but are at a much slower rate, that is, convergent. This insertion is almost literally a keystone, splitting into two random variables and eking out asymptotic normality of , of , and hence of under . The immediate advantages of the normality of are that the knowledge of and is not required when testing , and that the test statistic seems to converge faster than Pearson’s chisquared test under . On the other hand, the statistical assessment of is done by two separate random pieces instead of one, as in , and some efficiency may be lost as evidenced by the simulation studies. The observed loss of power may be considered a cost for more generality, that is, the knowledge of and is not required.
It is to be noted that the concept of escort distributions is essential in the arguments leading to the proposed test. Only when , the inserts, the lefthandsides of (5) and (6), would enable an positive variance in (12), and hence the asymptotic normality of .
It is also to be highlighted that the proposed test based on is a consistent test as stated in Proposition 3. This fact lends the utility of the proposed test in the general alternative space. That is to say, provided a sufficiently large sample, any form of dependent structure between and will be detected. The test is proposed herewithin in the form of a twosided test due to its generality. For specific forms of dependence structures between and , some of the tests based on , or , onesided or twosided, may have better performance in terms of faster convergence under and higher power under than others. This provides a potentially fruitful direction for further investigation.
5 Appendix
Proof of Proposition 1.
for , and ,
and for , and ,
Let
noting specifically the implied enumeration of the indexes of corresponds to the arrangement of as in , that is,
(11) 
Let the gradient of with respect to for all be denoted by with the index arrangement given in (11).
It follows that , where is the covariance matrix given by
According to the firstorder delta method, , as , where
(12) 
∎
Proof of Proposition 3.
Distribution  Sample Size  

= 30  0.2002  0.3235  0.0632  0.4377  0.0831  0.0694  
= 100  0.0561  0.7739  0.0373  0.9951  0.0439  0.9940  
= 500  0.0145  1.0000  0.0124  1.0000  0.2058  1.0000  
= 1000  0.0122  1.0000  0.0113  1.0000  0.1437  1.0000  
= 1500  0.0122  1.0000  0.0109  1.0000  0.10841  1.0000  
= 2000  0.0113  1.0000  0.0098  1.0000  0.0858  1.0000  
= 30  0.0540  0.0862  0.0968  0.2811  0.0021  0.0133  
= 100  0.0071  0.0965  0.0764  0.8760  0.0997  0.8515  
= 500  0.0087  0.9267  0.0183  1.0000  0.0767  1.0000  
= 1000  0.0105  0.9997  0.0143  1.0000  0.0412  1.0000  
= 1500  0.0103  1.0000  0.0123  1.0000  0.0316  1.0000  
= 2000  0.0102  1.0000  0.0120  1.0000  0.0263  1.0000  
= 30  0.0040  0.0054  0.1352  0.2118  0.0002  0.0020  
= 100  0.0003  0.0011  0.1400  0.5563  0.0966  0.4720  
= 500  0.0081  0.1324  0.0320  1.0000  0.0320  1.0000  
= 1000  0.0096  0.4032  0.0200  1.0000  0.0200  1.0000  
= 1500  0.0102  0.6541  0.0168  1.0000  0.0168  1.0000  
= 2000  0.0104  0.8196  0.0151  1.0000  0.0151  1.0000  
= 30  0.0021  0.0027  0.1697  0.1941  0.00213  0.0027  
= 100  0.0000  0.0000  0.2082  0.3369  0.0997  0.1977  
= 500  0.0086  0.0189  0.0768  0.9426  0.0767  0.9426  
= 1000  0.0104  0.0389  0.0412  0.9999  0.0412  0.9999  
= 1500  0.0108  0.0586  0.0316  1.0000  0.0316  1.0000  
= 2000  0.0109  0.0813  0.0263  1.0000  0.0263  1.0000  
= 30  0.0831  0.0822  0.2326  0.2337  0.0832  0.0822  
= 100  0.0000  0.0000  0.2386  0.2519  0.0439  0.0532  
= 500  0.0213  0.0222  0.2120  0.4037  0.2058  0.3966  
= 1000  0.0202  0.0227  0.1437  0.6408  0.1437  0.6407  
= 1500  0.0162  0.0212  0.1084  0.8330  0.1084  0.8330  
= 2000  0.0158  0.0206  0.0854  0.9385  0.0854  0.9385 
References
 (1)
 Agresti (2003) Agresti, A. (2003), Categorical data analysis, John Wiley & Sons.
 Beck & Schögl (1995) Beck, C. & Schögl, F. (1995), Thermodynamics of chaotic systems.
 Cochran (1952) Cochran, W. G. (1952), ‘The 2 test of goodness of fit’, The Annals of mathematical statistics pp. 315–345.
 Cochran (1954) Cochran, W. G. (1954), ‘Some methods for strengthening the common 2 tests’, Biometrics 10(4), 417–451.
 Shannon (1948) Shannon, C. E. (1948), ‘A mathematical theory of communication’, The Bell system technical journal 27(3), 379–423.
 Wilks (1938) Wilks, S. S. (1938), ‘The largesample distribution of the likelihood ratio for testing composite hypotheses’, The annals of mathematical statistics 9(1), 60–62.
 Yates et al. (1999) Yates, D., Moore, D. & McCabe, G. (1999), ‘The practice of statistics. new york, ny: H’.
 Zhang (2016) Zhang, Z. (2016), Statistical Implications of Turing’s Formula, John Wiley & Sons.
 Zhang (2020) Zhang, Z. (2020), ‘Generalized mutual information’, Stats 3(2), 158–165.