In this paper, we discuss the problem of testing two multinomial distributions when the number of categories is large. Specifically, when we have two vectors for which follow multinomial distributions, where is a probability vector, our testing scenario is
Our particular interest is a high dimensional multinomial with sparsity in the sense that is large with a majority of categories having fairly small counts, such as 0, 1, or 2. Typical examples are the cases where and ’s are close to 0. Some existing tests such as Pearson chi-square test are based on large number of counts in each cell, however this may not occur under sparse data especially when is larger than for . The test that we propose is applicable to this sparse case and also more general cases including non-sparseness under some regular conditions presented later.
In fact, the hypothesis in (1) is equivalent to testing the equality of two mean vectors of two multinomial distributions for with sample sizes and . For testing the equality of two population mean vectors, there are numerous studies. For example, see Bai and Saranadasa(1996), Chen and Qin (2010), Srivastava (2009), Srivastava et al. (2013) and Park and Ayyala (2013). However, multinomial distribution does not satisfy the assumptions such as factor models used in these references.
On the other hand, Zelterman (1987) discussed goodness of fit tests in sparse contingency tables and also proposed the test when the null probabilities are unknown. Zelterman (1987) includes the mean and variance of his proposed test and proposed the normal approximation of standardized form of the test. From a theoretical point of view, Zelterman’s test requests some conditions on the cell probabilities and some relationship between the number of cells and the frequency totals in contingency table.
It is worth while to noting that the goodness of fit test from one sample has a different context from the two sample problem. In other words, the goodness of fit test is testing for a given and . There are extensive studies on the goodness of fit testing problem for one sample such a s Morris (1975), Cressie and Read (1984) and Kim et al. (2009) and all these studies on goodness of fit tests are different from the two sample problem in (1utilize .
In this paper, we propose a new test statistic to test (1) for two samples of multinomial distributions. We provide asymptotic distribution and power function of the proposed test and show numerical studies. In particular, we emphasize that our asymptotic results provide more general results than Zelterman (1987).
This paper is organized as follows. In Section 2 we discuss existing methods that can be applied to our testing (1). In Sections 3-4, we present our proposed test statistics and prove their asymptotic normality. We propose a new test statistic and show its asymptotic null distribution and asymptotic power function. In Section 5, we consider an application of our proposed test based on asymptotic power function. We define a neighborhood test, which is used in conjunction with our test statistic in Section 7 to analyze the 20 newsgroups dataset. In Section 6 we show the performance of our test compared to other existing tests through the use of simulation experiments. Concluding remarks are presented in section 8.
2 Existing methods for Comparison of Two Multinomial Distributions
Suppose we have for which has the multinomial distribution, namely where . One typical method for testing (1) is to use Pearson’s test, which is reliable when sample size in each cell is large enough. Pearson’s statistic is defined as follows:
where for is the expected count and is the observed count for the vector entry of the
group. As a related work, Anderson et al.(1972) applied a union-intersection method to develop a procedure for testing the homogeneity of two sample multinomial data and showed that their test is eventually equivalent to the Pearson chi-square test. The approximation based on chi-square distribution to (2) may be poor when the number of frequencies is not large enough.
Alternatively, Zelterman (1987) proposed a goodness-of-fit statistic for contingency tables which provides improved power over the test when the is biased due to sparseness. They presented the conditional mean and variance of their proposed test conditioning on the marginal totals. They applied the asymptotic normality of the normalized form of their proposed test, which is effective especially for sparse and large dimensional contingency tables. Zelterman’s test is
where , for , and is the observed value for the entry of the group. Zelterman (1987) presented and . From a theoretical point of view, Zelterman (1987) mentioned that the asymptotic normality of in (3) can hold when and have the same increasing rate and the cell probabilities have the rates between and for some constants . Theses imply for some constant
which means that the expected counts under the null hypothesis should be bounded away from 0. Our proposed test is motivated by the estimator of Euclidean distance between two probability vectors and demonstrate some advantage over the test in Zelterman (1987) in two sample case. This advantage can be understood through both theory and numerical studies as we will show.
Additionally, there are many studies for testing the equality of mean vectors under some models such as factor models, for example, see Bai and Saranadasa (1996), Chen and Qin (2010), Park and Ayyala (2013) and Srivastava (2009). As mentioned in the introduction, the multinomial distribution does not satisfy the conditions in all these studies. However, our problem for two multinomial distributions and is considered as testing (1) when there are where for and . This is actually the case of testing the equality of mean vectors of and which is in (1). The tests in Park and Ayyala (2013) and Srivastava (2009) are not well defined in our setting due to zero values in many cells. We will consider the test in Bai and Saranadasa (1996) in our numerical studies while the test in Chen and Qin (2010) is not practical under our situation due to computational complexity when s are thousands.
In the following section, we propose a new test and show its asymptotic normality and the asymptotic power under some conditions. We will also provide numerical studies comparing our proposed test with existing methods as well as a real data example.
3 New Test Statistic for Comparison of Two Multinomial Distributions
In this section, we propose a new test and derive the asymptotic power of the proposed test from the asymptotic normality under some regularity conditions.
3.1 The Proposed Test Statistic
We present a new procedure for testing the hypotheses in (1
) when the dimension of the multinomial vector is large. Our main goal is to propose a new test and derive the asymptotic distribution and asymptotic power function of the proposed test. The proposed test is based on an unbiased estimator of Euclidean distance betweenand : where for . Before we construct our test statistic, we mention that we reformulate the multinomial distributed vector as the conditional distribution of given the total sum where
s come from an independent Poisson distribution with mean, i.e., where means the equivalence of two distributions. Morris (1975) provided asymptotic results for the multinomial distribution using Poisson distributions conditioning on the total sum. We first propose our test statistic based on independent Poisson distributions and then we provide asymptotic results conditioning on the total sums, . In the observational vector of independent Poisson variables, say with for , we define
and, to obtain an unbiased estimator for (4), we introduce
We obtain an unbiased estimator of based on for which is
has the asymptotic normal distribution for multinomial vector. The Euclidean distance is commonly used for testing the equality of mean vectors of multivariate normal distributions or factor models with some moment conditions. See Bai and Saranadasa (1995) and Chen and Qin (2010). In the context of testing in contingency tables, the idea for the chi-square distribution is to consider the goodness of fit for each cell using standardized quantities under the null hypothesis,for . However, the denominator in (2
) is affected by cell probabilities which may lead to very skewed distribution for smalls. In our context, the sparse multinomial data are from small probabilities in most of cells, so chi-square approximation to each cell may not be desirable. On the other hand, our proposed tests based on in (6) first aggregate estimates of and then consider the normalization of . This difference will lead to different performance between our proposed test and the test (3).
We first present the following theorem which plays a major role in deriving the asymptotic distribution of our proposed test and the asymptotic power. We use the following notation: let and for . Let for and . For two vectors and , the dot product is and component-wise product of and is . We also define and . Let be a sequence satisfying where implies for sequences and . The notion implies convergence in distribution.
Let be independent multinomial random vectors for such as where and for . Suppose the following conditions are satisfied: for ,
Then, we have
and is given by (5).
The proof of Theorem 1 will be provided in section 4 with a series of lemmas. ∎
The conditions in Theorem 1 will be used throughout this paper. The sample sizes for and the dimension . do not have explicit relationship. This is in contrast to (3) in Zelterman (1987) assuming that and have the same increasing rate for the theoretical proof of the asymptotic normality. Instead, our conditions in Theorem 1 do not require direct relationship between and . Rather, the relationship between and are only through Conditions 3 in Theorem 1. For example, when and , then the condition 3 requests which includes the case of in Zelterman (1987). However, the condition 3 covers a variety of situations compared to Zelterman (1987). For example, when and , all four conditions in Theorem 1 are satisfied. The condition allows to increase at the rate of . In other words, our conditions include more general relationship between and through depending on the configurations of s.
In Theorem 1 is known, however is unknown, so we need to have some estimates of defined in (8) which have an asymptotically equivalent behavior. Our proposed test is constructed under the null hypothesis : . For derivation of , see the proof of Lemma 2 in section 4. In practice, we need some estimate of based on multinomial data for . We propose an estimator of which is
where and . Lemma 1 states that the proposed estimator of has the property of ratio consistency.
Under conditions 1 and 2 in Theorem 1, .
See Appendix. ∎
Based on the estimators , we define the following two test statistics, namely ;
where is the quantile of a standard normal distribution. In practice, our test requests only conditions of Theorem 1 to have asymptotic size test for a given . Additionally, it is of interest to investigate the power function of our proposed tests. In particular, the power function from Theorem 1 is meaningful when the signal-to-noise ratio for is bounded, i.e., which is the case that the asymptotic power is non-trivial in the sense that the power is in . Condition 4 in Theorem 1 is equivalent to the condition that the SNR is bounded by some constant as .
It is clear that under , the proposed test is asymptotically size- test since under .
In the following section, we provide the proof of Theorem 1.
4 Asymptotic Normality of the proposed tests
In this section we prove Theorem 1. The main difficulty is the dependency imposed by the multinomial distribution. In other words, s are not independent since s have dependency for
from the multinomial distributions. Therefore, it is not straightforward to apply the central limit theorem based on the assumption of independence. Instead, Steck (1957) and Morris (1975) use conditional central limit theory for independent Poisson distributions conditioning on sums of Poisson variables to have the asymptotic normality of multinomial distributions. More specifically, to avoid the issue of dependency from the multinomial distribution, we use the fact that the multinomial random vectorhas the same distribution as where s are independent Poisson random variables with mean . Before we present our main results, we first define the following notations:
We will show that which is a trivariate multinormal distribution where is a identity matrix and . The latter case means that, under the condition of (equivalently ) for , the conditional distribution of is the same as that of since . For the asymptotic normality of , we need the uniform equicontinuity for the conditional central limit theorem as stated in Theorem 2.1 in Steck (1957). For the uniform equicontinuity in Steck (1957), we need to show that, for bounded values and for some and
, the conditional characteristic function ofgiven and satisfies
We will show the uniform equicontinuity of the characteristic function in Lemma 3.
From Theorem 2.1 in Steck (1957), the uniform equicontinuity of characteristic function of implies the conditional asymptotic normality of given , i.e., .
In fact, the uniform equicontinuity of characteristic function becomes
and it is sufficient to show that the last expression converges to 0.
When and for are independent Poisson random variables with means and , respectively, then
See Supplementary material. ∎
The following lemma ensures that the convergence of characteristic function of based on independent Poisson distributions conditioning on and which come from multinomial distributions.
When and are independent multinomial vectors for with and . and are such that and are nonnegative integers and is bounded (say for some constant , )as . Under the conditions in Theorem 1, for , we have
See supplementary material. ∎
The following lemma shows that has the asymptotic normality when s are independent poisson distributions.
When and for are independent Poisson random variables with means and , respectively, then
See the Supplementary material. ∎
Proof of Theorem 1 : Lemma 4 shows . We also have for from the Lyapounov’ condition : from the condition 3 in Theorem 1. Using Lemma 2 and independence of and , we have the result that , and are uncorrelated to each other. Therefore, using Lemma 2.1 in Morris (1975), we have tri-variate asymptotic normality of , i.e., where is a identity matrix. Lemma 3 shows the uniform equicontinuity of conditional characteristic function of given and , so we have , in other words