 # Two-Sample Test for Sparse High Dimensional Multinomial Distributions

In this paper we consider testing the equality of probability vectors of two independent multinomial distributions in high dimension. The classical chi-square test may have some drawbacks in this case since many of cell counts may be zero or may not be large enough. We propose a new test and show its asymptotic normality and the asymptotic power function. Based on the asymptotic power function, we present an application of our result to neighborhood type test which has been previously studied, especially for the case of fairly small p-values. To compare the proposed test with existing tests, we provide numerical studies including simulations and real data examples.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In this paper, we discuss the problem of testing two multinomial distributions when the number of categories is large. Specifically, when we have two vectors for which follow multinomial distributions, where is a probability vector, our testing scenario is

 H0:P1=P2  vs.  H1:P1≠P2. (1)

Our particular interest is a high dimensional multinomial with sparsity in the sense that is large with a majority of categories having fairly small counts, such as 0, 1, or 2. Typical examples are the cases where and ’s are close to 0. Some existing tests such as Pearson chi-square test are based on large number of counts in each cell, however this may not occur under sparse data especially when is larger than for . The test that we propose is applicable to this sparse case and also more general cases including non-sparseness under some regular conditions presented later.

In fact, the hypothesis in (1) is equivalent to testing the equality of two mean vectors of two multinomial distributions for with sample sizes and . For testing the equality of two population mean vectors, there are numerous studies. For example, see Bai and Saranadasa(1996), Chen and Qin (2010), Srivastava (2009), Srivastava et al. (2013) and Park and Ayyala (2013). However, multinomial distribution does not satisfy the assumptions such as factor models used in these references.

On the other hand, Zelterman (1987) discussed goodness of fit tests in sparse contingency tables and also proposed the test when the null probabilities are unknown. Zelterman (1987) includes the mean and variance of his proposed test and proposed the normal approximation of standardized form of the test. From a theoretical point of view, Zelterman’s test requests some conditions on the cell probabilities and some relationship between the number of cells and the frequency totals in contingency table.

It is worth while to noting that the goodness of fit test from one sample has a different context from the two sample problem. In other words, the goodness of fit test is testing for a given and . There are extensive studies on the goodness of fit testing problem for one sample such a s Morris (1975), Cressie and Read (1984) and Kim et al. (2009) and all these studies on goodness of fit tests are different from the two sample problem in (1

) in the sense that test statistics for goodness of fit under the null hypothesis

utilize .

In this paper, we propose a new test statistic to test (1) for two samples of multinomial distributions. We provide asymptotic distribution and power function of the proposed test and show numerical studies. In particular, we emphasize that our asymptotic results provide more general results than Zelterman (1987).

This paper is organized as follows. In Section 2 we discuss existing methods that can be applied to our testing (1). In Sections 3-4, we present our proposed test statistics and prove their asymptotic normality. We propose a new test statistic and show its asymptotic null distribution and asymptotic power function. In Section 5, we consider an application of our proposed test based on asymptotic power function. We define a neighborhood test, which is used in conjunction with our test statistic in Section 7 to analyze the 20 newsgroups dataset. In Section 6 we show the performance of our test compared to other existing tests through the use of simulation experiments. Concluding remarks are presented in section 8.

## 2 Existing methods for Comparison of Two Multinomial Distributions

Suppose we have for which has the multinomial distribution, namely where . One typical method for testing (1) is to use Pearson’s test, which is reliable when sample size in each cell is large enough. Pearson’s statistic is defined as follows:

 χ2=2∑c=1∑i∈{i:Nci>0}(Nci−^Nci)2^Nci (2)

where for is the expected count and is the observed count for the vector entry of the

group. As a related work, Anderson et al.(1972) applied a union-intersection method to develop a procedure for testing the homogeneity of two sample multinomial data and showed that their test is eventually equivalent to the Pearson chi-square test. The approximation based on chi-square distribution to (

2) may be poor when the number of frequencies is not large enough.

Alternatively, Zelterman (1987) proposed a goodness-of-fit statistic for contingency tables which provides improved power over the test when the is biased due to sparseness. They presented the conditional mean and variance of their proposed test conditioning on the marginal totals. They applied the asymptotic normality of the normalized form of their proposed test, which is effective especially for sparse and large dimensional contingency tables. Zelterman’s test is

 Z=^D2Z−E(^D2Z)√Var(^D2Z) (3)

where , for , and is the observed value for the entry of the group. Zelterman (1987) presented and . From a theoretical point of view, Zelterman (1987) mentioned that the asymptotic normality of in (3) can hold when and have the same increasing rate and the cell probabilities have the rates between and for some constants . Theses imply for some constant

which means that the expected counts under the null hypothesis should be bounded away from 0. Our proposed test is motivated by the estimator of Euclidean distance between two probability vectors and demonstrate some advantage over the test in Zelterman (1987) in two sample case. This advantage can be understood through both theory and numerical studies as we will show.

Additionally, there are many studies for testing the equality of mean vectors under some models such as factor models, for example, see Bai and Saranadasa (1996), Chen and Qin (2010), Park and Ayyala (2013) and Srivastava (2009). As mentioned in the introduction, the multinomial distribution does not satisfy the conditions in all these studies. However, our problem for two multinomial distributions and is considered as testing (1) when there are where for and . This is actually the case of testing the equality of mean vectors of and which is in (1). The tests in Park and Ayyala (2013) and Srivastava (2009) are not well defined in our setting due to zero values in many cells. We will consider the test in Bai and Saranadasa (1996) in our numerical studies while the test in Chen and Qin (2010) is not practical under our situation due to computational complexity when s are thousands.

In the following section, we propose a new test and show its asymptotic normality and the asymptotic power under some conditions. We will also provide numerical studies comparing our proposed test with existing methods as well as a real data example.

## 3 New Test Statistic for Comparison of Two Multinomial Distributions

In this section, we propose a new test and derive the asymptotic power of the proposed test from the asymptotic normality under some regularity conditions.

### 3.1 The Proposed Test Statistic

We present a new procedure for testing the hypotheses in (1

) when the dimension of the multinomial vector is large. Our main goal is to propose a new test and derive the asymptotic distribution and asymptotic power function of the proposed test. The proposed test is based on an unbiased estimator of Euclidean distance between

and : where for . Before we construct our test statistic, we mention that we reformulate the multinomial distributed vector as the conditional distribution of given the total sum where

s come from an independent Poisson distribution with mean

, i.e., where means the equivalence of two distributions. Morris (1975) provided asymptotic results for the multinomial distribution using Poisson distributions conditioning on the total sum. We first propose our test statistic based on independent Poisson distributions and then we provide asymptotic results conditioning on the total sums, . In the observational vector of independent Poisson variables, say with for , we define

 ||P1−P2||22=∣∣∣∣∣∣λ1n1−λ2n2∣∣∣∣∣∣22=k∑i=1(λ1in1−λ2in2)2. (4)

and, to obtain an unbiased estimator for (4), we introduce

 f∗(x1,x2)=(x1n1−x2n2)2−x1n21−x2n22. (5)

We obtain an unbiased estimator of based on for which is

 D≡k∑i=1((X1in1−X2in2)2−X1in21−X2in22)=k∑i=1f∗(X1i,X2i) (6)

satisfying . Theorem 1 and Corollary 2 will show that the normalized form for some estimator

has the asymptotic normal distribution for multinomial vector. The Euclidean distance is commonly used for testing the equality of mean vectors of multivariate normal distributions or factor models with some moment conditions. See Bai and Saranadasa (1995) and Chen and Qin (2010). In the context of testing in contingency tables, the idea for the chi-square distribution is to consider the goodness of fit for each cell using standardized quantities under the null hypothesis,

for . However, the denominator in (2

) is affected by cell probabilities which may lead to very skewed distribution for small

s. In our context, the sparse multinomial data are from small probabilities in most of cells, so chi-square approximation to each cell may not be desirable. On the other hand, our proposed tests based on in (6) first aggregate estimates of and then consider the normalization of . This difference will lead to different performance between our proposed test and the test (3).

We first present the following theorem which plays a major role in deriving the asymptotic distribution of our proposed test and the asymptotic power. We use the following notation: let and for . Let for and . For two vectors and , the dot product is and component-wise product of and is . We also define and . Let be a sequence satisfying where implies for sequences and . The notion implies convergence in distribution.

###### Theorem 1.

Let be independent multinomial random vectors for such as where and for . Suppose the following conditions are satisfied: for ,

 Condition 1:~{}~{}~{}~{}min(n1,n2)→∞,  n1n→c∈(0,1), Condition 2:~{}~{}~{}~{}maxip2ci||Pc||22→0  for c=1,2 as k→∞, Condition 3:~{}~{}~{}~{}n||P1+P2||22≥ϵ>0   for some ϵ>0, Condition 4:~{}~{}~{}~{}n2||ξ||42=O(||P1+P2||22).

Then, we have

 ∑ki=1f∗(N1i,N2i)−||ξ||22σkd→N(0,1) (7)

where

 σ2k=2k∑i=1(p1in1+p2in2)2 (8)

and is given by (5).

###### Proof.

The proof of Theorem 1 will be provided in section 4 with a series of lemmas. ∎

###### Remark 1.

The conditions in Theorem 1 will be used throughout this paper. The sample sizes for and the dimension . do not have explicit relationship. This is in contrast to (3) in Zelterman (1987) assuming that and have the same increasing rate for the theoretical proof of the asymptotic normality. Instead, our conditions in Theorem 1 do not require direct relationship between and . Rather, the relationship between and are only through Conditions 3 in Theorem 1. For example, when and , then the condition 3 requests which includes the case of in Zelterman (1987). However, the condition 3 covers a variety of situations compared to Zelterman (1987). For example, when and , all four conditions in Theorem 1 are satisfied. The condition allows to increase at the rate of . In other words, our conditions include more general relationship between and through depending on the configurations of s.

In Theorem 1 is known, however is unknown, so we need to have some estimates of defined in (8) which have an asymptotically equivalent behavior. Our proposed test is constructed under the null hypothesis : . For derivation of , see the proof of Lemma 2 in section 4. In practice, we need some estimate of based on multinomial data for . We propose an estimator of which is

 ^σ2k = k∑i=12∑c=12n2c(^p2ci−^pcinc)+4n1n2k∑i=1^p1i^p2i (9)

where and . Lemma 1 states that the proposed estimator of has the property of ratio consistency.

###### Lemma 1.

Under conditions 1 and 2 in Theorem 1, .

###### Proof.

See Appendix. ∎

Based on the estimators , we define the following two test statistics, namely ;

 T≡∑ki=1f∗(N1i,N2i)^σk (10)

where is defined in (5). From Theorem 1 and Lemma 1, is asymptotic normal under the . We state this in the following corollary.

###### Corollary 1.

Under , if Conditions 1 and 2 in Theorem 1 are satisfied, then where is defined in (10).

Corollary 1 shows that our proposed test is available for practical use under fairly mild conditions of Theorem 1. Based on Corollary 1, we reject if

 T>z1−α (11)

where is the quantile of a standard normal distribution. In practice, our test requests only conditions of Theorem 1 to have asymptotic size test for a given . Additionally, it is of interest to investigate the power function of our proposed tests. In particular, the power function from Theorem 1 is meaningful when the signal-to-noise ratio for is bounded, i.e., which is the case that the asymptotic power is non-trivial in the sense that the power is in . Condition 4 in Theorem 1 is equivalent to the condition that the SNR is bounded by some constant as .

###### Corollary 2.

Under the conditions in Theorem 1, we have

 P(T>z1−α)−¯Φ(z1−α−||P1−P2||22σk)→0 (12)

where

for a standard normal random variable

and is the quantile of a standard normal distribution.

###### Proof.

From Theorem 1 and Lemma 1, we have (12). ∎

###### Remark 2.

It is clear that under , the proposed test is asymptotically size- test since under .

In the following section, we provide the proof of Theorem 1.

## 4 Asymptotic Normality of the proposed tests

In this section we prove Theorem 1. The main difficulty is the dependency imposed by the multinomial distribution. In other words, s are not independent since s have dependency for

from the multinomial distributions. Therefore, it is not straightforward to apply the central limit theorem based on the assumption of independence. Instead, Steck (1957) and Morris (1975) use conditional central limit theory for independent Poisson distributions conditioning on sums of Poisson variables to have the asymptotic normality of multinomial distributions. More specifically, to avoid the issue of dependency from the multinomial distribution, we use the fact that the multinomial random vector

has the same distribution as where s are independent Poisson random variables with mean . Before we present our main results, we first define the following notations:

 fi(x1i,x2i) = f∗(x1i,x2i)−(p1i−p2i)2G1i(x1i,x2i) (14) −2(p1i−p2i)(x1in1−x2in2)+2(p1i−p2i)2G2i(x1i,x2i) Fk = ∑ki=1fi(X1i,X2i)σk (15) = ∑ki=1G1i(X1i,X2i)σk+∑ki=1G2i(X1i,X2i)σk Uck = 1√nck∑i=1(Xci−λci)   for c=1,2. (16)

We will show that which is a trivariate multinormal distribution where is a identity matrix and . The latter case means that, under the condition of (equivalently ) for , the conditional distribution of is the same as that of since . For the asymptotic normality of , we need the uniform equicontinuity for the conditional central limit theorem as stated in Theorem 2.1 in Steck (1957). For the uniform equicontinuity in Steck (1957), we need to show that, for bounded values and for some and

, the conditional characteristic function of

given and satisfies

 limh→0supksup|u1|≤δ,|u2|≤δ|E(eitFk|Uk1=u1+h,Uk2=u2+h)−E(eitF|Uk1=u1,Uk2=u2)| →0.

We will show the uniform equicontinuity of the characteristic function in Lemma 3.

From Theorem 2.1 in Steck (1957), the uniform equicontinuity of characteristic function of implies the conditional asymptotic normality of given , i.e., .

The following Lemmas, 2 and 3, will be used in showing the asymptotic multivariate normality of and the uniform equicontinuity of the characteristic function of conditioning on and .

In fact, the uniform equicontinuity of characteristic function becomes

 limh→0supksup|u1|≤δ,|u2|≤δ|E(eitFk|Uk1=u1+h,Uk2=u2+h)−E(eitF|Uk1=u1,Uk2=u2)| ≤ limh→0supksup|u1|≤δ,|u2|≤δEexp(itσkk∑i=1(fi(L1i+M1i,L2i+M2i)−fi(L1i,L2i))) ≤ limh→0supksup|u1|≤δ,|u2|≤δ|t|σkE∣∣∣k∑i=1(fi(L1i+M1i,L2i+M2i)−fi(L1i,L2i))∣∣∣ ≤ limh→0supksup|u1|≤δ,|u2|≤δ⎛⎝t2σ2kE(k∑i=1(fi(L1i+M1i,L2i+M2i)−fi(L1i,L2i)))2⎞⎠1/2

and it is sufficient to show that the last expression converges to 0.

###### Lemma 2.

When and for are independent Poisson random variables with means and , respectively, then

1. .

2. for .

3. .

###### Proof.

See Supplementary material. ∎

The following lemma ensures that the convergence of characteristic function of based on independent Poisson distributions conditioning on and which come from multinomial distributions.

###### Lemma 3.

When and are independent multinomial vectors for with and . and are such that and are nonnegative integers and is bounded (say for some constant , )as . Under the conditions in Theorem 1, for , we have

 limh→0supksup|u1|≤δ,|u2|≤δ1σ2kE⎡⎣(k∑i=1fi(L1i+M1i,L2i+M2i)−fi(L1i,L2i))2⎤⎦=0. (17)
###### Proof.

See supplementary material. ∎

The following lemma shows that has the asymptotic normality when s are independent poisson distributions.

###### Lemma 4.

When and for are independent Poisson random variables with means and , respectively, then

 ∑ki=1f(X1i,X2i)σkd→N(0,1). (18)
###### Proof.

See the Supplementary material. ∎

Based on the lemmas, we prove Theorem 1. In fact, Theorem 1 is the case when independent poisson random variables s in Lemma 4 can be replaced by the multinomial distributions s.

Proof of Theorem 1 : Lemma 4 shows . We also have for from the Lyapounov’ condition : from the condition 3 in Theorem 1. Using Lemma 2 and independence of and , we have the result that , and are uncorrelated to each other. Therefore, using Lemma 2.1 in Morris (1975), we have tri-variate asymptotic normality of , i.e., where is a identity matrix. Lemma 3 shows the uniform equicontinuity of conditional characteristic function of given and , so we have , in other words

 (Fk|U1k=U2k=0)d≡∑ki=1fi(N1i,N2i)σkd→N(0,1). (19)

From (15), conditioning on , we have From (19), we only to show to have the asymptotic normality of . For this, it is enough to show since . Using and for , we have

 Var(k∑i=1G2i(N1i,N2i)) = k∑i=1ξ2i(p1i(1−p1i)n1+p2i(1−p2i)n2)−∑i≠jξiξj(p1ip1jn1+p2ip2jn2) = k∑i=1ξ2i(p1in1+p2in2)−2∑c=1(k∑i=1ξipci)2≤k∑i=1ξ2i(p1in1