# A test for k sample Behrens-Fisher problem in high dimensional data

In this paper, the k sample Behrens-Fisher problem is investigated in high dimensional setting. We propose a new test statistic and demonstrate that the proposed test is expected to have more powers than some existing test especially when sample sizes are unbalanced. We provide theoretical investigation as well as numerical studies on both sizes and powers of the proposed tests and existing test. Both theoretical comparison of the asymptotic power functions and numerical studies show that the proposed test tends to have more powers than existing test in many cases of unbalanced sample sizes.

## Authors

• 1 publication
• 6 publications
• 1 publication
• ### Power Comparison between High Dimensional t-Test, Sign, and Signed Rank Tests

In this paper, we propose a power comparison between high dimensional t-...
12/27/2018 ∙ by Long Feng, et al. ∙ 0

• ### Two-Sample Test Based on Classification Probability

Robust classification algorithms have been developed in recent years wit...
09/17/2019 ∙ by Haiyan Cai, et al. ∙ 0

• ### Distribution and correlation free two-sample test of high-dimensional means

We propose a two-sample test for high-dimensional means that requires ne...
04/16/2019 ∙ by Kaijie Xue, et al. ∙ 0

• ### Nonparametric High-dimensional K-sample Comparison

High-dimensional k-sample comparison is a common applied problem. We con...
10/03/2018 ∙ by Subhadeep, et al. ∙ 12

• ### Properties of adaptively weighted Fisher's method

Meta-analysis is a statistical method to combine results from multiple c...
08/01/2019 ∙ by Yusi Fang, et al. ∙ 0

• ### An Independence Test Based on Recurrence Rates. An empirical study and applications to real data

In this paper we propose several variants to perform the independence te...
09/18/2020 ∙ by Juan Kalemkerian, et al. ∙ 0

• ### A modified maximum contrast method for unequal sample sizes in pharmacogenomic studies

In pharmacogenomic studies, biomedical researchers commonly analyze the ...
12/07/2020 ∙ by Kengo Nagashima, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many contemporary applications, high and ultrahigh dimensional data are increasingly available, such as molecular biology, genomics, fMRI, finance and transcriptomics. A common feature for high and ultrahigh dimensional data is that the data dimension is larger or much larger than the sample size, the so called “large , small ” phenomenon where is the data dimension and is the sample size. In high dimensional settings, classical methods may be invalid, or not applicable at all. Hence, there has been growing interest in developing testing procedures which are better suited to deal with statistical problems in high dimensional setting. Testing hypotheses in high dimension is one of important issues in high dimensional data which has attracted a great deal of attention in recent decades. In two sample testing in high dimension, there have been numerous studies such as Bai and Saranadasa (1996), Srivastava et al. (2008, 2009, 2013), Chen and Qin (2010), Aoshima and Yata (2011), Park and Ayyala (2013), Feng et al. (2015), Zhou and Kong (2015), Ma et al. (2015), Ghosh and Biswas (2016) and Zhao and Xu (2016)

. For multivariate analysis of variance (MANOVA), see

Fujikoshi et al. (2004), Schott (2007), Srivastava et al. (2007), Cai and Xia (2014) and Cao and Xu (2015). More specifically, when there are groups and represent random samples from the

th group with unknown mean vector

and positive definite covariance matrix for , it is of interest to test

 H0: \boldmath{μ}1=⋯=\boldmath{μ}k  versus  H1: H0 is not true. (1)

In particular, when all covariance matrices are homogeneous such as , testing (1) is known as MANOVA. On the other hand, Hu et al. (2015) and Cao (2014) recently proposed the same test statistic to test (1) when covariance matrices are not necessarily homogeneous. This is also known as the sample Behrens-Fisher (BF) problem which does not require . The homogeneity of covariance matrices is a strong condition in practice. In fact, it is not straightforward to verify the homogeneity of covariance matrices especially in high dimensional data. Therefore, unless there is any strong evidence supporting the homogeneity of covariance matrices, it is natural to allow different covariance matrices in practice.

The main goal of this paper is to propose a new test statistic in the sample Behrens-Fisher problem. It will be shown that the proposed test behaves differently from existing test such as Hu et al. (2015) when sample sizes are unbalanced. We will discuss such differences between the proposed test and the test in Hu et al. (2015) through both theoretical and numerical comparisons under a variety of situations. We observe that the proposed test has some advantage in powers compared to Hu et al. (2015) in many cases situations through theoretical and numerical comparisons.

The remainder of the paper is organized as follows. Section 2 first presents conditions of statistical model. Some notations used throughout the paper are defined and assumptions are also announced for the theoretical study. In Section 3, we give the new test statistic and investigate its asymptotic behavior under and . Theoretical comparisons and numerical studies on the proposed test and the Hu’s test are carried out in Section 4. Concluding remarks is presented in 5.

## 2 Preliminaries

In this section, we give notations and the statistical model for the sample BF problem. Some assumptions are also illustrated.

### 2.1 Notations

The following notations will be used in subsequent exposition. All vectors are column and denotes the transpose of . All vectors and matrices are bold-faced. For two sequences of real numbers and , we write if there exists a constant such that holds for all sufficiently large , and write if . For a random sequence

and a random variable

, and denote converges to

in probability and in distribution, respectively, as

.

Let and be the sample mean vector and sample covariance matrix from the th group for . Let be the pooled sample mean vector which is . If we define and where is the integer part of for , then , and , stand for the sample mean vectors and covariances matrices of the first samples and the rest samples, respectively. We also define the pooled sample covariance denoted by

 \boldmath{E}1=1n−kk∑l=1nl∑i=1(\boldmath{X}li−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{X% }l)(\boldmath{X}li−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{X}l)\tiny{T} (2)

and

 \boldmath{E}2=1k−1k∑l=1nl(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{X}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{X})(¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{X}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{X})% \tiny{T}. (3)

Finally, we define and as weighted mean vector and average mean vector of the population means , respectively.

### 2.2 Model

We assume that random samples ’s are generated from a factor model in multivariate analysis which are commonly used in many existing studies, for example, Bai and Saranadasa (1996), Chen and Qin (2010) and Hu et al. (2015)

. More formally, some moment conditions on the distributions of random samples

are imposed as follows; for every and , we consider

 \boldmath{X}li=\boldmath{Γ}l% \boldmath{Z}li+\boldmath{μ}l (4)

where is a matrix for some such that and are variate independent and identically distributed (i.i.d.) random vectors with and . Moreover, we assume and ’s are independent for all ; and , where .

### 2.3 Assumptions

We first state the main conditions which will be used in the proof of asymptotic results of our proposed test. The three conditions, (A1), (A2) and (A3) are as follows:

• for .

• for as and .

• for as .

(A1) implies that all sample sizes have the same increasing rate except constant terms. (A2) is used in a local alternative for the power function of the proposed test and it is actually an extension of (3.3) in Chen and Qin (2010) to the case of multi-groups (). Similarly, (A3) can be seen as an extension of the condition (3.6) in Chen and Qin (2010) to the case of multi-groups.

## 3 Main results

In this section we present a new proposed test statistic and its asymptotic properties under the conditions (A1)–(A3).

### 3.1 The proposed test statistic

Our proposed test is motivated by Schott (2007) and “leave-one-out” idea of Chen and Qin (2010). Schott (2007) tested the hypothesis (1) under MANOVA based on

 TS:=tr(\boldmath{E}2)−tr(% \boldmath{E}1). (5)

where and are defined in (2) and (3). The asymptotic normality of was derived in Schott (2007), hence a test statistic was formulated by standardizing

with an asymptotically ratio-consistent estimator of its standard deviation. The main assumptions in

Schott (2007) are as follows:

• The random samples ’s come from normal model for and .

• .

• for or .

With (A4), the asymptotic results in Schott (2007) were derived under MANOVA which is the case of homogeneous covariance matrices under multivariate normality of data. (A5) means that the sample dimension and sample size have the same order and the total number of samples should be larger than the dimension . However, our proposed test and Hu et al. (2015) need some implicit relationship between and through the condition (A3) rather than explicit restriction on and as in (A5). Under , (A6) is a stronger condition than (A3) since (A3) is showing that (A6) implies (A3) through as . Thus, considering all these, it is clear that (A4)-(A6) are stronger conditions than (A1)-(A3).

We modify in (5) by removing the terms which is also done in Chen and Qin (2010) and get a test statistic denoted by as follows:

 T:=k∑l=1n−nln(nl−1)nl∑i≠j\boldmath{X}\tiny{T}li\boldmath{X}lj−k∑l≠snlnsn¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯% \boldmath{X}\tiny{T}l¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{X}s. (6)

It is worth pointing out that, for two sample BF problem, the statistic is the same as Chen and Qin (2010) except a constant factor . Elementary derivation shows

 E(T)=k∑l=1nl(\boldmath{μ}l−˜\boldmath{μ})\tiny{T}(\boldmath{μ% }l−˜\boldmath{μ}) (7)

where for . In Hu et al. (2015), their test statistic is based on a statistic, say , of which the expected value is where . The deviation of from in (7) is weighted by the corresponding sample size which can emphasize the deviations of populations with large sample sizes. On the other hand, in Hu et al. (2015) gives all equal weight to the deviations of from overall mean . This difference leads to different asymptotic powers of test statistics based on and .

We now propose a test statistic based on in (6). It can be shown that the variance of is

 Var(T) = 2n2⎧⎨⎩k∑l=1nl(n−nl)2nl−1tr(\boldmath{Σ}2l)+k∑l≠snlnstr(\boldmath{Σ}l\boldmath{Σ}s)⎫⎬⎭+4k∑l=1nl(\boldmath{μ}l−˜\boldmath{μ})\tiny{T}\boldmath{Σ}l(\boldmath{μ}l−˜\boldmath{μ}) (8) = σ2T+4k∑l=1nl(\boldmath{μ}% l−˜\boldmath{μ})\tiny{T}\boldmath{Σ}l(\boldmath{μ}l−˜\boldmath{μ})

where

 σ2T:=2n2⎧⎨⎩k∑l=1nl(n−nl)2nl−1tr(\boldmath{Σ}2l)+k∑l≠snlnstr(\boldmath{Σ}l% \boldmath{Σ}s)⎫⎬⎭.

Note that under . From (A1) and (A2), we have

 (\boldmath{μ}l−˜\boldmath{μ})\tiny{T}\boldmath{Σ}l(\boldmath{μ}l−˜\boldmath{μ})=o⎧⎨⎩n−1tr(k∑l=1\boldmath{Σ}l)2⎫⎬⎭ (9)

for and by combining and (9), we obtain

 Var(T)=σ2T{1+o(1)}.

In order to formulate a test procedure, we should give an asymptotically ratio-consistent estimator of . There are many different estimators proposed in existing studies. We adopt two different estimators which are stated in the following two lemmas.

The first one is based on Aoshima and Yata (2011) which is given in Lemma 3.1. It should be noted that the requirements for obtaining asymptotically ratio-consistent estimator of in Aoshima and Yata (2011) are different from (A1)-(A3). Our assumption on ’s in (A3) is weaker than those assumptions (A-iv and A-v) in Aoshima and Yata (2011).

###### Lemma 3.1.

Suppose we have the following estimator of

 ˆσT2:=2n2⎧⎨⎩k∑l=1nl(n−nl)2nl−1ˆtr(\boldmath{Σ}2l)+k∑l≠snlnsˆtr(\boldmath{Σ}l\boldmath{Σ}s)⎫⎬⎭,

then we have the ratio consistency of , i.e.,

 ˆσT/σTpr⟶1

where and are asymptotically ratio-consistent estimators of and , respectively, for and , .

Proof See Appendix.

The other estimator of is the estimator used in Bai and Saranadasa (1996) and Hu et al. (2015) which is stated in the following lemma.

###### Lemma 3.2.

(Hu et al. (2015)) Suppose

 ˜σT2:=2n2⎧⎨⎩k∑l=1nl(n−nl)2nl−1˜tr(\boldmath{Σ}2l)+k∑l≠snlnsˆtr(\boldmath{Σ}l\boldmath{Σ}s)⎫⎬⎭,

then

 ˜σT/σTpr⟶1

where

 ˜tr(\boldmath{Σ}2l)=(nl−1)2(nl+1)(nl−2){tr(\boldmath{S}2lnl)−1nl−1tr2(\boldmath{S}lnl)}.

On the basis of Lemmas 3.1 and 3.2, we propose two test statistics which are

 ˆT1:=TˆσT  and  ˆT2:=T˜σT. (10)

In the following section, we prove the asymptotic normality of the proposed tests in (10) and their asymptotic power functions.

### 3.2 Asymptotic distributions of the proposed test statistic

The following theorems establish the asymptotic normality of the new test statistic (10) under the and their power function under the , when data dimension and data size increase to infinity.

###### Theorem 3.1.

Under (A1), (A3) and , as , , where is the upper quantile of standard normal distribution where is either or in (10).

Proof See Appendix.

The following theorem shows the asymptotic power function of the proposed test.

###### Theorem 3.2.

Under (A1)-(A3) as , the asymptotic power function of ( or ) is

 P(ˆT≥ξα)=Φ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩−ξα+√22nk∑l=1λl(\boldmath{μ}l−˜\boldmath{μ})T(\boldmath{μ}% l−˜\boldmath{μ})√k∑l=1(1−λl)2tr(\boldmath{Σ}2l)+k∑l≠sλlλstr(\boldmath{Σ}l\boldmath{Σ}s)⎫⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎭+o(1), (11)

where and

is the standard normal cumulative distribution function.

Proof See Appendix.

Since MANOVA is a special situation of the sample BF problem, we have the following two corollaries which are immediate results from Theorems 3.1 and 3.2.

###### Corollary 3.1.

If and assumptions (A1) and (A3

) hold, under the null hypothesis

, as , we get .

###### Corollary 3.2.

Suppose and assumptions (A1) and (A3) hold. Under the local alternative (A2), as , we have

 P(ˆT≥ξα)=Φ⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩−ξα+nk∑l=1λl(\boldmath{μ}l−˜% \boldmath{μ})T(\boldmath{μ}l−˜\boldmath{μ})√2(k−1)tr(\boldmath{Σ}21)⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭+o(1).

## 4 Theoretical comparisons and simulations

In this section, we provide theoretical comparisons between the proposed test and some existing test. For sample BF problem, Cao (2014) and Hu et al. (2015) construct test statistics via the same statistic

 TCH=(k−1)k∑l=1nl∑i≠j\boldmath% {X}\tiny{T}li\boldmath{X}lj/nl(nl−1)−k∑l≠s\boldmath{¯¯¯¯¯X}\tiny{T}% l\boldmath{¯¯¯¯¯X}s,

which is an extension of the two sample test in Chen and Qin (2010) to the case of samples. Depending on different estimators of variance of , different test statistics have been proposed. Cao (2014) used two different estimators of variance of . One is similar to that in Chen and Qin (2010) and the other is the same as that in Lemma 3.1. On the other hand, Hu et al. (2015) used the similar estimator to that in Bai and Saranadasa (1996). Under the assumptions similar to (A1)-(A3), Cao (2014) and Hu et al. (2015) obtained the same asymptotic distribution of their test statistics, say where represents the estimators of variance of considered in Cao (2014) and Hu et al. (2015), as follows:

 P(ˆTCH≥ξα)=Φ⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩−ξα+√22knk∑l=1(\boldmath{μ}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{μ})T(\boldmath{μ}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{μ})√(k−1)2k∑l=1λ−2ltr(\boldmath{Σ}2l)+∑l≠s(λlλs)−1tr(\boldmath{Σ}l\boldmath{Σ}s)⎫⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎭+o(1). (12)

Since all tests in Cao (2014) and Hu et al. (2015) have the same asymptotic distribution, we use the test statistic in Hu et al. (2015), where

 ˜σ2=2(k−1)2k∑l=1˜tr(\boldmath{Σ}2l)/nl(nl−1)+k∑l≠sˆ2tr(\boldmath{Σ}l% \boldmath{Σ}s)/nlns.

We provide numerical studies and theoretical comparisons between our proposed test statistic in (10) and in the following sections.

### 4.1 Theoretical comparisons

We first compare the power functions of the proposed test and when all sample sizes are the same, where is either or in (10). The following Corollary 4.1 states that and have the same asymptotic power under balanced model. This can be shown directly from (11) and (12).

###### Corollary 4.1.

The test statistics and have the same asymptotic power under balanced model which means each group has equal sample sizes.

For more general cases such as unbalanced sample sizes, it is not easy to compare the asymptotic power functions of and . We compare all test statistics under simple and typical situations so that we can compare the power functions analytically. To obtain rough depiction, we assume for the following cases. We define the asymptotic relative efficiency (ARE) of to which is the ratio of two signal-to-noise ratios:

 ARE(ˆT,ˆTH) :=E(ˆT)√Var(ˆT)/E(ˆTH)√Var(ˆTH) =nk∑l=1λl(\boldmath{μ}l−˜\boldmath{μ})\tiny{T}(% \boldmath{μ}l−˜\boldmath{μ})√2(k−1)tr(\boldmath{Σ}21)/knk∑l=1(\boldmath{μ}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{μ})% \tiny{T}(\boldmath{μ}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{μ})√2tr(\boldmath{Σ}21)√(k−1)2k∑l=1λ−2l+∑l≠s(λlλs)−1. (13)

If the , the asymptotic power of is larger than that of from (11) and (12).

Based on the ARE (13), we consider the following two representative cases:

1. . Without loss of generality, we set . Then, we have

 ARE(ˆT,ˆTH) =λk(1−λk)(k−1)√k−1 ⎷k(k−2)k∑l=1λ−2l+(k∑l=1λ−1l)2.

From this, we see that is larger than 1 if there exists at least one such that is very small, for example is close to 0 for some . This is because the right hand side of is unbounded as is close to for at least one . Since , it indicates that if is close to 1, then most of s for are close to 0 which results in . Furthermore, we can get an another low bound of such as based on mean value and Jensen’s inequality (Mitrinović et al. (1993)). The low bound depends on only and it shows that if , then we have regardless of configurations of all other for . This shows that as the number of groups () increases, the interval is getting wider, so is expected to have more power than as the number of groups () increases.

2. As a second case, we assume all mean vectors have the same direction such that for and some constants . For simplicity, we consider and . For , without loss of generality, we can assume , and with . From and , we have

 ARE(ˆT,ˆTH) =τ2λ−13+(τ−2)24√2(τ2−τ+1)√9λ23+1.

For all , the equation has a fixed solution . In addition to this solution, there are more solutions and we provide approximate solutions by numerical studies as follows:

• The case of : In this case, there are two solutions of which are and , where . Note that depends on . If , ; otherwise . This implies that has more powers than when . The left panel in Figure 1 shows this case.

• The case of : The solutions of are and with . The right panel in Figure 1 shows that reaches 1 very rapidly when and . We see that is significantly larger than 1 when while is slightly less than 1 when . This shows that has significantly larger powers than in most cases while can have slightly more powers;on the other hand, even when has more powers than , the difference is not that significant. This case is shown in the right panel in Figure 1.

• The case of : There is only one solution of the equation which is . When is less than , which means that has larger powers than those of . Moreover, we see that is increasing as decreases. This case is shown in the left panel in Figure 2.

• The case of : The equation has only one solution . When is more than , which illustrates has larger powers than those of . See the right panel in Figure 2 for this case.

To summarize, we see that the proposed test has potential to have more power than when sample sizes are highly unbalanced while and have the same asymptotic power from Corollary 4.1.. We provide numerical studies to demonstrate this point in the following section.

### 4.2 Simulations

As shown in Corollary 4.1., and have the same asymptotic power function for balanced sample sizes. Therefore, we conduct simulations only for unbalanced sample sizes to compare ( and ) with . We set and generate from the following two models.

• The first model: we consider “Two-dependence” moving average model

 Xlij=ρl1Zlij+ρl2Zl,i,j+1+ρl3Zl,i,j+2+μlj

for , and , where ’s are i.i.d. random variables distributed with centered and , respectively. ’s and ’s are constants such that . Moreover, ’s were generated independently from with , , , , , , , and , and were kept fixed throughout the simulations. For power studies, population means are fixed as , while the third mean vector consists of components equal to and the others equal to zero where is related to the following standard parameter

 θ=3∑l=1(\boldmath{μ}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{μ})\tiny{T}(\boldmath{μ}l−¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯\boldmath{μ})√43∑l=1λ−2ltr(\boldmath{Σ}2l)+3</