High-dimensional MANOVA via Bootstrapping and its Application to Functional and Sparse Count Data

07/02/2020
by   Zhenhua Lin, et al.
0

We propose a new approach to the problem of high-dimensional multivariate ANOVA via bootstrapping max statistics that involve the differences of sample mean vectors, through constructing simultaneous confidence intervals for the differences of population mean vectors. The proposed method is suited to simultaneously test the equality of several pairs of mean vectors of potentially more than two populations. By exploiting the variance decay property that is a natural feature in relevant applications, we are able to provide dimension-free and nearly-parametric convergence rates for Gaussian approximation, bootstrap approximation, and the size of the test. We demonstrate the proposed approach with ANOVA problems for functional data and sparse count data. The proposed methodology is shown to work well in simulations and several real data applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/23/2020

Bootstrapping ℓ_p-Statistics in High Dimensions

This paper considers a new bootstrap procedure to estimate the distribut...
12/01/2021

AR-sieve Bootstrap for High-dimensional Time Series

This paper proposes a new AR-sieve bootstrap approach on high-dimensiona...
02/02/2019

High-dimensional semi-supervised learning: in search for optimal inference of the mean

We provide a high-dimensional semi-supervised inference framework focuse...
03/22/2021

New Perspectives on Centering

Data matrix centering is an ever-present yet under-examined aspect of da...
07/01/2021

Dealing with overdispersion in multivariate count data

The problem of overdispersion in multivariate count data is a challengin...
10/13/2014

Mining Block I/O Traces for Cache Preloading with Sparse Temporal Non-parametric Mixture of Multivariate Poisson

Existing caching strategies, in the storage domain, though well suited t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of detecting significant differences among the means of multivariate populations, known as MANOVA, is of central importance in a myriad of statistical applications. However, classical approaches to MANOVA are only intended to handle low-dimensional settings where the number of covariates is much smaller than the sample size — which is a crucial limitation in modern high-dimensional data analysis. As a result of this issue, the challenge of finding new approaches to MANOVA that can succeed in high dimensions has developed into a major line of research. For example, the special case high-dimensional two-sample testing has been investigated by

Bai and Saranadasa (1996); Chen and Qin (2010); Lopes et al. (2011); Cai et al. (2014); Xu et al. (2016); Zhang and Pan (2016); Zhang et al. (2019) under the condition that populations have a common covariance matrix, while procedures designed by Feng and Sun (2015); Feng et al. (2015); Gregory et al. (2015); Städler and Mukherjee (2016); Chang et al. (2017); Xue and Yao (2020) do not require common covariance. For the more general multiple-sample problem, methods and theory were studied by Fujikoshi et al. (2004); Srivastava and Fujikoshi (2006); Schott (2007); Yamada and Srivastava (2012); Srivastava and Kubokawa (2013); Cai and Xia (2014); Zhang et al. (2017); Bai et al. (2018); Li et al. (2018) when the populations share common covariance structure, while Zhang and Xu (2009); Yamada and Himeno (2015); Li et al. (2017); Hu et al. (2017); Zhou et al. (2017); Zhang et al. (2018) eliminated the requirement of common covariance. Among these works, Chang et al. (2017); Zhang et al. (2018); Xue and Yao (2020) take the bootstrap approach based on Chernozhukov et al. (2013) or Chernozhukov et al. (2017), in contrast with others that are based on the asymptotic formulas.

It is not uncommon that variances of variables are in different scales and exhibit certain decay pattern. As an example, consider a multinomial model of categories, and without loss of generality, assume that the probability of the th category is nonincreasing. As the probabilities are summed to one, the variance of the th category must decay at the rate at least . Additional examples that are also demonstrated in Lopes et al. (2020)

include data for which principal component analysis is applicable and generalized Fourier coefficients of functional data. When there is such variance decay,

Lopes et al. (2020) shows that the convergence rates of the Gaussian approximation and bootstrap approximation to the maximum statistic are nearly parametric and free of the dimension, where , and is the sample mean of independent and identically distributed random vectors whose mean is and whose variance is in the th coordinate. Remarkably, this rate remains valid even when the decay is very weak, i.e., for an arbitrarily small . An intuitive explanation for this faster rate relative to Chernozhukov et al. (2013) is that, the maximum is unlikely to be realized among those coordinates whose corresponding variances are relatively small, thanks to the decay. The parameter , strictly less than 1, is introduced to offset the explosion of caused by the decay. In this paper, we further explore the decay for the high-dimensional MANOVA problem.

We consider a general setting involving populations with mean vectors

. For any collection of ordered pairs

taken from the set , the hypothesis testing problem of interest is

(1)

Our strategy is to construct simultaneous confidence intervals for the differences for all pairs in via bootstrapping a maximum-type statistic related to across all coordinates and all pairs. In addition, we adopt the idea of partial standardization developed in Lopes et al. (2020) to take advantage of the variance decay. This differs from the aforementioned bootstrapping-based methods (e.g., Chang et al., 2017; Zhang et al., 2018; Xue and Yao, 2020)

that do not exploit the decay, and in addition, the first two consider only one- and/or two-sample problems, and the last one considers only the standard “global null hypothesis

.

The proposed method has the following features.

  • There is flexibility in the null hypothesis. In addition to the standard “global null hypothesis” , which corresponds to choosing , we can also test more specific hypotheses like and , which corresponds to . In general, whenever contains more than one pair, traditional methods often require two or more separate tests to be performed. In turn, this requires extra adjustments to be made for multiple comparisons, which often has a negative impact on power. Indeed, the effect of multiplicity can be severe, because the number of pairs may grow quadratically as a function of , as in the case of the global null hypothesis with .

  • The proposed method performs the test via constructing simultaneous confidence intervals (SCI) for the differences indexed by . Such SCIs are also valuable in their own right (regardless of the outcome of the hypothesis test), because they provide quantitative information about the separation of the mean vectors .

  • When the null hypothesis is rejected, our approach can immediately identify which pairs of populations have significantly different means without additional tests, in contrast with the aforementioned traditional methods.

  • Like Chang et al. (2017); Zhang et al. (2018); Xue and Yao (2020) that are based on bootstrapping, we do not require the ratio of the sample sizes of any pair of populations to converge.

  • Contrasting with the test procedures of Chang et al. (2017); Zhang et al. (2018) for which the convergence rates for the size are not established, and the procedure of Xue and Yao (2020) for which the convergence rate is at most , our approach is shown to enjoy a nearly-parametric rate of convergence, and under a mild condition on the population distributions, the rate is free of the dimension , if the decay is exploited.

For demonstration, we apply the proposed procedure to ANOVA on functional data and sparse count data. Functional data are commonly encountered in practical data analysis, and there is a vast literature. For a comprehensive treatment on these subjects, we recommend the monographs Ramsay and Silverman (2005); Ferraty and Vieu (2006); Horváth and Kokoszka (2012); Zhang (2013); Hsing and Eubank (2015); Kokoszka and Reimherr (2017) and review papers Wang et al. (2016); Aneiros et al. (2019). Unlike the sparse count data which are vectors and thus to which general high-dimensional procedures are applicable, for functional data, specialized methods are required to perform ANOVA. Various such methods have been proposed in the literature, including the pointwise -test documented in (Ramsay and Silverman, 2005) (p.227), the integrated -test proposed by Shen and Faraway (2004), Zhang (2011) and Zhang (2013), Globalization of the pointwise -test by Zhang and Liang (2014), HANOVA method by Fan and Lin (1998), norm based methods by Faraway (1997) and Zhang and Chen (2007), and empirical likelihood ratio by Chang and McKeague (2020a). Resampling methods were also considered, such as Zhang (2013) and Paparoditis and Sapatinas (2016).

Although our approach makes use of the techniques and some results developed in Lopes et al. (2020), the question of how to adapt these results to the multiple-sample setting is far from obvious. The major obstacle encountered is that, unlike Lopes et al. (2020), the max statistic (2) considered in our setting is not the maximum of an average of independent vectors. This prevents us from directly applying Bentkus (2003) or its generalized version Bentkus (2005) which requires independence and is one of the key tools in Lopes et al. (2020). Circumvention of this difficulty requires a more delicate transformation of the statistic into the maximum of the average of some independent random vectors that are further transformed from the data; see Proposition LABEL:prop:gau-approx-II in Supplementary material for details. The intuition behind such transformation is that, although there are (often larger than ) pairs to be tested, in total there are only populations to be compared from which independent samples are drawn. In addition, unlike Lopes et al. (2020), our theory, specifically, Theorem 3.4, accommodates the practice that the variances

are often unknown and are estimated from data. As these quantities appear as a denominator, handle of them is rather challenging and requires us to establish a nontrivial bound on the estimation error of

uniformly over all coordinates and groups under certain continuity assumption on the distribution of the data; see Lemma LABEL:lem:bound-on-sigma-hat in Supplementary Material for details.

We structure the rest of the paper as follows. In Section 2 we detail the proposed test procedure. In Section 3 we first establish a general theory of bootstrapping max statistics under a multiple-sample setting, and then apply it to derive the convergence rate of the empirical size and establish the consistency of the proposed test. Application to functional ANOVA is provided in Section 4, and to sparse count data in Section 5.

2 High-dimensional multiple-sample test

Consider independent groups, for the th one of which, there are i.i.d. -dimensional observations of mean . Our goal is to use such data to test the hypothesis in (1).

To motivate our approach, consider two-sample test in the classic setting that corresponds to our special case and , and thus and . The classic statistic given by

asymptotically follows a standard Gaussian distribution, where

denotes the sample mean of the th group for . Based on such statistic, a confidence interval can be built for the difference of means. In turn, the confidence interval can be used to perform the classic two-sample hypothesis. When , one can consider a simultaneous confidence interval for , for instance, via the distribution of the following max statistic

where denotes the th coordinate of . For the general case that , it is natural to consider the following max statistic

To motivate the discussion that follows, we equivalently rewrite the statistic by

where , denotes the th coordinate of the vector , and . Motivated by Lopes et al. (2020), when the variances exhibit a decay pattern, it is beneficial to adopt the partial standardization by considering the statistic

and

(2)

where is a parameter that may be tuned to maximize power. As

is the maximum of of several random variables that are in turn coordinate-wise maximum of a random vector, it is difficult to derive its distribution. This difficulty, fortunately, can be circumvented efficiently by bootstrapping, as follows.

Let be the sample covariance of the th group. Define the bootstrap version of by . The bootstrapped values of is then used to construct a bootstrap counterpart of given by

which further constitutes the bootstrap version of , defined by

where are diagonal elements of . We then note that the distribution of can be obtained by resampling from the distribution for . Specifically, conditional on the data , we generate independent samples of , and for each such sample, we obtain an observation of

. The quantile function

of the generated sampled of observations of serves as an estimate of the quantile function of .

Analogously, we define the min statistic

and

and their bootstrap counterparts

and

Similarly, the quantile function of can be obtained by drawing samples from the distributions .

Finally, the two-sided simultaneous confidence intervals (SCI) for the th coordinate of for , are given by

(3)

where denotes the harmonic sample size of the th and th groups. With the constructed SCIs above, we perform the test in (1) by rejecting the null hypothesis at the significance level if for some . One-sided SCIs can be constructed and one-sided hypothesis test can be conducted in a similar fashion. For the testing problem (1), it is often desirable to obtain the -value, which is computable by finding the largest value of that makes all SCIs in (3) contain zero.

In practice, one needs to determine a value for the parameter . Although in the next section it is shown that any fixed value in gives rise to the same asymptotic behavior of the proposed test, a data-driven method might be employed to optimize the empirical power. To this end, we propose to select the value of that yields the smallest -value while keeps the size at the nominal level. For a given value of , the corresponding -value can be obtained by the aforementioned method. To estimate the corresponding empirical size, we propose the following resampling approach. First, the data are centered within each group, so that the null hypothesis holds for the centered data. For each group, a new sample of the same size is generated by resampling the original dataset with replacement. Then the proposed test is applied on the new samples. This process is repeated several times, for example, 100 times, and the empirical size is estimated by the proportion of the resampled datasets that lead to rejecting the null hypothesis. To tackle the incurred additional computation, one can leverage the two levels of parallelism of our algorithm, by observing that each candidate value of in a grid can be examined in parallel, and for a given , all the subsequent computations are parallel. Therefore, the proposed method is scalable with modern cloud or cluster computing.

3 Theory

3.1 Bootstrapping max statistics for multiple samples

We start with some remarks on notation. The identity matrix of size

is denoted by . For a fixed/deterministic vector , and , we write . If is a random scalar, then we write . The -Orlicz norm of a random variable is denoted and defined by . If and are two sequences of non-negative real numbers, then represents that there is a constant not depending on , such that for all . Also, the notation means that and simultaneously. In addition, and . We allow symbols such as to denote positive absolute constants whose value may change at each occurrence.

The main results developed in the sequel are formulated in terms of a sequence of models indexed by the integer . In particular, each of the populations may depend on , and we allow to grow with .

Assumption 1 (Data-generating model).
  1. [label=()]

  2. For each , there exists a vector and a positive semi-definite matrix , such that the observations are generated as for each , where the random vectors are i.i.d.

  3. There is an absolute constant , such that for each , the random vector satisfies and , as well as .

In the above assumption, the mean vectors and covariance matrices are allowed to vary with the sample size . In addition, the random vectors across different populations are independent, and may have different distributions.

To state the next assumption, for , we use to denote a set of indices corresponding to the largest values among . In addition, let denote the correlation matrix of the random variables . Lastly, let be a constant fixed with respect to , and define the integers and according to

Assumption 2 (Structural assumptions).
  1. [label=()]

  2. The parameters are positive, and there are positive constants , , and , not depending on , and , such that

  3. There exists a constant , not depending on , and , such that for ,

    where denotes the entry of the matrix . Also, for , the matrix with entry given by is positive semi-definite. Moreover, there is a constant , not depending on , and , such that for ,

The above two assumptions are the multiple-sample analogy of the assumptions in Lopes et al. (2020), in which examples of correlation matrices satisfying the above condition can be also found. The following assumption imposes certain constraint on the parameter and the sample sizes .

Assumption 3.

and for all and for some absolute constants , where . Also, .

In the above assumption, different choices of can be made for different pairs of indices . For simplicity, here we only consider the case that is identical among all pairs. In addition, is allowed to approach to 1 at a slow rate. We emphasize that, although are required to be of the same order, their ratios do not have to converge to certain limits. In contrast, such convergence condition is required by the test procedures that are surveyed in Section 1

and are based on the limit distribution of test statistics rather than relying on bootstrapping. Also, the number

of populations and the number of pairs to be tested are allowed to grow with the sample sizes, at the rate for any fixed .

Let and define the Gaussian counterparts of the partially standardized statistics and according to

and

The following two theorems, whose proofs can be found in the online Supplementary Material, extend the Gaussian and bootstrap approximation results in Lopes et al. (2020) to a multiple-sample setting, where denotes the Kolmogorov distance, defined by for generic random variables and . As discussed in the introduction, such extension is technically nontrivial.

Theorem 3.1 (Gaussian approximation).

Fix any number and suppose that Assumptions 13 hold. Then

Theorem 3.2 (Bootstrap approximation).

Fix any number and suppose that Assumptions 13 hold. Then, there is a constant , not depending on , , and , such that the event

occurs with probability at least , where represents the distribution of conditional on the observed data.

3.2 High-dimensional MANOVA

We first analyze the power of the test procedure in Section 2. It is seen that the power of the test depends on the width of the constructed SCIs. The proof of the following theorem, as well as other theorems in this subsection, can be found in the online Supplementary Material.

Theorem 3.3.

If Assumptions 13 hold, then

  1. [label=()]

  2. for any fixed , we have with probability at least , where is a constant not depending on , , and , and

  3. for some constant not depending on , , and , one has

    where .

Consequently, if , then the null hypothesis will be rejected with probability tending to one.

To analyze the size, we observe that when we construct the SCIs, we use instead of . This requires us to quantify the distance of the distributions of and

(4)

where

(5)

To this end, let

denote the cumulative distribution function of the standardized random variable

. We require the following mild condition on the distribution of the standardized observations.

Assumption 4.

There are constant , such that, for any , for all and , for all and , where is dependent only on .

The above condition is essentially equivalent to that the distribution functions are collectively Hölder continuous, with a Hölder constant that is fixed but could be arbitrarily small. The assumption is satisfied if each of the distributions has a density function that is collectively bounded (when is finite, simply bounded). However, it is much weaker than this, as it could hold even when the distributions do not have a density function or the density function is unbounded.

Theorem 3.4.

Fix any number , and suppose that Assumptions 14 hold. Then

With the triangle inequality, the above theorem together with Theorem 3.1 and 3.2 implies that, with probability at least , , for some constant not depending on and . This eventually allows us to quantify the convergence rate of the size of the test, as follows. Let be the probability that is rejected when and the significance level is set to . The following theorem, which is a direct consequence of Theorem 3.13.4, asserts that the size of the test is asymptotically correctly controlled at the rate with high probability.

Theorem 3.5.

Fix any number , and suppose that Assumptions 14 hold. Then, for some constant not depending on , , and , with probability at least , .

Remark.

In Theorems 3.4 and 3.5, Assumption 4 can be replaced with the condition which then imposes an upper bound on the growth rate of relative to .

4 Application to functional ANOVA

To set the stage, let be a separable Hilbert space, and second-order random elements with mean element , i.e., , where denotes the norm of the Hilbert space . In our context, the random element represents the prototype of the observed functional data in a population. Commonly considered Hilbert spaces in the area of functional data analysis include reproducing kernel Hilbert spaces and the space of squared integrable functions defined on a domain . The purpose of one-way functional ANOVA is to test the hypothesis

(6)

given data i.i.d. sampled from for each , where is the mean element of .

Let be an orthonormal basis of . We represent each in terms of such basis, i.e., , where are generalized Fourier coefficients. Then, the hypothesis (6) is equivalent to the statement that for all and all . Empirically, we choose a large integer and test whether the vectors are equal for . This precisely the hypothesis test problem introduced in Section 2. Below we assess this method by numerical simulations, and compare it with three popular methods in the literature, namely, the based method (L2) (Faraway, 1997; Zhang and Chen, 2007), -statistic based method (F) (Shen and Faraway, 2004; Zhang, 2011) and the global pointwise test (GPF) (Zhang and Liang, 2014) that are reviewed in the introduction and implemented by Górecki and Smaga (2019).

In the simulation study, we take , and consider four families of mean functions, parameterized by , which are

(M1)

,

(M2)

,

(M3)

,

(M4)

,

for . It is seen that, are identical when , and differ from each other when . These families are shown in Figure 1

. Among them, the families (M1) and (M2) represent “sparse alternative” in the frequency domain in the sense that the Fourier coefficients of the mean functions differ most in the first few leading ones when the alternative is true, i.e., when

. In contrast, the family (M3) represents “dense alternative” in the frequency domain. In addition, when , all of the families (M1)–(M3) are dense in the time domain. In particular, the alternatives in (M2) are uniformly dense in the time domain, in that the differences of the mean functions among groups are nonzero and uniform in . Thus, the families (M1)–(M3) favor the integral-based methods such as the L2, F and GPF tests, as these methods integrate certain statistics over the time domain. In contrast, the alternatives in the last family (M4) are sparse in the time domain.

Figure 1: Mean functions (solid), (dashed) and (dotted) with for families (M1) (top-left), (M2) (top-right), (M3) (bottom-left) and (M4) (bottom-right).

We sample each observed function from , and consider two different settings for the centered random processes . In the first one that is referred to as the “common covariance”, the random processes of all groups are Gaussian with the following common Matérn covariance function

(7)

where is the gamma function, is the modified Bessel function of the second kind, is set to , and is set to . In the second scenario that is refereed to as “different covariance”, the groups have different covariance functions, as follows. For the first group, the random process is the Gaussian process with the Matérn covariance function (7). For the second group, the process is the Wiener process with dispersion , i.e., the Gaussian process with the covariance function . For the third group, we set , where , and , and

follows a uniform distribution on

. Therefore, the samples in the third group are not Gaussian.

We set the significance level at and consider the case of balanced sampling with and imbalanced sampling with . To apply our method, we use the aforementioned basis with . The parameter is selected by the method described in Section 2. Each simulation setup is replicated 1000 times independently. The results for the size are summarized in Table 1, from which we see that the empirical size of all methods is fairly close to the nominal level. The performance in terms of power is depicted in Figure 2 for the scenario of common covariance structure. We observe that, when the alternatives are sparse in the frequency domain but not uniformly dense in the time domain, like the case of (M1), or when the alternatives are sparse in the time domain, like the case of (M4), the proposed method outperforms the others by a significant margin. In the other two cases, all methods have almost indistinguishable performance. Similar observations are made for the scenario of different covariance functions based on Figure 3, except that the power of GPF is slightly larger than the other when the sampling is imbalanced in the case of (M2), where the alternatives are uniformly dense in the time domain. In conclusion, the proposed test is powerful against both dense and sparse alternatives in either time or frequency domain, especially in the commonly encountered practical scenario that the alternative is sparse in the time domain or in the frequency domain but not uniformly dense in the time domain.

proposed L2 F GPF
common M1 50,50,50 .047 .053 .046 .049
30,50,70 .054 .053 .049 .052
M2 50,50,50 .052 .064 .058 .062
30,50,70 .056 .050 .044 .050
M3 50,50,50 .050 .052 .046 .048
30,50,70 .055 .061 .056 .060
M4 50,50,50 .047 .056 .055 .054
30,50,70 .038 .044 .038 .040
different M1 50,50,50 .050 .050 .042 .050
30,50,70 .057 .041 .039 .058
M2 50,50,50 .043 .052 .046 .049
30,50,70 .051 .042 .038 .047
M3 50,50,50 .042 .048 .045 .048
30,50,70 .052 .048 .042 .052
M4 50,50,50 .052 .046 .042 .040
30,50,70 .052 .041 .037 .048
Table 1: Empirical size of functional ANOVA
Figure 2: Empirical power of the proposed functional ANOVA (solid), L2 (dashed), F (dotted) and GPF (dotdash) with a common covariance function. Top: from left to right panels are respectively empirical power for families (M1), (M2), (M3) and (M4), when . Bottom: from left to right panels are respectively empirical power for families (M1), (M2), (M3) and (M4), when and . The performance of L2, F and GPF are almost indistinguishable.
Figure 3: Empirical power of the proposed functional ANOVA (solid), L2 (dashed), F (dotted) and GPF (dotdash) with different covariance functions. Top: from left to right panels are respectively empirical power for families (M1), (M2), (M3) and (M4), when . Bottom: from left to right panels are respectively empirical power for families (M1), (M2), (M3) and (M4), when and . The performance of L2, F and GPF are almost indistinguishable.

Now we apply our method to analyze the data collected in Müller et al. (1997) about female Mediterranean fruit flies (Ceratitis capitata). Four thousands of cages of female flies were evenly divided into four cohorts that correspond to different environmental conditions. For each cage, the total number of eggs laid by flies was recorded in each day, and the observed numbers form a trajectory that characterizes the dynamic pattern of egg laying of the female flies in the cage. Our goal is to investigate whether the environmental condition has impact on the egg laying pattern by using the recorded trajectories. As most cages have zero eggs laid in the first few days and the last few days, we only focus on the trajectories recorded between day 10 and day 50. By excluding trajectories with missing data, we then obtain four samples of sizes and , respectively. To stabilize variance, we apply square-root transformation to the egg counts.

Our proposed method with the Fourier basis rejects the null hypothesis with -value less than . In other words, the mean trajectories of all cohorts are not identical. Specifically, the method shows that, the mean trajectories of the first cohort (represented by the solid line in Figure 4) and the fourth cohort (represented by the dotdash line in Figure 4) are respectively significantly different from the other three. Moreover, we note that the first Fourier coefficient is the daily average of the number of eggs laid. The constructed SCIs suggest that the daily average of the first cohort is different from its counterparts of the other cohorts, and similarly, the daily averages of the second and the fourth cohorts are different. This valuable extra information is obtained without additional hypothesis tests and thus no requirement for correction of multiple comparison that might lower the power of the test.

Figure 4: Mean trajectories of the number of eggs laid by female fruit flies in four cohorts.

5 Application to sparse count data

Count data, often modeled by multinomial or Poisson distributions, are common in practice. For the multinomial model, the decay in variance is innate due to the requirement that the sum of the probabilities of all categories is one. For the Poisson distribution, since the variance is equal to the mean, sparseness in the mean then induces decay in the variance. Here, sparseness refers to the situation that either there are only a few nonzero coordinates or the ordered coordinate mean is fast decreasing to zero. For instance, in the field of text mining or information retrieval in which word frequency is an important feature, words in a vocabulary often have drastically different frequencies. In addition, the frequency of words decreases in a rather fast rate from the frequent words to the rare words. For example, for the English language, the ordered word frequency is found to approximately follow the Zipf’s law

(Zipf, 1949). To assess the application of the proposed method to the sparse Poisson data, we conduct the following simulation study.

We consider three groups, represented by the -dimensional random vectors , , and . Each random vector follows a multivariate Poisson distribution (Inouye et al., 2017) and is represented by , where for , are independent Poisson random variables with mean , respectively. It is seen that the th coordinate of follows also a Poisson distribution with mean . In addition, all coordinates are correlated due to the shared the random variable . In our study, we set for , and consider two settings for . In the first setting, termed the sparse setting, for and . In this setting, when , the difference of the mean in each coordinate decays. In the second one, termed the dense setting, we set , so that the difference of the mean in each coordinate is equal. Note that, the setting with is used to evaluate the empirical size, since it corresponds to the case that the null hypothesis is true, i.e., the mean vectors of all groups are identical. For the dimension, we consider two cases, namely, and . For the sample sizes, we consider the balanced case and the imbalanced one . The parameter is selected by the method described in Section 2. Each simulation is repeated 1000 times independently.

For comparison, we implement the methods of Schott (2007) and Zhang et al. (2018) that are reviewed in the introduction. The former is based on the limit distribution of a test statistic that is composed by inter-group and within-group sum of squares, while the latter utilizes an adjusted -norm-based test statistic whose distribution is approximated by multiplier bootstrap. The former is favored by the testing problems with a dense alternative, while the latter is claimed to be powerful against different alternative patterns (Zhang et al., 2018). From the empirical size showed in Table 2, we observe that, overall, the sizes of the proposed method and Schott (2007) are rather close to the nominal level, while the size of Zhang et al. (2018) seems slightly inflated. The empirical power for the case is shown in Figure 5; the result for is rather similar to the case of and thus is omitted for space economy. In the sparse case, the proposed method has a power significantly larger than Zhang et al. (2018), while the latter is in turn much larger than Schott (2007). In the dense setting which does not favor our method, it still has a power rather comparable to Schott (2007) and Zhang et al. (2018). In addition to testing hypotheses, the proposed method can also simultaneously identify the pairs of groups, as well as the coordinates, that have significantly different means, as demonstrated below by two real datasets.

proposed Schott (2007) Zhang et al. (2018)
sparse 25 50,50,50 .055 .042 .065
30,50,70 .052 .053 .069
100 50,50,50 .056 .045 .054
30,50,70 .056 .055 .065
dense 25 50,50,50 .050 .051 .065
30,50,70 .045 .066 .062
100 50,50,50 .057 .054 .064
30,50,70 .051 .049 .067
Table 2: Empirical size of ANOVA on Poisson data
Figure 5: Empirical power of the proposed high-dimensional ANOVA (solid), the method (dashed) of Zhang et al. (2018) and the method (dotted) of Schott (2007), when . Top: the sparse setting with (left) and (right). Bottom: the dense setting with (left) and (right).

As the first application, we apply the proposed method to analyze the CLASSIC3 dataset333Originally available from ftp://ftp.cs.cornell.edu/pub/smart, and now available publicly on the Internet, e.g., https://www.dataminingresearch.com/index.php/2010/09/classic3-classic4-datasets/ (Dhillon et al., 2003) studied in the field of information retrieval. The dataset consists of 3891 document abstracts from three different domains, specifically, from information retrieval (CISI), from aeronautical system (CRAN) and from medical research (MED). Standard text preprocessing is applied to these abstracts, including removal of high-frequency common words (commonly referred to as stop words, such as “the”, “is”, “and”, etc), punctuation and Arabic numbers. In addition, we follow the common practice in the field of information retrieval to reduce inflected words to their word stem, base or root form by using a stemmer, such as the Krovetz stemmer (Krovetz, 1993) that is employed in this paper. Each document is then represented by a vector of word counts. Such a vector is in nature sparse, as the number of distinct words appearing in a document is in general far less than the size of the vocabulary. Intuitively, vocabularies from different domains are different. Our goal is to examine this intuition and to find the words that are substantially different among all of the three domains. To this end, we focus on words with at least 50 occurrences in total to eliminate the randomness caused by the rare words. This results in distinct words under consideration. Now, we apply the proposed test to the processed data and find that the vocabularies used in these three domains are not the same among any pair of the domains, with -value less than . In particular, our method simultaneously identifies the words that have significantly different frequency among the domains, showed in Table 3, where the numbers represent the average frequency of the words within each domain. The results for CISI and CRAN match our intuition about these two domains. For the domain of medical research, the word “normal” is often used to refer to healthy patients or subjects, while the word “increase” is used to describe the change of certain biological metrics, such as protein metabolism.

Word CISI CRAN MED
use 0.715 0.515 0.265
data 0.401 0.239 0.082
pressure 0.011 1.004 0.139
effect 0.060 0.759 0.338
theory 0.167 0.684 0.024
problem 0.301 0.456 0.069
body 0.017 0.607 0.162
increase 0.089 0.271 0.437
normal 0.007 0.112 0.351
group 0.129 0.011 0.304
Table 3: The average frequency of words that are significantly different among all categories

Next, we apply our method to study physical activity using data collected by wearable devices in National Health and Nutrition Examination Survey (NHANES) 2005–2006. In the survey, each participant of age 6 years or above was asked to wear a physical activity monitor (Actigraph 7164) for seven consecutive days, with bedtime excluded. Also, as the device is not waterproof, participants were advised to remove it during swimming or bathing. The monitor detected and recorded the magnitude of acceleration of movement of the participant. For each minute, the readings were summarized to yield one single integer in the interval that signifies the average intensity within that minute. This results in observations per participant. The demographic data about the participants were also collected. In our study, we focus on two age groups and two martial categories. The two age groups are young adulthood with age ranging from 18 to 44, and middle-age adulthood with age ranging from 45 to 65. Two martial groups are “single” (including the widowed, divorced, separated and never-married categories in the original data) and “non-single” (include the married and living-with-partner categories). These groups induce four cohorts: young non-single adults, young single adults, middle-age non-single adults and middle-age single adults. Our goal is to examine whether the physical activity patterns are different among the cohorts by using the NHANES data.

From Figure 6 which presents the activity trajectories of three randomly selected participants from the dataset, we see that participants have different circadian rhythms. To address this problem, we adopt the strategy proposed by Chang and McKeague (2020b), who studied physical activity of elder veterans from the perspective of functional data analysis, to transform each activity trajectory into an activity profile for , where denotes the Lebesgue measure on . This is equivalent to , where denotes the frequency of , i.e., the number of occurrences of the intensity value , in the trajectory . Therefore, the activity profile can be viewed as count data normalized by . As over 95% physical activity has low to moderate intensity, i.e., with intensity value below 1000, we focus on the intensity spectrum and only consider subjects with both “reliable” and “in calibration” readings. This results in four cohorts of size , , and , respectively, after excluding subjects with missing data.

The mean activity profiles and their standard deviations are depicted in the top panels of Figure

7, from which we observe that both the mean and standard deviation decay in a rather fast rate. In addition, the mean profiles from the young single and middle-age non-single cohorts are almost indistinguishable in the plot, while the mean profile of the middle-age single cohort is visibly different from the others. This visual inspection is aligned with the result obtained by the proposed test, which rejects the null hypothesis with approximate -value 0.0036 and thus suggests that some mean activity profiles are likely to be substantially different. Moreover, the method identifies two pairs of cohorts whose mean activity profiles are different and the intensity spectrum on which the differences are significant, namely, the young single cohort and the middle-age single cohort on the spectrum , and the middle-age non-single cohort and middle-age single cohort on the spectrum . These findings are also visualized in the bottom panels of Figure 7. Furthermore, our method can provide SCIs for the differences of mean activity profiles among all pairs of cohorts. For instance, in Figure 8 we present the SCIs for the pairs with differences in the mean activity profiles over the spectrum on which the differences are statistically significant. In summary, comparing to the young single and middle-age non-single cohorts, the middle-age single cohort is found to have less activity on average, especially less low-intensity activity, for instance, activity with intensity smaller than 87.