Testing the equality of multivariate means when p>n by combining the Hoteling and Simes tests

12/22/2019 ∙ by Tzviel Frostig, et al. ∙ 0

We propose a method of testing the shift between mean vectors of two multivariate Gaussian random variables in a high-dimensional setting incorporating the possible dependency and allowing p > n. This method is a combination of two well-known tests: the Hotelling test and the Simes test. The tests are integrated by sampling several dimensions at each iteration, testing each using the Hotelling test, and combining their results using the Simes test. We prove that this procedure is valid asymptotically. This procedure can be extended to handle non-equal covariance matrices by plugging in the appropriate extension of the Hotelling test. Using a simulation study, we show that the proposed test is advantageous over state-of-the-art tests in many scenarios and robust to violation of the Gaussian assumption.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A common problem in statistics is testing the equality of two mean vectors and based on two independent random samples, and . Formally, the hypothesis being tested is


This problem appears in many fields of science, including but not limited to analyzing EEG data [14], fMRI data [33], Neurobiology [34] and more. At the first stage of the analysis the interest lies in detecting a difference and not necessarily identifying where this difference lies. A common approach to test the hypothesis is using the classical Hotelling test [16] which assumes that and

follow the same multivariate normal distribution


1.1 Hotelling test

The Hotelling test statistic is defined as follows:


where and

are the estimated mean vectors,

and is the pooled covariance matrix estimator. Under , the test statistic follows the distribution, the p-value of the test is given by

, the null hypothesis is rejected for large value of the test statistic only.

The test is a natural generalization of the one-dimensional t-test. It is invariant under any non-singular linear transformations and is the uniformly most powerful test among all tests which share this invariance property

[4]. The test cannot be used when , a common situation in modern scientific research, as the estimated covariance matrix is not of full rank, and therefore its inverse is undefined. Moreover, even for , as approaches , the power of the test deteriorates [2]. Several approaches have been suggested for testing 1.1 in the setting. Many of them rely on a different method to estimate , which will guarantee that it would be of full rank.

1.2 Simes test

A very different approach is to rely on the logical fact that . This allows us to test each dimension marginally and reject the global-null hypothesis using a combination test, such as Fisher, Simes, or Stouffer.

Assuming that and follow the multivariate normal distribution, each one of their dimensions also follows the normal distribution and . Under these conditions we can test the marginal hypotheses , by applying the t-test. To complete the procedure, we need to find a test of the global null hypothesis, . Since the tests are dependent, a good candidate is the Simes test [28], which does not rely on the assumption of independence.

The Simes test, is a test of the global null, making use of the p-values, , which in our context, the p-values are acquired by marginally testing each dimension. The p-values are sorted, , and the test statistic is

Under the null hypothesis, if the test statistics are independent then [28]. While the Simes test offers a method to test the global-null, it ignores the covariance structure as each dimension is tested marginally. We suggest combining the Hotelling and Simes tests to construct a new test for the hypothesis of the two sample mean equality in the dependent setting.

The rest of the paper is organized as follows: In Section 2 we propose the test, in Section 3, default test parameters values are suggested. In Section 4 we extend the test by removing the equal covariance assumption and the normality assumption using a permutation test based on the suggested test. In Section 5 we present the results of the simulation study comparing our suggested test with existing alternatives, and finally in Section 6 we apply the suggested test to two real-data examples.

2 The proposed test

In order to reject the null hypothesis presented in Eq. 1.1, it is enough to reject any marginal hypothesis . However, we need not limit ourselves to a single marginal hypothesis as we can test different subset of hypotheses, . Any subset of hypotheses rejected indicates that the global null hypothesis should also be rejected.

Utilizing this idea, define to be the group of all possible subset of dimensions, . At each iteration we sample dimensions, and conduct the Hotelling test of the null hypothesis . Repeat the process times and combine the resulting p-values from the random Hotelling statistics using the Simes test and compute , then test at level . Algorithmically the test procedure is

for  do
end for
Algorithm 1 Suggested procedure

Each Hotelling test under the intersection hypothesis (

) yields a uniformly distributed p-value at step 2. We denoted the test as

, for Simes-Hotelling. One of the reasons to use the Simes test, is that it remains valid under various dependency structures of the test statistics. In particular, the Simes test has been proven to be conservative if the positive regression dependency on subset (PRDS) property between the test statistics holds [3]. This is type of dependency likely to arise when two samples of dimensions overlap.

Property 2.1 ().

For any increasing set , and for each is non-decreasing in .

Unfortunately, even if is PRDS, is not necessarily PRDS, and therefore does not assure the validity of Simes for p-values stemming from two-sided tests. For p-values originating from two-sided Gaussian tests, if is multivariate total positivity of order 2 (MTP), or if it is so after changing the sign of some of the -s, then so is , thereby implying the validity of the Simes test [17], [26].

Property 2.2 ().

is MTP if for all and ,


where is the joint density, and the minimum and maximum are evaluated component-wise.

Even for pairwise dependency structure which is neither MTP nor PRDS [35] gave strong theoretical support for its validity. These are known theoretical results establishing the validity of Simes test for . Still, PRDS is not a necessary condition for the validity of Simes test: the case of t-tests derived from PRDS normal distribution scaled by the same independent estimator

is not PRDS yet the Benjamini-Hochberg (BH) procedure and the Simes test are valid for a significance level less than 0.5. Evidence of the conservatism of the Simes test based on two-sided p-values under general dependency for Gaussian distribution have accumulated since then. This is based on numerous reported simulation studies, see

[32], [19], [24], [21], as well as unreported ones. Even though the global bound of

on the level of Simes test for any joint distribution has been shown to be attainable

[3], a counter example based on Gaussian distributed test statistics has not been found. The distribution for which the bound is attained is far from being Gaussian, not being ellipsoidal or even unimodal. Hence Simes test is widely considered valid for two-sided tests based on Gaussian or t-distributions under any dependency.

We show that the covariance between pairs of Hotelling statistics is asymptotically positive and they are asymptotically distributed as multivariate Gaussian. Since positively correlated multivariate Gaussian hold the PRDS property this ensures the validity of the Simes test. For further details see the supplementary material.

Theorem 2.1.

If , then for .

The theorem is stated using a single sample, , but can easily be extended to accommodate two-samples. We support the validity of the procedure for finite sample sizes in the Simulation section. The proposed test has certain advantages over the classic Hotelling test, as it is applicable when , and it allows to inspect smaller subsets of the dimensions. It also has advantages over the marginal Simes procedure, it combines a number of weak signals potentially improving the power and incorporates information regarding the covariance structure. In some sense the procedure is like that suggested by Thulin [31], where the sampled Hotelling statistics are averaged, and the computation of p-value relies on permutations. The suggested procedure does not rely on permutation for the computations and therefore enjoys a lower computational burden. It remains to determine the values of parameter and , which will be done in the next section.

3 Considerations for choosing and in

We follow the result of [23], and set . Since the proof of the property underlying their recommendation is asymptotic, we will also consider another option in our simulations study.

For a given parameter there are as many as different dimension subsets of size . Even for small and the number of all possible samples is too large to consider each any every possibility. Therefore, we must find a suitable value for the number of samples, , to be taken. Note that if we happen to sample the same set of dimensions twice the Simes test does not lose power even though the number of tests increases because of its step-up nature. We would like to ensure that each dimension will be chosen at least once, as to avoid situations where only null hypothesis dimensions are captured by the sampling, which could lead to a loss of power (see Fig. 1). For computational reason we would like to use smaller . Finding the number of samples needed to cover all the dimensions is known as the Coupon Collector Problem, and for choosing one dimensions at each time the expected number of samples needed to ensure it,


where is the Mascheroni-Euler constant. For selecting more than one dimensions at each sample see [30], the above equation can be viewed as an upper bound for the expected number of draws when selecting more than one dimension. Setting

seems to satisfy the condition of ensuring the selection of all dimensions with high probability.

The power increases by sampling the dimensions rather than conducting a single Simes test over all dimensions. Suppose that the proportion of dimensions with no signal is . If testing is conducted marginally (t-test followed by Simes) there are test statistics distributed according to the null distribution. By sampling each time dimensions and combining them using Hotelling, the proportion of statistics that are from the null distribution is


The gain from the sampling mechanism is not only by allowing the incorporation of data regarding the covariance matrix, but also by reducing the ratio of null to non-null test statistics. The parameter can be viewed as controlling the trade-off between including data regarding the covariance matrix and the averaging of effects. If the procedure would correspond to the marginal testing of components and will include no data regarding the covariance matrix, and for the procedure would correspond to the standard Hotelling test.

Figure 1: Comparison of various choices for and . Data is simulated from multivariate Gaussian the covariance matrix , , and . The normalized Euclidean distance The x-axis is the number of non-null hypotheses , the y-axis is the power, the rows correspond to various values, the columns to the number of samples taken , and the colors to sample sizes,

. The number of iterations is 1000, and the maximum standard deviation is 0.016

The test seems to lose power for all values of when is small, for larger values of it seems to have little effect (see Figure 1, the difference between to is large, yet the the difference between and is almost negligible). While , is the suggested value, it is not consistently the best option. The optimal is dependent both on , and on . When signal is spread across more dimensions, it is better to select a larger value for , but if the signal is centered only on a few dimensions it seems that a smaller value of should be considered. Usually, this type of information is not known, and it seems that is a good compromise, having good power for most scenarios. Further theoretical justification for the value is given in Section ”Choosing sample size in the supplementary material”.

4 Extensions

4.1 Non equal covariance matrices

The testing of Gaussian means when the variances are not equal, also known as the Behrens-Fisher problem, has been of interest from the early age of Statistics. The natural statistic in one dimension is

The common solution is approximating

using the t-distribution and using the Welch method to evaluate the degree of freedom. The corresponding statistic in high dimensions would be:

To find its distribution we use the approximation by [20]

where the degree of freedom is calculated as follows

and (or Y accordingly) and . This suggestion was shown to have the best power out of the invariant methods. It can be easily integrated into our proposed test, using instead of the usual Hotelling statistic for the sampled dimensions. Note, that Result 2.1 still holds as the only thing changed is the number of degrees of freedom. We assess the performance of the procedure and then compare the performance of three other methods in a simulation study (see Section 5.3).

4.2 Extension to non-Gaussian distributions

In order to overcome the reliance of the suggested test on the Gaussianity assumption, we can conduct a permutation test of the same statistic, resulting with a distribution free test. The only assumption needed is that the two samples are interchangeable. Under the global null there is no difference between the two groups, the two distributions are equal, so observations in the two groups are interchangeable. Notice, that the non-parametric extension of the procedure no longer test for equality of the mean vector but the equality of the distributions, .

The permutation test is the following, for each sample obtain the Hotelling statistics, for each such sample shuffle the labels times, and obtain the permutation p-value, for each shuffle and sample. Now use function to combine p-values for each permutation resulting in a single vector of p-values. From there compare the shuffled p-values to the original to obtain the final p-value. Algorithmically the permutation test is the following

for  do
     for  do
     end for
end for
for  do
     for  do
     end for
end for
Algorithm 2 Extension of suggested procedure to permutation tests

After obtaining the p-values for all the subsets, one can either combine them using any valid test (Bonferroni, the brave scientist may still use Simes), or use any function coupled with a permutation test. This method is similar to the method suggested by [31] and shares the disadvantage of being computationally expensive. A small study of additional non-parametric tests and their comparison to the suggested test can be found in the supplementary material.

5 Simulations

A simulation study was conducted in order to evaluate the suggested test in terms of Type I error and power in comparison to other tests. The other tests considered rely on various approaches, regularization based methods

[29] (SD), [8] (CQ), that are similar to Hotelling but the estimate of is replaced by a matrix which can be inverted, and thereby allow for the calculation of the test statistic. Their main drawback is the loss of power when the covariance structure is not negligible [23]. The random projection type tests such as [23] (Lopes) and the non-parametric test suggested by Thulin [31], where the data are projected into a smaller sub-space, thus allowing the calculation of Hotelling-like test statistic. The suggested method falls under this category (using Simes instead of permutation tests). A test based on the supremum of the standardized differences between the observed mean vectors [5] (CLX). Another test is based on the smoothed t-test statistics [11] (GCT). More information regarding the tests can be found in the supplementary material. The simulation section is split into two subsections each dedicated to a different type of tests: subsection 5.2, Parametric tests, and subsection 5.3 covers tests which do not require the equal covariance assumption. The section regarding the performance of the non-parametric tests can be found in the supplementary material 5.2. In this section we present the most interesting results, the full comprehensive tables and figures can be found in the supplementary material. The simulation code can be accessed in github.com/…/Simulations-Simes-Hotelling.

5.1 Simulations parameters

Since the goal is to evaluate the tests over a wide array of scenarios, we varied multiple parameters of the data generation.

  1. The number of observations: .

  2. The number of dimensions, : For the equal and non-equal covariance matrix samples, , for the non-parametric simulation .

  3. The sparsity, , varied between .

We considered 4 different types of covariance matrices.

  1. An auto regression (AR) matrix, that is , , .

  2. The block covariance matrix, where we varied the in block correlation , and the size of each block .

  3. Model 7 presented in [5], where and for , . is a diagonal matrix with sampled from a Uniform distribution .

  4. Equally correlated covariance matrix , , , . .

The above list of covariance matrices represents scenario used in other matrices, as well as covariance structures that resembles those encountered in practice. To simulate the signal, we kept , and vary . is divided into 6 sub-vectors, 1 sub-vector of size (), whose entries are 0. The rest of the sub-vectors have a size of where each sub-vector elements are . To find we determine , and solve

for . The norm

is chosen to set a specific signal to noise ratio, allowing us to compare different scenarios results. The non-zero entries are determined at random. We followed a full factorial design observations

1- dependency structure. All tests are evaluated at level .

5.2 Simulations - parametric tests

The parametric tests that were chosen for comparison are SD 3.1.2, CQ 3.1.3, Lopes 3.1.4, GCT 3.1.5, CLX 3.1.6 and the suggested test (SH) with number of dimensions sampled at each iteration . We kept the number of samples constant at , so we shall simply denote the test as . The implementation of GCT and CLX used can be found in the R package highD2pop [12], and the implementation of SD, CQ in the R package highmean [22]. Across all simulations the number of repetitions is 1,000, such that the conservative deviation of the estimated power and Type I error is . All tests are conducted on the same simulated data. Throughout the simulation to allow us to compare between the rest of the simulations shown.

5.2.1 Type I Error Comparison

Here we describe our main finding with some of the supporting evidence. The complete results of the Type I error are presented in the supplementary material.

20 0.3 0.034 0.051 0.051 0.066 0.077 0.19 0.114 0.054 0.051
0.45 0.034 0.056 0.052 0.048 0.074 0.178 0.113 0.045 0.046
0.6 0.035 0.057 0.052 0.049 0.074 0.19 0.119 0.052 0.056
0.75 0.029 0.047 0.07 0.05 0.077 0.188 0.111 0.049 0.044
0.95 0.019 0.05 0.046 0.03 0.164 0.15 0.124 0.054 0.032
50 0.3 0.046 0.06 0.046 0.049 0.065 0.08 0.101 0.051 0.05
0.45 0.044 0.061 0.047 0.043 0.068 0.1 0.1 0.048 0.052
0.6 0.041 0.053 0.041 0.045 0.064 0.057 0.095 0.044 0.047
0.75 0.034 0.056 0.047 0.043 0.077 0.093 0.103 0.052 0.038
0.95 0.034 0.074 0.043 0.024 0.191 0.137 0.12 0.042 0.028
100 0.3 0.043 0.049 0.046 0.046 0.056 0.072 0.094 0.04 0.04
0.45 0.04 0.05 0.036 0.048 0.056 0.054 0.098 0.045 0.028
0.6 0.04 0.053 0.048 0.06 0.068 0.031 0.087 0.042 0.042
0.75 0.043 0.054 0.045 0.045 0.075 0.095 0.092 0.043 0.031
0.95 0.029 0.067 0.052 0.025 0.178 0.099 0.113 0.036 0.029
Table 1: Parametric tests, type I error rates for the AR covariance matrix scenario for each of the methods. is the correlation parameter. The number of iterations is 1000, and the conservative s.e of the simulation is 0.016.

Across the various scenarios it can be seen that both SH variants keep the required error rate. As the correlation structure is stronger (Table LABEL:tab:Cov3_Alpha and Table 1), the Simes and SD tests are becoming more conservative. CQ test has a slightly elevated error rates () under all the covariance structures. It seems that a GCT, and CLX fail to keep type I error, especially when the covariance between the dimensions decays slowly (Table LABEL:tab:Cov3_Alpha). CLX performance was worst for the block covariance structure, due to the difficulty in estimating the covariance matrix. The covariance matrix was estimated using the adaptive threshold estimator, indeed, it can be expected that for sparse covariance matrices using the CLIME would have yielded better performance. The type I error of the GCT test holds asymptotically as and should have therefore performed better for larger . KNN also does not seem to keep the type I error, this is possibly due to the asymptotic nature of the test. It can be expected to perform better when . Due to these results, GCT, KNN and CLX tests were omitted from the rest of the simulations study.

5.2.2 Power comparison - parametric tests

We compare the power curves of the different tests in different scenarios. The power of the global null tests is affected by the sparsity of the signal, the covariance structure and number of observations. Considering the covariance structure can change the sparsity of the problem: if a signal is sparse in the original dimensions considered, it could be dense when considering it across the principle components. As can be seen in Lemma 1.2, in the supplementary material, Hotelling’s test sums the signal across the principal components, increasing the power of tests that take into consideration the covariance structure in certain situations.

Figure 2: Parametric tests power comparison, block covariance matrix, block size is 100. Rows indicate number of observations (), the columns indicate , the proportion of dimensions where . The number of iterations is 1000, and the conservative s.e of the simulation is 0.016.

As seen in Figure 2 the SH variants and Lopes tests are the only tests that do not lose power as the correlation increases as evident by their increasing power curves across the x-axis in each of the plot. This is an example of the difference between tests which utilize the sum of signal across the principle components versus tests such as SD and CQ that mainly rely on summing the signal across the original dimensions. It is also worth noting that in the sparse case the Simes procedure performs well as expected.

In Figure 3a, we see the power curves of the tests under a milder covariance structure, where the block size of equally correlated dimensions size is 20, rather than 100. One of the downsides of the SH procedure becomes apparent: if the number of observations is small then the number of dimensions sampled at each iteration is also small, which makes it harder for the test to detect a signal that is dense. This can be seen in the downward trend of the SH power curves across the columns of the figure. Notice that the trend decreases as the number of observations increases. This is unlike the rest of the tests such as the CQ, SD and Lopes which power does not change as a result of changes in the number of false null hypotheses. It is important to mention that the Simes test is better than all other tests except for SH in the sparse setting (). Still, the SH variants dominate the other tests except for the instance where there the signal is dense () and the number of observations is small ().

Figure 3: Parametric tests power comparison, (a) block covariance matrix, block size is 20 and (b) AR covariance matrix. The x-axis is the covariance parameter and the y-axis is power. The number of iterations is 1000, and the conservative s.e of the simulation is 0.016.

Figure 3b displays the results of the least favorable scenario for the SH variants, the signal is relatively dense () and the covariance structure is close to independence ( with exponential decay). Indeed, the power of the SH and Lopes tests is lower than the power of SD and CQ. However when the number of observations or increases, so that either more dimensions are sampled or there is more signal across the principle components, the SH tests becomes more competitive, as can be seen in the power curves trend upwards across rows and the x-axis. Overall, it seems that each test has its preferable scenarios: for the Simes test it is a strong and sparse signal across independent dimensions, for the CQ and SD it is when the signal is spread across independent dimensions, and for the Lopes test it is when the signal is dense and spread across strongly dependent dimensions. A way to encompass the comparisons across all scenarios jointly is presented next.

5.2.3 Min-min power comparison (A-la Princeton)

The SH test dominates in some scenarios but even in its least favorable scenarios retains a reasonable amount of power. It performs best when the dimensions are strongly correlated with a sparse signal, but even when the correlation is weak, if the number of observations, , is large enough for to capture most of the signal, it returns decent results. This performance can be quantified by using empirical min-min criterion (A-la Princeton simulation study),


being a scenario, SD, CQ, SH, Lopes, Simes. The score indicates how well a test performs relative to the best among the compared tests under its least favorable condition. is calculated separately for each , the number of observations, which is a known parameter of the problem when testing the global null.

SD CQ Lopes Simes SH SH
20 0.022 0.066 0.063 0.124 0.383 0.371
50 0.009 0.089 0.108 0.072 0.439 0.456
100 0.008 0.090 0.131 0.060 0.551 0.551
Table 2: A-la Princeton simulation study results, finite sample min-min results.

In the equal-correlated covariance structure (4) the correlation between the dimensions is high and the SH test variants have very high power. We therefore removed the equal correlation case from the following simulation. The SH variants dominate the other tests. This is because most of the tests specialize in a certain kind of signal, while the SH test power is not always the highest, it does not plummet in any scenario.

5.3 Simulation - non-equal covariance matrices

In this section we study the SH variant for the non-equal covariance matrices and compare its power to that of the other available tests. The simulation is conducted as in 5.2, but with and . We first assess the type I error for the tests that controlled the type I error in the equal covariance simulation, namely, SD, CQ, Simes (with Welch t-test), SH and SH, and then continue to compare their power.

5.3.1 Type I error comparison

20 0.3 0.001 0.061 0.048 0.035 0.015
0.45 0.006 0.061 0.047 0.021 0.03
0.6 0.005 0.056 0.043 0.028 0.025
0.75 0.015 0.049 0.049 0.037 0.03
0.95 0.018 0.062 0.027 0.023 0.021
50 0.3 0.014 0.06 0.049 0.046 0.044
0.45 0.023 0.061 0.046 0.045 0.046
0.6 0.027 0.06 0.041 0.046 0.04
0.75 0.029 0.062 0.052 0.048 0.024
0.95 0.03 0.071 0.029 0.039 0.023
100 0.3 0.032 0.05 0.045 0.061 0.045
0.45 0.028 0.05 0.052 0.051 0.041
0.6 0.027 0.051 0.058 0.058 0.044
0.75 0.037 0.058 0.048 0.047 0.029
0.95 0.032 0.073 0.03 0.047 0.032
Table 3: Non-equal covariance matrix tests, type I error rates for the AR covariance matrix scenario. is in correlation parameter. The number of iterations is 1000, and the conservative s.e of the simulation is 0.016.
Block Size SD CQ Simes SH SH
20 20 0.5 0.025 0.057 0.048 0.029 0.031
0.65 0.021 0.054 0.029 0.019 0.029
0.8 0.019 0.052 0.025 0.025 0.012
50 0.5 0.02 0.058 0.043 0.015 0.015
0.65 0.017 0.061 0.032 0.019 0.015
0.8 0.01 0.058 0.021 0.022 0.015
100 0.5 0.027 0.07 0.034 0.025 0.023
0.65 0.022 0.073 0.033 0.025 0.023
0.8 0.018 0.07 0.018 0.03 0.024
50 20 0.5 0.027 0.058 0.047 0.044 0.029
0.65 0.026 0.063 0.039 0.055 0.03
0.8 0.024 0.066 0.022 0.035 0.023
50 0.5 0.036 0.077 0.044 0.046 0.038
0.65 0.025 0.076 0.034 0.052 0.036
0.8 0.018 0.074 0.024 0.052 0.022
100 0.5 0.021 0.066 0.036 0.042 0.037
0.65 0.013 0.066 0.028 0.048 0.037
0.8 0.007 0.066 0.022 0.05 0.024
100 20 0.5 0.04 0.067 0.048 0.057 0.037
0.65 0.038 0.07 0.036 0.068 0.036
0.8 0.03 0.07 0.031 0.063 0.033
50 0.5 0.034 0.072 0.054 0.063 0.05
0.65 0.028 0.073 0.036 0.066 0.039
0.8 0.021 0.073 0.024 0.057 0.033
100 0.5 0.027 0.07 0.035 0.066 0.038
0.65 0.013 0.069 0.031 0.074 0.044
0.8 0.007 0.069 0.015 0.061 0.041
Table 4: Non-equal covariance matrix tests type I error rates for the block covariance matrix scenario. is the in block covariance. The number of iterations is 1000, and the conservative s.e of the simulation is 0.016.

The type I error is kept by all tests in all of the scenarios. The type I error rates indicate that the tests are slightly more conservative compared to 5.2.2 for small and . We are left to assess the tests power of in the different scenarios.

5.3.2 Power Comparison

The signal was generated by ensuring that . This results in a smaller signal to noise ratio than in the parametric simulations tests (as the covariance matrix was inflated).

Figure 4: Non-equal covariance tests power comparison, AR covariance matrix (). The number of iterations is 1000, and the conservative s.e of the simulation is 0.016.

As can be expected, the power of the procedure is lower because only or observations are available to estimate each covariance matrix rather than the in the equal covariance case, with the implication on the degrees of freedom of the test statistic distribution. The difference between the compared tests is similar to the differences in the parametric setting, as evident by the power curves in Figure 4 intersecting roughly in the same points as in Figure 3b.

5.3.3 Min - min Power Comparison (A-la Princeton)

Evaluating the min-min performance of the tests, after removing again, the equally correlated covariance scenario (4).

20 0.031 0.078 0.146 0.234 0.235
50 0.009 0.080 0.074 0.404 0.401
100 0.007 0.083 0.035 0.512 0.468
Table 5: A-la Princeton simulation study results for the non-equal covariance tests, finite sample min-min results.

The results are similar to those seen in the equal covariance case. The SH loses power in certain scenarios, but even in the least favorable scenario its power does not deteriorate much unlike the other tests.

6 Real data examples

The proposed combination of Simes and Hotelling tests (SH) was also applied to real data, found in previous papers. We show that it performs as well as other tests, in terms of getting low p-values while enjoying full guarantee of type I error.

6.1 Colon cancer data

Alon et al. [1] compared the expression level of 2000 genes between 22 normal colon tissues and 40 tumor colon tissues. Same example was used in [29] the data is available at (http://genomics-pubs.princeton.edu/oncology/affydata/index.html) and was taken from [1].

Tests SD CQ Lopes Simes SH SH
Table 6: The resulting p-values for the tests of the gene expression levels (the random tests SH and Lopes test were repeated ten times and averaged to lower variance)

All tests except for SD reject the null hypothesis. CQ seems to be the most sensitive but recall that it fails to control the type I error.

6.2 Mitochondrial calcium concentration

The next data set was used in [11] and was taken from [25]. The researcher tracked the Calcium count level every ten second during the hour, after administrating a dose of Cariporide. They repeated the experiment twice, once for intact cells, and once for permeabilized membranes. The pre-processing here was conducted in the same way as in [11]. We used the non-equal covariance matrix variant of the test.

Tests GCT CQ Simes SH SH
Table 7: The resulting p-values for the tests of the Calcium count levels for intact cells (SH was repeated ten times and averaged to lower variance)

Again, all tests reject the null hypothesis.


The suggested test seems to be powerful when there is strong correlation between dimensions. This is because the Mahalanobis distance between and can be bigger than the Euclidean distance in such a scenario.

Example 6.1.

Suppose where , is known. Then . Taking , the expected Mahalanobis distance is , while the expected Euclidean distance .

We can see that dimensions which do not contain any signal influence the Mahalanobis distance between the mean vectors. When sampling dimensions we obtain various estimates to the Mahalanobis distance between parts of the mean vector. This is beneficial since the Simes test is powerful for a single strong effect but retains its power for many moderate effects.

To conclude, we have suggested a two-sample test for detecting a shift between the mean vectors of two multivariate normal distributions, by combining the Simes test and the Hotelling test. While the Hotelling test cannot be used when , and the Simes test does not take dependency into account, combining them resolves both issues. The procedure is different from other tests in two important ways: (i) It allows for the incorporation of the covariance matrix, as seen in the simulation in the scenarios where the dimensions are correlated; (ii) The procedure is relatively computationally efficient, as it does not require conducting permutations to obtain a p-value. The test retains power across widely different scenarios, as can be seen in the min-min simulations. While many tests have a preferred scenario, the suggested test works well in most of them, often achieving the highest or second highest power. The simulation results can be summarized to a rule of thumb of which test to use for various signal and dependencies strength (assuming ), as can be seen in the table below.

Signal Dependency Test to use
Very sparse Weak Simes
Sparse Strong SH
Dense Weak CQ
Dense Strong SH
Table 8: Rule of thumb summary. When to use each test according to expected signal sparsity and covariance between dimensions.

If computational power is not an issue, then it is recommended to use PSH. As was mentioned earlier, the simulation study is not exhaustive, and missing several tests such as Higher Criticism [10] which is suitable to deal with a much larger amount of hypotheses but relies on the assumption of independence, as well as non-parametric tests such as suggested in [13], which is suited for testing a general difference between distributions and not a specific difference between the mean vectors. It could be interesting to check the effect of combining sampled dimensions for different tests, as this methodology can be appropriate for other problems.


  • [1] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine (1999-JUN 8)

    Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays

    Proceedings of the National Academy of Sciences of the United States of America 96 (12), pp. 6745–6750. External Links: Document, ISSN 0027-8424 Cited by: §6.1.
  • [2] Z. Bai and H. Saranadasa (1996-APR) Effect of high dimension: By an example of a two sample problem. Statistica Sinica 6 (2), pp. 311–329. External Links: ISSN 1017-0405 Cited by: §1.1, §3.1.1, §3.1.2.
  • [3] Y. Benjamini and D. Yekutieli (2001-AUG) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29 (4), pp. 1165–1188. External Links: ISSN 0090-5364 Cited by: Theorem 1.2, §2, §2.
  • [4] J. Bibby, J. Kent, and K. Mardia (1979) Multivariate analysis. Academic Press, London. Cited by: §1.1, Lemma 1.3, Lemma 1.3, §1.
  • [5] T. T. Cai, W. Liu, and Y. Xia (2014-MAR) Two-sample test of high dimensional means under dependence. Journal of the Royal Statistical Society Series B-Statistical Methodology 76 (2), pp. 349–372. External Links: Document, ISSN 1369-7412 Cited by: §3.1.6, §3.1, item 3, §5.
  • [6] T. Cai, W. Liu, and X. Luo (2011-JUN) A Constrained l(1) Minimization Approach to Sparse Precision Matrix Estimation. Journal of the American Statistical Association 106 (494), pp. 594–607. External Links: Document, ISSN 0162-1459 Cited by: §3.1.6.
  • [7] T. Cai and W. Liu (2011-JUN) Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106 (494), pp. 672–684. External Links: Document, ISSN 0162-1459 Cited by: §3.1.6.
  • [8] S. X. Chen and Y. Qin (2010-APR)

    A two-sample test for high-dimensional data with applications to gene-set testing

    Annals of Statistics 38 (2), pp. 808–835. External Links: Document, ISSN 0090-5364 Cited by: §3.1.3, §5.
  • [9] A. P. Dempster (1958) A high dimensional two sample significance test. The Annals of Mathematical Statistics, pp. 995–1010. Cited by: §3.1.1, §3.1.2, §3.1.
  • [10] D. Donoho and J. Jin (2004-JUN) Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics 32 (3), pp. 962–994. External Links: Document, ISSN 0090-5364 Cited by: Conclusions.
  • [11] K. B. Gregory, R. J. Carroll, V. Baladandayuthapani, and S. N. Lahiri (2015-JUN) A Two-Sample Test for Equality of Means in High Dimension. Journal of the American Statistical Association 110 (510), pp. 837–849. External Links: Document, ISSN 0162-1459 Cited by: §5, §6.2.
  • [12] K. Gregory (2014) HighD2pop: two-sample tests for equality of means in high dimension. Note: R package version 1.0 External Links: Link Cited by: §5.2.
  • [13] R. Heller and Y. Heller (2016) Multivariate tests of association based on univariate tests. In Advances in Neural Information Processing Systems 29 (NIPS 2016), Lee, DD and Sugiyama, M and Luxburg, UV and Guyon, I and Garnett, R (Ed.), Advances in Neural Information Processing Systems, Vol. 29. Note: 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, SPAIN, 2016 External Links: ISSN 1049-5258 Cited by: Conclusions.
  • [14] C. Hemmelmann, M. Horn, S. Reiterer, B. Schack, T. Susse, and S. Weiss (2004-OCT 15) Multivariate tests for the evaluation of high-dimensional EEG data. Journal of Neuroscience Methods 139 (1), pp. 111–120. External Links: Document, ISSN 0165-0270 Cited by: §1.
  • [15] N. Henze (1988) A multivariate two-sample test based on the number of nearest neighbor type coincidences. The Annals of Statistics, pp. 772–783. Cited by: §5.1.1.
  • [16] H. Hotelling (1931) The generalization of student’s ratios. Annals of Mathematical Statistics 2, pp. 360–378. Cited by: §1.
  • [17] S. Karlin and Y. Rinott (1981)

    Total positivity properties of absolute value multinormal variables with applications to confidence interval estimates and related probabilistic inequalities

    Annals of Statistics 9 (5), pp. 1035–1049. External Links: Document, ISSN 0090-5364 Cited by: §2.
  • [18] A. F. Karr (1993) Probability. Springer. Cited by: Lemma 1.3.
  • [19] H. Keselman, R. Cribbie, and B. Holland (2002-MAY) Controlling the rate of Type I error over a large set of statistical tests. British Journal of Mathematical & Statistical Psychology 55 (1), pp. 27–39. External Links: Document, ISSN 0007-1102 Cited by: §2.
  • [20] K. Krishnamoorthy and J. Yu (2004-JAN 15) Modified Nel and Van der Merwe test for the multivariate Behrens-Fisher problem. Statistics & Probability Letters 66 (2), pp. 161–169. External Links: Document, ISSN 0167-7152 Cited by: §4.1.
  • [21] J. Läuter (2013) Simes’ theorem is generally valid for dependent normally distributed variables. In Invited talk at the international conference on simultaneous inference, Cited by: §2.
  • [22] L. Lin and W. Pan (2016) Highmean: two-sample tests for high-dimensional mean vectors. Note: R package version 3.0 External Links: Link Cited by: §5.2.
  • [23] M. Lopes, L. Jacob, and M. J. Wainwright (2011) A more powerful two-sample test in high dimensions using random projection. In Advances in Neural Information Processing Systems, pp. 1206–1214. Cited by: §3.1.3, §3.1.4, §3, §5.
  • [24] A. Reiner-Benaim (2007-FEB) FDR control by the BH procedure for two-sided correlated tests with implications to gene expression data analysis. Biometrical Journal 49 (1), pp. 107–126. Note: 4th International Conference on Multiple Comparison Procedures (MCP2005), Shanghai, PEOPLES R CHINA, AUG 17-19, 2005 External Links: Document, ISSN 0323-3847 Cited by: §2.
  • [25] M. Ruiz-Meana, D. Garcia-Dorado, P. Pina, J. Inserte, L. Agullo, and J. Soler-Soler (2003-SEP) Cariporide preserves mitochondrial proton gradient and delays ATP depletion in cardiomyocytes during ischemic conditions. American Journal of Physiology-heart and Circulatory Physiology 285 (3), pp. H999–H1006. External Links: Document, ISSN 0363-6135 Cited by: §6.2.
  • [26] S. Sarkar (1998-APR) Some probability inequalities for ordered MTP2 random variables: A proof of the Simes conjecture. Annals of Statistics 26 (2), pp. 494–504. External Links: ISSN 0090-5364 Cited by: §2.
  • [27] Y. Shen and Z. Lin (2015-SEP) An adaptive test for the mean vector in large-p-small-n problems. Computational Statistics & Data Analysis 89, pp. 25–38. External Links: Document, ISSN 0167-9473 Cited by: §5.1.3.
  • [28] R. J. Simes (1986-DEC) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73 (3), pp. 751–754. External Links: Document, ISSN 0006-3444 Cited by: §1.2, §1.2.
  • [29] M. S. Srivastava and M. Du (2008-MAR) A test for the mean vector with fewer observations than the dimension. Journal of Multivariate Analysis 99 (3), pp. 386–402. External Links: Document, ISSN 0047-259X Cited by: §3.1.2, §5, §6.1.
  • [30] W. Stadje (1990) The collector’s problem with group drawings. Advances in Applied Probability, pp. 866–882. Cited by: §3.
  • [31] M. Thulin (2014-JUN) A high-dimensional two-sample test for the mean using random subspaces. Computational Statistics & Data Analysis 74, pp. 26–38. External Links: Document, ISSN 0167-9473 Cited by: §2, §4.2, §5.1.2, §5.
  • [32] V. Williams, L. Jones, and J. Tukey (1999-SPR) Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. Journal of Educational and Behavioral Statistics 24 (1), pp. 42–69. External Links: Document, ISSN 1076-9986 Cited by: §2.
  • [33] M. Xiong, J. Zhao, and E. Boerwinkle (2002-MAY) Generalized T-2 test for genome association studies. American Journal of Human Genetics 70 (5), pp. 1257–1268. External Links: Document, ISSN 0002-9297 Cited by: §1.
  • [34] Y. Yang, T. Yamada, K. K. Hill, M. Hemberg, N. C. Reddy, H. Y. Cho, A. N. Guthrie, A. Oldenborg, S. A. Heiney, S. Ohmae, J. F. Medina, T. E. Holy, and A. Bonni (2016-JUL 15) Chromatin remodeling inactivates activity genes and regulates neural coding. SCIENCE 353 (6296), pp. 300–305. External Links: Document, ISSN 0036-8075 Cited by: §1.
  • [35] D. Yekutieli (2008-FEB 1) False discovery rate control for non-positively regression dependent test statistics. Journal of Statistical Planning and Inference 138 (2), pp. 405–415. External Links: Document, ISSN 0378-3758 Cited by: §2.
  • [36] J. Zhang and M. Pan (2016-MAY) A high-dimension two-sample test for the mean using cluster subspaces. Computational Statistics & Data Analysis 97, pp. 87–97. External Links: Document, ISSN 0167-9473 Cited by: §5.1.4.

1 Theorem and Lemmas

Theorem 1.1.

for .

The proof start with the observation that the Hotelling statistic of the subset can be rewritten as (as in [4])






We also require to prove the following Lemmas.

Lemma 1.1.

For , then



w.l.o.g assume . Let where , then


Allowing us to calculate the covariance

Lemma 1.2.

If then


, is a

matrix whose columns are the eigenvectors of


is a diagonal matrix of eigenvalues. Under the null hypothesis

is distributed as , being the eigenvalue with respect to the corresponding eigenvector in the ’th column of . Writing the quadratic form in as a sum,

The covariance between and is

Using the result from Lemma 1.1 and defining , we obtain

Concluding .

Lemma 1.3.

If then



The joint distribution of can be written as


, [4]. Therefore, is dependent solely on , we can define , and.

Since Cochran’s Theorem, [4] and and are Borel measurable functions, according to Theorem 3.1 [18], , concluding the proof of Lemma 1.3.

We can now turn to prove Theorem 1.1, define , and , . We will use Taylor approximation to estimate a at the expansion point , where .