1 Introduction
A common problem in statistics is testing the equality of two mean vectors and based on two independent random samples, and . Formally, the hypothesis being tested is
(1.1) 
This problem appears in many fields of science, including but not limited to analyzing EEG data [14], fMRI data [33], Neurobiology [34] and more. At the first stage of the analysis the interest lies in detecting a difference and not necessarily identifying where this difference lies. A common approach to test the hypothesis is using the classical Hotelling test [16] which assumes that and
follow the same multivariate normal distribution
.1.1 Hotelling test
The Hotelling test statistic is defined as follows:
,
where and
are the estimated mean vectors,
and is the pooled covariance matrix estimator. Under , the test statistic follows the distribution, the pvalue of the test is given by, the null hypothesis is rejected for large value of the test statistic only.
The test is a natural generalization of the onedimensional ttest. It is invariant under any nonsingular linear transformations and is the uniformly most powerful test among all tests which share this invariance property
[4]. The test cannot be used when , a common situation in modern scientific research, as the estimated covariance matrix is not of full rank, and therefore its inverse is undefined. Moreover, even for , as approaches , the power of the test deteriorates [2]. Several approaches have been suggested for testing 1.1 in the setting. Many of them rely on a different method to estimate , which will guarantee that it would be of full rank.1.2 Simes test
A very different approach is to rely on the logical fact that . This allows us to test each dimension marginally and reject the globalnull hypothesis using a combination test, such as Fisher, Simes, or Stouffer.
Assuming that and follow the multivariate normal distribution, each one of their dimensions also follows the normal distribution and . Under these conditions we can test the marginal hypotheses , by applying the ttest. To complete the procedure, we need to find a test of the global null hypothesis, . Since the tests are dependent, a good candidate is the Simes test [28], which does not rely on the assumption of independence.
The Simes test, is a test of the global null, making use of the pvalues, , which in our context, the pvalues are acquired by marginally testing each dimension. The pvalues are sorted, , and the test statistic is
Under the null hypothesis, if the test statistics are independent then [28]. While the Simes test offers a method to test the globalnull, it ignores the covariance structure as each dimension is tested marginally.
We suggest combining the Hotelling and Simes tests to construct a new test for the hypothesis of the two sample mean equality in the dependent setting.
The rest of the paper is organized as follows: In Section 2 we propose the test, in Section 3, default test parameters values are suggested. In Section 4 we extend the test by removing the equal covariance assumption and the normality assumption using a permutation test based on the suggested test. In Section 5 we present the results of the simulation study comparing our suggested test with existing alternatives, and finally in Section 6 we apply the suggested test to two realdata examples.
2 The proposed test
In order to reject the null hypothesis presented in Eq. 1.1, it is enough to reject any marginal hypothesis . However, we need not limit ourselves to a single marginal hypothesis as we can test different subset of hypotheses, . Any subset of hypotheses rejected indicates that the global null hypothesis should also be rejected.
Utilizing this idea, define to be the group of all possible subset of dimensions, . At each iteration we sample dimensions, and conduct the Hotelling test of the null hypothesis . Repeat the process times and combine the resulting pvalues from the random Hotelling statistics using the Simes test and compute , then test at level . Algorithmically the test procedure is
Each Hotelling test under the intersection hypothesis (
) yields a uniformly distributed pvalue at step 2. We denoted the test as
, for SimesHotelling. One of the reasons to use the Simes test, is that it remains valid under various dependency structures of the test statistics. In particular, the Simes test has been proven to be conservative if the positive regression dependency on subset (PRDS) property between the test statistics holds [3]. This is type of dependency likely to arise when two samples of dimensions overlap.Property 2.1 ().
For any increasing set , and for each is nondecreasing in .
Unfortunately, even if is PRDS, is not necessarily PRDS, and therefore does not assure the validity of Simes for pvalues stemming from twosided tests. For pvalues originating from twosided Gaussian tests, if is multivariate total positivity of order 2 (MTP), or if it is so after changing the sign of some of the s, then so is , thereby implying the validity of the Simes test [17], [26].
Property 2.2 ().
is MTP if for all and ,
(2.1) 
where is the joint density, and the minimum and maximum are evaluated componentwise.
Even for pairwise dependency structure which is neither MTP nor PRDS [35] gave strong theoretical support for its validity. These are known theoretical results establishing the validity of Simes test for . Still, PRDS is not a necessary condition for the validity of Simes test: the case of ttests derived from PRDS normal distribution scaled by the same independent estimator
is not PRDS yet the BenjaminiHochberg (BH) procedure and the Simes test are valid for a significance level less than 0.5. Evidence of the conservatism of the Simes test based on twosided pvalues under general dependency for Gaussian distribution have accumulated since then. This is based on numerous reported simulation studies, see
[32], [19], [24], [21], as well as unreported ones. Even though the global bound ofon the level of Simes test for any joint distribution has been shown to be attainable
[3], a counter example based on Gaussian distributed test statistics has not been found. The distribution for which the bound is attained is far from being Gaussian, not being ellipsoidal or even unimodal. Hence Simes test is widely considered valid for twosided tests based on Gaussian or tdistributions under any dependency.We show that the covariance between pairs of Hotelling statistics is asymptotically positive and they are asymptotically distributed as multivariate Gaussian. Since positively correlated multivariate Gaussian hold the PRDS property this ensures the validity of the Simes test. For further details see the supplementary material.
Theorem 2.1.
If , then for .
The theorem is stated using a single sample, , but can easily be extended to accommodate twosamples. We support the validity of the procedure for finite sample sizes in the Simulation section. The proposed test has certain advantages over the classic Hotelling test, as it is applicable when , and it allows to inspect smaller subsets of the dimensions. It also has advantages over the marginal Simes procedure, it combines a number of weak signals potentially improving the power and incorporates information regarding the covariance structure. In some sense the procedure is like that suggested by Thulin [31], where the sampled Hotelling statistics are averaged, and the computation of pvalue relies on permutations. The suggested procedure does not rely on permutation for the computations and therefore enjoys a lower computational burden. It remains to determine the values of parameter and , which will be done in the next section.
3 Considerations for choosing and in
We follow the result of [23], and set . Since the proof of the property underlying their recommendation is asymptotic, we will also consider another option in our simulations study.
For a given parameter there are as many as different dimension subsets of size . Even for small and the number of all possible samples is too large to consider each any every possibility. Therefore, we must find a suitable value for the number of samples, , to be taken. Note that if we happen to sample the same set of dimensions twice the Simes test does not lose power even though the number of tests increases because of its stepup nature. We would like to ensure that each dimension will be chosen at least once, as to avoid situations where only null hypothesis dimensions are captured by the sampling, which could lead to a loss of power (see Fig. 1). For computational reason we would like to use smaller . Finding the number of samples needed to cover all the dimensions is known as the Coupon Collector Problem, and for choosing one dimensions at each time the expected number of samples needed to ensure it,
(3.1) 
where is the MascheroniEuler constant. For selecting more than one dimensions at each sample see [30], the above equation can be viewed as an upper bound for the expected number of draws when selecting more than one dimension. Setting
seems to satisfy the condition of ensuring the selection of all dimensions with high probability.
The power increases by sampling the dimensions rather than conducting a single Simes test over all dimensions. Suppose that the proportion of dimensions with no signal is . If testing is conducted marginally (ttest followed by Simes) there are test statistics distributed according to the null distribution. By sampling each time dimensions and combining them using Hotelling, the proportion of statistics that are from the null distribution is
(3.2) 
The gain from the sampling mechanism is not only by allowing the incorporation of data regarding the covariance matrix, but also by reducing the ratio of null to nonnull test statistics. The parameter can be viewed as controlling the tradeoff between including data regarding the covariance matrix and the averaging of effects. If the procedure would correspond to the marginal testing of components and will include no data regarding the covariance matrix, and for the procedure would correspond to the standard Hotelling test.
The test seems to lose power for all values of when is small, for larger values of it seems to have little effect (see Figure 1, the difference between to is large, yet the the difference between and is almost negligible). While , is the suggested value, it is not consistently the best option. The optimal is dependent both on , and on . When signal is spread across more dimensions, it is better to select a larger value for , but if the signal is centered only on a few dimensions it seems that a smaller value of should be considered. Usually, this type of information is not known, and it seems that is a good compromise, having good power for most scenarios. Further theoretical justification for the value is given in Section ”Choosing sample size in the supplementary material”.
4 Extensions
4.1 Non equal covariance matrices
The testing of Gaussian means when the variances are not equal, also known as the BehrensFisher problem, has been of interest from the early age of Statistics. The natural statistic in one dimension is
The common solution is approximating
using the tdistribution and using the Welch method to evaluate the degree of freedom. The corresponding statistic in high dimensions would be:
To find its distribution we use the approximation by [20]
where the degree of freedom is calculated as follows
and (or Y accordingly) and . This suggestion was shown to have the best power out of the invariant methods. It can be easily integrated into our proposed test, using instead of the usual Hotelling statistic for the sampled dimensions. Note, that Result 2.1 still holds as the only thing changed is the number of degrees of freedom. We assess the performance of the procedure and then compare the performance of three other methods in a simulation study (see Section 5.3).
4.2 Extension to nonGaussian distributions
In order to overcome the reliance of the suggested test on the Gaussianity assumption, we can conduct a permutation test of the same statistic, resulting with a distribution free test. The only assumption needed is that the two samples are interchangeable. Under the global null there is no difference between the two groups, the two distributions are equal, so observations in the two groups are interchangeable. Notice, that the nonparametric extension of the procedure no longer test for equality of the mean vector but the equality of the distributions, .
The permutation test is the following, for each sample obtain the Hotelling statistics, for each such sample shuffle the labels times, and obtain the permutation pvalue, for each shuffle and sample. Now use function to combine pvalues for each permutation resulting in a single vector of pvalues. From there compare the shuffled pvalues to the original to obtain the final pvalue. Algorithmically the permutation test is the following
After obtaining the pvalues for all the subsets, one can either combine them using any valid test (Bonferroni, the brave scientist may still use Simes), or use any function coupled with a permutation test. This method is similar to the method suggested by [31] and shares the disadvantage of being computationally expensive. A small study of additional nonparametric tests and their comparison to the suggested test can be found in the supplementary material.
5 Simulations
A simulation study was conducted in order to evaluate the suggested test in terms of Type I error and power in comparison to other tests. The other tests considered rely on various approaches, regularization based methods
[29] (SD), [8] (CQ), that are similar to Hotelling but the estimate of is replaced by a matrix which can be inverted, and thereby allow for the calculation of the test statistic. Their main drawback is the loss of power when the covariance structure is not negligible [23]. The random projection type tests such as [23] (Lopes) and the nonparametric test suggested by Thulin [31], where the data are projected into a smaller subspace, thus allowing the calculation of Hotellinglike test statistic. The suggested method falls under this category (using Simes instead of permutation tests). A test based on the supremum of the standardized differences between the observed mean vectors [5] (CLX). Another test is based on the smoothed ttest statistics [11] (GCT). More information regarding the tests can be found in the supplementary material. The simulation section is split into two subsections each dedicated to a different type of tests: subsection 5.2, Parametric tests, and subsection 5.3 covers tests which do not require the equal covariance assumption. The section regarding the performance of the nonparametric tests can be found in the supplementary material 5.2. In this section we present the most interesting results, the full comprehensive tables and figures can be found in the supplementary material. The simulation code can be accessed in github.com/…/SimulationsSimesHotelling.5.1 Simulations parameters
Since the goal is to evaluate the tests over a wide array of scenarios, we varied multiple parameters of the data generation.

The number of observations: .

The number of dimensions, : For the equal and nonequal covariance matrix samples, , for the nonparametric simulation .

The sparsity, , varied between .
We considered 4 different types of covariance matrices.

An auto regression (AR) matrix, that is , , .

The block covariance matrix, where we varied the in block correlation , and the size of each block .

Model 7 presented in [5], where and for , . is a diagonal matrix with sampled from a Uniform distribution .

Equally correlated covariance matrix , , , . .
The above list of covariance matrices represents scenario used in other matrices, as well as covariance structures that resembles those encountered in practice. To simulate the signal, we kept , and vary . is divided into 6 subvectors, 1 subvector of size (), whose entries are 0. The rest of the subvectors have a size of where each subvector elements are . To find we determine , and solve
for . The norm
is chosen to set a specific signal to noise ratio, allowing us to compare different scenarios results. The nonzero entries are determined at random. We followed a full factorial design observations
1 dependency structure. All tests are evaluated at level .5.2 Simulations  parametric tests
The parametric tests that were chosen for comparison are SD 3.1.2, CQ 3.1.3, Lopes 3.1.4, GCT 3.1.5, CLX 3.1.6 and the suggested test (SH) with number of dimensions sampled at each iteration . We kept the number of samples constant at , so we shall simply denote the test as . The implementation of GCT and CLX used can be found in the R package highD2pop [12], and the implementation of SD, CQ in the R package highmean [22]. Across all simulations the number of repetitions is 1,000, such that the conservative deviation of the estimated power and Type I error is . All tests are conducted on the same simulated data. Throughout the simulation to allow us to compare between the rest of the simulations shown.
5.2.1 Type I Error Comparison
Here we describe our main finding with some of the supporting evidence. The complete results of the Type I error are presented in the supplementary material.
SD  CQ  Lopes  Simes  GCT  CLX  KNN  SH  SH  

20  0.3  0.034  0.051  0.051  0.066  0.077  0.19  0.114  0.054  0.051 
0.45  0.034  0.056  0.052  0.048  0.074  0.178  0.113  0.045  0.046  
0.6  0.035  0.057  0.052  0.049  0.074  0.19  0.119  0.052  0.056  
0.75  0.029  0.047  0.07  0.05  0.077  0.188  0.111  0.049  0.044  
0.95  0.019  0.05  0.046  0.03  0.164  0.15  0.124  0.054  0.032  
50  0.3  0.046  0.06  0.046  0.049  0.065  0.08  0.101  0.051  0.05 
0.45  0.044  0.061  0.047  0.043  0.068  0.1  0.1  0.048  0.052  
0.6  0.041  0.053  0.041  0.045  0.064  0.057  0.095  0.044  0.047  
0.75  0.034  0.056  0.047  0.043  0.077  0.093  0.103  0.052  0.038  
0.95  0.034  0.074  0.043  0.024  0.191  0.137  0.12  0.042  0.028  
100  0.3  0.043  0.049  0.046  0.046  0.056  0.072  0.094  0.04  0.04 
0.45  0.04  0.05  0.036  0.048  0.056  0.054  0.098  0.045  0.028  
0.6  0.04  0.053  0.048  0.06  0.068  0.031  0.087  0.042  0.042  
0.75  0.043  0.054  0.045  0.045  0.075  0.095  0.092  0.043  0.031  
0.95  0.029  0.067  0.052  0.025  0.178  0.099  0.113  0.036  0.029 
Across the various scenarios it can be seen that both SH variants keep the required error rate. As the correlation structure is stronger (Table LABEL:tab:Cov3_Alpha and Table 1), the Simes and SD tests are becoming more conservative. CQ test has a slightly elevated error rates () under all the covariance structures. It seems that a GCT, and CLX fail to keep type I error, especially when the covariance between the dimensions decays slowly (Table LABEL:tab:Cov3_Alpha). CLX performance was worst for the block covariance structure, due to the difficulty in estimating the covariance matrix. The covariance matrix was estimated using the adaptive threshold estimator, indeed, it can be expected that for sparse covariance matrices using the CLIME would have yielded better performance. The type I error of the GCT test holds asymptotically as and should have therefore performed better for larger . KNN also does not seem to keep the type I error, this is possibly due to the asymptotic nature of the test. It can be expected to perform better when . Due to these results, GCT, KNN and CLX tests were omitted from the rest of the simulations study.
5.2.2 Power comparison  parametric tests
We compare the power curves of the different tests in different scenarios. The power of the global null tests is affected by the sparsity of the signal, the covariance structure and number of observations. Considering the covariance structure can change the sparsity of the problem: if a signal is sparse in the original dimensions considered, it could be dense when considering it across the principle components. As can be seen in Lemma 1.2, in the supplementary material, Hotelling’s test sums the signal across the principal components, increasing the power of tests that take into consideration the covariance structure in certain situations.
As seen in Figure 2 the SH variants and Lopes tests are the only tests that do not lose power as the correlation increases as evident by their increasing power curves across the xaxis in each of the plot. This is an example of the difference between tests which utilize the sum of signal across the principle components versus tests such as SD and CQ that mainly rely on summing the signal across the original dimensions. It is also worth noting that in the sparse case the Simes procedure performs well as expected.
In Figure 3a, we see the power curves of the tests under a milder covariance structure, where the block size of equally correlated dimensions size is 20, rather than 100. One of the downsides of the SH procedure becomes apparent: if the number of observations is small then the number of dimensions sampled at each iteration is also small, which makes it harder for the test to detect a signal that is dense. This can be seen in the downward trend of the SH power curves across the columns of the figure. Notice that the trend decreases as the number of observations increases. This is unlike the rest of the tests such as the CQ, SD and Lopes which power does not change as a result of changes in the number of false null hypotheses. It is important to mention that the Simes test is better than all other tests except for SH in the sparse setting (). Still, the SH variants dominate the other tests except for the instance where there the signal is dense () and the number of observations is small ().
Figure 3b displays the results of the least favorable scenario for the SH variants, the signal is relatively dense () and the covariance structure is close to independence ( with exponential decay). Indeed, the power of the SH and Lopes tests is lower than the power of SD and CQ. However when the number of observations or increases, so that either more dimensions are sampled or there is more signal across the principle components, the SH tests becomes more competitive, as can be seen in the power curves trend upwards across rows and the xaxis. Overall, it seems that each test has its preferable scenarios: for the Simes test it is a strong and sparse signal across independent dimensions, for the CQ and SD it is when the signal is spread across independent dimensions, and for the Lopes test it is when the signal is dense and spread across strongly dependent dimensions. A way to encompass the comparisons across all scenarios jointly is presented next.
5.2.3 Minmin power comparison (Ala Princeton)
The SH test dominates in some scenarios but even in its least favorable scenarios retains a reasonable amount of power. It performs best when the dimensions are strongly correlated with a sparse signal, but even when the correlation is weak, if the number of observations, , is large enough for to capture most of the signal, it returns decent results. This performance can be quantified by using empirical minmin criterion (Ala Princeton simulation study),
(5.1) 
being a scenario, SD, CQ, SH, Lopes, Simes. The score indicates how well a test performs relative to the best among the compared tests under its least favorable condition. is calculated separately for each , the number of observations, which is a known parameter of the problem when testing the global null.
SD  CQ  Lopes  Simes  SH  SH  

20  0.022  0.066  0.063  0.124  0.383  0.371 
50  0.009  0.089  0.108  0.072  0.439  0.456 
100  0.008  0.090  0.131  0.060  0.551  0.551 
In the equalcorrelated covariance structure (4) the correlation between the dimensions is high and the SH test variants have very high power. We therefore removed the equal correlation case from the following simulation. The SH variants dominate the other tests. This is because most of the tests specialize in a certain kind of signal, while the SH test power is not always the highest, it does not plummet in any scenario.
5.3 Simulation  nonequal covariance matrices
In this section we study the SH variant for the nonequal covariance matrices and compare its power to that of the other available tests. The simulation is conducted as in 5.2, but with and . We first assess the type I error for the tests that controlled the type I error in the equal covariance simulation, namely, SD, CQ, Simes (with Welch ttest), SH and SH, and then continue to compare their power.
5.3.1 Type I error comparison
SD  CQ  Simes  SH  SH  

20  0.3  0.001  0.061  0.048  0.035  0.015 
0.45  0.006  0.061  0.047  0.021  0.03  
0.6  0.005  0.056  0.043  0.028  0.025  
0.75  0.015  0.049  0.049  0.037  0.03  
0.95  0.018  0.062  0.027  0.023  0.021  
50  0.3  0.014  0.06  0.049  0.046  0.044 
0.45  0.023  0.061  0.046  0.045  0.046  
0.6  0.027  0.06  0.041  0.046  0.04  
0.75  0.029  0.062  0.052  0.048  0.024  
0.95  0.03  0.071  0.029  0.039  0.023  
100  0.3  0.032  0.05  0.045  0.061  0.045 
0.45  0.028  0.05  0.052  0.051  0.041  
0.6  0.027  0.051  0.058  0.058  0.044  
0.75  0.037  0.058  0.048  0.047  0.029  
0.95  0.032  0.073  0.03  0.047  0.032 
Block Size  SD  CQ  Simes  SH  SH  

20  20  0.5  0.025  0.057  0.048  0.029  0.031 
0.65  0.021  0.054  0.029  0.019  0.029  
0.8  0.019  0.052  0.025  0.025  0.012  
50  0.5  0.02  0.058  0.043  0.015  0.015  
0.65  0.017  0.061  0.032  0.019  0.015  
0.8  0.01  0.058  0.021  0.022  0.015  
100  0.5  0.027  0.07  0.034  0.025  0.023  
0.65  0.022  0.073  0.033  0.025  0.023  
0.8  0.018  0.07  0.018  0.03  0.024  
50  20  0.5  0.027  0.058  0.047  0.044  0.029 
0.65  0.026  0.063  0.039  0.055  0.03  
0.8  0.024  0.066  0.022  0.035  0.023  
50  0.5  0.036  0.077  0.044  0.046  0.038  
0.65  0.025  0.076  0.034  0.052  0.036  
0.8  0.018  0.074  0.024  0.052  0.022  
100  0.5  0.021  0.066  0.036  0.042  0.037  
0.65  0.013  0.066  0.028  0.048  0.037  
0.8  0.007  0.066  0.022  0.05  0.024  
100  20  0.5  0.04  0.067  0.048  0.057  0.037 
0.65  0.038  0.07  0.036  0.068  0.036  
0.8  0.03  0.07  0.031  0.063  0.033  
50  0.5  0.034  0.072  0.054  0.063  0.05  
0.65  0.028  0.073  0.036  0.066  0.039  
0.8  0.021  0.073  0.024  0.057  0.033  
100  0.5  0.027  0.07  0.035  0.066  0.038  
0.65  0.013  0.069  0.031  0.074  0.044  
0.8  0.007  0.069  0.015  0.061  0.041 
The type I error is kept by all tests in all of the scenarios. The type I error rates indicate that the tests are slightly more conservative compared to 5.2.2 for small and . We are left to assess the tests power of in the different scenarios.
5.3.2 Power Comparison
The signal was generated by ensuring that . This results in a smaller signal to noise ratio than in the parametric simulations tests (as the covariance matrix was inflated).
As can be expected, the power of the procedure is lower because only or observations are available to estimate each covariance matrix rather than the in the equal covariance case, with the implication on the degrees of freedom of the test statistic distribution. The difference between the compared tests is similar to the differences in the parametric setting, as evident by the power curves in Figure 4 intersecting roughly in the same points as in Figure 3b.
5.3.3 Min  min Power Comparison (Ala Princeton)
Evaluating the minmin performance of the tests, after removing again, the equally correlated covariance scenario (4).
SD  CQ  Simes  SH  SH  
20  0.031  0.078  0.146  0.234  0.235 
50  0.009  0.080  0.074  0.404  0.401 
100  0.007  0.083  0.035  0.512  0.468 
The results are similar to those seen in the equal covariance case. The SH loses power in certain scenarios, but even in the least favorable scenario its power does not deteriorate much unlike the other tests.
6 Real data examples
The proposed combination of Simes and Hotelling tests (SH) was also applied to real data, found in previous papers. We show that it performs as well as other tests, in terms of getting low pvalues while enjoying full guarantee of type I error.
6.1 Colon cancer data
Alon et al. [1] compared the expression level of 2000 genes between 22 normal colon tissues and 40 tumor colon tissues. Same example was used in [29] the data is available at (http://genomicspubs.princeton.edu/oncology/affydata/index.html) and was taken from [1].
Tests  SD  CQ  Lopes  Simes  SH  SH 

Pvalue 
All tests except for SD reject the null hypothesis. CQ seems to be the most sensitive but recall that it fails to control the type I error.
6.2 Mitochondrial calcium concentration
The next data set was used in [11] and was taken from [25]. The researcher tracked the Calcium count level every ten second during the hour, after administrating a dose of Cariporide. They repeated the experiment twice, once for intact cells, and once for permeabilized membranes. The preprocessing here was conducted in the same way as in [11]. We used the nonequal covariance matrix variant of the test.
Tests  GCT  CQ  Simes  SH  SH 

Pvalue 
Again, all tests reject the null hypothesis.
Conclusions
The suggested test seems to be powerful when there is strong correlation between dimensions. This is because the Mahalanobis distance between and can be bigger than the Euclidean distance in such a scenario.
Example 6.1.
Suppose where , is known. Then . Taking , the expected Mahalanobis distance is , while the expected Euclidean distance .
We can see that dimensions which do not contain any signal influence the Mahalanobis distance between the mean vectors. When sampling dimensions we obtain various estimates to the Mahalanobis distance between parts of the mean vector. This is beneficial since the Simes test is powerful for a single strong effect but retains its power for many moderate effects.
To conclude, we have suggested a twosample test for detecting a shift between the mean vectors of two multivariate normal distributions, by combining the Simes test and the Hotelling test. While the Hotelling test cannot be used when , and the Simes test does not take dependency into account, combining them resolves both issues. The procedure is different from other tests in two important ways: (i) It allows for the incorporation of the covariance matrix, as seen in the simulation in the scenarios where the dimensions are correlated; (ii) The procedure is relatively computationally efficient, as it does not require conducting permutations to obtain a pvalue. The test retains power across widely different scenarios, as can be seen in the minmin simulations. While many tests have a preferred scenario, the suggested test works well in most of them, often achieving the highest or second highest power. The simulation results can be summarized to a rule of thumb of which test to use for various signal and dependencies strength (assuming ), as can be seen in the table below.
Signal  Dependency  Test to use 

Very sparse  Weak  Simes 
Sparse  Strong  SH 
Dense  Weak  CQ 
Dense  Strong  SH 
If computational power is not an issue, then it is recommended to use PSH. As was mentioned earlier, the simulation study is not exhaustive, and missing several tests such as Higher Criticism [10] which is suitable to deal with a much larger amount of hypotheses but relies on the assumption of independence, as well as nonparametric tests such as suggested in [13], which is suited for testing a general difference between distributions and not a specific difference between the mean vectors. It could be interesting to check the effect of combining sampled dimensions for different tests, as this methodology can be appropriate for other problems.
References

[1]
(1999JUN 8)
Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays
. Proceedings of the National Academy of Sciences of the United States of America 96 (12), pp. 6745–6750. External Links: Document, ISSN 00278424 Cited by: §6.1.  [2] (1996APR) Effect of high dimension: By an example of a two sample problem. Statistica Sinica 6 (2), pp. 311–329. External Links: ISSN 10170405 Cited by: §1.1, §3.1.1, §3.1.2.
 [3] (2001AUG) The control of the false discovery rate in multiple testing under dependency. Annals of Statistics 29 (4), pp. 1165–1188. External Links: ISSN 00905364 Cited by: Theorem 1.2, §2, §2.
 [4] (1979) Multivariate analysis. Academic Press, London. Cited by: §1.1, Lemma 1.3, Lemma 1.3, §1.
 [5] (2014MAR) Twosample test of high dimensional means under dependence. Journal of the Royal Statistical Society Series BStatistical Methodology 76 (2), pp. 349–372. External Links: Document, ISSN 13697412 Cited by: §3.1.6, §3.1, item 3, §5.
 [6] (2011JUN) A Constrained l(1) Minimization Approach to Sparse Precision Matrix Estimation. Journal of the American Statistical Association 106 (494), pp. 594–607. External Links: Document, ISSN 01621459 Cited by: §3.1.6.
 [7] (2011JUN) Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106 (494), pp. 672–684. External Links: Document, ISSN 01621459 Cited by: §3.1.6.

[8]
(2010APR)
A twosample test for highdimensional data with applications to geneset testing
. Annals of Statistics 38 (2), pp. 808–835. External Links: Document, ISSN 00905364 Cited by: §3.1.3, §5.  [9] (1958) A high dimensional two sample significance test. The Annals of Mathematical Statistics, pp. 995–1010. Cited by: §3.1.1, §3.1.2, §3.1.
 [10] (2004JUN) Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics 32 (3), pp. 962–994. External Links: Document, ISSN 00905364 Cited by: Conclusions.
 [11] (2015JUN) A TwoSample Test for Equality of Means in High Dimension. Journal of the American Statistical Association 110 (510), pp. 837–849. External Links: Document, ISSN 01621459 Cited by: §5, §6.2.
 [12] (2014) HighD2pop: twosample tests for equality of means in high dimension. Note: R package version 1.0 External Links: Link Cited by: §5.2.
 [13] (2016) Multivariate tests of association based on univariate tests. In Advances in Neural Information Processing Systems 29 (NIPS 2016), Lee, DD and Sugiyama, M and Luxburg, UV and Guyon, I and Garnett, R (Ed.), Advances in Neural Information Processing Systems, Vol. 29. Note: 30th Conference on Neural Information Processing Systems (NIPS), Barcelona, SPAIN, 2016 External Links: ISSN 10495258 Cited by: Conclusions.
 [14] (2004OCT 15) Multivariate tests for the evaluation of highdimensional EEG data. Journal of Neuroscience Methods 139 (1), pp. 111–120. External Links: Document, ISSN 01650270 Cited by: §1.
 [15] (1988) A multivariate twosample test based on the number of nearest neighbor type coincidences. The Annals of Statistics, pp. 772–783. Cited by: §5.1.1.
 [16] (1931) The generalization of student’s ratios. Annals of Mathematical Statistics 2, pp. 360–378. Cited by: §1.

[17]
(1981)
Total positivity properties of absolute value multinormal variables with applications to confidence interval estimates and related probabilistic inequalities
. Annals of Statistics 9 (5), pp. 1035–1049. External Links: Document, ISSN 00905364 Cited by: §2.  [18] (1993) Probability. Springer. Cited by: Lemma 1.3.
 [19] (2002MAY) Controlling the rate of Type I error over a large set of statistical tests. British Journal of Mathematical & Statistical Psychology 55 (1), pp. 27–39. External Links: Document, ISSN 00071102 Cited by: §2.
 [20] (2004JAN 15) Modified Nel and Van der Merwe test for the multivariate BehrensFisher problem. Statistics & Probability Letters 66 (2), pp. 161–169. External Links: Document, ISSN 01677152 Cited by: §4.1.
 [21] (2013) Simes’ theorem is generally valid for dependent normally distributed variables. In Invited talk at the international conference on simultaneous inference, Cited by: §2.
 [22] (2016) Highmean: twosample tests for highdimensional mean vectors. Note: R package version 3.0 External Links: Link Cited by: §5.2.
 [23] (2011) A more powerful twosample test in high dimensions using random projection. In Advances in Neural Information Processing Systems, pp. 1206–1214. Cited by: §3.1.3, §3.1.4, §3, §5.
 [24] (2007FEB) FDR control by the BH procedure for twosided correlated tests with implications to gene expression data analysis. Biometrical Journal 49 (1), pp. 107–126. Note: 4th International Conference on Multiple Comparison Procedures (MCP2005), Shanghai, PEOPLES R CHINA, AUG 1719, 2005 External Links: Document, ISSN 03233847 Cited by: §2.
 [25] (2003SEP) Cariporide preserves mitochondrial proton gradient and delays ATP depletion in cardiomyocytes during ischemic conditions. American Journal of Physiologyheart and Circulatory Physiology 285 (3), pp. H999–H1006. External Links: Document, ISSN 03636135 Cited by: §6.2.
 [26] (1998APR) Some probability inequalities for ordered MTP2 random variables: A proof of the Simes conjecture. Annals of Statistics 26 (2), pp. 494–504. External Links: ISSN 00905364 Cited by: §2.
 [27] (2015SEP) An adaptive test for the mean vector in largepsmalln problems. Computational Statistics & Data Analysis 89, pp. 25–38. External Links: Document, ISSN 01679473 Cited by: §5.1.3.
 [28] (1986DEC) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73 (3), pp. 751–754. External Links: Document, ISSN 00063444 Cited by: §1.2, §1.2.
 [29] (2008MAR) A test for the mean vector with fewer observations than the dimension. Journal of Multivariate Analysis 99 (3), pp. 386–402. External Links: Document, ISSN 0047259X Cited by: §3.1.2, §5, §6.1.
 [30] (1990) The collector’s problem with group drawings. Advances in Applied Probability, pp. 866–882. Cited by: §3.
 [31] (2014JUN) A highdimensional twosample test for the mean using random subspaces. Computational Statistics & Data Analysis 74, pp. 26–38. External Links: Document, ISSN 01679473 Cited by: §2, §4.2, §5.1.2, §5.
 [32] (1999SPR) Controlling error in multiple comparisons, with examples from statetostate differences in educational achievement. Journal of Educational and Behavioral Statistics 24 (1), pp. 42–69. External Links: Document, ISSN 10769986 Cited by: §2.
 [33] (2002MAY) Generalized T2 test for genome association studies. American Journal of Human Genetics 70 (5), pp. 1257–1268. External Links: Document, ISSN 00029297 Cited by: §1.
 [34] (2016JUL 15) Chromatin remodeling inactivates activity genes and regulates neural coding. SCIENCE 353 (6296), pp. 300–305. External Links: Document, ISSN 00368075 Cited by: §1.
 [35] (2008FEB 1) False discovery rate control for nonpositively regression dependent test statistics. Journal of Statistical Planning and Inference 138 (2), pp. 405–415. External Links: Document, ISSN 03783758 Cited by: §2.
 [36] (2016MAY) A highdimension twosample test for the mean using cluster subspaces. Computational Statistics & Data Analysis 97, pp. 87–97. External Links: Document, ISSN 01679473 Cited by: §5.1.4.
1 Theorem and Lemmas
Theorem 1.1.
for .
The proof start with the observation that the Hotelling statistic of the subset can be rewritten as (as in [4])
(S1) 
where
(S2) 
and
(S3) 
We also require to prove the following Lemmas.
Lemma 1.1.
For , then
(S4) 
Proof:
w.l.o.g assume . Let where , then
(S5) 
Allowing us to calculate the covariance
(S6) 
Lemma 1.2.
If then
Proof:
, is a
matrix whose columns are the eigenvectors of
.is a diagonal matrix of eigenvalues. Under the null hypothesis
is distributed as , being the eigenvalue with respect to the corresponding eigenvector in the ’th column of . Writing the quadratic form in as a sum,The covariance between and is
Using the result from Lemma 1.1 and defining , we obtain
Concluding .
Lemma 1.3.
If then
.
Proof:
The joint distribution of can be written as
(S7) 
, [4]. Therefore, is dependent solely on , we can define , and.
We can now turn to prove Theorem 1.1, define , and , . We will use Taylor approximation to estimate a at the expansion point , where .
Comments
There are no comments yet.