1 Introduction
There has recently been a heightened interest in testing the location of a highdimensional parameter vector, where the dimension of the parameter vector may exceed the number of observations . Traditionally, multidimensional parameter location problems are tested using tests based on a quadratic statistic such as the Hotelling statistic, which is a multivariate generalization of the statistic. However, such tests may lose power against subregions of the alternative if the dimensionality of the parameter vector increases, even if the inverse covariance matrix is known (see e.g. Fan et al., 2015). Recently, Kock and Preinerstorfer (2019) have further explored this problem, calling for the construction of tests that direct power towards specific subregions.
In addition, the inverse covariance matrix is typically not known in practice and must be estimated. This is problematic in highdimensional settings, as the standard sample covariance matrix is not invertible. The solution provided in the literature is to impose restrictions, through regularization or other methods, in order to estimate the covariance matrix (see e.g.
Bai and Saranadasa, 1996; Chen et al., 2010; Chen et al., 2011). However, if no further information about the covariance matrix is available, then it is not obvious what restrictions should be used, and imposing the wrong restrictions can bias the estimate.To address these two issues, I propose a novel test statistic that generalizes the Hotelling statistic and is suitable for testing in highdimensional setting. I derive this statistic by generalizing the Mahalanobis distance to a distance that measures the length only in a direction of interest. The test statistic is the sample analogue of this distance and directs power towards the subregion of interest within the alternative (henceforth subalternative).
For the test statistic to exist, a condition that is much weaker than nonsingularity of the sample covariance matrix typically suffices. The strength of this condition depends directly on the scope of the subalternative. In particular, nonsingularity is only required if the subalternative contains the entire space except the origin. In that case, the statistic coincides with the Hotelling statistic.
I show that the computation of the test statistic reduces to a quadratic minimization problem, regularized by the same constraints that specify the subalternative. If the standard estimate of the sample covariance matrix is used, then this problem reduces to linear regression with a constant response vector. I provide an additional result for the special case of a diagonal estimate of the covariance matrix, for which the computation can reduce to solving a minimum distance problem and a closedform test statistic when considering subalternatives of a class defined by sign or sparsity restrictions.
I demonstrate this methodology by testing against sparse subalternatives, which can be defined by the number of hypothesized violations of the null hypothesis. The setting of sparse subalternatives is also considered by
Fan et al. (2015). They provide the following motivating example from financial econometrics concerning multifactor pricing theory. Multifactor pricing models assume that returns of assets can be described by a linear combination of a limited number of factors, so that no individual assets have returns in excess of the market (known as ‘alpha’ in the finance literature). Because such models are well founded in arbitrage pricing theory (Ross, 1976), one would expect that if such a model is false then it would only be violated by a small number of exceptional assets. So, under this alternative hypothesis, the underlying vector of excess returns is sparse. Hence, testing a multifactor pricing model against a small number of exceptional assets can be represented as testing against sparse subalternatives.For sparse subalternatives, the computation of my test statistic is a special case of bestsubset selection in linear regression. Bestsubset selection has had a recent surge of interest after Bertsimas et al. (2016) showed that this problem is easier to solve than was commonly assumed, by exploiting gradient descent methods and commercially available mixedinteger optimization solvers.^{1}^{1}1See e.g. Hastie et al. (2017); Mazumder et al. (2017); Hazimeh and Mazumder (2018); Koning and Bekker (2018) for more recent work. In case of a diagonal covariance matrix, the test statistic has closed form solution and is closely related to thresholdtype statistics (see e.g., Fan, 1996; Zhong et al., 2013).
In addition, I consider Lasso, which is another popular technique to obtain sparse solutions (Tibshirani, 1996). I show that in its constraint form, Lasso generates a subalternative that consists of convex cones that are centered around the axes. The width of these cones depends on the regularization parameter. Therefore, the Lasso subalternative can be interpreted as a nearlysparse subalternative. The corresponding test statistic can be computed using Lasso. furthermore, the penalized form of Lasso has the interpretation of shrinking the sample mean estimate towards zero.
2 Restricting the Mahalanobis distance
In this section, I show that the Mahalanobis distance is the result of a maximization problem. I present a new distance measure by restricting the argument over which is maximized, to measure distance in a direction. I describe its connection to the Mahalanobis distance and provide some insights into the existence of the new distance measure.
I use the following notation. For a set , the set denotes the set in addition to the origin and the set excluding the origin. I use to denote the identity matrix. The subscripted vector has elements equal to if and 0, otherwise. Finally, for a vector with elements , the following standard notation is used
2.1 The distance measure
Let the vector be the parameter of interest and let be the set containing the (possibly defect) unit ellipsoid described by some symmetric matrix . Let represent a set of constraints. Without loss of generality, assume that is closed under scalar multiplication, so that it constitutes a cone.^{2}^{2}2Generality is not lost, as only the intersection of with will is used to define the distance. Hence, the cone can be constructed from an arbitrary subset , where , so that it has the same intersection . This is demonstrated using Lasso in Section 4.2. That is, if , then , for all .
I propose the following restricted distance measure over , which measures the distance in the directions specified by the cone .
Definition 1.
The length of in is equal to
The distance measure can be interpreted as follows. Suppose that is the maximizing vector. Then is a projection of onto , scaled by . The length of this projection is the distance measure. A visual demonstration for both a singular and positivedefinite matrix is given in Figure 1. The, is included in the maximization to ensure that the distance is nonnegative, and to simplify its computation.
Notice that if is the entire space , then the maximizing argument is equal to , so that coincides with the Mahalanobis distance . In this case, the existence of the statistic requires a positive definite matrix . If , then the following existence condition for is weaker than , which is required for the inverse to exist. ^{3}^{3}3The condition follows from first observing that the case is irrelevant for the existence, and then rewriting as for . The strength of the assumption depends entirely on the scope of .
Condition 1.
.
This condition is closely related to the restricted eigenvalue conditions used to prove oracle results for Lasso (see e.g.
Van De Geer et al. 2009). For example, if we choose and , where is an index set with cardinality and is some constant, the condition coincides with the compatibility condition (Van de Geer, 2007).The Mahalanobis distance is well known for appearing in the exponent of the density of the normal distribution. As the distance from Definition
1 generalizes the Mahalanobis distance, it also generalizes the normal distribution. In particular, replacing the Mahalanobis distance with this distance measure produces a densitywhere is a proportionality constant. This density is equal to the (multivariate) normal distribution if . Using different sets , other densities can be produced. For example, if we consider the twodimensional case with equal to scalar multiples of the canonical basis vectors , then can be viewed as a bellcurve with flattened sides. A visual illustration of this example is provided in Figure 2.
The following result shows that distance measure can be computed using restricted quadratic minimization. This is convenient as some of such problems have been well studied in the optimization literature. A proof of the result can be found in Appendix A. In Section 3.2
, where the distance is used as a test statistic, results for sample moments are provided for which the computation simplifies.
Proposition 1.
Let for all . Assume that and are unique optimizers. Then
3 Multidimensional hypothesis testing
In this section, I apply the distance measure from the previous section to hypothesis testing. I start by showing how conventional null and alternative hypotheses can be defined by restricting the distance measure. This reveals that different restrictions can lead to the same null and alternative hypotheses, but may differ in the subregions of the alternative on which they are maximized. I define these subregions of the alternative as the subalternative. I continue by proposing a test statistic that is the sample equivalent of the distance, in order to direct power towards the subalternative. Finally, I provide several results for the computation of the statistic. Interestingly, if the conventional estimate of the sample covariance matrix is used then the computation of the statistic reduces to linear regression with a constant response, restricted by the set that defines the subalternative.
The setting can described as follows. Let be an matrix that can be decomposed as , where is the vector of interest, is an vector of ones, and
is a random matrix with mean zero and covariance matrix
.3.1 Subalternatives
In order to describe the hypotheses in terms of , it will be convenient to reparameterize as , where the columns of
are orthonormal eigenvectors of
. Let be the diagonal matrix containing the eigenvalues of , so that . This reparametrization is equivalent to using . Without loss of generality, assume that and that is a cone.^{4}^{4}4Note that the normalization of the eigenvalues and reparametrization should be taken into account when constructing this cone from an arbitrary subset of .With the distance measure defined in the previous section, the null hypothesis and alternative hypothesis can be equivalently written as the length of in :
This reformulation is convenient, as it condenses the vector hypotheses to scalar hypotheses. Additionally, a different null hypothesis and alternative hypothesis can be constructed by simply replacing with some other cone . For example, the cone that imposes the restriction produces a ‘onesided’ test with null hypothesis and alternative hypothesis .
A key observation to make is that different cones can produce the same null and alternative hypothesis. For example, the sets containing the vectors in the direction of the canonical basis, and share and , as the lengths and are positive for the same vectors . However, the lengths and are different for almost all in . It then seems natural to define the directions in which the is maximized to be the subregions of interest of the alternative (henceforth named subalternatives).
is largest if , as a vector not in that is projected onto some vector in is shortened as a result of the projection. Therefore, I define the subalternative corresponding to to be the vectors in , excluding the origin. Notice that the subalternative and alternative coincide if and only if .
Definition 2.
The subalternative corresponding to is .
3.2 Test statistic
In this section, I use the new distance measure to propose a novel test statistic, that directs power towards the subalternative. In order to do so, the following notation is used. The vector stands for a generic estimate of , and for a generic estimate of . Let .
I propose the following test statistic that is the sample analogue of the distance :
where is the cone that describes the subalternative.
For the estimate , I use the sample mean . For , I will focus on two estimates. The first is the conventional sample covariance matrix , where is the Gramian matrix. The second is a diagonal estimate, which includes popular choices such as or . For notational clarity, I will write if is the full sample covariance matrix and add the superscript to denote the statistic if is a diagonal estimate. [In section X, I consider  and regularization and show how regularization corresponds to a shrinkage in and regularization to shrinkage in ]
The intuition for the statistic is that it is small under , positive under and large under the sub alternative , so that power is directed towards the subalternatives. Notice that if the subalternative and alternative coincide, then is equal to squareroot of the Hotelling statistic. A crucial difference with the Hotelling statistic is that the statistic does not generally require to be invertible. Instead, for its existence a weaker sample equivalent of Condition 1 suffices.
3.3 Computing the test statistic
From Proposition 1 it follows that the computation of reduces to a quadratic minimization problem. In this section, I show that the computation can simplify in some cases. In particular, I provide results for the case that and the case that is diagonal. The proofs of these results can be found in Appendix A.
If is chosen as estimate of the covariance matrix and , then Proposition 2 shows that the computation simplifies to linear regression of a constant vector on , regularized by the cones that define the subalternative of interest. This is a curious result, because linear regression with a constant response has no other applications that I am aware of. The reason for this is that in most applications the matrix contains an intercept so that regression of on results in the trivial parameter estimate . The result suggests that the residual sum of squares could be used as a test statistic instead of the distance measure.
Proposition 2.
Let and for all . Let and be unique optimizers. Then
if and , otherwise.
In addition, I consider the case where is a diagonal estimate of the covariance matrix, such as or . In this case, the computation can simplify substantially if a special type of cone is used. In particular, define a scone (signed cone) to be a set of vectors that is closed under multiplication by a positive definite diagonal matrix. That is, if , then , where . Premultiplication by a diagonal matrix is a signpreserving operation, so that the subalternatives for scones relate directly to hypotheses concerning sign and sparsity restrictions.
For a scone subalternative, Proposition 3 shows that computing requires solving a minimum distance problem. Such problems are typically easier to solve than regression. For example, the next section considers testing against sparse subalternatives, for which this problem has a closedform solution.
Proposition 3.
Let be a diagonal matrix and let be a scone. Suppose that and are unique optimizers. Then
if and , otherwise.
4 Sparse subalternatives
In this section, I demonstrate the methodology described in the previous sections by considering the special case of sparse subalternatives. For a given level of sparsity , the sparse subalternative is defined by , where is the norm of a vector , which counts the number of nonzero elements in the vector. Notice that is a scone as for all diagonal matrices . The subalternative corresponding to this scone is given by
The interpretation of this subalternative is that (at most) elements of are unequal to zero. The corresponding null hypothesis is and the alternative hypothesis is . Notice that the alternative hypothesis and subalternative coincide if .
The sparse subalternative leads to the statistic , which will be abbreviated as for notational convenience. A visual illustration comparing the statistic to the statistic is provided for the twodimensional case in Figure 3. On the left panel, it can be observed that, the statistic is large near the axes, where is sparse. Conversely, the right panel shows that the statistic is equally large in every direction. The difference becomes increasingly pronounced as increases and the level of sparsity is fixed.
4.1 Computation
For the case that , Proposition 2 shows that the corresponding test statistic requires solving the nonconvex quadratic minimization problem
(1) 
This minimization problem is a special case of the best subset selection problem (BSS), which was long deemed computationally infeasible for . Recently, Bertsimas et al. (2016) have shown that BSS can be solved for problems of practical size within reasonable time by using gradient descent methods and mixedinteger optimization solvers. In particular, they solve BSS with of order and of order in minutes. If large and are very large or if resampling is used to compute a critical value, then this may still be costly in terms of computation. However, promising work by Hazimeh and Mazumder (2018) shows that problems of order can be approximated in seconds.
If is assumed to be diagonal, then the computation becomes much simpler. Proposition 3 shows that can be computed by solving
It is straightforward to show that , where contains the largest elements of .^{5}^{5}5See e.g. Proposition 3 of Bertsimas et al. (2016) for a formal proof. This leads to the closed form
where has elements equal to the corresponding element in if its index is in , and 0 otherwise.
If , then the test statistic is closely related to thresholdtype statistics (see e.g. Fan, 1996; Zhong et al., 2013) including the screening statistic proposed by Fan et al. (2015). They define the screening statistic as
where , for a given threshold value . Here, Fan et al. (2015) choose to grow sufficiently fast so that using 0 as critical value results in an test with asymptotic size 0. Notice that if is instead chosen such that , then is a monotone function of : . Hence, the difference is that a thresholdtype statistic implicitly and the explicitly specifies the sparsity level of the subalternative of interest. While the latter seems more intuitive, there may be applications for which a thresholdtype specification of the sparsity level is desirable.
4.2 Lasso
Another popular tool to obtain sparse solutions is Lasso (Tibshirani, 1996), which is linear regression under an norm restriction. Unlike , the Lasso constraint set , where , is not a scone nor even a cone. However, as mentioned in Section 2, it is possible to convert noncone restrictions to cone restrictions, which I will demonstrate in this section.
Let be the intersection of the Lasso constraint set and the eigenvalue ellipsoid. Then we can construct the cone that consists of all the scalar multiples of and the origin. This leads to the subalternative
Notice that if is the eigenvalue matrix of , then this subalternative coincides with the 1sparse subalternative when , and the Hotelling statistic if . A visual illustration of the twodimensional case is given in Figure 4, for . This figure shows that the subalternative corresponding to the Lasso constraint is set of cones that are centered around the axes, and whose angles are regulated by the parameter . Therefore, the subalternative induced by Lasso can be viewed as a nearsparse subalternative.
Using its penalty form exposes an alternative interpretation of Lasso, in that it shrinks the mean estimate . In particular, using Proposition 2, the penalized form of the Lasso problem given by
where is the penalty parameter, is the elementwise absolute value, and is the elementwise positivepart operator. This shows that Lasso shrinks each element of the mean estimate by , to produce the estimate .^{6}^{6}6Similarly, a Ridge or penalty (Hoerl and Kennard, 1970) inflates the diagonals of the estimate of .
4.3 Critical values
In order to construct a test, a critical value is necessary for which the distribution under the null should be approximated. For the purposes of this paper, I will use a reflectionbased randomization test. Under a symmetry assumption on , this leads to an exact critical value because the sparse subalternatives and Lasso subalternatives share . This section briefly describes the construction of such a critical value (see e.g. Ch. 15 of Lehmann and Romano (2006) for a more general discussion).
Let contain all reflection matrices, so that is the reflection group of order . The assumption that permits the construction of an exact test is that and share the same distribution, for all . Let be selected uniformly from such that it includes the identity matrix. Denote the set containing the reflection transformed data by . Let be the set containing test statistic of interest applied to each of the reflected data sets. Notice that under , each of the elements of would exante have been equally likely to be observe. So, for a given level of significance , the critical value can be defined as the quantile of . If all elements of are different, then rejecting if is an exact test.
References
 Bai and Saranadasa (1996) Bai, Z. and Saranadasa, H., 1996. Effect of high dimension: by an example of a two sample problem. Statistica Sinica, pages 311–329.
 Bertsimas et al. (2016) Bertsimas, D., King, A., Mazumder, R., et al., 2016. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852.
 Bugni et al. (2016) Bugni, F. A., Caner, M., Kock, A. B., and Lahiri, S., 2016. Inference in partially identified models with many moment inequalities using lasso. arXiv preprint arXiv:1604.02309.
 Canay et al. (2017) Canay, I. A., Romano, J. P., and Shaikh, A. M., 2017. Randomization tests under an approximate symmetry assumption. Econometrica, 85(3):1013–1030.
 Chen et al. (2011) Chen, L. S., Paul, D., Prentice, R. L., and Wang, P., 2011. A regularized hotelling’s t 2 test for pathway analysis in proteomic studies. Journal of the American Statistical Association, 106(496):1345–1360.

Chen et al. (2010)
Chen, S. X., Qin, Y.L., et al., 2010.
A twosample test for highdimensional data with applications to geneset testing.
The Annals of Statistics, 38(2):808–835.  Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., and Kato, K., 2018. Inference on causal and structural parameters using many moment inequalities. arXiv preprint arXiv:1312.7614.
 Fan (1996) Fan, J., 1996. Test of significance based on wavelet thresholding and neyman’s truncation. Journal of the American Statistical Association, 91(434):674–688.
 Fan et al. (2015) Fan, J., Liao, Y., and Yao, J., 2015. Power enhancement in highdimensional crosssectional tests. Econometrica, 83(4):1497–1541.
 Hastie et al. (2005) Hastie, T., Tibshirani, R., Friedman, J., and Franklin, J., 2005. The elements of statistical learning: data mining, inference and prediction, volume 27. Springer.
 Hastie et al. (2017) Hastie, T., Tibshirani, R., and Tibshirani, R. J., 2017. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692.
 Hazimeh and Mazumder (2018) Hazimeh, H. and Mazumder, R., 2018. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. arXiv preprint arXiv:1803.01454.
 Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.
 Kock and Preinerstorfer (2019) Kock, A. B. and Preinerstorfer, D., 2019. Power in highdimensional testing problems. Econometrica, 87(3):1055–1069.
 Koning and Bekker (2018) Koning, N. W. and Bekker, P. A., 2018. Sparse unitsum regression. arXiv preprint arXiv:1907.04620.
 Koning and Bekker (2019) Koning, N. W. and Bekker, P. A., 2019. Exact testing of many moment inequalities against multiple violations. arXiv preprint arXiv:1904.12775.
 Lawson and Hanson (1995) Lawson, C. L. and Hanson, R. J., 1995. Solving least squares problems, volume 15. Siam.
 Lehmann and Romano (2006) Lehmann, E. L. and Romano, J. P., 2006. Testing statistical hypotheses. Springer Science & Business Media.
 Mazumder et al. (2017) Mazumder, R., Radchenko, P., and Dedieu, A., 2017. Subset selection with shrinkage: Sparse linear modeling when the snr is low. arXiv preprint arXiv:1708.03288.
 Ross (1976) Ross, S., 1976. The arbitrage theory of capital asset pricing. Journal of Economic Theory, 13(3):341–360.
 Stark and Parker (1995) Stark, P. B. and Parker, R. L., 1995. Boundedvariable leastsquares: an algorithm and applications. Computational Statistics, 10:129–129.
 Tibshirani (1996) Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
 Van de Geer (2007) Van de Geer, S. The deterministic lasso. Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich, 2007.
 Van De Geer et al. (2009) Van De Geer, S. A., Bühlmann, P., et al., 2009. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392.
 Zhong et al. (2013) Zhong, P.S., Chen, S. X., Xu, M., et al., 2013. Tests alternative to higher criticism for highdimensional means under sparsity and columnwise dependence. The Annals of Statistics, 41(6):2820–2851.
5 Appendix A
In order to present the proofs of Proposition 1, I first provide the following lemma, where the notation is used to mean elementwise scalar multiplication of the set .
Lemma 1.
Let be a cone and . If Condition 1 holds, then
Proof.
Proposition 1.
Let for all . Let and be unique optimizers. Then
if and , otherwise.
Proof.
I will consider two cases: and .
Case
I will prove that by contradiction. Suppose that be arbitrarily given. Then
, as is a unique minimizer. So . As , we have that for all . Then , which implies . This contradicts the assumption that is a minimizer. Hence, .
∎
Proposition 2.
Let and for all . Let and be unique optimizers. Then
if and , otherwise.
Proof.
I will consider two cases.
Proposition 3.
Let be a diagonal matrix and let be a scone. Suppose that and are unique optimizers. Then
if and , otherwise.
Proof.
I will consider two cases: and .
Case
Using the substitution and yields
where if , as is diagonal and positivedefinite. Define . From Lemma 1 it follows that
Case .
This case is analogous to the case that in the proof of Proposition 1.
∎
6 Appendix B: Connection to existing test statistics
In this section, I discuss two scone restrictions for which the test statistic reduces to test statistics that have recently received attention in the literature.
6.1 Sign restrictions
Scones can also be constructed by imposing nonnegativity restrictions on elements of . This yields the following null and alternative hypothesis:
where and , without loss of generality. This is the wellknown moment (in)equalities setting. The special case of this problem where is treated by Koning and Bekker (2019).
For the case , the corresponding test statistic is computed by solving
where has elements . This is a special case of the boundedvariable leastsquares minimization (BVLS) problem with a lower bound that is only applied to the first parameters. BVLS is well studied minimization problem (see e.g. Stark and Parker 1995; Lawson and Hanson 1995) and implementations are available for most major statistical software packages. If is diagonal, the computation is has a closedform solution.
For , the minimization problem is given by
Let , and let contain the positive elements in . It is straightforward to show that if , and , otherwise.
6.2 Mixing sign and sparsity restrictions
Another type of scone restriction imposes both sign and sparsity restrictions on . The special case where and is treated by Chernozhukov et al. (2018), and Bugni et al. (2016) consider the case and .
If is diagonal, the computational results for sign restrictions and sparsity restrictions are easily combined. In particular, let contain the indexes of the at most largest and positive elements in . Then if and , otherwise.
For , the computation of the test statistic requires solving
Comments
There are no comments yet.