There has recently been a heightened interest in testing the location of a high-dimensional parameter vector, where the dimension of the parameter vector may exceed the number of observations . Traditionally, multi-dimensional parameter location problems are tested using tests based on a quadratic statistic such as the Hotelling statistic, which is a multivariate generalization of the statistic. However, such tests may lose power against sub-regions of the alternative if the dimensionality of the parameter vector increases, even if the inverse covariance matrix is known (see e.g. Fan et al., 2015). Recently, Kock and Preinerstorfer (2019) have further explored this problem, calling for the construction of tests that direct power towards specific sub-regions.
In addition, the inverse covariance matrix is typically not known in practice and must be estimated. This is problematic in high-dimensional settings, as the standard sample covariance matrix is not invertible. The solution provided in the literature is to impose restrictions, through regularization or other methods, in order to estimate the covariance matrix (see e.g.Bai and Saranadasa, 1996; Chen et al., 2010; Chen et al., 2011). However, if no further information about the covariance matrix is available, then it is not obvious what restrictions should be used, and imposing the wrong restrictions can bias the estimate.
To address these two issues, I propose a novel test statistic that generalizes the Hotelling statistic and is suitable for testing in high-dimensional setting. I derive this statistic by generalizing the Mahalanobis distance to a distance that measures the length only in a direction of interest. The test statistic is the sample analogue of this distance and directs power towards the sub-region of interest within the alternative (henceforth sub-alternative).
For the test statistic to exist, a condition that is much weaker than non-singularity of the sample covariance matrix typically suffices. The strength of this condition depends directly on the scope of the sub-alternative. In particular, non-singularity is only required if the sub-alternative contains the entire space except the origin. In that case, the statistic coincides with the Hotelling statistic.
I show that the computation of the test statistic reduces to a quadratic minimization problem, regularized by the same constraints that specify the sub-alternative. If the standard estimate of the sample covariance matrix is used, then this problem reduces to linear regression with a constant response vector. I provide an additional result for the special case of a diagonal estimate of the covariance matrix, for which the computation can reduce to solving a minimum distance problem and a closed-form test statistic when considering sub-alternatives of a class defined by sign or sparsity restrictions.
I demonstrate this methodology by testing against sparse sub-alternatives, which can be defined by the number of hypothesized violations of the null hypothesis. The setting of sparse sub-alternatives is also considered byFan et al. (2015). They provide the following motivating example from financial econometrics concerning multi-factor pricing theory. Multi-factor pricing models assume that returns of assets can be described by a linear combination of a limited number of factors, so that no individual assets have returns in excess of the market (known as ‘alpha’ in the finance literature). Because such models are well founded in arbitrage pricing theory (Ross, 1976), one would expect that if such a model is false then it would only be violated by a small number of exceptional assets. So, under this alternative hypothesis, the underlying vector of excess returns is sparse. Hence, testing a multi-factor pricing model against a small number of exceptional assets can be represented as testing against sparse sub-alternatives.
For sparse sub-alternatives, the computation of my test statistic is a special case of best-subset selection in linear regression. Best-subset selection has had a recent surge of interest after Bertsimas et al. (2016) showed that this problem is easier to solve than was commonly assumed, by exploiting gradient descent methods and commercially available mixed-integer optimization solvers.111See e.g. Hastie et al. (2017); Mazumder et al. (2017); Hazimeh and Mazumder (2018); Koning and Bekker (2018) for more recent work. In case of a diagonal covariance matrix, the test statistic has closed form solution and is closely related to threshold-type statistics (see e.g., Fan, 1996; Zhong et al., 2013).
In addition, I consider Lasso, which is another popular technique to obtain sparse solutions (Tibshirani, 1996). I show that in its constraint form, Lasso generates a sub-alternative that consists of convex cones that are centered around the axes. The width of these cones depends on the regularization parameter. Therefore, the Lasso sub-alternative can be interpreted as a nearly-sparse sub-alternative. The corresponding test statistic can be computed using Lasso. furthermore, the penalized form of Lasso has the interpretation of shrinking the sample mean estimate towards zero.
2 Restricting the Mahalanobis distance
In this section, I show that the Mahalanobis distance is the result of a maximization problem. I present a new distance measure by restricting the argument over which is maximized, to measure distance in a direction. I describe its connection to the Mahalanobis distance and provide some insights into the existence of the new distance measure.
I use the following notation. For a set , the set denotes the set in addition to the origin and the set excluding the origin. I use to denote the identity matrix. The subscripted vector has elements equal to if and 0, otherwise. Finally, for a vector with elements , the following standard notation is used
2.1 The distance measure
Let the -vector be the parameter of interest and let be the set containing the (possibly defect) unit ellipsoid described by some symmetric matrix . Let represent a set of constraints. Without loss of generality, assume that is closed under scalar multiplication, so that it constitutes a cone.222Generality is not lost, as only the intersection of with will is used to define the distance. Hence, the cone can be constructed from an arbitrary subset , where , so that it has the same intersection . This is demonstrated using Lasso in Section 4.2. That is, if , then , for all .
I propose the following restricted distance measure over , which measures the distance in the directions specified by the cone .
The length of in is equal to
The distance measure can be interpreted as follows. Suppose that is the maximizing vector. Then is a projection of onto , scaled by . The length of this projection is the distance measure. A visual demonstration for both a singular and positive-definite matrix is given in Figure 1. The, is included in the maximization to ensure that the distance is non-negative, and to simplify its computation.
Notice that if is the entire space , then the maximizing argument is equal to , so that coincides with the Mahalanobis distance . In this case, the existence of the statistic requires a positive definite matrix . If , then the following existence condition for is weaker than , which is required for the inverse to exist. 333The condition follows from first observing that the case is irrelevant for the existence, and then rewriting as for . The strength of the assumption depends entirely on the scope of .
This condition is closely related to the restricted eigenvalue conditions used to prove oracle results for Lasso (see e.g.Van De Geer et al. 2009). For example, if we choose and , where is an index set with cardinality and is some constant, the condition coincides with the compatibility condition (Van de Geer, 2007).
The Mahalanobis distance is well known for appearing in the exponent of the density of the normal distribution. As the distance from Definition1 generalizes the Mahalanobis distance, it also generalizes the normal distribution. In particular, replacing the Mahalanobis distance with this distance measure produces a density
where is a proportionality constant. This density is equal to the (multivariate) normal distribution if . Using different sets , other densities can be produced. For example, if we consider the two-dimensional case with equal to scalar multiples of the canonical basis vectors , then can be viewed as a bell-curve with flattened sides. A visual illustration of this example is provided in Figure 2.
The following result shows that distance measure can be computed using restricted quadratic minimization. This is convenient as some of such problems have been well studied in the optimization literature. A proof of the result can be found in Appendix A. In Section 3.2
, where the distance is used as a test statistic, results for sample moments are provided for which the computation simplifies.
Let for all . Assume that and are unique optimizers. Then
3 Multi-dimensional hypothesis testing
In this section, I apply the distance measure from the previous section to hypothesis testing. I start by showing how conventional null and alternative hypotheses can be defined by restricting the distance measure. This reveals that different restrictions can lead to the same null and alternative hypotheses, but may differ in the sub-regions of the alternative on which they are maximized. I define these sub-regions of the alternative as the sub-alternative. I continue by proposing a test statistic that is the sample equivalent of the distance, in order to direct power towards the sub-alternative. Finally, I provide several results for the computation of the statistic. Interestingly, if the conventional estimate of the sample covariance matrix is used then the computation of the statistic reduces to linear regression with a constant response, restricted by the set that defines the sub-alternative.
The setting can described as follows. Let be an matrix that can be decomposed as , where is the -vector of interest, is an -vector of ones, and
is a random matrix with mean zero and covariance matrix.
In order to describe the hypotheses in terms of , it will be convenient to re-parameterize as , where the columns of
are orthonormal eigenvectors of. Let be the diagonal matrix containing the eigenvalues of , so that . This re-parametrization is equivalent to using . Without loss of generality, assume that and that is a cone.444Note that the normalization of the eigenvalues and re-parametrization should be taken into account when constructing this cone from an arbitrary subset of .
With the distance measure defined in the previous section, the null hypothesis and alternative hypothesis can be equivalently written as the length of in :
This reformulation is convenient, as it condenses the vector hypotheses to scalar hypotheses. Additionally, a different null hypothesis and alternative hypothesis can be constructed by simply replacing with some other cone . For example, the cone that imposes the restriction produces a ‘one-sided’ test with null hypothesis and alternative hypothesis .
A key observation to make is that different cones can produce the same null and alternative hypothesis. For example, the sets containing the vectors in the direction of the canonical basis, and share and , as the lengths and are positive for the same vectors . However, the lengths and are different for almost all in . It then seems natural to define the directions in which the is maximized to be the sub-regions of interest of the alternative (henceforth named sub-alternatives).
is largest if , as a vector not in that is projected onto some vector in is shortened as a result of the projection. Therefore, I define the sub-alternative corresponding to to be the vectors in , excluding the origin. Notice that the sub-alternative and alternative coincide if and only if .
The sub-alternative corresponding to is .
3.2 Test statistic
In this section, I use the new distance measure to propose a novel test statistic, that directs power towards the sub-alternative. In order to do so, the following notation is used. The vector stands for a generic estimate of , and for a generic estimate of . Let .
I propose the following test statistic that is the sample analogue of the distance :
where is the cone that describes the sub-alternative.
For the estimate , I use the sample mean . For , I will focus on two estimates. The first is the conventional sample covariance matrix , where is the Gramian matrix. The second is a diagonal estimate, which includes popular choices such as or . For notational clarity, I will write if is the full sample covariance matrix and add the superscript to denote the statistic if is a diagonal estimate. [In section X, I consider - and -regularization and show how -regularization corresponds to a shrinkage in and -regularization to shrinkage in ]
The intuition for the statistic is that it is small under , positive under and large under the sub alternative , so that power is directed towards the sub-alternatives. Notice that if the sub-alternative and alternative coincide, then is equal to square-root of the Hotelling statistic. A crucial difference with the Hotelling statistic is that the statistic does not generally require to be invertible. Instead, for its existence a weaker sample equivalent of Condition 1 suffices.
3.3 Computing the test statistic
From Proposition 1 it follows that the computation of reduces to a quadratic minimization problem. In this section, I show that the computation can simplify in some cases. In particular, I provide results for the case that and the case that is diagonal. The proofs of these results can be found in Appendix A.
If is chosen as estimate of the covariance matrix and , then Proposition 2 shows that the computation simplifies to linear regression of a constant vector on , regularized by the cones that define the sub-alternative of interest. This is a curious result, because linear regression with a constant response has no other applications that I am aware of. The reason for this is that in most applications the matrix contains an intercept so that regression of on results in the trivial parameter estimate . The result suggests that the residual sum of squares could be used as a test statistic instead of the distance measure.
Let and for all . Let and be unique optimizers. Then
if and , otherwise.
In addition, I consider the case where is a diagonal estimate of the covariance matrix, such as or . In this case, the computation can simplify substantially if a special type of cone is used. In particular, define a scone (signed cone) to be a set of vectors that is closed under multiplication by a positive definite diagonal matrix. That is, if , then , where . Pre-multiplication by a diagonal matrix is a sign-preserving operation, so that the sub-alternatives for scones relate directly to hypotheses concerning sign and sparsity restrictions.
For a scone sub-alternative, Proposition 3 shows that computing requires solving a minimum distance problem. Such problems are typically easier to solve than regression. For example, the next section considers testing against sparse sub-alternatives, for which this problem has a closed-form solution.
Let be a diagonal matrix and let be a scone. Suppose that and are unique optimizers. Then
if and , otherwise.
4 Sparse sub-alternatives
In this section, I demonstrate the methodology described in the previous sections by considering the special case of sparse sub-alternatives. For a given level of sparsity , the sparse sub-alternative is defined by , where is the -norm of a vector , which counts the number of non-zero elements in the vector. Notice that is a scone as for all diagonal matrices . The sub-alternative corresponding to this scone is given by
The interpretation of this sub-alternative is that (at most) elements of are unequal to zero. The corresponding null hypothesis is and the alternative hypothesis is . Notice that the alternative hypothesis and sub-alternative coincide if .
The sparse sub-alternative leads to the statistic , which will be abbreviated as for notational convenience. A visual illustration comparing the statistic to the statistic is provided for the two-dimensional case in Figure 3. On the left panel, it can be observed that, the statistic is large near the axes, where is sparse. Conversely, the right panel shows that the statistic is equally large in every direction. The difference becomes increasingly pronounced as increases and the level of sparsity is fixed.
For the case that , Proposition 2 shows that the corresponding test statistic requires solving the non-convex quadratic minimization problem
This minimization problem is a special case of the best subset selection problem (BSS), which was long deemed computationally infeasible for . Recently, Bertsimas et al. (2016) have shown that BSS can be solved for problems of practical size within reasonable time by using gradient descent methods and mixed-integer optimization solvers. In particular, they solve BSS with of order and of order in minutes. If large and are very large or if re-sampling is used to compute a critical value, then this may still be costly in terms of computation. However, promising work by Hazimeh and Mazumder (2018) shows that problems of order can be approximated in seconds.
If is assumed to be diagonal, then the computation becomes much simpler. Proposition 3 shows that can be computed by solving
It is straightforward to show that , where contains the largest elements of .555See e.g. Proposition 3 of Bertsimas et al. (2016) for a formal proof. This leads to the closed form
where has elements equal to the corresponding element in if its index is in , and 0 otherwise.
If , then the test statistic is closely related to threshold-type statistics (see e.g. Fan, 1996; Zhong et al., 2013) including the screening statistic proposed by Fan et al. (2015). They define the screening statistic as
where , for a given threshold value . Here, Fan et al. (2015) choose to grow sufficiently fast so that using 0 as critical value results in an test with asymptotic size 0. Notice that if is instead chosen such that , then is a monotone function of : . Hence, the difference is that a threshold-type statistic implicitly and the explicitly specifies the sparsity level of the sub-alternative of interest. While the latter seems more intuitive, there may be applications for which a threshold-type specification of the sparsity level is desirable.
Another popular tool to obtain sparse solutions is Lasso (Tibshirani, 1996), which is linear regression under an -norm restriction. Unlike , the Lasso constraint set , where , is not a scone nor even a cone. However, as mentioned in Section 2, it is possible to convert non-cone restrictions to cone restrictions, which I will demonstrate in this section.
Let be the intersection of the Lasso constraint set and the eigenvalue ellipsoid. Then we can construct the cone that consists of all the scalar multiples of and the origin. This leads to the sub-alternative
Notice that if is the eigenvalue matrix of , then this sub-alternative coincides with the 1-sparse sub-alternative when , and the Hotelling statistic if . A visual illustration of the two-dimensional case is given in Figure 4, for . This figure shows that the sub-alternative corresponding to the Lasso constraint is set of cones that are centered around the axes, and whose angles are regulated by the parameter . Therefore, the sub-alternative induced by Lasso can be viewed as a near-sparse sub-alternative.
By Proposition 2, the computation of the test statistic reduces to Lasso
so that .
Using its penalty form exposes an alternative interpretation of Lasso, in that it shrinks the mean estimate . In particular, using Proposition 2, the penalized form of the Lasso problem given by
where is the penalty parameter, is the element-wise absolute value, and is the element-wise positive-part operator. This shows that Lasso shrinks each element of the mean estimate by , to produce the estimate .666Similarly, a Ridge or -penalty (Hoerl and Kennard, 1970) inflates the diagonals of the estimate of .
4.3 Critical values
In order to construct a test, a critical value is necessary for which the distribution under the null should be approximated. For the purposes of this paper, I will use a reflection-based randomization test. Under a symmetry assumption on , this leads to an exact critical value because the sparse sub-alternatives and Lasso sub-alternatives share . This section briefly describes the construction of such a critical value (see e.g. Ch. 15 of Lehmann and Romano (2006) for a more general discussion).
Let contain all reflection matrices, so that is the reflection group of order . The assumption that permits the construction of an exact test is that and share the same distribution, for all . Let be selected uniformly from such that it includes the identity matrix. Denote the set containing the reflection transformed data by . Let be the set containing test statistic of interest applied to each of the reflected data sets. Notice that under , each of the elements of would ex-ante have been equally likely to be observe. So, for a given level of significance , the critical value can be defined as the quantile of . If all elements of are different, then rejecting if is an exact test.
- Bai and Saranadasa (1996) Bai, Z. and Saranadasa, H., 1996. Effect of high dimension: by an example of a two sample problem. Statistica Sinica, pages 311–329.
- Bertsimas et al. (2016) Bertsimas, D., King, A., Mazumder, R., et al., 2016. Best subset selection via a modern optimization lens. The Annals of Statistics, 44(2):813–852.
- Bugni et al. (2016) Bugni, F. A., Caner, M., Kock, A. B., and Lahiri, S., 2016. Inference in partially identified models with many moment inequalities using lasso. arXiv preprint arXiv:1604.02309.
- Canay et al. (2017) Canay, I. A., Romano, J. P., and Shaikh, A. M., 2017. Randomization tests under an approximate symmetry assumption. Econometrica, 85(3):1013–1030.
- Chen et al. (2011) Chen, L. S., Paul, D., Prentice, R. L., and Wang, P., 2011. A regularized hotelling’s t 2 test for pathway analysis in proteomic studies. Journal of the American Statistical Association, 106(496):1345–1360.
Chen et al. (2010)
Chen, S. X., Qin, Y.-L., et al., 2010.
A two-sample test for high-dimensional data with applications to gene-set testing.The Annals of Statistics, 38(2):808–835.
- Chernozhukov et al. (2018) Chernozhukov, V., Chetverikov, D., and Kato, K., 2018. Inference on causal and structural parameters using many moment inequalities. arXiv preprint arXiv:1312.7614.
- Fan (1996) Fan, J., 1996. Test of significance based on wavelet thresholding and neyman’s truncation. Journal of the American Statistical Association, 91(434):674–688.
- Fan et al. (2015) Fan, J., Liao, Y., and Yao, J., 2015. Power enhancement in high-dimensional cross-sectional tests. Econometrica, 83(4):1497–1541.
- Hastie et al. (2005) Hastie, T., Tibshirani, R., Friedman, J., and Franklin, J., 2005. The elements of statistical learning: data mining, inference and prediction, volume 27. Springer.
- Hastie et al. (2017) Hastie, T., Tibshirani, R., and Tibshirani, R. J., 2017. Extended comparisons of best subset selection, forward stepwise selection, and the lasso. arXiv preprint arXiv:1707.08692.
- Hazimeh and Mazumder (2018) Hazimeh, H. and Mazumder, R., 2018. Fast best subset selection: Coordinate descent and local combinatorial optimization algorithms. arXiv preprint arXiv:1803.01454.
- Hoerl and Kennard (1970) Hoerl, A. E. and Kennard, R. W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.
- Kock and Preinerstorfer (2019) Kock, A. B. and Preinerstorfer, D., 2019. Power in high-dimensional testing problems. Econometrica, 87(3):1055–1069.
- Koning and Bekker (2018) Koning, N. W. and Bekker, P. A., 2018. Sparse unit-sum regression. arXiv preprint arXiv:1907.04620.
- Koning and Bekker (2019) Koning, N. W. and Bekker, P. A., 2019. Exact testing of many moment inequalities against multiple violations. arXiv preprint arXiv:1904.12775.
- Lawson and Hanson (1995) Lawson, C. L. and Hanson, R. J., 1995. Solving least squares problems, volume 15. Siam.
- Lehmann and Romano (2006) Lehmann, E. L. and Romano, J. P., 2006. Testing statistical hypotheses. Springer Science & Business Media.
- Mazumder et al. (2017) Mazumder, R., Radchenko, P., and Dedieu, A., 2017. Subset selection with shrinkage: Sparse linear modeling when the snr is low. arXiv preprint arXiv:1708.03288.
- Ross (1976) Ross, S., 1976. The arbitrage theory of capital asset pricing. Journal of Economic Theory, 13(3):341–360.
- Stark and Parker (1995) Stark, P. B. and Parker, R. L., 1995. Bounded-variable least-squares: an algorithm and applications. Computational Statistics, 10:129–129.
- Tibshirani (1996) Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
- Van de Geer (2007) Van de Geer, S. The deterministic lasso. Seminar für Statistik, Eidgenössische Technische Hochschule (ETH) Zürich, 2007.
- Van De Geer et al. (2009) Van De Geer, S. A., Bühlmann, P., et al., 2009. On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392.
- Zhong et al. (2013) Zhong, P.-S., Chen, S. X., Xu, M., et al., 2013. Tests alternative to higher criticism for high-dimensional means under sparsity and column-wise dependence. The Annals of Statistics, 41(6):2820–2851.
5 Appendix A
In order to present the proofs of Proposition 1, I first provide the following lemma, where the notation is used to mean element-wise scalar multiplication of the set .
Let be a cone and . If Condition 1 holds, then
As Condition 1 holds, a maximizing argument exists. I find
Let for all . Let and be unique optimizers. Then
if and , otherwise.
I will consider two cases: and .
where the second equality follows from Lemma 1.
I will prove that by contradiction. Suppose that be arbitrarily given. Then , as is a unique minimizer. So . As , we have that for all . Then , which implies . This contradicts the assumption that is a minimizer. Hence, . ∎
Let and for all . Let and be unique optimizers. Then
if and , otherwise.
I will consider two cases.
Let be a diagonal matrix and let be a scone. Suppose that and are unique optimizers. Then
if and , otherwise.
I will consider two cases: and .
Using the substitution and yields
where if , as is diagonal and positive-definite. Define . From Lemma 1 it follows that
This case is analogous to the case that in the proof of Proposition 1. ∎
6 Appendix B: Connection to existing test statistics
In this section, I discuss two scone restrictions for which the test statistic reduces to test statistics that have recently received attention in the literature.
6.1 Sign restrictions
Scones can also be constructed by imposing non-negativity restrictions on elements of . This yields the following null and alternative hypothesis:
where and , without loss of generality. This is the well-known moment (in)equalities setting. The special case of this problem where is treated by Koning and Bekker (2019).
For the case , the corresponding test statistic is computed by solving
where has elements . This is a special case of the bounded-variable least-squares minimization (BVLS) problem with a lower bound that is only applied to the first parameters. BVLS is well studied minimization problem (see e.g. Stark and Parker 1995; Lawson and Hanson 1995) and implementations are available for most major statistical software packages. If is diagonal, the computation is has a closed-form solution.
For , the minimization problem is given by
Let , and let contain the positive elements in . It is straightforward to show that if , and , otherwise.
6.2 Mixing sign and sparsity restrictions
If is diagonal, the computational results for sign restrictions and sparsity restrictions are easily combined. In particular, let contain the indexes of the at most largest and positive elements in . Then if and , otherwise.
For , the computation of the test statistic requires solving