Directing Power Towards Sub-Alternatives

07/11/2019 ∙ by Nick Koning, et al. ∙ University of Groningen 0

This paper proposes a novel test statistic for testing a potentially high-dimensional parameter vector. To derive the statistic, I generalize the Mahalanobis distance to measure length in a direction of interest. The test statistic is the sample analogue of the distance and directs power towards a sub-region within the alternative hypothesis (sub-alternative). I show how the computation of this test statistic can reduce to a linear regression problem with a constant response vector, regularized by the same constraints that specify the sub-alternative. The existence of the statistic is directly tied to the scope of the sub-alternative and reduces to the Hotelling T^2 statistic if the sub-alternative coincides with the alternative. I demonstrate this test statistic by testing against sparse alternatives, where the computation reduces to ℓ_0-regularized regression.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There has recently been a heightened interest in testing the location of a high-dimensional parameter vector, where the dimension of the parameter vector may exceed the number of observations . Traditionally, multi-dimensional parameter location problems are tested using tests based on a quadratic statistic such as the Hotelling statistic, which is a multivariate generalization of the statistic. However, such tests may lose power against sub-regions of the alternative if the dimensionality of the parameter vector increases, even if the inverse covariance matrix is known (see e.g. Fan et al., 2015). Recently, Kock and Preinerstorfer (2019) have further explored this problem, calling for the construction of tests that direct power towards specific sub-regions.

In addition, the inverse covariance matrix is typically not known in practice and must be estimated. This is problematic in high-dimensional settings, as the standard sample covariance matrix is not invertible. The solution provided in the literature is to impose restrictions, through regularization or other methods, in order to estimate the covariance matrix (see e.g.

Bai and Saranadasa, 1996; Chen et al., 2010; Chen et al., 2011). However, if no further information about the covariance matrix is available, then it is not obvious what restrictions should be used, and imposing the wrong restrictions can bias the estimate.

To address these two issues, I propose a novel test statistic that generalizes the Hotelling statistic and is suitable for testing in high-dimensional setting. I derive this statistic by generalizing the Mahalanobis distance to a distance that measures the length only in a direction of interest. The test statistic is the sample analogue of this distance and directs power towards the sub-region of interest within the alternative (henceforth sub-alternative).

For the test statistic to exist, a condition that is much weaker than non-singularity of the sample covariance matrix typically suffices. The strength of this condition depends directly on the scope of the sub-alternative. In particular, non-singularity is only required if the sub-alternative contains the entire space except the origin. In that case, the statistic coincides with the Hotelling statistic.

I show that the computation of the test statistic reduces to a quadratic minimization problem, regularized by the same constraints that specify the sub-alternative. If the standard estimate of the sample covariance matrix is used, then this problem reduces to linear regression with a constant response vector. I provide an additional result for the special case of a diagonal estimate of the covariance matrix, for which the computation can reduce to solving a minimum distance problem and a closed-form test statistic when considering sub-alternatives of a class defined by sign or sparsity restrictions.

I demonstrate this methodology by testing against sparse sub-alternatives, which can be defined by the number of hypothesized violations of the null hypothesis. The setting of sparse sub-alternatives is also considered by

Fan et al. (2015). They provide the following motivating example from financial econometrics concerning multi-factor pricing theory. Multi-factor pricing models assume that returns of assets can be described by a linear combination of a limited number of factors, so that no individual assets have returns in excess of the market (known as ‘alpha’ in the finance literature). Because such models are well founded in arbitrage pricing theory (Ross, 1976), one would expect that if such a model is false then it would only be violated by a small number of exceptional assets. So, under this alternative hypothesis, the underlying vector of excess returns is sparse. Hence, testing a multi-factor pricing model against a small number of exceptional assets can be represented as testing against sparse sub-alternatives.

For sparse sub-alternatives, the computation of my test statistic is a special case of best-subset selection in linear regression. Best-subset selection has had a recent surge of interest after Bertsimas et al. (2016) showed that this problem is easier to solve than was commonly assumed, by exploiting gradient descent methods and commercially available mixed-integer optimization solvers.111See e.g. Hastie et al. (2017); Mazumder et al. (2017); Hazimeh and Mazumder (2018); Koning and Bekker (2018) for more recent work. In case of a diagonal covariance matrix, the test statistic has closed form solution and is closely related to threshold-type statistics (see e.g., Fan, 1996; Zhong et al., 2013).

In addition, I consider Lasso, which is another popular technique to obtain sparse solutions (Tibshirani, 1996). I show that in its constraint form, Lasso generates a sub-alternative that consists of convex cones that are centered around the axes. The width of these cones depends on the regularization parameter. Therefore, the Lasso sub-alternative can be interpreted as a nearly-sparse sub-alternative. The corresponding test statistic can be computed using Lasso. furthermore, the penalized form of Lasso has the interpretation of shrinking the sample mean estimate towards zero.

For this demonstration, I use randomization to construct a critical value (see e.g. Lehmann and Romano, 2006; Canay et al., 2017). While the a construction relies on symmetry assumption on the distribution of the error term, the advantage is that size control is guaranteed even if .

2 Restricting the Mahalanobis distance

In this section, I show that the Mahalanobis distance is the result of a maximization problem. I present a new distance measure by restricting the argument over which is maximized, to measure distance in a direction. I describe its connection to the Mahalanobis distance and provide some insights into the existence of the new distance measure.

I use the following notation. For a set , the set denotes the set in addition to the origin and the set excluding the origin. I use to denote the identity matrix. The subscripted vector has elements equal to if and 0, otherwise. Finally, for a vector with elements , the following standard notation is used

2.1 The distance measure

Let the -vector be the parameter of interest and let be the set containing the (possibly defect) unit ellipsoid described by some symmetric matrix . Let represent a set of constraints. Without loss of generality, assume that is closed under scalar multiplication, so that it constitutes a cone.222Generality is not lost, as only the intersection of with will is used to define the distance. Hence, the cone can be constructed from an arbitrary subset , where , so that it has the same intersection . This is demonstrated using Lasso in Section 4.2. That is, if , then , for all .

I propose the following restricted distance measure over , which measures the distance in the directions specified by the cone .

Definition 1.

The length of in is equal to

The distance measure can be interpreted as follows. Suppose that is the maximizing vector. Then is a projection of onto , scaled by . The length of this projection is the distance measure. A visual demonstration for both a singular and positive-definite matrix is given in Figure 1. The, is included in the maximization to ensure that the distance is non-negative, and to simplify its computation.

Notice that if is the entire space , then the maximizing argument is equal to , so that coincides with the Mahalanobis distance . In this case, the existence of the statistic requires a positive definite matrix . If , then the following existence condition for is weaker than , which is required for the inverse to exist. 333The condition follows from first observing that the case is irrelevant for the existence, and then rewriting as for . The strength of the assumption depends entirely on the scope of .

Condition 1.

.

This condition is closely related to the restricted eigenvalue conditions used to prove oracle results for Lasso (see e.g.

Van De Geer et al. 2009). For example, if we choose and , where is an index set with cardinality and is some constant, the condition coincides with the compatibility condition (Van de Geer, 2007).

The Mahalanobis distance is well known for appearing in the exponent of the density of the normal distribution. As the distance from Definition

1 generalizes the Mahalanobis distance, it also generalizes the normal distribution. In particular, replacing the Mahalanobis distance with this distance measure produces a density

where is a proportionality constant. This density is equal to the (multivariate) normal distribution if . Using different sets , other densities can be produced. For example, if we consider the two-dimensional case with equal to scalar multiples of the canonical basis vectors , then can be viewed as a bell-curve with flattened sides. A visual illustration of this example is provided in Figure 2.

The following result shows that distance measure can be computed using restricted quadratic minimization. This is convenient as some of such problems have been well studied in the optimization literature. A proof of the result can be found in Appendix A. In Section 3.2

, where the distance is used as a test statistic, results for sample moments are provided for which the computation simplifies.

Proposition 1.

Let for all . Assume that and are unique optimizers. Then

Figure 1: Visual demonstration of the distance measure for a positive-definite and singular matrix . In both plots, the diagonal elements of are 1. For the left plot, its off-diagonals are , so that is an ellipse, and for the right plot the off-diagonals are , so that is two parallel lines. The dark part of the lines shows the intersection with . In both plots, the thin line with an open circle at the end represents , and the ticker line with the square at the end represents the maximizing vector . The thin line with the solid dot is the scaled projection of onto . Its length is the distance .
Figure 2: The function with , , and , where consists of the scalar multiples of the canonical basis vectors, so that .

3 Multi-dimensional hypothesis testing

In this section, I apply the distance measure from the previous section to hypothesis testing. I start by showing how conventional null and alternative hypotheses can be defined by restricting the distance measure. This reveals that different restrictions can lead to the same null and alternative hypotheses, but may differ in the sub-regions of the alternative on which they are maximized. I define these sub-regions of the alternative as the sub-alternative. I continue by proposing a test statistic that is the sample equivalent of the distance, in order to direct power towards the sub-alternative. Finally, I provide several results for the computation of the statistic. Interestingly, if the conventional estimate of the sample covariance matrix is used then the computation of the statistic reduces to linear regression with a constant response, restricted by the set that defines the sub-alternative.

The setting can described as follows. Let be an matrix that can be decomposed as , where is the -vector of interest, is an -vector of ones, and

is a random matrix with mean zero and covariance matrix

.

3.1 Sub-alternatives

In order to describe the hypotheses in terms of , it will be convenient to re-parameterize as , where the columns of

are orthonormal eigenvectors of

. Let be the diagonal matrix containing the eigenvalues of , so that . This re-parametrization is equivalent to using . Without loss of generality, assume that and that is a cone.444Note that the normalization of the eigenvalues and re-parametrization should be taken into account when constructing this cone from an arbitrary subset of .

With the distance measure defined in the previous section, the null hypothesis and alternative hypothesis can be equivalently written as the length of in :

This reformulation is convenient, as it condenses the vector hypotheses to scalar hypotheses. Additionally, a different null hypothesis and alternative hypothesis can be constructed by simply replacing with some other cone . For example, the cone that imposes the restriction produces a ‘one-sided’ test with null hypothesis and alternative hypothesis .

A key observation to make is that different cones can produce the same null and alternative hypothesis. For example, the sets containing the vectors in the direction of the canonical basis, and share and , as the lengths and are positive for the same vectors . However, the lengths and are different for almost all in . It then seems natural to define the directions in which the is maximized to be the sub-regions of interest of the alternative (henceforth named sub-alternatives).

is largest if , as a vector not in that is projected onto some vector in is shortened as a result of the projection. Therefore, I define the sub-alternative corresponding to to be the vectors in , excluding the origin. Notice that the sub-alternative and alternative coincide if and only if .

Definition 2.

The sub-alternative corresponding to is .

3.2 Test statistic

In this section, I use the new distance measure to propose a novel test statistic, that directs power towards the sub-alternative. In order to do so, the following notation is used. The vector stands for a generic estimate of , and for a generic estimate of . Let .

I propose the following test statistic that is the sample analogue of the distance :

where is the cone that describes the sub-alternative.

For the estimate , I use the sample mean . For , I will focus on two estimates. The first is the conventional sample covariance matrix , where is the Gramian matrix. The second is a diagonal estimate, which includes popular choices such as or . For notational clarity, I will write if is the full sample covariance matrix and add the superscript to denote the statistic if is a diagonal estimate. [In section X, I consider - and -regularization and show how -regularization corresponds to a shrinkage in and -regularization to shrinkage in ]

The intuition for the statistic is that it is small under , positive under and large under the sub alternative , so that power is directed towards the sub-alternatives. Notice that if the sub-alternative and alternative coincide, then is equal to square-root of the Hotelling statistic. A crucial difference with the Hotelling statistic is that the statistic does not generally require to be invertible. Instead, for its existence a weaker sample equivalent of Condition 1 suffices.

3.3 Computing the test statistic

From Proposition 1 it follows that the computation of reduces to a quadratic minimization problem. In this section, I show that the computation can simplify in some cases. In particular, I provide results for the case that and the case that is diagonal. The proofs of these results can be found in Appendix A.

If is chosen as estimate of the covariance matrix and , then Proposition 2 shows that the computation simplifies to linear regression of a constant vector on , regularized by the cones that define the sub-alternative of interest. This is a curious result, because linear regression with a constant response has no other applications that I am aware of. The reason for this is that in most applications the matrix contains an intercept so that regression of on results in the trivial parameter estimate . The result suggests that the residual sum of squares could be used as a test statistic instead of the distance measure.

Proposition 2.

Let and for all . Let and be unique optimizers. Then

if and , otherwise.

In addition, I consider the case where is a diagonal estimate of the covariance matrix, such as or . In this case, the computation can simplify substantially if a special type of cone is used. In particular, define a scone (signed cone) to be a set of vectors that is closed under multiplication by a positive definite diagonal matrix. That is, if , then , where . Pre-multiplication by a diagonal matrix is a sign-preserving operation, so that the sub-alternatives for scones relate directly to hypotheses concerning sign and sparsity restrictions.

For a scone sub-alternative, Proposition 3 shows that computing requires solving a minimum distance problem. Such problems are typically easier to solve than regression. For example, the next section considers testing against sparse sub-alternatives, for which this problem has a closed-form solution.

Proposition 3.

Let be a diagonal matrix and let be a scone. Suppose that and are unique optimizers. Then

if and , otherwise.

4 Sparse sub-alternatives

In this section, I demonstrate the methodology described in the previous sections by considering the special case of sparse sub-alternatives. For a given level of sparsity , the sparse sub-alternative is defined by , where is the -norm of a vector , which counts the number of non-zero elements in the vector. Notice that is a scone as for all diagonal matrices . The sub-alternative corresponding to this scone is given by

The interpretation of this sub-alternative is that (at most) elements of are unequal to zero. The corresponding null hypothesis is and the alternative hypothesis is . Notice that the alternative hypothesis and sub-alternative coincide if .

The sparse sub-alternative leads to the statistic , which will be abbreviated as for notational convenience. A visual illustration comparing the statistic to the statistic is provided for the two-dimensional case in Figure 3. On the left panel, it can be observed that, the statistic is large near the axes, where is sparse. Conversely, the right panel shows that the statistic is equally large in every direction. The difference becomes increasingly pronounced as increases and the level of sparsity is fixed.

Figure 3: Levelplot of (left) and (right) over in two dimensions, where . Notice that the plot on the left is closely related to Figure 2, which plots .

4.1 Computation

For the case that , Proposition 2 shows that the corresponding test statistic requires solving the non-convex quadratic minimization problem

(1)

This minimization problem is a special case of the best subset selection problem (BSS), which was long deemed computationally infeasible for . Recently, Bertsimas et al. (2016) have shown that BSS can be solved for problems of practical size within reasonable time by using gradient descent methods and mixed-integer optimization solvers. In particular, they solve BSS with of order and of order in minutes. If large and are very large or if re-sampling is used to compute a critical value, then this may still be costly in terms of computation. However, promising work by Hazimeh and Mazumder (2018) shows that problems of order can be approximated in seconds.

If is assumed to be diagonal, then the computation becomes much simpler. Proposition 3 shows that can be computed by solving

It is straightforward to show that , where contains the largest elements of .555See e.g. Proposition 3 of Bertsimas et al. (2016) for a formal proof. This leads to the closed form

where has elements equal to the corresponding element in if its index is in , and 0 otherwise.

If , then the test statistic is closely related to threshold-type statistics (see e.g. Fan, 1996; Zhong et al., 2013) including the screening statistic proposed by Fan et al. (2015). They define the screening statistic as

where , for a given threshold value . Here, Fan et al. (2015) choose to grow sufficiently fast so that using 0 as critical value results in an test with asymptotic size 0. Notice that if is instead chosen such that , then is a monotone function of : . Hence, the difference is that a threshold-type statistic implicitly and the explicitly specifies the sparsity level of the sub-alternative of interest. While the latter seems more intuitive, there may be applications for which a threshold-type specification of the sparsity level is desirable.

4.2 Lasso

Another popular tool to obtain sparse solutions is Lasso (Tibshirani, 1996), which is linear regression under an -norm restriction. Unlike , the Lasso constraint set , where , is not a scone nor even a cone. However, as mentioned in Section 2, it is possible to convert non-cone restrictions to cone restrictions, which I will demonstrate in this section.

Let be the intersection of the Lasso constraint set and the eigenvalue ellipsoid. Then we can construct the cone that consists of all the scalar multiples of and the origin. This leads to the sub-alternative

Notice that if is the eigenvalue matrix of , then this sub-alternative coincides with the 1-sparse sub-alternative when , and the Hotelling statistic if . A visual illustration of the two-dimensional case is given in Figure 4, for . This figure shows that the sub-alternative corresponding to the Lasso constraint is set of cones that are centered around the axes, and whose angles are regulated by the parameter . Therefore, the sub-alternative induced by Lasso can be viewed as a near-sparse sub-alternative.

By Proposition 2, the computation of the test statistic reduces to Lasso

so that .

Using its penalty form exposes an alternative interpretation of Lasso, in that it shrinks the mean estimate . In particular, using Proposition 2, the penalized form of the Lasso problem given by

where is the penalty parameter, is the element-wise absolute value, and is the element-wise positive-part operator. This shows that Lasso shrinks each element of the mean estimate by , to produce the estimate .666Similarly, a Ridge or -penalty (Hoerl and Kennard, 1970) inflates the diagonals of the estimate of .

Figure 4: The construction of a Lasso sub-alternative in two dimensions. The area inside the diamond corresponds to the Lasso constraint , the circle is the unit sphere , the thicker sections of the circle correspond to the intersection , and the cone corresponding to the intersection is shown as the gray dashed areas.

4.3 Critical values

In order to construct a test, a critical value is necessary for which the distribution under the null should be approximated. For the purposes of this paper, I will use a reflection-based randomization test. Under a symmetry assumption on , this leads to an exact critical value because the sparse sub-alternatives and Lasso sub-alternatives share . This section briefly describes the construction of such a critical value (see e.g. Ch. 15 of Lehmann and Romano (2006) for a more general discussion).

Let contain all reflection matrices, so that is the reflection group of order . The assumption that permits the construction of an exact test is that and share the same distribution, for all . Let be selected uniformly from such that it includes the identity matrix. Denote the set containing the reflection transformed data by . Let be the set containing test statistic of interest applied to each of the reflected data sets. Notice that under , each of the elements of would ex-ante have been equally likely to be observe. So, for a given level of significance , the critical value can be defined as the quantile of . If all elements of are different, then rejecting if is an exact test.

References

5 Appendix A

In order to present the proofs of Proposition 1, I first provide the following lemma, where the notation is used to mean element-wise scalar multiplication of the set .

Lemma 1.

Let be a cone and . If Condition 1 holds, then

Proof.

As Condition 1 holds, a maximizing argument exists. I find

Proposition 1.

Let for all . Let and be unique optimizers. Then

if and , otherwise.

Proof.

I will consider two cases: and .

Case
I find

where the second equality follows from Lemma 1.

Case
I will prove that by contradiction. Suppose that be arbitrarily given. Then , as is a unique minimizer. So . As , we have that for all . Then , which implies . This contradicts the assumption that is a minimizer. Hence, . ∎

Proposition 2.

Let and for all . Let and be unique optimizers. Then

if and , otherwise.

Proof.

I will consider two cases.

Case
Writing out the -norm and using Proposition 1 yields

By Lemma 1

Combining this and applying some straightforward algebra yields

Case
This case is analogous to the case that in the proof of Proposition 1. ∎

Proposition 3.

Let be a diagonal matrix and let be a scone. Suppose that and are unique optimizers. Then

if and , otherwise.

Proof.

I will consider two cases: and .

Case
Using the substitution and yields

where if , as is diagonal and positive-definite. Define . From Lemma 1 it follows that

Case .
This case is analogous to the case that in the proof of Proposition 1. ∎

6 Appendix B: Connection to existing test statistics

In this section, I discuss two scone restrictions for which the test statistic reduces to test statistics that have recently received attention in the literature.

6.1 Sign restrictions

Scones can also be constructed by imposing non-negativity restrictions on elements of . This yields the following null and alternative hypothesis:

where and , without loss of generality. This is the well-known moment (in)equalities setting. The special case of this problem where is treated by Koning and Bekker (2019).

For the case , the corresponding test statistic is computed by solving

where has elements . This is a special case of the bounded-variable least-squares minimization (BVLS) problem with a lower bound that is only applied to the first parameters. BVLS is well studied minimization problem (see e.g. Stark and Parker 1995; Lawson and Hanson 1995) and implementations are available for most major statistical software packages. If is diagonal, the computation is has a closed-form solution.

For , the minimization problem is given by

Let , and let contain the positive elements in . It is straightforward to show that if , and , otherwise.

6.2 Mixing sign and sparsity restrictions

Another type of scone restriction imposes both sign and sparsity restrictions on . The special case where and is treated by Chernozhukov et al. (2018), and Bugni et al. (2016) consider the case and .

If is diagonal, the computational results for sign restrictions and sparsity restrictions are easily combined. In particular, let contain the indexes of the at most largest and positive elements in . Then if and , otherwise.

For , the computation of the test statistic requires solving