where (i) is a matrix of regression coefficients; (ii) is a design matrix of rank ; (iii) is a
matrix with i.i.d. entries having zero mean and unit variance; and (iv), a nonnegative definite matrix, is the population covariance matrix of the errors, with a “square-root” of so that . General linear hypotheses involving the linear model (1.1) are of the form
for an arbitrary “constraints matrix” , subject to the requirement that
is estimable. Without loss of generality,is taken to be of rank . Throughout, we assume that and are fixed, even as observation dimension and sample size increase to infinity. Henceforth,
is used to denote the effective sample size, which is also the degree of freedom associated with the sample error covariance matrix.
With various choices of and , the testing formulation incorporates many hypotheses of interest. For example, multivariate analysis of variance (MANOVA) is a special case. When the sample size is substantially larger than the dimension of the observations, this problem is well-studied. Anderson (1958) and Muirhead (2009) are among standard references. Various classical inferential procedures involve the matrices
so that is the residual covariance of the full model, an estimator of , while is the hypothesis sums of squares and cross products matrix, scaled by . In a one-way MANOVA set-up, and are, respectively, the within-group and between-group sums of squares and products matrices, scaled by . In the rest of the paper, we shall refer to as the sample covariance matrix.
The testing problem (1.2) is well-studied in the classical multivariate analysis literature. Three standard test procedures are the likelihood ratio test (LR), Lawley–Hotelling trace test (LH) and Bartlett–Nanda–Pillai trace (BNP) test. They are called invariant tests, since under Gaussianity the null distributions of the test statistics are invariant with respect to . One common feature is that all test statistics are linear functionals of the spectrum of . Since this matrix is asymmetric, for convenience, a standard transformation is applied, giving the expressions of the invariant tests as follows. Define
is the “hat matrix” of the reduced model under the null hypothesis. Note that the non-zero eigenvalues ofare the same as those of . The test statistics for the LR, LH and BNP tests can be expressed as
The symbol denotes the -th largest eigenvalue of a symmetric matrix, further using the convention that and indicate the largest and smallest eigenvalue.
In contemporary statistical research and applications, high-dimensional data whose dimension is at least comparable to the sample size is ubiquitous. In this paper, focus is on the interesting boundary case when dimension and sample sizes are comparable. Primarily due to inconsistency of conventional estimators of model parameters — such as—, classical test procedures for the hypothesis (1.2) — such as the LR, LH and BNP tests — perform poorly in such settings. When the dimension is larger than the degree of freedom , the invariant tests are not even well-defined because is singular. Even when is strictly less than , but the ratio is close to , these tests are known to have poor power behavior. Asymptotic results when were obtained in Fujikoshi et al. (2004) under Gaussianity of the populations, and more recently in Bai et al. (2017) under more general settings that only require the existence of certain moments.
Pioneering work on modifying the classical solutions in high dimension is in Bai et al. (2013), who corrected the scaling of the LR statistic when but , and are proportional to . The corrected LR statistic was shown to have significantly more power than its classical counterpart. In contrast, in this paper, we focus on the setting where and are fixed even as so that . In the multivariate regression problem, this corresponds to a situation where the response is high-dimensional, while the predictor is finite-dimensional. In the MANOVA problem, this framework corresponds to high-dimensional observations belonging to one of a finite number of populations.
To the best of our knowledge, when , the linear hypothesis testing problem has been studied in depth only for specific submodels of (1.1), primarily for the important case of two-sample tests for equality of population means. For the latter tests, a widely used idea is to construct modified statistics based on replacing with an appropriate substitute. This approach was pioneered in Bai and Saranadasa (1996) and further developed in Chen and Qin (2010). Various extensions to one-way MANOVA (Srivastava and Fujikoshi, 2006; Yamada and Himeno, 2015; Srivastava and Fujikoshi, 2006; Hu et al., 2017)
and a general multi-sample Behrens–Fisher problem under heteroscedasticity(Zhou et al., 2017) exist. Other notable works for the two-sample problem include Biswas and Ghosh (2014); Chang et al. (2017); Chen et al. (2014); Guo and Chen (2016); Lopes et al. (2011); Srivastava et al. (2016); Wang et al. (2015). A second approach aims to regularize to address the issue of its near-singularity in high dimensions; see Chen et al. (2011) and Li et al. (2016) for ridge-type penalties in two-sample settings. Finally, another alternative line of attack consists of exploiting sparsity; see Cai et al. (2014); Cai and Xia (2014).
In this paper, we seek to regularize the spectrum of by flexible shrinkage functions. For a symmetric matrix and a function on , define
is the matrix of eigenvectors associated with the ordered eigenvalues of. Now, consider any real-valued function on that is analytic over a specific domain associated with the limiting behavior of the eigenvalues of , as elaborated in Section 2. The proposed statistics are functionals of eigenvalues of the regularized quadratic forms
Specifically, we propose regularized versions of LR, LH and BNP test criteria, namely
These test statistics are designed to capture possible departures from the null hypothesis, when is replaced by , while suitable choices of the regularizer allow for getting around the problem of singularity or near-singularity when is comparable to .
Notice that has the same non-zero eigenvalues as . Thus, the proposed test family is a generalization of the classical statistics based on . Importantly, — and consequently the proposed statistics — is rotation-invariant
, which means if a linear transformation is applied to the observations with an arbitrary orthogonal matrix, the statistic remains unchanged. It is a desirable property when not much additional knowledge aboutand is available. It should be noted that the two-sample mean tests by Bai and Saranadasa (1996) and Li et al. (2016), together with their generalization to MANOVA, are special cases of the proposed family with and , , respectively.
The present work builds on the work by Li et al. (2016). The theoretical analysis also involves an extension of the analytical framework adopted by Pan and Zhou (2011) in their study of the asymptotic behavior of Hotelling’s
statistic for non-Gaussian observations. However, the current work goes well beyond the existing literature in several aspects. We highlight these as the key contributions of this manuscript: (a) We propose new families of rotation-invariant tests for general linear hypotheses for multivariate regression problems involving high-dimensional response and fixed-dimensional predictor variables that incorporate a flexible regularization scheme to account for the dimensionality of the observations growing proportional to the sample size. (b) UnlikeLi et al. (2016), who assumed sub-Gaussianity, here only the existence of finite fourth moments of the observations is required. (c) Unlike Pan and Zhou (2011), who assumed , is allowed to be fairly arbitrary and subjected only to some standard conditions on the limiting behavior of its spectrum. (d) We carry out a detailed analysis of the power characteristics of the proposed tests. The proposal of a class of local alternatives enables a clear interpretation of the contributions of different parameters in the performance of the test. (e) We develop a data-driven test procedure based on the principle of maximizing asymptotic power under appropriate local alternatives. This principle leads to the definition of a composite test that combines the optimal tests associated with a set of different kinds of local alternatives. The latter formulation is an extension of the data-adaptive test procedure designed by Li et al. (2016) for the two-sample testing problem.
The rest of the paper is organized as follows. Section 2 introduces the asymptotics of the proposed test family both under the null hypothesis and under a class of local alternatives. Using these local alternatives, in Section 3 a data-driven shrinkage selection methodology based on maximizing asymptotic power is developed. In Section 4, an application of the asymptotic theory and the shrinkage selection method is given for the ridge-regularization family. An extension of ridge-regularization to higher orders is also discussed. The results of a simulation study are reported in Section 5. In the Appendix, a proof outline of the main theorem is presented, while technical details and proofs of other theorems are collected in the Supplementary Material, which is available at anson.ucdavis.edu/%7Elihaoran/.
2 Asymptotic theory
After giving necessary preliminaries on Random Matrix Theory (RMT), the asymptotic theory of the proposed tests under the null hypothesis and under various local alternative models is presented in this section. For any symmetric matrix , define the Empirical Spectral Distribution (ESD) of by
In the following, stands for the maximum absolute value of the entries of a matrix. The following assumptions are employed.
(Moment conditions) has i.i.d. entries such that , , ;
(High-dimensional setting) and are fixed, while such that and ;
(Boundedness of spectral norm) is non-negative definite; ;
(Asymptotic stability of ESD) There exists a distribution with compact support in , non-degenerate at zero, such that , as , where denotes the Wasserstein distance between distributions and , defined as
(Asymptotically full rank) is of full rank and converges to a positive definite matrix. Moreover, ;
(Asymptotically estimable) .
2.1 Preliminaries on random matrix theory
Recall that the Stieltjes transform of any function of bounded variation on is defined by
Minor modifications of a standard RMT result imply that, under Conditions C1–C6, the ESD converges almost surely to a nonrandom distribution at all points of continuity of . This limit is determined in such a way that for any , the Stieltjes transform of is the unique solution in of the equation
Equation (2.1) is often referred to as the Marčenko–Pastur equation. Moreover, pointwise almost surely for , converges to . The convergence holds even when (negative reals) with a smooth extension of to . Readers may refer to Bai and Silverstein (2004) and Paul and Aue (2014) for more details. From now on, for notational simplicity, we shall write as and write as . Note that
in the sense that for symmetric matrices bounded in operator norm, as ,
Resolvent and deterministic equivalent will be used frequently in this paper. They will appear for example as Cauchy kernels in contour integrals in various places.
2.2 Asymptotics under the null hypothesis
To begin with, for , denote by the Gaussian Orthogonal Ensemble (GOE) defined by (1) ; (2) , , ; (3) ’s are jointly independent for . Throughout this paper, is assumed to be analytic in an open interval containing
Let to be a closed contour enclosing such that has a complex extension to the interior of . Further use to denote .
Suppose C1–C6 hold. Under the null hypothesis ,
where denotes weak convergence and and are as follows.
See (2.2) for the definition of . For any two analytic functions and ,
and is written as for simplicity. The kernel is such that
The contour integral is taken counter-clockwise.
Using knowledge of the eigenvalues of the GOE leads to the following statement.
Under the conditions of Theorem 2.1, assume further that . Let
Then, the limiting joint density function of at is given by
Although without closed forms, and do not depend on the choice of used to compute the contour integral. With the resolvent as kernel can be expressed as the integral of on any contour , up to a scaling factor. The quadratic form is then shown to concentrate around , which consequently serves as the integral kernel in . The kernel of is the limit of .
Two sufficient conditions for are
, with for some , and
It would be convenient if and had closed forms in order to avoid computational inefficiencies. Closed forms are available for special cases as shown in the following lemma.
When with , the contour integrals in Theorem 2.1 have closed forms, namely, for , , ,
The results continue to hold when .
Lemma 2.1 indicates that it is possible to have convenient and accurate estimators of the asymptotic mean and variance of under ridge-regularization. The result easily generalizes to the setting when is a linear combination of functions of the form , for any finite collection of ’s. We elaborate on this in Section 4.
To conduct the tests, consistent estimators of and are needed.
Let and be the plug-in estimators of and , with estimated by . For general , , , we can estimate and by replacing and with and . Denote the resulting estimators by and . Then,
where indicates convergence in probability. Again, we write as .
For the special case of , and , using Lemma 2.1, natural estimators in closed forms are
In particular, for ,
The estimators are consistent, for any fixed and . Given the eigenvalues of , the computational complexity of calculating the above estimators is .
Recall the definitions of and from Section 1.
2.3 Asymptotic power under local alternatives
This subsection deals with the behavior of the proposed family of tests under a host of local alternatives. We start with deterministic alternatives, a framework commonly used in the literature to study the asymptotic power of inferential procedures. Next, we consider a Bayesian framework, using a class of priors that characterize the structure of the alternatives. Because the results to follow simultaneously hold for , and , the unifying notation will be used to refer to each of the test statistics.
2.3.1 Deterministic local alternatives
Consider a sequence of such that, on an open subset of containing ,
Observe that and define
Note that exists and is non-singular under C5 and C6. If further for any , is non-negative definite.
Suppose C1–C6 and (2.3) hold, and . Then, as ,
Denote the power functions of at asymptotic level , conditional on , by
The asymptotic behavior of the power functions is described in the following corollary.
Under the assumptions of Theorem 2.3, as ,
where is the standard normal CDF.
Corollary 2.2 indicates the three proposed statistics have identical asymptotic powers under the assumed local alternatives. This is because the first-order Taylor expansions of , and coincide at . However, the respective empirical powers may differ considerably for moderate sample sizes.
The following remark provides a sufficient condition under which (2.3) is satisfied. Denoting the columns of by , it follows that
(a) Let denote the eigen-projection associated with . Suppose that there exists a sequence (in ) of mappings from to , satisfying , , and a mapping continuous on such that, as and for ,
Then, under C4, it follows that (2.3) holds with and
(b) If , then (2.3) is satisfied if , for some constants , . In this case, .
2.3.2 Probabilistic local alternatives
While deterministic local alternatives provide useful information, they are somewhat restrictive for the purpose of a systematic investigation of the power characteristics. Therefore, probabilistic alternatives are considered in the form of a sequence of prior distributions for . This has the added advantage of providing flexibility for incorporating structural information about the regression parameters and the constraints matrices. The proposed formulation of probabilistic alternatives can be seen as an extension of the proposal adopted by Li et al. (2016) in the context of two-sample tests for equality of means. One challenge associated with formulating meaningful alternatives to the hypothesis (1.2), when compared to the two-sample testing problem, is that there are many more plausible ways in which the null hypothesis can be violated. Considering this, we propose a class of alternatives, that on one hand can incorporate a multitude of structures of the parameter , while on the other hand retains analytical tractability in terms of providing interpretable expressions for the local asymptotic power.
Assume the following prior model of with separable covariance
where is a stochastic matrix ( fixed) with independent elements such that , and for some ; is a deterministic matrix and is a fixed matrix. Moreover, let and suppose there is a nonrandom function such that, as , on an open subset of containing ,
Recalling that is the deterministic equivalent of the resolvent , existence of the limit (2.7) also implies that converges pointwise in probability to . Notice also that is the Stieltjes transform of a measure supported on the eigenvalues of .
Model (2.6) leads to a fairly broad covariance design for multi-dimensional random elements, encompassing structures commonly encountered in many application domains, especially in spatio-temporal statistics. We give some representative examples by considering various functional forms of the matrix . Denote by the columns of and by the columns of .
In all that follows takes values in .
Moving average: for constants .
Taking the MANOVA problem to illustrate, suppose that the columns of
represent group mean vectors, and supposeis the matrix that determines successive contrasts among them. Then, is the difference between the means of group and group . Parts (a)–(c) of Example 2.1 correspond then to respectively following an independent, a longitudinal and a moving average process. The row-wise covariance structure is assumed to be such that each has a covariance matrix proportional to . The factor provides the scaling for the tests to have non-trivial local power.
for some function continuous on , where is the th eigenvalue of and is the eigen-projection associated with . Then
Equations (2.7) and (2.8) indicate that effectively captures the distribution of the total spectral mass of across the spectral coordinates of , also taking into account the dimensionality effect through the aspect ratio . Later, we shall discuss specific classes of the matrices that lead to analytically tractable expressions for , with the structure of linking the parameter under the alternative through (2.6) to the structure of .
Another important feature of the probabilistic model is that it incorporates both dense and sparse alternatives through different specifications of the innovation variables . We consider two special cases.
Dense alternative: ;
Sparse alternative: , for some , where
is the discrete probability distribution assigning massto and mass to the points .
Note that the usual notion of sparsity corresponds to the setting where in addition, . More generally, the second specification above formulates a prior model for that is sparse in the coordinate system determined by . In particular, if is a polynomial in (see Section 3.2 for a discussion), can be seen as sparse in the spectral coordinates of .
Even if the quantity