1 Introduction
Principal components analysis (PCA) was introduced in the early 20th century [Pearson (1901), Hotelling (1933)] and is arguably the most well known and widely used technique for dimension reduction. It is part of the mainstream statistical repertoire and is routinely used in numerous and diverse areas of application. However, contemporary applications often involve much higherdimensional data than envisioned by the early developers of PCA. In such highdimensional situations, where the number of variables is of the same order or much larger than the number of observations , serious difficulties emerge: standard PCA can produce inconsistent estimates of the principal directions of variation and lead to unreliable conclusions [Johnstone and Lu (2009), Paul (2007), Nadler (2008)].
The principal directions of variation correspond to the eigenvectors of the covariance matrix, and in highdimensions consistent estimation of the eigenvectors is generally not possible without additional assumptions about the covariance matrix or its eigenstructure. Much of the recent development in PCA has focused on methodology that applies the concept of sparsity to the estimation of individual eigenvectors [examples include Jolliffe, Trendafilov and Uddin (2003), d’Aspremont et al. (2007), Zou, Hastie and Tibshirani (2006), Shen and Huang (2008), Witten, Tibshirani and Hastie (2009), Journée et al. (2010)]. Theoretical developments on sparsity and PCA include consistency [Johnstone and Lu (2009), Shen, Shen and Marron (2013)], variable selection properties [Amini and Wainwright (2009)], rates of convergence and minimaxity [Vu and Lei (2012a)], but have primarily been limited to results about estimation of the leading eigenvector. Very recently, Birnbaum et al. (2013) established minimax lower bounds for the estimation of individual eigenvectors. However, an open problem that has remained is whether sparse PCA methods can optimally estimate the subspace spanned by the leading eigenvectors, that is, the principal subspace of variation.
The subspace estimation problem is directly connected to dimension reduction and is important when there may be more than one principal component of interest. Indeed, typical applications of PCA use the projection onto the principal subspace to facilitate exploration and inference of important features of the data. In that case, the assumption that there are distinct principal directions of variation is mathematically convenient but unnatural: it avoids the problem of unidentifiability of eigenvectors by imposing an artifactual choice of principal axes. Dimension reduction by PCA should emphasize subspaces rather than eigenvectors.
An important conceptual issue in applying sparsity to principal subspace estimation is that, unlike the case of sparse vectors, it is not obvious how to formally define what is meant by a
sparse principal subspace. In this article, we present two complementary notions of sparsity based on (pseudo) norms: row sparsity and column sparsity. Roughly, a subspace is row sparse if every one of its orthonormal bases consists of sparse vectors. In the case, this intuitively means that a row sparse subspace is generated by a small subset of variables, independent of the choice of basis. A column sparse subspace, on the other hand, is one which has some orthonormal basis consisting of sparse vectors. This means that the choice of basis is crucial; the existence of a sparse basis is an implicit assumption behind the frequent use of rotation techniques by practitioners to help interpret principal components.In this paper, we study sparse principal subspace estimation in highdimensions. We present nonasymptotic minimax lower and upper bounds for estimation of both row sparse and column sparse principal subspaces. Our upper bounds are constructive and apply to a wide class of distributions and covariance matrices. In the row sparse case they are optimal up to constant factors, while in the column sparse case they are nearly optimal. As an illustration, one consequence of our results is that the order of the minimax mean squared estimation error of a row sparse dimensional principal subspace (for ) is
where
is the effective noise variance (a function of the eigenvalues of population covariance matrix) and
is a measure of the sparsity in an sense defined in Section 2. Our analysis allows , , and to change with and . When , the rate has a very intuitive explanation. There are variables active in generating the principal subspace. For each active variable, we must estimate the corresponding coordinates of the basis vectors. Since we do not know in advance which variables are active, we incur an additional cost of for variable selection.To our knowledge, the only other work that has considered sparse principal subspace estimation is that of Ma (2013). He proposed a sparse principal subspace estimator based on iterative thresholding, and derived its rate of convergence under a spiked covariance model (where the covariance matrix is assumed to be a rank perturbation of the identity) similar to that in Birnbaum et al. (2013). He showed that it nearly achieves the optimal rate when estimating a single eigenvector, but was not able to track its dependence on the dimension of the principal subspace.
We obtain the minimax upper bounds by analyzing a sparsity constrained principal subspace estimator and showing that it attains the optimal error (up to constant factors). In comparison to most existing works in the literature, we show that the upper bounds hold without assuming a spiked covariance model. This spiked covariance assumption seems to be necessary for two reasons. The first is that it simplifies analyses and enables the exploitation of special properties of the multivariate Gaussian distribution. The second is that it excludes the possibility of the variables having equal variances. Estimators proposed by
Paul (2007), Johnstone and Lu (2009), and Ma (2013) require an initial estimate based on diagonal thresholding—screening out variables with small sample variances. Such an initial estimate will not work when the variables have equal variances or have been standardized. The spiked covariance model excludes that case and, in particular, does not allow PCA on correlation matrices.A key technical ingredient in our analysis of the subspace estimator is a novel variational form of the Davis–Kahan theorem (see Corollary 4.1) that may be useful in other regularized spectral estimation problems. It allows us to bound the estimation error using some recent advanced results in empirical process theory, without Gaussian or spiked covariance assumptions. The minimax lower bounds follow the standard Fano method framework [e.g., Yu (1997)], but their proofs involve nontrivial constructions of packing sets in the Stiefel manifold. We develop a generic technique that allows us to convert global packing sets without orthogonality constraints into local packing sets in the Stiefel manifold, followed by a careful combinatorial analysis on the cardinality of the resulting matrix class.
The remainder of the paper is organized as follows. In the next section, we introduce the sparse principal subspace estimation problem and formally describe our minimax framework and estimator. In Section 3, we present our main conditions and results, and provide a brief discussion about their consequences and intuition. Section 4 outlines the key ideas and main steps of the proof. Section 5 concludes the paper with discussion of related problems and practical concerns. Appendices A, B contain the details in proving the lower and upper bounds. The major steps in the proofs require some auxiliary lemmas whose proofs we defer to Appendices C, D.
2 Subspace estimation
Let be independent, identically distributed random vectors with mean and covariance matrix . To reduce the dimension of the ’s from down to , PCA looks for mutually uncorrelated, linear combinations of the coordinates of that have maximal variance. Geometrically, this is equivalent to finding a dimensional linear subspace that is closest to the centered random vector in a mean squared sense, and it corresponds to the optimization problem
(1)  
where is the Grassmann manifold of dimensional subspaces of , is the orthogonal projector of , and is the identity matrix. [For background on Grassmann and Stiefel manifolds, see Edelman, Arias and Smith (1999), Chikuse (2003).] There is always at least one for which (2) has a unique solution. That solution can be determined by the spectral decomposition
(2) 
where are the eigenvalues of and , orthonormal, are the associated eigenvectors. If , then the dimensional principal subspace of is
(3) 
and the orthogonal projector of is given by , where is the matrix with columns .
In practice, is unknown, so must be estimated from the data. Standard PCA replaces (2) with an empirical version. This leads to the spectral decomposition of the sample covariance matrix
where is the sample mean, and estimating by the span of the leading eigenvectors of . In highdimensions however, the eigenvectors of can be inconsistent estimators of the eigenvectors of . Additional structural constraints are necessary for consistent estimation of .
2.1 Subspace sparsity
The notion of sparsity is appealing and has been used successfully in the context of estimating vector valued parameters such as the leading eigenvector in PCA. Extending this notion to subspaces requires care because sparsity is inherently a coordinatedependent concept while subspaces are coordinateindependent. For a given dimensional subspace , the set of orthonormal matrices whose columns span is a subset of the Stiefel manifold of orthonormal matrices. We will consider two complementary notions of subspace sparsity defined in terms of those orthonormal matrices: row sparsity and column sparsity.
Define the norm, , of a matrix as the usual norm of the vector of rowwise norms of :
where denotes the th row of . (To be precise, this is actually a pseudonorm when .) Note that is coordinateindependent, because
for any orthogonal matrix
. We define the row sparse subspaces using this norm. Let denotes the span of the columns of . [(Row sparse subspaces)] For and ,The constraints on arise from the fact that the vector of rowwise norms of a orthonormal matrix belongs to a sphere of radius . Roughly speaking, row sparsity asserts that there is a small subset of variables (coordinates of ) that generate the principal subspace. Since is coordinateindependent, every orthonormal basis of a has the same norm.
Another related notion of subspace sparsity is column sparsity, which asserts that there is some orthonormal basis of sparse vectors that spans the principal subspace. Define the norm, , of a matrix as the maximal norm of its columns:
where denotes the th column of . This is not coordinateindependent. We define the column sparse subspaces to be those that have some orthonormal basis with small norm. [(Column sparse subspaces)] For and ,
The column sparse subspaces are the dimensional subspaces that have some orthonormal basis whose vectors are sparse in the usual sense. Unlike row sparsity, the orthonormal bases of a column sparse do not all have the same norm, but if , then there exists some such that and (or for ).
2.2 Parameter space
We assume that there exist i.i.d. random vectors , with and , such that
(4) 
for , where is the Orlicz norm [e.g., van der Vaart and Wellner (1996), Chapter 2] defined for as
This ensures that all onedimensional marginals of have subGaussian tails. We also assume that the eigengap so that the principal subspace is well defined. Intuitively, is harder to estimate when the eigengap is small. This is made precise by the effective noise variance
(5) 
It turns out that this is a key quantity in the estimation of
, and that it is analogous to the noise variance in linear regression. Let
denote the class of distributions on that satisfy (4), , and . Similarly, let
denote the class of distributions that satisfy (4), , and .
2.3 Subspace distance
A notion of distance between subspaces is necessary to measure the performance of a principal subspace estimator. The canonical angles between subspaces generalize the notion of angles between lines and can be used to define subspace distances. There are several equivalent ways to describe canonical angles, but for our purposes it will be easiest to describe them in terms of projection matrices. See Bhatia [(1997), Chapter VII.1] and Stewart and Sun (1990) for additional background on canonical angles. For a subspace and its orthogonal projector , we write to denote the orthogonal projector of and recall that . Let and be dimensional subspaces of with orthogonal projectors and
. Denote the singular values of
by . The canonical angles between and are the numbersfor and the angle operator between and is the matrix
In this paper we will consider the distance between subspaces
where is the Frobenius norm. This distance is indeed a metric on [see Stewart and Sun (1990), e.g.], and can be connected to the familiar Frobenius (squared error) distance between projection matrices by the following fact from matrix perturbation theory.
Proposition 2.1 ([See Stewart and Sun (1990), Theorem I.5.5])
Let and be dimensional subspaces of with orthogonal projectors and . Then:

The singular values of are

The singular values of are
In other words, has at most nonzero singular values and the nonzero singular values of are the nonzero singular values of , each counted twice.
Thus,
(6) 
We will frequently use these identities. For simplicity, we will overload notation and write
for . We also use a similar convention for , where are the orthogonal projectors corresponding to The following proposition, proved in Appendix C, relates the subspace distance to the ordinary Euclidean distance between orthonormal matrices.
Proposition 2.2
If , then
In other words, the distance between two subspaces is equivalent to the minimal distance between their orthonormal bases.
2.4 Sparse subspace estimators
Here we introduce an estimator that achieves the optimal (up to a constant factor) minimax error for row sparse subspace estimation. To estimate a row sparse subspace, it is natural to consider the empirical minimization problem corresponding to (2) with an additional sparsity constraint corresponding to .
We define the row sparse principal subspace estimator to be a solution of the following constrained optimization problem:
(7)  
For our analysis, it is more convenient to work on the Stiefel manifold. Let for matrices of compatible dimension. It is straightforward to show that following optimization problem is equivalent to (2.4):
(8)  
If is a global maximizer of (2.4), then is a solution of (2.4). When , the estimator defined by (2.4) is essentially a generalization to subspaces of the Lassotype sparse PCA estimator proposed by Jolliffe, Trendafilov and Uddin (2003). A similar idea has also been used by Chen, Zou and Cook (2010) in the context of sufficient dimension reduction. The constraint set in (2.4) is clearly nonconvex, however this is unimportant, because the objective function is convex and we know that the maximum of a convex function over a set is unaltered if we replace by its convex hull. Thus, (2.4) is equivalent to a convex maximization problem. Finding a global maximum of convex maximization problems is computationally challenging and efficient algorithms remain to be developed. Nevertheless, in the most popular case , some algorithms have been proposed with promising empirical performance [Shen and Huang (2008), Witten, Tibshirani and Hastie (2009)].
We define the column sparse principal subspace estimator analogously to the row sparse principal subspace estimator, using the column sparse subspaces instead of the row sparse ones. This leads to the following equivalent Grassmann and Stiefel manifold optimization problems:
(9)  
and
(10)  
3 Main results
In this section, we present our main results on the minimax lower and upper bounds on sparse principal subspace estimation over the row sparse and column sparse classes.
3.1 Row sparse lower bound
To highlight the key results with minimal assumptions, we will first consider the simplest case where . Consider the following two conditions. There is a constant such that
and . Condition 3.1 is necessary for the existence of a consistent estimator (see Theorems A.1 and A.2). Without Condition 3.1, the statements of our results would be complicated by multiple cases to deal with the fact that the subspace distance is bounded above by . The lower bounds on and are minor technical conditions that ensure our nonasymptotic bounds are nontrivial. Similarly, the upper bound on is only violated in trivial cases (detailed discussion given below).
Here, as well as in the entire paper, denotes a universal, positive constant, not necessarily the same at each occurrence. This lower bound result reflects two separate aspects of the estimation problem: variable selection and parameter estimation after variable selection. Variable selection refers to finding the variables that generate the principal subspace, while estimation refers to estimating the subspace after selecting the variables. For each variable, we accumulate two types of errors: one proportional to that reflects the coordinates of the variable in the dimensional subspace, and one proportional to that reflects the cost of searching for the active variables. We prove Theorem 3.1 in Appendix A.
The nonasymptotic lower bound for has a more complicated dependence on (, , , , ) because of the interaction between and norms. Therefore, our main lower bound result for will focus on combinations of (, , , , ) that correspond to the highdimensional and sparse regime. (We state more general lower bound results in Appendix A.) Let
(11) 
The interpretation for these two quantities is natural. First, measures the relative sparsity of the problem. Roughly speaking, it ranges between and when the sparsity constraint in (2.4) is active, though the “sparse” regime generally corresponds to . The second quantity, corresponds to the classic mean squared error (MSE) of standard PCA. The problem is lowdimensional if is small compared to . We impose the following condition to preclude this case. There is a constant such that . This condition lower bounds the classic MSE in terms of the sparsity and is mild in highdimensional situations. When , for example, Condition 11 reduces to
We also note that this assumption becomes milder for larger values of and it is related to conditions in other minimax inference problem involving and balls [see Donoho and Johnstone (1994), e.g.].
This result generalizes Theorem 3.1 and reflects the same combination of variable selection and parameter estimation. When Condition 11
does not hold, the problem is outside of the sparse, highdimensional regime. As we show in the proof, there is actually a “phase transition regime” between the highdimensional sparse and the classic dense regimes for which sharp minimax rate remains unknown. A similar phenomenon has been observed in
Birnbaum et al. (2013).3.2 Row sparse upper bound
Our upper bound results are obtained by analyzing the estimators given in Section 2.4. The case where is the clearest, and we begin by stating a weaker, but simpler minimax upper bound for the row sparse class.
Theorem 3.3 ((Row sparse upper bound, ))
Let be any global maximizer of (2.4). If , then
Although (2.4) may not have a unique global optimum, Theorem 3.3 shows that any global optimum will be within a certain radius of the principal subspace . The proof of Theorem 3.3, given in Section 4.2, is relatively simple but still nontrivial. It also serves as a prototype for the much more involved proof of our main upper bound result stated in Theorem 3.4 below. We note that the rate given by Theorem 3.3 is off by a factor that is due to the specific approach taken to control an empirical process in our proof of Theorem 3.3.
To state the main upper bound result with optimal dependence on (, , , , ), we first describe some regularity conditions. Let
The regularity conditions are
(12)  
(13)  
(14) 
and
(15) 
where and are positive constants involved in the empirical process arguments. Equations (12) to (15) require that , the minimax rate of estimation (except the factor involving ), to be small enough, compared to empirical process constants and some polynomials of . Such conditions are mild in the high dimensional, sparse regime, since to some extent, they are qualitatively similar and analogous to Conditions 3.1 to 11 required by the lower bound.
Conditions (12) to (15) are general enough to allow , and () to scale with . For example, consider the case , and let , , , , , , where , and . Note that the ’s can be negative. Then it is straightforward to verify that conditions (12) to (15) hold for large values of whenever and . Condition (12) implies that cannot grow faster than .
Theorem 3.4 ((Row sparse upper bound in probability))
Theorem 3.4 is presented in terms of a probability bound instead of an expectation bound. This stems from technical aspects of our proof that involve bounding the supremum of an empirical process over a set of random diameter. For , the upper bound matches our lower bounds (Theorems 3.1 and 3.2) for the entire tuple (, , , , ) up to a constant if
(16) 
for some constant . To see this, combining this additional condition and Condition 3.1, the term in the lower bound given in Theorem 3.2 is within a constant factor of in the upper bound given in Theorem 3.4. It is straightforward to check that the other terms in lower and upper bounds agree up to constants with obvious correspondence. Moreover, we note that the additional condition (16) is only slightly stronger than the last inequality in Condition 3.1. The proof of Theorem 3.4 is in Appendix B.1.
Using the probability upper bound result and the fact that , one can derive an upper bound in expectation.
Corollary 3.1
Under the same condition as in Theorem 3.4, we have for some constant ,
The expectation upper bound has an additional term that can be further reduced by refining the argument (see Remark 3.2 below). It is not obvious if one can completely avoid such a term. But in many situations it is dominated by the first term. Again, we invoke the scaling considered in Remark 3.2. When , the first term is of order , and the additional term is , which is asymptotically negligible if .
3.3 Column sparse lower bound
By modifying the proofs of Theorems 3.1 and 3.2, we can obtain lower bound results for the column sparse case that are parallel to the row sparse case. For brevity, we present the and cases together. The analog of , the degree of sparsity, for the column sparse case is
(17) 
and the analogs of Conditions 3.1 and 11 are the following. and . There is a constant such that .
For column sparse subspaces, the lower bound is dominated by the variable selection error, because column sparsity is defined in terms of the maximal norms of the vectors in an orthonormal basis and variables must be selected for each of the vectors. So the variable selection error is inflated by a factor of , and hence becomes the dominating term in the total estimation error. We prove Theorem 3.5 in Appendix A.
3.4 Column sparse upper bound
A specific challenge in analyzing the column sparse principal subspace problem (2.4) is to bound the supremum of the empirical process
indexed by all where
Unlike the row sparse matrices, the matrices and are no longer column sparse with the same radius .
By observing that , we can reuse the proof of Theorem 3.4 to derive the following upper bound for the column sparse class.
Corollary 3.2 ((Column sparse upper bound))
3.5 A conjecture for the column sparse case
Note that Theorem 3.5 and Corollary 3.2 only match when . For larger values of , we believe that the lower bound in Theorem 3.5 is optimal and the upper bound can be improved.
[(Minimax error bound for column sparse case)] Under the same conditions as in Corollary 3.2, there exists an estimator such that
with high probability. As a result, the optimal minimax lower and upper bounds for this case shall be
One reason for the conjecture is based on the following intuition. Suppose that (there is enough gap between the leading eigenvalues) one can recover the individual leading eigenvectors with an error rate whose dependence on is the same as in the lower bound [cf. Vu and Lei (2012a), Birnbaum et al. (2013)]. As a result, the estimator shall give the desired upper bound. On the other hand, it remains open to us whether the estimator in (2.4) can achieve this rate for much larger than .
4 Sketch of proofs
For simplicity, we focus on the row sparse case with , assuming also the high dimensional and sparse regime. For more general cases, see Theorems A.1 and A.2 in Appendix A.
4.1 The lower bound
Our proof of the lower bound features a combination of the general framework of the Fano method and a careful combinatorial analysis of packing sets of various classes of sparse matrices. The particular challenge is to construct a rich packing set of the parameter space . We will consider centered dimensional Gaussian distributions with covariance matrix given by
(18) 
where is constructed from the “local Stiefel embedding” as given below. Let and the function be defined in block form as
(19) 
for . We have the following generic method for lower bounding the minimax risk of estimating the principal subspace of a covariance matrix. It is proved in Appendix A as a consequence of Lemmas A.1 to A.3.
Lemma 4.1 ((Fano method with Stiefel embedding))
Note that if , then . Thus Lemma 4.1 with appropriate choices of can yield minimax lower bounds over dimensional Gaussian distributions whose principal subspace is row sparse.
The remainder of the proof consists of two applications of Lemma 4.1 that correspond to the two terms in Theorem 3.1. In the first part, we use a variation of the Gilbert–Varshamov bound (Lemma A.5) to construct a packing set in consisting of sparse vectors. Then we apply Lemma 4.1 with
This yields a minimax lower bound that reflects the variable selection complexity. In the second part, we leverage existing results on the metric entropy of the Grassmann manifold (Lemma A.6) to construct a packing set of . Then we apply Lemma 4.1 with
This yields a minimax lower bound that reflects the complexity of postselection estimation. Putting these two results together, we have for a subset of Gaussian distributions the minimax lower bound:
4.2 The upper bound
The upper bound proof requires a careful analysis of the behavior of the empirical maximizer of the PCA problem under sparsity constraints. The first key ingredient is to provide a lower bound of the curvature of the objective function at its global maxima. Traditional results of this kind, such as Davis–Kahan sin theorem and Weyl’s inequality, are not sufficient for our purpose.
The following lemma, despite its elementary form, has not been seen in the literature (to our knowledge). It gives us the right tool to bound the curvature of the matrix functional at its point of maximum on the Grassmann manifold.
Lemma 4.2 ((Curvature lemma))
Let be a positive semidefinite matrix and suppose that its eigenvalues satisfy for . Let be the dimensional subspace spanned by the eigenvectors of corresponding to its largest eigenvalues, and let denote its orthogonal projector. If is a dimensional subspace of and is its orthogonal projector, then
Lemma 4.2 is proved in Appendix C.2. An immediate corollary is the following alternative to the traditional matrix perturbation approach to bounding subspace distances using the Davis–Kahan theorem and Weyl’s inequality.
Corollary 4.1 ((Variational ))
In addition to the hypotheses of Lemma 4.2, if is a symmetric matrix and satisfies
(20) 
for some function , then
(21) 
The corollary is different from the Davis–Kahan theorem because the orthogonal projector does not have to correspond to a subspace spanned by eigenvectors of . only has to satisfy
This condition is suited ideally for analyzing solutions of regularized and/or constrained maximization problems where and are feasible, but is optimal. In the simplest case, where , combining (21) with the Cauchy–Schwarz inequality and (6) recovers a form of the Davis–Kahan theorem in the Frobenius norm:
In the upper bound proof, let be the true parameter, and be a solution of (2.4). Then we have
Applying Corollary 4.1 with , , , , and , we have
(22) 
Obtaining a sharp upper bound for is nontrivial. First, one needs to control for some class of sparse and symmetric matrices. This requires some results on quadratic form empirical process. Second, in order to obtain better bounds, we need to take advantage of the fact that is probably small. Thus, we need to use a peeling argument to deal with the case where has a random (but probably) small diameter. These details are given in Appendices B.1 and D. Here we present a short proof of Theorem 3.3 to illustrate the idea.
Proof of Theorem 3.3 By (22), we have
and
(23) 
because by (6). Let
Then , , and has at most positive eigenvalues and at most negative eigenvalues (see Proposition 2.1). Therefore, we can write where , , , and the same holds for . Let
Equation (23) implies
The empirical process indexed by is a generalized quadratic form, and a sharp bound of its supremum involves some recent advances in empirical process theory due to Mendelson (2010) and extensions of his results. By Corollary 4.1 of Vu and Lei (2012b), we have
where is a matrix of i.i.d. standard Gaussian variables. To control