## 1 Introduction

Principal component analysis (PCA) is one of the most widely-used techniques for dimensionality reduction in Statistics, Image Processing and many other fields. The aim is to project the data along directions that explain the greatest proportion of the variance in the population. In the simplest setting where we seek a single, univariate projection of our data, we may estimate this optimal direction by computing the leading eigenvector of the sample covariance matrix.

Despite its successes and enormous popularity, it has been well-known for a decade or more that PCA breaks down as soon as the dimensionality of the data is of the same order as the sample size . More precisely, suppose that

are observations from a Gaussian distribution with a spiked covariance matrix

whose leading eigenvector is , and let denote the leading unit-length eigenvector of the sample covariance matrix . Then Johnstone and Lu (2009) and Paul (2007) showed that is a consistent estimator of , i.e. , if and only if satisfies as . It is also worth noting that the principal component may be a linear combination of all elements of the canonical basis in , which can often make it difficult to interpret the estimated projected directions (Jolliffe, Trendafilov and Uddin, 2003).To remedy this situation, and to provide additional interpretability to the principal components in high-dimensional settings, Jolliffe, Trendafilov and Uddin (2003) and Zou, Hastie and Tibshirani (2006) proposed Sparse Principal Component Analysis (SPCA). Here it is assumed that the leading population eigenvectors belong to the -sparse unit ball

for some . In addition to the easier interpretability, a great deal of research effort has shown that such an assumption facilitates improved estimation performance (e.g. Johnstone and Lu, 2009; Paul and Johnstone, 2012; Vu and Lei, 2013; Cai, Ma and Wu, 2013; Ma, 2013; Wang, Berthet and Samworth, 2016a). To give a flavour of these results, let denote the set of all estimators of , i.e. the class of Borel measurable functions from to . Vu and Lei (2013) introduce a class of sub-Gaussian distributions whose first principal component belongs to and show that^{1}^{1}1Here, means .

(1) |

Thus, consistent estimation is possible in this framework provided only that and satisfy . Vu and Lei (2013) show further that this estimation rate is achieved by the natural estimator

(2) |

However, results such as (1) do not complete the story of SPCA. Indeed, computing the estimator defined in (2) turns out to be an NP-hard problem (e.g. Tillmann and Pfetsch, 2014): the naive approach would require searching through all of the symmetric submatrices of , which takes exponential time in . Therefore, in parallel to the theoretical developments described above, numerous alternative algorithms for SPCA have been proposed in recent years. For instance, several papers have introduced techniques based on solving the non-convex optimisation problem in (2) by invoking an -penalty (e.g. Jolliffe, Trendafilov and Uddin, 2003; Zou, Hastie and Tibshirani, 2006; Shen and Huang, 2008; Witten, Tibshirani and Hastie, 2009). Typically, these methods are fast, but lack theoretical performance guarantees. On the other hand, d’Aspremont et al. (2007) propose to solve the optimisation problem in (2) via semidefinite relaxation. This approach and its variants were analysed by Amini and Wainwright (2009), Vu et al. (2013), Wang, Lu and Liu (2014) and Wang, Berthet and Samworth (2016a), and have been proved to achieve the minimax rate of convergence under certain assumptions on the underlying distribution and asymptotic regime, but the algorithm is slow compared to other approaches. In a separate, recent development, it is now understood that, conditional on a Planted Clique hypothesis from theoretical computer science, there is an asymptotic regime in which no randomised polynomial time algorithm can attain the minimax optimal rate (Wang, Berthet and Samworth, 2016a). Various fast, iterative algorithms were introduced by Johnstone and Lu (2009), Paul and Johnstone (2012), and Ma (2013); the last of these was shown to attain the minimax rate under a Gaussian spiked covariance model. We also mention the computationally-efficient combinatorial approaches proposed by Moghaddam, Weiss and Avidan (2006) and d’Aspremont, Bach and El Ghaoui (2008) that aim to find solutions to the optimisation problem in (2) using greedy methods.

A common feature to all of the computationally efficient algorithms mentioned above is that they are iterative, in the sense that, starting from an initial guess , they refine their guess by producing a finite sequence of iterates , with the estimator defined to be the final iterate. A major drawback of such iterative methods is that a bad initialisation may yield a disastrous final estimate. To illustrate this point, we ran a simple simulation in which the underlying distribution is , with

where denotes the matrix with each entry equal to . In this example, , so . Figure 1

shows, for several different SPCA algorithms, different sample sizes and different initialisation methods, the average values of the loss function

(3) |

over 100 repetitions of the experiment. In the upper panels, the initialisation methods used were the default recommendations of the respective authors, namely diagonal thresholding (d’Aspremont, Bach and El Ghaoui, 2008; Ma, 2013), and vanilla PCA (Zou, Hastie and Tibshirani, 2006; Shen and Huang, 2008; Witten, Tibshirani and Hastie, 2009). We note that the consistency of diagonal thresholding relies on a spiked covariance structure, which is violated in this example. In the lower panels, we ran the algorithms with 10 independent initialisations chosen uniformly at random on the unit Euclidean sphere in , and selected the solution from these 10 that maximises . The main observation is that each of the previously proposed algorithms mentioned above produces very poor estimates, with some almost orthogonal to the true principal component! The reason for this is that all of the default initialisation procedures are unsuccessful in finding a good starting point, and this problem is not fixed by using multiple random initialisations; cf. Section 4.3 for further comparisons. For comparison, we also present the corresponding results for Wang, Berthet and Samworth (2016a)’s variant of the semi-definite programming (SDP) algorithm introduced by d’Aspremont et al. (2007). This method is guaranteed to converge from any initialisation, so does not suffer the same drawbacks as mentioned above.

In Section 2 of this paper, we propose a novel algorithm for SPCA that aggregates estimates over carefully-chosen random projections of the data into a lower-dimensional space. In contrast to the other algorithms mentioned above, it is non-iterative and does not depend on a choice of initialisation, so it has no difficulty with the simulation example above. Indeed, from the blue curve in Figure 1, we see that it considerably outperforms even the SDP algorithm, and moreover it was over 50 times faster to compute.

Our algorithm, which we refer to as SPCAvRP, turns out to be attractive for both theoretical and computational reasons. Indeed our theory, developed in Section 3, provides a detailed description of the statistical and computational trade-off involved in the SPCAvRP algorithm. It reveals a subtle interaction between conditions on an effective sample size parameter and the number of projections, under which our estimator attains the minimax optimal rate. When the effective sample size is large, the minimax rate can be attained with a number of projections that grows only slightly faster than linearly in . This turns out not to contradict the computational lower bound of Wang, Berthet and Samworth (2016a), which applies to an intermediate effective sample size regime where the SPCAvRP algorithm would require an exponential number of projections to attain the optimal rate. The computational attractions of the proposed algorithm include the fact that it is highly scalable due to easy parallelisation, and does not even require computation of , since it suffices to extract principal submatrices of , which can be done by computing the sample covariance matrices of the projected data. This may result in a significant computational saving if is very large. Several numerical aspects of the algorithm, including a finite-sample simulation comparison with alternative methods on both simulated and real data, are considered in Section 4. These reveal that our SPCAvRP algorithm has very competitive performance, and enjoys robustness properties that iterative algorithms do not share. The proofs of all of our results are given in Section 5.

Algorithms based on random projections have recently been shown to be highly effective for several different problems in high-dimensional statistical inference. For instance, in the context of high-dimensional classification,

Cannings and Samworth (2017)showed that their random projection ensemble classifier that aggregates over projections that yield small estimates of the test error can result in excellent performance.

Marzetta, Tucci and Simon (2011) employ an ensemble of random projections to construct an estimator of the population covariance matrix and its inverse in the setting where . Fowler (2009) introduced a so-called compressive-projection PCA that reconstructs the sample principal components from many low-dimensional projections of the data. Finally, to decrease the computational burden of classical PCA, Qi and Hughes (2012) and Pourkamali-Anaraki and Hughes (2014) propose estimating by the leading eigenvector of , where are random projections of a particular form.##### Notation.

We conclude this introduction with some notation used throughout the paper. For a vector

, we write for its th component and let denote its Euclidean norm. For a real symmetric matrix , we letdenote its eigenvalues, arranged in decreasing order. In addition, we define the leading eigenvector of

bywhere denotes the smallest element of the in the lexicographic ordering. In the special case where , we drop the argument, and write the eigenvalues and eigenvectors as and , respectively. We also define to be the th entry of , and write for the operator norm of matrix .

For , let . Let

denote the support of the vector . We write for the smallest non-zero component of in absolute value.

For any index subset we write to denote the projection onto the span of , where are the standard Euclidean basis vectors in , so that is a diagonal matrix whose th diagonal entry is . Finally, for , we write to mean that there exists a universal constant such that .

## 2 SPCA via random projections

### 2.1 Single principal component estimation

In this section, we describe our algorithm for estimating a single principal component in detail; more general estimation of multiple principal components and principal subspaces is treated in Section 2.2 below. Let be data points in and let . We think of as independent realisations of a mean-zero random vector , so a practitioner may choose to center each variable so that for each . For , let denote the set of -dimensional, axis-aligned projections. For fixed , consider projections

independently and uniformly distributed on

. We think of these projections as consisting of groups, each of cardinality . For each , letdenote the index of the selected projection within the th group. The idea is that the non-zero entries of form a principal submatrix of that should have a large leading eigenvalue, so the non-zero entries of the corresponding leading eigenvector of should have some overlap with those of . Observe that, if and contained all projections, then the leading eigenvector of would yield the minimax optimal estimator in (2). Of course, it would typically be too computationally expensive to compute all such projections, so instead we only consider randomly chosen ones.

The remaining challenge is to aggregate over the selected projections. To this end, for each coordinate , we compute the average of the absolute values of the th components of the selected eigenvectors . This means that we take account not just of the frequency with which each coordinate is chosen, but also their corresponding magnitudes in the selected eigenvector. Finally, we select the indices corresponding to the largest values of and output our estimate as the leading eigenvector of . Pseudo-code for our SPCAvRP algorithm is given in Algorithm 1.

*do*

*do*

(4) |

(5) |

Besides the intuitive selection of the most important coordinates, the use of axis-aligned projections in SPCAvRP algorithm facilitates faster computation as opposed to the use of general orthogonal projections. Indeed, the multiplication of by an axis-aligned projection from the left (or right) can be recast as the selection of rows (or columns) of corresponding to the indices of the non-zero diagonal entries of . Thus, instead of the typical matrix multiplication complexity, only operations are required. We also remark that, instead of storing , it suffices to store its non-zero indices.

More generally, the computational complexity of Algorithm 1 can be analysed as follows. Generating initial random projections takes operations. Next, we need to compute for all and , which can be done in two different ways. One option is to compute , and then for each projection select the corresponding principal submatrix of , which requires operations. Alternatively, we can avoid computing by computing the sample covariance matrix of the projected data , which has complexity. If , then the second option is preferable.

The rest of Algorithm 1 entails computing an eigendecomposition of each matrix, and computing , , , and , which altogether amounts to operations. Thus, assuming that , the overall computational complexity of the SPCAvRP algorithm is

We also note that, due to the use of random projections, the algorithm is highly parallelisable. In particular, both for-loops of Algorithm 1 can be parallelised, and the selection of good projections can easily be carried out using different (up to ) machines.

Finally, we note that the numbers and of projections, the dimension of those projections and the sparsity of the final estimator, need to be provided as inputs to Algorithm 1. The effect of these parameter choices on the theoretical guarantees of our SPCAvRP algorithm is elucidated in Section 3, while their practical selection is discussed in Section 4.2. Our main conclusion is that the algorithm is robust to the choice of and , so that plays the role of the main tuning parameter (analogously to other sparse PCA algorithms).

### 2.2 Multiple principal component estimation

The estimation of higher-order principal components is typically achieved via a deflation scheme. Having computed estimates of the top principal components, the aim of such a procedure is to modify the observations to remove correlation with these previously-estimated components (e.g. Mackey, 2009). For any matrix , we define the projection onto the orthogonal complement of the column space of by if and otherwise. Then writing , one possibility to implement a deflation scheme is to set for . Note that, in contrast to classical PCA, in sparse PCA the estimated principal components from such a deflation scheme are typically not orthogonal. In Algorithm 2, we therefore propose a modified deflation scheme, which in combination with Algorithm 1 can be used to compute an arbitrary principal components that are orthogonal (as well as sparse), as verified in Lemma 1 below.

###### Lemma 1.

For any , the outputs of Algorithm 2 are mutually orthogonal.

We remark that, in fact, our proposed deflation method can be used in conjunction with any SPCA algorithm.

Although Algorithm 2 can conveniently be used to compute sparse principal components up to order , it requires Algorithm 1 to be executed times. Instead, we can modify Algorithm 1

to estimate directly the leading eigenspace of dimension

at a considerably reduced computational cost. To this end, we propose a generalisation of the SPCAvRP algorithm for eigenspace estimation in Algorithm 3. In this generalisation, projections are selected from a total of random projections, by computingfor each and . Moreover, the aggregation step is modified to account for the sparsity and orthogonality of the components. Observe that for , Algorithm 3 reduces to Algorithm 1. Furthermore, for any , up to the step where is computed, Algorithm 3 has the same complexity as Algorithm 1, with the total complexity of Algorithm 3 amounting to provided that , and .

*do*

*do*

*do*

*do*

*do*

## 3 Theoretical guarantees

In this section, we focus on Algorithm 1 and assume that are independently sampled from a distribution satisfying a Restricted Covariance Concentration (RCC) condition introduced in Wang, Berthet and Samworth (2016a). Recall that, for , we say that a mean zero distribution on satisfies an RCC condition with parameter , and write , if for all , and , we have

(6) |

In particular, if , then ; and if is sub-Gaussian with parameter , in the sense that for all , then (Wang, Berthet and Samworth, 2016a, Proposition 1). In Section 3.1, we first derive theoretical guarantees in the special case where the covariance matrix has a single-spiked structure and its leading eigenvector is homogeneous in all signal coordinates. The result in this special case already provides useful insights on how different parameters affect the performance of estimator proposed in Algorithm 1. We then extend our theory to more general distributions in Section 3.2.

### 3.1 Single-spiked model with homogeneous signal

Any permutation of acts naturally on by

. This action maps any probability measure

on to another probability measure on , where for any Borel set , we define . In this section, we consider a subclass of distributionssuch that any has covariance matrix , for some and and such that for any that stabilises , that is . In particular, includes distributions of the form when and .

In what follows, we use

to denote the distribution function of the hypergeometric distribution

. Recall that this distribution models the number of white balls obtained when drawing balls uniformly and without replacement from an urn containing balls, of which are white.###### Theorem 2.

Let . Let be the output of Algorithm 1 with input , , , and . Assume that , , and that there exists such that

(7) |

and

(8) |

Then with probability at least we have

We note that for , the loss is bounded by a constant multiple of , which, as mentioned in the introduction, is minimax rate optimal when is bounded by a constant. On the other hand, when , an additional loss of order may be incurred.

As increases, conditions (7) and (8) are respectively strengthened and weakened. The flexibility of allows us to trade off these two conditions in the theorem. For example, when , we have

so it suffices to choose for (7) to hold. In this case, we can choose parameters and , depending polynomially on and , so that Algorithm 1 is a polynomial time algorithm that can achieve the minimax rate for and appropriately chosen . However, this does not contradict the computational lower bound established in Wang, Berthet and Samworth (2016a, Theorem 6) because for , condition (8) implies a sample size requirement of order , which belongs to the high effective sample size regime discussed in Wang, Berthet and Samworth (2016a, Section 4.4). On the other hand, for , (8) is satisfied for a much smaller sample size , which includes both the intermediate and high effective sample size regimes of Wang, Berthet and Samworth (2016a) (these are the only regimes where consistent estimation is possible using any algorithm). However, by Hoeffding (1963, Theorems 2 and 4), if , then

which together with (7) entails choosing exponentially large in the problem parameters. Hence Algorithm 1 will not be polynomial time in this case. Therefore, in this single-spiked homogeneous signal setting, Theorem 2

continuously interpolates between the high and intermediate effective sample size regimes, and elucidates the phase transition for our random projection ensemble estimator in a fairly precise way.

### 3.2 General distributions

We consider more general distributions in this section. To begin with, we provide a proposition which controls the risk of estimator defined in Algorithm 1 by the sum of a bias term, based on its support recovery quality, and a variance term, which measures the risk incurred in estimating the leading eigenvector after knowing its support.

###### Proposition 3.

Let with covariance matrix satisfying . Suppose that has support . Let and let , where is a random subset of of cardinality . If and , then

Note that this result holds for any estimator of form where is an index subset of cardinality that depends on the data. In what follows, we bound when is defined as in Algorithm 1, thereby explicitly bounding the risk of estimator computed therein. To achieve this, we show that with high probability, our selection criterion (4) ensures that we aggregate over a certain set of ‘good’ projections, defined for by

Such projections capture at least a given proportion of the signal in the leading eigenvector . Writing for the selected projection from the first group of projections in Algorithm 1, we also define the event

Since we aim to bound , and since signal coordinates may differ in magnitude, we need to consider the probability that each signal coordinate is captured by a selected good projection. To this end, we define

(9) |

Observe that, under the setting of Section 3.1, we have . Moreover, whenever and , we have provided . The theorem below provides conditions under which we can control , and therefore bound the risk of our SPCAvRP estimator. In what follows, we order as .

###### Theorem 4.

Let with covariance matrix satisfying and . Let be the output of Algorithm 1 with input , , , and , satisfying and and . Suppose there exists such that

(10) |

and that is large enough that there exists for which

(11) |

Then

Remark: In the case where is a spiked covariance matrix of the form

(12) |

for some

, , the conditions of Theorem 4 can be weakened. In fact, noting the remarks following Lemma 6 and Lemma 7, and in the theorem may be replaced with and respectively (the naive direct application of Theorem 4 would have set for ).We further remark that conditions (10) and (11) again exhibit a statistical and computational trade-off as discussed after Theorem 2. For close , (10) is satisfied with a mild sample size requirement but (11) would require a choice of exponentially large in the problem parameters. On the other hand, if is sufficiently small and is close to , then (11) can be satisfied with depending polynomially on the problem parameters, at the price of a much larger sample size requirement implied by (10).

## 4 Numerical experiments

In this section we demonstrate the performance of our proposed method in different simulated settings and discuss the practical choice of the input parameters. We also compare our method with several existing sparse principal component estimation algorithms. All examples are computed using the R package ‘SPCAvRP’ (Gataric, Wang and Samworth, 2018).

### 4.1 Dependence of risk on problem parameters

Our first goal is to illustrate that our SPCAvRP algorithm achieves the estimation risk bounds as derived in Section 3. To this end, we apply Algorithm 1 to observations independently and identically sampled from a distribution with a spiked covariance matrix defined as in (12). We define the effective sample size

and in Figure 2, we plot the loss , averaged over 100 repetitions for a range of values of . In addition to the empirical loss, we also plot and an empirical estimate of , which, up to universal scaling constants, are the two terms in the risk bound derived in Proposition 3. We observe that the curves of empirical losses for different values of align well with each other, showing that is indeed an effective sample size that characterises the difficulty of the estimation problem. We also observe that the empirical estimate for exhibits a rapid phase transition in its behaviour as increases. Thus, for moderately large , the loss is essentially controlled by , which is reflected by the linear decay of the loss curve with slope under the log-log scaling in Figure 2. In fact, since the left panel of Figure 2 corresponds to the single-spiked homogeneous signal setting, in this special case we can apply Theorem 2 to bound the risk by in the high effective sample size regime discussed after Theorem 2. However, we note that the loss curves behave very similarly in both panels, indicating that the algorithmic performance is robust to the presence of multiple spikes.

Comments

There are no comments yet.