Principal Component Analysis (PCA) is a fundamental tool for data processing and visualization in machine learning and statistics [hotelling1933analysis, jolliffe2002principal]. PCA captures variable interactions in a high-dimensional dataset by identifying the directions of highest variance: the principal components. Standard iterative methods such as the power method and the faster Lanczos algorithm perform full passes over the data at every iteration and are effective on small and medium problems [golub2012matrix]. Notably, Lanczos requires only
full-pass matrix-vector multiplies by the input matrix, which is optimal with respect to its eigen-gapand is considered an “accelerated rate” compared to power method’s passes.
Modern machine learning applications, however, are prohibitively large for full-pass methods. Instead, practitioners use stochastic methods: algorithms that only ingest a random subset of the available data at every iteration. Some methods are proposed for the so-called offline, or finite-sample setting, where the algorithm is given random access to a finite set of samples, and thus could potentially use a full-pass periodically [shamir2015stochastic]. Others are presented for the truly-stochastic or online setting, where the samples are randomly drawn from a distribution, and full passes are not possible [mitliagkas2013memory, boutsidis2015online, jain2016matching]. Information theoretic bounds [allen2016first] show that, the number of samples necessary to recover the principal component, known as the sample complexity, is at least in the online setting. A number of elegant variants of the power method have been shown to match this lower bound in various regimes [jain2016matching, allen2016first].
However, sample complexity is not a great proxy for run time. Iteration complexity—the number of outer loop iterations required, when the inner loop is embarrassingly parallel—provides an asymptotic performance measure of an algorithm on a highly parallel computer. We would like to match Lanczos’ iterations from the full-pass setting. Unfortunately, the Lanczos algorithm cannot operate in a stochastic setting and none of the simple stochastic power iteration variants achieve this accelerated iteration complexity. Recently, this kind of acceleration has been achieved with carefully tuned numerical methods based on approximate matrix inversion [garber2016faster, allen2016doubly]. These methods are largely theoretical in nature and significantly more complex than stochastic power iteration. This context motivates the question: is it possible to achieve the optimal sample and iteration complexity with a method as simple as power iteration?
In this paper, we propose a class of simple PCA algorithms based on the power method that (1) operate in the stochastic setting, (2) have a sample complexity with an asymptotically optimal dependence on the eigen-gap, and (3) have an iteration complexity with an asymptotically optimal dependence on the eigen-gap (i.e. one that matches the worst-case convergence rate for the Lanczos method). As background for our method, we first note that a simple modification of the power iteration, power iteration with momentum, achieves the accelerated convergence rate . Our proposed algorithms come from the natural idea of designing an efficient, stochastic version of that method.
We first demonstrate that simply adding momentum to a stochastic method like Oja’s does not always result in acceleration. While momentum accelerates the convergence of expected iterates, variance typically dominates so no overall acceleration is observed (cf. Section 3). Using Chebyshev polynomials to derive an exact expression for the variance of the iterates of our algorithm, we identify the precise relationship between sample variance and acceleration. Importantly, we identify the exact break-down point beyond which variance is too much and acceleration is no longer observed.
Based on this analysis, we show that we can design a stochastic version of the momentum power iteration that is guaranteed to work. We propose two versions based on mini-batching and variance reduction. Both of these techniques are used to speed up computation in stochastic optimization and are embarrassingly parallel. This property allows our method to achieve true wall-clock time acceleration even in the online setting, something not possible with state-of-the-art results. Hence, we demonstrate that the more complicated techniques based on approximate matrix inversion are not necessary: simple momentum-based methods are sufficient to accelerate PCA. Because our analysis depends only on the variance of the iterates, it is very general: it enables many non-convex problems, including matrix completion [jain2013low], phase retrieval [candes2015phase] and subspace tracking [balzano2010online], to now be accelerated using a single technique, and suggests that the same might be true for a larger class of non-convex optimization problems.
We study the relationship between variance and acceleration by finding an exact characterization of variance for a general class of power iteration variants with momentum in Section 3.1.
Using this bound, we design an algorithm using mini-batches and obtain the optimal iteration and sample complexity for the online setting in Section 3.2.
We design a second algorithm using variance reduction to obtain the optimal rate for the offline setting in Section 3.3. Notably, operating in the offline setting, we are able to use a batch size that is independent of the target accuracy.
|Setting||Algorithm||Number of Iterations||Batch Size||Reference|
|Mini-batch Power+M||This paper|
|VR Power+M||This paper|
notation, we omit the factors depending on failure probability. jain2016matching and shamir2015stochastic give the best known results for stochastic PCA without and with variance reduction respectively. However, neither of these results achieve the optimal iteration complexity. Furthermore, they are not tight in terms of the variance of the problem (i.e. when is small, the bounds are loose).
2 Power method with momentum
In this section, we describe the basic PCA setup and show that a momentum scheme can be used to accelerate the standard power method. This momentum scheme, and its connection with the Chebyshev polynomial family, serves as the foundation of our stochastic method.
PCA Let be
data points. The goal of PCA is to find the top eigenvector of the symmetric positive semidefinite (PSD) matrix(the sample covariance matrix) when the data points are centered at the origin. We assume that the target matrix
has eigenvalueswith corresponding normalized eigenvectors
. The power method estimates the top eigenvector by repeatedly applying the update step
with an arbitrary initial vector . After steps, the normalized iterate 111The in this paper is norm for vectors and spectral norm for matrices. is an -accurate estimate of top principal component. Here accuracy is measured by the squared sine of the angle between and , which is .
When is close to (the eigengap is small), then the power method will converge very slowly. To address this, we propose a class of algorithms based on the alternative update step
We call the extra term, , the momentum term, and the momentum parameter, in analogy to the heavy ball method [polyak1964some], which uses the same technique to address poorly conditioned problems in convex optimization. For appropriate settings of , this accelerated power method can converge dramatically faster than the traditional power method; this is not surprising, since the same is true for analogous accelerated methods for convex optimization.
Orthogonal polynomials We now connect the dynamics of the update (A) to the behavior of a family of orthogonal polynomials, which allows us to use well-known results about orthogonal polynomials to analyze the algorithm’s convergence. Consider the polynomial sequence , defined as
According to Favard’s theorem [chihara2011introduction], this recurrence forms an orthogonal polynomial family—in fact these are scaled Chebyshev polynomials. If we use the update (A) with appropriate initialization, then our iterates will be given by
We can use this expression, together with known facts about the Chebyshev polynomials, to explicitly bound the convergence rate of the accelerated power method with the following theorem (Analysis and proof in Appendix A).
Given a PSD matrix with eigenvalues , running update (A) with results in estimates with worst-case error
We can derive the following corollary, which gives the iteration complexity to achieve error.
Remark. Minimizing over tells us that is the optimal setting.
When we compare this algorithm to power iteration, we notice that it is converging at an accelerated rate. In fact, as shown in Table 1, this momentum power method scheme (with the optimal assignment of ) even matches the worst-case rate of the Lanczos method.
Extensions In Appendix B.1, we extend this momentum scheme to achieve acceleration in the setting where we want to recover multiple top eigenvectors of , rather than just one. In Appendix B.2 we show that this momentum method is numerically stable, whereas the Lanczos method suffers from numerical instability [trefethen1997numerical, golub2012matrix]. Next, in Appendix B.3
we provide a heuristic for auto-tuning the momentum parameter, which is useful in practice. Finally, in AppendixB.4, we consider a larger orthogonal polynomial family, and we show that given some extra information about the tail spectrum of the matrix, we can obtain even faster convergence by using a 4-term inhomogeneous recurrence.
3 Stochastic PCA
Motivated by the results in the previous section, we study using momentum to accelerate PCA in the stochastic setting. We consider a streaming PCA setting, where we are given a series of i.i.d. samples, , such that
In the sample covariance setting of Section 2, can be obtained by selecting , where is uniformly sampled from the dataset. One of the most popular streaming PCA algorithms is Oja’s algorithm [oja1982simplified], which repeatedly runs the update222Here we consider a constant step size scheme, in which the iterate will converge to a noise ball. The size of the noise ball depends on the variance. . A natural way to try to accelerate Oja’s algorithm is to directly add a momentum term, which leads to
In expectation, this stochastic recurrence behaves like the deterministic three-term recurrence (A), which can achieve acceleration with proper setting of . However, we observe empirically that (2) usually does not give acceleration. In Figure 1(a), we see that while adding momentum does accelerate the convergence to the noise ball, it also increases the size of the noise ball—and decreasing the step size to try to compensate for this roughly cancels out the acceleration from momentum. This same counterintuitive phenomenon has independently been observed in goh2017why for stochastic optimization. The inability of momentum to accelerate Oja’s algorithm is perhaps not surprising because the sampling complexity of Oja’s algorithm is asymptotically optimal in terms of the eigen-gap [allen2016first].
In Section 4, we will characterize this connection between the noise ball size and momentum in more depth by presenting an exact expression for the variance of the iterates. Our analysis shows that when the sample variance is bounded, momentum can yield an accelerated convergence rate. In this section, we will present two methods that can be used to successfully control the variance: mini-batching and variance reduction. A summary of our methods and convergence rates is presented in Table 1.
shows the performance of VR power methods. The epoch lengthwas estimated according to (7) by setting and . Stochastic methods report the average performance over 10 runs.
3.1 Stochastic power method with momentum
In addition to adding momentum to Oja’s algorithm, another natural way to try to accelerate stochastic PCA is to use the deterministic update (A) with random samples rather than the exact matrix . Specifically, we analyze the stochastic recurrence
where is an i.i.d. unbiased random estimate of . More explicitly, we write this as Algorithm 1.
When the variance is zero, the dynamics of this algorithm are the same as the dynamics of update (A), so it converges at the accelerated rate given in Theorem 1. Even if the variance is nonzero, but sufficiently small, we can still prove that Algorithm 1 converges at an accelerated rate.
Suppose we run Algorithm 1 with . Let 333 denotes the Kronecker product.. Suppose that and . For any and , if
then with probability at least , we have
When we compare this to the result of Theorem 1, we can see that as long as the variance is sufficiently small, the number of iterations we need to run in the online setting is the same as in the deterministic setting (up to a constant factor that depends on ). In particular, this is faster than the power method without momentum in the deterministic setting. Of course, in order to get this accelerated rate, we need some way of getting samples that satisfy the variance condition of Theorem 3. Certain low-noise datasets might satisfy this condition, but this is not always the case. In the next two sections, we discuss methods of getting lower-variance samples.
3.2 Controlling variance with mini-batches
In the online PCA setting, a natural way of getting lower-variance samples is to increase the batch size (parameter ) used by Algorithm 1. Using the following bound on the variance,
we can get an upper bound on the mini-batch size we will need in order to satisfy the variance condition in Theorem 3, which leads to the following corollary.
Suppose we run Algorithm 1 with . Assume that and . For any and , if
then with probability at least ,
This means that no matter what the variance of the estimator is, we can still converge at the same rate as the deterministic setting as long as we can compute mini-batches of size quickly. One practical way of doing this is by using many parallel workers: a mini-batch of size can be computed in time by machines working in parallel. If we use a sufficiently large cluster that allows us to do this, this means that Algorithm 1 converges in asymptotically less time than any non-momentum power method that uses the cluster for mini-batching, because we converge faster than even the deterministic non-momentum method.
One drawback of this approach is that the required variance decreases as a function of , so we will need to increase our mini-batch size as the desired error decreases. If we are running in parallel on a cluster of fixed size, this means that we will eventually exhaust the parallel resources of the cluster and be unable to compute the mini-batches in asymptotic time. As a result, we now seek methods to reduce the required batch size, and remove its dependence on .
3.3 Reducing batch size with variance reduction
Another way to generate low-variance samples is the variance reduction technique. This technique can be used if we have access to the target matrix so that we can occasionally compute an exact matrix-vector product with . For example, in the offline setting, we can compute by occasionally doing a complete pass over the data. In PCA, shamir2015stochastic has applied the standard variance reduction technique that was used in stochastic convex optimization [johnson2013accelerating], in which the stochastic term in the update is
where is the (normalized) anchor iterate, for which we know the exact value of . We propose a slightly different variance reduction scheme, where the stochastic term in the update is
It is easy to verify that both (5) and (6) can be computed using only the samples and the exact value of . In the PCA setting, (6) is more appropriate because progress is measured by the angle between and , not the distance as in the convex optimization problem setting: this makes (6) easier to analyze. In addition to being easier to analyze, our proposed update rule (6) produces updates that have generally lower variance because for all unit vectors and , . Using this update step results in the variance-reduced power method with momentum in Algorithm 2.
A number of methods use this kind of SVRG-style variance reduction technique, which converges at a linear rate and is not limited by a noise ball. Our method improves upon that by achieving the accelerated rate throughout, and only using a mini-batch size that is constant with respect to .
Suppose we run Algorithm 2 with and a initial unit vector such that . For any , if
then after epochs, with probability at least , we have , where is a numerical constant.
By comparing to the results of Theorem 1 and Theorem 5, we notice that we still achieve the same convergence rate, in terms of the total number of iterations we need to run, as the deterministic setting. Compared with the non-variance-reduced setting, notice that the mini-batch size we need to use does not depend on the desired error , which allows us to use a fixed mini-batch size throughout the execution of the algorithm. This means that we can use Algorithm 2 together with a parallel mini-batch-computing cluster of fixed size to compute solutions of arbitrary accuracy at a rate faster than any non-momentum power method could achieve. As shown in Table 1, in terms of number of iterations, the momentum methods achieve accelerated linear convergence with proper mini-batching (our results there follow Corollary 4 and Theorem 5, using the optimal momentum .).
Experiment Now we use some simple synthetic experiments (details in Appendix E) to illustrate how the variance affects the momentum methods. Figure 1(b) shows that the stochastic power method maintains the same linear convergence as the deterministic power method before hitting the noise ball. Therefore, the momentum method can accelerate the convergence before hitting the noise ball. Figure 1(c) shows that the variance-reduced power method indeed can achieve an accelerated linear convergence with a much smaller batch size on this same synthetic dataset.
4 Convergence analysis
In this section, we sketch the proofs of Theorems 3 and 5. The main idea is to use Chebyshev polynomials to tightly bound the variance of the iterates. Either with or without variance reduction, the dynamics of the stochastic power method with momentum from (3) can be written as where is a sequence of stochastic matrices in satisfying
The random matrixwill have different forms in Algorithm 1 and Algorithm 2, but will still be i.i.d. and satisfy . In fact this recurrence (8) is general enough to be applied into many other problems, including least-square regression and the randomized Kaczmarz algorithm [strohmer2009randomized], as well as some non-convex matrix problems [de2014global] such as matrix completion [jain2013low], phase retrieval [candes2015phase] and subspace tracking [balzano2010online]. Since
obeys a linear recurrence, its second moment also follows a linear recurrence (in fact all its moments do). Decomposing this recurrence using Chebyshev polynomials, we can get a tight bound on the covariance of, which is shown in Lemma 6. It is worth mentioning that this bound is exact in the scalar case.
Suppose and . The norm of the covariance of the matrix is bounded by
where is the Chebyshev polynomial of the second kind, and denotes the set of vectors in with entries that sum to , i.e.
For the mini-batch power method without variance reduction (Algorithm 1), the goal is to bound , which is equivalent to bounding . We use Lemma 6 to get a variance bound for the denominator of this expression, which is
With this variance bound and Chebyshev’s inequality we get a probabilistic lower bound for . Lemma 6 can also be used to get an upper bound for the numerator, which is
Next we consider the case with variance reduction (Algorithm 2). The analysis contains two steps. The first step is to show a geometric contraction for a single epoch, i.e.
with probability at least , where is a numerical constant. Afterwards, the second step is to get the final accuracy of the solution, which trivially requires epochs. Thus, the analysis boils down to analyzing a single epoch. Notice that in this setting,
Suppose . Let be a unit vector, , and
Then, the norm of the covariance will be bounded by
Comparing to the result in Lemma 6, this lemma shows that the covariance is also controlled by the angle between and which is the anchor point in each epoch. Since the anchor point is approaching , the norm of the covariance is shrinking across epochs—this allows us to prove (11). From here, the proof of Theorem 5 is similar to non-VR case, and the details are in Appendix C.2.
5 Related work
A recent spike in research activity has focused on improving a number of computational and statistical aspects of PCA, including tighter sample complexity analysis [jain2016matching], global convergence [de2014global, allen2016first], memory efficiency [mitliagkas2013memory] and doing online regret analysis [boutsidis2015online]. Some work has also focused on tightening the analysis of power iteration and Krylov methods to provide gap-independent results using polynomial-based analysis techniques [musco2015randomized]. However, that work does not consider the stochastic setting. Some works that study Oja’s algorithm [oja1982simplified] or stochastic power methods in the stochastic setting focus on the analysis of a gap-free convergence rate for the distinct PCA formulation of maximizing explained variance (as opposed to recovering the strongest direction) [shamir2016convergence, allen2016first]. Others provide better dependence on the dimension of the problem [jain2016matching]. garber2016faster, allen2016lazysvd use faster linear system solvers to speed up PCA algorithms such that the convergence rate has the square root dependence on the eigengap in the offline setting. However their methods require solving a series of linear systems, which is not trivially parallelizable. Also none of these results give a convergence analysis that is asymptotically tight in terms of variance, which allows us to show an accelerated linear rate in the stochastic setting. Another line of work has focused on variance control for PCA in the stochastic setting [shamir2015stochastic] to get a different kind of acceleration. Since this is an independent source of improvement, these methods can be further accelerated using our momentum scheme.
The momentum scheme is a common acceleration technique in convex optimization [polyak1964some, nesterov1983method]
, and has been widely adopted as the de-facto optimization method for non-convex objectives in deep learning[sutskever2013importance]. Provably accelerated stochastic methods have previously been found for convex problems [cotter2011better, jain2017accelerating]. However, similar results for non-convex problems remain elusive, despite empirical evidence that momentum results in acceleration for some non-convex problems [sutskever2013importance, kingma2014adam].
The Chebyshev polynomial family is a sequence of orthogonal polynomials [chihara2011introduction] that has been used for analyzing accelerated methods. For example, Chebyshev polynomials have been studied to accelerate the solvers of linear systems [golub1961chebyshev, golub2012matrix] and to accelerate convex optimization [scieur2016regularized]. trefethen1997numerical use Chebyshev polynomials to show that the Lanczos method is quadratically faster than the standard power iteration, which is conventionally considered as the accelerated version of power method with momentum [hardt2014noisy].
This paper introduced a very simple accelerated PCA algorithm that works in the stochastic setting. As a foundation, we presented the power method with momentum, an accelerated scheme in the deterministic setting. We proved that the power method with momentum obtains quadratic acceleration like the convex optimization setting. Then, for the stochastic setting, we introduced and analyzed the stochastic power method with momentum. By leveraging the Chebyshev polynomials, we derived a convergence rate that is asymptotically tight in terms of the variance. Using a tight variance analysis, we demonstrated how the momentum scheme behaves in a stochastic system, which can lead to a better understanding of how momentum interacts with variance in stochastic optimization problems [goh2017why]. Specifically, with mini-batching, the stochastic power method with momentum can achieve accelerated convergence to the noise ball. Alternatively, using variance reduction, accelerated convergence at a linear rate can be achieved with a much smaller batch size.
We thank Aaron Sidford for helpful discussion and feedback on this work.
We gratefully acknowledge the support of the Defense Advanced Research Projects Agency (DARPA) SIMPLEX program under No. N66001-15-C-4043, D3M program under No. FA8750-17-2-0095, the National Science Foundation (NSF) CAREER Award under No. IIS- 1353606, the Office of Naval Research (ONR) under awards No. N000141210041 and No. N000141310129, a Sloan Research Fellowship, the Moore Foundation, an Okawa Research Grant, Toshiba, and Intel. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, NSF, ONR, or the U.S. government.
Appendix A Momentum PCA and Orthogonal Polynomials
In this section, we prove Theorem 1 and give the intuition that the momentum can provide acceleration from both geometric and algebraic perspectives.
First, we restate the update (A) for power iteration with momentum,
and the correponding orthogonal polynomial sequence (P),
According to Lemma 20, we have the expression of ,
a.1 Proof of Theorem 1
Here we prove a slightly general result.
Given a PSD matrix with eigenvalues with normalized eigenvectors , we run the power iteration with momentum update A with a unit vector , the we have
Denote and , then
Let’s bound . Denote as the smallest index such that . Since , then . Now use Lemma 20, we get
First, let’s consider .
Now consider ,
Therefore plug in the bound for and we get the desired result. ∎
a.2 Effect of Momentum
In this section, we explain why acceleration happens from both a geometric and algebraic perspective of the orthogonal polynomial recurrence. First, we show the geometric behavior of the orthogonal polynomial sequence. We see that momentum results in a “calm” region, where the orthogonal polynomial sequence grows very slowly and an “explosive” region, where the polynomials grow exponentially fast. We then show how the momentum controls the size of “calm” region. Second, we consider an algebraically equivalent form of the three-term recurrence in terms of an augmented matrix. We see that power iteration with momentum is equivalent to standard power iteration on an augmented matrix and quantitatively how the momentum leads to a “better-conditioned” problem. From either perspective, we get a better understanding about how our methods work.
Regions of the Polynomial Recurrence Now, we demonstrate the effect of momentum on different eigenvalues. In Figure 2, we show the values of the polynomial recurrence, which characterizes the growth of different eigenvalues for varying .
For power iteration, where , . While the recurrence reduces mass on small eigenvalues quickly, eigenvalues near the largest eigenvalue will decay relatively slowly, yielding slow convergence.
As is increased, a “knee” appears in . For values of smaller than the knee, remains small, which implies that these eigenvalues decay quickly. For values of greater than the knee, grows rapidly, which means that these eigenvalues will remain. By selecting a value that puts that knee close to , our recurrence quickly eliminates mass on all but the largest eigenvector.
Well-Conditioned Augmented Matrix
Consider the recurrence
Notice that this is simply power iteration on an augmented matrix. It is straightforward to see taht the power iteration with momentum is exactly equivalent to standard power iteration on this augmented matrix, i.e. from (A1) and from (A) are the same. As a result, we can take advantage of known power iteration properties when studying our method. In the following proposition, we derive the eigenvalues of the augmented matrix.
Suppose a matrix has eigenvalue- eigenvector pairs , then the augmented matrix
has eigenvalue-eigenvector pairs
In particular, when , the relative eigen-gap of this augmented matrix is . And the standard power iteration on has the convergence rate , which matches the result in Theorem 1.
Now we present the proof of Proposition 9 below.
For any eigenvalue, eigenvector pair of , let be a solution of . Suppose that we define
Thus, is an eigenvector of with corresponding eigenvalue . Doing this for all eigenvectors of will produce a complete eigendecomposition of . ∎
Appendix B Extensions
In this section, we consider several extension based on power method with momentum presented in Section 2. In Section B.1, we will generalize our methods to multiple components case, i.e. finding the top eigenvalues/eigenvectors and show that it is numerically stable in Section B.2. In Section B.3, we provide some simple heuristics to tune the momentum parameter. In Section B.4, we extend our momentum method into an inhomogeous polynomials recurrence and show that it is optimal in expection with respect to the tail distribution of the tail spectrum of the target matrix . All the proofs for this section are in Section B.6.
b.1 Block Update for Multiple Components
In this section, we use a block version of our method to compute multiple principal components. In this case, the initial state is a matrix , rather than a single vector. The orthogonal polynomal sequence (P) natually corresponds to the update scheme
To obtain the convergence result, we use the standard definition from golub2012matrix to measure the distance between spaces.
Given two spaces , the distance between is defined as
where is the orthogonal projection onto . Furthermore, when are matrices, we overload the definition as , where denotes the range space.
The following lemma shows that we can analyze the convergence rate of any update scheme by studying the growth rate of the corresponding orthogonal polynomial.
Given a PSD matrix , its top () eigenvectors , and a matrix such that , for any polynomial , we have
The following theorem gives the rate at which the space spanned by the first columns of approach the space spanned by the top eigenvectors.
b.2 Stable Implementation of Momentum Methods
In this section, we provide a numerically stable implementation of our momentum method for the multi-component case. This implementation can also be applied in the single component case. Consider the update scheme A’. Similar to the unnormalized simultaneous iteration (which essentially is the block version of the power method) ([Lecture 28]trefethen1997numerical), as , all columns of converge to the multiples of the same dominant eigenvectors of due to the round-off errors. A common technique to remedy the situation is orthonormalization, which is used in the standard power method. However we cannot simply orthonormalize each or every iteration because it changes the convergence behavior. Instead, we propose the normalization scheme A” to stabilize our method: