Convergence of Stochastic Gradient Descent for PCA

by   Ohad Shamir, et al.
Weizmann Institute of Science

We consider the problem of principal component analysis (PCA) in a streaming stochastic setting, where our goal is to find a direction of approximate maximal variance, based on a stream of i.i.d. data points in ^d. A simple and computationally cheap algorithm for this is stochastic gradient descent (SGD), which incrementally updates its estimate based on each new data point. However, due to the non-convex nature of the problem, analyzing its performance has been a challenge. In particular, existing guarantees rely on a non-trivial eigengap assumption on the covariance matrix, which is intuitively unnecessary. In this paper, we provide (to the best of our knowledge) the first eigengap-free convergence guarantees for SGD in the context of PCA. This also partially resolves an open problem posed in hardt2014noisy. Moreover, under an eigengap assumption, we show that the same techniques lead to new SGD convergence guarantees with better dependence on the eigengap.


page 1

page 2

page 3

page 4


Averaging Stochastic Gradient Descent on Riemannian Manifolds

We consider the minimization of a function defined on a Riemannian manif...

Variance Reduced Stochastic Gradient Descent with Neighbors

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, ye...

Exponentially convergent stochastic k-PCA without variance reduction

We present Matrix Krasulina, an algorithm for online k-PCA, by generaliz...

Bias-Variance Tradeoff in a Sliding Window Implementation of the Stochastic Gradient Algorithm

This paper provides a framework to analyze stochastic gradient algorithm...

NeCPD: An Online Tensor Decomposition with Optimal Stochastic Gradient Descent

Multi-way data analysis has become an essential tool for capturing under...

Data Dependent Convergence for Distributed Stochastic Optimization

In this dissertation we propose alternative analysis of distributed stoc...

Semi-Cyclic Stochastic Gradient Descent

We consider convex SGD updates with a block-cyclic structure, i.e. where...

1 Introduction

Principal component analysis (PCA) [20, 11] is a fundamental tool in data analysis and visualization, designed to find the subspace of largest variance in a given dataset (a set of points in Euclidean space). We focus on a simple stochastic setting, where the data is assumed to be drawn i.i.d. from an unknown underlying distribution, and our goal is to find a direction of approximately maximal variance. This can be written as the optimization problem


or equivalently, finding an approximate leading eigenvector of the covariance matrix


The conceptually simplest method for this task, given sampled points , is to construct the empirical covariance matrix , and compute its leading eigenvector by an eigendecomposition. Based on concentration of measure arguments, it is not difficult to show that this would result in an -optimal solution to Eq. (1). Unfortunately, the runtime of this method is . In large-scale applications, both and might be huge, and even forming the

covariance matrix, let alone performing an eigendecomposition, can be computationally prohibitive. A standard alternative to exact eigendecomposition is iterative methods, such as power iterations or the Lanczos method, which require performing multiple products of a vector with the empirical covariance matrix. Although this doesn’t require computing and storing the matrix explicitly, it still requires multiple passes over the data, whose number may scale with eigengap parameters of the matrix or the target accuracy

[14, 16]. Recently, new randomized algorithms for this problem were able to significantly reduce the required number of passes, while maintaining the ability to compute high-accuracy solutions [25, 24, 8, 12].

In this work, we consider the efficacy of algorithms which perform a single pass over the data, and in particular, stochastic gradient descent (SGD). For solving Eq. (1), SGD corresponds to initializing at some unit vector , and then at each iteration perform a stochastic gradient step with respect to

(which is an unbiased estimate of

), followed by a projection to the unit sphere:

Here, is a step size parameter. In the context of PCA, this is also known as Oja’s method [18, 19]. The algorithm is highly efficient in terms of memory and runtime per iteration, requiring storage of a single -dimensional vector, and performing only vector-vector and a vector-scalar products in each iteration.

In the world of convex stochastic optimization and learning, SGD has another remarkable property: Despite it being a simple, one-pass algorithm, it is essentially (worst-case) statistically optimal, attaining the same statistical estimation error rate as exact empirical risk minimization [5, 23, 22]. Thus, it is quite natural to ask whether SGD also performs well for the PCA problem in Eq. (1), compared to statistically optimal but computationally heavier methods.

The study of SGD (or variants thereof) for PCA has gained interest in recent years, with some notable examples including [1, 3, 2, 15, 10, 7, 12]. While experimentally SGD appears to perform reasonably well, its theoretical analysis has proven difficult, due to the non-convex nature of the objective function in Eq. (1

). Remarkably, despite this non-convexity, finite-time convergence guarantees have been obtained under an eigengap assumption – namely, that the difference between the largest and 2nd-largest eigenvalues of

are separated by some fixed value . For example, [7] require

iterations to ensure with high probability that one of the iterates is

-optimal. [12] require iterations, provided we begin close enough to an optimal solution.

Nevertheless, one may ask whether the eigengap assumption is indeed necessary, if our goal is simply to find an approximately optimal solution of Eq. (1). Intuitively, if has two equal (or near equal) top eigenvalues, then we may still expect to get a solution which lies close to the subspace of these two top eigenvalues, and approximately minimizes Eq. (1), with the runtime not dependent on any eigengap. Unfortunately, existing results tell us nothing about this regime, and not just for minor technical reasons: These results are based on tracking the geometric convergence of the SGD iterates to a leading eigenvector of the covariance matrix. When there is no eigengap, there is also no single eigenvector to converge to, and such a geometric approach does not seem to work. Getting an eigengap-free analysis has also been posed as an open problem in [10]. We note that while there are quite a few other single-pass, eigengap-free methods for this problem, such as [28, 29, 17, 6, 9, 13], their memory and runtime-per iteration requirements are much higher than SGD, often or worse.

In this work, we study the convergence of SGD for PCA, using a different technique that those employed in previous works, with the following main results:

  • We provide the first (to the best of our knowledge) SGD convergence guarantee which does not pose an eigengap assumption. Roughly speaking, we prove that if the step size is chosen appropriately, then after iterations starting from random initialization, with positive probability, SGD returns an -optimal111Throughout, we use , to hide constants, and , to hide constants and logarithmic factors. solution of Eq. (1), where is a parameter depending on how the algorithm is initialized:

    • If the algorithm is initialized from a warm-start point such that for some leading eigenvector of the covariance matrix, then .

    • Under uniform random initialization on the unit Euclidean sphere, , where is the dimension.

    • Using a more sophisticated initialization (requiring the usage of the first iterations, but no warm-start point), , where is the numerical rank of the covariance matrix. The numerical rank is a relaxation of the standard notion of rank, is always at most and can be considered a constant under some mild assumptions.

  • In the scenario of a positive eigengap , and using a similar proof technique, we prove an SGD convergence guarantee of (where is as above) with positive probability. This guarantee is optimal in terms of dependence on , and in particular, has better dependence on compared to all previous works on SGD-like methods we are aware of ( as opposed to ).

Unfortunately, a drawback of our guarantees is that they only hold with rather low probability: , which can be small if is large. Formally, this can be overcome by repeating the algorithm times, which ensures that with high probability, at least one of the outputs will be close to optimal. However, we suspect that these low probabilities are an artifact of our proof technique, and resolving it is left to future work.

2 Setting

We use bold-faced letters to denote vectors, and capital letters to denote matrices. Given a matrix , we let denote its spectral norm, and its Frobenius norm.

We now present the formal problem setting, in a somewhat more general way than the PCA problem considered earlier. Specifically, we study the problem of solving


where and is a positive semidefinite matrix, given access to a stream of i.i.d. positive semidefinite matrices where (e.g. in the PCA case). Notice that the gradient of Eq. (2) at a point equals , with an unbiased stochastic estimate being . Therefore, applying SGD to Eq. (2) reduces to the following: Initialize at some unit-norm vector , and for , perform , returning . In fact, for the purpose of the analysis, it is sufficient to consider a formally equivalent algorithm, which only performs the projection to the unit sphere at the end:

  • Initialize by picking a unit norm vector

  • For , perform

  • Return

It is easy to verify that the output of this algorithm is mathematically equivalent to the original SGD algorithm, since the stochastic gradient step amounts to multiplying by a matrix independent of , and the projection just amounts to re-scaling. In both cases, we can write the algorithm’s output in closed form as

3 Convergence Without an Eigengap Assumption

Our main result is the following theorem, which analyzes the performance of SGD for solving Eq. (2).

Theorem 1.

Suppose that

  • For some leading eigenvector of , for some (assumed to be for simplicity).

  • For some , both and are at most with probability .

If we run the algorithm above for iterations with (assumed to be ), then with probability at least , the returned satisfies

where are positive numerical constants.

The proof and an outline of its main ideas appears in Subsection 5.1 below. Note that this is a multiplicative guarantee on the suboptimality of Eq. (2), since we normalize by , which is the largest magnitude Eq. (2) can attain. By multiplying both sides by , we can convert this to an additive bound of the form

where is a bound on . Also, note that the choice of in the theorem is not crucial, and similar bounds (with different ) can be shown for other .

The value of in the theorem depends on how the initial point is chosen. One possibility, of course, is if we can initialize the algorithm from a “warm-start” point such that , in which case the bound in the theorem becomes with probability . Such a may be given by some other algorithm, or alternatively, if we are interested in analyzing SGD in the regime where it is close to one of the leading eigenvectors.

Of course, such an assumption is not always relevant, so let us turn to consider the performance without such a “warm-start”. For example, the simplest and most common way to initialize is by picking it uniformly at random from the unit sphere. In that case, for any , with high constant probability222One way to see this is by assuming w.l.o.g. that and noting that the distribution of is the same as where

has a standard Gaussian distribution, hence

, and by using standard concentration tools it can be shown that the numerator is and the denominator is with high probability., so the theorem above applies with :

Corollary 1.

If is chosen uniformly at random from the unit sphere in , then Thm. 1 applies with , and the returned satisfies, with probability at least ,

While providing some convergence guarantee, note that the probability of success is low, scaling down linearly with . One way to formally solve this is to repeat the algorithm times, which ensures that with high probability, at least one output will succeed (and finding it can be done empirically by testing the outputs on a validation set). However, it turns out that by picking in a smarter way, we can get a bound where the factors are substantially improved.

Specifically, we consider the following method, parameterized by number of iterations , which are implemented before the main algorithm above:

  • Sample from a standard Gaussian distribution on

  • Let .

  • For , let

  • Return .

Essentially, instead of initializing from a random point , we initialize from

Since is a mean of random matrices with mean , this amounts to performing a single approximate power iteration. Recently, it was shown that a single exact power iteration can improve the starting point of stochastic methods for PCA [24]. The method above extends this idea to a purely streaming setting, where we only have access to stochastic approximations of .

The improved properties of with this initialization is formalized in the following lemma (where denotes the Frobenius norm of ):

Lemma 1.

The following holds for some numerical constants : For as defined above, if , then with probability at least ,

where is the numerical rank of .

The proof is provided in Subsection 5.2. Combining this with Thm. 1, we immediately get the following corollary:

Corollary 2.

If is initialized as described above, then Thm. 1 applies with , and the returned satisfies, with probability at least ,

The improvement of Corollary 2 compared to Corollary 1 depends on how much smaller is , the numerical rank of , compared to . We argue that in most cases, is much smaller, and often can be thought of as a moderate constant, in which case Corollary 2 provides an error bound with probability , at the cost of additional iterations at the beginning. Specifically:

  • is always in , and in particular, can never be larger than .

  • is always upper bounded by the rank of , and is small even if is only approximately low rank. For example, if the spectrum of has polynomial decay where , then will be a constant independent of . Moreover, to begin with, PCA is usually applied in situations where we hope is close to being low rank.

  • When is of rank (which is the case, for instance, in PCA, where equals the outer product of the -th datapoint ), we have , where we recall that upper bounds the scaled spectral norm of

    . In machine learning application, the data norm is often assumed to be bounded, hence

    is not too large. To see why this holds, note that for rank matrices, the spectral and Frobenius norms coincide, hence

    where we used Jensen’s inequality.

Similar to Corollary 1, we can also convert the bound of Corollary 2 into a high-probability bound, by repeating the algorithm times.

4 Convergence under an Eigengap Assumption

Although our main interest so far has been the convergence of SGD without any eigengap assumptions, we show in this section that our techniques also imply new bounds for PCA with an eigengap assumptions, which in certain aspects are stronger than what was previously known.

Specifically, we consider the same setting as before, but where the ratio , where

are the leading singular values of the covariance matrix

is assumed to be strictly positive and lower bounded by some fixed . Using this assumption and a proof largely similar to that of Thm. 1, we have the following theorem:

Theorem 2.

Under the same conditions as Thm. 1, suppose furthermore that

  • The top two eigenvalues of have a gap

If we run the algorithm above for iterations with (assumed to be ), then with probability at least , the returned satisfies

where are positive numerical constants.

The proof appears in Subsection 5.3. Considering first the technical conditions of the theorem, we note that assuming simply amounts to saying that is sufficiently large so that the bound provided by Thm. 2 is better than the bound provided by Thm. 1, by more than a constant. This is the interesting regime, since otherwise we might as well choose as in Thm. 1 and get a better bound without any eigengap assumptions. Moreover, as in Thm. 1, a similar proof would hold if the step size is replaced by for some constant .

As in Thm. 1, we note that can be as large as under random initialization, but this can be improved to the numerical rank of using an approximate power iteration, or by analyzing the algorithm starting from a warm-start point for which for a leading eigenvector of . Also, note that under an eigengap assumption, if goes to with the number of iterations , it must hold that goes to for a leading eigenvector of , so the analysis with is also relevant for analyzing SGD for sufficiently large , once we’re sufficiently close to the optimum.

Comparing the bound to previous bounds in the literature for SGD-like methods (which all assume an eigengap, e.g. [3, 10, 7, 12]), an interesting difference is that the dependence on the eigengap is only , as opposed to or worse. Intuitively, we are able to improve this dependence since we track the suboptimality directly, as opposed to tracking how converges to a leading eigenvector, say in terms of the Euclidean norm. This has an interesting parallel in the analysis of SGD for -strongly convex functions, where the suboptimality of decays as , although can only be bounded by (compare for instance Lemma 1 in [21] and Theorem 1 in [26]). Quite recently, Jin et al. ([12]) proposed another streaming algorithm which does have only dependence (at least for sufficiently large ), and a high probability convergence rate which is even asymptotically optimal in some cases. However, their formal analysis is from a warm-start point (which implies in our notation), whereas the analysis here applies to any starting point. Moreover, the algorithm in [12] is different and more complex, whereas our focus here is on the simple and practical SGD algorithm. Finally, we remark that although an convergence rate is generally optimal (using any algorithm), we do not know whether the dependence on and in the convergence bound of Thm. 2 for SGD is optimal, or whether it can be improved.

5 Proofs

5.1 Proof of Thm. 1

To simplify things, we will assume that we work in a coordinate system where is diagonal, , where , and is the eigenvalue corresponding to . This is without loss of generality, since the algorithm and the theorem conditions are invariant to the choice of coordinate system. Moreover, since the objective function in the theorem is invariant to , we shall assume that . Under these assumptions, the theorem’s conditions reduce to:

  • , for some

  • is an upper bound on

Let be a parameter to be determined later. The proof works by lower bounding the probability of the objective function (which under the assumption , equals ) being suboptimal by at most . This can be written as

or equivalently,


we need to lower bound .

In analyzing the convergence of stochastic gradient descent, a standard technique to bound such probabilities is via a martingale analysis, showing that after every iteration, the objective function decreases by a certain amount. Unfortunately, due to the non-convexity of the objective function here, the amount of decrease at iteration critically depends on the current iterate , and in the worst case may even be (e.g. if is orthogonal to the leading eigenvector, and there is no noise). Moreover, analyzing the evolution of is difficult, especially without eigengap assumptions, where there isn’t necessarily some fixed direction which converges to. Hence, we are forced to take a more circuitous route.

In a nutshell, the proof is composed of three parts. First, we prove that if and the step size are chosen appropriately, then . If we could also prove a concentration result, namely that is not much larger than its expectation, this would imply that is indeed large. Unfortunately, we do not know how to prove such concentration. However, it turns out that it is possible to prove that is not much smaller than its expected value: More precisely, that with high probability. We then show that given such a high-probability lower bound on , and a bound on its expectation, we can produce an upper bound on which holds with probability , hence leading to the result stated in the theorem.

We begin with a preliminary technical lemma:

Lemma 2.

For any , and integer ,


The result trivially holds for , so we will assume from now. Let

Differentiating and setting to zero, we have

Let denote this critical point, and consider two cases:

  • : In that case, has no critical points in the domain, hence is maximized at one of the domain endpoints, with a value of at most

  • : In that case, we must have , and the value of at is

    The maximal value of is either the value above, or the maximal value of at the domain endpoints, which we already showed to be most . Overall, the maximal value can attain is at most

Combining the two cases, the result follows. ∎

Using this lemma, we now prove that has a large negative expected value. To explain the intuition, note that if we could have used the exact instead of the stochastic approximations in deriving , then we would have

which by the assumptions and is at most

Applying Lemma 2 and picking appropriately, it can be shown that the above is at most .

Unfortunately, this calculation doesn’t apply in practice, since we use the stochastic approximations instead of . However, using more involved calculations, we prove in the lemma below that the expectation is still essentially the same, provided are chosen appropriately.

Lemma 3.

If and for some sufficiently large constant , then it holds that


To simplify notation, define for all the matrices

Note that is deterministic whereas is random and zero-mean. Moreover, and .

By definition of the algorithm, we have the following:

Since are independent and zero-mean, the expectation of each summand in the expression above is non-zero only if for all . Therefore,

We now decompose this sum according to what is the largest value of for which (hence ). The intuition for this, as will be seen shortly, is that Lemma 2 allows us to attain tighter bounds on the summands when is much smaller than . Formally, we can rewrite the expression above as

Since is diagonal and the same for all , and is diagonal as well, we can simplify the above to

Using the fact that the spectral norm is sub-multiplicative, and that for any symmetric matrix , , where denotes the largest eigenvalue of , we can upper bound the above by

Since , and , , this is at most


Recalling that with , that , and that , the first term in Eq. (3) equals

Applying Lemma 2, and recalling that , we can upper bound the above by


As to the second term in Eq. (3), again using the fact that , we can upper bound it by

Applying Lemma 2, and recalling that , this is at most

Upper bounding by , and rewriting the sum in terms of instead of , we get

Since , we have , so the above is at most