Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity

07/31/2015 ∙ by Ohad Shamir, et al. ∙ Weizmann Institute of Science 0

We study the convergence properties of the VR-PCA algorithm introduced by shamir2015stochastic for fast computation of leading singular vectors. We prove several new results, including a formal analysis of a block version of the algorithm, and convergence from random initialization. We also make a few observations of independent interest, such as how pre-initializing with just a single exact power iteration can significantly improve the runtime of stochastic methods, and what are the convexity and non-convexity properties of the underlying optimization problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of recovering the top left singular vectors of a matrix , where . This is equivalent to recovering the top eigenvectors of , or equivalently, solving the optimization problem

(1)

This is one of the most fundamental matrix computation problems, and has numerous uses (such as low-rank matrix approximation and principal component analysis).

For large-scale matrices , where exact eigendecomposition is infeasible, standard deterministic approaches are based on power iterations or variants thereof (e.g. the Lanczos method) [8]. Alternatively, one can exploit the structure of Eq. (1) and apply stochastic iterative algorithms, where in each iteration we update a current matrix based on one or more randomly-drawn columns of . Such algorithms have been known for several decades ([14, 17]), and enjoyed renewed interest in recent years, e.g. [2, 4, 3, 10, 6]. Another stochastic approach is based on random projections, e.g. [9, 20].

Unfortunately, each of these algorithms suffer from a different disadvantage: The deterministic algorithms are accurate (runtime logarithmic in the required accuracy , under an eigengap condition), but require a full pass over the matrix for each iteration, and in the worst-case many such passes would be required (polynomial in the eigengap). On the other hand, each iteration of the stochastic algorithms is cheap, and their number is independent of the size of the matrix, but on the flip side, their noisy stochastic nature means they are not suitable for obtaining a high-accuracy solution (the runtime scales polynomially with ).

Recently, [19] proposed a new practical algorithm, VR-PCA, for solving Eq. (1), which has a “best-of-both-worlds” property: The algorithm is based on cheap stochastic iterations, yet the algorithm’s runtime is logarithmic in the required accuracy . More precisely, for the case , of bounded norm, and when there is an eigengap of

between the first and second leading eigenvalues of the covariance matrix

, the required runtime was shown to be on the order of

(2)

The algorithm is therefore suitable for obtaining high accuracy solutions (the dependence on is logarithmic), but essentially at the cost of only

passes over the data. The algorithm is based on a recent variance-reduction technique designed to speed up stochastic algorithms for

convex optimization problems ([13]), although the optimization problem in Eq. (1) is inherently non-convex. See Section 3 for a more detailed description of this algorithm, and [19] for more discussions as well as empirical results.

The results and analysis in [19] left several issues open. For example, it is not clear if the quadratic dependence on in Eq. (2) is necessary, since it is worse than the linear (or better) dependence that can be obtained with the deterministic algorithms mentioned earlier, as well as analogous results that can be obtained with similar techniques for convex optimization problems (where is the strong convexity parameter). Also, the analysis was only shown for the case , whereas often in practice, we may want to recover singular vectors simultaneously. Although [19] proposed a variant of the algorithm for that case, and studied it empirically, no analysis was provided. Finally, the convergence guarantee assumed that the algorithm is initialized from a point closer to the optimum than what is attained with standard random initialization. Although one can use some other, existing stochastic algorithm to do this “warm-start”, no end-to-end analysis of the algorithm, starting from random initialization, was provided.

In this paper, we study these and related questions, and make the following contributions:

  • We propose a variant of VR-PCA to handle the case, and formally analyze its convergence (Section 3). The extension to is non-trivial, and requires tracking the evolution of the subspace spanned by the current solution at each iteration.

  • In Section 4, we study the convergence of VR-PCA starting from a random initialization. And show that with a slightly smarter initialization – essentially, random initialization followed by a single power iteration – the convergence results can be substantially improved. In fact, a similar initialization scheme should assist in the convergence of other stochastic algorithms for this problem, as long as a single power iteration can be performed.

  • In Section 5, we study whether functions similar to Eq. (1) have hidden convexity properties, which would allow applying existing convex optimization tools as-is, and improve the required runtime. For the case, we show that this is in fact true: Close enough to the optimum, and on a suitably-designed convex set, such a function is indeed -strongly convex. Unfortunately, the distance from the optimum has to be , and this precludes a better runtime in most practical regimes. However, it still indicates that a better runtime and dependence on should be possible.

2 Some Preliminaries and Notation

We consider a matrix composed of columns , and let

Thus, Eq. (1) is equivalent to finding the leading eigenvectors of .

We generally use bold-face letters to denote vectors, and capital letters to denote matrices. We let denote the trace of a matrix, to denote the Frobenius norm, and to denote the spectral norm. A symmetric matrix is positive semidefinite, if . is positive definite if the inequality is strict. Following standard notation, we write to denote that is positive semidefinite, and if . means that is positive definite.

A twice-differentiable function on a subset of is convex, if its Hessian is alway positive semidefinite. If it is always positive definite, and for some , we say that the function is -strongly convex. If the Hessian is always for some , then the function is -smooth.

3 The VR-PCA Algorithm and a Block Version

We begin by recalling the algorithm of [19] for the case (Algorithm 1), and then discuss its generalization for .

1:  Parameters: Step size

, epoch length

2:  Input: Data matrix ; Initial unit vector
3:  for  do
4:     
5:     
6:     for  do
7:        Pick uniformly at random
8:        
9:        
10:     end for
11:     
12:  end for
Algorithm 1 VR-PCA: Vector version ()

The basic idea of the algorithm is to perform stochastic updates using randomly-sampled columns of the matrix, but interlace them with occasional exact power iterations, and use that to gradually reduce the variance of the stochastic updates. Specifically, the algorithm is split into epochs , where in each epoch we do a single exact power iteration with respect to the matrix (by computing ), and then perform stochastic updates, which can be re-written as

The first term is essentially a power iteration (with a finite step size ), whereas the second term is zero-mean, and with variance dominated by . As the algorithm progresses, and both converge toward the same optimal point, hence shrinks, eventually leading to an exponential convergence rate.

To handle the case (where more than one eigenvector should be recovered), one simple technique is deflation, where we recover the leading eigenvectors one-by-one, each time using the algorithm. However, a disadvantage of this approach is that it requires a positive eigengap between all top eigenvalues, otherwise the algorithm is not guaranteed to converge. Thus, an algorithm which simultaneously recovers all leading eigenvectors is preferable.

We will study a block version of Algorithm 1, presented as Algorithm 2. It is mostly a straightforward generalization (similar to how power iterations are generalized to orthogonal iterations), where the -dimensional vectors are replaced by matrices , and normalization is replaced by orthogonalization111The normalization ensures that has orthonormal columns. We note that in our analysis, is chosen sufficiently small so that is always invertible, hence the operation is well-defined.. Indeed, Algorithm 1 is equivalent to Algorithm 2 when . The main twist in Algorithm 2 is that instead of using as-is, we perform a unitary transformation (via the orthogonal matrix ) which maximally aligns them with . Note that is a matrix, and since is assumed to be small, this does not introduce significant computational overhead.

  Parameters: Rank , Step size , epoch length
  Input: Data matrix ; Initial matrix with orthonormal columns
  for  do
     
     
     for  do
        , where is an SVD decomposition of
         Equivalent to
        Pick uniformly at random
        
        
     end for
     
  end for
Algorithm 2 VR-PCA: Block version

We now turn to provide a formal analysis of Algorithm 2, which directly generalizes the analysis of Algorithm 1 given in [19]:

Theorem 1.

Define the matrix as , and let denote the matrix composed of the eigenvectors corresponding to the largest eigenvalues. Suppose that

  • for some .

  • has eigenvalues , where for some .

  • .

Let be fixed. If we run the algorithm with any epoch length parameter and step size , such that

(3)

(where designate certain positive numerical constants), and for

epochs, then with probability at least

, it holds that

For any orthogonal , lies between and , and equals when the column spaces of and are the same (i.e., when spans the leading singular vectors). According to the theorem, taking appropriate222Specifically, we can take and , where is sufficiently small to ensure that the first and third condition in Eq. (3) holds. It can be verified that it’s enough to take . , and , the algorithm converges with high probability to a high-accuracy approximation of . Moreover, the runtime of each epoch of the algorithm equals . Overall, we get the following corollary:

Corollary 1.

Under the conditions of Theorem 1, there exists an algorithm returning such that with arbitrary constant accuracy, in runtime .

This runtime bound is the same333[19] showed that it’s possible to further improve the runtime for sparse , replacing by the average column sparsity . This is done by maintaining parameters in an implicit form, but it’s not clear how to implement a similar trick in the block version, where . as that of [19] for .

The proof of Theorem 1 appears in Subsection 6.1, and relies on a careful tracking of the evolution of the potential function . An important challenge compared to the case is that the matrices and do not necessarily become closer over time, so the variance-reduction intuition discussed earlier no longer applies. However, the column space of and do become closer, and this is utilized by introducing the transformation matrix . We note that although appears essential for our analysis, it isn’t clear that using it is necessary in practice: In [19], the suggested block algorithm was Algorithm 2 with , which seemed to work well in experiments. In any case, using this matrix doesn’t affect the overall runtime beyond constants, since the additional runtime of computing and using this matrix () is the same as the other computations performed at each iteration.

A limitation of the theorem above is the assumption that the initial point is such that . This is a non-trivial assumption, since if we initialize the algorithm from a random orthogonal matrix , then with overwhelming probability, . However, experimentally the algorithm seems to work well even with random initialization [19]. Moreover, if we are interested in a theoretical guarantee, one simple solution is to warm-start the algorithm with a purely stochastic algorithm for this problem (such as [6, 10, 4]), with runtime guarantees on getting such a . The idea is that is only required to approximate up to constant accuracy, so purely stochastic algorithms (which are good in obtaining a low-accuracy solution) are quite suitable. In the next section, we further delve into these issues, and show that in our setting such algorithms in fact can be substantially improved.

4 Warm-Start and the Power of a Power Iteration

In this section, we study the runtime required to compute a starting point satisfying the conditions of Theorem 1, starting from a random initialization. Combined with Theorem 1, this gives us an end-to-end analysis of the runtime required to find an -accurate solution, starting from a random point. For simplicity, we will only discuss the case , i.e. where our goal is to compute the single leading eigenvector , although our observations can be generalized to . In the case, Theorem 1 kicks in once we find a vector satisfying .

As mentioned previously, one way to get such a is to run a purely stochastic algorithm, which computes the leading eigenvector of a covariance matrix given a stream of i.i.d. samples . We can easily use such an algorithm in our setting, by sampling columns from our matrix uniformly at random, and feed to such a stochastic optimization algorithm, guaranteed to approximate the leading eigenvector of .

To the best of our knowledge, the existing iteration complexity guarantees for such algorithms (assuming the norm constraint for simplicity) scale at least444For example, this holds for [6], although the bound only guarantees the existence of some iteration which produces the desired output. The guarantee of [4] scale as , and the guarantee of [10] scales as in our setting. as . Since the runtime of each iteration is , we get an overall runtime of .

The dependence on in the iteration bound stems from the fact that with a random initial unit vector , we have . Thus, we begin with a vector almost orthogonal to the leading eigenvector (depending on ). In a purely stochastic setting, where only noisy information is available, this necessitates conservative updates at first, and in all the analyses we are aware of, the number of iterations appear to necessarily scale at least linearly with .

However, it turns out that in our setting, with a finite matrix , we can perform a smarter initialization: Sample

from the standard Gaussian distribution on

, perform a single power iteration w.r.t. the covariance matrix , i.e. , and initialize from . For such a procedure, we have the following simple observation:

Lemma 1.

For as above, it holds for any that with probability at least ,

where is the numerical rank of .

The numerical rank (see e.g. [18]) is a relaxation of the standard notion of rank: For any matrix , nrank(A) is at most the rank of (which in turn is at most ). However, it will be small even if

is just close to being low-rank. In many if not most machine learning applications, we are interested in matrices which tend to be approximately low-rank, in which case

nrank(A) is much smaller than or even a constant. Therefore, by a single power iteration, we get an initial point for which is on the order of , which can be much larger than the given by a random initialization, and is never substantially worse.

Proof of Lemma 1.

Let be the eigenvalues of , with eigenvectors . We have

Since is distributed according to a standard Gaussian distribution, which is rotationally symmetric, we can assume without loss of generality that correspond to the standard basis vectors , in which case the above reduces to

where

are independent and scalar random variables with a standard Gaussian distribution.

First, we note that equals , the spectral norm of , whereas equals , the Frobenius norm of . Therefore, , and we get overall that

(4)

We consider the random quantity , and independently bound the deviation probability of the numerator and denominator. First, for any we have

(5)

Second, by combining two standard Gaussian concentration results (namely, that if , then , and by the Cirelson-Ibragimov-Sudakov inequality, ), we get that

and therefore

(6)

Combining Eq. (5) and Eq. (6), with a union bound, we get that for any , it holds with probability at least that

To slightly simplify this for readability, we take , and substitute . This implies that with probability at least ,

Plugging back into Eq. (4), the result follows. ∎

This result can be plugged into the existing analyses of purely stochastic PCA/SVD algorithms, and can often improve the dependence on the factor in the iteration complexity bounds to a dependence on the numerical rank of . We again emphasize that this is applicable in a situation where we can actually perform a power iteration, and not in a purely stochastic setting where we only have access to an i.i.d. data stream (nevertheless, it would be interesting to explore whether this idea can be utilized in such a streaming setting as well).

To give a concrete example of this, we provide a convergence analysis of the VR-PCA algorithm (Algorithm 1), starting from an arbitrary initial point, bounding the total number of stochastic iterations required by the algorithm in order to produce a point satisfying the conditions of Theorem 1 (from which point the analysis of Theorem 1 takes over). Combined with Theorem 1, this analysis also justifies that VR-PCA indeed converges starting from a random initialization.

Theorem 2.

Using the notation of Theorem 1 (where is the eigengap, is the leading eigenvector, and ), and for any , suppose we run Algorithm 1 with some initial unit-norm vector such that

and a step size satisfying

(7)

(for some universal constant ). Then with probability at least , after

stochastic iterations (lines in the pseudocode, where is again a universal constant), we get a point satisfying . Moreover, if is chosen on the same order as the upper bound in Eq. (7), then

Note that the analysis does not depend on the choice of the epoch size , and does not use the special structure of VR-PCA (in fact, the technique we use is applicable to any algorithm which takes stochastic gradient steps to solve this type of problem555Although there exist previous analyses of such algorithms in the literature, they unfortunately do not quite apply to our algorithm, for various technical reasons.). The proof of the theorem appears in Section 6.2.

Considering as a constants, we get that the runtime required by VR-PCA to find a point such that is where is a lower bound on . As discussed earlier, if is a result of random initialization followed by a power iteration (requiring time), and the covariance matrix has small numerical rank, then , and the runtime is

By Corollary 1, the runtime required by VR-PCA from that point to get an -accurate solution is

so the sum of the two expressions (which is up to log-factors), represents the total runtime required by the algorithm.

Finally, we note that this bound holds under the reasonable assumption that the numeric rank of is constant. If this assumption doesn’t hold, can be as large as , and the resulting bound will have a worse polynomial dependence on . We suspect that this is due to a looseness in the dependence on in Theorem 2, since better dependencies can be obtained, at least for slightly different algorithmic approaches (e.g. [4, 10, 6]). We leave a sharpening of the bound w.r.t. as an open problem.

5 Convexity and Non-Convexity of the Rayleigh Quotient

As mentioned in the introduction, an intriguing open question is whether the runtime guarantees from the previous sections can be further improved. Although a linear dependence on seems unavoidable, this is not the case for the quadratic dependence on . Indeed, when using deterministic methods such as power iterations or the Lanczos method, the dependence on in the runtime is only or even [15]. In the world of convex optimization from which our algorithmic techniques are derived, the analog of is the strong convexity parameter of the function, and again, it is possible to get a dependence of , or even with accelerated schemes (see e.g. [13, 16, 7] in the context of the variance-reduction technique we use). Is it possible to get such a dependence for our problem as well?

Another question is whether the non-convex problem that we are tackling (Eq. (1)) is really that non-convex. Clearly, it has a nice structure (since we can solve the problem in polynomial time), but perhaps it actually has hidden convexity properties, at least close enough to the optimal points? We note that Eq. (1) can be “trivially” convexified, by re-casting it as an equivalent semidefinite program [5]. However, that would require optimization over matrices, leading to poor runtime and memory requirements. The question here is whether we have any convexity with respect to the original optimization problem over “thin” matrices.

In fact, the two questions of improved runtime and convexity are closely related: If we can show that the optimization problem is convex in some domain containing an optimal point, then we may be able to use fast stochastic algorithms designed for convex optimization problems, inheriting their good guarantees.

To discuss these questions, we will focus on the case for simplicity (i.e., our goal is to find a leading eigenvector of the matrix ), and study potential convexity properties of the negative Rayleigh quotient,

Note that for , this function coincides with Eq. (1) on the unit Euclidean sphere, and with the same optimal points, but has the nice property of being defined on the entire Euclidean space (thus, at least its domain is convex).

At a first glance, such functions appear to potentially be convex at some bounded distance from an optimum, as illustrated for instance in the case where (see Figure 1). Unfortunately, it turns out that the figure is misleading, and in fact the function is not convex almost everywhere:

Figure 1: The function , corresponding to where . It is invariant to re-scaling of , and attains a minimum at for any . Figure 2: Illustration of the construction of the convex set on which is strongly convex and smooth. is the leading eigenvector of , and a minimum of (as well as any re-scaling of ).

is a nearby unit vector, and we consider the intersection of a hyperplane orthogonal to

, and an Euclidean ball centered at .
Theorem 3.

For the matrix above, the Hessian of is not positive semidefinite for all but a measure-zero set.

Proof.

The leading eigenvector of is , and . The Hessian of this function at some equals

The determinant of this matrix equals

which is always non-positive, and strictly negative for for which (which holds for all but a measure-zero set of ). Since the determinant of a positive semidefinite matrix is always non-negative, this implies that the Hessian isn’t positive semidefinite for any such . ∎

The theorem implies that we indeed cannot use convex optimization tools as-is on the function , even if we’re close to an optimum. However, the non-convexity was shown for as a function over the entire Euclidean space, so the result does not preclude the possibility of having convexity on a more constrained, lower-dimensional set. In fact, this is what we are going to do next: We will show that if we are given some point close enough to an optimum, then we can explicitly construct a simple convex set, such that

  • The set includes an optimal point of .

  • The function is -smooth and -strongly convex in that set.

This means that we can potentially use a two-stage approach: First, we use some existing algorithm (such as VR-PCA) to find , and then switch to a convex optimization algorithm designed to handle functions with a finite sum structure (such as ). Since the runtime of such algorithms scale better than VR-PCA, in terms of the dependence on , we can hope for an overall runtime improvement.

Unfortunately, this has a catch: To make it work, we need to have very close to the optimum – in fact, we require , and we show (in Theorem 5) that such a dependence on the eigengap cannot be avoided (perhaps up to a small polynomial factor). The issue is that the runtime to get such a , using stochastic-based approaches we are aware of, would scale at least quadratically with , but getting dependence better than quadratic was our problem to begin with. For example, the runtime guarantee using VR-PCA to get such a point (even if we start from a good point as specified in Theorem 1) is on the order of

whereas the best known guarantees on getting an -optimal solution for -strongly convex and smooth functions (see [1]) is on the order of

Therefore, the total runtime we can hope for would be on the order of

(8)

In comparison, the runtime guarantee of using just VR-PCA to get an -accurate solution is on the order of

(9)

Unfortunately, Eq. (9) is the same as Eq. (8) up to log-factors, and the difference is not significant unless the required accuracy is extremely small (exponentially small in ). Therefore, our construction is mostly of theoretical interest. However, it still shows that asymptotically, as , it is indeed possible to have runtime scaling better than Eq. (9). This might hint that designing practical algorithms, with better runtime guarantees for our problem, may indeed be possible.

To explain our construction, we need to consider two convex sets: Given a unit vector , define the hyperplane tangent to ,

as well as a Euclidean ball of radius centered at :

The convex set we use, given such a , is simply the intersection of the two, , where is a sufficiently small number (see Figure 2).

The following theorem shows that if is -close to an optimal point (a leading eigenvector of ), and we choose the radius of appropriately, then contains an optimal point, and the function is indeed -strongly convex and smooth on that set. For simplicity, we will assume that is scaled to have spectral norm of , but the result can be easily generalized.

Theorem 4.

For any positive semidefinite with spectral norm , eigengap and a leading eigenvector , and any unit vector such that , the function is -smooth and -strongly convex on the convex set , which contains a global optimum of .

The proof of the theorem appears in Subsection 6.3. Finally, we show below that a polynomial dependence on the eigengap is unavoidable, in the sense that the convexity property is lost if is significantly further away from .

Theorem 5.

For any , there exists a positive semidefinite matrix with spectral norm , eigengap , and leading eigenvector , as well as a unit vector for which , such that is not convex in any neighborhood of on .

Proof.

Let

for which , and take

where (which ensures ). Consider the ray , and note that it starts from and lies in . The function along that ray (considering it as a function of ) is of the form

The second derivative with respect to equals

where we plugged in the definition of . This is a negative quantity for any . Therefore, the function is strictly concave (and not convex) along the ray we have defined and close enough to , and therefore isn’t convex in any neighborhood of on . ∎

6 Proofs

6.1 Proof of Theorem 1

Although the proof structure generally mimics the proof of Theorem 1 in [19] for the special case, it is more intricate and requires several new technical tools. To streamline the presentation of the proof, we begin with proving a series of auxiliary lemmas in Subsection 6.1.1, and then move to the main proof in Subsection 6.1. The main proof itself is divided into several steps, each constituting one or more lemmas.

Throughout the proof, we use the well-known facts that for all matrices of suitable dimensions, , , , and . Moreover, since is a linear operation,

for a random matrix

.

6.1.1 Auxiliary Lemmas

Lemma 2.

For any , it holds that and .

Proof.

It is enough to prove that for any positive semidefinite matrices , it holds that . The lemma follows by taking either (in which case, ), or (in which case, ).

Any positive semidefinite matrix can be written as the product for some symmetric matrix (known as the matrix square root of ). Therefore,

Lemma 3.

If and , then

where

is the identity matrix.

Proof.

We begin by proving the one-dimensional case, where are scalars . The inequality then becomes , which is equivalent to , or upon rearranging, , which trivially holds.

Turning to the general case, we note that by Lemma 2, it is enough to prove that . To prove this, we make a couple of observations. The positive definite matrix

(like any positive definite matrix) has a singular value decomposition which can be written as

, where is an orthogonal matrix, and is a diagonal matrix with positive entries. Its inverse is , and . Therefore,

To show this matrix is positive semidefinite, it is enough to show that each diagonal entry of