There are a variety of problems in statistics and machine learning that require estimating a matrix that is assumed—or desired—to be low-rank. For high-dimensional problems, the low-rank property is is useful as a form of regularization, and also can lead to more interpretable results in scientific settings. Low-rank matrix estimation can be formulated as a nonconvex optimization problem involving a cost function, measuring the fit to the data, along with a rank constraint. Even when the cost function is convex—such as in the ubiquitous case of least-squares fitting—solving a rank-constrained problem can be computationally difficult, with many interesting special cases known to have NP-hard complexity in the worst-case setting. However, statistical settings lead naturally to random ensembles, in which context such complexity concerns have been assuaged to some extent by the use of semidefinite programming (SDP) relaxations. These SDP relaxations are based on replacing the nonconvex rank constraint with a convex constraint based on the trace/nuclear norm. For many statistical ensembles of problems, among them multivariate regression, matrix completion and matrix decomposition, such types of SDP relaxations have been shown to have near-optimal performace (e.g., see the papers[candes2009exact, recht2010guaranteed, negahban2010noisymc, negahban2009estimation, chen2011LSarxiv, koltchinskii2011matrixCompletion] and references therein). Although in theory, any SDP can be solved to accuracy in polynomial-time [NesNem87], the associated computational cost is often too high in practice. Letting denote the dimension of the matrix, it can be as high as using standard interior point methods [Boyd02, NesNem87]; such a scaling is not practical for many real-world applications involving high-dimensional matrices. More recent work has developed algorithms that are specifically tailored to certain classes of SDPs; however, even such specialized algorithms require at least time, since solving the SDP involves optimizing over the space of matrices.
In practice, researchers often resort instead to heuristic methods that directly optimize over the space of low-rank matrices, using iterative algorithms such as alternating minimization, power iteration, expectation maximization (EM) and projected gradient descent. Lettingdenote the rank, these factorized optimization problems live in an dimensional space, as opposed to the space of the original problem. Such heuristic methods are quite effective in practice for some problems, but sometimes can also suffer from local optima. These intriguing phenomena motivate a recent and evolving line of work on understanding such iterative methods in the low-rank space. As we discuss in detail below, recent work has studied some of these algorithms in a number of specific settings. A natural question then arises: is there a general theory for understanding when low-rank iterative methods will succeed?
In this paper, we make progress on this general question by focusing on projected gradient descent in the low-rank space. We characterize a general set of conditions that govern the computational and statistical properties of the solutions, and then specialize this general theory to obtain corollaries for a broad range of problems. In more detail, suppose that we write a rank- matrix in its factorized form , where , and consider projected gradient descent methods in the variable . The matrix quadratic form makes the problem inherently nonconvex, and in many cases, the problem is not even locally convex. Nevertheless, our theory shows that given a suitable initialization, projected gradient descent converges geometrically to a statistically useful solution, under conditions that are much more general than convexity. Our results are applicable even when the initial solution is outside any region of local convexity, or when the problem is globally concave. Each iteration of projected gradient descent typically takes time that is linear in
, the degrees of freedom of a low-rank matrix, as well as in the input size. Therefore, by directly enforcing low-rankness, our method simultaneously achieves two goals: we not only attainstatistical consistency in the high-dimensional regime, but also gain computational advantages over convex relaxation methods that lift the problem to the space of matrices.
For this approach to be relevant, an equally important question is when the above conditions for convergence are satisfied. We verify these conditions for a broad range of statistical and machine learning problems, including matrix sensing, matrix completion in both its standard and one-bit forms, sparse principal component analysis (SPCA), graph clustering, and matrix decomposition or robust PCA. For each of these problems, we show that a suitable initialization can be obtained efficiently using simple methods, and the projected gradient descent approach has sample complexity and statistical error bounds that are comparable (and sometimes better) to the best existing results (which are often achieved by convex relaxation methods). Notably, our approach does not require using fresh samples in each iteration—a heuristic known as sample splitting that is often used to simplify analysis—nor does it involve the computation of multiple singular value decompositions (SVDs).
Let us now put our contributions in a broader historical context. The seminal work in [burer2005LRSDP] studies the problem of obtaining low-rank solutions to SDPs using gradient descent on the factor space. Several subsequent papers aim to obtain rigorous guarantees for nonconvex gradient descent focused on specific classes of matrix estimation problems. For instance, the recent papers [zheng2015convergent, tu2015procrustes] study exact recovery in the setting of noiseless matrix sensing (i.e., solving linear matrix equalities with random designs). Focusing on the rank-one setting, De et al. [de2014global, de2015wild]
study the noiseless matrix completion problem, and a stochastic version of nonconvex gradient descent; they prove global convergence with a constant success probability, assuming independence between the samples used by each iteration. The recent manuscript[sun2014mc] studies several variants of nonconvex gradient descent algorithms, again for noiseless matrix completion. Another line of work [candes2014wirtinger, chen2015quadratic] considers the phase retrieval problem, which can be reformulated as recovering a rank one () matrix from random quadratic measurements. The regularity conditions imposed in this work bear some similarity with our conditions, but their validation requires a very different analysis. An attractive feature of phase retrieval is that it is known to be locally convex around the global optimum under certain settings [SoltanolkotabiThesis, white2015local_convex].
The work in this paper develops a unified framework for analyzing the behavior of projected gradient descent in application to low-rank estimation problems, covering many of the models described above as well as various others. Our theory applies to matrices of arbitrary rank , and is framed in the statistical setting of noisy observations, allowing for noiseless observations as a special case, When specialized to particular models, our framework yields a variety of corollaries providing guarantees for concrete statistical models that have not been studied in the work above. Notably, our general conditions do not
depend on local convexity, and thus can be applied to models such as sparse PCA and clustering in which no form of local convexity holds. (In fact, our results apply even when the loss function is globally concave). In addition, we impose only a natural gradient smoothness condition that is much less restrictive than the vanishing gradient condition imposed in other work. Thus, one of the main contributions of this paper is to illuminate to weakest known conditions under which nonconvex gradient descent can succeed, and also allows for applications to several problems that lack local convexity and vanishing gradients.
It is also worth noting that other types of algorithms for nonconvex problems have also been analyzed, including alternating minimization [jain2013low, hardt2014understanding, hardt2014fast], EM algorithms [balakrishnan2014EM, wang2014HighDimEM] and power methods [hardt2014power], various hard-thresholding and singular value projection [jain2010svp, netrapalli2014nonconvexRPCA, jain2014hard, bhatia2015rob_regression], gradient descent for nonconvex regression and spectrally sparse recovery problems [loh2013regularized, wang2013sparseNonConvex, cai2015hankel], as well as gradient descent on Grassmannian manifolds [keshavan2009matrixafew, zhang2015grassmannian]. Finally, there is a large body of work on convex-optimization based approach to the concrete examples considered in this paper. We compare our statistical guarantees with results of these types after the statements of each of our corollaries.
The -th row and -th column of a matrix are denoted by and , respectively. The spectral norm is the largest singular value of . The nuclear norm is the sum of the singular values of . For parameters and a matrix , the norm of is —that is, the
norm of the vector of thenorms of the rows. Special cases include the Frobenius norm , the elementwise norm and the elementwise norm . For a convex set , we use to denote the Euclidean projection onto .
We begin by setting up the class of matrix estimators to be studied in this paper, and then providing various concrete examples of specific models to which our general theory applies.
2.1 Matrix estimators in the factorized formulation
Letting denote the space of all symmetric -dimensional matrices, this paper focuses on a class of matrix estimators that take the following general form. For a given sample size , let be a cost function. It is a random function, since it depends (implicitly in our notation) on the observed data, and the function value provides some measure of fit of the matrix to the given data. For a given convex set , we then consider a minimization problem of the form
The goal of solving this optimization problem is to estimate some unknown target matrix . Typically, the target matrix is a (near)-minimizer of the population version of the program—that is, a solution to the same constrained minimization problem with replaced by its expectation . However, our theory does not require that minimizes this quantity, nor that the gradient vanish.
In many cases, the matrix either has low rank, or can be well-approximated by a matrix of low rank. Concretely, if the target matrix has rank , then it can be written in the outer product form for some other matrix with orthogonal columns. This factorized representation motivates us to consider the function , and the factorized formulation
where is some convex set that contains , and for which the set acts as a surrogate for . Note that due to the factorized representation of the low-rank matrix, this factorized program is (in general) nonconvex, and is typically so even if the original program (1) is convex.
Nonetheless, we can apply a projected gradient descent method in order to compute an approximate minimizer. For this particular problem, the projected gradient descent updates take the form
where is a step size parameter, denotes the Euclidean projection onto the set , and the gradient111This gradient takes the simpler form whenever is symmetric, which is the case in the concrete examples that we treat. is given by . The main goal of this paper is to provide a general set of sufficient conditions under which—up to a statistical tolerance term —the sequence converges to some such that .
A significant challenge in the analysis is the fact that there are many possible factorizations of the form . In order to address this issue, it is convenient to define an equivalent class of valid solutions as follows
For the applications of interest here, the underlying goal is to obtain a good estimate of any matrix in the set . In particular, such an estimate implies a good estimate of itself as well as the column space and singular values of all the members of the class . Accordingly, we define the pseudometric
Note that all matrices have the same singular values, so that we may write the singular values as well as and without any ambiguity. In fact, this invariant property holds more generally for any function of the sorted singular values and column space of (e.g., any unitarially invariant norm).
2.2 Illustrative examples
Let us now consider a few specific models to illustrate the general set-up from the previous section. We return to demonstrate consequences of our general theory for these (and other) models in Section 4.
2.2.1 Matrix regression
We begin with a simple example, namely one in which we make noisy observations of linear projections of an unknown low-rank matrix . In particular, suppose that we are given i.i.d. observations of the form
and is some i.i.d. sequence of zero-mean noise variables. The paper [negahban2009estimation] provides various examples of such matrix regression problems, depending on the particular choice of the regression matrices .
Without considering computational complexity, a reasonable estimate of would be based on minimizing the least-squares cost
subject to a rank constraint. However, this problem is computationally intractable in general due to the nonconvexity of the rank function. A standard convex relaxation is based on the nuclear norm , corresponding to the sum of the singular values of the matrix. In the symmetric PSD case, it is equivalent to the trace of the matrix. Using the nuclear norm as regularizer leads to the estimator
Suppose that the noise variables
are i.i.d. zero-mean with variance, and the regression matrices are also i.i.d., zero-mean and such that for any matrix . Under these conditions, an easy calculation yields that the population cost function is given by . For this particular case, note that is the unique minimizer of the population cost.
Projected gradient descent:
2.2.2 Rank- PCA with row sparsity
Principal component analysis is a widely used method for dimensionality reduction. For high-dimensional problems in which , it is well-known that classical PCA is inconsistent [JohLu09]
. Moreover, minimax lower bounds show that consistent eigen-estimation is impossible in the absence of structure in the eigenvectors. Accordingly, a recent line of work (e.g.,[JohLu09, AmiWai09, cai2013adaptive, berthet2013lowerSparsePCA, vu2013minimax, BirJohNadPau12]) has studied different forms of PCA with structured eigenvectors.
Here we consider one such form of structured PCA, namely a rank model with row-wise sparsity. For a given signal-to-noise ratio and an orthonormal matrix , consider a covariance matrix of the form
By construction, the columns of span the top rank-eigenspace of
with the corresponding maximal eigenvalues. In the row-sparse version of this model [vu2013minimax], this leading eigenspace is assumed to be supported on coordinates—that is, the matrix has at most non-zero rows. Given i.i.d. samples
from the Gaussian distribution, the goal of sparse PCA is to estimate the sparse eigenspace spanned by .
A natural estimator is based on a semidefinite program, referred to as the Fantope relaxation in the paper [vu2013fantope], given by
where is the empirical covariance matrix, and is a radius to be chosen. This is a special case of our general set-up with and
Since , the population cost function is given by
Thus, by construction, for any radius , the matrix is the unique minimizer of the population version of the problem (10), subject to the constraint .
Projected gradient descent:
For a radius to be chosen, we consider a factorized version of the SDP
where we recall that . This norm is the appropriate choice for selecting matrices with sparse rows, as assumed in our initial set-up. We return in Section 4.3 to analyze the projected gradient updates (3) applied to pair in equation (11).
As a side-comment, this example illustrates that our theory does not depend on local convexity of the function . In this case, even though the original function is convex (in fact, linear) in the matrix , observe that that the function from equation (11) is never locally convex in the low-rank matrix ; in fact, since is positive semidefinite, it is a globally concave function.
2.2.3 Low-rank and sparse matrix decomposition
There are various applications in which it is natural to model an unknown matrix as the sum of two matrices, one of which is low-rank and the other of which is sparse. Concretely, suppose that we make observations of the form where is low-rank, the matrix is symmetric and elementwise-sparse, and is a symmetric matrix of noise variables. Many problems can be cast in this form, including robust forms of PCA, factor analysis, and Gaussian graphical model estimation; see the papers [candes2009robustPCA, agarwal2012decomposition, chen2011LSarxiv, chandrasekaran2011siam, Hsu2010RobustDecomposition] and references therein for further details on these and other applications.
Letting denote the column of a matrix , define the set of matrices , where are user-defined radii. Using the nuclear norm and norm as surrogates for rank and sparsity respectively, a popular convex relaxation approach is based on the SDP
This is a special case of our general estimator with , and the constraint set .
In this case, the population function is given by
where the expectation is over the random noise matrix . In general, we are not guaranteed that is the unique minimizer of this objective, but our analysis shows that (under suitable conditions) it is a near-minimizer, and this is adequate for our theory.
Projected gradient descent:
In this paper, we analyze a version of gradient descent that operates on the pair given by
Here is the initialization of the algorithm, and the parameter controls the matrix incoherence. See Sections 4.1 and 4.6 for discussion of matrix incoherence parameters, and their necessity in such problems. The gradient of takes the form
where denotes projection onto the constraint set
. This projection is easy to carry it, as it simply involves
a soft-thresholding of the columns of the matrix. Likewise, the
projection onto the set from
equation (12) is easy to carry out. We
return to analyze these projected gradient updates in
3 Main results
In this section, we turn to the set-up and statement of our main results on the convergence properties of projected gradient descent for low-rank factorizations. We begin in Section 3.1 by stating the conditions on the function and that underlie our analysis. In Section 3.2, we state a result (Theorem 1) that guarantees sublinear convergence, whereas Section 3.3 is devoted to a result (Theorem 2) that guarantees faster linear convergence under slightly stronger assumptions. In Section 4 to follow, we derive various corollaries of these theorems for different concrete versions of low-rank estimation.
Given a radius , we define the ball . At a high level, our goal is to provide conditions under which the projected gradient sequence converges some multiple of the ball , where is a statistical tolerance.
3.1 Conditions on the pair
Recall the definition of the set of equivalent orthogonal factorizations of a given matrix . We begin with a condition on that guarantees that it respects the structure of this set.
-faithfulness of :
For a radius , the constraint set is said to be -faithful if for each matrix , we guaranteed that
Of course, this condition is implied by the inclusion . The -faithfulness condition is natural for our setting, as our goal is to estimate the eigen structure of , and the set should therefore represent prior knowledge of this structure and be independent of a specific factorization of .
Local descent condition:
Our next condition provides a guarantee on the cost improvement that can be obtained by taking a gradient step when starting from any matrix that is “sufficiently” far away from the set .
Definition 1 (Local descent condition).
For a given radius , curvature parameter and statistical tolerance , a cost function satisfies a local descent condition with parameters over if for each , there is some such that
In order to gain intuition for this condition, note that by a first-order Taylor series expansion, we have , so that this inner product measures the potential gains afforded by taking a gradient step. Now consider some matrix such that , so that its distance from is larger than the statistical precision. The lower bound (14) with then implies that
which guarantees a quadratic descent condition. Note that the condition (14) actually allows for additional freedom in the choice of so as to accommodate the non-uniqueness of the factorization.
One way in which to establish a bound of the form (14) is by requiring that be locally strongly convex, and that the gradient approximately vanishes. In particular, suppose is -strongly convex over the set , in the sense that
If we assume that , then some simple algebra yields that the lower bound (14) holds.
However, it is essential to note that our theory covers several
examples in which a lower bound (14) of the form
holds, even though fails to be locally convex, and/or the
gradient does not approximately
vanish.222We note that the vanishing gradient condition is needed in all existing work on nonconvex gradient descent [candes2014wirtinger, sun2014mc, zheng2015convergent]. Examples include the problem of sparse PCA, previously
introduced in Section 2.2.2; in this case, the
function is actually globally concave, but nonetheless our
analysis in Section 4.3 shows that a local descent
condition of the form (14) holds. Similarly, for the
planted clustering model studied in Section 4.4, the
same form of global concavity holds. In addition, for the matrix
regression problem previously introduced in
Section 2.2.1, we prove in
Section 4.2 that the condition (14)
holds over a set over which is nonconvex. The generality
of our condition (14) is essential to accommodate
these and other examples.
Local Lipschitz condition:
Our next requirement is a straightforward local Lipschitz property:
Definition 2 (Local Lipschitz).
The loss function is locally Lipschitz in the sense that
for all .
Our last condition is not required to establish convergence of projected gradient descent, but rather to guarantee a faster geometric rate of convergence. It is a condition—complementary to the local descent condition—that upper bounds the behavior of the gradient map .
Definition 3 (Local smoothness).
For some curvature and smoothness parameters and , statistical tolerance and radius , we say that the loss function satisfies a local smoothness condition with parameters over if for each and ,
The above conditions are stated in terms of the loss function for the factor matrix . Alternatively, one may restate these conditions in terms of the loss function on the original space, and we make use of this type of reformulation in parts of our proofs. For instance, see Section 6 for details.
3.2 Sublinear convergence under Lipschitz condition
With our basic conditions in place, we are now ready to state our first main result. It guarantees a sublinear rate of convergence under the -faithfulness, local descent, and local Lipschitz conditions.
More precisely, for some descent and Lipschitz parameters , a statistical tolerance , and a constant , suppose that , the cost functions satisfies the local descent and Lipschitz conditions (Definitions 1 and 2) with parameters and , and the constraint set is -faithful and convex. Let be the condition number of . We then have the following guarantee:
Under the previously stated conditions, given any initial point belonging to the set , the projected gradient iterates with step size satisfy the bound
See Section 5.1 for the proof of this claim.
As a minor remark, we note that the assumption entails no loss of generality—if it fails to hold, then the initial solution already satisfies an error bound better than what is guaranteed for subsequent iterates.
Conceptually, Theorem 1 provides a minimal set of conditions for the convergence of projected gradient descent using the nonconvex factorization . The first term on the right hand side of equation (17) corresponds to the optimization error, whereas the second term is the statistical error. The bound (17) shows that the distance between and drops at the rate up to the statistical limit that is determined by the sample size and the signal-to-noise ratio (SNR) of the problem. We see concrete instances of this statistical error in the examples to follow.
3.3 Linear convergence under smoothness condition
Although Theorem 1 does guarantee convergence, the resulting rate is sublinear (), and hence rather slow. In this section, we show that if in addition to the local Lipschitz and descent conditions, the function satisfies the local smoothness conditions in Definition 3, then much faster convergence can be guaranteed.
More precisely, suppose that for some numbers , , , and with , and , the loss function satisfies the local descent, Lipschitz and smoothness conditions in Definitions 1–3 over with parameters , , , and , and that the set is -faithful and convex.
Under the previously stated conditions, there is a constant depending only on such that given an initial matrix in the set , the projected gradient iterates with step size satisfy the bound
See Section 5.2 for the proof of this claim.
The right hand side of the bound (18) again consists of an optimization error term and a statistical error term. The theorem guarantees that the optimization error converges linearly at the geometric rate up to a statistical limit. Note that the theorem requires the initial solution to lie within a ball around with radius , which is slightly smaller than the radius for which the local descent, Lipschitz and smoothness conditions hold. Moreover, the step size and the convergence rate depend on the condition number of as well as the quality of the initialization through . We did not make an attempt to optimize this dependence, but improvement in this direction, including adaptive choices of the step size, is certainly an interesting problem for future work.
4 Concrete results for specific models
In this section, we turn to the consequences of our general theory for specific models that arise in applications. Throughout this section, we focus on geometric convergence guaranteed by Theorem 2 using a constant step size. The main technical challenges are to verify the local descent, local Lipschitz and local smoothness assumptions that are needed to apply this result. Since Theorem 1 depends on weaker assumptions—it does not need the local smoothness property—it should be understood also that our analysis can be used to derive corollaries based on Theorem 1 as well.
In all of the analysis to follow, we adopt the shorthand for the singular values of , and for its condition number.
4.1 Noisy matrix completion
We begin by deriving a corollary for the problem of noisy matrix completion. Since we did not discuss this model in Section 2.2, let us provide some background here. There are a wide variety of matrix completion problems (e.g., [Lau01]), and the variant of interest here arises when the unknown matrix has low rank. More precisely, for an unknown PSD and low-rank matrix , suppose that we are given noisy observations of a subset of its entries. In the so-called Bernoulli model, the random subset of observed entries is chosen uniformly at random—that is, each entry is observed with some probability , independently of all other entries. We can represent these observations by a random symmetric matrix with entries of the form
Here the variables represent a form of measurement noise.
A standard method for matrix completion is based on solving the semidefinite program
where is a radius to be chosen. As noted above, the PSD constraint and nuclear norm bound are equivalent to the trace constraint . In either case, this is a special case of our general estimator (1).
The SDP-based estimator (20) is known to have good performance when the underlying matrix satisfies certain matrix incoherence conditions. These conditions involve its leverage scores, defined in the following way. Here we consider a simplified setting where the eigenvalues of are equal. By performing an eigendecomposition, we can write where is a diagonal matrix of eigenvalues (a constant multiple of the identity when they are constant), and take . With this notation, the incoherence parameter of is given by
Since we already enforce low-rankness in the factorized formulation, we can drop nuclear norm constraint. The generalized projected gradient descent (3) is specified by letting and set
Note that is convex, and depends on the initial solution . The gradient of is , and the projection is given by the row-wise “clipping” operation
This projection ensures that the iterates of gradient descent (3) remains incoherent.
Note that ( is the -th standard basis vector and is the column space of ), so the values of and depend only on and are the same for any in .
With this notation in place, we are now ready to apply Theorem 2 to the noisy matrix completion problem. As we show below, if the initial matrix satisfies the bound , then the set is -faithful. Moreover, if the expected sample size satisfies , then with probability at least the loss function satisfies the local descent, Lipschitz and smoothness conditions with parameters
Using this fact, we have the following consequence of Theorem 2, which holds when the sample size size satisfies the bound above and is large enough to ensure that .
Under the previously stated conditions, if we are given an initial matrix satisfying the bound , then with probability at least , the gradient iterates with step size satisfy the bound
See Section 6.2 for the proof of this
Even though Corollary 1 is a consequence of our general theory, it leads to results for exact/approximate recovery in the noiseless/noisy setting that are as good as or better than known results. In the noiseless setting (), our sample size requirement and contraction factor are sharper than those in the paper [sun2014mc] by a polynomial factor in the rank . Turning to the noisy setting, suppose the noise matrix has independent sub-Gaussian entries with parameter . A technical result to be proved later (see Lemma 11) guarantees that given a sample size , we have the operator norm bound with probability at least . Together with the bound (22), we conclude that
The scaling is better than the results in past work (e.g., [negahban2010noisymc, koltchinskii2011matrixCompletion, keshavan2010noisy]) on noisy matrix completion by a factor; in fact, it matches the minimax lower bounds established in the papers [negahban2010noisymc, koltchinskii2011matrixCompletion]. Thus, Corollary 1 in fact establishes that the projected gradient descent method yields minimax-optimal estimates.
Suppose the rank- SVD of the matrix is given by . We can take