## 1 Introduction

There are a variety of problems in statistics and machine learning that require estimating a matrix that is assumed—or desired—to be low-rank. For high-dimensional problems, the low-rank property is is useful as a form of regularization, and also can lead to more interpretable results in scientific settings. Low-rank matrix estimation can be formulated as a nonconvex optimization problem involving a cost function, measuring the fit to the data, along with a rank constraint. Even when the cost function is convex—such as in the ubiquitous case of least-squares fitting—solving a rank-constrained problem can be computationally difficult, with many interesting special cases known to have NP-hard complexity in the worst-case setting. However, statistical settings lead naturally to random ensembles, in which context such complexity concerns have been assuaged to some extent by the use of semidefinite programming (SDP) relaxations. These SDP relaxations are based on replacing the nonconvex rank constraint with a convex constraint based on the trace/nuclear norm. For many statistical ensembles of problems, among them multivariate regression, matrix completion and matrix decomposition, such types of SDP relaxations have been shown to have near-optimal performace (e.g., see the papers

[candes2009exact, recht2010guaranteed, negahban2010noisymc, negahban2009estimation, chen2011LSarxiv, koltchinskii2011matrixCompletion] and references therein). Although in theory, any SDP can be solved to accuracy in polynomial-time [NesNem87], the associated computational cost is often too high in practice. Letting denote the dimension of the matrix, it can be as high as using standard interior point methods [Boyd02, NesNem87]; such a scaling is not practical for many real-world applications involving high-dimensional matrices. More recent work has developed algorithms that are specifically tailored to certain classes of SDPs; however, even such specialized algorithms require at least time, since solving the SDP involves optimizing over the space of matrices.In practice, researchers often resort instead to heuristic methods that directly optimize over the space of low-rank matrices, using iterative algorithms such as alternating minimization, power iteration, expectation maximization (EM) and projected gradient descent. Letting

denote the rank, these factorized optimization problems live in an dimensional space, as opposed to the space of the original problem. Such heuristic methods are quite effective in practice for some problems, but sometimes can also suffer from local optima. These intriguing phenomena motivate a recent and evolving line of work on understanding such iterative methods in the low-rank space. As we discuss in detail below, recent work has studied some of these algorithms in a number of specific settings. A natural question then arises: is there a general theory for understanding when low-rank iterative methods will succeed?In this paper, we make progress on this general question by focusing on projected gradient descent in the low-rank space. We characterize a general set of conditions that govern the computational and statistical properties of the solutions, and then specialize this general theory to obtain corollaries for a broad range of problems. In more detail, suppose that we write a rank- matrix in its factorized form , where , and consider projected gradient descent methods in the variable . The matrix quadratic form makes the problem inherently nonconvex, and in many cases, the problem is not even locally convex. Nevertheless, our theory shows that given a suitable initialization, projected gradient descent converges geometrically to a statistically useful solution, under conditions that are much more general than convexity. Our results are applicable even when the initial solution is outside any region of local convexity, or when the problem is globally concave. Each iteration of projected gradient descent typically takes time that is linear in

, the degrees of freedom of a low-rank matrix, as well as in the input size. Therefore, by directly enforcing low-rankness, our method simultaneously achieves two goals: we not only attain

*statistical*consistency in the high-dimensional regime, but also gain

*computational*advantages over convex relaxation methods that lift the problem to the space of matrices.

For this approach to be relevant, an equally important question is when the above conditions for convergence are satisfied. We verify these conditions for a broad range of statistical and machine learning problems, including matrix sensing, matrix completion in both its standard and one-bit forms, sparse principal component analysis (SPCA), graph clustering, and matrix decomposition or robust PCA. For each of these problems, we show that a suitable initialization can be obtained efficiently using simple methods, and the projected gradient descent approach has sample complexity and statistical error bounds that are comparable (and sometimes better) to the best existing results (which are often achieved by convex relaxation methods). Notably, our approach does not require using fresh samples in each iteration—a heuristic known as sample splitting that is often used to simplify analysis—nor does it involve the computation of multiple singular value decompositions (SVDs).

Let us now put our contributions in a broader historical context. The seminal work in [burer2005LRSDP] studies the problem of obtaining low-rank solutions to SDPs using gradient descent on the factor space. Several subsequent papers aim to obtain rigorous guarantees for nonconvex gradient descent focused on specific classes of matrix estimation problems. For instance, the recent papers [zheng2015convergent, tu2015procrustes] study exact recovery in the setting of noiseless matrix sensing (i.e., solving linear matrix equalities with random designs). Focusing on the rank-one setting, De et al. [de2014global, de2015wild]

study the noiseless matrix completion problem, and a stochastic version of nonconvex gradient descent; they prove global convergence with a constant success probability, assuming independence between the samples used by each iteration. The recent manuscript

[sun2014mc] studies several variants of nonconvex gradient descent algorithms, again for noiseless matrix completion. Another line of work [candes2014wirtinger, chen2015quadratic] considers the phase retrieval problem, which can be reformulated as recovering a rank one () matrix from random quadratic measurements. The regularity conditions imposed in this work bear some similarity with our conditions, but their validation requires a very different analysis. An attractive feature of phase retrieval is that it is known to be locally convex around the global optimum under certain settings [SoltanolkotabiThesis, white2015local_convex].The work in this paper develops a unified framework for analyzing the
behavior of projected gradient descent in application to low-rank
estimation problems, covering many of the models described above as
well as various others. Our theory applies to matrices of arbitrary
rank , and is framed in the statistical setting of noisy
observations, allowing for noiseless observations as a special case,
When specialized to particular models, our framework yields a variety
of corollaries providing guarantees for concrete statistical models
that have not been studied in the work above. Notably, our general
conditions *do not*

depend on local convexity, and thus can be applied to models such as sparse PCA and clustering in which no form of local convexity holds. (In fact, our results apply even when the loss function is globally concave). In addition, we impose only a natural gradient smoothness condition that is much less restrictive than the vanishing gradient condition imposed in other work. Thus, one of the main contributions of this paper is to illuminate to weakest known conditions under which nonconvex gradient descent can succeed, and also allows for applications to several problems that lack local convexity and vanishing gradients.

It is also worth noting that other types of algorithms for nonconvex problems have also been analyzed, including alternating minimization [jain2013low, hardt2014understanding, hardt2014fast], EM algorithms [balakrishnan2014EM, wang2014HighDimEM] and power methods [hardt2014power], various hard-thresholding and singular value projection [jain2010svp, netrapalli2014nonconvexRPCA, jain2014hard, bhatia2015rob_regression], gradient descent for nonconvex regression and spectrally sparse recovery problems [loh2013regularized, wang2013sparseNonConvex, cai2015hankel], as well as gradient descent on Grassmannian manifolds [keshavan2009matrixafew, zhang2015grassmannian]. Finally, there is a large body of work on convex-optimization based approach to the concrete examples considered in this paper. We compare our statistical guarantees with results of these types after the statements of each of our corollaries.

##### Notation:

The -th row and -th column of a matrix are denoted by and , respectively. The spectral norm is the largest singular value of . The nuclear norm is the sum of the singular values of . For parameters and a matrix , the norm of is —that is, the

norm of the vector of the

norms of the rows. Special cases include the Frobenius norm , the elementwise norm and the elementwise norm . For a convex set , we use to denote the Euclidean projection onto .## 2 Background

We begin by setting up the class of matrix estimators to be studied in this paper, and then providing various concrete examples of specific models to which our general theory applies.

### 2.1 Matrix estimators in the factorized formulation

Letting denote the space of all symmetric -dimensional matrices, this paper focuses on a class of matrix estimators that take the following general form. For a given sample size , let be a cost function. It is a random function, since it depends (implicitly in our notation) on the observed data, and the function value provides some measure of fit of the matrix to the given data. For a given convex set , we then consider a minimization problem of the form

(1) |

The goal of solving this optimization problem is to estimate some unknown target matrix . Typically, the target matrix is a (near)-minimizer of the population version of the program—that is, a solution to the same constrained minimization problem with replaced by its expectation . However, our theory does not require that minimizes this quantity, nor that the gradient vanish.

In many cases, the matrix either has low rank, or can be well-approximated by a matrix of low rank. Concretely, if the target matrix has rank , then it can be written in the outer product form for some other matrix with orthogonal columns. This factorized representation motivates us to consider the function , and the factorized formulation

(2) |

where is some convex set that contains , and for which the set acts as a surrogate for . Note that due to the factorized representation of the low-rank matrix, this factorized program is (in general) nonconvex, and is typically so even if the original program (1) is convex.

Nonetheless, we can apply a projected gradient descent method in order to compute an approximate minimizer. For this particular problem, the projected gradient descent updates take the form

(3) |

where is a step size parameter, denotes
the Euclidean projection onto the set , and the
gradient^{1}^{1}1This gradient takes the simpler form whenever is
symmetric, which is the case in the concrete examples that we
treat. is given by . The main goal of
this paper is to provide a general set of sufficient conditions under
which—up to a statistical tolerance term —the sequence
converges to some such
that .

A significant challenge in the analysis is the fact that there are
*many* possible factorizations of the form . In order to address this issue, it is convenient
to define an equivalent class of valid solutions as follows

(4) |

For the applications of interest here, the underlying goal is to
obtain a good estimate of *any* matrix in the set . In
particular, such an estimate implies a good estimate of
itself as well as the column space and singular values of all the
members of the class . Accordingly, we define the
pseudometric

(5) |

Note that all matrices have the same singular values, so that we may write the singular values as well as and without any ambiguity. In fact, this invariant property holds more generally for any function of the sorted singular values and column space of (e.g., any unitarially invariant norm).

### 2.2 Illustrative examples

Let us now consider a few specific models to illustrate the general set-up from the previous section. We return to demonstrate consequences of our general theory for these (and other) models in Section 4.

#### 2.2.1 Matrix regression

We begin with a simple example, namely one in which we make noisy observations of linear projections of an unknown low-rank matrix . In particular, suppose that we are given i.i.d. observations of the form

(6) |

and is some i.i.d. sequence of zero-mean noise variables. The paper [negahban2009estimation] provides various examples of such matrix regression problems, depending on the particular choice of the regression matrices .

##### Original estimator:

Without considering computational complexity, a reasonable estimate of would be based on minimizing the least-squares cost

(7) |

subject to a rank constraint. However, this problem is computationally intractable in general due to the nonconvexity of the rank function. A standard convex relaxation is based on the nuclear norm , corresponding to the sum of the singular values of the matrix. In the symmetric PSD case, it is equivalent to the trace of the matrix. Using the nuclear norm as regularizer leads to the estimator

where is a radius to be chosen. This is a special case of our general estimator (1) with being the least-squares cost (7), and the constraint set .

##### Population version:

Suppose that the noise variables

are i.i.d. zero-mean with variance

, and the regression matrices are also i.i.d., zero-mean and such that for any matrix . Under these conditions, an easy calculation yields that the population cost function is given by . For this particular case, note that is the unique minimizer of the population cost.##### Projected gradient descent:

#### 2.2.2 Rank- PCA with row sparsity

Principal component analysis is a widely used method for dimensionality reduction. For high-dimensional problems in which , it is well-known that classical PCA is inconsistent [JohLu09]

. Moreover, minimax lower bounds show that consistent eigen-estimation is impossible in the absence of structure in the eigenvectors. Accordingly, a recent line of work (e.g.,

[JohLu09, AmiWai09, cai2013adaptive, berthet2013lowerSparsePCA, vu2013minimax, BirJohNadPau12]) has studied different forms of PCA with structured eigenvectors.Here we consider one such form of structured PCA, namely a rank model with row-wise sparsity. For a given signal-to-noise ratio and an orthonormal matrix , consider a covariance matrix of the form

(9) |

By construction, the columns of span the top rank-eigenspace of

with the corresponding maximal eigenvalues

. In the row-sparse version of this model [vu2013minimax], this leading eigenspace is assumed to be supported on coordinates—that is, the matrix has at most non-zero rows. Given i.i.d. samplesfrom the Gaussian distribution

, the goal of sparse PCA is to estimate the sparse eigenspace spanned by .##### Original estimator:

A natural estimator is based on a semidefinite program, referred to as the Fantope relaxation in the paper [vu2013fantope], given by

(10) |

where is the empirical covariance matrix, and is a radius to be chosen. This is a special case of our general set-up with and

##### Population version:

Since , the population cost function is given by

Thus, by construction, for any radius , the matrix is the unique minimizer of the population version of the problem (10), subject to the constraint .

##### Projected gradient descent:

For a radius to be chosen, we consider a factorized version of the SDP

(11) |

where we recall that . This norm is the appropriate choice for selecting matrices with sparse rows, as assumed in our initial set-up. We return in Section 4.3 to analyze the projected gradient updates (3) applied to pair in equation (11).

As a side-comment, this example illustrates that our theory does not depend on local convexity of the function . In this case, even though the original function is convex (in fact, linear) in the matrix , observe that that the function from equation (11) is never locally convex in the low-rank matrix ; in fact, since is positive semidefinite, it is a globally concave function.

#### 2.2.3 Low-rank and sparse matrix decomposition

There are various applications in which it is natural to model an unknown matrix as the sum of two matrices, one of which is low-rank and the other of which is sparse. Concretely, suppose that we make observations of the form where is low-rank, the matrix is symmetric and elementwise-sparse, and is a symmetric matrix of noise variables. Many problems can be cast in this form, including robust forms of PCA, factor analysis, and Gaussian graphical model estimation; see the papers [candes2009robustPCA, agarwal2012decomposition, chen2011LSarxiv, chandrasekaran2011siam, Hsu2010RobustDecomposition] and references therein for further details on these and other applications.

##### Original estimator:

Letting denote the column of a matrix , define the set of matrices , where are user-defined radii. Using the nuclear norm and norm as surrogates for rank and sparsity respectively, a popular convex relaxation approach is based on the SDP

This is a special case of our general estimator with , and the constraint set .

##### Population version:

In this case, the population function is given by

where the expectation is over the random noise matrix . In general, we are not guaranteed that is the unique minimizer of this objective, but our analysis shows that (under suitable conditions) it is a near-minimizer, and this is adequate for our theory.

##### Projected gradient descent:

In this paper, we analyze a version of gradient descent that operates on the pair given by

(12) |

Here is the initialization of the algorithm, and the parameter controls the matrix incoherence. See Sections 4.1 and 4.6 for discussion of matrix incoherence parameters, and their necessity in such problems. The gradient of takes the form

where denotes projection onto the constraint set
. This projection is easy to carry it, as it simply involves
a soft-thresholding of the columns of the matrix. Likewise, the
projection onto the set from
equation (12) is easy to carry out. We
return to analyze these projected gradient updates in
Section 4.6.

## 3 Main results

In this section, we turn to the set-up and statement of our main results on the convergence properties of projected gradient descent for low-rank factorizations. We begin in Section 3.1 by stating the conditions on the function and that underlie our analysis. In Section 3.2, we state a result (Theorem 1) that guarantees sublinear convergence, whereas Section 3.3 is devoted to a result (Theorem 2) that guarantees faster linear convergence under slightly stronger assumptions. In Section 4 to follow, we derive various corollaries of these theorems for different concrete versions of low-rank estimation.

Given a radius , we define the ball
. At a high
level, our goal is to provide conditions under which the projected
gradient sequence converges some
multiple of the ball , where is a
*statistical tolerance*.

### 3.1 Conditions on the pair

Recall the definition of the set of equivalent orthogonal factorizations of a given matrix . We begin with a condition on that guarantees that it respects the structure of this set.

##### -faithfulness of :

For a radius , the constraint set is said to be
*-faithful* if for each matrix , we guaranteed that

(13) |

Of course, this condition is implied by the inclusion . The -faithfulness condition is natural for our setting, as our goal is to estimate the eigen structure of , and the set should therefore represent prior knowledge of this structure and be independent of a specific factorization of .

##### Local descent condition:

Our next condition provides a guarantee on the cost improvement that can be obtained by taking a gradient step when starting from any matrix that is “sufficiently” far away from the set .

###### Definition 1 (Local descent condition).

For a given radius , curvature parameter
and statistical tolerance , a cost function
satisfies a *local descent condition* with parameters
over if for each , there is some such that

(14) |

In order to gain intuition for this condition, note that by a first-order Taylor series expansion, we have , so that this inner product measures the potential gains afforded by taking a gradient step. Now consider some matrix such that , so that its distance from is larger than the statistical precision. The lower bound (14) with then implies that

which guarantees a quadratic descent condition. Note that the condition (14) actually allows for additional freedom in the choice of so as to accommodate the non-uniqueness of the factorization.

One way in which to establish a bound of the form (14) is by requiring that be locally strongly convex, and that the gradient approximately vanishes. In particular, suppose is -strongly convex over the set , in the sense that

If we assume that , then some simple algebra yields that the lower bound (14) holds.

However, it is essential to note that our theory covers several
examples in which a lower bound (14) of the form
holds, even though fails to be locally convex, and/or the
gradient does not approximately
vanish.^{2}^{2}2We note that the vanishing gradient condition is needed in all existing work on nonconvex gradient descent [candes2014wirtinger, sun2014mc, zheng2015convergent]. Examples include the problem of sparse PCA, previously
introduced in Section 2.2.2; in this case, the
function is actually globally concave, but nonetheless our
analysis in Section 4.3 shows that a local descent
condition of the form (14) holds. Similarly, for the
planted clustering model studied in Section 4.4, the
same form of global concavity holds. In addition, for the matrix
regression problem previously introduced in
Section 2.2.1, we prove in
Section 4.2 that the condition (14)
holds over a set over which is nonconvex. The generality
of our condition (14) is essential to accommodate
these and other examples.

##### Local Lipschitz condition:

Our next requirement is a straightforward local Lipschitz property:

###### Definition 2 (Local Lipschitz).

The loss function is locally Lipschitz in the sense that

(15) |

for all .

##### Local smoothness:

Our last condition is *not* required to establish convergence of
projected gradient descent, but rather to guarantee a faster
geometric rate of convergence. It is a condition—complementary to
the local descent condition—that upper bounds the behavior of the
gradient map .

###### Definition 3 (Local smoothness).

For some curvature
and smoothness parameters and , statistical tolerance
and radius , we say that the loss function
satisfies a* local smoothness condition* with
parameters over if for
each and ,

(16) |

The above conditions are stated in terms of the loss function for the factor matrix . Alternatively, one may restate these conditions in terms of the loss function on the original space, and we make use of this type of reformulation in parts of our proofs. For instance, see Section 6 for details.

### 3.2 Sublinear convergence under Lipschitz condition

With our basic conditions in place, we are now ready to state our first main result. It guarantees a sublinear rate of convergence under the -faithfulness, local descent, and local Lipschitz conditions.

More precisely, for some descent and Lipschitz parameters , a statistical tolerance , and a constant , suppose that , the cost functions satisfies the local descent and Lipschitz conditions (Definitions 1 and 2) with parameters and , and the constraint set is -faithful and convex. Let be the condition number of . We then have the following guarantee:

###### Theorem 1.

Under the previously stated conditions, given any initial point belonging to the set , the projected gradient iterates with step size satisfy the bound

(17) |

See Section 5.1 for the proof of this claim.

As a minor remark, we note that the assumption entails no loss of generality—if it fails to hold, then the initial solution already satisfies an error bound better than what is guaranteed for subsequent iterates.

Conceptually, Theorem 1 provides a minimal set of
conditions for the convergence of projected gradient descent using the
nonconvex factorization . The first
term on the right hand side of equation (17)
corresponds to the *optimization error*, whereas the second
term is the *statistical error.* The
bound (17) shows that the distance between
and drops at the rate
up to the statistical limit that is determined by the
sample size and the signal-to-noise ratio (SNR) of the problem. We
see concrete instances of this statistical error in the examples to
follow.

### 3.3 Linear convergence under smoothness condition

Although Theorem 1 does guarantee convergence, the resulting rate is sublinear (), and hence rather slow. In this section, we show that if in addition to the local Lipschitz and descent conditions, the function satisfies the local smoothness conditions in Definition 3, then much faster convergence can be guaranteed.

More precisely, suppose that for some numbers , , , and with , and , the loss function satisfies the local descent, Lipschitz and smoothness conditions in Definitions 1–3 over with parameters , , , and , and that the set is -faithful and convex.

###### Theorem 2.

Under the previously stated conditions, there is a constant depending only on such that given an initial matrix in the set , the projected gradient iterates with step size satisfy the bound

(18) |

See Section 5.2 for the proof of this claim.

The right hand side of the bound (18) again consists of an optimization error term and a statistical error term. The theorem guarantees that the optimization error converges linearly at the geometric rate up to a statistical limit. Note that the theorem requires the initial solution to lie within a ball around with radius , which is slightly smaller than the radius for which the local descent, Lipschitz and smoothness conditions hold. Moreover, the step size and the convergence rate depend on the condition number of as well as the quality of the initialization through . We did not make an attempt to optimize this dependence, but improvement in this direction, including adaptive choices of the step size, is certainly an interesting problem for future work.

## 4 Concrete results for specific models

In this section, we turn to the consequences of our general theory for specific models that arise in applications. Throughout this section, we focus on geometric convergence guaranteed by Theorem 2 using a constant step size. The main technical challenges are to verify the local descent, local Lipschitz and local smoothness assumptions that are needed to apply this result. Since Theorem 1 depends on weaker assumptions—it does not need the local smoothness property—it should be understood also that our analysis can be used to derive corollaries based on Theorem 1 as well.

##### Note:

In all of the analysis to follow, we adopt the shorthand for the singular values of , and for its condition number.

### 4.1 Noisy matrix completion

We begin by deriving a corollary for the problem of noisy matrix completion. Since we did not discuss this model in Section 2.2, let us provide some background here. There are a wide variety of matrix completion problems (e.g., [Lau01]), and the variant of interest here arises when the unknown matrix has low rank. More precisely, for an unknown PSD and low-rank matrix , suppose that we are given noisy observations of a subset of its entries. In the so-called Bernoulli model, the random subset of observed entries is chosen uniformly at random—that is, each entry is observed with some probability , independently of all other entries. We can represent these observations by a random symmetric matrix with entries of the form

(19) |

Here the variables represent a form of measurement noise.

A standard method for matrix completion is based on solving the semidefinite program

(20) |

where is a radius to be chosen. As noted above, the PSD constraint and nuclear norm bound are equivalent to the trace constraint . In either case, this is a special case of our general estimator (1).

The SDP-based estimator (20) is known to have good
performance when the underlying matrix satisfies certain
matrix incoherence conditions. These conditions involve its leverage
scores, defined in the following way. Here we consider a simplified
setting where the eigenvalues of are equal. By
performing an eigendecomposition, we can write
where is a diagonal matrix of
eigenvalues (a constant multiple of the identity when they are
constant), and take . With this notation, the
*incoherence* parameter of is
given by

(21) |

Since we already enforce low-rankness in the factorized formulation, we can drop nuclear norm constraint. The generalized projected gradient descent (3) is specified by letting and set

Note that is convex, and depends on the initial solution . The gradient of is , and the projection is given by the row-wise “clipping” operation

This projection ensures that the iterates of gradient descent (3) remains incoherent.

###### Remark 1.

Note that ( is the -th standard basis vector and is the column space of ), so the values of and depend only on and are the same for any in .

With this notation in place, we are now ready to apply Theorem 2 to the noisy matrix completion problem. As we show below, if the initial matrix satisfies the bound , then the set is -faithful. Moreover, if the expected sample size satisfies , then with probability at least the loss function satisfies the local descent, Lipschitz and smoothness conditions with parameters

Using this fact, we have the following consequence of Theorem 2, which holds when the sample size size satisfies the bound above and is large enough to ensure that .

###### Corollary 1.

Under the previously stated conditions, if we are given an initial matrix satisfying the bound , then with probability at least , the gradient iterates with step size satisfy the bound

(22) |

See Section 6.2 for the proof of this
claim.

Even though Corollary 1 is a consequence of our general theory, it leads to results for exact/approximate recovery in the noiseless/noisy setting that are as good as or better than known results. In the noiseless setting (), our sample size requirement and contraction factor are sharper than those in the paper [sun2014mc] by a polynomial factor in the rank . Turning to the noisy setting, suppose the noise matrix has independent sub-Gaussian entries with parameter . A technical result to be proved later (see Lemma 11) guarantees that given a sample size , we have the operator norm bound with probability at least . Together with the bound (22), we conclude that

(23) |

The scaling is better than the results in past work (e.g., [negahban2010noisymc, koltchinskii2011matrixCompletion, keshavan2010noisy]) on noisy matrix completion by a factor; in fact, it matches the minimax lower bounds established in the papers [negahban2010noisymc, koltchinskii2011matrixCompletion]. Thus, Corollary 1 in fact establishes that the projected gradient descent method yields minimax-optimal estimates.

(a) | (b) | (c) |

##### Initialization:

Suppose the rank- SVD of the matrix is given by . We can take

Comments

There are no comments yet.