No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis

by   Rong Ge, et al.
Duke University
berkeley college

In this paper we develop a new framework that captures the common landscape underlying the common non-convex low-rank matrix problems including matrix sensing, matrix completion and robust PCA. In particular, we show for all above problems (including asymmetric cases): 1) all local minima are also globally optimal; 2) no high-order saddle points exists. These results explain why simple algorithms such as stochastic gradient descent have global converge, and efficiently optimize these non-convex objective functions in practice. Our framework connects and simplifies the existing analyses on optimization landscapes for matrix sensing and symmetric matrix completion. The framework naturally leads to new results for asymmetric matrix completion and robust PCA.


page 1

page 2

page 3

page 4


Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization

We study the asymmetric low-rank factorization problem: min_𝐔∈ℝ^m ×...

Matrix Completion has No Spurious Local Minimum

Matrix completion is a basic machine learning problem that has wide appl...

Memory-efficient Kernel PCA via Partial Matrix Sampling and Nonconvex Optimization: a Model-free Analysis of Local Minima

Kernel PCA is a widely used nonlinear dimension reduction technique in m...

Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees

Optimization problems with rank constraints arise in many applications, ...

A Non-convex One-Pass Framework for Generalized Factorization Machine and Rank-One Matrix Sensing

We develop an efficient alternating framework for learning a generalized...

Non-Convex Matrix Completion Against a Semi-Random Adversary

Matrix completion is a well-studied problem with many machine learning a...

Optimal (0,1)-Matrix Completion with Majorization Ordered Objectives (To the memory of Pravin Varaiya)

We propose and examine two optimal (0,1)-matrix completion problems with...

1 Introduction

Non-convex optimization is one of the most powerful tools in machine learning. Many popular approaches, from traditional ones such as matrix factorization

(Hotelling, 1933)

to modern deep learning

(Bengio, 2009) rely on optimizing non-convex functions. In practice, these functions are optimized using simple algorithms such as alternating minimization or gradient descent. Why such simple algorithms work is still a mystery for many important problems.

One way to understand the success of non-convex optimization is to study the optimization landscape: for the objective function, where are the possible locations of global optima, local optima and saddle points. Recently, a line of works showed that several natural problems including tensor decomposition

(Ge et al., 2015), dictionary learning (Sun et al., 2015a), matrix sensing (Bhojanapalli et al., 2016; Park et al., 2016) and matrix completion (Ge et al., 2016) have well-behaved optimization landscape: all local optima are also globally optimal. Combined with recent results (e.g. Ge et al. (2015); Carmon et al. (2016); Agarwal et al. (2016); Jin et al. (2017)) that are guaranteed to find a local minimum for many non-convex functions, such problems can be efficiently solved by basic optimization algorithms such as stochastic gradient descent.

In this paper we focus on optimization problems that look for low rank matrices using partial or corrupted observations. Such problems are studied extensively (Fazel, 2002; Rennie and Srebro, 2005; Candès and Recht, 2009) and has many applications in recommendation systems (Koren, 2009), see survey by Davenport and Romberg (2016). These optimization problems can be formalized as follows:


Here is an matrix and is a convex function of . The non-convexity of this problem stems from the low rank constraint. Several interesting problems, such as matrix sensing (Recht et al., 2010), matrix completion (Candès and Recht, 2009) and robust PCA (Candès et al., 2011) can all be framed as optimization problems of this form(see Section 3).

In practice, Burer and Monteiro (2003)heuristic is often used – replace with an explicit low rank representation , where and . The new optimization problem becomes


Here is a (optional) regularizer. Despite the objective being non-convex, for all the problems mentioned above, simple iterative updates from random or even arbitrary initial point find the optimal solution in practice. It is then natural to ask: Can we characterize the similarities between the optimization landscape of these problems? We show this is indeed possible:

Theorem 1 (informal).

The objective function of matrix sensing, matrix completion and robust PCA have similar optimization landscape. In particular, for all these problems, 1) all local minima are also globally optimal; 2) any saddle point has at least one strictly negative eigenvalue in its Hessian.

More precise theorem statements appear in Section 3. Note that there were several cases (matrix sensing (Bhojanapalli et al., 2016; Park et al., 2016), symmetric matrix completion (Ge et al., 2016)) where similar results on the optimization landscape were known. However the techniques in previous works are tailored to the specific problems and hard to generalize. Our framework captures and simplifies all these previous results, and also gives new results on asymmetric matrix completion and robust PCA.

The key observation in our analysis is that for matrix sensing, matrix completion, and robust PCA (when fixing sparse estimate), function

(in Equation (1)) is a quadratic function over the matrix . Hence the Hessian of with respect to is a constant. More importantly, the Hessian in all above problems has similar properties (that it approximately preserves norm, similar to the RIP properties used in matrix sensing (Recht et al., 2010)), which allows their optimization landscapes to be characterized in a unified way. Specifically, our framework gives principled way of defining a direction of improvement for all points that are not globally optimal.

Another crucial property of our framework is the interaction between the regularizer and the Hessian . Intuitively, the regularizer makes sure the solution is in a nice region (e.g. set of incoherent matrices for matrix completion), and only within the Hessian has the norm preserving property. On the other hand, regularizer should not be too large to severely distort the landscape. This interaction is crucial for matrix completion, and is also very useful in handling noise and perturbations. In Section 4, we discuss ideas required to apply this framework to matrix sensing, matrix completion and robust PCA.

Using this framework, we also give a way to reduce asymmetric matrix problems to symmetric PSD problems (where the desired matrix is of the form ). See Section 5 for more details.

In addition to the results of no spurious local minima, our framework also implies that any saddle point has at least one strictly negative eigenvalue in its Hessian. Formally, we proved all above problems satisfy a robust version of this claim — strict saddle property (see Definition 2), which is one of crucial sufficient conditions to admit efficient optimization algorithms, and thus following corollary (see Section 6 for more details).

Corollary 2 (informal).

For matrix sensing, matrix completion and robust PCA, simple local search algorithms can find the desired low rank matrix

from an arbitrary starting point in polynomial time with high probability.

For simplicity, we present most results in the noiseless setting, but our results can also be generalized to handle noise. As an example, we show how to do this for matrix sensing in Section C.

1.1 Related Works

The landscape of low rank matrix problems have recently received a lot of attention. Ge et al. (2016) showed symmetric matrix completion has no spurious local minimum. At the same time, Bhojanapalli et al. (2016) proved similar result for symmetric matrix sensing. Park et al. (2016) extended the matrix sensing result to asymmetric case. All of these works guarantee global convergence to the correct solution.

There has been a lot of work on the local convergence analysis for various algorithms and problems. For matrix sensing or matrix completion, the works (Keshavan et al., 2010a, b; Hardt and Wootters, 2014; Hardt, 2014; Jain et al., 2013; Chen and Wainwright, 2015; Sun and Luo, 2015; Zhao et al., 2015; Zheng and Lafferty, 2016; Tu et al., 2015) showed that given a good enough initialization, many simple local search algorithms, including gradient descent and alternating least squares, succeed. Particularly, several works (e.g. Sun and Luo (2015); Zheng and Lafferty (2016)) accomplished this by showing a geometric property which is very similar to strong convexity holds in the neighborhood of optimal solution. For robust PCA, there are also many analysis for local convergence (Lin et al., 2010; Netrapalli et al., 2014; Yi et al., 2016; Zhang et al., 2017).

Several works also try to unify the analysis for similar problems. Bhojanapalli et al. (2015) gave a framework for local analysis for these low rank problems. Belkin et al. (2014) showed a framework of learning basis functions, which generalizes tensor decompositions. Their techniques imply the optimization landscape for all such problems are very similar. For problems looking for a symmetric PSD matrix, Li and Tang (2016) showed for objective similar to (2) (but in the symmetric setting), restricted smoothness/strong convexity on the function suffices for local analysis. However, their framework does not address the interaction between regularizer and the function , hence cannot be directly applied to problems such as matrix completion or robust PCA.


We will first introduce notations and basic optimality conditions in Section 2. Then Section 3 introduces the problems and our results. For simplicity, we present our framework for the symmetric case in Section 4, and briefly discuss how to reduce asymmetric problem to symmetric problem in Section 5. We discuss how our geometric result implies efficient algorithms in Section 6. We then show how our geometric results imply fast runtime of popular local search algorithms in Section 6. For clean presentation, many proofs are deferred to appendix .

2 Preliminaries

In this section we introduce notations and basic optimality conditions.

2.1 Notations

We use bold letters for matrices and vectors. For a vector

we use to denote its norm. For a matrix we use to denote its spectral norm, and to denote its Frobenius norm. For vectors we use to denote inner-product, and for matrices we use to denote the trace of . We will always use to denote the optimal low rank solution. Further, we use

to denote its largest singular value,

to denote its -th singular value and be the condition number.

We use to denote the gradient and to denote its Hessian. Since function can often be applied to both (as in (1)) and (as in (2)), we use to denote gradient with respect to and to denote gradient with respect to . Similar notation is used for Hessian. The Hessian is a crucial object in our framework. It can be interpreted as a linear operator on matrices. This linear operator can be viewed as a matrix (or matrix in the symmetric case) that applies to the vectorized version of matrices. We use the notation to denote the quadratic form . Similarly, the Hessian of objective (2) is a linear operator on a pair of matrices , which we usually denote as .

2.2 Optimality Conditions

Local Optimality

Suppose we are optimizing a function with no constraints on . In order for a point to be a local minimum, it must satisfy the first and second order necessary conditions. That is, we must have and .

Definition 1 (Optimality Condition).

Suppose is a local minimum of , then we have

Intuitively, if one of these conditions is violated, then it is possible to find a direction that decreases the function value. Ge et al. (2015) characterized the following strict-saddle property, which is a quantitative version of the optimality conditions, and can lead to efficient algorithms to find local minima.

Definition 2.

We say function is -strict saddle. That is, for any , at least one of followings holds:

  1. .

  2. .

  3. is -close to – the set of local minima.

Intuitively, this definition says for any point , it either violates one of the optimality conditions significantly (first two cases), or is close to a local minima. Note that and are often closely related. For a function with strict-saddle property, it is possible to efficiently find a point near a local minimum.

Local vs. Global

However, of course finding a local minimum is not sufficient in many case. In this paper we are also going to prove that all local minima are also globally optimal, and they correspond to the desired solutions.

3 Low Rank Problems and Our Results

In this section we introduce matrix sensing, matrix completion and robust PCA. For each problem we give the results obtained by our framework. The proof ideas are illustrated later in Sections 4 and 5.

3.1 Matrix Sensing

Matrix sensing (Recht et al., 2010) is a generalization of compressed sensing (Candes et al., 2006). In the matrix sensing problem, there is an unknown low rank matrix . We make linear observations on this matrix: let be sensing matrices, the algorithm is given ’s and the corresponding . The goal is now to find the unknown matrix . In order to find , we need to solve the following nonconvex optimization problem

We can transform this constraint problem to an unconstraint problem by expressing as where and . We also need an additional regularizer (common for all asymmetric problems):


The regularizer has been widely used in previous works (Zheng and Lafferty, 2016; Park et al., 2016). In Section 5 we show how this regularizer can be viewed as a way to deal with the additional invariants in asymmetric case, and reduce the asymmetric case to the symmetric case. A crucial concept in standard sensing literature is Restrict Isometry Property (RIP), which is defined as follows:

Definition 3.

A group of sensing matrices satisfies the -RIP condition, if for every matrix of rank at most ,

Intuitively, RIP says operator approximately perserve norms for all low rank matrices. When the sensing matrices are chosen to be i.i.d. matrices with independent Gaussian entries, if for large enough constant , the sensing matrices satisfy the -RIP condition (Candes and Plan, 2011). Using our framework we can show:

Theorem 3.

When measurements satisfy -RIP, for matrix sensing objective (3) we have 1) all local minima satisfy 2) the function is -strict saddle.

This in particular says 1) no spurious local minima existsl; 2) whenever at some point so that the gradient is small and the Hessian does not have significant negative eigenvalue, then the distance to global optimal (see Definition 6 and Definition 7) is guaranteed to be small. Such a point can be found efficiently (see Section 6).

3.2 Matrix Completion

Matrix completion is a popular technique in recommendation systems and collaborative filtering (Koren, 2009; Rennie and Srebro, 2005). In this problem, again we have an unknown low rank matrix . We observe each entry of the matrix independently with probability . Let be a set of observed entries. For any matrix , we use to denote the matrix whose entries outside of are set to 0. That is, if , and otherwise. We further use to denote . Matrix completion can be viewed as a special case of matrix sensing, where the sensing matrices only have one nonzero entry. However such matrices do not satisfy the RIP condition.

In order to solve matrix completion, we try to optimize the following:

A well-known problem in matrix completion is that when the true matrix is very sparse, then we are very likely to observe only entries, and has no chance to learn the other entries of . To avoid this case, previous works have assumed following incoherence condition:

Definition 4.

A rank matrix is -incoherent, if for the rank- SVD of , we have for all

We assume the unknown optimal low rank matrix is -incoherent.

In the non-convex program, we try to make sure the decomposition is also incoherent by adding a regularizer

Here are parameters that we choose later, . Using this regularizer, we can now transform the objective function to the unconstraint form


Using the framework, we can show following:

Theorem 4.

Let , when sample rate , choose and . With probability at least , for Objective Function (4) we have 1) all local minima satisfy 2) The objective is -strict saddle for polynomially small .

3.3 Robust PCA

Robust PCA (Candès et al., 2011) is a generalization to the standard Principled Component Analysis. In Robust PCA, we are given an observation matrix , which is an true underlying matrix corrupted by a sparse noise (). In some sense the goal is to decompose the matrix into these two components. There are many models on how many entries can be perturbed, and how they are distributed. In this paper we work in the setting where is -incoherent, and the rows/columns of can have at most -fraction non-zero entries.

In order to express robust PCA as an optimization problem, we need constraints on both and :


There can be several ways to specify the sparsity of . In this paper we restrict attention to the following set:

Assuming the true sparse matrix is in . Note that the infinite norm requirement on is without loss of generality, because by incoherence cannot have entries with absolute value more than . Any entry larger than that is obviously in the support of and can be truncated.

In objective function, we allow to be times denser (in ) where is a parameter we choose later. Now the constraint optimization problem can be tranformed to the unconstraint problem


Of course, we can also think of this as a joint minimization problem of . However we choose to present it this way in order to allow extension of the strict-saddle condition. Since is not twice-differetiable w.r.t , it does not admit Hessian matrix, so we use the following generalized version of strict-saddle

Definition 5.

We say function is -pseudo strict saddle if for any , at least one of followings holds:

  1. .

  2. so that ; ; .

  3. is -close to – the set of local minima.

Note that in this definition, the upperbound in 2 can be viewed as similar to the idea of subgradient. For functions with non-differentiable points, subgradient is defined so that it still offers a lowerbound for the function. In our case this is very similar – although Hessian is not defined, we can use a smooth function that upperbounds the current function (upper-bound is required for minimization). In the case of robust PCA the upperbound is obtained by a fixed . Using this formalization we can prove

Theorem 5.

There is an absolute constant , if , and holds, for objective function Eq.(6) we have 1) all local minima satisfies ; 2) objective function is -pseudo strict saddle for polynomially small .

4 Framework for Symmetric Positive Definite Problems

In this section we describe our framework in the simpler setting where the desired matrix is positive semidefinite. In particular, suppose the true matrix we are looking for can be written as where . For objective functions that is quadratic over , we denote its Hessian as and we can write the objective as


We call this objective function . Via Burer-Monteiro factorization, the corresponding unconstraint optimization problem, with regularization can be written as


In this section, we also denote as objective function with respect to parameter , abuse the notation of previously defined over .

Direction of Improvement

The optimality condition (Definition 1) implies if the gradient is non-zero, or if we can find a negative direction of the Hessian (that is a direction , so that ), then the point is not a local minimum. A common technique in characterizing the optimization landscape is therefore trying to explicitly find this negative direction. We call this the direction of improvement. Different works (Bhojanapalli et al., 2016; Ge et al., 2016) have chosen very different directions of improvement.

In our framework, we show it suffices to choose a single direction as the direction of improvement. Intuitively, this direction should bring us close to the true solution from the current point . Due to rotational symmetry ( and behave the same for the objective if is a rotation matrix), we need to carefully define the difference between and .

Definition 6.

Given matrices , define their difference , where is chosen as

Note that this definition tries to “align” and before taking their difference, and therefore is invariant under rotations. In particular, this definition has the nice property that as long as is close to , we have is small (we defer the proof to Appendix):

Lemma 6.

Given matrices , let and , and let be defined as in Definition 6, then we have , and .

Now we can state the main Lemma:

Lemma 7 (Main).

For the objective (8), let be defined as in Definition 6 and . Then, for any , we have


To see why this lemma is useful, let us look at the simplest case where and is identity. In this case, if gradient is zero, by Eq. (9)

By Lemma 6 this is no more than . Therefore, all stationary point with must be saddle points, and we immediately conclude all local minimum satisfies !

Interaction with Regularizer

For problems such as matrix completion, the Hessian does not preserve the norm for all low rank matrices. In these cases we need to use additional regularizer. In particular, conceptually we need the following steps:

  1. Show that the regularizer ensures for any such that , for some set .

  2. Show that whenever , the Hessian operator behaves similarly as identity: for some we have:

  3. Show that the regularizer does not contribute a large positive term to . This means we show an upperbound for

Interestingly, these steps are not just useful for handling regularizers. Any deviation to the original model (such as noise, or if the optimal matrix is not exactly low rank) can be viewed as an additional “regularizer” function and argued in the same framework. See e.g. Section C.

4.1 Matrix Sensing

Matrix sensing is the ideal setting for this framework. For symmetric matrix sensing, the objective function is


Recall that matrices are known sensing matrices, and is the result of -th observation. The intended solution is the unknown low rank matrix . For any low rank matrix , the Hessian operator satisfies

Therefore if the sensing matrices satisfy the RIP property (Definition 3), the Hessian operator is close to identity for all low rank matrices! In the symmetric case there is no regularizer, so the landscape for symmetric matrix sensing follows immediately from our main Lemma 7.

Theorem 8.

When measurement satisfies -RIP, for matrix sensing objective (10) we have 1) all local minima satisfy ; 2) the function is -strict saddle.


For point with small gradient satisfying , by -RIP property:

The second last inequality is due to Lemma 6 that , and last inequality is due to and second part of Lemma 6. This means if is not close to , that is, if , we have . This proves -strict saddle property. Take , we know all stationary points with are saddle points. This means all local minima are global minima (satisfying ), which finishes the proof. ∎

4.2 Matrix Completion

For matrix completion, we need to ensure the incoherence condition (Definition 4). In order to do that, we add a regularizer that penalize the objective function when some row of is too large. We choose the same regularizer as Ge et al. (2016): . The objective is then


Using our framework, we first need to show that the regularizer ensures all rows of are small (step 1).

Lemma 9.

There exists an absolute constant , when sample rate , and , we have for any points with for polynomially small , with probability at least :

This is a slightly stronger version of Lemma 4.7 in Ge et al. (2016). Next we show under this regularizer, we can still select the direction , and the first part of Equation (9) is significantly negative when is large (step 2):

Lemma 10.

When sample rate , by choosing and with probability at least , for all with for polynomially small we have

This lemma follows from several standard concentration inequalities, and is made possible because of the incoherence bound we proved in the previous lemma.

Finally we show the additional regularizer related term in Equation (9) is bounded (step 3).

Lemma 11.

By choosing and , we have:

Combining these three lemmas, it is easy to see

Theorem 12.

When sample rate , by choosing and . Then with probability at least , for matrix completion objective (11) we have 1) all local minima satisfy 2) the function is -strict saddle for polynomially small .

Notice that our proof is different from Ge et al. (2016), as we focus on the direction for both first and second order conditions while they need to select different directions for the Hessian. The framework allowed us to get a simpler proof, generalize to asymmetric case and also improved the dependencies on rank.

4.3 Robust PCA

In the robust PCA problem, for any given matrix the objective function try to find the optimal sparse perturbation . In the symmetric PSD case, recall we observe , we define the set of sparse matrices to be

Note the projection onto set be computed in polynomial time (using a max flow algorithm).

We assume , the objective can be written as


Here is a slack parameter that we choose later.

Note that now the objective function is not quadratic, so we cannot use the framework directly. However, if we fix , then is a quadratic function with Hessian equal to identity. We can still apply our framework to this function. In this case, since the Hessian is identity for all matrices, we can skip the first step. The problem becomes a matrix factorization problem:


The difference here is that the matrix (which is ) is not equal to and is in general not low rank. We can use the framework to analyze this problem (and treat the residue as the “regularizer” ).

Lemma 13.

Let be a symmetric PSD matrix, and matrix factorization objective to be:

where . then 1) all local minima satisfies (best rank- approximation), 2) objective is -strict saddle.

To deal with the case not fixed (but as minimizer of Eq.(12)), we let be the best rank -approximation of . The next lemma shows when is close to up to some rotation, will actually be already close to up to some rotation.

Lemma 14.

There is an absolute constant , assume , and . Let be the best rank -approximation of , where is the minimizer as in Eq.(12). Assume . Let be defined as in Definition 6, then for polynomially small .

The proof of Lemma 14 is inspired by Yi et al. (2016) and uses the property of the optimally chosen sparse set . Combining these two lemmas we get our main result:

Theorem 15.

There is an absolute constant , if , and holds, for objective function Eq.(12) we have 1) all local minima satisfies ; 2) objective function is -pseudo strict saddle for polynomially small .

5 Handling Asymmetric Matrices

In this section we show how to reduce problems on asymmetric matrices to problems on symmetric PSD matrices.

Let , and , and objective function:

Note this is a scaled version of objectives introduced in Sec.3 (multiplied by ), and scaling will not change the property of local minima, global minima and saddle points.

We view the problem as if it is trying to find a matrix, whose first rows are equal to , and last rows are equal to .

Definition 7.

Suppose is the optimal solution, and its SVD is . Let , , is the current point, we reduce the problem into a symmetric case using following notations.


Further, is defined to be the difference between and up to rotation as in Definition 6.

We will also transform the Hessian operators to operate on matrices. In particular, define Hessian such that for all we have:

Now, let , and we can rewrite the objective function as


We know perserves the norm of low rank matrices . To reduce asymmetric problems to symmetric problem, intuitively, we also hope to approximately preserve the norm of . However this is impossible as by definition, only acts on , which is the off-diagonal blocks of . We can expect to be close to the norm of , but for all matrices with the same , the matrix can have very different norms. The easiest example is to consider and : while no matter what is, the norm of is of order and can change drastically. The regularizer is exactly there to handle this case: the Hessian of the regularizer will be related to the norm of the diagonal components, therefore allowing the full Hessian to still be approximately identity.

Now we can formalize the reduction as the following main Lemma:

Lemma 16.

For the objective (15), let be defined as in Definition 7. Then, for any , we have


where . Further, if satisfies for some matrix , let and be defined as in (14), then .

Intuitively, this lemma shows the same direction of improvement works as before, and the regularizer is exactly what it requires to maintain the norm-preserving property of the Hessian.

Below we prove Theorem 3, which show for matrix sensing 1) all local minima satisfy ; 2) strict saddle property is satisfied. Other proofs are deferred to appendix.

Proof of Theorem 3.

In this case, and regularization . Since is -RIP, by Lemma 16, we have satisfying -RIP.

Similar to the symmetric case, for point with small gradient satisfying , by -RIP property of (let ) we have

The second last inequality is due to Lemma 6 that , and last inequality is due to and second part of Lemma 6. This means if is not close to , that is, if , we have . This proves -strict saddle property. Take , we know all stationary points with are saddle points. This means all local minima satisfy , which in particular implies because is a submatrix of . ∎

6 Runtime

In this section we give the precise statement of Corollary 2: the runtime of algorithms implied by the geometric properties we prove.

In order to translate the geometric result into runtime guarantees, many algorithms require additional smoothness conditions. We say a function is -smooth if for all ,

This is a standard assumption in optimization. In order to avoid saddle points, say a function is -Hessian Lipschitz if for all

We call an optimization algorithm saddle-avoiding if the algorithm is able to find a point with small gradient and almost positive semidefinite Hessian.

Definition 8.

A local search algorithm is called saddle-avoiding, if for a function that is -smooth and -Lipschitz Hessian, given a point such that either