1 Introduction
Nonconvex optimization is one of the most powerful tools in machine learning. Many popular approaches, from traditional ones such as matrix factorization
(Hotelling, 1933)to modern deep learning
(Bengio, 2009) rely on optimizing nonconvex functions. In practice, these functions are optimized using simple algorithms such as alternating minimization or gradient descent. Why such simple algorithms work is still a mystery for many important problems.One way to understand the success of nonconvex optimization is to study the optimization landscape: for the objective function, where are the possible locations of global optima, local optima and saddle points. Recently, a line of works showed that several natural problems including tensor decomposition
(Ge et al., 2015), dictionary learning (Sun et al., 2015a), matrix sensing (Bhojanapalli et al., 2016; Park et al., 2016) and matrix completion (Ge et al., 2016) have wellbehaved optimization landscape: all local optima are also globally optimal. Combined with recent results (e.g. Ge et al. (2015); Carmon et al. (2016); Agarwal et al. (2016); Jin et al. (2017)) that are guaranteed to find a local minimum for many nonconvex functions, such problems can be efficiently solved by basic optimization algorithms such as stochastic gradient descent.In this paper we focus on optimization problems that look for low rank matrices using partial or corrupted observations. Such problems are studied extensively (Fazel, 2002; Rennie and Srebro, 2005; Candès and Recht, 2009) and has many applications in recommendation systems (Koren, 2009), see survey by Davenport and Romberg (2016). These optimization problems can be formalized as follows:
(1)  
Here is an matrix and is a convex function of . The nonconvexity of this problem stems from the low rank constraint. Several interesting problems, such as matrix sensing (Recht et al., 2010), matrix completion (Candès and Recht, 2009) and robust PCA (Candès et al., 2011) can all be framed as optimization problems of this form(see Section 3).
In practice, Burer and Monteiro (2003)heuristic is often used – replace with an explicit low rank representation , where and . The new optimization problem becomes
(2) 
Here is a (optional) regularizer. Despite the objective being nonconvex, for all the problems mentioned above, simple iterative updates from random or even arbitrary initial point find the optimal solution in practice. It is then natural to ask: Can we characterize the similarities between the optimization landscape of these problems? We show this is indeed possible:
Theorem 1 (informal).
The objective function of matrix sensing, matrix completion and robust PCA have similar optimization landscape. In particular, for all these problems, 1) all local minima are also globally optimal; 2) any saddle point has at least one strictly negative eigenvalue in its Hessian.
More precise theorem statements appear in Section 3. Note that there were several cases (matrix sensing (Bhojanapalli et al., 2016; Park et al., 2016), symmetric matrix completion (Ge et al., 2016)) where similar results on the optimization landscape were known. However the techniques in previous works are tailored to the specific problems and hard to generalize. Our framework captures and simplifies all these previous results, and also gives new results on asymmetric matrix completion and robust PCA.
The key observation in our analysis is that for matrix sensing, matrix completion, and robust PCA (when fixing sparse estimate), function
(in Equation (1)) is a quadratic function over the matrix . Hence the Hessian of with respect to is a constant. More importantly, the Hessian in all above problems has similar properties (that it approximately preserves norm, similar to the RIP properties used in matrix sensing (Recht et al., 2010)), which allows their optimization landscapes to be characterized in a unified way. Specifically, our framework gives principled way of defining a direction of improvement for all points that are not globally optimal.Another crucial property of our framework is the interaction between the regularizer and the Hessian . Intuitively, the regularizer makes sure the solution is in a nice region (e.g. set of incoherent matrices for matrix completion), and only within the Hessian has the norm preserving property. On the other hand, regularizer should not be too large to severely distort the landscape. This interaction is crucial for matrix completion, and is also very useful in handling noise and perturbations. In Section 4, we discuss ideas required to apply this framework to matrix sensing, matrix completion and robust PCA.
Using this framework, we also give a way to reduce asymmetric matrix problems to symmetric PSD problems (where the desired matrix is of the form ). See Section 5 for more details.
In addition to the results of no spurious local minima, our framework also implies that any saddle point has at least one strictly negative eigenvalue in its Hessian. Formally, we proved all above problems satisfy a robust version of this claim — strict saddle property (see Definition 2), which is one of crucial sufficient conditions to admit efficient optimization algorithms, and thus following corollary (see Section 6 for more details).
Corollary 2 (informal).
For matrix sensing, matrix completion and robust PCA, simple local search algorithms can find the desired low rank matrix
from an arbitrary starting point in polynomial time with high probability.
For simplicity, we present most results in the noiseless setting, but our results can also be generalized to handle noise. As an example, we show how to do this for matrix sensing in Section C.
1.1 Related Works
The landscape of low rank matrix problems have recently received a lot of attention. Ge et al. (2016) showed symmetric matrix completion has no spurious local minimum. At the same time, Bhojanapalli et al. (2016) proved similar result for symmetric matrix sensing. Park et al. (2016) extended the matrix sensing result to asymmetric case. All of these works guarantee global convergence to the correct solution.
There has been a lot of work on the local convergence analysis for various algorithms and problems. For matrix sensing or matrix completion, the works (Keshavan et al., 2010a, b; Hardt and Wootters, 2014; Hardt, 2014; Jain et al., 2013; Chen and Wainwright, 2015; Sun and Luo, 2015; Zhao et al., 2015; Zheng and Lafferty, 2016; Tu et al., 2015) showed that given a good enough initialization, many simple local search algorithms, including gradient descent and alternating least squares, succeed. Particularly, several works (e.g. Sun and Luo (2015); Zheng and Lafferty (2016)) accomplished this by showing a geometric property which is very similar to strong convexity holds in the neighborhood of optimal solution. For robust PCA, there are also many analysis for local convergence (Lin et al., 2010; Netrapalli et al., 2014; Yi et al., 2016; Zhang et al., 2017).
Several works also try to unify the analysis for similar problems. Bhojanapalli et al. (2015) gave a framework for local analysis for these low rank problems. Belkin et al. (2014) showed a framework of learning basis functions, which generalizes tensor decompositions. Their techniques imply the optimization landscape for all such problems are very similar. For problems looking for a symmetric PSD matrix, Li and Tang (2016) showed for objective similar to (2) (but in the symmetric setting), restricted smoothness/strong convexity on the function suffices for local analysis. However, their framework does not address the interaction between regularizer and the function , hence cannot be directly applied to problems such as matrix completion or robust PCA.
Organization
We will first introduce notations and basic optimality conditions in Section 2. Then Section 3 introduces the problems and our results. For simplicity, we present our framework for the symmetric case in Section 4, and briefly discuss how to reduce asymmetric problem to symmetric problem in Section 5. We discuss how our geometric result implies efficient algorithms in Section 6. We then show how our geometric results imply fast runtime of popular local search algorithms in Section 6. For clean presentation, many proofs are deferred to appendix .
2 Preliminaries
In this section we introduce notations and basic optimality conditions.
2.1 Notations
We use bold letters for matrices and vectors. For a vector
we use to denote its norm. For a matrix we use to denote its spectral norm, and to denote its Frobenius norm. For vectors we use to denote innerproduct, and for matrices we use to denote the trace of . We will always use to denote the optimal low rank solution. Further, we useto denote its largest singular value,
to denote its th singular value and be the condition number.We use to denote the gradient and to denote its Hessian. Since function can often be applied to both (as in (1)) and (as in (2)), we use to denote gradient with respect to and to denote gradient with respect to . Similar notation is used for Hessian. The Hessian is a crucial object in our framework. It can be interpreted as a linear operator on matrices. This linear operator can be viewed as a matrix (or matrix in the symmetric case) that applies to the vectorized version of matrices. We use the notation to denote the quadratic form . Similarly, the Hessian of objective (2) is a linear operator on a pair of matrices , which we usually denote as .
2.2 Optimality Conditions
Local Optimality
Suppose we are optimizing a function with no constraints on . In order for a point to be a local minimum, it must satisfy the first and second order necessary conditions. That is, we must have and .
Definition 1 (Optimality Condition).
Suppose is a local minimum of , then we have
Intuitively, if one of these conditions is violated, then it is possible to find a direction that decreases the function value. Ge et al. (2015) characterized the following strictsaddle property, which is a quantitative version of the optimality conditions, and can lead to efficient algorithms to find local minima.
Definition 2.
We say function is strict saddle. That is, for any , at least one of followings holds:

.

.

is close to – the set of local minima.
Intuitively, this definition says for any point , it either violates one of the optimality conditions significantly (first two cases), or is close to a local minima. Note that and are often closely related. For a function with strictsaddle property, it is possible to efficiently find a point near a local minimum.
Local vs. Global
However, of course finding a local minimum is not sufficient in many case. In this paper we are also going to prove that all local minima are also globally optimal, and they correspond to the desired solutions.
3 Low Rank Problems and Our Results
In this section we introduce matrix sensing, matrix completion and robust PCA. For each problem we give the results obtained by our framework. The proof ideas are illustrated later in Sections 4 and 5.
3.1 Matrix Sensing
Matrix sensing (Recht et al., 2010) is a generalization of compressed sensing (Candes et al., 2006). In the matrix sensing problem, there is an unknown low rank matrix . We make linear observations on this matrix: let be sensing matrices, the algorithm is given ’s and the corresponding . The goal is now to find the unknown matrix . In order to find , we need to solve the following nonconvex optimization problem
We can transform this constraint problem to an unconstraint problem by expressing as where and . We also need an additional regularizer (common for all asymmetric problems):
(3) 
The regularizer has been widely used in previous works (Zheng and Lafferty, 2016; Park et al., 2016). In Section 5 we show how this regularizer can be viewed as a way to deal with the additional invariants in asymmetric case, and reduce the asymmetric case to the symmetric case. A crucial concept in standard sensing literature is Restrict Isometry Property (RIP), which is defined as follows:
Definition 3.
A group of sensing matrices satisfies the RIP condition, if for every matrix of rank at most ,
Intuitively, RIP says operator approximately perserve norms for all low rank matrices. When the sensing matrices are chosen to be i.i.d. matrices with independent Gaussian entries, if for large enough constant , the sensing matrices satisfy the RIP condition (Candes and Plan, 2011). Using our framework we can show:
Theorem 3.
When measurements satisfy RIP, for matrix sensing objective (3) we have 1) all local minima satisfy 2) the function is strict saddle.
This in particular says 1) no spurious local minima existsl; 2) whenever at some point so that the gradient is small and the Hessian does not have significant negative eigenvalue, then the distance to global optimal (see Definition 6 and Definition 7) is guaranteed to be small. Such a point can be found efficiently (see Section 6).
3.2 Matrix Completion
Matrix completion is a popular technique in recommendation systems and collaborative filtering (Koren, 2009; Rennie and Srebro, 2005). In this problem, again we have an unknown low rank matrix . We observe each entry of the matrix independently with probability . Let be a set of observed entries. For any matrix , we use to denote the matrix whose entries outside of are set to 0. That is, if , and otherwise. We further use to denote . Matrix completion can be viewed as a special case of matrix sensing, where the sensing matrices only have one nonzero entry. However such matrices do not satisfy the RIP condition.
In order to solve matrix completion, we try to optimize the following:
A wellknown problem in matrix completion is that when the true matrix is very sparse, then we are very likely to observe only entries, and has no chance to learn the other entries of . To avoid this case, previous works have assumed following incoherence condition:
Definition 4.
A rank matrix is incoherent, if for the rank SVD of , we have for all
We assume the unknown optimal low rank matrix is incoherent.
In the nonconvex program, we try to make sure the decomposition is also incoherent by adding a regularizer
Here are parameters that we choose later, . Using this regularizer, we can now transform the objective function to the unconstraint form
(4) 
Using the framework, we can show following:
Theorem 4.
Let , when sample rate , choose and . With probability at least , for Objective Function (4) we have 1) all local minima satisfy 2) The objective is strict saddle for polynomially small .
3.3 Robust PCA
Robust PCA (Candès et al., 2011) is a generalization to the standard Principled Component Analysis. In Robust PCA, we are given an observation matrix , which is an true underlying matrix corrupted by a sparse noise (). In some sense the goal is to decompose the matrix into these two components. There are many models on how many entries can be perturbed, and how they are distributed. In this paper we work in the setting where is incoherent, and the rows/columns of can have at most fraction nonzero entries.
In order to express robust PCA as an optimization problem, we need constraints on both and :
(5)  
There can be several ways to specify the sparsity of . In this paper we restrict attention to the following set:
Assuming the true sparse matrix is in . Note that the infinite norm requirement on is without loss of generality, because by incoherence cannot have entries with absolute value more than . Any entry larger than that is obviously in the support of and can be truncated.
In objective function, we allow to be times denser (in ) where is a parameter we choose later. Now the constraint optimization problem can be tranformed to the unconstraint problem
(6)  
Of course, we can also think of this as a joint minimization problem of . However we choose to present it this way in order to allow extension of the strictsaddle condition. Since is not twicedifferetiable w.r.t , it does not admit Hessian matrix, so we use the following generalized version of strictsaddle
Definition 5.
We say function is pseudo strict saddle if for any , at least one of followings holds:

.

so that ; ; .

is close to – the set of local minima.
Note that in this definition, the upperbound in 2 can be viewed as similar to the idea of subgradient. For functions with nondifferentiable points, subgradient is defined so that it still offers a lowerbound for the function. In our case this is very similar – although Hessian is not defined, we can use a smooth function that upperbounds the current function (upperbound is required for minimization). In the case of robust PCA the upperbound is obtained by a fixed . Using this formalization we can prove
Theorem 5.
There is an absolute constant , if , and holds, for objective function Eq.(6) we have 1) all local minima satisfies ; 2) objective function is pseudo strict saddle for polynomially small .
4 Framework for Symmetric Positive Definite Problems
In this section we describe our framework in the simpler setting where the desired matrix is positive semidefinite. In particular, suppose the true matrix we are looking for can be written as where . For objective functions that is quadratic over , we denote its Hessian as and we can write the objective as
(7) 
We call this objective function . Via BurerMonteiro factorization, the corresponding unconstraint optimization problem, with regularization can be written as
(8) 
In this section, we also denote as objective function with respect to parameter , abuse the notation of previously defined over .
Direction of Improvement
The optimality condition (Definition 1) implies if the gradient is nonzero, or if we can find a negative direction of the Hessian (that is a direction , so that ), then the point is not a local minimum. A common technique in characterizing the optimization landscape is therefore trying to explicitly find this negative direction. We call this the direction of improvement. Different works (Bhojanapalli et al., 2016; Ge et al., 2016) have chosen very different directions of improvement.
In our framework, we show it suffices to choose a single direction as the direction of improvement. Intuitively, this direction should bring us close to the true solution from the current point . Due to rotational symmetry ( and behave the same for the objective if is a rotation matrix), we need to carefully define the difference between and .
Definition 6.
Given matrices , define their difference , where is chosen as
Note that this definition tries to “align” and before taking their difference, and therefore is invariant under rotations. In particular, this definition has the nice property that as long as is close to , we have is small (we defer the proof to Appendix):
Lemma 6.
Given matrices , let and , and let be defined as in Definition 6, then we have , and .
Now we can state the main Lemma:
Lemma 7 (Main).
To see why this lemma is useful, let us look at the simplest case where and is identity. In this case, if gradient is zero, by Eq. (9)
By Lemma 6 this is no more than . Therefore, all stationary point with must be saddle points, and we immediately conclude all local minimum satisfies !
Interaction with Regularizer
For problems such as matrix completion, the Hessian does not preserve the norm for all low rank matrices. In these cases we need to use additional regularizer. In particular, conceptually we need the following steps:

Show that the regularizer ensures for any such that , for some set .

Show that whenever , the Hessian operator behaves similarly as identity: for some we have:

Show that the regularizer does not contribute a large positive term to . This means we show an upperbound for
Interestingly, these steps are not just useful for handling regularizers. Any deviation to the original model (such as noise, or if the optimal matrix is not exactly low rank) can be viewed as an additional “regularizer” function and argued in the same framework. See e.g. Section C.
4.1 Matrix Sensing
Matrix sensing is the ideal setting for this framework. For symmetric matrix sensing, the objective function is
(10) 
Recall that matrices are known sensing matrices, and is the result of th observation. The intended solution is the unknown low rank matrix . For any low rank matrix , the Hessian operator satisfies
Therefore if the sensing matrices satisfy the RIP property (Definition 3), the Hessian operator is close to identity for all low rank matrices! In the symmetric case there is no regularizer, so the landscape for symmetric matrix sensing follows immediately from our main Lemma 7.
Theorem 8.
When measurement satisfies RIP, for matrix sensing objective (10) we have 1) all local minima satisfy ; 2) the function is strict saddle.
Proof.
For point with small gradient satisfying , by RIP property:
The second last inequality is due to Lemma 6 that , and last inequality is due to and second part of Lemma 6. This means if is not close to , that is, if , we have . This proves strict saddle property. Take , we know all stationary points with are saddle points. This means all local minima are global minima (satisfying ), which finishes the proof. ∎
4.2 Matrix Completion
For matrix completion, we need to ensure the incoherence condition (Definition 4). In order to do that, we add a regularizer that penalize the objective function when some row of is too large. We choose the same regularizer as Ge et al. (2016): . The objective is then
(11) 
Using our framework, we first need to show that the regularizer ensures all rows of are small (step 1).
Lemma 9.
There exists an absolute constant , when sample rate , and , we have for any points with for polynomially small , with probability at least :
This is a slightly stronger version of Lemma 4.7 in Ge et al. (2016). Next we show under this regularizer, we can still select the direction , and the first part of Equation (9) is significantly negative when is large (step 2):
Lemma 10.
When sample rate , by choosing and with probability at least , for all with for polynomially small we have
This lemma follows from several standard concentration inequalities, and is made possible because of the incoherence bound we proved in the previous lemma.
Finally we show the additional regularizer related term in Equation (9) is bounded (step 3).
Lemma 11.
By choosing and , we have:
Combining these three lemmas, it is easy to see
Theorem 12.
When sample rate , by choosing and . Then with probability at least , for matrix completion objective (11) we have 1) all local minima satisfy 2) the function is strict saddle for polynomially small .
Notice that our proof is different from Ge et al. (2016), as we focus on the direction for both first and second order conditions while they need to select different directions for the Hessian. The framework allowed us to get a simpler proof, generalize to asymmetric case and also improved the dependencies on rank.
4.3 Robust PCA
In the robust PCA problem, for any given matrix the objective function try to find the optimal sparse perturbation . In the symmetric PSD case, recall we observe , we define the set of sparse matrices to be
Note the projection onto set be computed in polynomial time (using a max flow algorithm).
We assume , the objective can be written as
(12) 
Here is a slack parameter that we choose later.
Note that now the objective function is not quadratic, so we cannot use the framework directly. However, if we fix , then is a quadratic function with Hessian equal to identity. We can still apply our framework to this function. In this case, since the Hessian is identity for all matrices, we can skip the first step. The problem becomes a matrix factorization problem:
(13) 
The difference here is that the matrix (which is ) is not equal to and is in general not low rank. We can use the framework to analyze this problem (and treat the residue as the “regularizer” ).
Lemma 13.
Let be a symmetric PSD matrix, and matrix factorization objective to be:
where . then 1) all local minima satisfies (best rank approximation), 2) objective is strict saddle.
To deal with the case not fixed (but as minimizer of Eq.(12)), we let be the best rank approximation of . The next lemma shows when is close to up to some rotation, will actually be already close to up to some rotation.
Lemma 14.
The proof of Lemma 14 is inspired by Yi et al. (2016) and uses the property of the optimally chosen sparse set . Combining these two lemmas we get our main result:
Theorem 15.
There is an absolute constant , if , and holds, for objective function Eq.(12) we have 1) all local minima satisfies ; 2) objective function is pseudo strict saddle for polynomially small .
5 Handling Asymmetric Matrices
In this section we show how to reduce problems on asymmetric matrices to problems on symmetric PSD matrices.
Let , and , and objective function:
Note this is a scaled version of objectives introduced in Sec.3 (multiplied by ), and scaling will not change the property of local minima, global minima and saddle points.
We view the problem as if it is trying to find a matrix, whose first rows are equal to , and last rows are equal to .
Definition 7.
Suppose is the optimal solution, and its SVD is . Let , , is the current point, we reduce the problem into a symmetric case using following notations.
(14) 
Further, is defined to be the difference between and up to rotation as in Definition 6.
We will also transform the Hessian operators to operate on matrices. In particular, define Hessian such that for all we have:
Now, let , and we can rewrite the objective function as
(15) 
We know perserves the norm of low rank matrices . To reduce asymmetric problems to symmetric problem, intuitively, we also hope to approximately preserve the norm of . However this is impossible as by definition, only acts on , which is the offdiagonal blocks of . We can expect to be close to the norm of , but for all matrices with the same , the matrix can have very different norms. The easiest example is to consider and : while no matter what is, the norm of is of order and can change drastically. The regularizer is exactly there to handle this case: the Hessian of the regularizer will be related to the norm of the diagonal components, therefore allowing the full Hessian to still be approximately identity.
Now we can formalize the reduction as the following main Lemma:
Lemma 16.
Intuitively, this lemma shows the same direction of improvement works as before, and the regularizer is exactly what it requires to maintain the normpreserving property of the Hessian.
Below we prove Theorem 3, which show for matrix sensing 1) all local minima satisfy ; 2) strict saddle property is satisfied. Other proofs are deferred to appendix.
Proof of Theorem 3.
In this case, and regularization . Since is RIP, by Lemma 16, we have satisfying RIP.
Similar to the symmetric case, for point with small gradient satisfying , by RIP property of (let ) we have
The second last inequality is due to Lemma 6 that , and last inequality is due to and second part of Lemma 6. This means if is not close to , that is, if , we have . This proves strict saddle property. Take , we know all stationary points with are saddle points. This means all local minima satisfy , which in particular implies because is a submatrix of . ∎
6 Runtime
In this section we give the precise statement of Corollary 2: the runtime of algorithms implied by the geometric properties we prove.
In order to translate the geometric result into runtime guarantees, many algorithms require additional smoothness conditions. We say a function is smooth if for all ,
This is a standard assumption in optimization. In order to avoid saddle points, say a function is Hessian Lipschitz if for all
We call an optimization algorithm saddleavoiding if the algorithm is able to find a point with small gradient and almost positive semidefinite Hessian.
Definition 8.
A local search algorithm is called saddleavoiding, if for a function that is smooth and Lipschitz Hessian, given a point such that either