Sharp Restricted Isometry Bounds for the Inexistence of Spurious Local Minima in Nonconvex Matrix Recovery

01/07/2019 ∙ by Richard Y. Zhang, et al. ∙ MIT berkeley college 0

Nonconvex matrix recovery is known to contain no spurious local minima under a restricted isometry property (RIP) with a sufficiently small RIP constant δ. If δ is too large, however, then counterexamples containing spurious local minima are known to exist. In this paper, we introduce a proof technique that is capable of establishing sharp thresholds on δ to guarantee the inexistence of spurious local minima. Using the technique, we prove that in the case of a rank-1 ground truth, an RIP constant of δ<1/2 is both necessary and sufficient for exact recovery from any arbitrary initial point (such as a random point). We also prove a local recovery result: given an initial point x_0 satisfying f(x_0)<(1-δ)^2f(0), any descent algorithm that converges to second-order optimality guarantees exact recovery.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The low-rank matrix recovery problem seeks to recover an unknown ground truth matrix of low-rank from linear measurements of . The problem naturally arises in recommendation systems (Rennie and Srebro, 2005) and clustering algorithms (Amit et al., 2007)—often under the names of matrix completion and matrix sensing—and also finds engineering applications in phase retrieval (Candes et al., 2013)

and power system state estimation 

(Zhang et al., 2018b).

In the symmetric, noiseless variant of low-rank matrix recovery, the ground truth is taken to be positive semidefinite (denoted as ), and the linear measurements are made without error, as in

(1)

To recover from

, the standard approach in the machine learning community is to factor a candidate

into its low-rank factors , and to solve a nonlinear least-squares problem on

using a local search algorithm (usually stochastic gradient descent):

(2)

The function is nonconvex, so a “greedy” local search algorithm can become stuck at a spurious local minimum, especially if a random initial point is used. Despite this apparent risk of failure, the nonconvex approach remains both widely popular as well as highly effective in practice.

Recently, Bhojanapalli et al. (2016b) provided a rigorous theoretical justification for the empirical success of local search on problem (2). Specifically, they showed that the problem contains no spurious local minima under the assumption that satisfies the restricted isometry property (RIP) of Recht et al. (2010) with a sufficiently small constant. The nonconvex problem is easily solved using local search algorithms because every local minimum is also a global minimum.

Definition 1 (Restricted Isometry Property).

The linear map is said to satisfy -RIP if there is constant such that

(3)

holds for all satisfying .

Theorem 2 (Bhojanapalli et al., 2016b; Ge et al., 2017).

Let satisfy -RIP with . Then, (2) has no spurious local minima:

Hence, any algorithm that converges to a second-order critical point is guaranteed to recover exactly.

While Theorem 2 says that an RIP constant of is sufficient for exact recovery, Zhang et al. (2018a) proved that is necessary. Specifically, they gave a counterexample satisfying -RIP that causes randomized stochastic gradient descent to fail 12% of the time. A number of previous authors have attempted to close the gap between sufficiency and necessity, including Bhojanapalli et al. (2016b); Ge et al. (2017); Park et al. (2017); Zhang et al. (2018a); Zhu et al. (2018). In this paper, we prove that in the rank-1 case, an RIP constant of is both necessary and sufficient for exact recovery.

Once the RIP constant exceeds , global guarantees are no longer possible. Zhang et al. (2018a) proved that counterexamples exist generically: almost every choice of generates an instance of nonconvex recovery satisfying RIP with as a spurious local minimum and as ground truth. In practice, local search may continue to work well, often with a 100% success rate as if spurious local minima do not exist. However, the inexistence of spurious local minima can no longer be assured.

Instead, we turn our attention to local guarantees, based on good initial guesses that often arise from domain expertise, or even chosen randomly. Given an initial point satisfies where is the RIP constant and is a rank-1 ground truth, we prove that a descent algorithm that converges to second-order optimality is guaranteed to recover the ground truth. Examples of such algorithms include randomized full-batch gradient descent (Jin et al., 2017) and trust-region methods (Conn et al., 2000; Nesterov and Polyak, 2006).

2 Main Results

Our main contribution in this paper is a proof technique capable of establishing RIP thresholds that are both necessary and sufficient for exact recovery. The key idea is to disprove the counterfactual. To prove for some that “-RIP implies no spurious local minima”, we instead establish the inexistence of a counterexample that admits a spurious local minimum despite satisfying -RIP. In particular, if is the smallest RIP constant associated with a counterexample, then any cannot admit a counterexample (or it would contradict the definition of as the smallest RIP constant). Accordingly, is precisely the sharp threshold needed to yield a necessary and sufficient recovery guarantee.

The main difficulty with the above line of reasoning is the need to optimize over the set of counterexamples. Indeed, verifying RIP for a fixed operator is already NP-hard in general (Tillmann and Pfetsch, 2014), so it is reasonable to expect that optimizing over the set of RIP operators is at least NP-hard. Surprisingly, this is not the case. Consider finding the smallest RIP constant associated with a counterexample with fixed ground truth and fixed spurious point :

(4)
subject to

In Section 5, we reformulate problem (4) into a convex linear matrix inequality (LMI) optimization, and prove that the reformulation is exact (Theorem 8). Accordingly, we can evaluate to arbitrary precision in polynomial time by solving an LMI using an interior-point method.

In the rank case, the LMI reformulation is sufficiently simple that it can be relaxed and then solved in closed-form (Theorem 13). This yields a lower-bound that we optimize over all spurious choices of to prove that . Given that due to the counterexample of Zhang et al. (2018a), we must actually have .

Theorem 3 (Global guarantee).

Let , let satisfy -RIP, and define .

  • If , then has no spurious local minima:

  • If then there exists a counterexample satisfying -RIP, but whose admits a spurious point satisfying:

We can also optimize over spurious choices within an -neighborhood of the ground truth. The resulting guarantee is applicable to much larger RIP constants , including those arbitrarily close to one.

Theorem 4 (Local guarantee).

Let , and let satisfy -RIP. If

then has no spurious local minima within an -neighborhood of the solution:

Theorem 4 gives an RIP-based exact recovery guarantee for descent algorithms, such as randomized full-batch gradient descent (Jin et al., 2017) and trust-region methods (Conn et al., 2000; Nesterov and Polyak, 2006), that generate a sequence of iterates from an initial guess with each iterate no worse than the one before:

(5)

Heuristically, it also applies to nondescent algorithms, like stochastic gradient descent and Nesterov’s accelerated gradient descent, under the mild assumption that the final iterate is no worse than the initial guess , as in .

Corollary 5.

Let , and let satisfy -RIP. If satisfies

where , then the sublevel set defined by contains no spurious local minima:

When the RIP constant satisfies , Corollary 5 guarantees exact recovery from an initial point satisfying . In practice, such an can often be found using a spectral initializer (Keshavan et al., 2010a; Jain et al., 2013; Netrapalli et al., 2013; Candes et al., 2015; Chen and Candes, 2015). If

is not too close to one, then even a random point may suffice with a reasonable probability (see the related discussion by 

Goldstein and Studer (2018)).

In the rank- case with , our proof technique continues to work, but the LMI reformulation becomes very difficult to solve in closed-form. Nevertheless, we can evaluate numerically using an interior-point method, and then heuristically optimize over the spurious point and ground truth . Doing this in Section 8, we obtain empirical evidence that higher-rank have larger RIP thresholds, and so are in a sense “easier” to solve.

3 Related work

3.1 No spurious local minima in matrix completion

Exact recovery guarantees like Theorem 2 have also been established for “harder” choices of that do not satisfy RIP over its entire domain. In particular, the matrix completion problem has sparse measurement matrices , with each containing just a single nonzero element. In this case, the RIP-like condition holds only when is both low-rank and sufficiently dense; see the discussion by Candès and Recht (2009). Nevertheless, Ge et al. (2016) proved a similar result to Theorem 2 by adding a regularizing term to the objective.

Our recovery results are developed for the classical form of RIP—a much stronger notion than the RIP-like condition satisfied by matrix completion. Intuitively, if exact recovery cannot be guaranteed under standard RIP, then exact recovery under a weaker notion would seem unlikely. It remains future work to make this argument precise, and to extend our proof technique to these “harder” choices of .

3.2 Noisy measurements and nonsymmetric ground truth

Recovery guarantees for the noisy and/or nonsymmetric variants of nonconvex matrix recovery typically require a smaller RIP constant than the symmetric, noiseless case. For example, Bhojanapalli et al. (2016b) proved that the symmetric, zero-mean,

-variance Gaussian noise case requires a rank-

RIP constant of to recover an -accurate solution satisfying . Also, Ge et al. (2017) proved that the nonsymmetric, noiseless case requires a rank- RIP constant for exact recovery. By comparison, the symmetric, noiseless case requires only a rank- RIP constant of for exact recovery.

The main goal of this paper is to develop a proof technique capable of establishing sharp RIP thresholds for exact recovery. As such, we have focused our attention on the symmetric, noiseless case. While our technique can be easily modified to accommodate for the nonsymmetric, noisy case, the sharpness of the technique (via Theorem 8) may be lost. Whether an exact convex reformulation exists for the nonsymmetric, noisy case is an open question, and the subject of important future work.

3.3 Approximate second-order points and strict saddles

Existing “no spurious local minima” results (Bhojanapalli et al., 2016b; Ge et al., 2017) guarantee that satisfying second-order optimality to -accuracy will yield a point within an -neighborhood of the solution:

Such a condition is often known as “strict saddle” (Ge et al., 2015). The associated constants determine the rate at which gradient methods can converge to an -accurate solution (Du et al., 2017; Jin et al., 2017).

The proof technique presented in this paper can be extended in a straightforward way to the strict saddle condition. Specifically, we replace all instances of and with and in Section 5, and derive a suitable version of Theorem 8. However, the resulting reformulation can no longer be solved in closed form, so it becomes difficult to extend the guarantees in Theorem 3 and Theorem 4. Nevertheless, quantifying its asymptotic behavior may yield valuable insights in understanding the optimization landscape.

3.4 Special initialization schemes

Our local recovery result is reminiscent of classic exact recovery results based on placing an initial point sufficiently close to the global optimum. Most algorithms use the spectral initializer to chose the initial point (Keshavan et al., 2010a, b; Jain et al., 2013; Netrapalli et al., 2013; Candes et al., 2015; Chen and Candes, 2015; Zheng and Lafferty, 2015; Zhao et al., 2015; Bhojanapalli et al., 2016a; Sun and Luo, 2016; Sanghavi et al., 2017; Park et al., 2018), although other initializers have also been proposed (Wang et al., 2018; Chen et al., 2018; Mondelli and Montanari, 2018). Our result differs from prior work in being completely agnostic to the specific application and the initializer. First, it requires only a suboptimality bound to be satisfied by the initial point . Second, its sole parameter is the RIP constant , so issues of sample complexity are implicitly resolved in a universal way for different measurement ensembles. On the other hand, the result is not directly applicable to problems that only approximately satisfy RIP, including matrix completion.

3.5 Comparison to convex recovery

Classical theory for the low-rank matrix recovery problem is based on a quadratic lift: replacing in (2) by a convex term , and augmenting the objective with a trace penalty to induce a low-rank solution (Candès and Recht, 2009; Recht et al., 2010; Candès and Tao, 2010; Candes and Plan, 2011; Candes et al., 2013). The convex approach also enjoys RIP-based exact recovery guarantees: in the noiseless case, Cai and Zhang (2013) proved that is sufficient, while the counterexample of Wang and Li (2013) shows that is necessary. While convex recovery may be able to solve problems with larger RIP constants than nonconvex recovery, it is also considerably more expensive. In practice, convex recovery is seldom used for large-scale datasets with on the order of thousands to millions.

Recently, several authors have proposed non-lifting convex relaxations, motivated by the desire to avoid squaring the number of variables in the classic quadratic lift. In particular, we mention the PhaseMax method studied by Bahmani and Romberg (2017) and Goldstein and Studer (2018), which avoids the need to square the number of variables when both the measurement matrices and the ground truth are rank-1. These methods also require a good initial guess as an input, and so are in a sense very similar to nonconvex recovery.

4 Preliminaries

4.1 Notation

Lower-case letters are vectors and upper-case letters are matrices. The sets

are the space of real matrices and real symmetric matrices, and and are the Frobenius inner product and norm. We write (resp. ) to mean that is positive semidefinite (resp. positive definite), and to denote (resp. to denote ).

Throughout the paper, we use (resp. ) to refer to any candidate point, and (resp. ) or to refer to a rank- (resp. rank-) factorization of the ground truth . The vector and matrix are defined in (11). We also denote the optimal value of the nonconvex problem (15) as , and later show it to be equal to the optimal value of the convex problem (21) denoted as .

4.2 Basic definitions

The vectorization operator stacks the columns of an matrix into a single column vector:

It defines an isometry between the matrices and their

underlying degrees of freedom

:

The matricization operator is the inverse of vectorization, meaning that if and only if .

The Kronecker product between the matrix and the matrix is the matrix defined

to satisfy the Kronecker identity

The orthogonal basis of a given matrix (with ) is a matrix comprising orthonormal columns of length- that span :

We can compute using either a rank-revealing QR factorization (Chan, 1987)

or a (thin) singular value decomposition 

(Golub and Van Loan, 1996, p. 254) in time and memory.

4.3 Global optimality and local optimality

Given a choice of and the rank- ground truth , we define the nonconvex objective

(6)

If the point attains , then we call it a globally minimum; otherwise, we call it a spurious point. If satisfies -RIP, then is a global minimum if and only if  (Recht et al., 2010, Theorem 3.2).

The point is said to be a local minimum if holds for all within a local neighborhood of . If is a local minimum, then it must satisfy the second-order necessary condition for local optimality:

(7)

Conversely, a point satisfying (7) is called a second-order critical point, and can be either a local minimum or a saddle point. It is worth emphasizing that local search algorithms can only guarantee convergence to a second-order critical point, and not necessarily a local minimum; see Ge et al. (2015); Lee et al. (2016); Jin et al. (2017); Du et al. (2017) for the literature on gradient methods, and Conn et al. (2000); Nesterov and Polyak (2006); Cartis et al. (2012); Boumal et al. (2018) for the literature on trust-region methods.

If a point satisfies the second-order sufficient condition for local optimality (with ):

(8)

then it is guaranteed to be a local minimum. However, it is also possible for to be a local minimum without satisfying (8). Indeed, certifying to be a local minimum is NP-hard in the worst case (Murty and Kabadi, 1987). Hence, the finite gap between necessary and sufficient conditions for local optimality reflects the inherent hardness of the problem.

4.4 Explicit expressions for and

Define as the nonlinear least-squares objective shown in (6). While not immediately obvious, both the gradient and the Hessian are linear with respect to the the kernel operator . To show this, we define the matrix representation of the operator

(9)

which satisfies

Then, some linear algebra reveals

(10a)
(10b)
(10c)

where and are defined with respect to and to satisfy

(11a)
(11b)

(Note that is simply the Jacobian of with respect to .) Clearly, , , and are all linear with respect to . In turn, is simply the matrix representation of the kernel operator .

As an immediate consequence noted by Zhang et al. (2018a), both the second-order necessary condition (7) and the second-order sufficient condition (8) for local optimality are linear matrix inequalities (LMIs) with respect to . In particular, this means that finding an instance of (2) with a fixed as the ground truth and as a spurious local minimum is a convex optimization problem:

find find (12)
such that

Given a feasible point , we compute an satisfying using Cholesky factorization or an eigendecomposition. Then, matricizing each row of recovers the matrices implementing a feasible choice of .

5 Main idea: The inexistence of counterexamples

At the heart of this work is a simple argument by the inexistence of counterexamples. To illustrate the idea, consider making the following claim for a fixed choice of and :

(13)

The claim is refuted by a counterexample: an instance of (2) satisfying -RIP with ground truth and spurious local minimum . The problem of finding a counterexample is a nonconvex feasibility problem:

find (14)
such that

If problem (14) is feasible for , then any feasible point is a counterexample that refutes the claim (13). However, if problem (14) is infeasible for , then counterexamples do not exist, so we must accept the claim (13) at face value. In other words, the inexistence of counterexamples is proof for the original claim.

The same argument can be posed in an optimization form. Instead of finding any arbitrary counterexample, we will look for the counterexample with the smallest RIP constant

(15)
subject to

Suppose that problem (15) attains its minimum at . If , then the minimizer is a counterexample that refutes the claim (13). On the other hand, if , then problem (14) is infeasible for , so counterexamples do not exist, so the claim (13) must be true.

Repeating these arguments over all choices of and yields the following global recovery guarantee.

Lemma 6 (Sharp global guarantee).

Suppose that problem (15) attains its minimum of . Define as in

(16)

If satisfies -RIP with , then with ground truth and satisfies:

(17)

Moreover, if there exist such that , then the threshold is sharp.

Proof.

To prove (17), we simply prove the claim (13) for and every possible choice of . Indeed, if , then is not a spurious point (as it is a global minimum), whereas if , then proves the inexistence of a counterexample. Sharpness follows because the minimum is attained by the minimizer that refutes the claim (13) for all and and . ∎

Repeating the same arguments over an -local neighborhood of the ground truth yields the following local recovery guarantee.

Lemma 7 (Sharp local guarantee).

Suppose that problem (15) attains its minimum of . Given , define as in

(18)

If satisfies -RIP with , then with ground truth and satisfies:

(19)

Moreover, if there exist such that , then the threshold is sharp.

Our main difficulty with Lemma 6 and Lemma 7 is the evaluation of . Indeed, verifying -RIP for a fixed is already NP-hard in general (Tillmann and Pfetsch, 2014), so it is reasonable to expect that solving an optimization problem (15) with a -RIP constraint would be at least NP-hard. Instead, Zhang et al. (2018a) suggests replacing the -RIP constraint with a convex sufficient condition, obtained by enforcing the RIP inequality (3) over all matrices (and not just rank- matrices):

(20)

The resulting problem is a linear matrix inequality (LMI) optimization over the kernel operator that yields an upper-bound on :

(21)
subject to

Surprisingly, the upper-bound is tight—problem (21) is actually an exact reformulation of problem (15).

Theorem 8 (Exact convex reformulation).

Given , we have with both problems attaining their minima. Moreover, every minimizer for the latter problem is related to a minimizer for the former problem via .

Theorem 8 is the key insight that allows us to establish our main results. When rank , the LMI is sufficiently simple that it can be suitably relaxed and solved in closed-form, as we will soon show in Section 7. But even when , the LMI can still be solved numerically using an interior-point method. This allows us to perform numerical experiments to probe at the true value of and , even when analytical arguments are not available.

Section 5.1 below gives a proof of Theorem 8. A key step of the proof is to establish the following equivalence:

(22)

For small values of the rank , equation (22) also yields an efficient algorithm for evaluating in linear time: compute and and then evaluate . Moreover, the associated minimizer can also be efficiently recovered. These practical aspects are discussed in detail in Section 5.2.

5.1 Proof of Theorem 8

Given , we define and to satisfy equation (11) with respect to and . Then, problem (21) can be explicitly written as

(23)
subject to

with Lagrangian dual

(24)
subject to

The dual problem admits a strictly feasible point (for sufficiently small , set and where and ) and the primal problem is bounded (the constraints imply ). Hence, Slater’s condition is satisfied, strong duality holds, and the primal problem attains its optimal value at a minimizer.

It turns out that both the minimizer and the minimum are invariant under an orthogonal projection.

Lemma 9 (Orthogonal projection).

Given , let with satisfy

Let be a minimizer for . Then, is a minimizer for , where and

Proof.

Choose arbitrarily small . Strong duality guarantees the existence of a dual feasible point with duality gap . This is a certificate that proves to be -suboptimal for . We can mechanically verify that is primal feasible and that is dual feasible, where

Then, is a certificate that proves to be -suboptimal for , since

Given that -suboptimal certificates exist for all , the point must actually be optimal. The details for verifying primal and dual feasibility are straightforward but tedious; they are included in Appendix A for completeness. ∎

Recall that we developed an upper-bound on by replacing -RIP with a convex sufficient condition (20). The same idea can also be used to produce a lower-bound. Specifically, we replace the -RIP constraint with a convex necessary condition, obtained by enforcing the RIP inequality (3) over a subset of rank- matrices (instead of over all rank- matrices):

(25)

where is a fixed matrix with . The resulting problem is also convex (we write )

(26)
subject to

with Lagrangian dual

(27)
subject to

It turns out that for the specific choice of , the lower-bound in (26) coincides with the upper-bound in (23).

Lemma 10 (Tightness).

Define . Let be a minimizer for . Then, is a minimizer for problem (26), where and

Proof.

The proof is almost identical to that of Lemma 9. Again, choose arbitrarily small . Let be a dual feasible point for with duality gap . Then, where

is a certificate that proves to be -suboptimal for problem (26). The details for verifying primal and dual feasibility are included in Appendix B. ∎

Putting the upper- and lower-bounds together then yields a short proof of Theorem 8.

Proof of Theorem 8.

Denote as the optimal value to the upper-bound problem (23) and as the corresponding minimizer. (The minimizer always exists due to the boundedness of the primal problem and the existence of a strictly feasible point in the dual problem.) Denote as the optimal value to the lower-bound problem (26). For the sequence of inclusions

implies . However, by Lemma 9 and Lemma 10, we actually have , and hence . Finally, the minimizer factors into , where satisfies the sufficient condition (20), and hence also -RIP. ∎

5.2 Efficient evaluation of and

We now turn to the practical problem of evaluating and the associated minimizer using a numerical algorithm. While its exact reformulation is indeed convex, naïvely solving it using an interior-point solution can require up to time and memory (as it requires solving an order- semidefinite program). In our experiments, the largest instances of (21) that we could accommodate using the state-of-the-art solver MOSEK (Andersen and Andersen, 2000) had dimensions no greater than .

Input. Choices of .

Output. The value and the corresponding minimizer (if desired).

Algorithm.

  1. Compute and project and .

  2. Solve using an interior-point method to obtain minimizer . Output .

  3. Compute the orthogonal complement

  4. Factor using (dense) Cholesky factorization.

  5. Analytically factor using the formula