# Smoothed analysis of the low-rank approach for smooth semidefinite programs

We consider semidefinite programs (SDPs) of size n with equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix Y of size n by k such that X = YY^* is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced and positive semidefiniteness is naturally enforced. However, the problem in Y is non-convex. In prior work, it has been shown that, when the constraints on the factorized variable regularly define a smooth manifold, provided k is large enough, for almost all cost matrices, all second-order stationary points (SOSPs) are optimal. Importantly, in practice, one can only compute points which approximately satisfy necessary optimality conditions, leading to the question: are such points also approximately optimal? To this end, and under similar assumptions, we use smoothed analysis to show that approximate SOSPs for a randomly perturbed objective function are approximate global optima, with k scaling like the square root of the number of constraints (up to log factors). We particularize our results to an SDP relaxation of phase retrieval.

## Authors

• 3 publications
• 10 publications
• 15 publications
06/10/2019

### Efficiently escaping saddle points on manifolds

Smooth, non-convex optimization problems on Riemannian manifolds occur i...
03/01/2018

### Smoothed analysis for low-rank solutions to semidefinite programs in quadratic penalty form

Semidefinite programs (SDP) are important in learning and combinatorial ...
04/14/2021

### Improving Optimal Power Flow Relaxations Using 3-Cycle Second-Order Cone Constraints

This paper develops a novel second order cone relaxation of the semidefi...
06/01/2015

### A Riemannian low-rank method for optimization over semidefinite matrices with block-diagonal constraints

We propose a new algorithm to solve optimization problems of the form f...
07/14/2021

### Zeroth and First Order Stochastic Frank-Wolfe Algorithms for Constrained Optimization

This paper considers stochastic convex optimization problems with two se...
08/03/2021

### Nonconvex Factorization and Manifold Formulations are Almost Equivalent in Low-rank Matrix Optimization

In this paper, we consider the geometric landscape connection of the wid...
10/21/2020

### Riemannian Langevin Algorithm for Solving Semidefinite Programs

We propose a Langevin diffusion-based algorithm for non-convex optimizat...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider semidefinite programs (SDP) over or of the form:

 minX∈Sn×n ⟨C,X⟩ (SDP) subject to A(X)=b, X⪰0,

with the Frobenius inner product ( is the conjugate-transpose of ), the set of self-adjoint matrices of size (real symmetric for , or Hermitian for ), the cost matrix, and a linear operator capturing equality constraints with right hand side : for each , for given matrices . The optimization variable is positive semidefinite. We let be the feasible set of (SDP):

 C={X∈Sn×n:A(X)=b and X⪰0}. (1)

Large-scale SDPs have been proposed for machine learning applications including matrix completion

[candes2009exact], community detection [abbe2017community] and kernel learning [lanckriet2004learning] for , and in angular synchronization [singer2011angular] and phase retrieval [waldspurger2015phase] for . Unfortunately, traditional methods to solve (SDP) do not scale (due to memory and computational requirements), hence the need for alternatives.

In order to address such scalability issues, burer2003nonlinear, burer2005local restrict the search to the set of matrices of rank at most by factorizing as , with . It has been shown that if the search space  (1) is compact, then (SDP) admits a global optimum of rank at most , where  [barvinok1995problems, pataki1998rank], with for and for . In other words, restricting to the space of matrices with rank at most with does not change the optimal value. This factorization leads to a quadratically constrained quadratic program:

 minY∈Kn×k ⟨C,YY∗⟩ (P) subject to A(YY∗)=b.

Although (P) is in general non-convex because its feasible set

 M=Mk={Y∈Kn×k:A(YY∗)=b} (2)

is non-convex, considering (P) instead of the original SDP presents significant advantages: the number of variables is reduced from to , and the positive semidefiniteness of the matrix is naturally enforced. Solving (P) using local optimization methods is known as the Burer–Monteiro method and yields good results in practice: kulis2007fast

underlined the practical success of such low-rank approaches in particular for maximum variance unfolding and for k-means clustering (see also

[carson2017kmeans]). Their approach is significantly faster and more scalable. However, the non-convexity of (P) means further analysis is needed to determine whether it can be solved to global optimality reliably.

For , in the case where is a compact, smooth manifold (see Assumption 1 below for a precise condition), it has been shown recently that, up to a zero-measure set of cost matrices, second-order stationary points (SOSPs) of (P) are globally optimal provided  [boumal2016non, boumal2018deterministic]. Algorithms such as the Riemannian trust-regions method (RTR) converge globally to SOSPs, but unfortunately they can only guarantee approximate satisfaction of second-order optimality conditions in a finite number of iterations [boumal2016globalrates].

The aforementioned papers close with a question, crucial in practice: when is it the case that approximate SOSPs, which we now call ASOSPs, are approximately optimal? Building on recent proof techniques by bhojanapalli2018smoothed, we provide some answers here.

### Contributions

This paper formulates approximate global optimality conditions holding for (P) and, consequently, for (SDP). Our results rely on the following core assumption as set in [boumal2016non].

###### Assumption 1 (Smooth manifold).

For all values of up to such that is non-empty, the constraints on (P) defined by and satisfy at least one of the following:

1. are linearly independent in for all ; or

2. span a subspace of constant dimension in for all in an open neighborhood of in

In [boumal2018deterministic], it is shown that (a) if the assumption above is verified for , then it automatically holds for all values of such that is non-empty; and (b) for those values of , the dimension of the subspace spanned by is independent of : we call it .

When Assumption 1 holds, we refer to problems of the form (SDP) as smooth SDPs because is then a smooth manifold. Examples of smooth SDPs for are given in [boumal2018deterministic]. For , we detail an example in Section 4. Our main theorem is a smooth analysis result (cf. Theorem 3.1 for a more formal statement). An ASOSP is an approximate SOSP (a precise definition follows.)

###### Theorem 1.1 (Informal).

Let Assumption 1 hold and assume is compact. Randomly perturb the cost matrix

. With high probability, if

, any ASOSP for (P) is an approximate global optimum, and is an approximate global optimum for (SDP) (with the perturbed .)

The high probability proviso is with respect to the perturbation only: if the perturbation is “good”, then all ASOSPs are as described in the statement. If is compact, then so is and known algorithms for optimization on manifolds produce an ASOSP in finite time (with explicit bounds). Theorem 1.1 ensures that, for large enough and for any cost matrix , with high probability upon a random perturbation of , such algorithms produce an approximate global optimum of (P).

Theorem 1.1 is a corollary of two intermediate arguments, developed in Lemmas 3.1 and 3.2:

1. Probabilistic argument (Lemma 3.1): By perturbing the cost matrix in the objective function of (P) with a Gaussian Wigner matrix, with high probability, any approximate first-order stationary point of the perturbed problem (P) is almost column-rank deficient.

2. Deterministic argument (Lemma 3.2): If an approximate second-order stationary point for (P) is also almost column-rank deficient, then it is an approximate global optimum and is an approximate global optimum for (SDP).

The first argument is motivated by smoothed analysis [SmoothedAnalysis] and draws heavily on a recent paper by bhojanapalli2018smoothed. The latter work introduces smoothed analysis to analyze the performance of the Burer–Monteiro factorization, but it analyzes a quadratically penalized version of the SDP: its solutions do not satisfy constraints exactly. This affords more generality, but, for the special class of smooth SDPs, the present work has the advantage of analyzing an exact formulation. The second argument is a smoothed extension of well-known on-off results [burer2003nonlinear, burer2005local, journee2010low]. Implications of this theorem for a particular SDP are derived in Section 4, with applications to phase retrieval and angular synchronization.

Thus, for smooth SDPs, our results improve upon [bhojanapalli2018smoothed] in that we address exact-feasibility formulations of the SDP. Our results also improve upon [boumal2016non] by providing approximate optimality results for approximate second-order points with relaxation rank scaling only as , whereas the latter reference establishes such results only for . Finally, we aim for more generality by covering both real and complex SDPs, and we illustrate the relevance of complex SDPs in Section 4.

### Related work

A number of recent works focus on large-scale SDP solvers. Among the direct approaches (which proceed in the convex domain directly), SPARSESDP introduced a Frank–Wolfe type method for a restricted class of SDPs. Here, the key is that each iteration increases the rank of the solution only by one, so that if only a few iterations are required to reach satisfactory accuracy, then only low dimensional objects need to be manipulated. This line of work was later improved by HybridDSP, FastSpecta and SublinearSDP through hybrid methods. Still, if high accuracy solutions are desired, a large number of iterations will be required, eventually leading to large-rank iterates. In order to overcome such issue, SketchyDecisions recently proposed to combine conditional gradient and sketching techniques in order to maintain a low rank representation of the iterates.

Among the low-rank approaches, our work is closest to (and indeed largely builds upon) recent results of bhojanapalli2018smoothed. For the real case, they consider a penalized version of problem (SDP) (which we here refer to as (P-SDP)) and its related penalized Burer–Monteiro formulation, here called (P-P). With high probability upon random perturbation of the cost matrix, they show approximate global optimality of ASOSPs for (P-P), assuming grows with and either the SDP is compact or its cost matrix is positive definite. Given that there is a zero-measure set of SDPs where SOSPs may be suboptimal, there can be a small-measure set of SDPs where ASOSPs are not approximately optimal [bhojanapalli2018smoothed]. In this context, the authors resort to smoothed analysis, in the same way that we do here. One drawback of that work is that the final result does not hold for the original SDP, but for a non-equivalent penalized version of it. This is one of the points we improve here, by focusing on smooth SDPs as defined in [boumal2016non].

### Notation

We use to refer to or when results hold for both fields. For matrices , of same size, we use the inner product , which reduces to in the real case. The associated Frobenius norm is defined as . For a linear map between matrix spaces, this yields a subordinate operator norm as . The set of self-adjoint matrices of size over , , is the set of symmetric matrices for or the set of Hermitian matrices for . We also write to denote for . A self-adjoint matrix is positive semidefinite () if and only if for all . Furthermore, is the identity operator and

is the identity matrix of size

. The integer is defined after Assumption 1.

## 2 Geometric framework and near-optimality conditions

In this section, we present properties of the smooth geometry of (P) and approximate global optimality conditions for this problem. In covering these preliminaries, we largely parallel developments in [boumal2016non]. As argued in that reference, Assumption 1 implies that the search space of (P) is a submanifold in of codimension . We can associate tangent spaces to a submanifold. Intuitively, the tangent space to the submanifold at a point is a subspace that best approximates around , when the subspace origin is translated to . It is obtained by linearizing the equality constraints.

###### Lemma 2.1 (boumal2018deterministic).

Under Assumption 1, the tangent space at to  (2), denoted by , is:

 TYM ={˙Y∈Kn×k:A(˙YY∗+Y˙Y∗)=0} ={˙Y∈Kn×k:⟨AiY,˙Y⟩=0 for i=1,…,m}. (3)

By equipping each tangent space with a restriction of the inner product , we turn into a Riemannian submanifold of . We also introduce the orthogonal projector which, given a matrix , projects it to the tangent space :

 ProjYZ:=% argmin˙Y∈TYM∥˙Y−Z∥. (4)

This projector will be useful to phrase optimality conditions. It is characterized as follows.

###### Lemma 2.2 (boumal2018deterministic).

Under Assumption 1, the orthogonal projector admits the closed form

 ProjYZ=Z−A∗(G†A(ZY∗))Y,

where is the adjoint of is a Gram matrix defined by (it is a function of ), and denotes the Moore–Penrose pseudo-inverse of (differentiable in ).

(See a proof in Appendix A.) To properly state the approximate first- and second-order necessary optimality conditions for (P), we further need the notions of Riemannian gradient and Riemannian Hessian on the manifold . We recall that (P) minimizes the function , defined by

 g(Y)=⟨CY,Y⟩, (5)

on the manifold . The Riemannian gradient of at ,

, is the unique tangent vector at

such that, for all tangent , with the Euclidean (classical) gradient of evaluated at . Intuitively, is the tangent vector at that points in the steepest ascent direction for as seen from the manifold’s perspective. A classical result states that, for Riemannian submanifolds, the Riemannian gradient is given by the projection of the classical gradient to the tangent space [absil2009optimization, eq. (3.37)]:

This leads us to define the matrix which plays a key role to guarantee approximate global optimality for problem (P), as discussed in Section 3:

 S=S(Y)=C−A∗(μ)=C−m∑i=1μiAi, (7)

where We can write the Riemannian gradient of evaluated at as

The Riemannian gradient enables us to define an approximate first-order necessary optimality condition below. To define the approximate second-order necessary optimality condition, we need to introduce the notion of Riemannian Hessian. The Riemannian Hessian of at is a self-adjoint operator on the tangent space at obtained as the projection of the derivative of the Riemannian gradient vector field [absil2009optimization, eq. (5.15)]. boumal2018deterministic give a closed form expression for the Riemannian Hessian of at :

 Hessg(Y)[˙Y]=2⋅ProjY(S˙Y). (9)

We can now formally define the approximate necessary optimality conditions for problem (P).

###### Definition 2.1 (εg-Fosp).

is an –first-order stationary point for (P) if the norm of the Riemannian gradient of at almost vanishes, specifically,

where is defined as in equation (7).

###### Definition 2.2 ((εg,εH)-Sosp).

is an ()–second-order stationary point for (P) if it is an –first-order stationary point and the Riemannian Hessian of at is almost positive semidefinite, specifically,

 ∀˙Y∈TYM,12⟨˙Y,Hessg(Y)[˙Y]⟩=⟨˙Y,S˙Y⟩≥−εH∥˙Y∥2.

From these definitions, it is clear that encapsulates the approximate optimality conditions of problem (P).

## 3 Approximate second-order points and smoothed analysis

We state our main results formally in this section. As announced, following [bhojanapalli2018smoothed], we resort to smoothed analysis [SmoothedAnalysis]. To this end, we consider perturbations of the cost matrix of (SDP) by a Gaussian Wigner matrix. Intuitively, smoothed analysis tells us how large the variance of the perturbation should be in order to obtain a new SDP which, with high probability, is sufficiently distant from any pathological case. We start by formally defining the notion of Gaussian Wigner matrix, following [arous2011wigner].

###### Definition 3.1 (Gaussian Wigner matrix).

in is a Gaussian Wigner matrix with variance if its entries on and above the diagonal are independent, zero-mean Gaussian variables (real or complex depending on context) with variance .

Besides Assumption 1, another important assumption for our results is that the search space  (1) of (SDP) is compact. In that scenario, there exists a finite constant such that

 ∀X∈C,Tr(X)≤R. (10)

Thus, for all , . Another consequence of compactness of is that the operator is uniformly bounded, that is, there exists a finite constant such that

 ∀Y∈M,∥A∗∘G†∘A∥op≤K, (11)

where is a continuous function of as in Lemma 2.2. We give explicit expressions for the constants and for the case of phase retrieval in Section 4.

We now state the main theorem, whose proof is in Appendix E.

###### Theorem 3.1.

Let Assumption 1 hold for (SDP) with cost matrix and constraints. Assume  (1) is compact, and let and be as in (10) and (11). Let be a Gaussian Wigner matrix with variance and let be any tolerance. Define as:

 κ=κ(R,K,C,n,σW) =RK(∥C∥op+3σW√n). (12)

There exists a universal constant such that, if the rank for the low-rank problem (P) satisfies

 k ≥3⎡⎢⎣log(n)+√log(1/δ)+ ⎷m⋅log(1+6κ√c0nσW)⎤⎥⎦, (13)

then, with probability at least on the random matrix , any -SOSP of (P) with perturbed cost matrix has bounded optimality gap:

 0≤g(Y)−f⋆≤(εH+ε2gη)R+εg2√R, (14)

with the cost function of (P), the optimal value of (SDP) (both perturbed), and

 η =η(R,K,C,n,m,σW)=c0nK(2+KR)2(∥C∥op+3σW√n))9mσ2Wlog(1+6κ√c0nσW). (15)

This result indicates that, as long as the rank is on the order of (up to logarithmic factors), the optimality gap in the perturbed problem is small if a sufficiently good approximate second-order point is computed. Since (SDP) may admit a unique solution of rank as large as (see for example [laurent1996facial, Thm. 3.1(ii)] for the Max-Cut SDP), we conclude that the scaling of with respect to in Theorem 3.1 is essentially optimal.

There is an incentive to pick small, since the optimality gap is phrased in terms of the perturbed problem. As expected though, taking small comes at a price. Specifically, the required rank scales with , so that a smaller may require to be a larger multiple of . Furthermore, the optimality gap is bounded in terms of with a dependence in ; this may force us to compute more accurate approximate second-order points (smaller ) for a similar guarantee when is smaller: see also Corollary 3.1 below.

As announced, the theorem rests on two arguments which we now present—a probabilistic one, and a deterministic one:

1. Probabilistic argument: In the smoothed analysis framework, we show, for large enough, that -FOSPs of (P

) have their smallest singular value near zero, with high probability upon perturbation of

. This implies that such points are almost column-rank deficient.

2. Deterministic argument: If is an -SOSP of (P) and it is almost column-rank deficient, then the matrix defined in equation (7) is almost positive semidefinite. From there, we can derive a bound on the optimality gap.

Formal statements for both follow, building on the notation in Theorem 3.1. Proofs are in Appendices C and D, with supporting lemmas in Appendix B.

###### Lemma 3.1.

Let Assumption 1 hold for (SDP). Assume  (1) is compact. Let be a Gaussian Wigner matrix with variance and let be any tolerance. There exists a universal constant such that, if the rank for the low-rank problem (P) is lower bounded as in (13), then, with probability at least on the random matrix , we have

 ∥W∥op ≤3σW√n,

and furthermore: any -FOSP of (P) with perturbed cost matrix satisfies

 σk(Y)≤εgσW√c0nk,

where is the th singular value of the matrix .

###### Lemma 3.2.

Let Assumption 1 hold for (SDP) with cost matrix . Assume is compact. Let be an -SOSP of (P) (for any

). Then, the smallest eigenvalue of

(7) is bounded below as

 λmin(S) ≥−εH−ζ∥C∥op⋅σ2k(Y),

where with as in (10) and (11), and is the th singular value of (it is zero if ). This holds deterministically for any cost matrix .

Combining the two above lemmas, the key step in the proof of Theorem 3.1 is to deduce a bound on the optimality gap from a bound on the smallest eigenvalue of : see Appendix E.

We have shown in Theorem 3.1 that a perturbed version of (P) can be approximately solved to global optimality, with high probability on the perturbation. In the corollary below, we further bound the optimality gap at the approximate solution of the perturbed problem with respect to the original, unperturbed problem. The proof is in Appendix F.

###### Corollary 3.1.

Assume is compact and let be as defined in (10). Let be an approximate solution for (SDP) with perturbed cost matrix , so that the optimality gap in the perturbed problem is bounded by . Let denote the optimal value of the unperturbed problem (SDP), with cost matrix . Then, the optimality gap for with respect to the unperturbed problem is bounded as:

 0≤⟨C,X⟩−f⋆≤εf+2∥W∥opR.

Under the conditions of Theorem 3.1, with the prescribed probability, and can be bounded so that for an -SOSP of the perturbed problem (P) we have:

 0≤⟨CY,Y⟩−f⋆≤(εH+ε2gη)R+εg2√R+6σW√nR,

where is as defined in (15) and is the variance of the Wigner perturbation .

## 4 Applications

The approximate global optimality results established in the previous section can be applied to deduce guarantees on the quality of ASOSPs of the low-rank factorization for a number of SDPs that appear in machine learning problems. Of particular interest, we focus on the phase retrieval problem. This problem consists in retrieving a signal from amplitude measurements (the absolute value of vector is taken entry-wise). If we can recover the complex phases of , then

can be estimated through linear least-squares. Following this approach,

waldspurger2015phase argue that this task can be modeled as the following non-convex problem:

 minu∈Cn u∗Cu (PR) subject to |ui|=1, for i=1,…,n,

where and maps a vector to the corresponding diagonal matrix. The classical relaxation is to rewrite the above in terms of (lifting) without enforcing , leading to a complex SDP which waldspurger2015phase call PhaseCut:

 minX∈Hn×n ⟨C,X⟩ (PhaseCut) subject to diag(X)=1, X⪰0.

The same SDP relaxation also applies to a problem called angular synchronization [singer2011angular]. The Burer–Monteiro factorization of is an optimization problem over a matrix as follows:

 minY∈Cn×k ⟨CY,Y⟩ (PhaseCut-BM) subject to diag(YY∗)=1.

For a feasible , each row has unit norm: the search space is a Cartesian product of spheres in , which is a smooth manifold. We can check that Assumption 1 holds for all . Furthermore, the feasible space of the SDP is compact. Therefore, Theorem 3.1 applies.

In this setting, for all feasible , and for all feasible (because for all feasible —see Lemma 2.2—and is an orthogonal projector from Hermitian matrices to diagonal matrices). For this reason, the constants defined in (10) and (11) can be set to and .

As a comparison, mei2017solving also provide an optimality gap for ASOSPs of (PhaseCut) without perturbation. Their result is more general in the sense that it holds for all possible values of . However, when is large, it does not accurately capture the fact that SOSPs are optimal, thus incurring a larger bound on the optimality gap of ASOSPs. In contrast, our bounds do show that for large enough, as go to zero, the optimality gap goes to zero, with the trade-off that they do so for a perturbed problem (though see Corollary 3.1), with high probability.

### Numerical Experiments

We present the empirical performance of the low-rank approach in the case of (PhaseCut). We compare it with a dedicated interior-point method (IPM) implemented by helmberg1996interior for real SDPs and adapted to phase retrieval as done by waldspurger2015phase. This adaptation involves splitting the real and the imaginary parts of the variables in (PhaseCut) and forming an equivalent real SDP with double the dimension. The Burer–Monteiro approach (BM) is implemented in complex form directly using Manopt, a toolbox for optimization on manifolds [JMLR:v15:boumal14a]. In particular, a Riemannian Trust-Region method (RTR) is used [genrtr]. Theory supports that these methods can return an ASOSP in a finite number of iterations [boumal2016globalrates]. We stress that the SDP is not perturbed in these experiments: the role of the perturbation in the analysis is to understand why the low-rank approach is so successful in practice despite the existence of pathological cases. In practice, we do not expect to encounter pathological cases.

Our numerical experiment setup is as follows. We seek to recover a signal of dimension , , from measurements encoded in the vector such that , where is the sensing matrix and is standard Gaussian noise. For the numerical experiments, we generate the vectors as complex random vectors with i.i.d. standard Gaussian entries, and we randomly generate the complex sensing matrices also with i.i.d. standard Gaussian entries. We do so for values of ranging from 10 to 1000, and always for (that is, there are 10 magnitude measurements per unknown complex coefficient, which is an oversampling factor of 5.) Lastly, we generate the measurement vectors as described above and we cap its values from below at in order to avoid small (or even negative) magnitude measurements.

For up to 3000, both methods solve the same problem, and indeed produce the same answer up to small discrepancies. The BM approach is more accurate, at least in satisfying the constraints, and, for , it is also about 40 times faster than IPM. BM is run with (rounded up), which is expected to be generically sufficient to include the global optimum of the SDP (as confirmed in practice). For larger values of , the IPM ran into memory issues and we had to abort the process.

## 5 Conclusion

We considered the low-rank (or Burer–Monteiro) approach to solve equality-constrained SDPs. Our key assumptions are that (a) the search space of the SDP is compact, and (b) the search space of its low-rank version is smooth (the actual condition is slightly stronger). Under these assumptions, we proved using smoothed analysis that, provided where is the number of constraints, if the cost matrix is perturbed randomly, with high probability, approximate second-order stationary points of the perturbed low-rank problem map to approximately optimal solutions of the perturbed SDP. We also related optimality gaps in the perturbed SDP to optimality gaps in the original SDP. Finally, we applied this result to an SDP relaxation of phase retrieval (also applicable to angular synchronization).

## Acknowledgments

NB is partially supported by NSF award DMS-1719558.

## Appendix A Proof of Lemma 2.2

We follow the proof of [boumal2018deterministic, Lemma 2.2] and reproduce it here to be self-contained, and also because the reference treats only the real case; writing the proof here explicitly allows to verify that, indeed, all steps go through for the complex case as well.

Orthogonal projection is along the normal space, so that is in  (3) and is in , where the normal space at is (using Assumption 1)

 NYM ={Z∈Kn×k:⟨Z,˙Y⟩=0 ∀˙Y∈TYM}=span{A1Y,…,AmY}. (16)

From the latter we infer there exists such that

 Z−ProjYZ=m∑i=1μiAiY=A∗(μ)Y,

since the adjoint of is . Multiply on the right by and apply to obtain

 A(ZY∗)=A(A∗(μ)YY∗),

where we used since . The right-hand side expands into

where is a real, positive semidefinite matrix of size defined by . By construction, this system of equations in has at least one solution; we single out , where is the Moore–Penrose pseudo-inverse of . The function is continuous and differentiable at provided has constant rank in an open neighborhood of in  [golub1973differentiation, Thm 4.3], which is the case for all under Assumption 1.

## Appendix B Lower-bound for smallest singular values

This appendix provides supporting results necessary for Appendix C, which is devoted to the proof of Lemma 3.1. The statements we need are established for in [bhojanapalli2018smoothed, Cor. 5, Lem. 7]. Here we give the corresponding statements for : the proofs are essentially the same.

We first state a special case of Corollary 1.17 from [nguyen2017random]. Here, denotes the number of eigenvalues of in the real interval . (Note that the reference covers the real case in its main statement, and addresses the complex case later on as a remark.) For Gaussian Wigner matrices, we follow Definition 3.1. Furthermore, denotes the probability of event .

###### Corollary B.1.

Let be a deterministic Hermitian matrix of size . Let be a Gaussian Wigner matrix with variance 1. Then, for any given , there exists a constant such that for any and , with being the interval ,

The next lemma follows easily—the original proof for is in [bhojanapalli2018smoothed, Lem. 7].

###### Lemma B.1.

Let be a deterministic Hermitian matrix of size . Let be a complex Gaussian Wigner matrix of size with variance , independent of . There exists an absolute constant such that:

 P{k∑i=1σn−(i−1)(M+W)2
###### Proof.

In our case, the entries of have variance . Thus, set and . From Corollary B.1, we get

 NσWI(M+W)=NI(¯¯¯¯¯¯M+¯¯¯¯¯¯W)

with probability at least . In this event, . With the choices and , we get that

 σn−(k−1)(¯¯¯¯¯¯M+¯¯¯¯¯¯W)≥k2c√nσW

with probability at least . In that event,

 k∑i=1σn−(i−1)(¯¯¯¯¯¯M+¯¯¯¯¯¯W)2≥σn−(k−1)(¯¯¯¯¯¯M+¯¯¯¯¯¯W)2≥k2c0nσ2W

for some absolute constant . ∎

## Appendix C Proof of Lemma 3.1

This section builds on a result from [bhojanapalli2018smoothed], where a similar statement was made under different assumptions. The proof follows closely the developments therein, with appropriate changes. Using (8), is an -FOSP of the perturbed problem if and only if with and

 M =C−(A∗∘G†∘A)((C+W)YY∗). (17)

Let be a thin SVD of , where is with orthonormal columns (assuming without loss of generality , as otherwise deterministically) and is orthogonal. Then,

 εg≥∥2SY∥ =∥2(M+W)Y∥ ≥2σk(Y)∥(M+W)P∥ ≥2σk(Y) ⎷k∑i=1σn−(i−1)(M+W)2.

Thus, we control the smallest singular value of in terms of and the smallest singular values of :

 σk(Y)≤εg2√∑ki=1σn−(i−1)(M+W)2. (18)

Given that is not statistically independent of , we are not able to directly apply Lemma B.1. Indeed, depends on and on , and itself is an -FOSP of the perturbed problem: a feature which depends on . To tackle this issue, we cover the set of possible s with a net. Lemma B.1 provides a bound for each in this net. By union bound, we can extend the lemma for all . By taking a sufficiently dense net, we then infer that is necessarily close to one of these ’s and conclude.

To this end, we first control using the definitions of  (10) and  (11):

 ∥M−C∥ =∥A∗∘G†∘A((C+W)YY∗)∥ ≤∥A∗∘G†∘A∥op∥C+W∥op∥YY∗∥ ≤K(∥C∥op+∥W∥op)R.

Since is a Gaussian Wigner matrix with variance , it is a well known fact (see for instance111The reference proves the statement for complex matrices with diagonal entries equal to zero. That proof can easily be adapted to the definition of Wigner matrices used in this paper, both real and complex. Part 1 of Appendix A in [bandeira2016tightness]) that, with probability at least ,

 ∥W∥op≤3σW√n. (19)

Hence, with probability at least ,

 ∥M−C∥ ≤RK(∥C∥op+3σW√n)≜κ,

where we recover as defined in (12).

As a result, lies in a ball of center and radius . Moreover, from (17), we remark that lives in an affine subspace of dimension . A unit ball in Frobenius norm in dimensions admits an -net of points (see for instance Lemma 1.18 in [RigolletNotes]).222The lemma in the reference shows that for any the cardinality of one such -net is bounded by . Furthermore, for , there is an obvious -net of cardinality one, comprising just the origin. Hence, for any , it is possible so find an -net of cardinality at most . Thus, we pick a -net on the unit ball with points. Rescaling by a factor gives a -net of a ball of radius centered at zero. Hence, for any as in (17) there necessarily exists a point in the net satisfying:

 ∥¯M−M∥≤kσW2√c0n. (20)

Let be defined by , that is: extracts the smallest singular values of , in order. Then, by using the result from Exercise IV.3.5. in [Bhatia] in the first inequality,333The same result can be obtained by using Theorem IV.2.14 in the reference. In this setting, one considers the function ; then, use the subadditive property of , i.e., , and define and . we have:

 ∥¯M−M∥ =∥(¯M+W)−(M+W)∥ = ⎷n∑i=1σ2i((¯M+W)−(M+W)) ≥ ⎷n∑i=1(σi(¯M+W)−σi(M+W))2 =∥Tn(¯M+W)−Tn(M+W)∥ ≥∥Tk(¯M+W)−Tk(M+W)∥ ≥∥Tk(¯M+W)∥−∥Tk(M+W)∥,

where we used the triangular inequality in the last inequality. Thus, rearranging we obtain

  ⎷k∑i=1σn−(i−1)(M+W)2≥ ⎷k∑i=1σn−(i−1)(¯M+W)2−∥¯M−M∥. (21)

Taking a union bound for Lemma B.1 over each in the net, we get that

  ⎷k∑i=1σn−(i−1)(¯M+W)2≥kσW√c0n (22)

holds with probability at least

 1−exp(−k28log(8π)+klog(n)+rank(A)⋅log(1+6κ√c0nkσW)). (23)

Combining (20), (21) and (22), we conclude that

  ⎷k∑i=1σn−(i−1)(M+W)2≥kσW2√c0n (24)

holds with probability bounded as in (23). Combining with (18), we obtain

 σk(Y)≤εgσW√c0nk

as desired. It remains to discuss the probability of success, which we do below.

Inside the log in (23), we can safely replace with , as this only hurts the probability. Then, the result holds with probability at least

 1−exp(−k28log(8π)+klog(n)+rank(A)⋅log(1+6κ√c0nσW)).

We would like to constrain such that the exponential part is bounded by . In this fashion, taking a union bound with event (19), we will get an overall probability of success of at least . Equivalently, must satisfy the quadratic inequality

 −ak2+bk+c≤log(δ),

with defined by . This quadratic inequality can be rewritten as:

 ak2−bk−c′≥0,

with . This quadratic has two distinct real roots, one positive and one negative:

 b±√b2+4ac′2a.

Since is positive, we deduce that needs to be larger than the positive root. The latter obeys the following inequality:444We use that, for any , .

 b+√b2+4ac′2a ≤b+b+2√ac′2a=b+√ac′a=1ab+1√a√c′.

Since both and are smaller than 3, it is sufficient to require

 k≥3(b+√c+log(1/δ)).

Assuming , we can use the inequality in the footnote again and find that it is sufficient to have

 k≥3(b+√log(1/δ)+√c).

Plugging in the definitions of and , we find the sufficient condition (with ):

 k ≥3⎡⎢⎣log(n)+√log(1/δ)+ ⎷rank(A)⋅log(1+6κ√c0nσW)⎤⎥⎦.

Since , we obtain the desired sufficient bound on .

## Appendix D Proof of Lemma 3.2

The Riemannian gradient and Hessian of the objective function of (P) are respectively given by equations (8) and (9). Since is an -SOSP, it holds for all  (3) with that:

 −εH≤12⟨˙Y,Hessg(Y)[˙Y]⟩=⟨˙Y,S˙Y⟩. (25)

Our goal is to show that is almost positive semidefinite. To this end, we first construct specific ’s to exploit the fact that is almost rank deficient. Let be a right singular vector of such that and . For any with