1 Introduction
We consider semidefinite programs (SDP) over or of the form:
(SDP)  
subject to  
with the Frobenius inner product ( is the conjugatetranspose of ), the set of selfadjoint matrices of size (real symmetric for , or Hermitian for ), the cost matrix, and a linear operator capturing equality constraints with right hand side : for each , for given matrices . The optimization variable is positive semidefinite. We let be the feasible set of (SDP):
(1) 
Largescale SDPs have been proposed for machine learning applications including matrix completion
[candes2009exact], community detection [abbe2017community] and kernel learning [lanckriet2004learning] for , and in angular synchronization [singer2011angular] and phase retrieval [waldspurger2015phase] for . Unfortunately, traditional methods to solve (SDP) do not scale (due to memory and computational requirements), hence the need for alternatives.In order to address such scalability issues, burer2003nonlinear, burer2005local restrict the search to the set of matrices of rank at most by factorizing as , with . It has been shown that if the search space (1) is compact, then (SDP) admits a global optimum of rank at most , where [barvinok1995problems, pataki1998rank], with for and for . In other words, restricting to the space of matrices with rank at most with does not change the optimal value. This factorization leads to a quadratically constrained quadratic program:
(P)  
subject to 
Although (P) is in general nonconvex because its feasible set
(2) 
is nonconvex, considering (P) instead of the original SDP presents significant advantages: the number of variables is reduced from to , and the positive semidefiniteness of the matrix is naturally enforced. Solving (P) using local optimization methods is known as the Burer–Monteiro method and yields good results in practice: kulis2007fast
underlined the practical success of such lowrank approaches in particular for maximum variance unfolding and for kmeans clustering (see also
[carson2017kmeans]). Their approach is significantly faster and more scalable. However, the nonconvexity of (P) means further analysis is needed to determine whether it can be solved to global optimality reliably.For , in the case where is a compact, smooth manifold (see Assumption 1 below for a precise condition), it has been shown recently that, up to a zeromeasure set of cost matrices, secondorder stationary points (SOSPs) of (P) are globally optimal provided [boumal2016non, boumal2018deterministic]. Algorithms such as the Riemannian trustregions method (RTR) converge globally to SOSPs, but unfortunately they can only guarantee approximate satisfaction of secondorder optimality conditions in a finite number of iterations [boumal2016globalrates].
The aforementioned papers close with a question, crucial in practice: when is it the case that approximate SOSPs, which we now call ASOSPs, are approximately optimal? Building on recent proof techniques by bhojanapalli2018smoothed, we provide some answers here.
Contributions
This paper formulates approximate global optimality conditions holding for (P) and, consequently, for (SDP). Our results rely on the following core assumption as set in [boumal2016non].
Assumption 1 (Smooth manifold).
For all values of up to such that is nonempty, the constraints on (P) defined by and satisfy at least one of the following:

are linearly independent in for all ; or

span a subspace of constant dimension in for all in an open neighborhood of in
In [boumal2018deterministic], it is shown that (a) if the assumption above is verified for , then it automatically holds for all values of such that is nonempty; and (b) for those values of , the dimension of the subspace spanned by is independent of : we call it .
When Assumption 1 holds, we refer to problems of the form (SDP) as smooth SDPs because is then a smooth manifold. Examples of smooth SDPs for are given in [boumal2018deterministic]. For , we detail an example in Section 4. Our main theorem is a smooth analysis result (cf. Theorem 3.1 for a more formal statement). An ASOSP is an approximate SOSP (a precise definition follows.)
Theorem 1.1 (Informal).
Let Assumption 1 hold and assume is compact. Randomly perturb the cost matrix
. With high probability, if
, any ASOSP for (P) is an approximate global optimum, and is an approximate global optimum for (SDP) (with the perturbed .)The high probability proviso is with respect to the perturbation only: if the perturbation is “good”, then all ASOSPs are as described in the statement. If is compact, then so is and known algorithms for optimization on manifolds produce an ASOSP in finite time (with explicit bounds). Theorem 1.1 ensures that, for large enough and for any cost matrix , with high probability upon a random perturbation of , such algorithms produce an approximate global optimum of (P).
The first argument is motivated by smoothed analysis [SmoothedAnalysis] and draws heavily on a recent paper by bhojanapalli2018smoothed. The latter work introduces smoothed analysis to analyze the performance of the Burer–Monteiro factorization, but it analyzes a quadratically penalized version of the SDP: its solutions do not satisfy constraints exactly. This affords more generality, but, for the special class of smooth SDPs, the present work has the advantage of analyzing an exact formulation. The second argument is a smoothed extension of wellknown onoff results [burer2003nonlinear, burer2005local, journee2010low]. Implications of this theorem for a particular SDP are derived in Section 4, with applications to phase retrieval and angular synchronization.
Thus, for smooth SDPs, our results improve upon [bhojanapalli2018smoothed] in that we address exactfeasibility formulations of the SDP. Our results also improve upon [boumal2016non] by providing approximate optimality results for approximate secondorder points with relaxation rank scaling only as , whereas the latter reference establishes such results only for . Finally, we aim for more generality by covering both real and complex SDPs, and we illustrate the relevance of complex SDPs in Section 4.
Related work
A number of recent works focus on largescale SDP solvers. Among the direct approaches (which proceed in the convex domain directly), SPARSESDP introduced a Frank–Wolfe type method for a restricted class of SDPs. Here, the key is that each iteration increases the rank of the solution only by one, so that if only a few iterations are required to reach satisfactory accuracy, then only low dimensional objects need to be manipulated. This line of work was later improved by HybridDSP, FastSpecta and SublinearSDP through hybrid methods. Still, if high accuracy solutions are desired, a large number of iterations will be required, eventually leading to largerank iterates. In order to overcome such issue, SketchyDecisions recently proposed to combine conditional gradient and sketching techniques in order to maintain a low rank representation of the iterates.
Among the lowrank approaches, our work is closest to (and indeed largely builds upon) recent results of bhojanapalli2018smoothed. For the real case, they consider a penalized version of problem (SDP) (which we here refer to as (PSDP)) and its related penalized Burer–Monteiro formulation, here called (PP). With high probability upon random perturbation of the cost matrix, they show approximate global optimality of ASOSPs for (PP), assuming grows with and either the SDP is compact or its cost matrix is positive definite. Given that there is a zeromeasure set of SDPs where SOSPs may be suboptimal, there can be a smallmeasure set of SDPs where ASOSPs are not approximately optimal [bhojanapalli2018smoothed]. In this context, the authors resort to smoothed analysis, in the same way that we do here. One drawback of that work is that the final result does not hold for the original SDP, but for a nonequivalent penalized version of it. This is one of the points we improve here, by focusing on smooth SDPs as defined in [boumal2016non].
Notation
We use to refer to or when results hold for both fields. For matrices , of same size, we use the inner product , which reduces to in the real case. The associated Frobenius norm is defined as . For a linear map between matrix spaces, this yields a subordinate operator norm as . The set of selfadjoint matrices of size over , , is the set of symmetric matrices for or the set of Hermitian matrices for . We also write to denote for . A selfadjoint matrix is positive semidefinite () if and only if for all . Furthermore, is the identity operator and
is the identity matrix of size
. The integer is defined after Assumption 1.2 Geometric framework and nearoptimality conditions
In this section, we present properties of the smooth geometry of (P) and approximate global optimality conditions for this problem. In covering these preliminaries, we largely parallel developments in [boumal2016non]. As argued in that reference, Assumption 1 implies that the search space of (P) is a submanifold in of codimension . We can associate tangent spaces to a submanifold. Intuitively, the tangent space to the submanifold at a point is a subspace that best approximates around , when the subspace origin is translated to . It is obtained by linearizing the equality constraints.
Lemma 2.1 (boumal2018deterministic).
By equipping each tangent space with a restriction of the inner product , we turn into a Riemannian submanifold of . We also introduce the orthogonal projector which, given a matrix , projects it to the tangent space :
(4) 
This projector will be useful to phrase optimality conditions. It is characterized as follows.
Lemma 2.2 (boumal2018deterministic).
Under Assumption 1, the orthogonal projector admits the closed form
where is the adjoint of is a Gram matrix defined by (it is a function of ), and denotes the Moore–Penrose pseudoinverse of (differentiable in ).
(See a proof in Appendix A.) To properly state the approximate first and secondorder necessary optimality conditions for (P), we further need the notions of Riemannian gradient and Riemannian Hessian on the manifold . We recall that (P) minimizes the function , defined by
(5) 
on the manifold . The Riemannian gradient of at ,
, is the unique tangent vector at
such that, for all tangent , with the Euclidean (classical) gradient of evaluated at . Intuitively, is the tangent vector at that points in the steepest ascent direction for as seen from the manifold’s perspective. A classical result states that, for Riemannian submanifolds, the Riemannian gradient is given by the projection of the classical gradient to the tangent space [absil2009optimization, eq. (3.37)]:(6) 
This leads us to define the matrix which plays a key role to guarantee approximate global optimality for problem (P), as discussed in Section 3:
(7) 
where We can write the Riemannian gradient of evaluated at as
(8) 
The Riemannian gradient enables us to define an approximate firstorder necessary optimality condition below. To define the approximate secondorder necessary optimality condition, we need to introduce the notion of Riemannian Hessian. The Riemannian Hessian of at is a selfadjoint operator on the tangent space at obtained as the projection of the derivative of the Riemannian gradient vector field [absil2009optimization, eq. (5.15)]. boumal2018deterministic give a closed form expression for the Riemannian Hessian of at :
(9) 
We can now formally define the approximate necessary optimality conditions for problem (P).
Definition 2.1 (Fosp).
Definition 2.2 (Sosp).
is an ()–secondorder stationary point for (P) if it is an –firstorder stationary point and the Riemannian Hessian of at is almost positive semidefinite, specifically,
From these definitions, it is clear that encapsulates the approximate optimality conditions of problem (P).
3 Approximate secondorder points and smoothed analysis
We state our main results formally in this section. As announced, following [bhojanapalli2018smoothed], we resort to smoothed analysis [SmoothedAnalysis]. To this end, we consider perturbations of the cost matrix of (SDP) by a Gaussian Wigner matrix. Intuitively, smoothed analysis tells us how large the variance of the perturbation should be in order to obtain a new SDP which, with high probability, is sufficiently distant from any pathological case. We start by formally defining the notion of Gaussian Wigner matrix, following [arous2011wigner].
Definition 3.1 (Gaussian Wigner matrix).
The random matrix
in is a Gaussian Wigner matrix with variance if its entries on and above the diagonal are independent, zeromean Gaussian variables (real or complex depending on context) with variance .Besides Assumption 1, another important assumption for our results is that the search space (1) of (SDP) is compact. In that scenario, there exists a finite constant such that
(10) 
Thus, for all , . Another consequence of compactness of is that the operator is uniformly bounded, that is, there exists a finite constant such that
(11) 
where is a continuous function of as in Lemma 2.2. We give explicit expressions for the constants and for the case of phase retrieval in Section 4.
We now state the main theorem, whose proof is in Appendix E.
Theorem 3.1.
Let Assumption 1 hold for (SDP) with cost matrix and constraints. Assume (1) is compact, and let and be as in (10) and (11). Let be a Gaussian Wigner matrix with variance and let be any tolerance. Define as:
(12) 
There exists a universal constant such that, if the rank for the lowrank problem (P) satisfies
(13) 
then, with probability at least on the random matrix , any SOSP of (P) with perturbed cost matrix has bounded optimality gap:
(14) 
with the cost function of (P), the optimal value of (SDP) (both perturbed), and
(15) 
This result indicates that, as long as the rank is on the order of (up to logarithmic factors), the optimality gap in the perturbed problem is small if a sufficiently good approximate secondorder point is computed. Since (SDP) may admit a unique solution of rank as large as (see for example [laurent1996facial, Thm. 3.1(ii)] for the MaxCut SDP), we conclude that the scaling of with respect to in Theorem 3.1 is essentially optimal.
There is an incentive to pick small, since the optimality gap is phrased in terms of the perturbed problem. As expected though, taking small comes at a price. Specifically, the required rank scales with , so that a smaller may require to be a larger multiple of . Furthermore, the optimality gap is bounded in terms of with a dependence in ; this may force us to compute more accurate approximate secondorder points (smaller ) for a similar guarantee when is smaller: see also Corollary 3.1 below.
As announced, the theorem rests on two arguments which we now present—a probabilistic one, and a deterministic one:

Probabilistic argument: In the smoothed analysis framework, we show, for large enough, that FOSPs of (P
) have their smallest singular value near zero, with high probability upon perturbation of
. This implies that such points are almost columnrank deficient.
Formal statements for both follow, building on the notation in Theorem 3.1. Proofs are in Appendices C and D, with supporting lemmas in Appendix B.
Lemma 3.1.
Let Assumption 1 hold for (SDP). Assume (1) is compact. Let be a Gaussian Wigner matrix with variance and let be any tolerance. There exists a universal constant such that, if the rank for the lowrank problem (P) is lower bounded as in (13), then, with probability at least on the random matrix , we have
and furthermore: any FOSP of (P) with perturbed cost matrix satisfies
where is the th singular value of the matrix .
Lemma 3.2.
Combining the two above lemmas, the key step in the proof of Theorem 3.1 is to deduce a bound on the optimality gap from a bound on the smallest eigenvalue of : see Appendix E.
We have shown in Theorem 3.1 that a perturbed version of (P) can be approximately solved to global optimality, with high probability on the perturbation. In the corollary below, we further bound the optimality gap at the approximate solution of the perturbed problem with respect to the original, unperturbed problem. The proof is in Appendix F.
Corollary 3.1.
Assume is compact and let be as defined in (10). Let be an approximate solution for (SDP) with perturbed cost matrix , so that the optimality gap in the perturbed problem is bounded by . Let denote the optimal value of the unperturbed problem (SDP), with cost matrix . Then, the optimality gap for with respect to the unperturbed problem is bounded as:
Under the conditions of Theorem 3.1, with the prescribed probability, and can be bounded so that for an SOSP of the perturbed problem (P) we have:
where is as defined in (15) and is the variance of the Wigner perturbation .
4 Applications
The approximate global optimality results established in the previous section can be applied to deduce guarantees on the quality of ASOSPs of the lowrank factorization for a number of SDPs that appear in machine learning problems. Of particular interest, we focus on the phase retrieval problem. This problem consists in retrieving a signal from amplitude measurements (the absolute value of vector is taken entrywise). If we can recover the complex phases of , then
can be estimated through linear leastsquares. Following this approach,
waldspurger2015phase argue that this task can be modeled as the following nonconvex problem:(PR)  
subject to 
where and maps a vector to the corresponding diagonal matrix. The classical relaxation is to rewrite the above in terms of (lifting) without enforcing , leading to a complex SDP which waldspurger2015phase call PhaseCut:
(PhaseCut)  
subject to  
The same SDP relaxation also applies to a problem called angular synchronization [singer2011angular]. The Burer–Monteiro factorization of is an optimization problem over a matrix as follows:
(PhaseCutBM)  
subject to 
For a feasible , each row has unit norm: the search space is a Cartesian product of spheres in , which is a smooth manifold. We can check that Assumption 1 holds for all . Furthermore, the feasible space of the SDP is compact. Therefore, Theorem 3.1 applies.
In this setting, for all feasible , and for all feasible (because for all feasible —see Lemma 2.2—and is an orthogonal projector from Hermitian matrices to diagonal matrices). For this reason, the constants defined in (10) and (11) can be set to and .
As a comparison, mei2017solving also provide an optimality gap for ASOSPs of (PhaseCut) without perturbation. Their result is more general in the sense that it holds for all possible values of . However, when is large, it does not accurately capture the fact that SOSPs are optimal, thus incurring a larger bound on the optimality gap of ASOSPs. In contrast, our bounds do show that for large enough, as go to zero, the optimality gap goes to zero, with the tradeoff that they do so for a perturbed problem (though see Corollary 3.1), with high probability.
Numerical Experiments
We present the empirical performance of the lowrank approach in the case of (PhaseCut). We compare it with a dedicated interiorpoint method (IPM) implemented by helmberg1996interior for real SDPs and adapted to phase retrieval as done by waldspurger2015phase. This adaptation involves splitting the real and the imaginary parts of the variables in (PhaseCut) and forming an equivalent real SDP with double the dimension. The Burer–Monteiro approach (BM) is implemented in complex form directly using Manopt, a toolbox for optimization on manifolds [JMLR:v15:boumal14a]. In particular, a Riemannian TrustRegion method (RTR) is used [genrtr]. Theory supports that these methods can return an ASOSP in a finite number of iterations [boumal2016globalrates]. We stress that the SDP is not perturbed in these experiments: the role of the perturbation in the analysis is to understand why the lowrank approach is so successful in practice despite the existence of pathological cases. In practice, we do not expect to encounter pathological cases.
Our numerical experiment setup is as follows. We seek to recover a signal of dimension , , from measurements encoded in the vector such that , where is the sensing matrix and is standard Gaussian noise. For the numerical experiments, we generate the vectors as complex random vectors with i.i.d. standard Gaussian entries, and we randomly generate the complex sensing matrices also with i.i.d. standard Gaussian entries. We do so for values of ranging from 10 to 1000, and always for (that is, there are 10 magnitude measurements per unknown complex coefficient, which is an oversampling factor of 5.) Lastly, we generate the measurement vectors as described above and we cap its values from below at in order to avoid small (or even negative) magnitude measurements.
For up to 3000, both methods solve the same problem, and indeed produce the same answer up to small discrepancies. The BM approach is more accurate, at least in satisfying the constraints, and, for , it is also about 40 times faster than IPM. BM is run with (rounded up), which is expected to be generically sufficient to include the global optimum of the SDP (as confirmed in practice). For larger values of , the IPM ran into memory issues and we had to abort the process.
5 Conclusion
We considered the lowrank (or Burer–Monteiro) approach to solve equalityconstrained SDPs. Our key assumptions are that (a) the search space of the SDP is compact, and (b) the search space of its lowrank version is smooth (the actual condition is slightly stronger). Under these assumptions, we proved using smoothed analysis that, provided where is the number of constraints, if the cost matrix is perturbed randomly, with high probability, approximate secondorder stationary points of the perturbed lowrank problem map to approximately optimal solutions of the perturbed SDP. We also related optimality gaps in the perturbed SDP to optimality gaps in the original SDP. Finally, we applied this result to an SDP relaxation of phase retrieval (also applicable to angular synchronization).
Acknowledgments
NB is partially supported by NSF award DMS1719558.
References
Appendix A Proof of Lemma 2.2
We follow the proof of [boumal2018deterministic, Lemma 2.2] and reproduce it here to be selfcontained, and also because the reference treats only the real case; writing the proof here explicitly allows to verify that, indeed, all steps go through for the complex case as well.
Orthogonal projection is along the normal space, so that is in (3) and is in , where the normal space at is (using Assumption 1)
(16) 
From the latter we infer there exists such that
since the adjoint of is . Multiply on the right by and apply to obtain
where we used since . The righthand side expands into
where is a real, positive semidefinite matrix of size defined by . By construction, this system of equations in has at least one solution; we single out , where is the Moore–Penrose pseudoinverse of . The function is continuous and differentiable at provided has constant rank in an open neighborhood of in [golub1973differentiation, Thm 4.3], which is the case for all under Assumption 1.
Appendix B Lowerbound for smallest singular values
This appendix provides supporting results necessary for Appendix C, which is devoted to the proof of Lemma 3.1. The statements we need are established for in [bhojanapalli2018smoothed, Cor. 5, Lem. 7]. Here we give the corresponding statements for : the proofs are essentially the same.
We first state a special case of Corollary 1.17 from [nguyen2017random]. Here, denotes the number of eigenvalues of in the real interval . (Note that the reference covers the real case in its main statement, and addresses the complex case later on as a remark.) For Gaussian Wigner matrices, we follow Definition 3.1. Furthermore, denotes the probability of event .
Corollary B.1.
Let be a deterministic Hermitian matrix of size . Let be a Gaussian Wigner matrix with variance 1. Then, for any given , there exists a constant such that for any and , with being the interval ,
The next lemma follows easily—the original proof for is in [bhojanapalli2018smoothed, Lem. 7].
Lemma B.1.
Let be a deterministic Hermitian matrix of size . Let be a complex Gaussian Wigner matrix of size with variance , independent of . There exists an absolute constant such that:
Proof.
In our case, the entries of have variance . Thus, set and . From Corollary B.1, we get
with probability at least . In this event, . With the choices and , we get that
with probability at least . In that event,
for some absolute constant . ∎
Appendix C Proof of Lemma 3.1
This section builds on a result from [bhojanapalli2018smoothed], where a similar statement was made under different assumptions. The proof follows closely the developments therein, with appropriate changes. Using (8), is an FOSP of the perturbed problem if and only if with and
(17) 
Let be a thin SVD of , where is with orthonormal columns (assuming without loss of generality , as otherwise deterministically) and is orthogonal. Then,
Thus, we control the smallest singular value of in terms of and the smallest singular values of :
(18) 
Given that is not statistically independent of , we are not able to directly apply Lemma B.1. Indeed, depends on and on , and itself is an FOSP of the perturbed problem: a feature which depends on . To tackle this issue, we cover the set of possible s with a net. Lemma B.1 provides a bound for each in this net. By union bound, we can extend the lemma for all . By taking a sufficiently dense net, we then infer that is necessarily close to one of these ’s and conclude.
To this end, we first control using the definitions of (10) and (11):
Since is a Gaussian Wigner matrix with variance , it is a well known fact (see for instance^{1}^{1}1The reference proves the statement for complex matrices with diagonal entries equal to zero. That proof can easily be adapted to the definition of Wigner matrices used in this paper, both real and complex. Part 1 of Appendix A in [bandeira2016tightness]) that, with probability at least ,
(19) 
Hence, with probability at least ,
where we recover as defined in (12).
As a result, lies in a ball of center and radius . Moreover, from (17), we remark that lives in an affine subspace of dimension . A unit ball in Frobenius norm in dimensions admits an net of points (see for instance Lemma 1.18 in [RigolletNotes]).^{2}^{2}2The lemma in the reference shows that for any the cardinality of one such net is bounded by . Furthermore, for , there is an obvious net of cardinality one, comprising just the origin. Hence, for any , it is possible so find an net of cardinality at most . Thus, we pick a net on the unit ball with points. Rescaling by a factor gives a net of a ball of radius centered at zero. Hence, for any as in (17) there necessarily exists a point in the net satisfying:
(20) 
Let be defined by , that is: extracts the smallest singular values of , in order. Then, by using the result from Exercise IV.3.5. in [Bhatia] in the first inequality,^{3}^{3}3The same result can be obtained by using Theorem IV.2.14 in the reference. In this setting, one considers the function ; then, use the subadditive property of , i.e., , and define and . we have:
where we used the triangular inequality in the last inequality. Thus, rearranging we obtain
(21) 
Taking a union bound for Lemma B.1 over each in the net, we get that
(22) 
holds with probability at least
(23) 
Combining (20), (21) and (22), we conclude that
(24) 
holds with probability bounded as in (23). Combining with (18), we obtain
as desired. It remains to discuss the probability of success, which we do below.
Inside the log in (23), we can safely replace with , as this only hurts the probability. Then, the result holds with probability at least
We would like to constrain such that the exponential part is bounded by . In this fashion, taking a union bound with event (19), we will get an overall probability of success of at least . Equivalently, must satisfy the quadratic inequality
with defined by . This quadratic inequality can be rewritten as:
with . This quadratic has two distinct real roots, one positive and one negative:
Since is positive, we deduce that needs to be larger than the positive root. The latter obeys the following inequality:^{4}^{4}4We use that, for any , .
Since both and are smaller than 3, it is sufficient to require
Assuming , we can use the inequality in the footnote again and find that it is sufficient to have
Plugging in the definitions of and , we find the sufficient condition (with ):
Since , we obtain the desired sufficient bound on .
Appendix D Proof of Lemma 3.2
The Riemannian gradient and Hessian of the objective function of (P) are respectively given by equations (8) and (9). Since is an SOSP, it holds for all (3) with that:
(25) 
Our goal is to show that is almost positive semidefinite. To this end, we first construct specific ’s to exploit the fact that is almost rank deficient. Let be a right singular vector of such that and . For any with
Comments
There are no comments yet.