We consider semidefinite programs (SDP) over or of the form:
with the Frobenius inner product ( is the conjugate-transpose of ), the set of self-adjoint matrices of size (real symmetric for , or Hermitian for ), the cost matrix, and a linear operator capturing equality constraints with right hand side : for each , for given matrices . The optimization variable is positive semidefinite. We let be the feasible set of (SDP):
Large-scale SDPs have been proposed for machine learning applications including matrix completion[candes2009exact], community detection [abbe2017community] and kernel learning [lanckriet2004learning] for , and in angular synchronization [singer2011angular] and phase retrieval [waldspurger2015phase] for . Unfortunately, traditional methods to solve (SDP) do not scale (due to memory and computational requirements), hence the need for alternatives.
In order to address such scalability issues, burer2003nonlinear, burer2005local restrict the search to the set of matrices of rank at most by factorizing as , with . It has been shown that if the search space (1) is compact, then (SDP) admits a global optimum of rank at most , where [barvinok1995problems, pataki1998rank], with for and for . In other words, restricting to the space of matrices with rank at most with does not change the optimal value. This factorization leads to a quadratically constrained quadratic program:
Although (P) is in general non-convex because its feasible set
is non-convex, considering (P) instead of the original SDP presents significant advantages: the number of variables is reduced from to , and the positive semidefiniteness of the matrix is naturally enforced. Solving (P) using local optimization methods is known as the Burer–Monteiro method and yields good results in practice: kulis2007fastcarson2017kmeans]). Their approach is significantly faster and more scalable. However, the non-convexity of (P) means further analysis is needed to determine whether it can be solved to global optimality reliably.
For , in the case where is a compact, smooth manifold (see Assumption 1 below for a precise condition), it has been shown recently that, up to a zero-measure set of cost matrices, second-order stationary points (SOSPs) of (P) are globally optimal provided [boumal2016non, boumal2018deterministic]. Algorithms such as the Riemannian trust-regions method (RTR) converge globally to SOSPs, but unfortunately they can only guarantee approximate satisfaction of second-order optimality conditions in a finite number of iterations [boumal2016globalrates].
The aforementioned papers close with a question, crucial in practice: when is it the case that approximate SOSPs, which we now call ASOSPs, are approximately optimal? Building on recent proof techniques by bhojanapalli2018smoothed, we provide some answers here.
Assumption 1 (Smooth manifold).
For all values of up to such that is non-empty, the constraints on (P) defined by and satisfy at least one of the following:
are linearly independent in for all ; or
span a subspace of constant dimension in for all in an open neighborhood of in
In [boumal2018deterministic], it is shown that (a) if the assumption above is verified for , then it automatically holds for all values of such that is non-empty; and (b) for those values of , the dimension of the subspace spanned by is independent of : we call it .
When Assumption 1 holds, we refer to problems of the form (SDP) as smooth SDPs because is then a smooth manifold. Examples of smooth SDPs for are given in [boumal2018deterministic]. For , we detail an example in Section 4. Our main theorem is a smooth analysis result (cf. Theorem 3.1 for a more formal statement). An ASOSP is an approximate SOSP (a precise definition follows.)
Theorem 1.1 (Informal).
The high probability proviso is with respect to the perturbation only: if the perturbation is “good”, then all ASOSPs are as described in the statement. If is compact, then so is and known algorithms for optimization on manifolds produce an ASOSP in finite time (with explicit bounds). Theorem 1.1 ensures that, for large enough and for any cost matrix , with high probability upon a random perturbation of , such algorithms produce an approximate global optimum of (P).
The first argument is motivated by smoothed analysis [SmoothedAnalysis] and draws heavily on a recent paper by bhojanapalli2018smoothed. The latter work introduces smoothed analysis to analyze the performance of the Burer–Monteiro factorization, but it analyzes a quadratically penalized version of the SDP: its solutions do not satisfy constraints exactly. This affords more generality, but, for the special class of smooth SDPs, the present work has the advantage of analyzing an exact formulation. The second argument is a smoothed extension of well-known on-off results [burer2003nonlinear, burer2005local, journee2010low]. Implications of this theorem for a particular SDP are derived in Section 4, with applications to phase retrieval and angular synchronization.
Thus, for smooth SDPs, our results improve upon [bhojanapalli2018smoothed] in that we address exact-feasibility formulations of the SDP. Our results also improve upon [boumal2016non] by providing approximate optimality results for approximate second-order points with relaxation rank scaling only as , whereas the latter reference establishes such results only for . Finally, we aim for more generality by covering both real and complex SDPs, and we illustrate the relevance of complex SDPs in Section 4.
A number of recent works focus on large-scale SDP solvers. Among the direct approaches (which proceed in the convex domain directly), SPARSESDP introduced a Frank–Wolfe type method for a restricted class of SDPs. Here, the key is that each iteration increases the rank of the solution only by one, so that if only a few iterations are required to reach satisfactory accuracy, then only low dimensional objects need to be manipulated. This line of work was later improved by HybridDSP, FastSpecta and SublinearSDP through hybrid methods. Still, if high accuracy solutions are desired, a large number of iterations will be required, eventually leading to large-rank iterates. In order to overcome such issue, SketchyDecisions recently proposed to combine conditional gradient and sketching techniques in order to maintain a low rank representation of the iterates.
Among the low-rank approaches, our work is closest to (and indeed largely builds upon) recent results of bhojanapalli2018smoothed. For the real case, they consider a penalized version of problem (SDP) (which we here refer to as (P-SDP)) and its related penalized Burer–Monteiro formulation, here called (P-P). With high probability upon random perturbation of the cost matrix, they show approximate global optimality of ASOSPs for (P-P), assuming grows with and either the SDP is compact or its cost matrix is positive definite. Given that there is a zero-measure set of SDPs where SOSPs may be suboptimal, there can be a small-measure set of SDPs where ASOSPs are not approximately optimal [bhojanapalli2018smoothed]. In this context, the authors resort to smoothed analysis, in the same way that we do here. One drawback of that work is that the final result does not hold for the original SDP, but for a non-equivalent penalized version of it. This is one of the points we improve here, by focusing on smooth SDPs as defined in [boumal2016non].
We use to refer to or when results hold for both fields. For matrices , of same size, we use the inner product , which reduces to in the real case. The associated Frobenius norm is defined as . For a linear map between matrix spaces, this yields a subordinate operator norm as . The set of self-adjoint matrices of size over , , is the set of symmetric matrices for or the set of Hermitian matrices for . We also write to denote for . A self-adjoint matrix is positive semidefinite () if and only if for all . Furthermore, is the identity operator and
is the identity matrix of size. The integer is defined after Assumption 1.
2 Geometric framework and near-optimality conditions
In this section, we present properties of the smooth geometry of (P) and approximate global optimality conditions for this problem. In covering these preliminaries, we largely parallel developments in [boumal2016non]. As argued in that reference, Assumption 1 implies that the search space of (P) is a submanifold in of codimension . We can associate tangent spaces to a submanifold. Intuitively, the tangent space to the submanifold at a point is a subspace that best approximates around , when the subspace origin is translated to . It is obtained by linearizing the equality constraints.
Lemma 2.1 (boumal2018deterministic).
By equipping each tangent space with a restriction of the inner product , we turn into a Riemannian submanifold of . We also introduce the orthogonal projector which, given a matrix , projects it to the tangent space :
This projector will be useful to phrase optimality conditions. It is characterized as follows.
Lemma 2.2 (boumal2018deterministic).
Under Assumption 1, the orthogonal projector admits the closed form
where is the adjoint of is a Gram matrix defined by (it is a function of ), and denotes the Moore–Penrose pseudo-inverse of (differentiable in ).
(See a proof in Appendix A.) To properly state the approximate first- and second-order necessary optimality conditions for (P), we further need the notions of Riemannian gradient and Riemannian Hessian on the manifold . We recall that (P) minimizes the function , defined by
on the manifold . The Riemannian gradient of at ,
, is the unique tangent vector atsuch that, for all tangent , with the Euclidean (classical) gradient of evaluated at . Intuitively, is the tangent vector at that points in the steepest ascent direction for as seen from the manifold’s perspective. A classical result states that, for Riemannian submanifolds, the Riemannian gradient is given by the projection of the classical gradient to the tangent space [absil2009optimization, eq. (3.37)]:
where We can write the Riemannian gradient of evaluated at as
The Riemannian gradient enables us to define an approximate first-order necessary optimality condition below. To define the approximate second-order necessary optimality condition, we need to introduce the notion of Riemannian Hessian. The Riemannian Hessian of at is a self-adjoint operator on the tangent space at obtained as the projection of the derivative of the Riemannian gradient vector field [absil2009optimization, eq. (5.15)]. boumal2018deterministic give a closed form expression for the Riemannian Hessian of at :
We can now formally define the approximate necessary optimality conditions for problem (P).
Definition 2.1 (-Fosp).
Definition 2.2 (-Sosp).
is an ()–second-order stationary point for (P) if it is an –first-order stationary point and the Riemannian Hessian of at is almost positive semidefinite, specifically,
From these definitions, it is clear that encapsulates the approximate optimality conditions of problem (P).
3 Approximate second-order points and smoothed analysis
We state our main results formally in this section. As announced, following [bhojanapalli2018smoothed], we resort to smoothed analysis [SmoothedAnalysis]. To this end, we consider perturbations of the cost matrix of (SDP) by a Gaussian Wigner matrix. Intuitively, smoothed analysis tells us how large the variance of the perturbation should be in order to obtain a new SDP which, with high probability, is sufficiently distant from any pathological case. We start by formally defining the notion of Gaussian Wigner matrix, following [arous2011wigner].
Definition 3.1 (Gaussian Wigner matrix).
The random matrix
The random matrixin is a Gaussian Wigner matrix with variance if its entries on and above the diagonal are independent, zero-mean Gaussian variables (real or complex depending on context) with variance .
Thus, for all , . Another consequence of compactness of is that the operator is uniformly bounded, that is, there exists a finite constant such that
We now state the main theorem, whose proof is in Appendix E.
Let Assumption 1 hold for (SDP) with cost matrix and constraints. Assume (1) is compact, and let and be as in (10) and (11). Let be a Gaussian Wigner matrix with variance and let be any tolerance. Define as:
There exists a universal constant such that, if the rank for the low-rank problem (P) satisfies
then, with probability at least on the random matrix , any -SOSP of (P) with perturbed cost matrix has bounded optimality gap:
This result indicates that, as long as the rank is on the order of (up to logarithmic factors), the optimality gap in the perturbed problem is small if a sufficiently good approximate second-order point is computed. Since (SDP) may admit a unique solution of rank as large as (see for example [laurent1996facial, Thm. 3.1(ii)] for the Max-Cut SDP), we conclude that the scaling of with respect to in Theorem 3.1 is essentially optimal.
There is an incentive to pick small, since the optimality gap is phrased in terms of the perturbed problem. As expected though, taking small comes at a price. Specifically, the required rank scales with , so that a smaller may require to be a larger multiple of . Furthermore, the optimality gap is bounded in terms of with a dependence in ; this may force us to compute more accurate approximate second-order points (smaller ) for a similar guarantee when is smaller: see also Corollary 3.1 below.
As announced, the theorem rests on two arguments which we now present—a probabilistic one, and a deterministic one:
Let Assumption 1 hold for (SDP). Assume (1) is compact. Let be a Gaussian Wigner matrix with variance and let be any tolerance. There exists a universal constant such that, if the rank for the low-rank problem (P) is lower bounded as in (13), then, with probability at least on the random matrix , we have
and furthermore: any -FOSP of (P) with perturbed cost matrix satisfies
where is the th singular value of the matrix .
We have shown in Theorem 3.1 that a perturbed version of (P) can be approximately solved to global optimality, with high probability on the perturbation. In the corollary below, we further bound the optimality gap at the approximate solution of the perturbed problem with respect to the original, unperturbed problem. The proof is in Appendix F.
Assume is compact and let be as defined in (10). Let be an approximate solution for (SDP) with perturbed cost matrix , so that the optimality gap in the perturbed problem is bounded by . Let denote the optimal value of the unperturbed problem (SDP), with cost matrix . Then, the optimality gap for with respect to the unperturbed problem is bounded as:
where is as defined in (15) and is the variance of the Wigner perturbation .
The approximate global optimality results established in the previous section can be applied to deduce guarantees on the quality of ASOSPs of the low-rank factorization for a number of SDPs that appear in machine learning problems. Of particular interest, we focus on the phase retrieval problem. This problem consists in retrieving a signal from amplitude measurements (the absolute value of vector is taken entry-wise). If we can recover the complex phases of , then
can be estimated through linear least-squares. Following this approach,waldspurger2015phase argue that this task can be modeled as the following non-convex problem:
where and maps a vector to the corresponding diagonal matrix. The classical relaxation is to rewrite the above in terms of (lifting) without enforcing , leading to a complex SDP which waldspurger2015phase call PhaseCut:
The same SDP relaxation also applies to a problem called angular synchronization [singer2011angular]. The Burer–Monteiro factorization of is an optimization problem over a matrix as follows:
For a feasible , each row has unit norm: the search space is a Cartesian product of spheres in , which is a smooth manifold. We can check that Assumption 1 holds for all . Furthermore, the feasible space of the SDP is compact. Therefore, Theorem 3.1 applies.
In this setting, for all feasible , and for all feasible (because for all feasible —see Lemma 2.2—and is an orthogonal projector from Hermitian matrices to diagonal matrices). For this reason, the constants defined in (10) and (11) can be set to and .
As a comparison, mei2017solving also provide an optimality gap for ASOSPs of (PhaseCut) without perturbation. Their result is more general in the sense that it holds for all possible values of . However, when is large, it does not accurately capture the fact that SOSPs are optimal, thus incurring a larger bound on the optimality gap of ASOSPs. In contrast, our bounds do show that for large enough, as go to zero, the optimality gap goes to zero, with the trade-off that they do so for a perturbed problem (though see Corollary 3.1), with high probability.
We present the empirical performance of the low-rank approach in the case of (PhaseCut). We compare it with a dedicated interior-point method (IPM) implemented by helmberg1996interior for real SDPs and adapted to phase retrieval as done by waldspurger2015phase. This adaptation involves splitting the real and the imaginary parts of the variables in (PhaseCut) and forming an equivalent real SDP with double the dimension. The Burer–Monteiro approach (BM) is implemented in complex form directly using Manopt, a toolbox for optimization on manifolds [JMLR:v15:boumal14a]. In particular, a Riemannian Trust-Region method (RTR) is used [genrtr]. Theory supports that these methods can return an ASOSP in a finite number of iterations [boumal2016globalrates]. We stress that the SDP is not perturbed in these experiments: the role of the perturbation in the analysis is to understand why the low-rank approach is so successful in practice despite the existence of pathological cases. In practice, we do not expect to encounter pathological cases.
Our numerical experiment setup is as follows. We seek to recover a signal of dimension , , from measurements encoded in the vector such that , where is the sensing matrix and is standard Gaussian noise. For the numerical experiments, we generate the vectors as complex random vectors with i.i.d. standard Gaussian entries, and we randomly generate the complex sensing matrices also with i.i.d. standard Gaussian entries. We do so for values of ranging from 10 to 1000, and always for (that is, there are 10 magnitude measurements per unknown complex coefficient, which is an oversampling factor of 5.) Lastly, we generate the measurement vectors as described above and we cap its values from below at in order to avoid small (or even negative) magnitude measurements.
For up to 3000, both methods solve the same problem, and indeed produce the same answer up to small discrepancies. The BM approach is more accurate, at least in satisfying the constraints, and, for , it is also about 40 times faster than IPM. BM is run with (rounded up), which is expected to be generically sufficient to include the global optimum of the SDP (as confirmed in practice). For larger values of , the IPM ran into memory issues and we had to abort the process.
We considered the low-rank (or Burer–Monteiro) approach to solve equality-constrained SDPs. Our key assumptions are that (a) the search space of the SDP is compact, and (b) the search space of its low-rank version is smooth (the actual condition is slightly stronger). Under these assumptions, we proved using smoothed analysis that, provided where is the number of constraints, if the cost matrix is perturbed randomly, with high probability, approximate second-order stationary points of the perturbed low-rank problem map to approximately optimal solutions of the perturbed SDP. We also related optimality gaps in the perturbed SDP to optimality gaps in the original SDP. Finally, we applied this result to an SDP relaxation of phase retrieval (also applicable to angular synchronization).
NB is partially supported by NSF award DMS-1719558.
Appendix A Proof of Lemma 2.2
We follow the proof of [boumal2018deterministic, Lemma 2.2] and reproduce it here to be self-contained, and also because the reference treats only the real case; writing the proof here explicitly allows to verify that, indeed, all steps go through for the complex case as well.
From the latter we infer there exists such that
since the adjoint of is . Multiply on the right by and apply to obtain
where we used since . The right-hand side expands into
where is a real, positive semidefinite matrix of size defined by . By construction, this system of equations in has at least one solution; we single out , where is the Moore–Penrose pseudo-inverse of . The function is continuous and differentiable at provided has constant rank in an open neighborhood of in [golub1973differentiation, Thm 4.3], which is the case for all under Assumption 1.
Appendix B Lower-bound for smallest singular values
This appendix provides supporting results necessary for Appendix C, which is devoted to the proof of Lemma 3.1. The statements we need are established for in [bhojanapalli2018smoothed, Cor. 5, Lem. 7]. Here we give the corresponding statements for : the proofs are essentially the same.
We first state a special case of Corollary 1.17 from [nguyen2017random]. Here, denotes the number of eigenvalues of in the real interval . (Note that the reference covers the real case in its main statement, and addresses the complex case later on as a remark.) For Gaussian Wigner matrices, we follow Definition 3.1. Furthermore, denotes the probability of event .
Let be a deterministic Hermitian matrix of size . Let be a Gaussian Wigner matrix with variance 1. Then, for any given , there exists a constant such that for any and , with being the interval ,
The next lemma follows easily—the original proof for is in [bhojanapalli2018smoothed, Lem. 7].
Let be a deterministic Hermitian matrix of size . Let be a complex Gaussian Wigner matrix of size with variance , independent of . There exists an absolute constant such that:
In our case, the entries of have variance . Thus, set and . From Corollary B.1, we get
with probability at least . In this event, . With the choices and , we get that
with probability at least . In that event,
for some absolute constant . ∎
Appendix C Proof of Lemma 3.1
This section builds on a result from [bhojanapalli2018smoothed], where a similar statement was made under different assumptions. The proof follows closely the developments therein, with appropriate changes. Using (8), is an -FOSP of the perturbed problem if and only if with and
Let be a thin SVD of , where is with orthonormal columns (assuming without loss of generality , as otherwise deterministically) and is orthogonal. Then,
Thus, we control the smallest singular value of in terms of and the smallest singular values of :
Given that is not statistically independent of , we are not able to directly apply Lemma B.1. Indeed, depends on and on , and itself is an -FOSP of the perturbed problem: a feature which depends on . To tackle this issue, we cover the set of possible s with a net. Lemma B.1 provides a bound for each in this net. By union bound, we can extend the lemma for all . By taking a sufficiently dense net, we then infer that is necessarily close to one of these ’s and conclude.
Since is a Gaussian Wigner matrix with variance , it is a well known fact (see for instance111The reference proves the statement for complex matrices with diagonal entries equal to zero. That proof can easily be adapted to the definition of Wigner matrices used in this paper, both real and complex. Part 1 of Appendix A in [bandeira2016tightness]) that, with probability at least ,
Hence, with probability at least ,
where we recover as defined in (12).
As a result, lies in a ball of center and radius . Moreover, from (17), we remark that lives in an affine subspace of dimension . A unit ball in Frobenius norm in dimensions admits an -net of points (see for instance Lemma 1.18 in [RigolletNotes]).222The lemma in the reference shows that for any the cardinality of one such -net is bounded by . Furthermore, for , there is an obvious -net of cardinality one, comprising just the origin. Hence, for any , it is possible so find an -net of cardinality at most . Thus, we pick a -net on the unit ball with points. Rescaling by a factor gives a -net of a ball of radius centered at zero. Hence, for any as in (17) there necessarily exists a point in the net satisfying:
Let be defined by , that is: extracts the smallest singular values of , in order. Then, by using the result from Exercise IV.3.5. in [Bhatia] in the first inequality,333The same result can be obtained by using Theorem IV.2.14 in the reference. In this setting, one considers the function ; then, use the subadditive property of , i.e., , and define and . we have:
where we used the triangular inequality in the last inequality. Thus, rearranging we obtain
Taking a union bound for Lemma B.1 over each in the net, we get that
holds with probability at least
as desired. It remains to discuss the probability of success, which we do below.
Inside the log in (23), we can safely replace with , as this only hurts the probability. Then, the result holds with probability at least
We would like to constrain such that the exponential part is bounded by . In this fashion, taking a union bound with event (19), we will get an overall probability of success of at least . Equivalently, must satisfy the quadratic inequality
with defined by . This quadratic inequality can be rewritten as:
with . This quadratic has two distinct real roots, one positive and one negative:
Since is positive, we deduce that needs to be larger than the positive root. The latter obeys the following inequality:444We use that, for any , .
Since both and are smaller than 3, it is sufficient to require
Assuming , we can use the inequality in the footnote again and find that it is sufficient to have
Plugging in the definitions of and , we find the sufficient condition (with ):
Since , we obtain the desired sufficient bound on .
Appendix D Proof of Lemma 3.2
Our goal is to show that is almost positive semidefinite. To this end, we first construct specific ’s to exploit the fact that is almost rank deficient. Let be a right singular vector of
such that and . For any with