Consider a semidefinite program (SDP) in the standard form
The primal variable is the symmetric, positive-semidefinite matrix . The problem data comprises a symmetric (but possibly indefinite) objective matrix , a linear map with rank , and a righthand side .
SDPs form a class of convex optimization problems with remarkable modeling power. But SDPs are challenging to solve because they involve a matrix variable whose dimension can rise into the millions or billions. For example, when using a matrix completion SDP in a recommender system, is the number of users and products; when using a phase retrieval SDP to visualize a biological sample, is the number of pixels in the recovered image. In these applications, most algorithms are prohibitively expensive because their storage costs are quadratic in .
How much memory should be required to solve this problem? Any algorithm must be able to query the problem data and to report a representation of the solution. Informally, we say that an algorithm uses optimal storage if the working storage is no more than a constant multiple of the storage required for these operations . (See Section 1.2 for a formal definition.)
It is not obvious how to develop storage-optimal SDP algorithms. To see why, recall that all weakly-constrained SDPs () admit low-rank solutions [8, 54], which can be expressed compactly in factored form. For these problems, a storage-optimal algorithm cannot even instantiate the matrix variable! One natural idea is to introduce an explicit low rank factorization of the primal variable and to minimize the problem over the factors .
Methods built from this idea provably work when the size of the factors is sufficiently large . However, recent work  shows that they cannot provably solve all SDPs with optimal storage; see Section 2.
In contrast, this paper develops a new algorithm that provably solves all generic SDPs using optimal storage. More precisely, our algorithm works when the primal and dual SDPs exhibit strong duality, the primal and dual solutions are unique, and strict complementarity holds; these standard conditions hold generically .
Our method begins with the Lagrange dual of the primal SDP Eq. P,
with dual variable
. The vectoris the transpose of , and the linear map is the adjoint of . It is straightforward to compute an approximate solution to the dual SDP Eq. D with optimal storage. The challenge is to recover a primal solution from the approximate dual solution.
To meet this challenge, we develop a new approximate complementarity principle that holds for generic SDP: Given an approximate dual solution , we prove that there is a primal approximate solution whose range is contained in the eigenspace with small eigenvalues of the dual slack matrix . This principle suggests an algorithm: we solve the primal SDP by searching over matrices with the appropriate range. This recovery problem is a (much smaller) SDP that can be solved with optimal storage.
The condition Eq. 1 follows, for example, from Slater’s constraint qualification.
Strong duality and feasibility imply that the solution and the dual slack matrix satisfy the complementary slackness condition:
These conditions suffice for the results in Section 5 (and the associated lemmas Lemma 2 and Lemma 3), which bound the recovery error of our procedure if the linear map restricted to a certain subspace is nonsingular (see (7) and Algorithm 1 for the choice of the subspace).
To ensure this condition, we make the stronger assumption that every solution pair satisfies the stronger strict complementarity condition:
Note that these assumptions ensure that all solutions have the same rank, and therefore that the primal solution is also unique [42, Corollary 2.5]. In particular, the rank of the solution satisfies the Barvinok–Pataki bound .
To summarize, all results in this paper hold under the assumptions of primal attainability, dual uniqueness, strong duality, and strict complementarity. These conditions hold generically conditioning on primal and dual attainability; i.e., for every SDP satisfying primal and dual attainability outside of a set of measure .
1.2 Optimal Storage
First, it is easy to see that numbers are sufficient to represent the rank- solution in factored form. This cost is also necessary because every rank- matrix is the solution to some SDP from our problem class.
To hide the internal complexity of the optimization problem Eq. P, we will interact with the problem data using data access oracles. Suppose we can perform any of the following operations on arbitrary vectors and :
These oracles enjoy simple implementations in many concrete applications. The input and output of these operations clearly involve storing numbers.
In summary, any method that uses these data access oracles to solve every SDP from our class must store numbers. We say a method has optimal storage if the working storage provably achieves this bound.
For many interesting problems, the number of constraints is proportional to the dimension . Moreover, the rank of the solution is constant or logarithmic in . In this case, a storage-optimal algorithm has working storage , where the tilde suppresses log-like factors.
Remark 1 (Applications)
The algorithmic framework we propose is most useful when the problem data has an efficient representation and the three operations in Eq. 4 can be implemented with low arithmetic cost. For example, it is often the case that the matrix and the linear map are sparse or structured. This situation occurs in the maxcut relaxation , matrix completion , phase retrieval [22, 66], and community detection . See  for some other examples.
1.3 From Strict Complementarity to Storage Optimality
Therefore, the slack matrix identifies the range of the primal solution.
Let be the rank of the primal solution. Construct an orthonormal matrix whose columns span . The compression of the primal problem Eq. P to this subspace is
This strategy for solving the primal SDP can be implemented with a storage-optimal algorithm. Indeed, the variable in the dual SDP Eq. D has length , so there is no obstacle to solving the dual with storage using the subgradient type method described in Section 6. We can compute the subspace using the randomized range finder [33, Alg. 4.1] with storage cost . Last, we can solve the compressed primal SDP Eq. 5 using working storage via the matrix-free method from [25, 52]. The total storage is the optimal . Furthermore, all of these algorithms can be implemented with the data access oracles Eq. 4.
1.4 The Approximate Complementarity Principle
A major challenge remains: one very rarely has access to an exact dual solution! Rather, we usually have an approximate dual solution, obtained via some iterative dual solver.
This observation motivates us to formulate a new approximate complementarity principle. For now, assume that is known. Given an approximate dual solution , we can construct an orthonormal matrix
whose columns are eigenvectors ofwith the smallest eigenvalues. Roughly speaking, the primal problem Eq. P admits an approximate solution whose range is contained in . We show the approximate solution is close to the true solution as measured in terms of suboptimality, infeasibility, and distance to the solution set.
We propose to recover the approximate primal solution by solving the semidefinite least-squares problem
with variable . Given a solution to Eq. MinFeasSDP, we obtain an (infeasible) approximate solution to the primal problem.
In fact, it is essential to relax our attention to infeasible solutions because the feasible set of Eq. P should almost never contains a matrix with range ! This observation was very surprising to us, but it seems evident in retrospect. (For example, using a dimension-counting argument together with Lemma 8.)
The resulting framework appears in Table 1[right]. This approach for solving Eq. P leads to storage-optimal algorithms for the same reasons described in Section 1.3. Our first main result ensures that this technique results in a provably good solution to the primal SDP Eq. P.
Theorem 1 (Main theorem, informal)
, we obtain a similar guarantee using an estimate for.
1.5 Paper Organization
We situate our contributions relative to other work in Section 2. Section 3 contains an overview of our notation and more detailed problem assumptions. Section 4 uses the approximate complementarity principle to develop practical, robust, and theoretically justified algorithms for solving Eq. P. The algorithms are accompanied by detailed bounds on the quality of the computed solutions as compared with the true solution. Section 5 contains algorithmically computable bounds that can be used to check the solution quality under weaker assumptions than those required in Section 4. These checks are important for building a reliable solver. Next, we turn to algorithmic issues: we explain how to compute an approximate dual solution efficiently in Section 6, which provides the last ingredient for a complete method to solve Eq. P. Section 7 shows numerically that the method is effective in practice.
2 Related Work
Semidefinite programming can be traced to a 1963 paper of Bellman & Fan . Related questions emerged earlier in control theory, starting from Lyapunov’s 1890 work on stability of dynamical systems. There are many classic applications in matrix analysis, dating to the 1930s. Graph theory provides another rich source of examples, beginning from the 1970s. See [14, 65, 63, 11, 16] for more history and problem formulations.
2.1 Interior-Point Methods
The success of these SDP algorithms motivated new applications. In particular, Goemans & Williamson  used semidefinite programming to design an approximation algorithm to compute the maximum-weight cut in a graph. Early SDP solvers could only handle graphs with a few hundred vertices [32, Sec. 5] although computational advances quickly led to IPMs that could solve problems with thousands of vertices .
IPMs form a series of unconstrained problems whose solutions are feasible for the original SDP, and move towards the solutions of these unconstrained problems using Newton’s method. As a result, IPMs converge to high accuracy in very few iterations, but require substantial work per iteration. To solve a standard-form SDP with an matrix variable and with equality constraints, a typical IPM requires iterations to reach a solution with accuracy (in terms of objective value) , and arithmetic operations per iteration (when no structure is concerned), so arithmetic operations in total. Further, a typical IPM requires at least memory not including the storage of data representation (which takes memory if no structure is assumed).
As a consequence, these algorithms are not effective for solving large problem instances, unless they enjoy a lot of structure. Hence researchers began to search for methods that could scale to larger problems.
2.2 First-Order Methods
One counterreaction to the expense of IPMs was to develop first-order optimization algorithms for SDPs. This line of work began in the late 1990s, and it accelerated as SDPs emerged in the machine learning and signal processing literature in the 2000s.
Early on, Helmberg & Rendl  proposed a spectral bundle method for solving an SDP in dual form, and they showed that it converges to a dual solution when the trace of is constant. In contrast to IPMs, the spectral bundle method has low per iteration complexity. On the other hand, the convergence rate is not known, and there is no convergence guarantee on the primal side. so there is no explicit control on the storage and arithmetic costs.
In machine learning, popular first-order algorithms include the proximal gradient method , accelerated variants  and the alternating direction method of multipliers [30, 31, 15, 53]. These methods provably solve the original convex formulation of Eq. P, but they converge slowly in practice. These algorithms all store the full primal matrix variable, so they are not storage-efficient.
Recently, Friedlander & Macedo  have proposed a novel first-order method that is based on gauge duality, rather than Lagrangian duality. This approach converts an SDP into an eigenvalue optimization problem. The authors propose a mechanism for using a dual solution to construct a primal solution. This paper is similar in spirit to our approach, but it lacks an analysis of the accuracy of the primal solution. Moreover, it only applies to problems with a positive-definite objective, i.e., .
2.3 Storage-Efficient First-Order Methods
Motivated by problems in signal processing and machine learning, a number of authors have revived the conditional gradient method (CGM) [28, 43, 39]. In particular, Hazan  suggested using CGM for semidefinite programming. Clarkson  developed a new analysis, and Jaggi , showed how this algorithm applies to a wide range of interesting problems.
The appeal of the CGM is that it computes an approximate solution to an SDP as a sum of rank-one updates; each rank-one update is obtained from an approximate eigenvector computation. In particular, after iterations, the iterate has rank at most
. This property has led to the exaggeration that CGM is a “storage-efficient” optimization method. Unfortunately, CGM converges very slowly, so the iterates do not have controlled rank. The literature describes many heuristics for attempting to control the rank of the iterates[55, 71], but these methods all lack guarantees.
Very recently, some of the authors of this paper  have shown how to use CGM to design a storage-optimal algorithm for a class of semidefinite programs by sketching the decision variable. This algorithm does not apply to standard-form SDPs, and it inherits the slow convergence of CGM. Nevertheless, the sketching methodology holds promise as a way to design storage optimal solvers, particularly together with algorithms that generalize CGM and that do apply to standard-form SDPs [70, 69].
We also mention a subgradient method developed by Renegar  that can be used to solve either the primal or dual SDP. Renegar’s method has a computational profile similar to CGM, and it does not have controlled storage costs.
2.4 Factorization Methods
There is also a large class of heuristic SDP algorithms based on matrix factorization. The key idea is to factorize the matrix variable and to reformulate the SDP Eq. P as
We can apply a wide range of nonlinear programming methods to optimize Eq. 6 with respect to the variable . In contrast to the convex methods described above, these techniques only offer incomplete guarantees on storage, arithmetic, and convergence.
The factorization idea originates in the paper  of Homer & Peinado. They focused on the maxcut SDP, and the factor was a square matrix, i.e., . These choices result in an unconstrained nonconvex optimization problem that can be tackled with a first-order optimization algorithm.
. (Note, however, that the SDP can have solutions with much lower or higher rank.) Interestingly, it is possible to find a low rank approximate solution to an SDP, with high probability, by projecting any solution to the SDP onto a random subspace[7, Proposition 6.1], . However, this approach requires having already computed a (possibly full rank) solution using some other method.
Inspired by the existence of low rank solutions to SDP, Burer & Monteiro  proposed to solve the optimization problem Eq. 6 where the variable is constrained to be a tall matrix (). The number is called the factorization rank. It is clear that every rank- solution to the SDP Eq. P induces a solution to the factorized problem Eq. 6 when . Burer & Monteiro applied a limited-memory BFGS algorithm to solve Eq. 6 in an explicit effort to reduce storage costs.
In subsequent work, Burer & Monteiro  proved that, under technical conditions, the local minima of the nonconvex formulation Eq. 6 are global minima of the SDP Eq. P, provided that the factorization rank satisfies . As a consequence, algorithms based on Eq. 6 often set the factorization rank , so the storage costs are .
Unfortunately, a recent result of Waldspurger & Walters [67, Thm. 2] demonstrates that the formulation Eq. 6 cannot lead to storage-optimal algorithms for generic SDPs. In particular, suppose that the feasible set of Eq. P satisfies a mild technical condition and contains a matrix with rank one. Whenever the factorization rank satisfies , there is a set of cost matrices with positive Lebesgue measure for which the factorized problem Eq. 6 has (1) a unique global optimizer with rank one and (2) at least one suboptimal local minimizer, while the original SDP has a unique primal and dual solution that satisfy strict complementarity. In this situation, the variable in the factorized SDP actually requires storage, which is not optimal if . In view of this negative result, we omit a detailed review of the literature on the analysis of factorization methods. See  for a full discussion.
2.5 Summary and Contribution
3 Basics and Notation
Here we introduce some additional notation, and metrics for evaluating the quality of a solution and the conditioning of an SDP.
We will work with the Frobenius norm , the operator norm , and its dual, the nuclear norm . We reserve the symbols and for the norm induced by the canonical inner product of the underlying real vector space.
For a matrix
, we arrange its singular values in decreasing order:
Define and . We also write for the smallest nonzero singular value of . For a linear operator , we define
We use analogous notation for the eigenvalues of a symmetric matrix. In particular, the map reports the th largest eigenvalue of its argument.
3.2 Optimal Solutions
Instate the notation and assumptions from Section 1.1. Define the slack operator that maps a putative dual solution to its associated slack matrix . We omit the dependence on if it is clear from the context.
Let the rank of primal solution being and denote the range as . We also fix an orthonormal matrix whose columns span . Introduce the subspace , and let be a fixed orthonormal basis for . We have the decomposition .
For a matrix , define the compressed cost matrix and constraint map
In particular, is the compression of the constraint map onto the range of .
3.3 Conditioning of the SDP
First, we measure the strength of the complementarity condition Eq. 2 using the spectral gaps of the primal solution and dual slack matrix :
These two numbers capture how far we can perturb the solutions before the complementarity condition fails.
Second, we measure the robustness of the primal solution to perturbations of the problem data using the quantity
This term arises because we have to understand the conditioning of the system of linear equations in the variable .
|primal matrix||dual vector|
|distance to solution set|
3.4 Quality of Solutions
We measure the quality of a primal matrix variable and a dual vector in terms of their suboptimality, their infeasibility, and their distance to the true solutions. Table 2 gives formulas for these quantities.
We say that a matrix is an -solution of Eq. P if its suboptimality is at most and its infeasibility is at most .
The primal suboptimality and infeasibility are both controlled by the distance to the primal solution set:
We can also control the distance of a dual vector and its slack matrix from their optima using the following quadratic growth lemma.
Lemma 1 (Quadratic Growth)
4 Reduced SDPs and Approximate Complementarity
In this section, we describe two reduced SDP formulations, and we explain when their solutions are nearly optimal for the original SDP Eq. P. We can interpret these results as constructive proofs of the approximate complementarity principle.
4.1 Reduced SDPs
Suppose that we have obtained a dual approximate solution and its associated dual slack matrix . Let be a rank parameter, which we will discuss later. Construct an orthonormal matrix whose range is an -dimensional invariant subspace associated with the smallest eigenvalues of the dual slack matrix . Our goal is to compute a matrix with range that approximately solves the primal SDP Eq. P.
Our first approach minimizes infeasibility over all psd matrices with range :
Our second approach minimizes the objective value over all psd matrices with range , subject to a specified limit on infeasibility:
with variable . Given a solution , we can form an approximate solution for the primal SDP Eq. P.
As we will see, both approaches lead to satisfactory solutions to the original SDP Eq. P under appropriate assumptions. Theorem 2 addresses the performance of Eq. MinFeasSDP, while Theorem 3 addresses the performance of Eq. MinObjSDP. Table 3 summarizes the hypotheses we impose to study each of the two problems, as well as the outcomes of the analysis.
The bounds in this section depend on the problem data and rely on assumptions that are not easy to check. Section 5 shows that there are easily verifiable conditions that allow us to calculate bounds on the quality of and .
4.2 Analysis of Eq. MinFeasSDP
Theorem 2 (Analysis of Eq. MinFeasSDP)
This bound shows that when the dual vector is suboptimal. Notice this result requires knowledge of the solution rank . The proof of Theorem 2 occupies the rest of this section.
4.2.1 Primal Optimizers and the Reduced Search Space
The first step in the argument is to prove that is near the search space of the reduced problems. We will use this lemma again, so we state it under minimal conditions.
Complete to form a basis for , where and where is the eigenvector of associated with the -th smallest eigenvalue. Rotating into this coordinate system, let’s compare and its projection into the space spanned by ,
Let , and . Using the unitary invariance of , we have
A similar equality holds for . Thus we need only bound the terms and . Applying Lemma 9 to , we have
Using strong duality and feasibility of , we rewrite the suboptimality as
Since all the vectors in have corresponding eigenvalues at least as large as the threshold , and as is feasible, we have
This inequality allows us to bound as
Basic linear algebra shows
Combining pieces, we bound the error in the Frobenius norm:
Similarly, we may bound the error in the nuclear norm:
by the same reasoning. Step uses the fact that has rank at most .
4.2.2 Relationship between the Solutions of Eq. MinFeasSDP and Eq. P
Lemma 2 shows that any solution of Eq. P is close to its compression onto the range of . Next, we show that is also close to . We can invoke strong convexity of the objective of Eq. MinFeasSDP to achieve this goal.
Instate the assumptions from LABEL:lemma:minfeas0. Assume . and that the threshold . Then
where is any solution of the primal SDP Eq. P.
Since we assume that , we know the objective of Eq. MinFeasSDP, , is -strongly convex, and so the solution is unique. We then have for any
where step uses strong convexity and step is due to the optimality of .
Since , we can bound the objective of Eq. MinFeasSDP by :
Combining pieces, we know that satisfies
where we use the unitary invariance of in . The inequality uses our bound above for and Lemma 2.
4.2.3 Lower Bounds for the Threshold and Minimum Singular Value
Finally, we must confirm that the extra hypotheses of Lemma 3 hold, viz., and .
Here is the intuition. Strict complementarity forces . If is close to , then we expect that by continuity. When is unique, Lemma 8 implies that . As a consequence, . If is close to , then we expect that as well. We have the following rigorous statement.
Instate the hypotheses of Theorem 2. Then
We first prove the lower bound on the threshold . Using and quadratic growth (Lemma 1), we have
Thus for sufficiently small , we have . Substituting this bound into previous inequality gives,
Weyl’s inequality tells us that . Using Eq. 22, we see that for all sufficiently small ,
Next we prove the lower bound on . We have by Lemma 8. It will be convenient to align the columns of with those of for our analysis. Consider the solution to the orthogonal Procrustes problem . Since for orthonormal , without loss of generality, we suppose we have already performed the alginment and in the following.
Let . Then we have
Defining , we bound the term as