A semidefinite program (SDP) is an optimization problem of the form
where denotes the set of real symmetric matrices, stands for the trace of a matrix, and are input data to the problem, and the notation denotes that the matrix is constrained to be positive semidefinite
(psd); i.e., to have nonnegative eigenvalues. Thedual to Problem (1) takes the form
SDPs have numerous fundamental applications throughout applied and computational mathematics. In combinatorics and discrete optimization for instance, some of the most powerful relaxations for prominent problems of the field rely on semidefinite programming. For example, the current best approximation algorithm for the maximum-cut problem , or the only known polynomial-time algorithm for finding maximum stable sets in perfect graphs 
make fundamental use of semidefinite programming. In machine learning, one is often faced with the task of learning a signal with a known structure from data (e.g., a low-rank matrix, a sparse eigenvector, a monotone polynomial, etc.). The design of an appropriate convex optimization problem which induces such structure in a signal often leads to a semidefinite programming problem; see, e.g.,, , [53, Chap. 8]. In polynomial optimization, SDP-based hierarchies based on the notion of sum of squares polynomials (see Section 1.3 below) are among the most promising approaches for obtaining global solutions in the absence of any convexity assumptions [64, 82]. Finally, in control and robotics, various notions of safety, performance, stability, and robustness of dynamical systems can be certified by searching for Lyapunov functions that satisfy appropriate nonnegativity constraints over the state space. When the candidate functions are parameterized as polynomials or other semialgebraic functions, this search can be automated by semidefinite programming via the sum of squares approach; see, again, Section 1.3 below and 
. Overall, semidefinite programming has proven to be one of the most expressive classes of convex optimization problems, subsuming many other important classes such as linear programming, convex quadratic programming, and second-order cone programming.
In addition to its expressiveness, another contributing factor to the popularity of semidefinite programming lies in its strong theoretical and computational properties, which to certain extent parallel that of linear programming . For instance, strong duality between (1) and (2) holds under mild assumptions. Moreover, many algorithms for linear programs, such as the ellipsoid method and interior point methods, can and have been adapted to SDPs. While these algorithms solve SDPs to arbitrary accuracy in polynomial time, in practice, they suffer from one serious impediment: scalability. When the problems considered have large or (as defined in (1)), solving time and memory requirements tend to explode, making SDPs prohibitive to solve. Indeed, most solvers rely on interior point methods as their algorithm of choice for SDPs and interior point methods produce iterates that are (as their name suggests) in the interior of the feasible set. Computing these large dense matrix iterates and storing them lead to the issues described above. Devising methods to make SDPs more scalable is consequently a very important and active area of research.
1.1 The goal and outline of this survey
In this paper, we review recent literature on scalability improvements for semidefinite programming with a particular focus on methods that have proved useful in machine learning, control theory, and robotics. We consider these representative application areas since they are timely and because many problems that arise in them have straightforward semidefinite programming-based relaxations or reformulations, but involve very large problem instances. This paper is a follow-up to a workshop at the 2016 Conference on Decision and Control on the same topic and titled “Solving Large SDPs in Control, Machine Learning, and Robotics”. Eleven lectures were given at this workshop whose content can be found at http://aaa.princeton.edu/largesdps.
The outline of this paper is as follows. In Section 1.3, we provide a brief introduction to sum of squares proofs of nonnegativity of polynomials. This background material is relevant to many of the applications of semidefinite programming we discuss in this paper (e.g., in Section 2.1.1 and Section 5.1). Section 2 presents approaches that enhance scalability by exploiting structure (e.g., symmetry or sparsity) in a problem. Section 3 presents methods for generating low-rank solutions to SDPs in order to reduce computation time and storage requirements. Section 4 reviews more scalable alternatives to interior-point methods for solving SDPs based on augmented Lagrangian techniques and the alternating direction method of multipliers. Section 5 presents approaches that trade off scalability with conservatism of solutions (e.g., by approximating SDPs with linear or second-order cone programs). In Section 6, we present a list of software packages that implement many of the techniques described in this paper. Section 7 concludes the paper and highlights promising directions for future work.
1.2 Related work not covered by this survey
The number of algorithms that have been proposed as alternatives to interior point methods for semidefinite programming is large and rapidly growing. While we have attempted to present some of the major themes, our survey is certainly not exhaustive. Some of the interesting contributions that this paper does not cover due to space limitations include the recent first-order methods developed by Renegar , the multiplicative weights update algorithm of Arora, Hazan, and Kale , the storage-optimal algorithm of Ding et al. , the iterative projection methods of Henrion and Malick , the conditional gradient-based augmented Lagrangian framework of Yurtsever, Fercoq, and Cevher , the regularization methods of Nie and Wang for SDP relaxations of polynomial optimization , the accelerated first-order method of Bertsimas, Freund, and Sun for the sum of squares relaxation of unconstrained polynomial optimization , and the spectral algorithms of Hopkins et al. for certain sum of squares problems arising in machine learning .
We also remark that there has been a relatively large recent literature at the intersection of optimization and machine learning where SDP relaxations of certain nonconvex problems are avoided altogether and instead the nonconvex problem is tackled directly with local descent algorithms. These algorithms can often scale to larger problems compared to the SDP approach (at least when the SDPs are solved by interior point methods). Moreover, under certain technical caveats and statistical assumptions on problem data, it is sometimes possible to prove similar guarantees for these algorithms as those of the SDP relaxation. We do not cover this literature in this paper but point the interested reader to  for a list of references.
1.3 A brief review of sum of squares proofs of nonnegativity
Sum of squares constraints on multivariate polynomials are responsible for a substantial fraction of recent applications of semidefinite programming. For the convenience of the reader, we briefly review the basics of this concept and the context in which it arises. This will be relevant e.g. for a better understanding of Section 2.1.1 and Section 5.1.
Recall that a closed basic semialgebraic set is a subset of the Euclidean space of the form
where are polynomial functions. It is relatively well known by now that many fundamental problems of computational mathematics can be cast as optimization problems where the decision variables are the coefficients of a multivariate polynomial , the objective function is a linear function of these coefficients, and the constraints are twofold: (i) affine constraints in the coefficients of , and (ii) the constraint that
where is a given closed basic semialgebraic set. For example, in a polynomial optimization problem—i.e., the problem of minimizing a polynomial function on —the optimal value is equal to the largest constant for which satisfies (3). See ,,[8, Sect. 1.1] for numerous other applications. Unfortunately, imposing the constraint in (3) (or even asking a decision version of it for a fixed polynomial ) is NP-hard already when is a quartic polynomial and , or when is quadratic and is a polytope.
An idea pioneered to a large extent by Lasserre  and Parrilo  has been to write algebraic sufficient conditions for (3) based on the concept of sum of squares polynomials. We say that a polynomial is a sum of squares (sos) if it can be written as for some polynomials . Observe that if we succeed in finding sos polynomials such that the polynomial identity
holds, then, clearly, the constraint in (3) must be satisfied. Conversely, a celebrated result in algebraic geometry  states that if satisfy the so-called “Archimedean property” (a condition slightly stronger than compactness of the set ), then positivity of on guarantees existence of sos polynomials that satisfy the algebraic identity in (4).
The computational appeal of the sum of squares approach stems from the fact that the search for sos polynomials of a given degree that verify the polynomial identity in (4) can be automated via semidefinite programming. This is true even when some coefficients of the polynomial are left as decision variables. This claim is a straightforward consequence of the following well-known fact (see, e.g., ): A polynomial of degree is a sum of squares if and only if there exists a symmetric matrix which is positive semidefinite and verifies the identity
here denotes the vector of all monomials inof degree less than or equal to . Note that the size of the semidefinite constraint in an SDP that imposes a sum of squares constraint on an -variate polynomial of degree is This is the reason why sum of squares optimization problems—i.e., SDPs arising from sum of squares constraints on polynomials—often run into scalability limitations quickly.
2 Enhancing Scalability by Exploiting Structure
A major direction for improving the scalability of semidefinite programming has been towards the development of methods for exploiting problem-specific structure. Existing literature in this area has focused on two kinds of structure that appears frequently in applications in control theory, robotics, and machine learning: (i) sparsity, and (ii) symmetry.
2.1 Exploiting sparsity
Many problems in control theory give rise to SDPs that exhibit structure in the form of chordal sparsity. We first give a brief introduction to the general notion of chordal sparsity and then highlight applications in control theory that exploit this structure to achieve significant gains in scalability. We begin by introducing a few relevant graph-theoretic notions.
Definition 1 (Cycles and chords).
A cycle of length in a graph is a sequence of distinct vertices such that is an edge and is an edge for . A chord is an edge connecting two nonadjacent vertices in a cycle.
Definition 2 (Chordal graph).
An undirected graph is chordal if every cycle of length has a chord.
Definition 3 (Maximal clique).
A clique of a graph is a subset of vertices such that for any distinct vertices , is an edge in . A clique is maximal if it is not a subset of another clique.
The graph theoretic notions introduced above can be used to capture sparsity patterns of matrices. Let denote the set of edges of an undirected graph and let denote the set of vertices. We define . Let denote the set of symmetric matrices as before, and let denote the set of matrices with a sparsity structure defined by as follows:
The following proposition is the primary result that allows one to exploit chordal sparsity (i.e., sparsity induced by a chordal graph) in order to improve the scalability of a semidefinite program.
Proposition 1 (see ).
Let be a chordal graph with vertices , edges , and a set of maximal cliques . Then is positive semidefinite if and only if there exists a set of matrices for , such that and .
This key result allows one to equivalently express a large semidefinite constraint as a set of smaller semidefinite constraints (and additional equality constraints). This can lead to a significant increase in computational efficiency since the computational cost of solving an SDP is primarily determined by the size of the largest semidefinite constraint. We note that for a chordal graph, the problem of finding maximal cliques is amenable to efficient (i.e., polynomial time) algorithms (see  and references therein).
Proposition 1 (and a related dual result due to Grone et al. ) has been leveraged by several researchers in optimization to improve the efficiency of SDPs with chordal sparsity [47, 61, 46]. These results have also been exploited by researchers in the context of control theory. In , the authors apply these results to the problem of finding Lyapunov functions for sparse linear dynamical systems. In particular, by parameterizing Lyapunov functions with a chordal structure (i.e., a structure that allows one to apply Proposition 1), the resulting SDP finds a Lyapunov function significantly faster (up to a factor of faster in terms of running time for instances where the largest maximal clique size is less than 15).
In [125, 126], the authors extend Proposition 1 to the case of matrices with block-chordal sparsity by demonstrating that such matrices can be decomposed into an equivalent set of smaller block-positive semidefinite matrices (again, with additional equality constraints). This result allows for applications to networked systems since it reasons about the sparsity of interconnections between subsystems. The authors apply these results to the problem of designing structured feedback controllers to stabilize large-scale networked systems.
In the work highlighted above, Proposition 1 is used to decompose a large semidefinite constraint into smaller ones (with additional equality constraints). The resulting SDP can then be solved using standard techniques (e.g., standard interior point methods). One challenge with this approach (as noted in ) is that the additional equality constraints can sometimes nullify the computational benefits of dealing with smaller semidefinite constraints. A set of strategies that seek to overcome this issue involve exploiting chordal sparsity directly in the solver, e.g., directly in an interior point method [47, 28, 9] or in a first-order method [104, 60, 71, 105, 124]. In the context of control theory,  leverages sparse solvers [9, 47, 16] based on this idea to tackle robust stability analysis for large-scale sparsely interconnected systems.
2.1.1 Exploiting sparsity in polynomial optimization problems
The existence of structure in the form of sparsity can also be exploited for polynomial optimization problems (cf. Section 1.3). In , the authors propose the Sparse-BSOS hierarchy (which is based on the Bounded-SOS (BSOS) hierarchy for polynomial optimization problems presented in ). The Sparse-BSOS hierarchy leverages the observation that for many polynomial optimization problems that arise in practice, the polynomials that define the constraints of the problem exhibit sparsity in their coefficients. This observation is used to split the variables in the problem into smaller blocks of variables such that (i) each monomial of the polynomial objective function only consists of one of the blocks, and (ii) each polynomial that defines the constraints also only depends on variables in only one of the blocks. Under the assumption that the sparsity in the objective and constraints satisfy the Running Intersection Property (RIP) , the Sparse-BSOS approach provides a hierarchy of SDPs whose optimal values converge to the global optimum of the original polynomial optimization problem. In addition, for the so-called class of “SOS-convex polynomial optimization problems” (see  for a definition), the hierarchy converges at the very first step. In contrast to the (standard) BSOS hierarchy (which also provides these guarantees), Sparse-BSOS results in semidefinite constraints of smaller block size. Numerical experiments in  demonstrate that exploiting sparsity in the problem can result in the ability to solve large-scale polynomial optimization problems that are beyond the reach of the standard (i.e., non-sparse) BSOS approach.
refers to the problem of simultaneously estimating the location (or trajectory of locations) of a robot and a model of the robot’s environment. Two important versions of the SLAM problem are (i) Landmark SLAM, and (ii) Pose-graph SLAM. The Landmark SLAM problem involves simultaneously estimating the position and orientation of the robot and the location of landmarks in the environment. The Pose-graph SLAM problem is a more restricted version that only requires estimating the position and orientation of the robot at each time-step. Traditionally, these problems have been posed as maximum likelihood estimation (MLE) problems and tackled using general nonlinear optimization methods (which do not come with guarantees on finding globally optimal solutions). In, the authors formulate the planar pose-graph and landmark SLAM problems as polynomial optimization problems and demonstrates that they are SOS-convex. This allows  to apply the Sparse-BSOS hierarchy with the guarantee that the hierarchy converges to the global optimum at the very first step. The approach in  contrasts with most prior work on pose-graph/landmark SLAM, which apply general-purpose nonlinear optimization algorithms to these problems. The numerical results in  compare the Sparse-BSOS approach to SLAM with nonlinear optimization techniques based on the Levenberg-Marquardt algorithm and demonstrate that the latter often get stuck in local minima while the former finds the globally optimal solution. We also note that the approach in  contrasts with the SE-Sync approach  we mention in Section 3.1 since  relaxes the requirement of limited measurement noise imposed by .
2.2 Exploiting symmetry
Beyond sparsity, another general form of structure that arises in applications of semidefinite programming to control theory, robotics, and machine learning is symmetry. Mathematically, symmetry refers to a transformation of the variables in a problem that does not change the underlying problem. As a concrete example, consider a multi-agent robotic system composed of a large number of identical agents. The agents in the system can be interchanged while leaving the overall capabilities of the multi-agent system invariant.
Symmetry reduction [49, 112, 36, 86] is a powerful set of approaches for exploiting symmetries in SDPs. We provide a brief overview of symmetry reduction here and refer the reader to  for a thorough exposition. Let denote the cone of symmetric positive semidefinite matrices. Given an SDP of the form (1), symmetry reduction performs the following two steps:
Find an appropriate affine subspace containing solutions to the problem;
as a linear transformation of a “simpler” cone, where :
Once steps 1 and 2 above have been accomplished, one can solve the following problem instead of the original SDP:
The key advantage here is that Problem (7) is formulated over a lower dimensional space , and can thus be significantly more efficient to solve.
Symmetry reduction techniques have been fruitfully applied in many different applications of semidefinite programming to control theory [56, 33, 11]. As an example,  exploits symmetry in linear dynamical systems to obtain significant reductions in computational effort for the problem of designing controllers (with various control objectives). In particular,  exploits invariances of the state space realizations of linear systems under unitary coordinate transformations to obtain significant computational gains for control synthesis problems. In [11, Chapter 7], symmetry reduction techniques are used to reduce the computational burden of certifying stability of interconnected subsystems. Specifically,  exploits permutation symmetries in a network of dynamical systems (i.e., invariances to exchanging subsystems) in order to reduce the complexity of the resulting SDPs.
2.3 Facial reduction
In certain cases, there may be additional structure beyond sparsity and symmetry that one may leverage in order to reduce the size of SDPs arising in practice. As an example, consider the search for a Lyapunov function that proves stability of the equilibrium of a polynomial dynamical system . In this case, one would like to search for functions that satisfy the following “degeneracy” conditions: . This imposes structure on the resulting optimization problem by restricting the space of possible functions . One powerful set of approaches for leveraging this kind of “degeneracy” structure is referred to as facial reduction [86, 39, 19, 84]. The general procedure behind facial reduction is identical to that of symmetry reduction (Steps 1 and 2 in Section 2.2). The primary difference is that facial reduction techniques identify the subspace in a different way than symmetry reduction techniques. In particular, is found by exploiting a certain kind of geometric degeneracy condition (see  for details).
Facial reduction techniques have been used to improve the scalability of SDPs arising in robotics, control theory, and ML applications [86, 114, 62]. We highlight , which presents methods for automating the process of performing facial (and symmetry) reduction. As an example,  considers the problem of analyzing the stability of a rimless wheel, which is a simple hybrid dynamical walking robot model. The facial reduction techniques presented in  reduce computation times by roughly a factor of for this problem. Similar gains in terms of scalability are also obtained for many other problems.
3 Enhancing Scalability by Producing Low-Rank Solutions
In this section, we focus on methods for producing low-rank feasible solutions to the SDP in (1) at relatively large scales. One may think of the techniques we present here as leveraging a very specific kind of “structure” (i.e., low-rank structure), similar to the approaches considered in Section 2. However, the technical approaches for leveraging low-rank structure are quite different in nature from the approaches presented in Section 2. Moreover, the literature on low-rank methods is quite vast. We thus treat these methods distinctly from the ones in Section 2. At a high level, one utilizes such methods in two different settings, which we describe now.
Case 1: There is a low-rank solution among the optimal solutions to (1). In this setting, we can establish a priori that among the optimal solutions to (1), there must be one of low rank. An important case where this is known to hold is when (1) does not have too many constraints. More specifically (see [14, 83, 66]), if (1) has constraints and an optimal solution, then it also has a rank- optimal solution with
Note that we need for this property to be useful.
In this setting, one may not particularly care about obtaining a low-rank solution to the problem. Rather, searching for a low-rank solution (which is guaranteed to exist) can lead to significant gains in computation time and storage. To see why, recall that a positive semidefinite matrix has rank if and only if it can be written as , where . By searching for a low-rank solution, one can hope to work in the lower-dimensional space corresponding to , rather than the higher dimensional space corresponding to . Doing so would lead to much cheaper algorithms as storage space needs would drop from to and, e.g., the important operation of matrix-vector multiplication would require flops rather than .
Case 2: Good-quality low-rank feasible solutions to (1) are needed. In this setting, one considers SDPs whose optimal solutions may all have high rank. However, the goal is to find low-rank feasible solutions with good objective values. As one is restricted to feasible solutions with low rank, one can again expect to work in a lower-dimensional space than the original space and make gains in terms of computation time and storage as explained previously.
Problems such as these occur in a variety of applications, most notably machine learning and combinatorial optimization. Consider for instance the problem of low-rank matrix completion from machine learning, which is a cornerstone problem in user recommendation systems (see, e.g., the Netflix competition ). In this problem, we have access to a matrix , which is partially complete, i.e., some of its entries are specified but others are unknown. The goal is to fill in the unknown entries. Without assuming some structure on , any completion of would be a valid guess and so the problem is ill-defined. This can be remedied by assuming that should have low rank, which is a natural assumption in the context of user recommender systems. Indeed, if one assumes that the matrix reflects the ratings a user () gives a product (), then completing the matrix would involve infering how a user would have rated a product (s)he has not yet rated from existing ratings, provided by both him/herself and other users. Under the assumption that a consumer base can be segmented into categories (with ), it would then make sense to constrain the matrix to have rank less than or equal , which would reflect this segmentation. The matrix completion problem would then be written as
where is the partially complete matrix corresponding to past data. The indices in the objective function range over known entries of the matrix . Problem (9) can be relaxed to an SDP by replacing the constraint on the rank of by a constraint on its nuclear norm , which is a semidefinite-representable surrogate of the rank. The problem then becomes:
The goal here is to obtain a feasible solution to the problem such that the matrix
has low rank and the objective value is small. This makes matrix completion a typical application of the algorithms we present next. Other machine learning problems which involve relaxing a rank-constrained SDP to an SDP via the use of the nuclear norm include certain formulations of clustering and maximum variance unfolding
, and sparse principal component analysis.
The rest of this section is devoted to two of the most well-known approaches for computing low-rank feasible (and possibly optimal) solutions to the SDP in (1). The first one is the Burer-Monteiro method and its variants (Section 3.1). The second one contains the methods that use the Frank-Wolfe algorithm as their computational backbone (Section 3.2).
3.1 Burer-Monteiro-based methods
The key idea of the Burer-Monteiro algorithm  is to factor the decision variable in (1) as where . Here, can either be chosen such that a solution of rank is guaranteed to exist (e.g., such that (8
) holds) (Case 1 in the previous segment) or such that it corresponds to the rank which we would like our solution to have (Case 2). Typically, the algorithm comes with some theoretical guarantees for the first case and is simply a heuristic in the second. Using this factorization, we rewrite (1) as:
This reformulation has both pros and cons: on the upside, the problem to consider has smaller dimension than (1) and has no conic constraint to contend with; on the downside, the problem has become nonconvex. Hence, local optimization methods may not always recover the global minimum of (1) even if there is one of low rank. Research in this area has consequently sought to find algorithms to solve (11) that (i) provably recover global low-rank solutions to the problem if they exist; (ii) do so reasonably fast in terms of both theoretical and empirical convergence rates; and (iii) work on a large class of inputs and . We review existing algorithms through the lens of these criteria.
3.1.1 Augmented Lagrangian approaches
The original method recommended by Burer and Monteiro  to solve (11) is a method based on an augmented Lagrangian procedure (see Section 4.2 for another appearance of this general approach). The idea is to rewrite (11) as an unconstrained optimization problem by adding terms to the objective that penalize infeasible points. This is done by constructing an augmented Lagrangian function, given here by
If and are fixed appropriately, then minimizing will yield an optimal solution to (11). To obtain an appropriate pair , the following iterative procedure is followed: fix and minimize to obtain , then construct a new from , and repeat; see  for the exact details. Note that this procedure involves the minimization of with respect to : choosing an appropriate algorithm for this subproblem is also an important step in the method. The authors in  suggest using a limited-memory BFGS (LM-BFGS) algorithm  to do so. The LM-BFGS algorithm is a quasi-Newton algorithm for unconstrained optimization problems, where the objective is differentiable. Unlike Newton methods, it does not require computing the Hessian of the objective function, constructing instead an approximate representation of it from past gradients. Given that gradients of can be computed quite quickly under some conditions (see  for more details), this is a reasonably efficient algorithm to use. One can replace the LM-BFGS subroutine with a subgradient approach if (11) has additional constraints that are non-differentiable. Furthermore, the augmented Lagrangian approach has been tailored to directly handle inequality constraints of the type in (1); see . Though there are no theoretical guarantees regarding convergence to global solutions, the authors of  experimentally observe that the algorithm tends to return global minima of (1), doing so in a much smaller amount of time than interior point or bundle methods. The follow-up paper  to  slightly modifies the augmented Lagrangian function and shows convergence of the algorithm to a global minimum of (1) (given that one exists of rank ) if, among other things, each minimization of the modified gives rise to a local minimum, a condition that is hard to test in practice.
3.1.2 Riemannian optimization approaches
Riemannian optimization concerns itself with the optimization of a smooth objective function over a smooth Riemannian manifold. We remind the reader that a Riemannian manifold is a manifold that can be linearized at each point by a tangent space , which is equipped with a Riemannian metric. We say that a Riemannian manifold is smooth if the Riemannian metric varies smoothly as the point varies; see e.g. [22, Appendix A] for a short introduction to such concepts. In the Riemannian setting, one can define notions of both gradients and Hessians for manifolds [22, Appendix A]. Using these notions, one can extend the concepts of first-order and second-order necessary conditions of optimality to optimization over manifolds. These are almost identical to the ones for unconstrained optimization over , except that they involve the analogs of gradient and Hessians for manifolds. Similarly, one can generalize many classical algorithms for nonlinear unconstrained optimization (such as, e.g., trust-region and gradient-descent methods) to the manifold setting. We refer the reader to [1, 2] for more information on the specifics of these algorithms which we will not cover here.
is a smooth Riemannian manifold under certain conditions on (see [24, Assumption 1]), and the objective function is smooth. Hence, (11) is exactly a Riemannian optimization problem. It can be shown that if , the conditions on that make it smooth hold, and is compact, then for almost all , any second-order critical point of (11) is globally optimal to (1) [23, Theorem 2]. (Note that an optimal solution of rank is guaranteed to exist here from the assumptions and that we use the manifold definition of a second-order critical point.) If an algorithm which returns second-order critical points can be exhibited, then, in effect, we will have found a way of solving (11). It turns out that Riemannian trust-region methods return such a point—under some conditions—regardless of initialization . In fact, it is shown in  that an -accurate second-order critical point (i.e., a point with and ) can be obtained in iterations. Other Riemannian methods which come with some theoretical guarantees are the Riemannian steepest-descent algorithm and the Riemannian conjugate gradient algorithm, both of them being known to converge to a critical point of (11) [21, 98]. In practice however, many more algorithms for nonlinear optimization can be and have been generalized to manifold settings (though their convergence properties have not necessarily been studied) and are used to solve problems such as (11) (see, e.g., the list of solvers implemented in Manopt , one of the main toolboxes for optimization over manifolds).
We conclude this section by mentioning a few applications where Riemannian optimization methods have recently been used. These include machine learning problems such as synchronization , phase recovery , dictionary learning , and low-rank matrix completion ; but also robotics problems such as the simultaneous localization and mapping (SLAM) problem (cf. Section 2.1.1). In [32, 31, 97], the authors propose semidefinite programming relaxations for the maximum likelihood estimation (MLE) problem arising from SLAM. These relaxations provide exact solutions to the MLE problem assuming that noise in the measurements made by the robot are below a certain threshold (see  for exact conditions). In , the authors tackle the challenge of scalability by demonstrating that the resulting SDP admits low-rank solutions and using a Burer Monteiro factorization combined with Riemannian optimization techniques (e.g., a Riemannian trust-region approach) to solve the SDP. Thus, assuming that the conditions on the measurement noise are satisfied, one can find globally optimal solutions to the MLE problems arising from Pose-graph SLAM (and other related MLE problems arising from camera motion estimation and sensor network localization problems) via this method.
3.1.3 Coordinate descent approaches
One of the more recent trends for solving problems of the form (11) has been to use coordinate descent (or block-coordinate descent) methods [42, 115]. They have been proposed to solve diagonally-constrained problems, i.e., problems of the type
where is the 2-norm and is the row of . This subset of SDPs cover machine learning applications such as certain formulations of graphical model inference and community detection . Coordinate-descent methods are easy to explain for this case. Let us assume that for for clarity of exposition. Initialization consists in randomly choosing on the sphere. Then, at each iteration, an index is picked and fixed and the goal is to find a feasible vector such that is minimized. We update the objective by replacing the previous by this new and proceed to the next iteration. This algorithm can be slightly modified to consider descent on blocks of coordinates, rather than one single coordinate at each time-step . In that setting, it can be shown that if the indices are picked appropriately, then coordinate block-descent methods converge to an -accurate critical point of (12) in time . Each iteration of coordinate descent methods is much less costly (by an order of approximately) than its Riemannian trust-region counterpart however. As a consequence, numerical results tend to show that as increases, using coordinate descent methods can be preferable .
3.2 Frank-Wolfe based methods
We start this subsection with a quick presentation of the Frank-Wolfe algorithm (also known as the conditional gradient algorithm), a version of which will be used in the approaches we present. This is an algorithm for constrained optimization problems of the type , where is a compact and convex set and is a convex and differentiable function. At iteration , a linear approximation of (obtained using the first-order Taylor approximation of around the current iterate ) is minimized over and a minimizer is obtained. The next iterate is then constructed by taking a convex combination of the previous iterate and the minimizer before repeating the process. Intuitively, the reason why Frank-Wolfe algorithms are popular for obtaining low-rank feasible solutions for SDPs is due to the following fact: if the algorithm is initialized with a rank-1 matrix and if each minimizer is also a rank-1 matrix, then the iterate is of rank at most . This enables us to control the rank of the solution that is given as output. We first present an algorithm developed by Hazan  in Section 3.2.1 and follow up with some extensions of Hazan’s algorithm in Section 3.2.2.
3.2.1 Hazan’s algorithm
Hazan’s algorithm is designed to produce low-rank solutions to SDPs of the type
The additional constraint is required by the algorithm, but can be slightly relaxed to the constraint , where is fixed [48, Chapter 5]. We consider the version with the constraint here. A key subroutine of Hazan’s algorithm is solving the following optimization problem:
where is a convex function. This is of interest in itself as some problems can be cast in the form (14), an example being given in . Let be the optimal value of this problem. It is for this subroutine that a Frank-Wolfe type iterative algorithm is used. It can be described as follows: initalize to be a rank- matrix with trace one. Then, at iteration , compute the eigenvector corresponding to the maximum eigenvalue of written in matrix form. Finally, for a step size , set and iterate. This algorithm returns a -approximate solution to (14) (i.e., a matrix such that and ) with rank at most in iterations.
In order to see how this subroutine can be used to solve (13), first note that one can reduce (13) to a sequence of SDP feasibility problems via the use of bisection. It is enough as a consequence to explain how one can leverage the Frank-Wolfe type algorithm described above to solve an SDP feasibility problem of the type:
This can be done by letting where , being the desired accuracy; see . The algorithm is then guaranteed to return a -approximate solution of rank at most in iterations.
It is interesting to note that this latter algorithm is nearly identical to the multiplicative weights update algorithm for SDPs which also produces low-rank solutions  and has exactly the same guarantees. However it is derived in a completely different fashion.
3.2.2 Other methods based on Frank-Wolfe
A caveat of Hazan’s algorithm is that the rank of the solution returned is linked to its accuracy. One may think that, if the solution is known to be low-rank and the Frank-Wolfe algorithm is initialized with a low-rank matrix, then all iterates of the algorithm produce a low-rank matrix. Unfortunately this is not always the case. In fact, what can be observed in practice is that the Frank-Wolfe algorithm typically returns iterates with increasing rank up to a certain point, where the rank decreases again, with the final solution returned being low-rank; see, e.g., the numerical experiments in . As a consequence, if one requires an accurate solution, one will likely have to deal with high-rank intermediate iterates, with the storage and computational problems that this entails.
In this section, we present two different methods designed to avoid this issue for optimization problems of the type:
where is a convex and differentiable function, denotes the nuclear norm, and is a linear operator that maps to . The relaxation of the low-rank matrix completion problem given in (10) e.g. fits into this format. Similarly to (14), the Frank-Wolfe algorithm can be applied to (15) quite readily. Initialization is done as before by setting to be some rank-1 matrix. At iteration , one computes an update direction by finding a pair
of singular vectors corresponding to the maximum singular value ofwhere is the adjoint operator to given by . We then construct a new iterate and repeat.
The first method, described in , is a modification of the Frank-Wolfe algorithm as described in the previous paragraph. The key idea is that at every iteration, the algorithm chooses between two types of steps. One type is the step described above, which is derived from the singular vectors of the matrix , (a “regular step” in the paper); the other is what is called an “in-face step”. One way to take such a step is to minimize over in the minimal face of the nuclear norm unit ball to which belongs. The matrix obtained from this minimization is then used to produce the next iterate: . We refer the reader to  to see how this can be done efficiently. The advantage of such a step is that the matrix constructed has rank no larger than . If the choice between a regular step and an in-face step is appropriately made, it can still be the case that one has a -approximate solution after iterations, though the rank of the iterate could be much less than depending on the number of “in face” steps used.
The second method given in  describes an alternative way of dealing with possibly high rank of the intermediate iterates produced by the Frank-Wolfe algorithm applied to (15). To get around this problem, the authors devise a “dual” formulation of the Frank-Wolfe algorithm described above where the primal iterates are replaced by “dual” iterates , which are of smaller size. There is a need however to keep track of the iterates in some sense as the ultimate goal is to recover a solution to (15) at the end of the algorithm. To do this, the authors suggest a procedure based on sketching. The procedure is as follows: two random matrices and are generated such that and are of order of the rank of the solution to (15), and two new matrices and are obtained as and . Note that the matrices typically have smaller rank than the iterates that they represent. Any update that is made in the algorithm is then reflected on rather than on the matrix . At the end of the algorithm, one can reconstruct a rank solution to the problem from the updated pair ; see  for details. Note that with this way of proceeding, one need never store or utilize high-rank intermediate iterates.
4 Scalability via Admm and Augmented Lagrangian Methods
An attractive alternative to using interior-point methods for solving SDPs is to use first-order methods. Compared to interior-point methods, first-order methods can scale to significantly larger problem sizes, while trading off the accuracy of the resulting solution (more precisely, first-order methods can require more time to achieve similar accuracy to interior-point methods). First-order methods are thus particularly attractive for applications that demand scalability, but do not require extremely accurate solutions (e.g., some machine learning applications highlighted below). Here, we describe two recent approaches that involve first-order methods: (i) an approach based on the Alternating Direction Method of Multipliers (ADMM) , and (ii) an approach based on the augmented Lagrangian method [122, 119].
4.1 Solving SDPs using ADMM
In , the authors present a first-order method for solving general cone programs, with SDPs as a special case. Here, we first describe the basic idea behind ADMM and then describe how this is applied to SDPs. We refer the reader to  for a thorough exposition to ADMM.
Consider a convex optimization problem of the form:
Here, the convex functions and can be nonsmooth and take on infinite values (e.g., to encode additional constraints). ADMM is a method for solving problems of the form (16) (more generally, ADMM can solve problems where and have an affine relationship). ADMM operates by iteratively updating solutions , , and a dual variable (associated with the constraint ) via the following steps:
where is a parameter. The initializations and are usually taken to be 0, but can be arbitrary. Under mild technical conditions (see ), the steps above converge. In particular, the cost converges to the optimal value of Problem (16), the residual converges to zero, and converges to an optimal dual variable.
In , the general ADMM procedure above is applied to SDPs. A key idea is to apply ADMM to the homogeneous self-dual embedding of the primal SDP (1) and its dual (2). The homogeneous self-dual embedding converts the primal-dual pair of optimization problems into a single convex feasibility problem. This is done by embedding the Karush-Kuhn-Tucker (KKT) optimality conditions associated with the primal and dual SDPs into a single system of equations and semidefinite constraints. If the original primal-dual pair has a feasible solution, it can be recovered from a solution to the embedding. If on the other hand, the original primal-dual pair is not solvable, the embedding allows one to produce a certificate of infeasibility. The homogeneous self-dual embedding is very commonly used in interior-point methods [99, 58]. We refer the reader to  for more details on this technique.
To summarize, the approach presented in  consists of the following three steps:
Form the homogeneous self-dual embedding associated with the primal-dual SDP pair;
Express this embedding in the form (16) required by ADMM;
Apply the ADMM steps to this problem.
In , the authors also present techniques for efficiently implementing Step 3. These include techniques for (i) efficiently performing the projections required to implement the ADMM steps, (ii) scaling/preconditioning the problem data in order to improve convergence, and (iii) choosing stopping criteria. The resulting approach achieves significant speedups as compared to interior-point methods for a number of large-scale conic optimization problems (with some loss in accuracy of the solution). As an example,  considers the problem of robust principal component analysis  in machine learning. This problem corresponds to recovering the principal components of a data matrix even when a fraction of its entries are arbitrarily corrupted by noise. More specifically, suppose one is given a (large) data matrix that is known to be decomposable as
where is a low-rank matrix and is sparse (but nothing further is known, e.g., the locations or number of nonzero entries of ). Interestingly, this problem can be solved using semidefinite programming under relatively mild conditions (see ). For this problem, the ADMM approach presented in  is able to tackle large-scale instances where standard interior-point solvers (SeDuMi  and SDPT3 ) run out of memory. Moreover, the approach provides significant (order of magnitude) speedups on smaller instances, while resulting in only a small loss in accuracy (less than reconstruction error of the low-rank matrix ).
4.2 Augmented Lagrangian methods
Next, we discuss two closely-related approaches for solving SDPs using augmented Lagrangian methods: SDPNAL , and SDPNAL+ . SDPNAL operates on the dual SDP (2) by defining an augmented Lagrangian function as:
where is a penalty parameter, corresponds to the Frobenius norm, and is the metric projection operator onto the set of symmetric psd matrices. More precisely, is the (unique) optimal solution to the following convex problem:
Of key importance is the fact that the augmented Lagrangian function is continuously differentiable (since is continuously differentiable). The augmented Lagrangian function may be viewed as the (usual) Lagrangian function associated with a “regularized” version of the SDP (2) (see [96, 25] for a more thorough exposition).
In order to implement this method, we thus require the ability to solve the “inner” optimization problems in Steps 1 and 2 above. The objective functions in these inner optimization problems are convex and continuously differentiable, but not twice continuously differentiable (see ). The SDPNAL method presented in  works by applying a semismooth Newton method combined with the conjugate gradient method to approximately solve the inner problems. Conditions under which the resulting iterates converge to the optimal solution are established in . In , the authors present SDPNAL+, which builds on the approach presented in . However, for certain kinds of SDPs (degenerate SDPs; see  for a formal definition), SDPNAL can encounter numerical difficulties. In , the authors present SDPNAL+, which tackles this issue by directly working with the primal SDP and using a different technique to solve the inner optimization problems that arise from the augmented Lagrangian method (specifically, a “majorized” semismooth Newton method).
In , the authors compare the performance of SDPNAL+ with two other first order methods on various numerical examples. In particular,  considers SDP relaxations for the fundamental machine learning problem of -means clustering. Specifically, the goal here is to partition a given set of data points into clusters such that the total sum-of-squared Euclidean distances from each data point to its assigned cluster center is minimized. In , the authors present semidefinite relaxations for finding approximate solutions to this NP-hard problem. The numerical experiments presented in  demonstrate that SDPNAL+ is able to solve the resulting SDPs with higher accuracy as compared to the other first-order methods while also resulting in faster running times (e.g., order of magnitude gains in running time on large instances of the clustering problem).
5 Trading Off Scalability With Conservatism
In this section, we discuss approaches that afford gains in scalability, but are potentially conservative. We use the word “conservative” to refer to solutions that may be sub-optimal, but satisfy all the constraints of the problem (in contrast to approaches that may produce approximate solutions to an SDP that violate some constraints). In particular, here we discuss approaches that provide the user with a tuning parameter for trading off
scalability with conservatism. Conceptually, we classify such approaches into two: (i) “special-purpose” approaches that leverage domain knowledge associated with the target application, and (ii) “general-purpose” approaches that apply broadly across application domains.
As an example of an approach that falls under the first category, we highlight recent work on neural network verification
using semidefinite programming. Over the last few years, a large body of work has explored the susceptibility of neural networks to adversarial inputs (or more generally, uncertainty in the inputs)[107, 80, 123, 76]. For example, in the context of image classification, neural networks can be made to switch their classification labels by adding very small perturbations (imperceptible to humans) to the input image. Motivated by the potentially serious consequences of such fragilities in safety-critical applications (e.g., autonomous cars), there has been a recent explosion of work on verifying the robustness of neural networks to perturbations to the input [109, 117, 67, 90, 118] (see  for a more comprehensive literature review). Recently, approaches based on semidefinite programming have been proposed to tackle this challenge [93, 92, 43]
. The challenge of scalability for this application is particularly pronounced since the number of decision variables in the resulting SDPs grows with the number of neurons in the network. In, the authors propose an approach for verifying neural networks by capturing their input-output relationships using quadratic constraints. This allows one to certify
that for a given set of inputs (e.g., a set of images obtained by making small perturbations to a training image), the output is contained within a specified set (e.g., outputs that assign the same label). The authors consider neural networks with different types of activation functions (e.g., ReLU, sigmoid, etc.) and propose different sets of quadratic constraints for each one. Importantly, by including or excluding different kinds of quadratic constraints, the approach allows one totrade off scalability with conservatism. Moreover, by exploiting the modular structure of neural networks, the approach scales to neural networks with roughly five thousand neurons. The scalability afforded by the modular approach, however, comes at the cost of conservatism.
Next, we highlight approaches that fall under the second category mentioned above, i.e., “general-purpose” approaches for trading off scalability with conservatism.
5.1 DSOS and SDSOS optimization
In , the authors introduce “DSOS and SDSOS optimization” as linear programming and second-order cone programming-based alternatives to sum of squares and semidefinite optimization that allow for a trade off between computation time and solution quality. The following definitions are central to their framework.
Definition 4 (dd and sdd matrices).
A symmetric matrix is diagonally dominant (dd) if
for all . A symmetric matrix is scaled diagonally dominant (sdd) if there exists a diagonal matrix , with positive diagonal entries, such that is dd.
Definition 5 (dsos and sdsos polynomials).
A polynomial is said to be diagonally-dominant-sum-of-squares (dsos) if it can be written as , for some scalars and some polynomials that each involve at most two monomials with a coefficient of . We say that a polynomial is scaled-diagonally-dominant-sum-of-squares (sdsos) if it can be written as for some polynomials that involve at most two monomials with an arbitrary coefficient.
The following implications readily follow: See Figure 1(a) for a comparison of the three notions on a parametric family of bivariate quartic polynomials. Similarly, in view of Gershgorin’s circle theorem , the implications are straightforward to establish. In , the authors connect the above definitions via the following statement.
A polynomial of degree is dsos (resp. sdsos) if and only if it admits a representation as , where is the standard monomial vector of degree and is a dd (resp. sdd) matrix.
By combining this theorem with some linear algebraic observations, the authors show in  that optimization of a linear objective function over the intersection of the cone of dsos (resp. sdsos) polynomials of a given degree with an affine subspace can be carried out via linear programming (resp. second-order cone programming). These are two mature classes of convex optimization problems that can be solved significantly faster than semidefinite programs. The linear and second-order cone programs that arise from dsos/sdsos constraints on polynomials are termed “DSOS and SDSOS optimization problems”. They are used to produce feasible, but possibly sub-optimal, solutions to sum of squares optimization problems quickly. In the special case where the polynomials involved have degree two, DSOS and SDSOS optimization problems can be used to produce approximate solutions to semidefinite programs. We also remark that in applications where sum of squares programming is used as a relaxation, i.e. to provide an outer approximation of a (typically intractable) feasible set, then this approach replaces the semidefinite constraint with a membership constraint in the dual of the cone of dsos or sdsos polynomials. See, e.g., [5, Sect. 3.4] for more details.
Impact on applications. A key practical benefit of the DSOS/SDSOS approach is that it can be used as a “plug-in” in any application area of SOS programming. The software package (cf. Section 6) that accompanies reference  facilitates this procedure. In [8, Sect. 4], it is shown via numerical experiments from diverse application areas—polynomial optimization, statistics and machine learning, derivative pricing, and control theory—that with reasonable tradeoffs in accuracy, one can achieve noticeable speedups and handle problems at scales that are currently beyond the reach of traditional sum of squares approaches.111Comparisons are made in  with SDP solvers such as MOSEK, SeDuMi, and SDPNAL+. For the problem of sparse principal component analysis in machine learning for example, the experiments in  show that the SDSOS approach is over 1000 times faster than the standard SDP approach on input instances while sacrificing only about 2 to 3% in optimality; see [8, Sect. 4.5]. As another example, Figure 1(b) illustrates a humanoid robot with 30 state variables and 14 control inputs that SDSOS optimization is able to stabilize on one foot 
. A nonlinear control problem of this scale is beyond the reach of standard solvers for sum of squares programs at the moment. In a different paper, the authors show the potential of DSOS and SDSOS optimization for real-time applications. More specifically, they use these techniques to compute, every few milliseconds, certificates of collision avoidance for a simple model of an unmanned vehicle that navigates through a cluttered environment.
Some guarantees of the DSOS/SDSOS approach. On the theoretical front, DSOS and SDSOS optimization enjoy some of the same guarantees as those enjoyed by SOS optimization. For example, classical theorems in algebraic geometry can be utilized to conclude that any even positive definite homogeneous polynomial is the ratio of two dsos polynomials [8, Sect. 3.2]. From this observation alone, one can design a hierarchy of linear programming relaxations that can solve any copositive program  to arbitrary accuracy [8, Sect. 4.2]. This idea can be extended to achieve the same result for any polynomial optimization problem with a compact feasible set; see [6, Sect. 4.2].
5.2 Adaptive improvements to DSOS and SDSOS optimization
While DSOS and SDSOS techniques result in significant gains in terms of solving times and scalability, they inevitably lead to some loss in solution accuracy when compared to the SOS approach. In this subsection, we briefly outline two possible strategies to mitigate this loss. These strategies solve a sequence of adaptive linear programs (LPs) or second-order cone programs (SOCPs) that inner approximate the feasible set of a sum of squares program in a direction of interest. For brevity of exposition, we explain how the strategies can be applied to approximate the generic semidefinite program given in (1). A treatment tailored to the case of sum of squares programs can be found in the references we provide.
5.2.1 Iterative change of basis
In , the authors build on the notions of diagonal and scaled diagonal dominance to provide a sequence of improving inner approximations to the cone of psd matrices in the direction of the objective function of an SDP at hand. The idea is simple: define a family of cones222One can think of as the set of matrices that are dd after a change of coordinates via the matrix .
parametrized by an matrix . Optimizing over the set is an LP since is fixed, and the defining constraints are linear in the coefficients of the two unknown matrices and . Furthermore, the matrices in are all psd; i.e., .
The proposal in  is to solve a sequence of LPs, indexed by , by replacing the condition by :
The sequence of matrices is defined as follows
where is an optimal solution to the LP in (19) and chol(.) denotes the Cholesky decomposition of a matrix (this can also be replaced with the matrix square root operation).
Note that the first LP in the sequence optimizes over the set of diagonally dominant matrices. By defining as a Cholesky factor of , improvement of the optimal value is guaranteed in each iteration. Indeed, as
, and the identity matrixis diagonally dominant, we see that and hence is feasible for iteration . This implies that the optimal value at iteration is at least as good as the optimal value at the previous iteration; i.e., (in fact, the inequality is strict under mild assumptions; see ).
In an analogous fashion, one can construct a sequence of SOCPs that inner approximate with increasing quality. This time, the authors define a family of cones
parameterized again by an matrix . For any , optimizing over the set is an SOCP and we have . This leads us to the following iterative SOCP sequence:
Assuming existence of an optimal solution at each iteration, we can define the sequence iteratively in the same way as was done in (20). Using similar reasoning, we have . In practice, the sequence of upper bounds approaches faster to the SDP optimal value than the sequence of the LP upper bounds .
Figure 2 shows the improvement (in every direction) obtained just by a single iteration of this approach. The outer set in green in both subfigures is the feasible set of a randomly generated semidefinite program. The sets in black are the diagonally dominant (left) and the scaled diagonally dominant (right) inner approximations. What is shown in dark blue in both cases is the boundary of the improved inner approximation after one iteration. Note that the SOCP in particular fills up almost the entire spectrahedron in a single iteration.
5.2.2 Column generation
In , the authors design another iterative method for inner approximating the set of psd matrices using linear and second order cone programming. Their approach combines DSOS/SDSOS techniques with ideas from the theory of column generation in large-scale linear and integer programming. The high-level idea is to approximate the SDP in (1) by a sequence of LPs (parameterized by ):
where are fixed psd matrices. These matrices are initialized to be the extreme rays of dd matrices which turn out to be all rank one matrices , where the vector has at most two nonzero components, each equal to . Once this initial LP is solved, one adds one (or sometimes several) new psd matrices to problem (22) and resolves. This process then continues. In each step, the new matrices are picked carefully to bring the optimal value of the LP closer to that of the SDP. Usually, the construction of involves solving a “pricing subproblem” (in the terminology of the column generation literature), which adds appropriate cutting planes to the dual of (22); see  for more details.
The SOCP analog of this process is similar. The SDP in (1) is inner approximated by a sequence of SOCPs (parameterized by ):
where are fixed matrices. They are initialized as the set of matrices that have zeros everywhere, except for a 1 in the first column in position and a 1 in the second column in position . This gives exactly the set of sdd matrices [8, Lemma 3.8]. In subsequent steps, one (or sometimes several) appropriately-chosen matrices are added to problem (23). These matrices are obtained by solving pricing subproblems and help bring the optimal value of the SOCP closer to that of the SDP in each iteration.
Figure 3 shows the improvement obtained by five iterations of this process on a randomly generated SDP, where the objective is to maximize a linear function in the northeast direction.
6 Software: Solvers and Modeling Languages
In this section, we list software tools that allow a user to specify and solve SDPs. This list is not meant to be exhaustive; rather, the goal is to provide the interested practitioner with a starting point in terms of software for implementing some of the approaches reviewed in this paper.
Software for specifying SDPs:
YALMIP , CVXPY , PICOS , SPOT , JuMP .
These software packages allow one to easily specify SDPs and provide interfaces to various solvers listed above. YALMIP and SPOT are MATLAB packages, CVXPY and PICOS are Python packages, and JuMP is a Julia package.
Software for exploiting chordal sparsity in SDPs (cf. Section 2.1):
SparseCoLO , CDCS .
SparseCoLO is a MATLAB toolbox for converting SDPs with chordal sparsity to equivalent SDPs with smaller-sized SDP constraints. CDCS is a solver that exploits chordal sparsity and provides a MATLAB interface.
Parsers for SOS programs (cf. Section 1.3):
YALMIP , SPOT , SOSTOOLS , SumofSquares.jl .
These packages allow one to specify SOS programs and convert them to a form that can be solved using SDP solvers. YALMIP, SPOT, and SOSTOOLS are MATLAB packages, while SumofSquares.jl is a Julia package.
Semidefinite programming is an exciting and active area of research with a vast number of applications in fields including machine learning, control theory, and robotics. Historically, the lack of scalability of semidefinite programming has been a major impediment to fulfilling the potential that it brings to these application domains. In this paper, we have reviewed recent approaches for addressing this challenge including (i) techniques that exploit structure such as sparsity and symmetry in a problem, (ii) approaches that produce low-rank approximate solutions to SDPs, (iii) methods for solving SDPs via augmented Lagrangian and ADMM techniques, and (iv) approaches that trade off scalability with conservatism, including techniques for approximating SDPs with LPs or SOCPs. We have also provided a list of software packages associated with these approaches. Our hope is that this paper will serve as an entry-point for both practitioners working in application domains that demand scalability, and researchers seeking to contribute to theory and algorithms for scalable semidefinite programming.
Exciting and significant work remains to be done on building upon the advancements reviewed here. Semidefinite programming is still far from being a mature technology similar to linear or quadratic programming. Potential directions for future work include: (i) better theoretical understanding of the convergence properties of algorithms for low-rank SDPs (cf. Section 3), (ii) identifying structure beyond chordal sparsity, symmetry, and degeneracy (cf. Section 2) that arise in practice and methods for exploiting such structure, (iii) exploring different first-order methods for solving SDPs and understanding their relative merits (cf. Section 4), (iv) understanding the power of LPs and SOCPs for approximating SDPs in an adaptive (as opposed to one-shot) fashion (cf. Section 5), (v) finding ways to combine the different approaches for improving scalability reviewed in this paper, and (vi) further developing mature software that allows practitioners to deploy semidefinite programming technology on future applications including ones that involve solving SDPs in real-time.
The authors are not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.
This work is partially supported by the DARPA Young Faculty Award, the CAREER Award of the NSF, the Google Faculty Award, the Innovation Award of the School of Engineering and Applied Sciences at Princeton University, the Sloan Fellowship, the MURI Award of the AFOSR, the Office of Naval Research [Award Number: N00014-18-1-2873], the NSF CRII award [IIS-1755038], and the Amazon Research Award.
-  (2007) Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics 7 (3), pp. 303–330. Cited by: §3.1.2.
-  (2009) Optimization Algorithms on Matrix Manifolds. Princeton University Press. Cited by: §3.1.2.
-  (1988) Positive semidefinite matrices with a given sparsity pattern. Linear Algebra and its Applications 107, pp. 101–149. Cited by: Proposition 1.
-  (2017) Optimization over structured subsets of positive semidefinite matrices via column generation. Discrete Optimization 24, pp. 129–151. Cited by: Figure 3, §5.2.2.
-  (2017) Sum of squares basis pursuit with linear and second order cone programming. Contemporary Mathematics, pp. 27–53. Note: Available at https://arxiv.org/pdf/1510.01597.pdf Cited by: Figure 2, §5.1, §5.2.1, §5.2.1, §5.2.1.
-  (2019) On the construction of converging hierarchies for polynomial optimization based on certificates of global positivity. Mathematics of Operations Research. Note: Available at https://arxiv.org/pdf/1709.09307.pdf Cited by: §5.1.
-  (2016) Some applications of polynomial optimization in operations research and real-time decision making. Optimization Letters 10 (4), pp. 709–729. Cited by: §5.1.
-  (2019) DSOS and SDSOS optimization: more tractable alternatives to sum of squares and semidefinite optimization. SIAM Journal on Applied Algebraic Geometry 3 (193). Cited by: §1.3, §5.1, §5.1, §5.1, §5.1, §5.1, §5.2.2, 8th item, footnote 1.
-  (2010) Implementation of nonsymmetric interior-point methods for linear optimization over sparse matrix cones. Mathematical Programming Computation 2 (3-4), pp. 167–201. Cited by: §2.1.
-  (2014) Robust stability analysis of sparsely interconnected uncertain systems. IEEE Transactions on Automatic Control 59 (8), pp. 2151–2156. Cited by: §2.1.
-  (2016) Networks of Dissipative Systems: Compositional Certification of Stability, Performance, and Safety. Springer. Cited by: §2.2.
-  (2005) Fast algorithms for approximate semidefinite programming using the multiplicative weights update method. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science, pp. 339–348. Cited by: §1.2, §3.2.1.
-  (1975) Cones of diagonally dominant matrices. Pacific Journal of Mathematics 57 (1), pp. 15–32. Cited by: §5.2.2.
-  (1995) Problems of distance geometry and convex properties of quadratic maps. Discrete & Computational Geometry 13 (2), pp. 189–202. Cited by: §3.
-  (2007) The Netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007, pp. 35. Cited by: §3.
-  (2006) DSDP5 user guide-software for semidefinite programming.. Technical report Argonne National Lab, Argonne. Cited by: §2.1.
-  (2013) An accelerated first-order method for solving SOS relaxations of unconstrained polynomial optimization problems. Optimization Methods and Software 28 (3), pp. 424–441. Cited by: §1.2.
-  (2013) Semidefinite Optimization and Convex Algebraic Geometry. SIAM Series on Optimization. Cited by: §1.3.
-  (1981) Regularizing the abstract convex program. Journal of Mathematical Analysis and Applications 83 (2), pp. 495–530. Cited by: §2.3.
-  (2014) Manopt, a MATLAB toolbox for optimization on manifolds. Journal of Machine Learning Research 15, pp. 1455–1459. External Links: Cited by: §3.1.2, 4th item.
-  (2018) Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis 39 (1), pp. 1–33. Cited by: §3.1.2.
-  (2011) RTRMC: a Riemannian trust-region method for low-rank matrix completion. In Advances in Neural Information Processing Systems, pp. 406–414. Cited by: §3.1.2, §3.1.2.
-  (2016) The non-convex Burer-Monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pp. 2757–2765. Cited by: §3.1.2.
-  (2018) Deterministic guarantees for Burer-Monteiro factorizations of smooth semidefinite programs. arXiv preprint arXiv:1804.02008. Cited by: §3.1.2.
-  (2011) Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning 3 (1), pp. 1–122. Cited by: §4.1, §4.1, §4.2.
-  (2003) A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization. Mathematical Programming 95 (2), pp. 329–357. Cited by: §3.1.1, §3.1.
-  (2005) Local minima and convergence in low-rank semidefinite programming. Mathematical Programming 103 (3), pp. 427–444. Cited by: §3.1.1.
-  (2003) Semidefinite programming in the space of partial positive semidefinite matrices. SIAM Journal on Optimization 14 (1), pp. 139–172. Cited by: §2.1.
-  (2011) Robust principal component analysis?. Journal of the ACM (JACM) 58 (3), pp. 11. Cited by: §4.1.
-  (2010) Matrix completion with noise. Proceedings of the IEEE 98 (6), pp. 925–936. Cited by: §3.
-  (2016) Planar pose graph optimization: duality, optimal solutions, and verification. IEEE Transactions on Robotics 32 (3), pp. 545–565. Cited by: §3.1.2.
-  (2015) Lagrangian duality in 3D SLAM: verification techniques and optimal solutions. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 125–132. Cited by: §3.1.2.
-  (2008) Structured semidefinite programs for the control of symmetric systems. Automatica 44 (5), pp. 1411–1417. Cited by: §2.2.
-  (2007) A direct formulation for sparse PCA using semidefinite programming. SIAM Review 49 (3), pp. 434–448. Cited by: §1, §3.
-  (2011) Projection methods for conic feasibility problems: applications to polynomial sum-of-squares decompositions. Optimization Methods & Software 26 (1), pp. 23–46. Cited by: §1.2.
-  (2010) Exploiting special structure in semidefinite programming: a survey of theory and applications. European Journal of Operational Research 201 (1), pp. 1–10. Cited by: §2.2.
-  (2016) CVXPY: a Python-embedded modeling language for convex optimization. Journal of Machine Learning Research 17 (83), pp. 1–5. Cited by: 1st item.
-  (2019) An optimal-storage approach to semidefinite programming using approximate complementarity. arXiv preprint arXiv:1902.03373. Cited by: §1.2.
-  (2017) The many faces of degeneracy in conic optimization. Foundations and Trends® in Optimization 3 (2), pp. 77–170. Cited by: §2.3.
JuMP: a modeling language for mathematical optimization. SIAM Review 59 (2), pp. 295–320. Cited by: 1st item.
-  (2010) Copositive programming–a survey. In Recent Advances in Optimization and its Applications in Engineering, pp. 3–20. Cited by: §5.1.
-  (2018) Convergence rate of block-coordinate maximization Burer-Monteiro method for solving large SDPs. arXiv preprint arXiv:1807.04428. Cited by: §3.1.3.
-  (2019) Safety verification and robustness analysis of neural networks via quadratic constraints and semidefinite programming. arXiv preprint arXiv:1903.01287. Cited by: §5.
-  (1956) An algorithm for quadratic programming. Naval research logistics quarterly 3 (1-2), pp. 95–110. Cited by: §3.2.
-  (2017) An extended Frank–Wolfe method with “in-face” directions, and its application to low-rank matrix completion. SIAM Journal on Optimization 27 (1), pp. 319–346. Cited by: §3.2.2.
-  (2009) User’s manual for SparseCoLO: conversion methods for sparse conic-form linear optimization problems. Research Report B-453, Dept. of Math. and Comp. Sci. Japan, Tech. Rep., pp. 152–8552. Cited by: §2.1, 5th item.
-  (2001) Exploiting sparsity in semidefinite programming via matrix completion I: general framework. SIAM Journal on Optimization 11 (3), pp. 647–674. Cited by: §2.1, §2.1.
-  (2012) Approximation algorithms and semidefinite programming. Springer Science & Business Media. Cited by: §3.2.1.
-  (2004) Symmetry groups, semidefinite programs, and sums of squares. Journal of Pure and Applied Algebra 192 (1-3), pp. 95–128. Cited by: §2.2.
-  (1931) Uber die Abgrenzung der Eigenwerte einer Matrix. Bulletin de l’Académie des Sciences de l’URSS. Classe des sciences mathématiques et na (6), pp. 749–754. Cited by: §5.1.
-  (1995) Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM 42 (6), pp. 1115–1145. Cited by: §1.
-  (1984) Positive definite completions of partial Hermitian matrices. Linear Algebra and its Applications 58, pp. 109–124. Cited by: §2.1.
-  (2018) Optimization over nonnegative and convex polynomials with and without semidefinite programming. Ph.D. Thesis, Princeton University. Cited by: §1.
-  (2019) Engineering and business applications of sum of squares polynomials. Note: Available at https://arxiv.org/pdf/1906.07961.pdf Cited by: §1.3.
-  (2008) Sparse approximate solutions to semidefinite programs. In Latin American Symposium on Theoretical Informatics, pp. 306–316. Cited by: §3.2.1, §3.2.1, §3.2.
-  (2005) Positive polynomials in control. Vol. 312, Springer Science & Business Media. Cited by: §2.2.
Fast spectral algorithms from sum-of-squares proofs: tensor decomposition and planted sparse vectors. In
Proceedings of the 48th Annual ACM Symposium on Theory of Computing, pp. 178–191. Cited by: §1.2.
-  (2018) Introducing the mosek optimization suite. External Links: Cited by: §4.1, 2nd item.
-  (2010) Low-rank optimization on the cone of positive semidefinite matrices. SIAM Journal on Optimization 20 (5), pp. 2327–2351. Cited by: §3.1.2.
-  (2015) A fast distributed algorithm for decomposable semidefinite programs. In Proceedings of the 54th IEEE Conference on Decision and Control, pp. 1742–1749. Cited by: §2.1.
-  (2011) Exploiting sparsity in linear and nonlinear matrix inequalities via positive semidefinite matrix completion. Mathematical programming 129 (1), pp. 33–68. Cited by: §2.1.
-  (2010) Explicit sensor network localization using semidefinite representations and facial reductions. SIAM Journal on Optimization 20 (5), pp. 2679–2708. Cited by: §2.3.
-  (2007) Fast low-rank semidefinite programming for embedding and clustering. In Artificial Intelligence and Statistics, pp. 235–242. Cited by: §3.1.1, §3.
-  (2001) Global optimization with polynomials and the problem of moments. SIAM Journal on Optimization 11 (3), pp. 796–817. Cited by: §1.3, §1.
-  (2017) A bounded degree SOS hierarchy for polynomial optimization. EURO Journal on Computational Optimization 5 (1-2), pp. 87–117. Cited by: §2.1.1.
-  (2016) Low-rank semidefinite programming: theory and applications. Foundations and Trends® in Optimization 2 (1-2), pp. 1–156. Cited by: §3.
-  (2019) Algorithms for verifying deep neural networks. arXiv preprint arXiv:1903.06758. Cited by: §5.
-  (1989) On the limited memory BFGS method for large scale optimization. Mathematical Programming 45 (1-3), pp. 503–528. Cited by: §3.1.1.
-  (2004) YALMIP: a toolbox for modeling and optimization in matlab. In Proceedings of the CACSD Conference, Taipei, Taiwan. Cited by: 1st item, 7th item.
-  (1979) On the Shannon capacity of a graph. IEEE Transactions on Information Theory 25 (1), pp. 1–7. Cited by: §1.
-  (2015) ADMM for sparse semidefinite programming with applications to optimal power flow problem. In Proceedings of the 54th IEEE Conference on Decision and Control, pp. 5932–5939. Cited by: §2.1.
-  (2014) Control and verification of high-dimensional systems via DSOS and SDSOS optimization. In Proceedings of the IEEE Conference on Decision and Control, Cited by: 1(b), §5.1.
-  (2018) Guaranteed globally optimal planar pose graph and landmark SLAM via sparse-bounded sums-of-squares programming. arXiv preprint arXiv:1809.07744. Cited by: §2.1.1.
-  (2014) Chordal sparsity, decomposing SDPs and the Lyapunov equation. In Proceedings of the American Control Conference, pp. 531–537. Cited by: §2.1.
-  (2017) Solving SDPs for synchronization and maxcut problems via the Grothendieck inequality. arXiv preprint arXiv:1703.08729. Cited by: §3.1.2.
-  (2017) Universal adversarial perturbations. In , pp. 1765–1773. Cited by: §5.
-  (2012) Regularization methods for SDP relaxations in large-scale polynomial optimization. SIAM Journal on Optimization 22 (2), pp. 408–428. Cited by: §1.2.
-  (2017-11) SCS: splitting conic solver, version 2.1.0. Note: https://github.com/cvxgrp/scs Cited by: 3rd item.
-  (2016) Conic optimization via operator splitting and homogeneous self-dual embedding. Journal of Optimization Theory and Applications 169 (3), pp. 1042–1068. Cited by: §4.1, §4.1, §4.1, §4.1, §4.
The limitations of deep learning in adversarial settings. In Proceedings of the IEEE European Symposium on Security and Privacy, pp. 372–387. Cited by: §5.
-  (2000-05) Structured semidefinite programs and semialgebraic geometry methods in robustness and optimization. Ph.D. Thesis, California Institute of Technology. Cited by: §1.3, §1.
-  (2003) Semidefinite programming relaxations for semialgebraic problems. Mathematical Programming 96 (2, Ser. B), pp. 293–320. Cited by: §1.3, §1.
-  (1998) On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Mathematics of Operations Research 23 (2), pp. 339–358. Cited by: §3.
-  (2013) Strong duality in conic linear programming: facial reduction and extended duals. In Computational and Analytical Mathematics, pp. 613–634. Cited by: §2.3.
Approximating k-means-type clustering via semidefinite programming. SIAM Journal on Optimization 18 (1), pp. 186–205. Cited by: §4.2.
-  (2017) Reduction methods in semidefinite and conic optimization. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §2.2, §2.3, §2.3.
-  (2018) Partial facial reduction: simplified, equivalent SDPs via approximations of the PSD cone. Mathematical Programming 171 (1-2), pp. 1–54. Cited by: 6th item.
-  (2018) The PICOS documentation. External Links: Cited by: 1st item.
-  (2002-05) SOSTOOLS: sum of squares optimization toolbox for MATLAB. Note: Available from http://www.cds.caltech.edu/sostools and http://www.mit.edu/~parrilo/sostools Cited by: 7th item.
-  (2012) Challenging SMT solvers to verify neural networks. AI Communications 25 (2), pp. 117–135. Cited by: §5.
-  (1993) Positive polynomials on compact semi-algebraic sets. Indiana University Mathematics Journal 42 (3), pp. 969–984. Cited by: §1.3.
-  (2019) Verification of non-linear specifications for neural networks. arXiv preprint arXiv:1902.09592. Cited by: §5.
-  (2018) Semidefinite relaxations for certifying robustness to adversarial examples. In Advances in Neural Information Processing Systems, pp. 10877–10887. Cited by: §5.
-  (2010) Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM Review 52 (3), pp. 471–501. Cited by: §1, §3.
-  (2019) Accelerated first-order methods for hyperbolic programming. Mathematical Programming, pp. 1–35. Cited by: §1.2.
-  (1976) Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research 1 (2), pp. 97–116. Cited by: §4.2.
-  (2019) SE-Sync: a certifiably correct algorithm for synchronization over the special Euclidean group. The International Journal of Robotics Research 38 (2-3), pp. 95–125. Cited by: §2.1.1, §3.1.2.
-  (2015) A new, globally convergent Riemannian conjugate gradient method. Optimization 64 (4), pp. 1011–1031. Cited by: §3.1.2.
-  (2001-10) SeDuMi version 1.05. Note: Latest version available at http://sedumi.ie.lehigh.edu/ Cited by: §4.1, §4.1, 2nd item.
-  Sum of squares programming for julia. External Links: Cited by: 7th item.
-  (2019) Provable nonconvex methods and algorithms (collection of papers). Note: References available at https://sunju.org/research/nonconvex/ Cited by: §1.2.
-  (2016) Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Transactions on Information Theory 63 (2), pp. 885–914. Cited by: §3.1.2.
-  (2018) A geometric analysis of phase retrieval. Foundations of Computational Mathematics 18 (5), pp. 1131–1198. Cited by: §3.1.2.
-  (2014) Decomposition in conic optimization with partially separable structure. SIAM Journal on Optimization 24 (2), pp. 873–897. Cited by: §2.1.
-  (2015) Decomposition methods for sparse matrix nearness problems. SIAM Journal on Matrix Analysis and Applications 36 (4), pp. 1691–1717. Cited by: §2.1.
-  (2013) Systems polynomial optimization toolbox (spot). External Links: Cited by: 1st item, 7th item, 8th item.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §5.
-  (2005) Probabilistic robotics. MIT press. Cited by: §2.1.1.
-  (2017) Evaluating robustness of neural networks with mixed integer programming. arXiv preprint arXiv:1711.07356. Cited by: §5.
-  SDPT3 - a MATLAB software package for semidefinite-quadratic-linear programming. Note: Available from http://www.math.cmu.edu/~reha/sdpt3.html Cited by: §4.1, 2nd item.
-  (2016) Pymanopt: a python toolbox for optimization on manifolds using automatic differentiation. The Journal of Machine Learning Research 17 (1), pp. 4755–4759. Cited by: 4th item.
-  (2009) Symmetry in semidefinite programs. Linear Algebra and its Applications 430 (1), pp. 360–369. Cited by: §2.2.
-  (1996-03) Semidefinite programming. SIAM Review 38 (1), pp. 49–95. Cited by: §1, §1.
-  (2016) Reduction of SDPs in H-infinity control of SISO systems and performance limitations analysis. In Proceedings of the 55th IEEE Conference on Decision and Control, pp. 646–651. Cited by: §2.3.
-  (2017) The mixing method: coordinate descent for low-rank semidefinite programming. arXiv preprint arXiv:1706.00476. Cited by: §3.1.3.
-  (2018) Sparse-BSOS: a bounded degree SOS hierarchy for large scale polynomial optimization with sparsity. Mathematical Programming Computation 10 (1), pp. 1–32. Cited by: §2.1.1.
-  (2017) Provable defenses against adversarial examples via the convex outer adversarial polytope. arXiv preprint arXiv:1711.00851. Cited by: §5.
-  (2018) Verification for machine learning, autonomy, and neural networks survey. arXiv preprint arXiv:1810.01989. Cited by: §5.
-  (2015) SDPNAL+: a majorized semismooth Newton-CG augmented Lagrangian method for semidefinite programming with nonnegative constraints. Mathematical Programming Computation 7 (3), pp. 331–366. Cited by: §4.2, §4.2, §4.2, §4, 3rd item.
-  (2019) A conditional gradient-based augmented Lagrangian framework. arXiv preprint arXiv:1901.04013. Cited by: §1.2.
-  (2017) Sketchy decisions: convex low-rank matrix optimization with optimal storage. arXiv preprint arXiv:1702.06838. Cited by: §3.2.2, §3.2.2.
-  (2010) A Newton-CG augmented Lagrangian method for semidefinite programming. SIAM Journal on Optimization 20 (4), pp. 1737–1765. Cited by: §4.2, §4.2, §4.
-  (2016) Improving the robustness of deep neural networks via stability training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4480–4488. Cited by: §5.
-  (2019) Chordal decomposition in operator-splitting methods for sparse semidefinite programs. Mathematical Programming, pp. 1–44. Cited by: §2.1, <