An Optimal-Storage Approach to Semidefinite Programming using Approximate Complementarity

02/09/2019 ∙ by Lijun Ding, et al. ∙ EPFL cornell university California Institute of Technology 0

This paper develops a new storage-optimal algorithm that provably solves generic semidefinite programs (SDPs) in standard form. This method is particularly effective for weakly constrained SDPs. The key idea is to formulate an approximate complementarity principle: Given an approximate solution to the dual SDP, the primal SDP has an approximate solution whose range is contained in the eigenspace with small eigenvalues of the dual slack matrix. For weakly constrained SDPs, this eigenspace has very low dimension, so this observation significantly reduces the search space for the primal solution. This result suggests an algorithmic strategy that can be implemented with minimal storage: (1) Solve the dual SDP approximately; (2) compress the primal SDP to the eigenspace with small eigenvalues of the dual slack matrix; (3) solve the compressed primal SDP. The paper also provides numerical experiments showing that this approach is successful for a range of interesting large-scale SDPs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider a semidefinite program (SDP) in the standard form

(P)

The primal variable is the symmetric, positive-semidefinite matrix . The problem data comprises a symmetric (but possibly indefinite) objective matrix , a linear map with rank , and a righthand side .

SDPs form a class of convex optimization problems with remarkable modeling power. But SDPs are challenging to solve because they involve a matrix variable whose dimension can rise into the millions or billions. For example, when using a matrix completion SDP in a recommender system, is the number of users and products; when using a phase retrieval SDP to visualize a biological sample, is the number of pixels in the recovered image. In these applications, most algorithms are prohibitively expensive because their storage costs are quadratic in .

How much memory should be required to solve this problem? Any algorithm must be able to query the problem data and to report a representation of the solution. Informally, we say that an algorithm uses optimal storage if the working storage is no more than a constant multiple of the storage required for these operations [72]. (See Section 1.2 for a formal definition.)

It is not obvious how to develop storage-optimal SDP algorithms. To see why, recall that all weakly-constrained SDPs () admit low-rank solutions [8, 54], which can be expressed compactly in factored form. For these problems, a storage-optimal algorithm cannot even instantiate the matrix variable! One natural idea is to introduce an explicit low rank factorization of the primal variable and to minimize the problem over the factors [17].

Methods built from this idea provably work when the size of the factors is sufficiently large [13]. However, recent work [67] shows that they cannot provably solve all SDPs with optimal storage; see Section 2.

In contrast, this paper develops a new algorithm that provably solves all generic SDPs using optimal storage. More precisely, our algorithm works when the primal and dual SDPs exhibit strong duality, the primal and dual solutions are unique, and strict complementarity holds; these standard conditions hold generically [4].

Our method begins with the Lagrange dual of the primal SDP Eq. P,

(D)

with dual variable

. The vector

is the transpose of , and the linear map is the adjoint of . It is straightforward to compute an approximate solution to the dual SDP Eq. D with optimal storage. The challenge is to recover a primal solution from the approximate dual solution.

To meet this challenge, we develop a new approximate complementarity principle that holds for generic SDP: Given an approximate dual solution , we prove that there is a primal approximate solution whose range is contained in the eigenspace with small eigenvalues of the dual slack matrix . This principle suggests an algorithm: we solve the primal SDP by searching over matrices with the appropriate range. This recovery problem is a (much smaller) SDP that can be solved with optimal storage.

1.1 Assumptions

First, assume that the primal Eq. P has a solution, say, and the dual Eq. D has a unique solution . We require that strong duality holds:

(1)

The condition Eq. 1 follows, for example, from Slater’s constraint qualification.

Strong duality and feasibility imply that the solution and the dual slack matrix satisfy the complementary slackness condition:

(2)

These conditions suffice for the results in Section 5 (and the associated lemmas Lemma 2 and Lemma 3), which bound the recovery error of our procedure if the linear map restricted to a certain subspace is nonsingular (see (7) and Algorithm 1 for the choice of the subspace).

To ensure this condition, we make the stronger assumption that every solution pair satisfies the stronger strict complementarity condition:

(3)

Note that these assumptions ensure that all solutions have the same rank, and therefore that the primal solution is also unique [42, Corollary 2.5]. In particular, the rank of the solution satisfies the Barvinok–Pataki bound .

To summarize, all results in this paper hold under the assumptions of primal attainability, dual uniqueness, strong duality, and strict complementarity. These conditions hold generically conditioning on primal and dual attainability; i.e., for every SDP satisfying primal and dual attainability outside of a set of measure  [4].

1.2 Optimal Storage

Following [72], let us quantify the storage necessary to solve every SDP Eq. P that satisfies our assumptions in Section 1.1 and that admits a solution with rank .

First, it is easy to see that numbers are sufficient to represent the rank- solution in factored form. This cost is also necessary because every rank- matrix is the solution to some SDP from our problem class.

To hide the internal complexity of the optimization problem Eq. P, we will interact with the problem data using data access oracles. Suppose we can perform any of the following operations on arbitrary vectors and :

(4)

These oracles enjoy simple implementations in many concrete applications. The input and output of these operations clearly involve storing numbers.

In summary, any method that uses these data access oracles to solve every SDP from our class must store numbers. We say a method has optimal storage if the working storage provably achieves this bound.

For many interesting problems, the number of constraints is proportional to the dimension . Moreover, the rank of the solution is constant or logarithmic in . In this case, a storage-optimal algorithm has working storage , where the tilde suppresses log-like factors.

Remark 1 (Applications)

The algorithmic framework we propose is most useful when the problem data has an efficient representation and the three operations in Eq. 4 can be implemented with low arithmetic cost. For example, it is often the case that the matrix and the linear map are sparse or structured. This situation occurs in the maxcut relaxation [32], matrix completion [61], phase retrieval [22, 66], and community detection [47]. See [56] for some other examples.

1.3 From Strict Complementarity to Storage Optimality

Suppose that we have computed the exact unique dual solution . Complementary slackness Eq. 2 and strict complementarity Eq. 3 ensure that

Therefore, the slack matrix identifies the range of the primal solution.

Let be the rank of the primal solution. Construct an orthonormal matrix whose columns span . The compression of the primal problem Eq. P to this subspace is

(5)

The variable is a low-dimensional matrix when is small. If is a solution to Eq. 5, then is a solution to the original SDP Eq. P.

This strategy for solving the primal SDP can be implemented with a storage-optimal algorithm. Indeed, the variable in the dual SDP Eq. D has length , so there is no obstacle to solving the dual with storage using the subgradient type method described in Section 6. We can compute the subspace using the randomized range finder [33, Alg. 4.1] with storage cost . Last, we can solve the compressed primal SDP Eq. 5 using working storage via the matrix-free method from [25, 52]. The total storage is the optimal . Furthermore, all of these algorithms can be implemented with the data access oracles Eq. 4.

Hence — assuming exact solutions to the optimization problems — we have developed a storage-optimal approach to the SDP Eq. P, summarized in Table 1[left].

1.4 The Approximate Complementarity Principle

A major challenge remains: one very rarely has access to an exact dual solution! Rather, we usually have an approximate dual solution, obtained via some iterative dual solver.

This observation motivates us to formulate a new approximate complementarity principle. For now, assume that is known. Given an approximate dual solution , we can construct an orthonormal matrix

whose columns are eigenvectors of

with the smallest eigenvalues. Roughly speaking, the primal problem Eq. P admits an approximate solution whose range is contained in . We show the approximate solution is close to the true solution as measured in terms of suboptimality, infeasibility, and distance to the solution set.

We propose to recover the approximate primal solution by solving the semidefinite least-squares problem

(MinFeasSDP)

with variable . Given a solution to Eq. MinFeasSDP, we obtain an (infeasible) approximate solution to the primal problem.

In fact, it is essential to relax our attention to infeasible solutions because the feasible set of Eq. P should almost never contains a matrix with range ! This observation was very surprising to us, but it seems evident in retrospect. (For example, using a dimension-counting argument together with Lemma 8.)

The resulting framework appears in Table 1[right]. This approach for solving Eq. P leads to storage-optimal algorithms for the same reasons described in Section 1.3. Our first main result ensures that this technique results in a provably good solution to the primal SDP Eq. P.

Theorem 1 (Main theorem, informal)

Instate the assumptions of Section 1.1. Suppose we have found a dual vector with suboptimality . Consider the primal reconstruction obtained by solving Eq. MinFeasSDP. Then we may bound the distance between to the primal solution by

The constant in the depends on the problem data , , and .

We state and prove the formal result as Theorem 2. As stated, this guarantee requires knowledge of the rank of the solution; in Section 5

, we obtain a similar guarantee using an estimate for

.

Step Exact Primal Recovery Robust Primal Recovery
Compute dual solution Compute approximate dual solution
Compute basis Compute eigenvectors
for of with smallest eigenvalues;
collect as columns of matrix
Solve the reduced SDP Eq. 5 Solve Eq. MinFeasSDP
Table 1: Exact and Robust Primal Recovery

1.5 Paper Organization

We situate our contributions relative to other work in Section 2. Section 3 contains an overview of our notation and more detailed problem assumptions. Section 4 uses the approximate complementarity principle to develop practical, robust, and theoretically justified algorithms for solving Eq. P. The algorithms are accompanied by detailed bounds on the quality of the computed solutions as compared with the true solution. Section 5 contains algorithmically computable bounds that can be used to check the solution quality under weaker assumptions than those required in Section 4. These checks are important for building a reliable solver. Next, we turn to algorithmic issues: we explain how to compute an approximate dual solution efficiently in Section 6, which provides the last ingredient for a complete method to solve Eq. P. Section 7 shows numerically that the method is effective in practice.

2 Related Work

Semidefinite programming can be traced to a 1963 paper of Bellman & Fan [10]. Related questions emerged earlier in control theory, starting from Lyapunov’s 1890 work on stability of dynamical systems. There are many classic applications in matrix analysis, dating to the 1930s. Graph theory provides another rich source of examples, beginning from the 1970s. See [14, 65, 63, 11, 16] for more history and problem formulations.

2.1 Interior-Point Methods

The first reliable algorithms for semidefinite programming were interior-point methods (IPMs). These techniques were introduced independently by Nesterov & Nemirovski [50, 51] and Alizadeh [2, 3].

The success of these SDP algorithms motivated new applications. In particular, Goemans & Williamson [32] used semidefinite programming to design an approximation algorithm to compute the maximum-weight cut in a graph. Early SDP solvers could only handle graphs with a few hundred vertices [32, Sec. 5] although computational advances quickly led to IPMs that could solve problems with thousands of vertices [12].

IPMs form a series of unconstrained problems whose solutions are feasible for the original SDP, and move towards the solutions of these unconstrained problems using Newton’s method. As a result, IPMs converge to high accuracy in very few iterations, but require substantial work per iteration. To solve a standard-form SDP with an matrix variable and with equality constraints, a typical IPM requires iterations to reach a solution with accuracy (in terms of objective value) [49], and arithmetic operations per iteration (when no structure is concerned)[5], so arithmetic operations in total. Further, a typical IPM requires at least memory not including the storage of data representation (which takes memory if no structure is assumed)[5].

As a consequence, these algorithms are not effective for solving large problem instances, unless they enjoy a lot of structure. Hence researchers began to search for methods that could scale to larger problems.

2.2 First-Order Methods

One counterreaction to the expense of IPMs was to develop first-order optimization algorithms for SDPs. This line of work began in the late 1990s, and it accelerated as SDPs emerged in the machine learning and signal processing literature in the 2000s.

Early on, Helmberg & Rendl [35] proposed a spectral bundle method for solving an SDP in dual form, and they showed that it converges to a dual solution when the trace of is constant. In contrast to IPMs, the spectral bundle method has low per iteration complexity. On the other hand, the convergence rate is not known, and there is no convergence guarantee on the primal side. so there is no explicit control on the storage and arithmetic costs.

In machine learning, popular first-order algorithms include the proximal gradient method [58], accelerated variants [9] and the alternating direction method of multipliers [30, 31, 15, 53]. These methods provably solve the original convex formulation of Eq. P, but they converge slowly in practice. These algorithms all store the full primal matrix variable, so they are not storage-efficient.

Recently, Friedlander & Macedo [29] have proposed a novel first-order method that is based on gauge duality, rather than Lagrangian duality. This approach converts an SDP into an eigenvalue optimization problem. The authors propose a mechanism for using a dual solution to construct a primal solution. This paper is similar in spirit to our approach, but it lacks an analysis of the accuracy of the primal solution. Moreover, it only applies to problems with a positive-definite objective, i.e., .

2.3 Storage-Efficient First-Order Methods

Motivated by problems in signal processing and machine learning, a number of authors have revived the conditional gradient method (CGM)  [28, 43, 39]. In particular, Hazan [34] suggested using CGM for semidefinite programming. Clarkson [24] developed a new analysis, and Jaggi [37], showed how this algorithm applies to a wide range of interesting problems.

The appeal of the CGM is that it computes an approximate solution to an SDP as a sum of rank-one updates; each rank-one update is obtained from an approximate eigenvector computation. In particular, after iterations, the iterate has rank at most

. This property has led to the exaggeration that CGM is a “storage-efficient” optimization method. Unfortunately, CGM converges very slowly, so the iterates do not have controlled rank. The literature describes many heuristics for attempting to control the rank of the iterates 

[55, 71], but these methods all lack guarantees.

Very recently, some of the authors of this paper [72] have shown how to use CGM to design a storage-optimal algorithm for a class of semidefinite programs by sketching the decision variable. This algorithm does not apply to standard-form SDPs, and it inherits the slow convergence of CGM. Nevertheless, the sketching methodology holds promise as a way to design storage optimal solvers, particularly together with algorithms that generalize CGM and that do apply to standard-form SDPs [70, 69].

We also mention a subgradient method developed by Renegar [57] that can be used to solve either the primal or dual SDP. Renegar’s method has a computational profile similar to CGM, and it does not have controlled storage costs.

2.4 Factorization Methods

There is also a large class of heuristic SDP algorithms based on matrix factorization. The key idea is to factorize the matrix variable and to reformulate the SDP Eq. P as

(6)

We can apply a wide range of nonlinear programming methods to optimize Eq. 6 with respect to the variable . In contrast to the convex methods described above, these techniques only offer incomplete guarantees on storage, arithmetic, and convergence.

The factorization idea originates in the paper [36] of Homer & Peinado. They focused on the maxcut SDP, and the factor was a square matrix, i.e., . These choices result in an unconstrained nonconvex optimization problem that can be tackled with a first-order optimization algorithm.

Theoretical work of Barvinok [8] and Pataki [54] demonstrates that the primal SDP Eq. P always admits a solution with rank , provided that

. (Note, however, that the SDP can have solutions with much lower or higher rank.) Interestingly, it is possible to find a low rank approximate solution to an SDP, with high probability, by projecting any solution to the SDP onto a random subspace

[7, Proposition 6.1], [60]. However, this approach requires having already computed a (possibly full rank) solution using some other method.

Inspired by the existence of low rank solutions to SDP, Burer & Monteiro [17] proposed to solve the optimization problem Eq. 6 where the variable is constrained to be a tall matrix (). The number is called the factorization rank. It is clear that every rank- solution to the SDP Eq. P induces a solution to the factorized problem Eq. 6 when . Burer & Monteiro applied a limited-memory BFGS algorithm to solve Eq. 6 in an explicit effort to reduce storage costs.

In subsequent work, Burer & Monteiro [18] proved that, under technical conditions, the local minima of the nonconvex formulation Eq. 6 are global minima of the SDP Eq. P, provided that the factorization rank satisfies . As a consequence, algorithms based on Eq. 6 often set the factorization rank , so the storage costs are .

Unfortunately, a recent result of Waldspurger & Walters [67, Thm. 2] demonstrates that the formulation Eq. 6 cannot lead to storage-optimal algorithms for generic SDPs. In particular, suppose that the feasible set of Eq. P satisfies a mild technical condition and contains a matrix with rank one. Whenever the factorization rank satisfies , there is a set of cost matrices with positive Lebesgue measure for which the factorized problem Eq. 6 has (1) a unique global optimizer with rank one and (2) at least one suboptimal local minimizer, while the original SDP has a unique primal and dual solution that satisfy strict complementarity. In this situation, the variable in the factorized SDP actually requires storage, which is not optimal if . In view of this negative result, we omit a detailed review of the literature on the analysis of factorization methods. See [67] for a full discussion.

2.5 Summary and Contribution

In short, all extant algorithms for solving the SDP Eq. P either lack the optimal storage guarantee, or they are heuristic, or both. This paper presents a new algorithm that provably solves the SDP Eq. P with optimal storage under assumptions that are standard in the optimization literature.

3 Basics and Notation

Here we introduce some additional notation, and metrics for evaluating the quality of a solution and the conditioning of an SDP.

3.1 Notation

We will work with the Frobenius norm , the operator norm , and its dual, the nuclear norm . We reserve the symbols and for the norm induced by the canonical inner product of the underlying real vector space.

For a matrix

, we arrange its singular values in decreasing order:

Define and . We also write for the smallest nonzero singular value of . For a linear operator , we define

We use analogous notation for the eigenvalues of a symmetric matrix. In particular, the map reports the th largest eigenvalue of its argument.

3.2 Optimal Solutions

Instate the notation and assumptions from Section 1.1. Define the slack operator that maps a putative dual solution to its associated slack matrix . We omit the dependence on if it is clear from the context.

Let the rank of primal solution being and denote the range as . We also fix an orthonormal matrix whose columns span . Introduce the subspace , and let be a fixed orthonormal basis for . We have the decomposition .

For a matrix , define the compressed cost matrix and constraint map

(7)

In particular, is the compression of the constraint map onto the range of .

3.3 Conditioning of the SDP

Our analysis depends on conditioning properties of the pair of primal Eq. P and dual Eq. D SDPs.

First, we measure the strength of the complementarity condition Eq. 2 using the spectral gaps of the primal solution and dual slack matrix :

These two numbers capture how far we can perturb the solutions before the complementarity condition fails.

Second, we measure the robustness of the primal solution to perturbations of the problem data using the quantity

(8)

This term arises because we have to understand the conditioning of the system of linear equations in the variable .

primal matrix dual vector
suboptimality
infeasibility
distance to solution set
Table 2: Quality of a primal matrix and a dual vector

3.4 Quality of Solutions

We measure the quality of a primal matrix variable and a dual vector in terms of their suboptimality, their infeasibility, and their distance to the true solutions. Table 2 gives formulas for these quantities.

We say that a matrix is an -solution of Eq. P if its suboptimality is at most and its infeasibility is at most .

The primal suboptimality and infeasibility are both controlled by the distance to the primal solution set:

(9)

We can also control the distance of a dual vector and its slack matrix from their optima using the following quadratic growth lemma.

Lemma 1 (Quadratic Growth)

Instate the assumptions from Section 1.1. For any dual feasible with dual slack matrix and dual suboptimality , we have

(10)

where the linear operator is defined by

The orthonormal matrix is defined in Section 3.2.

We defer the proof of Lemma 1 to Appendix A. The name quadratic growth arises from a limit of inequality 10: when is small, the second term in the bracket dominates the first term, so  [26].

4 Reduced SDPs and Approximate Complementarity

In this section, we describe two reduced SDP formulations, and we explain when their solutions are nearly optimal for the original SDP Eq. P. We can interpret these results as constructive proofs of the approximate complementarity principle.

4.1 Reduced SDPs

Suppose that we have obtained a dual approximate solution and its associated dual slack matrix . Let be a rank parameter, which we will discuss later. Construct an orthonormal matrix whose range is an -dimensional invariant subspace associated with the smallest eigenvalues of the dual slack matrix . Our goal is to compute a matrix with range that approximately solves the primal SDP Eq. P.

Our first approach minimizes infeasibility over all psd matrices with range :

(MinFeasSDP)

with variable . Given a solution , we can form an approximate solution for the primal SDP Eq. P. This is the same method from Section 1.4.

Our second approach minimizes the objective value over all psd matrices with range , subject to a specified limit on infeasibility:

(MinObjSDP)

with variable . Given a solution , we can form an approximate solution for the primal SDP Eq. P.

As we will see, both approaches lead to satisfactory solutions to the original SDP Eq. P under appropriate assumptions. Theorem 2 addresses the performance of Eq. MinFeasSDP, while Theorem 3 addresses the performance of Eq. MinObjSDP. Table 3 summarizes the hypotheses we impose to study each of the two problems, as well as the outcomes of the analysis.

The bounds in this section depend on the problem data and rely on assumptions that are not easy to check. Section 5 shows that there are easily verifiable conditions that allow us to calculate bounds on the quality of and .

4.2 Analysis of Eq. MinFeasSDP

First, we establish a result that connects the solution of Eq. MinFeasSDP with the solution of the original problem Eq. P.

Theorem 2 (Analysis of Eq. MinFeasSDP)

Instate the hypotheses of Section 1.1. Moreover, assume the solution rank is known. Set . Let be feasible for the dual SDP Eq. D with suboptimality , where the constant depends only on and . Then the threshold obeys

and we have the bound

(11)

This bound shows that when the dual vector is suboptimal. Notice this result requires knowledge of the solution rank . The proof of Theorem 2 occupies the rest of this section.

4.2.1 Primal Optimizers and the Reduced Search Space

The first step in the argument is to prove that is near the search space of the reduced problems. We will use this lemma again, so we state it under minimal conditions.

Lemma 2

Suppose (P) and (D) admit solutions and satisfy strong duality, Equation 1. Further suppose is feasible and -suboptimal for the dual SDP Eq. D, and construct the orthonormal matrix as in Section 4.1. Assume that the threshold . For any solution of the primal SDP Eq. P,

and

Proof

Complete to form a basis for , where and where is the eigenvector of associated with the -th smallest eigenvalue. Rotating into this coordinate system, let’s compare and its projection into the space spanned by ,

Let , and . Using the unitary invariance of , we have

(12)

A similar equality holds for . Thus we need only bound the terms and . Applying Lemma 9 to , we have

(13)

Using strong duality and feasibility of , we rewrite the suboptimality as

(14)

Since all the vectors in have corresponding eigenvalues at least as large as the threshold , and as is feasible, we have

(15)

This inequality allows us to bound as

(16)

where we recall to obtain the second to last equality. Combining Eq. 13, Eq. 16, and , we have

(17)

Basic linear algebra shows

(18)

Combining pieces, we bound the error in the Frobenius norm:

where step uses Eq. 12 and the triangle inequality; step uses Eq. 16 and Eq. 18; and step uses Eq. 17.

Similarly, we may bound the error in the nuclear norm:

by the same reasoning. Step uses the fact that has rank at most .

4.2.2 Relationship between the Solutions of Eq. MinFeasSDP and Eq. P

Lemma 2 shows that any solution of Eq. P is close to its compression onto the range of . Next, we show that is also close to . We can invoke strong convexity of the objective of Eq. MinFeasSDP to achieve this goal.

Lemma 3

Instate the assumptions from LABEL:lemma:minfeas0. Assume . and that the threshold . Then

(19)

where is any solution of the primal SDP Eq. P.

Proof

Since we assume that , we know the objective of Eq. MinFeasSDP, , is -strongly convex, and so the solution is unique. We then have for any

(20)

where step uses strong convexity and step is due to the optimality of .

Since , we can bound the objective of Eq. MinFeasSDP by :

(21)

Combining pieces, we know that satisfies

where step uses Eq. 20, step uses Eq. 21, and step uses Lemma 2. Lifting to the larger space , we see

where we use the unitary invariance of in . The inequality uses our bound above for and Lemma 2.

4.2.3 Lower Bounds for the Threshold and Minimum Singular Value

Finally, we must confirm that the extra hypotheses of Lemma 3 hold, viz., and .

Here is the intuition. Strict complementarity forces . If is close to , then we expect that by continuity. When is unique, Lemma 8 implies that . As a consequence, . If is close to , then we expect that as well. We have the following rigorous statement.

Lemma 4

Instate the hypotheses of Theorem 2. Then

Proof

We first prove the lower bound on the threshold . Using and quadratic growth (Lemma 1), we have

Thus for sufficiently small , we have . Substituting this bound into previous inequality gives,

(22)

Weyl’s inequality tells us that . Using Eq. 22, we see that for all sufficiently small ,

Next we prove the lower bound on . We have by Lemma 8. It will be convenient to align the columns of with those of for our analysis. Consider the solution to the orthogonal Procrustes problem . Since for orthonormal , without loss of generality, we suppose we have already performed the alginment and in the following.

Let . Then we have

(23)

Defining , we bound the term as