 # Bundle Method Sketching for Low Rank Semidefinite Programming

In this paper, we show that the bundle method can be applied to solve semidefinite programming problems with a low rank solution without ever constructing a full matrix. To accomplish this, we use recent results from randomly sketching matrix optimization problems and from the analysis of bundle methods. Under strong duality and strict complementarity of SDP, we achieve Õ(1/ϵ) convergence rates for both the primal and the dual sequences, and the algorithm proposed outputs a O(√(ϵ)) approximate solution X̂ (measured by distances) with a low rank representation with at most Õ(1/ϵ) many iterations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider solving semidefinite problems of the following form:

 % maximizeX∈Sn⊂Rn×n ⟨−C,X⟩ (P) subject to AX=b X⪰0,

where , being linear, and . Denote the solution set as . To accomplish the task of solving (P), we consider the dual problem:

 minimizey∈Rd ⟨−b,y⟩ (D) subject to A∗y⪯C,

whose solution set is denoted as . Then for all sufficiently large , e.g., larger than the trace of any solution (ding2019optimal, , Lemma 6.1), we can reformulate this as

 minimizey∈Rd F(y)=⟨−b,y⟩−αmin{λmin(C−A∗y),0}. (1)

We propose applying the bundle method to solve this problem, which generates a sequence of dual solutions . While the bundle method runs on the dual problem, a primal solution can be constructed through a series of rank one updates corresponding to subgradients of . However maintaining such a primal solution greatly increases memory costs. Fortunately, the primal problem enjoys having a low rank solution in many applications, e.g., matrix completion srebro2005rank and phase retrieval candes2013phaselift . Also, without specifying the detailed structure the problem, there always exists a solution to (P) with rank satisfying pataki1998rank

To utilize the existence of such a low rank solution, we employ the matrix sketching methods introduced in tropp2017practical . The main idea is the following: the skecthing method forms a linear sketch of the column and row spaces of the primal decision variable , and then uses the sketched column and row spaces to recover the primal decision variable. The recovered decision variable approximates the original well if is (approximately) low rank. Notably, we need not store the entire decision variable at each iteration, but only the sketch. Hence the memory requirements of the algorithm are substantially reduced.

#### Our Contributions.

Our proposed sketching bundle method produces a sequence of dual solutions and a sequence of primal solutions (which are sketched by low rank matrices ). This is done without ever needing to write down the solutions , which can be a substantial boon to computational efficiency.

In particular, we consider problems satisfying the following pair of standard assumptions: (i) Strong duality holds, meaning that there is a solution pair satisfying

 p⋆:=⟨−C,X⟩=⟨−b,y⟩=:d⋆

and (ii) Strict Complementarity holds, meaning there is a solution pair satisfying

 rank(X⋆)+rank(C−A∗(y⋆))=n.

Under these condtions, we show and converge in terms of primal and dual feasibility and optimality. In particular, all three of these quantities converge at a rate of .

###### Theorem 1.1 (Primal-Dual Convergence).

Suppose the sets are both compact, and that strong duality and a strict complementary condition holds. For any , Algorithm 1 with properly chosen parameters produces a solution pair and with

that satisfies

 approximate primal feasibility: ∥b−AXt∥≤ϵ,Xt⪰0, approximate dual feasibility: λmin(C−A∗yt)≥−ϵ, approximate primal-dual optimality: |⟨b,yt⟩−⟨C,Xt⟩|≤ϵ

Moreover, we show that assuming all of the minimizers in are low rank, the sketched primal solutions converge to the set of minimizers at the following rate.

###### Theorem 1.2 (Sketched Solution Convergence).

Suppose the sets are both compact, strong duality and a strict complementary condition holds, and all solutions have rank at most . For any , Algorithm 1 with properly chosen parameters produces a sketched primal solution with

 t≤˜O(1/ϵ2)

that satisfies

## 2 Defining The Sketching Bundle Method

Our proposed proximal bundle method relies on an approximation of given by where is a convex combination of the lower bounds from previous subgradients. Notice that a subgradient of at can be computed as

 vt ={MinEigenvector(C−A∗zt)if λmin(C−A∗zt)≤00otherwise (2) gk =−b+αAvtv∗t∈∂F(zt). (3)

Each iteration then computes a proximal step on this piecewise linear function given by

 zt+1∈argminy∈Rd˜Ft(y)+ρ2∥y−yt∥2 (4)

for some . The optimality condition of this subproblem ensures that for some Our aggregate lower bound is updated to match this certifying subgradient Then the following is an exact solution for the subproblem (4):

 θt (5) zt+1 (6)

If the decrease in value of from to is at least fraction of the decrease in value of from to , then the bundle method sets (called a descent step). Otherwise the method sets (called a null step).

#### Extracting Primal Solutions Directly.

A solution to the primal problem (P) can be extracted from the sequence of subgradients (as these describe the dual of the dual problem). Set our initial primal solution to be . When each iteration updates the model , we could update our primal variable similarly:

 Xt+1 =θtαvtv∗t+(1−θt)Xt. (7)

As stated in Theorem 1.1, these primal solutions converge to optimality and feasibility at a rate of . Alas this approach requires us to compute and store the full matrix at each iteration. Assuming every solution is low rank, an ideal method would only require memory. The following section shows matrix sketching can accomplish this.

#### Extracting Primal Solutions Via Matrix Sketching.

Here we describe how the matrix sketching method of tropp2017practical can be used to store an approximation of our primal solution . First, we draw two matrices with independent standard normal entries

 Ψ∈Rn×kwithk=2r+1; Φ∈Rl×nwithl=4r+3;

Here

is chosen by the user. It either represents the estimate of the true rank of the primal solution or the user’s computational budget in dealing with larges matrices.

We use and to capture the column space and the row space of :

 YCt=XtΨ∈Rn×k,YRt=ΦXt∈Rl×n. (8)

Hence we initially have and . Notice Algorithm 1 does not observe the matrix directly. Rather, it observes a stream of rank one updates

 Xt+1=θtαvtv∗t+(1−θt)Xt.

In this setting, and can be directly computed as

 (9) (10)

This observation allows us to form the sketch and from the stream of updates.

We then reconstruct and get the reconstructed matrix by

 YCt=QtRt,Bt=(ΦQt)†YRt,^Xt=Qt[Bt]r, (11)

where is the factorization of and returns the best rank approximation in Frobenius norm. Specifically, the best rank approximation of a matrix is , where and

are right and left singular vectors corresponding to the

largest singular values of

and is a diagonal matrix with largest singular values of . In actual implementation, we may only produce the factors defining in the end instead reconstructing in every iteration.

Hence the expensive primal update (7) can be replaced by much more efficient operations (9) and (10). Then a low rank approximation of can be computed by (11).

We remark that the reconstructed matrix is not necessarily positive definite. However, this suffices for the purpose of finding a matrices close to . More sophisticated procedure is available for producing a positive semidefinite approximation of (tropp2017practical, , Section 7.3).

## 3 Numerical Experiments

In this section, we demonstrate Algorithm 1

equipped with the sketching procedure does solve problem instances in four important problem classes: (1) Generalized eigenvalue

(boumal2018deterministic, , Section 5.1), (2) synchronization bandeira2018random , (3) Max-Cut goemans1995improved , and (4) matrix completionsrebro2005rank . For all experiments, we set , , and where is computed via MOSEK mosek2010mosek or prior knowledge of the problem. We present the results of convergences in Figure 1 for the following problem instances (results for other problems instances are similar for each problem class): Let be the distribution of symmetric matrices in with upper triangular part (including the diagonal) being independent standard Gaussians.

1. Generalized eigenvalue (GE): , for any , where and and is independent of , and .

2. synchronization (Z2): where is the all one matrix and , , and is the all one vector.

3. Max-Cut(MCut): where is the Laplacian matrix of the G1 graph Gset with vertices, and are the same as synchronization

4. Matrix Completion(MComp): A random rank matrix is generated. The index set is generated in a way that each is in

with probability

independently from anything else. Set . The linear constraint is for each . So and for each .

As can be seen from the experiments, with iterations or less, the dual and primal objective converges fairly fast except the matrix completion problem. The infeasibility measured by and distance to solution is about for most of the problems except GE. In general, we note the convergence is quicker when the problems have rank optimal solutions (GE and Z2) comparing to problems with higher rank optimal solutions (MCut and MComp). Figure 1: The upper, middle, and lower plots give the different convergence measures (as indicated on the title) for the dual iterates zt and yt, the primal iterates Xk, and the sketching iterate ^Xt respectively. The color corresponds to different problem instances as indicated in the legend of the upper plots. Due to numerical error in updating Xt and non-asymmetry of ^Xt, we compute the minimum eigenvalue of their symmetrized version. λmin(12(Xt+X′t)) appears to be negative but very small due to numerical error.

## References

•  The university of florida sparse matrix collection: Gset group.
•  Afonso S Bandeira. Random laplacian matrices and convex relaxations. Foundations of Computational Mathematics, 18(2):345–379, 2018.
•  Nicolas Boumal, Vladislav Voroninski, and Afonso S Bandeira. Deterministic guarantees for burer-monteiro factorizations of smooth semidefinite programs. Communications on Pure and Applied Mathematics, 2018.
•  Emmanuel J Candes, Thomas Strohmer, and Vladislav Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Communications on Pure and Applied Mathematics, 66(8):1241–1274, 2013.
•  Lijun Ding, Alp Yurtsever, Volkan Cevher, Joel A Tropp, and Madeleine Udell. An optimal-storage approach to semidefinite programming using approximate complementarity. arXiv preprint arXiv:1902.03373, 2019.
•  Yu Du and Andrzej Ruszczyński. Rate of convergence of the bundle method. J. Optim. Theory Appl., 173(3):908–922, June 2017.
•  Michel X Goemans and David P Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6):1115–1145, 1995.
•  APS Mosek. The mosek optimization software. Online at http://www. mosek. com, 54(2-1):5, 2010.
•  Gábor Pataki. On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Mathematics of operations research, 23(2):339–358, 1998.
•  Andrzej P Ruszczyński and Andrzej Ruszczynski. Nonlinear optimization, volume 13. Princeton university press, 2006.
•  Nathan Srebro and Adi Shraibman. Rank, trace-norm and max-norm. In

International Conference on Computational Learning Theory

, pages 545–560. Springer, 2005.
•  Jos F Sturm. Error bounds for linear matrix inequalities. SIAM Journal on Optimization, 10(4):1228–1248, 2000.
•  Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Practical sketching algorithms for low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 38(4):1454–1485, 2017.
•  Joel A Tropp, Alp Yurtsever, Madeleine Udell, and Volkan Cevher. Randomized single-view algorithms for low-rank matrix approximation. 2017.

## Appendix A Proofs of Convergence Guarantees

### a.1 Auxiliary lemmas

###### Lemma A.1.

(Sketching Guarantee)[14, Theorem 5.1] Fix a target rank . Let be a matrix, and let be a sketch of of the form (8). The procedure (11) yields a rank- matrix with

Here is the Frobenius norm. Similar operator -norm bounds hold with high probability.

###### Lemma A.2 (Compact sublevel set).

If a convex lower semicontinuous function has a compact nonempty solution set, then all of its sublevel set is compact.

###### Proof.

Suppose for some , then the closed sublevel set is unbounded. Then there is a unit direction vector such that for all , . This in particular violates the fact the solution set is bounded and the proof is completed. ∎

[12, Section 4] If the solution sets and are compact and strict complementarity holds, then for any fixed , there are some such that for all with with , and all with and :

###### Proof.

Using the proof of Lemma A.5, we find that . Thus . The result in [12, Section 4] requires the set , and the set being compact. Using [10, Theorem 7.21], the optimization problem has the same solution set as the primal SDP (P) for some large . Thus the compactness of the set and is ensured by Lemma A.2, and the proof is completed. ∎

### a.2 Proof of Theorem 1.1

Let and Then we set the bundle method’s parameters to be , , and . We recall the inner product for matrices is the trace inner product and is the dot product for the vectors.

In the following three lemmas, we prove bounds on primal feasibility, dual feasibility, and optimality in terms of . From this, we can conclude these quantities converge at the claimed rate since the Du and Ruszcynskii  recently showed the bundle method has converge at a rate.

###### Lemma A.4 (Primal Feasibility).

At every descent step , we have approximate primal feasibility

 Xt+1⪰0,
 ∥b−AXt+1∥≤√2ϵ(F(yt)−F(y∗))βD2y.
###### Proof.

Noting that is built out of a convex combination of the rank matrices , its immediate that it is always a positive semidefinite matrix.

The definition of immediately gives the following alternative characterization of ,

 ¯Ft+1(y)=−⟨C,Xt+1⟩−⟨b−AXt+1,y⟩.

Since we constructed to correspond to the first-order optimality condition of the subproblem (4), we have

 0=∇¯Ft+1(yt+1)+ρ(zt+1−yt).

Hence . The distance traveled during any descent step can be bounded by the objective value gap as

 ρ2∥yt+1−yt∥2≤F(yt)−˜Ft(yt+1)≤F(yt)−F(yt+1)β≤F(yt)−F(y⋆)β

where the first inequality uses the fact that minimizes and the second inequality uses the definition of a descent step. Combining this with our feasibility bound shows

 ∥b−AXt+1∥2≤2ρ(F(yt)−F(y∗))β.

Then our choice of completes the proof. ∎

###### Lemma A.5 (Dual Feasibility).

At every descent step , we have approximate dual feasibility

 λmin(C−A∗yt+1)≥−(F(yt)−F(y⋆))DX⋆.
###### Proof.

Standard strong duality and exact penalization arguments show for any ,

 ⟨b,yt+1−y⋆⟩ =⟨AX⋆,yt+1⟩−⟨C,X⋆⟩ ≤⟨X⋆,A∗(yt+1−C)⟩ ≤−trace(X⋆)min{λmin(C−A∗yt+1),0}.

Recalling our assumption that yields the claimed feasibility bound

 F(yt)−F(y⋆)≥F(yt+1)−F(y⋆) =⟨−b,yt+1−y⋆⟩−αmin{λmin(C−A∗yt+1),0} ≥−trace(X⋆)min{λmin(C−A∗yt+1),0}.\qed
###### Lemma A.6 (Primal-Dual Optimality).

At every descent step , we have approximate primal-dual optimality bounded above by

and below by

 ⟨b,yt+1⟩−⟨C,Xt+1⟩≥−1−ββ(F(yt)−F(y⋆))−√2ϵ(F(yt)−F(y⋆))β.
###### Proof.

The standard duality analysis shows the primal-dual objective gap equals

 ⟨b,yt+1⟩−⟨C,Xt+1⟩ =⟨AXt+1,yt+1⟩−⟨C,Xt+1⟩+⟨b−AXt+1,yt+1⟩ =⟨Xt+1,A∗yt+1−C⟩+⟨b−AXt+1,yt+1⟩.

Notice that the second term here is bounded above and below as

 |⟨b−AXt+1,yt+1⟩|≤√2ϵ(F(yt)−F(y⋆))βD2y ∥yt+1∥2≤√2ϵ(F(yt)−F(y⋆))β

by Lemma A.4. Hence we only need to show that the first term also approaches zero (that is, we approach holding complementary slackness).

An upper bound on this inner product follows from Lemma A.5 as

 ⟨Xt+1,A∗yt+1−C⟩≤−trace(Xt+1)min{λmin(C−A∗yt+1),0}≤trace(Xt+1)(F(yt)−F(y⋆))DX⋆.

Hence

A lower bound on this inner product follows as

 1−ββ(F(yt)−F(yt+1)) ≥F(yt+1)−˜Ft(yt+1) =F(yt+1)−¯Ft+1(yt+1) =−αmin{λmin(C−A∗yt+1),0}+⟨C,Xt+1⟩−⟨AXt+1,yt+1⟩ ≥⟨Xt+1,C−A∗yt+1⟩,

where the first inequality follows from the definition of a descent step. Hence

 ⟨b,yt+1⟩−⟨C,Xt+1⟩ ≥−1−ββ(F(yt)−F(y⋆))−√2ϵ(F(yt)−F(y⋆))β.

### a.3 Proof of Theorem 1.2

Using triangle inequality, we see that

 (12)

The first term of (12) is bounded by

where step is due to Lemma A.1, and step is because all has rank less than or equal to and is the best rank approximation of in terms of Frobenius norm.

Combining the above inequalities, we see that

 Edist\tiny{F}(^Xt,X⋆)≤(1+3√2)dist\tiny{F}(Xt,X⋆). (13)

Our task now is to have an estimate the rates convergence of . Denote the shorthand . Using Lemma A.3, there are some such that for all with satisfy

 dist\tiny{F}2(X,X⋆)≤γ(g(X)−g(X⋆)). (14)

Then the theorem follows from Theorem 1.1 as converges at a rate of .