DeepAI

# The Mixing method: coordinate descent for low-rank semidefinite programming

In this paper, we propose a coordinate descent approach to low-rank structured semidefinite programming. The approach, which we call the Mixing method, is extremely simple to implement, has no free parameters, and typically attains an order of magnitude or better improvement in optimization performance over the current state of the art. We show that for certain problems, the method is strictly decreasing and guaranteed to converge to a critical point. We then apply the algorithm to three separate domains: solving the maximum cut semidefinite relaxation, solving a (novel) maximum satisfiability relaxation, and solving the GloVe word embedding optimization problem. In all settings, we demonstrate improvement over the existing state of the art along various dimensions. In total, this work substantially expands the scope and scale of problems that can be solved using semidefinite programming methods.

• 5 publications
• 17 publications
• 77 publications
12/15/2018

### Low-rank semidefinite programming for the MAX2SAT problem

This paper proposes a new algorithm for solving MAX2SAT problems based o...
08/07/2018

### On the integrality gap of the maximum-cut semidefinite programming relaxation in fixed dimension

We describe a factor-revealing convex optimization problem for the integ...
12/04/2020

### Efficient semidefinite-programming-based inference for binary and multi-class MRFs

Probabilistic inference in pairwise Markov Random Fields (MRFs), i.e. co...
04/20/2014

### Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

We propose a Branch-and-Cut (B&C) method for solving general MAP-MRF inf...
12/04/2020

### Community detection using fast low-cardinality semidefinite programming

Modularity maximization has been a fundamental tool for understanding th...
04/03/2013

### A Fast Semidefinite Approach to Solving Binary Quadratic Problems

Many computer vision problems can be formulated as binary quadratic prog...
05/31/2016

### Biconvex Relaxation for Semidefinite Programming in Computer Vision

Semidefinite programming is an indispensable tool in computer vision, bu...

## Code Repositories

### mixing

The Mixing method: coordinate descent for low-rank semidefinite programming

## 1 Introduction

This paper considers the solution of large-scale, structured semidefinite programming problems (SDPs). A generic SDP can be written as the optimization problem

 minimizeX⪰0⟨C,X⟩,subject % to⟨Ai,X⟩=bi,i=1…p (1)

where is the optimization variable (a symmetric matrix), and ,

are problem data. Semidefinite programs can encode a huge range of practical problems, including relaxations of many combinatorial optimization tasks

(Boyd and Vandenberghe, 2004), approximate probabilistic inference (Jordan and Wainwright, 2004), metric learning (Yang, 2006), matrix completion (Candes and Recht, 2012), and many others. Unfortunately, generic semidefinite programs involve optimizing a matrix-valued variable , where the number of variables grows quadratically so that it quickly becomes unaffordable for solvers employing exact methods such as primal-dual interior point algorithms.

Fortunately, a property of these problems, which has been recognized for some time now, is that the solution to such problems is often low-rank; specifically, the problem always admits an optimal solution with at most rank (Barvinok, 1995; Pataki, 1998), and many SDPs are set up to have even lower rank solutions in practice. This has motivated the development of non-convex low-rank solvers for these systems: that is, we can attempt to solve the equivalent (but now non-convex) formulation of the problem

 minimizeV∈Rk×n⟨C,VTV⟩,subject to⟨Ai,VTV⟩=bi,i=1…p

with the optimization variable . Here we are explicitly representing by the matrix of rank (typically with ), . Note that because we are representing in this way, we no longer need to explicitly enforce semidefiniteness, as it is implied by the change of variables. In a long series of work dating back several years, it has been shown that, somewhat surprisingly, this change to a non-convex problem does not cause as many difficulties as might be thought: in practice, local solutions to the problem tend to recover the optimal solution (Burer and Monteiro, 2003); assuming sufficient rank , all second order local optima of the problem are also global optima (Boumal et al., 2016); and it even holds that approximated local optima also have good approximation properties (Mei et al., 2017) for convex relaxations of some combinatorial problems. Despite these advances, solving for remains a practical challenge. Traditional methods for handling non-linear equality constraints, such as augmented Lagrangian methods and Riemannian manifold methods, suffer from slow convergence, difficulties in selecting step size, or other deficiencies.

In this paper, we present a low-rank coordinate descent approach to solving SDPs that have the additional specific structure of constraints (only) on the diagonal entries of the matrix (we consider the case of unit diagonal constraints, though we can easily extend to arbitrary positive numbers)

 minimizeX⪰0⟨C,X⟩,subject to% Xii=1,i=1…n. (2)

This is clearly a very special case of the full semidefinite program, but it also captures some fundamental problems, such as the famous semidefinite relaxation of the maximum cut (MAXCUT) combinatorial optimization problem; indeed, the MAXCUT relaxation will be one of the primary applications in this paper. In this setting, if we consider the above low-rank form of the problem, we show that we can derive the coordinate descent updates in a very simple closed form, resulting in an algorithm several times faster than the existing state of the art. We call our approach the Mixing method, since the updates have a natural interpretation in terms of giving each as a mixture of the remaining terms. We will also show, however, that the method can be applied to other problems as well, such as a (novel, to the best of our knowledge) relaxation of the MAXSAT problem, as well as an unconstrained quadratic SDP similar to the GloVe word embedding algorithm (Pennington et al., 2014).

On the theoretical side, we prove several strong properties about the Mixing method. Most notably, despite the fact that the method is solving a non-convex formulation of the original MAXCUT SDP, it nonetheless will converge to the true global optimum of the original SDP problem, provided the rank is sufficient, (which is of course much smaller than optimizing over the entire variables of the original SDP). We prove this by first showing that the method is strictly decreasing, and always converges to a first-order critical point. We then show that all non-optimal critical points (that is, all points which are critical point in the non-convex formulation in , but not optimal in in the original SDP), are unstable, i.e., they are saddle points in and will be unstable under updates of the Mixing method; this means that, in practice, the Mixing method is extremely unlikely to converge to any non-optimal solution. However, to formally prove that the method will converge to the global optimum, we consider a slightly modified “step size" version of the algorithm, for which we can prove formally that the method indeed converges to the global optimum in all cases (although in practice such a step size is not needed). Finally, for both the traditional and step size versions of the algorithm, we show that the Mixing method attains locally linear convergence around the global optimum. The primary tools we use for these proofs require analyzing the spectrum of the Jacobian of the Mixing method at non-optimal critical points, showing that it is always unstable around those points; we combine these with a geometric analysis of critical points due to Boumal et al. (2016) and a convergence proof for coordinate gradient descent due to Lee et al. (2017) to reach our main result (both results require slight specialization for our case, so are included for completeness, but the overall thrust of those supporting points are due to these past papers).

###### Contributions of the current work

In summary, the main contributions of this paper are: 1) We propose a low-rank coordinate descent method, the Mixing method, for the diagonally constrained SDP problem, which is extremely fast and simple to implement. 2) We prove that despite its non-convex formulation, the method is guaranteed to converge to global optimum of the original SDP with local linear convergence, provided that we use a small rank . 3) We evaluate the proposed method on the MAXCUT SDP relaxation, showing that it is 10-100x times faster than the other state-of-the-art solvers and scales to millions of variables. 4) We extend the MAX-2SAT relaxation of Goemans and Williamson (1995) to general MAXSAT problems, showing that the proposed formulation can be solved by the Mixing method in linear time to the number of literals. Further, experiments show that the proposed method has much better approximation ratios than other approximation algorithms, and is even comparable to the best partial MAXSAT solvers in some instances.

## 2 Background and related work

##### Low-rank methods for semidefinite programming

Given an SDP problem with constraints, it was proven by Barvinok (1995); Pataki (1998) that, if the problem is solvable, it admits solutions of rank . That is, we have solutions satisfying , such that . Thus, if we can solve the problem in the space of , we can ignore the semidefinite constraint and also have many fewer variables. The idea of using this low-rank structure during optimization was first proposed by Burer and Monteiro (2003) in their solver SDPLR, in which they solve the low-rank problem with L-BFGS on the extended Lagrangian problem. Since then, many low-rank optimization algorithms have been developed. One of the most notable is the Riemannian trust region method introduced by Absil et al. (2009). They considered the Riemannian manifold of low-rank structures, and extended the non-linear conjugate gradient method to work on the manifold. Later, Boumal and Absil (2015) improved the method by including preconditioned CG; these methods are implemented in the popular Manopt package (Boumal et al., 2014).

Somewhat surprisingly, all the above low-rank methods are observed to converge to a global optimum in practice (Burer and Monteiro, 2003; Absil et al., 2009; Boumal et al., 2014). However, to the best of the authors’ knowledge, there is not yet a general proof on convergence to the globally optimal solution without strong assumptions. Absil et al. (2009, Theorem 7.4.2) proved that their method converges to the critical point under a sufficient decrease condition, and super-linear convergence near an isolated local minimizer when the Riemannian Hessian is positive definite. Nonetheless, the results do not apply to the linear objective in our problem. Boumal et al. (2016, Theorem 2) proved that, for sufficiently large , all second-order optimal solutions are globally optimal for almost all cost matrices . However, this alone does not imply converges to . Sato and Iwai (2015) proved the convergence to a critical point without assumptions for a modified Riemannian conjugate gradient method, and Boumal et al. (2018, Theorem 12 and Proposition 19) proved that the Riemannian trust region method converges to a solution with Hessian larger than in and provide bounds on when . Recently, Bhojanapalli et al. (2018) proved the connection between , the norm of gradients, and for the unconstrained penalty form , but this does not state the relationship between after projecting the unconstrained solution back to the feasible space.

Other existing results on global convergence to the optimal on low-rank problems do not apply to our problem setting. For example, Bhojanapalli et al. (2016) proved that under a certain spectral initialization, the gradient descent method converges to the global optima for unconstrained low-rank SDP. Park et al. (2016) further proved that the method works for norm-constrained low-rank SDP problems when the feasible space of is convex. Lee et al. (2016, 2017); O’Neill and Wright (2017) proved that, under random initialization, first-order methods converge to local minimizers for unconstrained problems. However, O’Neill and Wright (2017) only concerns the unconstrained optimization in Euclidean space, and results in Lee et al. (2016, 2017) do not work in the spherical manifold due to the singularity in the Jacobian (even with step size). Because of these issues, certifying global convergence for low-rank SDP methods typically requires a two-stage algorithm, where one iteratively solves the low-rank problem via some “inner” optimization, then checks for a certificate of (global) optimality via the dual objective of the original SDP and inflates the rank if the solution is not optimal (i.e., the “outer” iteration) (Burer and Monteiro, 2003)

. Even here, a globally optimal solution is not theoretically guaranteed to be achieved with reduced rank in all cases, but at least can be verified after the fact. Computing this dual solution, however, typically requires solving an eigenvalue problem of the size of the original

matrix, so is often avoided entirely in practice in favor of just solving one low-rank problem, and this is the setting we consider here.

##### Approximation algorithms for MAXCUT and MAXSAT

Semidefinite programming has many applications in approximation algorithms for NP-complete problems. In particular, Goemans and Williamson (1995) proposed a classical SDP relaxation on the MAXCUT and MAX-2SAT problems, which has a 0.878 approximation guarantee. Experiments on MAX-2SAT (Gomes et al., 2006)

show that the SDP upper bound and lower bound are much tighter than the classical linear programming relaxation for MAXSAT

(Goemans and Williamson, 1994). While traditionally SDPs are more expensive to solve than linear programming, we will show here that with our approach, SDP relaxations can achieve substantially better results than linear programming in less time.

## 3 The Mixing method

As mentioned above, the goal of the Mixing method is to solve the semidefinite program (2) with a unit diagonal constraint. As discussed, we can replace the constraint with for some ; when we do this, the constraint translates to the constraint , where is the th column of . This leads to the equivalent (non-convex) problem on the spherical manifold

 minimizeV∈Rk×n⟨C,V⊤V⟩subject to∥vi∥=1,i=1…n. (3)

Although the problem is non-convex, it is known (Barvinok, 1995; Pataki, 1998) that when , the optimal solution for can recover the optimal solution for .

We consider solving the problem (3) via a coordinate descent method. The resulting algorithm is extremely simple to implement but, as we will show, it performs substantially better than existing approaches for the semidefinite problems of interest. Specifically, the objective terms that depend on are given by . However, because we can assume that without affecting the solution of the optimization problem. Thus, the problem is equivalent to simply minimizing the inner product (where ), subject to the constraint that ; this problem has a closed form solution, simply given by . Put in terms of the original variable, this is simply the update

 vi:=normalize(−n∑j=1cijvj).

This way, we can initialize on the unit sphere and perform cyclic updates over all the in closed-form. We call this the Mixing method, because for each

our update mixes and normalizes the remaining vectors

according to weight . In the case of sparse (which is common for any large data problem), the time complexity for updating all variables once is , where is the rank of and is the number of nonzeros in . This is significantly cheaper than the interior point method, which typically admits complexities cubic in . However, the details for efficient computation differ depending on the precise nature of the SDP, so we will describe these in more detail in the subsequent application sections. A complete description of the generic algorithm is shown in Algorithm 1.

### 3.1 Convergence properties of the Mixing methods

This section presents the theoretical analysis of the Mixing methods, which constitutes four main properties:

• We prove that the Mixing method is strictly decreasing in objective value and always converges to a first-order critical point over iterations.

• Further, we show that for a rank111The tightness of the rank (actually, rank satisfying ) is proved in Barvinok (2001). , all non-optimal critical points are unstable for the Mixing method. That is, when is non-optimal for the convex problem (2), the critical point will sit on a saddle of the non-convex problem (3) and the Mixing method tends to diverge from the point locally.

• To rigorously prove the global convergence, we show that a variant of the Mixing method with a proper step size converges to a global optimum almost surely under random initialization for almost every cost matrix , without any assumptions.

• Moreover, we prove that both Mixing methods (with or without step size) converge linearly to the global optimum when the solution is close enough, regardless of the rank or the existence of nearby non-optimal critical points.

In total, our method represents the first low-rank semidefinite programming method which will provably converge to a global optimum under constraints, and further in a locally linear rate. As mentioned above, our only assumption (satisfiable by picking a proper step size) is simply that the updates will not lead to a with zero norm before normalization.

###### Assumption

Assume that for all , do not degenerate in the procedure. That is, all norms are always greater than or equal to a constant .

###### Convergence to a critical point.

Our first property shows that the Mixing method strictly decreases, and always converges to a first-order critical point for any . This is a relatively weak statement, but useful in the context of the further proofs.

Under Assumption 3.1, the Mixing method on the SDP problem (2) is strictly decreasing and always converges to a first-order critical point.

The proof is in Appendix A, which mainly involves setting up the Mixing iteration in a matrix form, and showing that the difference in objective value between two iterations is given by a particular positive term based upon this form.

###### Instability of non-optimal critical points.

Next we prove the main result of our approach, that not only is the function decreasing, but that every non-optimal critical point is unstable; that is, although the problem (3) is non-convex, the algorithm tends to diverge (locally) from any solution that is not globally optimal. Further, the local divergence from non-optimal critical points and the global convergence to a critical points hint that the Mixing method can only converges to a global optimal solution.

Pick . Under Assumption 3.1, for almost all , all non-optimal first-order critical points are unstable fixed points for the Mixing method. The full proof is provided in Section 3.2. The main idea is to show that the maximum eigenvalue of the dynamics Jacobian, evaluated at a critical point but when is not optimal, is guaranteed to have a spectral radius (magnitude of the largest eigenvalue) greater than one. We do this by showing that the Jacobian of the Mixing method has the same structure as the Jacobian of a standard Gauss-Seidel update plus an additional projection matrix. By a property from Boumal et al. (2016)

, plus an analysis of the eigenvector of Kronecker products, we can then guarantee that the eigenvalues of the Mixing method Jacobian contains those of the Gauss-Seidel update. We then use an argument based upon

Lee et al. (2017) to show that the Gauss-Seidel Jacobian is similarly unstable around non-optimal critical points, proving our main theorem.

###### Globally optimal convergence of Mixing method with a step size.

Though the above two theorems in practice ensure convergence of the Mixing method to a global optimum, because the method makes discrete updates, there is the theoretical possibility that the method will “jump" directly to a non-optimal critical point (yet again, this has never been observed in practice). For completeness, however, in the next theorem we highlight the fact that a version of the Mixing method that is modified with a step size will always converge to a global optimum.

Consider the Mixing method with a step size . That is,

Take and , where denotes the -norm. Then for almost every , the method converges to a global optimum almost surely under random initialization.222

Any distribution will suffice if it maps zero volume in the spherical manifold to zero probability. For example, both spherical Gaussian distribution and normalized uniform distribution work.

The full proof is provided in Appendix C. The main difference in the proof from the step size free version is that, with a step size, we can prove the diffeomorphism of the Mixing method and thus are able to use the stable manifold theorem. Because the Jacobian of any feasible method on the spherical manifold is singular333Note that on the spherical manifold, the Jacobian of any feasible method is singular because the Jacobian of is singular. Thus, the proof in Lee et al. (2017, Section 5.5) does not carry over to the case of the Mixing method, even with a step size. This past proof required a non-singular Jacobian, and thus different techniques are required in our setting., we need to construct the inverse function explicitly and show the smoothness of such function. We can then use an analysis of the Gauss-Seidel method with a step size to show our result. To the best of our knowledge, this represents the first globally optimal convergence result for the low-rank method applied to (constrained) semidefinite programming, without additional assumptions such as the Cauchy decrease (Boumal et al., 2018, Assumption A3), or solving a sequence of “bounded” log-barrier subproblems exactly (Burer and Monteiro, 2005, Theorem 5.3).

###### Locally linear convergence.

Finally, as a last point of analysis, we show that the convergence of the Mixing methods exhibits linear convergence to the global optimum whenever the solution is close enough, regardless of the rank and the existence of nearby non-optimal critical points, for both versions with or without the step size. This also echoes practical experience, where the Mixing method does exhibit this rate of convergence.

The Mixing methods converge linearly to the global optimum when close enough to the solution, with step size (no assumption) or without step size (under Assumption 3.1).

The full proof is provided in Appendix D. We prove it by exploiting the Lipschitz smoothness (Lipschitz continuous gradient) of the Mixing mappings. The main difficulty here is that the corresponding linear system in the Gauss-Seidel method, , is semidefinite so that the corresponding Jacobian contains eigenvectors of magnitude in the null space of . We overcome the difficulty by proving the result directly in the function value space like Wang and Lin (2014) so that the eigenvectors in can be ignored. This is the first local linear convergence on the spherical manifold without assumption.

Assume there are nonzeros in . With Theorem 3.1 and 3.1, for almost every , the Mixing method with a step size admits an asymptotic complexity of to reach the global optimality gap of almost surely under random initialization. This concludes the theoretical analysis of the Mixing method.

Now we will prove one of our main result: the instability of non-optimal critical points.

### 3.2 Proof of Theorem 3.1: The instability of non-optimal criticals points

Before starting the proofs, we discuss our notations and reformulate the Mixing methods.

##### Notations.

We use upper-case letters for matrix, and lower-case letters for vector and scalar. For a matrix , refers to the -th column of , and denote the vector stacking columns and rows of , respectively. For a vector , denotes the matrix with on the diagonal, the maximum, the minimum, and the minimum nonzero element of , respectively. The symbol denotes the Kronecker product, the Moore-Penrose inverse, the vector of eigenvalues, the spectral radius, the -vector of length ,

the identity matrix of size

, the trace, the dot product, and the generalized L2 norm. Indices are reserved for matrix element, and index for iterations.

##### The Mixing methods.

We denote the cost matrix of the problem

 minimizeV∈Rk×nf(V)≡⟨C,VTV⟩subject to∥vi∥=1,i=1…n,

and w.l.o.g. assume that in all proofs. Matrices and refer to the current and the next iterate, and the global optimum attaining an optimal value in the semidefinite program (2).444By Pataki (1998), the optimality in (2) is always attainable by when . Let

 gi=∑jicijvj (4)

and matrix be the strict lower triangular part of . With these notations, the mapping of the Mixing method and its variant with step size can be written as555The reason to reformulate here is to avoid the “overwrite” of variables in the algorithmic definition. Moving the inverse term to the left-hand side, the reader can recover the original sequential algorithm.

 M(V)T =−(L+Dy)−1LTVT,where yi=∥gi∥,i=1…n,and (5) Mθ(V)T =(θL+Dy)−1(I−θL)TVT,where yi=∥vi−θgi∥,i=1…n. (6)

Note that both and are valid functions of because can be calculated from by the original algorithmic definitions in Section 3. This formulation is similar to the classical analysis of the Gauss-Seidel method for linear equation in Golub and Van Loan (2012), where the difference is that here is not a constant to and thus the evolution is not linear.

#### Proof of technical lemmas.

We start by analyzing the Jacobian of the Mixing method. Using the notation in (5), the Jacobian of the Mixing method is

 J(V)=−(L+Dy)−1⊗IkPLT⊗Ik,

in which is the rejection matrix of . That is,

 P=diag(P1,…,Pn)∈Rnk×nk,where Pi=Ik−^vi^vTi∈Rk×k.

Proof  Denote and the current and the next iterate. Taking total derivatives on the update of the Mixing method (7), we have

 yid^vi=−Pidgi=−Pi(∑jicijdvj),i=1…n.

Moving to the left-hand side. By the implicit function theorem, we have the Jacobian

 J(V)=−⎛⎜ ⎜ ⎜⎝y1Ik0…0c12P2y2Ik…0………0c1nPnc2nPn…ynIk⎞⎟ ⎟ ⎟⎠−1⎛⎜ ⎜ ⎜ ⎜⎝0c12P1…c1nP100……00…c(n−1)nPn−100…0⎞⎟ ⎟ ⎟ ⎟⎠.

The implicit function theorem holds here because the triangular matrix is always inversible. Rewrite with Kronecker product, we have equivalently

 J(V)=−(PL⊗Ik+Dy⊗Ik)−1PLT⊗Ik.

Further, we can push into the inverse so that

 (PL⊗Ik+Dy⊗Ik)−1P=(PL⊗Ik+PDy⊗Ik)†=((L+Dy)⊗Ik)−1P.

Thus, can be reformulated as

 J(V)=−(L+Dy)−1⊗IkPLT⊗Ik. \BlackBox

Note that on critical points, which is also a fixed point of the Mixing method. Now we demonstrate how to analyze the Jacobian. Remember the notation . This way, we have the following convenient property by the Kronecker product. For matrices , we have . By the above property, part of the spectrum of the Jacobian can be analyzed. [Overlapping Eigenvalues] Assume has . Let

 P=diag(P1,…,Pn),where Pi=Ik−vivTi.

Then for any , any eigenvalue of is also an eigenvalue of

 J=A⊗IkPB⊗Ik.

Because , by linear dependency there is a nonzero such that

 rTvi=0 for i=1…n⟹Pir=r for i=1…n.

Let be an eigenvector of with eigenvalue . Let . With Lemma 3.2,

 Jvect(Z) =A⊗IkPvect((Bq)rT)=A⊗Ik(Pir(Bq)Ti)i=1…n =A⊗Ikvect((Bq)rT)=vect((ABq)rT) =λvect(qrT).

Thus, every eigenvalue of is also an eigenvalue of . By the above lemma, the spectral radius of is lower bounded by , which can be again lower bounded as follows. For a positive vector , consider a matrix under the notation in (5)

 JGS=−(L+Dy)−1LT.

Let . When , the spectral radius .666If , we can prove that the spectral radius , in which all eigenvectors with magnitude reside in the null of , see Wang and Lin (2014, Corollary 3.4). However, the result is not used here. The proof is more technical and is given in Appendix B. Further, the assumption in Lemma  is fulfilled by the following property of critical points. (Boumal et al., 2016, Lemma 9) Let . Then, for almost all , all first-order critical points have rank smaller than . The proof is listed in Appendix E for completeness. Next, we characterize the optimality of by proving all non-optimal admits an . For a critical solution , denote , where . Then

 S⪰0⟺V is optimal.

Further, if is optimal, all are positive except when .777Let and . An immediate consequence of the lemma is that, for any feasible , . Further, suppose is also an optimum, then That is, is unique. Consider the dual problem of the SDP problem (2),

 maximizey∈Rn−1Tny,% subject toC+diag(y)⪰0.

If , variable becomes a feasible solution of the above dual problem. Further, since is a critical solution, we have

 VS=0⟹VTVS=0⟹tr(VTVC)=−tr(VTVdiag(y))=−1Tny,

which means that and are primal and dual optimal solutions that close the duality gap. Thus, for critical , implies optimality, and non-optimality implies .

On the other direction, when the solution is optimal, there will be a corresponding dual optimal solution satisfying

 VTV(C+diag(y))=0⟹vTiV(C+diag(y))=0,∀i⟹yi=∥Vci∥,∀i.

And follows from the dual feasibility. By the characterization of SPSD matrix, all submatrix of are SPSD. Thus, . If equality holds, by the same reason all submatrix , . This means .

#### Proof of Theorem 3.1

Consider the non-optimal critical point . If degenerates for any , by Assumption 3.1, the Mixing method would not converge to the solution. Thus, we only need to discuss the critical point with .

We first derive the Jacobian of the Mixing method in Lemma 3.2, which gives

 J=−(L+Dy)−1⊗IkPLT⊗Ik,

where and because on the critical point (also a fixed point). In Lemma , we prove that when , the eigenvalues of contain the eigenvalues of

 JGS=−(L+Dy)−1LT.

The assumption in Lemma  is fulfilled by Lemma 3.2, which guarantees that for almost every , all the first-order critical point must have . Further, Lemma 3.2 and 3.2 show that happens to be the Jacobian of the Gauss-Seidel method on a linear system, which has a spectral radius on the non-optimal first-order critical point . Thus, Lemma  implies , which means that all non-optimal first-order critical points are unstable for the Mixing method.

## 4 Applications

### 4.1 Maximum cut problem

The SDP MAXCUT relaxation is indeed the motivating example of the Mixing method, so we consider it first. In this section, we demonstrate how to apply our method to this problem, which originated from Goemans and Williamson (1995).

##### Problem description.

The maximum cut problem is an NP-hard binary optimization problem, which seeks a partition over a set of vertices , so that the sum of edge weights across the partition is maximized. If we denote the two partitions as , we can formulate the assignment of vertex as the following binary optimization problem

 maximizevi∈{±1},∀i12∑ijcij(1−vivj2).

Goemans and Williamson (1995) proposed that we can approximate the above solution by “lifting” the assignment from to a unit sphere in for sufficiently large as

 maximize∥vi∥=1,∀i12∑ijcij(1−vTivj2).

To recover the binary assignment, we can do a randomized rounding by picking a random vector on the unit sphere, and letting the binary assignment of vertex be . Their analysis shows that the approximation ratio for the NP-hard problem is , which means that the expected objective from the randomized rounding scheme is at least times the optimal binary objective.

##### Algorithm Design.

Because the problem can be solved by the unit diagonal SDP (2), we can apply the Mixing method directly, as presented in Algorithm 1. Further, for a sparse adjacency matrix , the coefficient can be constructed in time proportional to the nonzeros in column of . Thus, the time complexity of running a round of updates for all is , in which is at most .

### 4.2 Maximum satisfiability problem

Using similar ideas as in the previous section, Goemans and Williamson (1995) proposed that we can use SDP to approximate the maximum 2-satisfiability problem. In this section, we propose a formulation that generalizes this idea to the general maximum satisfiability problem, and apply the Mixing method to this problem. The proposed relaxation here is novel, to the best of our knowledge, and (as we will show) achieves substantially better approximation results than existing relaxations.

##### Problem description.

The MAXSAT problem is an extension of the well-known satisfiability problem, where the goal is to find an assignment that maximizes the number of satisfied clauses. Let

be a binary variable and

be the sign of variable in clause . The goal of MAXSAT can then be written as the optimization problem

 maximizev∈{−1,1}nm∑j=1n⋁i=11{sijvi>0}.

Note that most clauses will contain relatively few variables, so the vectors will be sparse. To avoid the need for an additional bias term, we introduce an auxiliary “truth” variable , and define . Then the MAXSAT problem can be approximated as

 maximizev∈{−1,1}nm∑j=11−∥Vsj∥2−(|sj|−1)24|sj|.

Although we will not derive it formally, the reader can verify that for any configuration , this term represents an upper bound on the exact MAXSAT solution.888Actually, the formula matches the approximation of Goemans and Williamson (1995) for MAX-2SAT. Similar to the MAXCUT SDP, we can relax the s to be vectors in with . This leads to the full MAXSAT semidefinite programming relaxation

 minimizeX⪰0⟨C,X⟩,subject % toC=m∑j=1wjsjsTj,Xii=1,i=0…n,

where .

##### Algorithm Design.

Because the matrix here is not sparse ( has non-sparse entries), we need a slightly more involved approach than for MAXCUT, but the algorithm is still extremely simple. Specifically, we maintain for all clauses . Because in each subproblem only one variable is changed, can be maintained in time, where denotes the number of clauses that contain variable . In total, the iteration time complexity is , where is the number of literals in the problem. Also, because applying arbitrary rotations to does not change the objective value of our problem, we can avoid updating . Algorithm 2 shows the complete algorithm. To recover the binary assignment, we apply the following classic rounding scheme: sample a random vector from a unit sphere, then assign binary variable as true if and false otherwise.

## 5 Experimental results

##### Running time comparison for MAXCUT

Figure 1 shows the results of running the Mixing method on several instances of benchmark MAXCUT problems. These range in size from approximately 1000 nodes and 20000 edges to approximately 2 million nodes and 3 million edges. For this application, we are largely concerned with evaluating the runtime of our Mixing method versus other approaches for solving the same semidefinite program. Specifically, we compare to DSDP (Benson and Ye, 2005), a mature interior point solver; SDPLR (Burer and Monteiro, 2003), one of the first approaches to exploit low-rank structures; Pure-RBR (Wen et al., 2009, 2012), a coordinate descent method in the space, which outputs the best rank-1 update at each step; and Manopt (Boumal et al., 2014), a recent toolkit for optimization on Riemannian manifolds, with specialized solvers dedicated to the MAXCUT problem.999We didn’t compare to commercial software like MOSEK, an interior-point solver like DSDP, because it is not open-source and Boumal (2015) already showed that SDPLR is faster than MOSEK on diagonally constrained problems. To be specific, we use DSDP 5.8, SDPLR 1.03-beta, and Manopt 3.0 with their default parameters. For Manopt, we compare to a subroutine "elliptopefactory”, specially designed for diagonal-constrained SDP. For Pure-RBR, we implement the specialized algorithm (Wen et al., 2009, Algorithm 2) for MAXCUT SDP with a sparse graph in C++, which only requires a single pass of the sparse matrix per iteration. We omit the log barrier and initialize the RBR with full-rank . All experiments are run on an Intel Xeon E5-2670 machine with 256 GB memory, and all solvers are run in the single-core mode to ensure fair comparisons. As the results show, in all cases the Mixing method is substantially faster than other approaches: for reaching modest accuracy (defined as times the difference between the initial and optimal value), we are typically 10-100x faster than all competing approaches; only the Manopt algorithm ever surpasses our approach, and this happens only once both methods have achieved very high accuracy. Crucially, on the largest problems, we remain about 10x (or more) faster than Manopt over the entire run, which allows the Mixing method to scale to substantially larger problems.

##### Effectiveness of the Mixing method on approximating MAXSAT problems.

Unlike the previous experiment (where the focus was solely on optimization performance), in this section we highlight the fact that with the Mixing method we are able to obtain MAXSAT results with a high approximation ratio on challenging domains (as the problems are similar, relative optimization performance is similar to that of the MAXCUT evaluations). Specifically, we evaluate examples from the 2016 MAXSAT competition (Argelich et al., 2016)

and compare our result to the best heuristic

complete and partial solvers. Note that the complete solver produces a verified result, while the partial solver outputs a non-verified solution. Out of 525 problems solved in the complete track (every problem solved exactly by some solvers within 30 minutes during the competition), our method achieves an average approximation ratio of 0.978, and usually finds such solutions within seconds or less. Further, in some instances we obtain perfect solution faster than the best partial solvers. Figure 2 shows the progress of the approximate quality versus the running time. Beside the best heuristic solvers in MAXSAT 2016, we also show the approximation ratio over time for the well-known linear programming approximation (Goemans and Williamson, 1994) (solved via the Gurobi solver). Note that each point in the blue and green curves denote the approximation ratio of the output solution at the time, and the starting points of the curves denote the time that the solver output the first solution. In all cases the Mixing method gives better and faster solutions than the LP approximations.

## 6 Conclusion

In this paper we have presented the Mixing method, a low-rank coordinate descent approach for solving diagonally constrained semidefinite programming problems. The algorithm is extremely simple to implement, and involves no free parameters such as learning rates. In theoretical aspects, we have proved that the method converges to a first-order critical point and all non-optimal critical points are unstable under sufficient rank. With a proper step size, the method converges to the global optimum almost surely under random initialization. This is the first convergence result to the global optimum on the spherical manifold without assumption. Further, we have shown that the proposed methods admit local linear convergence in the neighborhood of the optimum regardless of the rank. In experiments, we have demonstrated the method on three different application domains: the MAXCUT SDP, a MAXSAT relaxation, and a word embedding problem (in the appendix). In all cases we show positive results, demonstrating that the method performs much faster than existing approaches from an optimization standpoint (for MAXCUT and word embeddings), and that the resulting solutions have high quality from an application perspective (for MAXSAT). In total, the method substantially raises the bar as to what applications can be feasibly addressed using semidefinite programming, and also advances the state of the art in structured low-rank optimization.

## Acknowledgement

We thank Gary L. Miller for his helpful discussion on the geometry of the unit sphere, and Simon S. Du for pointing out many recent references.

## Appendix A Proof of Theorem 3.1: Convergence to a critical point

Let for the Mixing method defined in (4) and (5). Then

 f(V)−f(^V)=n∑i=1yi∥vi−^vi∥2.

Recall the objective function w.r.t. variable while fixing all other variables is

 ⟨C,VTV⟩=∑i∑jcijvTivj=2vTi(∑jcijvj)+constant.

Note that the term is independent of because . Now consider the inner cyclic iteration of the Mixing method updating to . Because only those with are updated to , the objective value before updating to equals plus constants, Thus, the updates of the Mixing method can be written as

 ^vi=−gi/yi,where yi=∥gi∥ and gi=∑jicijvj,i=1…n. (7)

and the objective difference after updating to is

 2gTi(vi−^vi)=−2∥gi∥^vTi(vi−^vi)=2yi(1−vTi^vi)=yi∥vi−^vi∥2.

The result follows from summing above equation over .

##### Proof of Theorem 3.1

By Assumption 3.1 does not degenerate over iterations, so Lemma A implies that the Mixing method is strictly decreasing and admits a unique limit point for the sequence generated by the algorithm. Denote the corresponding for . Then being a fixed point of (5) implies

 ¯V(C+diag(¯y))=0,

which means the projected gradient of at is zero. Thus, together with the feasibility of , we prove that the Mixing method converges to a critical point.

## Appendix B Proof of Lemma 3.2: Divergence of Gauss-Seidel methods

Because the dynamics of the Gauss-Seidel method (GS) on the system

 minx∈Rnf(x),where f(x)≡xTSx,

has the same Jacobian as , proving is equivalent to proving the “linear divergence” of the Gauss-Seidel method, which cyclically optimizes each coordinate of . Further, since , there is an eigenvector of such that .

Consider the sequence generated by the GS. That is, , where is to the -th power. Let the initial solution of the system be so that . Because the Gauss-Seidel method is greedy in every coordinate updates, it is monotonically deceasing in the function value. Thus, there are only two cases for the sequence of function values: 1) the function value converges below zero; 2) the function value goes to negative infinity.

Denote the before updating the -th coordinate and let and . This way, only the -th coordinate between and is changed and the inner cyclic updates can be flattened as

 xr=zr1→zr2→…→zrn→zrn+1=xr+1. (8)
##### 1) When the function value converges.

The monotonic decreasing property of GS implies that the function difference converges to zero. By the same analysis in Lemma A,101010We can obtain the result by fixing in Lemma A to be a constant and let . The result can also be obtained by examining the coordinate updates of GS, which is already known in the literature. we have

 f(xr)−f(xr+1)=n∑i=1yi∥zri−zri+1∥2.

Thus, the flattened sequence convergences, which implies also converges. Let be the limit of the sequence . Being a limit of the GS sequence means that is a fixed point of GS, which implies and . This contradicts with the monotonic decreasing property of GS and the fact that .

##### 2) When the function value xTrSxr goes to negative infinity.

Because the spectrum of is bounded, we know that also goes to infinity. For simplicity, we focus on the -th iterate and write as . From the GS, is updated to , and we have

 f(zi)−f(zi+1)=yi∥zi−zi+1∥2=yi(zi,i+1yi(∑jcijzj,i))2=|eTiSzi|2/yi, (9)

where is the -th coordinate vector and the first equality is from . Then we have the following claim from Lee et al. (2017, cliam 1).

###### Claim

Assume be in the range of . There exists an index such that for some global constant that only depends on and .

The full proof of the claim is listed after this lemma for completeness. To fulfill the assumption, we can decompose , where is in the range of and is in the null of . Consider the flattened inner cyclic update like (8) but starting from such that111111Note that only is decomposed to in and in . Symbols are the GS iterates generated from and might not be in the range of .

Because map the null space of to itself,121212Consider such that . Then .

Further, because GS is coordinate-wise monotonic decreasing and the function decrease of a coordinate update is smaller than the whole cyclic update, by above equality and (9) we have

 f(xr)−f(xr+1)=f(z⊥\vruledepth0.0ptwidth1px\vruledepth0.0ptwidth1px1)−f(z⊥\vruledepth0.0ptwidth1px\vruledepth0.0ptwidth1pxn+1)≥(eTjSz⊥\vruledepth0.0ptwidth1px\vruledepth0.0ptwidth1pxj)2yj≥yjω2∥z⊥\vruledepth0.0ptwidth1px\vruledepth0.0ptwidth1pxj∥2.

The last inequality is from Claim B. Thus,

 (10)

Further, because