Efficient semidefinite bounds for multi-label discrete graphical models

11/24/2021
by   Valentin Durante, et al.
0

By concisely representing a joint function of many variables as the combination of small functions, discrete graphical models (GMs) provide a powerful framework to analyze stochastic and deterministic systems of interacting variables. One of the main queries on such models is to identify the extremum of this joint function. This is known as the Weighted Constraint Satisfaction Problem (WCSP) on deterministic Cost Function Networks and as Maximum a Posteriori (MAP) inference on stochastic Markov Random Fields. Algorithms for approximate WCSP inference typically rely on local consistency algorithms or belief propagation. These methods are intimately related to linear programming (LP) relaxations and often coupled with reparametrizations defined by the dual solution of the associated LP. Since the seminal work of Goemans and Williamson, it is well understood that convex SDP relaxations can provide superior guarantees to LP. But the inherent computational cost of interior point methods has limited their application. The situation has improved with the introduction of non-convex Burer-Monteiro style methods which are well suited to handle the SDP relaxation of combinatorial problems with binary variables (such as MAXCUT, MaxSAT or MAP/Ising). We compute low rank SDP upper and lower bounds for discrete pairwise graphical models with arbitrary number of values and arbitrary binary cost functions by extending a Burer-Monteiro style method based on row-by-row updates. We consider a traditional dualized constraint approach and a dedicated Block Coordinate Descent approach which avoids introducing large penalty coefficients to the formulation. On increasingly hard and dense WCSP/CFN instances, we observe that the BCD approach can outperform the dualized approach and provide tighter bounds than local consistencies/convergent message passing approaches.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

09/19/2017

Inference in Graphical Models via Semidefinite Programming Hierarchies

Maximum A posteriori Probability (MAP) inference in graphical models amo...
09/25/2017

Exact MAP inference in general higher-order graphical models using linear programming

This paper is concerned with the problem of exact MAP inference in gener...
09/14/2017

On Coordinate Minimization of Convex Piecewise-Affine Functions

A popular class of algorithms to optimize the dual LP relaxation of the ...
12/04/2020

Efficient semidefinite-programming-based inference for binary and multi-class MRFs

Probabilistic inference in pairwise Markov Random Fields (MRFs), i.e. co...
06/13/2017

Lost Relatives of the Gumbel Trick

The Gumbel trick is a method to sample from a discrete probability distr...
02/18/2021

Non-approximate Inference for Collective Graphical Models on Path Graphs via Discrete Difference of Convex Algorithm

The importance of aggregated count data, which is calculated from the da...
09/26/2013

Tighter Linear Program Relaxations for High Order Graphical Models

Graphical models with High Order Potentials (HOPs) have received conside...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction, Definition

Graphical models cooper2020graphical are descriptions of decomposable multivariate functions. A variety of frameworks in Computer Science, Logic, Constraint Satisfaction/Programming, Machine Learning, Statistical Physics, Operations Research and Artificial Intelligence define specific sub-classes of graphical models.

Definition 1.

(Graphical Model) A Graphical Model with co-domain B and a combination operator is defined by :

  • a sequence of n variables .

  • a set of functions . Each function is a function from . S is called the scope of the function and its arity. is the cartesian product of the domain of the variables within the scope.

defines a joint function:

Where is the projection of on : the sequence of values that the variables in takes in .

We consider additive numerical GMs, aka energy-based models that are used to describe infinite-valued or bounded cost functions in Constraint Programming, using Valued Constraint and Cost function Networks schiex1995valued,cooper2010soft or to describe probability distributions as Markov Random Fields or Bayesian Networks in the stochastic case koller2009probabilistic. GMs have many areas of applications, in image processing, computational biology, resource management, configuration, etc.

For our work we are interested in solving optimization queries. A GM is specified by a set of variables , where can take possible values and a set of potential or cost functions involving variables in , the scope of . We note the total number of values. An additive graphical model defines a joint function over as . Deterministic GMs use non negative integer functions and a saturating arithmetic. Stochastic models use real

functions, exponentiation and normalization to define a probability distribution. When all scopes have cardinality less than or equal to 2, the model is pairwise and its graph has vertices in

and edges defined by scopes.

Our optimization query is to find an assignment of the discrete variables that minimizes the joint function.

This problem is known as the Weighted Constraint Satisfaction Problem (WCSP). In the case of a pairwise GM, finding an assignment of the variables that minimizes the joint function can be expressed as an Integer Linear Program (ILP).

Minimize

Here, is a binary variable which is 1 if variable takes value , 0 otherwise. The second line describes the "exactly-one" constraints, a GM variable can take exactly one value within its domain. are 0/1 variables that encode assignments of pairs of variables, for all the binary cost functions. The second and third constraints ensure that those pair variables are assigned in a way which is consistent with the assignments of the variables.

Except for specific polynomial classes that include tree-width bounded models or models defined by submodular functions cooper2020graphical, the problem of finding an assignment of the variables in that optimizes the joint function

is decision NP-complete. While exact Branch and Bound based algorithms exist for the problem, a vast array of polynomial heuristics have been developed for the most challenging instances, that remain out of reach of exact algorithms. Relatively small very hard instances can be easily produced as random problems. Such hard instances also easily pop-up when Machine Learning algorithms are used to estimate GMs from data using regularized loglikelihood approaches, when tuning the regularizing penalty brouard2020pushing.

While the integer linear program is intractable in general, one can derive a Linear Programming (LP) relaxation of the original problem. The decision variables will now take continuous value in the interval . This LP relaxation, known as the local polytope Wer07, leads to a simpler polytope to optimize over, with a tractable number of constraints. If the LP relaxation returns an integer valued solution, this is an optimal solution of the original ILP wainwright2005map. However, even though solving LP is tractable, the number of constraints makes direct LP solvers inefficient on large instances. Efficient heuristics were developed to address this inefficiency.

A first family of heuristics derives from the seminal convergent arc consistency waltz1975understanding (or Boolean Constraint Propagation, BCP) algorithm introduced for Boolean graphical models (Constraint Networks). This algorithm was generalized by Pearl to numerical GMs as Loopy Belief Propagation pearl1988probabilistic, at the cost of losing convergence. Convergence was restored in Soft Arc Consistency algorithms schiex2000arc,cooper2010soft for the WCSP/CFN, culminating in Virtual Arc Consistency (VAC) cooper2008virtual. Independetly to the development of VAC, convergent belief propagation algorithms such as TRW-S kolmogorov2005convergent were introduced, which witness dual bounds by a potentially suboptimal LP dual solution, which can be interpreted as a reparametrizations of the MAP/MRF. Fixpoints of algorithms such as TRW-S share the same properties as those of VAC, but VAC additionally applies BCP on infinite-valued energies. These algorithms offer lower bounds for minimization. They also heuristically produce primal (discrete) heuristic solutions. Taken together, these offer post-hoc guarantees through an optimality gap, making such algorithms useful in time-constrained scenarios.

In many cases, these bounds correspond to high quality solutions and a small optimality gap. On the hardest instances, however, they produce primal solutions of high cost and poor lower bounds, failing to provide reasonable guarantees: efficient tighter bounds are needed. One natural direction to obtain such bounds is to use higher levels of the Sherali-Adams hierarchy SA90, which in practical terms means extending Arc Consistency and BP to a larger set of variables. This idea is captured by -consistency for Boolean GMs journals/cacm/Freuder78,journals/jacm/Freuder82 and CFNs nguyen2017triangle, or GBP yedidia2000generalized and convergent variants sontag2012tightening for MAP/MRF. The associated time and space costs quickly become extreme however, as moving up one step in the hierarchy multiplies time and space complexity by a factor equal to the average domain size. A different approach is to use a different hierarchy, namely to try to exploit SDP bounds which provide superior guarantees goemans1995improved compared to approximate LP bounds.

However, SDP has found limited applications in solving optimization queries in graphical models. The main reason lies in the fact that Goemans and Williamson’s SDP relaxation of MAXCUT, a specific case of MAP/MRF with binary variables, requires variables and memory resulting in an complexity for interior point methods. The algorithm doesn’t scale well and becomes impractical except on small very hard problems. It very quickly trades off too much in speed in favor of strength of inference. For this reason approximate LP methods such as those we describe above are preferred because they achieve a better trade off.

This has changed with the non-convex low-rank approach proposed by burer2003nonlinear. They exploited the result of barvinok1995problems and pataki1998rank, who showed that the rank of solutions of SDP problems is bounded by . Burer and Monteiro use a low-rank factorization of the semidefinite matrix in the SDP relaxation of these combinatorial problems. It has since been observed that, in practice, the rank of solutions is even lower, some software using a constant rank. Then, the number of variables and the memory requirements can be reduced to where is the decomposition rank used. As a side benefit, the decomposition guarantees the semidefiniteness of the matrix . This comes at the cost of addressing a non-convex optimization problem.

Combined with row-by-row updates javanmard2016phase,wang2017mixing, Riemannian gradient descent javanmard2016phase,mei2017solving or Riemannian trust-region methods absil2007trust, the Burer-Monteiro approach provides empirically very efficient methods to solve SDP relaxations of combinatorial optimization problems such as MAXCUT, MAXSAT wang2019low or MAP over Ising or Potts models pabbaraju2020efficient. The row-by-row update methods, with their efficient

updates are especially attractive.

Potts model MRFs, where pairwise functions are weighted identity matrices can still be directly dealt by using simple row-by-row updates pabbaraju2020efficient. In contrast, arbitrary pairwise GMs, with non binary variables and arbitrary functions need to explicitly deal with one additional constraint for each original variable of the considered GM. In the usual one-hot encoding of finite domain variables, a GM variable

with values is encoded as binary variables, each representing whether a value is used or not. The extra constraint specifies that a variable is assigned exactly one value. We add these constraints to the SDP relaxation and dualize them with a penalty of sufficient magnitude to guarantee they will be enforced lasserre2016max. Finally, we may try to get rid of these constraints altogether by shifting from row-by-row updates to Block Coordinate (BC) updates on blocks that include all the variables that represent the binary variables in the low-rank matrix . This approach requires however to have efficient BC updates.

We show that such BC updates can be performed efficiently while preserving convergence. We implement the two approaches described above and compare them with TRW-S and VAC.

2 SDP relaxation for pairwise GMs

Matrices are denoted with a capital boldface font and all vectors are column vectors. For matrix

, represents the entry at the -th row and -th column of . represents its -th row as a column vector. For a vector , denotes its Euclidean norm. The vector defined by the diagonal of is , the Frobenius inner product of matrices is denoted as . For a vector g, denotes the diagonal matrix with diagonal .

A pairwise GM with associated graph defines the joint function . We assume that the functions

are described as tensors, here matrices, vectors and a scalar respectively. For optimization, the constant term

can be ignored. Before building an SDP relaxation, the problem is reduced to a quadratic form by representing every finite domain variable as a vector of Boolean variables. The element of will take value when variable and value otherwise. The value of is and the value of is then . If we denote by the stacked vector , we then have where is the stacked vector of all and is a symmetric block matrix having block when and the zero matrix otherwise. Note that is sparse by block, according to the connectivity of the graph of . More importantly, has zero blocks on its diagonal since no pairwise function in connects a variable to itself.

When optimizing over instead of , one must enforce the fact that only one element of every Boolean vector is set to , i.e., that . This set of constraints can be gathered in a linear constraint where is a Boolean matrix in with and where is an -vector of . Finally, minimizing becomes equivalent to solving the following constrained quadratic program:

(1)

To build an SDP relaxation of the quadratic program above, centered variables are used and relaxed to norm-1 vectors of dimension . The problem above can be reduced to such formulation by using the variables . After this variable change, denoting , , and , the problem becomes:

(2)

Here, the initial exactly-one constraints become , with , . This change of variables creates a constant term in the objective value which can be ignored for optimization purposes. In this formulation, the matrix has the exact same sparsity as .

2.1 Lassere dualization of linear constraints

To get rid of the linear constraints in (2) and get a quadratic program, a usual approach is to dualize the linear constraint with a penalty that must be properly chosen lasserre2016max. This reduces the problem to a pure MAXCUT problem on which the Burer-Monteiro approach can be directly applied using its efficient row-by-row updates wang2017mixing. The relaxed problem then becomes:

(3)

As soon as , the solution of (3) and (1) are the same lasserre2016max. One can note at this point that if any potential function in the GM at hand contains functions of large amplitude, the above property requires large values of .

In order to convert linear terms to quadratic terms (for homogenization), we introduce the extended vector . Then, (3) equivalently asks to minimize where and (symmetric) is:

The usual SDP relaxation of the MAX-CUT problem can be written using the rank 1 matrix variable . Dropping the rank-1 constraint, the relaxation is:

(4)

burer2003nonlinear introduced the idea of using a low-rank factorization of to solve the SDP (4). This factorization was motivated by a proof by barvinok1995problems and pataki1998rank of the following theorem:

Theorem 1.

There exists an optimum of (4) with rank such that .

Every positive semidefinite matrix of rank can be factorized as a product of two rank matrices: . Then, (4) becomes:

(5)

With the row vector of , the number of variables shifts from to and the constraint becomes implicit as is always positive semidefinite. Several approaches exploit this idea. We use the mixing method wang2017mixing. It does efficient cyclic updates of the row vectors , all other row vectors being fixed:

The mixing method is known to recover the optimum of the convex SDP (4) as long as . One issue of the mixing method in the presence of dualized constraints generated by Lassere’s approach is that the row-by-row updates will be sensitive to the magnitude of the penalty . As we observed previously, may be large if the input GM has large terms. The large value of will create large coefficients in and then the value of will be dominated by few terms, possibly slowing down convergence. It is important to notice that if the solution to (4) nullifies the quadratic penalty term induced by the Lasserre dualization, it will implicitly satisfy the so-called gangster constraint zhao1998semidefinite. If we consider the Boolean vector that corresponds to the variable , the gangster constraint ensures that the off-diagonal entries of the matrix are all zeros. Indeed, , since at least one of the two terms is equal to zero. This can tighten the SDP relaxation bound.

2.2 SDP bound using block coordinate descent

We now consider an approach that exploits the specific structure of the matrix in the constrained quadratic formulation (2). Using the extended vector , the rank-1 matrix variable , dropping the rank-1 constraint and using a low rank factorization as above, the quadratic optimization problem (2) can be relaxed into the SDP (2.2):

with . The diagonal constraint can be rewritten . We note . For a given GM variable , represents the position of the first row representing in and . The exactly-one constraint can be rewritten .

Instead of doing row-by-row updates, we exploit the property that the matrix has zero blocks on the diagonal. For a given GM variable , we simultaneously optimize all the rows of that correspond to while keeping all other rows fixed. To simultaneously update all these rows, noting as before, we have to solve:

(6)

Notice that since the corresponding diagonal block matrix of is all zero, the do not depend on the optimized and can be considered as fixed. In the following, we make the assumption that the vectors and are never colinear (see wang2017mixing).

By the second constraint, every solution row vector must lie in the unit spherical manifold. Furthermore we observe that:

Theorem 2.

At the optimum of (6), every optimized vector lies in the dimension 2 subspace generated by and .

The full proof is provided in Appendix A.1. Given that lies on the unit sphere and in this dimension 2 subspace, it is entirely defined by the angle it makes with . With vectors on the sphere, the problem above can be rewritten using trigonometric functions, omitting the norm-1 constraints which are implicitly satisfied. To deal with the remaining exactly-one constraint, we write the Lagrangian dual using multiplier .

Theorem 3.

If we denote as the angle from to , the dual Lagrangian of problem (6) is:

The full proof is provided in Appendix A.2. This is a function on a single dimension, which we can optimize using the Newton method. We introduce a notation for the Newton method: we denote by the set of functions such that:

  • is times continuously differentiable on .

  • Its derivative is Lipschitz continuous on Q with constant L.

Theorem 4.

Rate of convergence of the Newton method nesterov2018lectures. Let us consider the problem

Let the function satisfy the following assumptions:

  • .

  • There exists a local minimum of the function with positive definite Hessian.

  • The initial starting point is close enough to .

Then the Newton’s method converges quadratically.

To reach quadratic convergence with the Newton method, we need to find a starting point which is close enough to . This starting point can be produced by analytically solving the Lagrangian dual of the following relaxation of the BCD problem (6):

(7)

where , and .

Theorem 5.

Let us define and . The solution to (7) is given by with

The full proof is provided in Appendix A.3.

Theorem 6.

The block-coordinate descent with fixed last column is decreasing.

The full proof is provided in Appendix A.4. This last theorem ensures that the block-coordinate descent will converge. The convergence speed using random row-by-row updates is linear in the neighborhood of a local optimum when the rank is sufficiently large wang2017mixing,erdogdu2018convergence.

Computational Complexity.

The update rule for the Newton method is and the first and second derivative of are:

(8)
(9)

Given that all vectors here have size , by pre-computing the dot products with , computing the first derivative requires operations. For the second derivative, we can pre-compute the numerators and evaluating the second derivative is again . Overall, one update of the Newton method will require only operations, which is also what a single round of row-by-row updates over the rows associated with would require. The BCD updates however benefit from the efficient Newton updates. In our experiments, we observe that only few iterations of the Newton method are needed.

2.3 Producing an integer primal solution

To produce a primal integer solution from the optimal solution of the convex relaxation, we use the usual random rounding approach of goemans1995improved. We generate a random vector

on the unit sphere by sampling each coordinate from a Gaussian distribution and then normalizing. We then assign variable

to the value that has the largest scalar product with . Following this, the integer solution obtained is submitted to a simple greedy search where, for every variable, the best improving change of value (if any) is applied. This is done repeatedly on every variable until a local minimum is reached. The same process is used for all convex relaxations considered. For every instance we repeat this scheme with 50 random vectors and the best solution is kept.

3 Experiments

All experiments were run on a server equipped with a Xeon®CPU E5-2680 v3 @ 2.50 GHz and 256GB or RAM running Debian Linux 4.19.98-1. All implementations are sequential and use a single thread of computation.

3.1 Description of solvers

We implemented the row-by-row updates with the dualized exactly-one constraints (LR-LAS) as well as the BCD update method (LR-BCD) in C++ with the Eigen3 eigenweb sparse matrix library. We use rank by default and tried variants with a low rank (down to ). These implementations are avaible at https://github.com/ValDurante/LR-BCD and compared with two traditional competitors: the convergent message passing MAP/MRF algorithm TRW-S, as implemented in Open-GM2 kappes2015comparative, an efficient C++ MIT-licensed library, in release 3.3.7, available at https://github.com/opengm. We used a maximum number of iterations 100000 and a tolerance of with TABLE mode activated. We also compared with the WCSP/CFN algorithm enforcing Virtual Arc Consistency cooper2010soft, as implemented in toulbar2 hurley2016multi, another C++ MIT-licensed library, in release 1.1.1, available at https://github.com/toulbar2/toulbar2. We used a resolution of (using flags -nopre -A -C=1000 to turn off all preprocessing but VAC). A primal solution is obtained by diving with default heuristics and no backtrack, simply by adding flag -bt=0 to the previous command line.

Overall, we therefore have four solvers: LR-LAS and LR-BCD (parameterized by their rank), TRW-S, and VAC. We also considered using the interior point method implemented in CVXPY agrawal2018rewriting, but preliminary tests showed that it ran several orders of magnitude slower than the Mixing Method on small problems with 120 binary variables from the BiqMac library at http://biqmac.uni-klu.ac.at/biqmaclib.html.

3.2 Description of instances

We used hard random instances of MAP/MRF and WCSP/CFN problems. We generated random pairwise problems using the WCSP/CFN generator included in toulbar2 with the bin-{n}-{d}-{t1}-{p2}-{seed} model. This generates a GM with n variables, d values, p2 pairwise functions, each containing a random fraction of potentials equal to 0, the rest being uniformly sampled between 1 and 3. The pairwise functions are sampled uniformly without replacement from the set of all possible edges. We generated different types of random problems, varying along the number of variables (50-100), number of values (3-10) and graph density (sparse-dense), keeping the potential functions in a 50% tightness regime. Dense problems have a complete graph while sparse problems have the number of pairwise functions set to 4 times the number of variables, giving an average degree of 8.

3.3 Results

We present in Table 1 a summary of the performance of all solvers on random instances. Table 1 shows that approximate LP bounds (VAC/TRW-S) are tighter and faster than SDP bounds on sparse problems while SDP becomes tighter than LP on dense problems. For SDP, LR-BCD is several orders of magnitude faster than LR-LAS, with similar bounds on the dense instances. However, on sparse instances, where TRW-S dominates anyway, LR-BCD lower bounds become looser. This can be explained by the fact that the LR-BCD does not enforce the gangster constraint, which is known to tighten SDP relaxations while LR-LAS enforces them.

For upper bounds, none of the four solvers strongly stands out. LR-LAS and LR-BCD provide bounds that are close in general. There are some instances where VAC seems to have better results, especially on sparse instances with large domain sizes while TRW-S is stronger on the dense instances with small domain sizes. However, LR-LAS and LR-BCD SDP-based upper bounds start to dominate on the larger and denser instances.

density sparse dense
n 50 100 50 100
d
lb VAC
TRW
BCD
LAS
ub VAC
TRW
BCD
LAS
gap (%) VAC
TRW
BCD
LAS
cpu (s) VAC
TRW
BCD
LAS
Table 1: Solver performances: lower bound (lb), upper bound (ub), optimality gap (gap) and cpu-time for all four methods (rows) on all types of random instances (columns).

These results suggest that the best result could be obtained using both SDP and propagation based bounds with just two threads of computations. In this scenario, we would use one thread to execute VAC or TRW-S, and another one to enforce LR-BCD. This combination will provide tighter gaps with enhanced upper and lower bounds along a large spectrum of problems at limited extra computational cost: on the hardest dense problems, where LR-BCD brings the strongest improvments, it is almost as fast as VAC or TRW-S. The use of LR-LAS to obtain stronger lower bounds is far more demanding in terms of cpu-time: on the hardest dense problems, where it could bring the most significant improvements, it runs 70 times slower than LR-BCD, as well as VAC/TRW-S.

3.3.1 Constant rank relaxations.

In Table 2, we empirically explore the world of very low rank relaxations: the rank is reduced from its theoretically safe value to reach values where the optimum value loses its status of lower bound. Indeed, it is important to recall that for a rank the mixing method is not guaranteed to converge to an optimum of the SDP relaxation.

Using such low ranks on the rd100-10-dense-0 () instance, we see that the LR-BCD method speeds-up as the rank decreases, the value of the primal and integer solutions remaining stable until very low ranks are used: the quality and speed of the method then worsen as the required number of iterations grows with very tiny ranks. LR-LAS also benefits from rank reduction since the value of the integer solution remains stable even with very low ranks. As upper-bounds seems mostly insensitive to rank reduction, rank reduction makes sense if quick reliable upper bounds are sufficient. To benefit from an optimality gap in this setting of very low ranks, a dual lower bound would need to be computed. The acceleration that BCD updates are clearly visible on this instance: iterations of LR-LAS corresponds to row updates. This is two order of magnitude more than Newton iterations for LR-BCD.

LR-BCD LR-LAS
Rank value ub cpu (s) iterations value ub cpu (s) iterations
BCD Newton
45 1890 4006 2.54 56 12854 2438 4006 212.33 5040
41 1890 4006 2.32 56 12742 2440 4006 192.57 4990
37 1890 4006 2.16 57 13020 2443 4006 181.17 5057
33 1890 4006 1.88 55 12618 2427 4006 159.39 4946
29 1890 4006 1.65 54 12305 2425 4006 146.72 5028
25 1890 4006 1.51 56 12881 2443 3972 129.25 5060
21 1890 4006 1.31 56 12903 2439 4006 113.51 5082
17 1890 3972 1.20 61 14151 2449 4006 96.12 5092
13 1890 4006 1.03 64 15086 2460 3972 79.59 5124
9 1891 3972 0.93 74 17795 2461 4003 62.78 5116
5 1910 4006 1.26 138 38037 2511 4006 45.08 5148
Table 2: Rank results on the rd100-10-dense-0 instance: optimum value (value), integer solution value (ub), cpu-time (cpu) and number of iterations. For LR-BCD, we report the number of matrix (BCD) and Newton updates. For LR-LAS, we report the number of matrix updates.

4 Conclusion

An increasing number of proposals for solving convex relaxations of discrete optimization problems rely on the Burer-Monteiro approach burer2003nonlinear. These approaches are fast and offer both a lower and upper bound which make them highly desirable when efficiency is required. They can also offer tighter gaps than the more usual LP-related local consistencies, especially on the hardest dense problems. However, to the best of our knowledge, all these proposals deal only with the simple case where only diagonal-one constraints exist. This is enough for MaxCUT, MAX2SAT, and MAP/Ising and even MAP/Potts (where potential functions are weighted identity matrices) or even MAXSAT (at the cost of a significantly weakened relaxation for -clauses, ).

But a variety of real world problems expressed as optimization over graphical models, in resource allocation, computational biology or image processing, naturally involve variables with many values and arbitrary pairwise functions. This work shows that the extension of the Burer-Monteiro approach to arbitrary domain sizes and functions is not straightforward: even if it improves over usual interior points methods, as implemented in CVXPY, the direct dualization of the required “exactly-one” constraints with adequate penalization leads to slow convergence, removing most of the practical interest of the Burer-Monteiro approach. Dedicated ways of dealing with these exactly-one constraints are needed. The BCD approach introduced is a clear step forward in this direction. With its fast Newton updates, it offers important speed-ups compared to the dualized variant, at the sole cost of relaxing the gangster constraint. On the hardest instances, where VAC/TRW-S can only provide very large gaps, the BCD approach we described is several orders of magnitude faster than interior points methods and orders of magnitude faster than the Lassere-dualized approach, while offering much tighter optimality gaps than VAC/TRW-S. It becomes the method of choice for hard problems and can be used in parallel with the linear bounds produced by VAC/TRW-S to get the best of both worlds, at modest additional runtime.

The possibility of relying on extremely low rank is another attractive capacity of the Burer-Monteiro method. We observe that the CPU-time that is required to produce an upper bound can be easily further reduced. A dual bound would then be able to produce an optimality gap that could exploited in the context of Branch-and-Bound methods, providing a desirable tight bounding method with an adjustable cpu/tightness compromise.

5 Acknowledgments and Disclosure of Funding

This work has benefited from the AI Interdisciplinary Institute ANITI funding, through the French “Investing for the Future PIA3” program under the Grant agreement n°ANR-19-PI3A-0004. The first author is also supported by the EUR BioEco program. The second and third authors are both supported by the "Décomposition de Modèles Graphiques – DE-MO-GRAPH" program under grant n°ANR-16-CE40-0028.

Appendix A Appendix

a.1 Proof of Theorem 2

Proof.

Let be the value of in an optimum of the problem above and assume that does not lie in the dimension 2 subspace generated by and . Let be the orthonormal basis of this subspace such that has a positive coordinate over . Then, can be written as where lies in the dimensional subspace orthogonal to the subspace generated by and and . Since, when , changing the sign of improves the criteria without changing the status of the “exactly-one” and norm-1 constraints, we must have . Let and . If we now consider a solution where has been replaced by , the “exactly one” constraint is still satisfied because the scalar product of with is unchanged. The norm constraint is still satisfied as . Then, the new solution improves the criteria given that and that . ∎

a.2 Proof of Theorem 3

Proof.

With being norm one and lying in the dimension 2 subspace generated by and , let be the angle from to in this subspace. We can rewrite the problem (6) as:

(10)

The Lagrangian is:

The partial derivatives of are which must all be equal to zero at the optimum. We use the fact that the sum of sinusoids with the same period and different phases is also a sinusoid:

We have:

which implies that

Since , we have:

By trigonometric identities, noting , we have:

Developing and simplifying and plugging the result back above, we get:

Similarly, one can derive:

Plugging this back into the Lagrangian, we get:

So the dual problem is to find that maximizes . The first derivative of is:

and the second derivative:

One can see that whenever and are not colinear which proves is strictly concave and guarantees the unicity of the optimum. Being strictly concave, we just have to find such that . ∎

a.3 Proof of Theorem 5

Let us first introduce a lemma that will be useful for the proof of this theorem.

Lemma 1.

The solution to (7) must lie in the subspace .

Let us consider and . For , we have with a particular solution to the linear equation and a vector such that . For , we take which gives us the desired solution . Since , the squared norm of must also satisfies .
Then, we compute the dot product . The first term is constant for all that lies in the feasible set . We just have to check the value of the second term . Let be the projection of into the hyperspace orthogonal to . We can write the dot product : since . Thus we only worry about the value of the dot product which is minimal in the direction . Since lies in the space generated by and we proved that the solution to (7) must lie in the space generated by and .

Proof.

Let us consider e.g. for some . The first constraint gives us:

By developing the squared norm we have:

Using the Cauchy-Schwartz inequality we know that whenever and are not colinear. Thus using the notation :

with

One can easily see that the objective value is smaller for which gives us the result. ∎

a.4 Proof of Theorem 6

Proof.

Let us show that the objective difference before and after a BCD update of is always positive. For a given column the objective difference before and after the update is:

with

Let us first check the first term. We know that:

Moreover:

Using the Cauchy-Schwartz inequality, we already know that