# Accelerating Message Passing for MAP with Benders Decomposition

We introduce a novel mechanism to tighten the local polytope relaxation for MAP inference in Markov random fields with low state space variables. We consider a surjection of the variables to a set of hyper-variables and apply the local polytope relaxation over these hyper-variables. The state space of each individual hyper-variable is constructed to be enumerable while the vector product of pairs is not easily enumerable making message passing inference intractable. To circumvent the difficulty of enumerating the vector product of state spaces of hyper-variables we introduce a novel Benders decomposition approach. This produces an upper envelope describing the message constructed from affine functions of the individual variables that compose the hyper-variable receiving the message. The envelope is tight at the minimizers which are shared by the true message. Benders rows are constructed to be Pareto optimal and are generated using an efficient procedure targeted for binary problems.

## Authors

• 21 publications
• 16 publications
07/01/2020

### Accelerated Message Passing for Entropy-Regularized MAP Inference

Maximum a posteriori (MAP) inference in discrete-valued Markov random fi...
02/17/2010

### Message-Passing Algorithms: Reparameterizations and Splittings

The max-product algorithm, a local message-passing scheme that attempts ...
02/14/2012

### Message-Passing Algorithms for Quadratic Programming Formulations of MAP Estimation

Computing maximum a posteriori (MAP) estimation in graphical models is a...
02/14/2012

### Distributed Anytime MAP Inference

We present a distributed anytime algorithm for performing MAP inference ...
02/25/2022

### A residual-based message passing algorithm for constraint satisfaction problems

Message passing algorithms, whose iterative nature captures well complic...
06/04/2014

### Augmentative Message Passing for Traveling Salesman Problem and Graph Partitioning

The cutting plane method is an augmentative constrained optimization pro...
09/09/2020

### Generalizing Complex/Hyper-complex Convolutions to Vector Map Convolutions

We show that the core reasons that complex and hypercomplex valued neura...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Linear Programming (LP) relaxations are powerful tools for finding the most probable (MAP) configuration in Markov random fields (MRF). The most popular LP relaxation, called the local polytope relaxation [13, 21, 11, 26, 19], is both too loose [20] to be of use on many real problems and too computationally demanding to be solved exactly using simplex or interior point methods [25]. This motivates the use of coordinate updates in the Lagrangian dual, which are commonly called “message passing" or fixed point updates. These updates can be applied jointly with tightening the local polytope relaxation in a cutting plane manner [20, 14].

We solve the local polytope relaxation over a graph corresponding to a surjection of the variables to sets called hyper-variables. Hyper-variables are constructed so that iterating through the state space of each hyper-variable is feasible. However it need not be the case that iterating through the vector product of the state spaces of pairs of hyper-variables is feasible. For example consider a pair of hyper-variables each corresponding to fifteen binary variables. The state spaces of the individual hyper-variables are of the enumerable size

while the corresponding vector product is of intractable size . Therefor traditional message passing approaches can not be applied as they rely on enumerating the vector product of the state spaces of hyper-variables.

In this paper we propose a Benders decomposition [4, 9, 7, 22] approach for computing the lower bounds on the true min-marginals needed for message passing. These lower bounds share common minimizers with the true min-marginals and the lower bound is tight at these points. We use these lower bounds in place of the true min-marginals during message passing. Our procedure is guaranteed to converge to a fixed point of the Lagrangian dual problem of the local polytope relaxation.

Our paper should be understood in relation to [23] which introduces nested Benders decomposition [5] to achieve efficient dynamic programming exact inference in high tree width binary MRFs. The core methodological idea is to exploit the fact that the high tree width MRF is composed of low state space variables which is achieved via the nested Benders decomposition. Our work can be understood as extending the approach of [23] to permit message passing inference in arbitrary MRFs.

Our work can be contrasted with its most key competitor [20] which is motivated by protein design problems which have large state space for individual variables. In contrast we are motivated by problems in which the states spaces of individual variables tend to be small or binary allowing for efficient specialized inference inspired by [23]. We now summarize the approach of [20] which is called max-product linear programming plus cycle pursuit (MPLPCP).

MPLPCP alternates between message passing inference, which is applied initially over the local polytope relaxation, and adding primal constraints/dual variables. When a fixed point is reached MPLPCP carefully selects an additional primal constraint from a bank of such constraints that is used to circumvent the fixed point and potentially tighten the relaxation. This process iterates until the MAP is provably identified or the bank of such constraints is exhausted at a fixed point. MPLPCP benefits from two key properties (1) tightening the relaxation does not require restarting inference and (2) the introduction of cutting planes in the primal allows for fixed points in the dual to be bypassed. In [20] constraints are only produced over triples of variables and the selection of triples to add is a key bottleneck of inference[3].

### 1.1 Outline

We outline our paper as follows. First in Section 2 we formalize the problem of MAP inference and a corresponding message passing inference formulation [11]. Next in Section 3 we apply Benders decomposition to produce lower bounds on the min-marginals used in message passing inference. Then in Section 4

we apply our approach to binary MAP inference problems that occur in multi-person pose estimation

[12, 17, 1, 24]. Finally we conclude and discuss extensions in Section 5.

## 2 Linear Program Relaxation for MAP Inference

Consider a directed graph with vertices, where there is a bijection of vertices to variables . We map a joint configuration of the variables to a cost using cost terms for vertex and for edge , respectively 111 Without loss of generality lets assume that is non-positive. For a given problem this is achieved by subtracting the largest term present in each table and adding it to the objective. For binary pairwise problems there is a conversion in which each pairwise table has at most one non-zero element and that this element is non-positive. (See Appendix C) . We define a function over the variables as:

 f(x;θ)=∑i∈Vθi(xi)+∑(ij)∈Eθij(xi,xj) (1)

The MAP problem is then defined as finding an assignment that minimizes the function :

 minxf(x;θ) (2)

A standard formulation of inference is the following integer linear program (ILP) which we express using over-complete representation and . Here and describe a configuration of variable and pair respectively.

 minμ≥0 ∑i∈Vxiθi(xi)μi(xi)+∑(ij)∈Exixjθij(xi,xj)μij(xi,xj) (3) s.t ∑xjμij(xi,xj)=μi(xi)∀[(ij)∈E,xi] ∑xiμij(xi,xj)=μj(xj)∀[(ij)∈E,xj] ∑xiμi(xi)=1∀i∈V ∑xi,xjμij(xi,xj)=1∀(ij)∈E μi(xi)∈{0,1}∀[i∈V,xi] μij(xi,xj)∈{0,1}∀[(ij)∈E,xi,xj]

By relaxing the integrality constraints in Eq 3 we recover a known relaxation called the local polytope relaxation [20].

### 2.1 Tightening the Local Polytope Relaxation

In this sub-section we consider a tighter relaxation than the local polytope over by mapping to a new space and solving the local polytope relaxation in that space. Consider a surjection of to indexed by where m<n. Each is associated with a variable (which we refer to as a hyper-variable) that describes the state of all variables associated with . We use to denote the state of variable associated with for .

Consider a directed graph where there is a bijection of vertices to members of . There is an edge between in if and there is an edge such that either or . We rewrite Eq 4 over below using to aggregate the cost terms .

 f(y;ϕ) =∑p∈Vϕp(yp)+∑(pq)∈Eϕpq(yp,yq) (4) ϕp(yp) =∑i∈Ypθi(xypi)+∑i∈Ypk∈Yp(ik)∈Eθik(xypi,xypk) ϕpq(yp,yq) =∑i∈Ypj∈Yq(ij)∈Eθij(xypi,xyqj)+∑i∈Ypj∈Yq(ji)∈Eθji(xyqj,xypi)

We rewrite Eq 2 in terms of and below.

 Eq ???=minyf(y,ϕ) (5)

The local polytope relaxation over Eq5 may tighten a loose local polytope relaxation over . As a trivial example consider that there is only one set in . In that case the relaxation over is tight by definition. Similarly if forms a tree then the relaxation over is tight since the local polytope relaxation is known to be tight for tree structured graphs [19].

### 2.2 Message Passing Inference over the Local Polytope Relaxation

We write the dual form of the local polytope relaxation over below using real valued dual variables and .

 maxλ ∑p∈Vλp+∑(pq)∈Eλpq (6) s.t. ϕp(yp)+∑q∈V(pq)∈Eλpq→p(yp)+∑q∈V(qp)∈Eλqp→p(yp)≥λp∀[p∈V,yp] ϕpq(yp,yq)−λpq→p(yp)−λpq→q(yq)≥λpq∀[(pq)∈E,yp,yq]

In Alg 1 we display the common way of optimizing Eq 6 which iterates over and exactly optimizes all terms in which is an index leaving all others fixed [19]. For ease of reading the remainder of this section and with some abuse of notation, during the inner loop operation over we flip to for any .

We now consider the increase in the objective at a given iterate over . We use helper term defined below which is commonly referred to as a “min-marginal" [19].

 νpq(yp)=minyqϕpq(yp,yq)−λpq→q(yq) (7)

The increase in the objective is described below using to denote the terms before/after the update to .

 (λ↑p+∑q∈V(pq)∈Eλ↑pq)−(λ↓p−∑q∈V(pq)∈Eλ↓pq) (8) =(minypϕp(yp)+∑q∈V(pq)∈Eνpq(yp)) −(minypϕp(yp)+∑q∈V(pq)∈Eλ↓pq→p(yp)+∑q∈V(pq)∈Eminypνpq(yp)−λ↓pq→p(yp))

Observe that the iterate over in Alg 1 tightens the lower bound when there is no setting of which minimizes for each and prior to updating terms.

### 2.3 Producing an Integral Solution Given λ

An anytime approximate solution to Eq 5 is produced by independently selecting the lowest reduced cost solution for each as follows[20].

 y∗p←minypϕp(yp)+∑q∈V(pq)∈Eλpq→p(yp) (9)

## 3 Benders Decomposition Approach

In this section we study an efficient mechanism to achieve the increase in the dual objective in Eq 8. Observe that to compute for every possible one computation need be done for each member of the vector product of the state spaces of , . If we have , then we have roughly 30k states for and , and enumerating the joint space of states, becomes prohibitively expensive for practical applications. In this section we assume that all edges including are of the form not for notational ease.

To avoid enumerating the vector product of state spaces for pairs of hyper-variables we employ a Benders decomposition [4] based approach. We compute terms that lower bound and use in place of in Alg 1. We construct so as to satisfy the following property.

 (10)

Using Eq 10 observe that the fixed point update using instead of improves the dual objective by the same amount as the standard update as describe in Eq 8. Since fixed point updates over increases the dual objective by exactly the same amount in a given iteration as the standard update then optimization is guaranteed to converge to a fixed point of dual of the local polytope relaxation.

### 3.1 Our Application of Benders Decomposition

We now rigorously define our Benders decomposition formulation for a specific edge given fixed . We define the set of affine functions that lower bound as which we index by each of which is called a “Benders row". We parameterize the ’th affine function in with and for all .

Given any we define a lower bound on denoted as follows.

 νpq(yp)≥ν−pq(yp)=maxz∈˙Zpq→pωz0+∑i∈Ypωzi(xypi) (11)

To construct for each s.t. so as to satisfy Eq 10 we alternate between the following two steps.

• Select the minimizer corresponding the left hand side of Eq 10.

• Add a new Benders row to for each neighbor that makes . As an alternative we can add a Benders row corresponding to the edge in which the lower bound is loosest meaning the that maximizes .

Termination occurs when Eq 10 is satisfied.

#### 3.1.1 Outline of Benders Approach Section

We outline the remainder of this section as follows. First in Section 3.2 we produce a new Benders row corresponding to a given edge that is tight at a given . Next in Section 3.3 we consider the use of Benders rows that are designed to speed convergence called Magnanti-Wong cuts (MWC) or Pareto optimal cuts [16]. Then in Section 3.4 we show how Benders rows produced with different values of than the current one can be re-used with little extra computational effort. Finally in Section 3.5 we consider our complete algorithm for producing that satisfy Eq 10.

### 3.2 Producing New Benders Rows

Given nascent set , and fixed , we determine if the current lower bound is tight as follows.

 minyqϕpq(y∗p,yq)−λpq→q(yq) (12)

In this section we reformulate Eq 12 as an ILP, then produce a tight LP relaxation with dual form that reveals a Benders row that is tight at . For ease of notation we flip for where to . Similarly for ease of notation we use zero valued for . We use [] to denote the binary indicator function.

 minμ≥0 ∑yq−λpq→q(yq)μq(yq)+∑i∈Ypj∈Yqxixjθij(xi,xj)μij(xi,xj) (13) s.t. ∑yqμq(yq)=1 ∑xjμij(xi,xj)=[xi=xy∗pi]∀[i∈Yp,j∈Yq,xi] ∑xiμij(xi,xj)=∑yq[xj=xyqj]μq(yq)∀[i∈Yp,j∈Yq,xj] μq(yq)∈{0,1}∀[yq] μij(xi,xj)∈{0,1}∀[i∈Yp,j∈Yq,xi,xj]

We assume that without loss of generality that is non-positive. In Appendix A we prove that without altering the objective in Eq 13 the integrality constraints can be forgotten and the bottom two equality constraints can be relaxed to ,. The corresponding optimization is below.

 Eq ???=minμ≥0 ∑yq−λpq→q(yq)μq(yq)+∑i∈Ypj∈Yqxixjθij(xi,xj)μij(xi,xj) (14) s.t. ∑yqμq(yq)=1 ∑xjμij(xi,xj)≤[xi=xy∗pi]∀[i∈Yp,j∈Yq,xi] ∑xiμij(xi,xj)≤∑yq[xj=xyqj]μq(yq)∀[i∈Yp,j∈Yq,xj]

We now consider the dual problem of Eq 13 using dual variables .

 Eq ???=maxβ0∈Rβ1ij(xi)≥0β2ij(xj)≥0 β0−∑i∈Ypj∈Yqxiβ1ij(xi)[xi=xy∗pi] (15) s.t. −λpq→q(yq)−β0−∑i∈Ypj∈Yqxj[xj=xyqj]β2ij(xj)≥0∀yq θij(xi,xj)+β1ij(xi)+β2ij(xj)≥0∀[i∈Yp,j∈Yq,xi,xj]

Observe that given fixed the objective in Eq 15 is an affine function of . Thus when dual variables are optimal Eq 15 represents a new Benders row that we add to that makes the lower bound in Eq 11 tight at . Let us denote the new Benders row as which we construct from as follows.

 ωz∗0 =β0 (16) ωz∗i(xi) =−∑j∈Yqβ1ij(xi)∀[i∈Yp,xi]

### 3.3 Magnanti-Wong Cuts

One can directly solve Eq 15 via an off-the-shelf LP solver, which gives a tight lower bound for a given . However, ideally we want this new Benders row to also give a good (not terribly loose) lower bound for other selections of , so that we can use as few computations as possible to satisfy Eq 10.

Approaches for generating Benders rows that produce good lower bounds are called Pareto optimal cuts or Magnanti-Wong cuts [16] (MWC) in the operations research literature. We generate a MWC by adding regularization with tiny positive weight to prefer smaller values of as follows.

 −ϵ∑i∈Ypminxi∑j∈Yqβ1ij(xi) (17)

#### 3.3.1 Specialization for Binary Problems

In Appendix B derive an efficient exact procedure for producing MWC for binary problems. We write the findings below. Recall that WLOG at exactly one entry of is non-zero and that entry is non-positive (See Appendix C for details). We denote the indexes corresponding to that entry as , respectively. We write fast computation of below using as follows using helper constants .

 β0+ϵmaxQj(xj)≥β3j(xj)≥0 −∑j∈Yqxjβ3j(xj) (18) s.t. −λpq→q(yq)−β0+∑i∈Ypj∈Yq[xijj=xyqj](θij(xiji,xijj)+β3j(xijj)hij)≥0∀yq

where 222Examine the mapping and observe that when then not undefined. Similarly in the case that then not undefined.:

 β3j(xj)=∑i∈Ypβ1ij(xiji)[xj=xijj]∀[j∈Yq,xj] β1ij(xiji)=β3j(xijj)hij∀[i∈Yp,j∈Yq] β2ij(xijj)=−θij(xiji,xijj)−β1ij(xiji)[i∈Yp,j∈Yq] hij=−θij(xiji,xijj)[xiji≠xy∗pi]Qj(xijj)[i∈Yp,j∈Yq] Qj(xj)=∑i∈Yp−θij(xiji,xijj)[xiji≠xy∗pi][xijj=xj]∀[j∈Yq,xj] β0=minyqϕpq(y∗p,yq)−λpq→q(yq) β1ij(xi)=0∀xi s.t. xi≠xiji β2ij(xj)=0∀xj s.t. xj≠xijj

We often observe in a pre-solve that setting to the zero vector satisfies a large portion of the constraints over . Therefor these constraints are satisfied for all and hence we remove them from consideration. We may choose to use a mixture of the L1 and L2 norm over optimization of parameterized by as discussed in the Appendix B. We write the corresponding objective below.

 β0+ϵmaxQj(xj)≥β3j(xj)≥0 −∑j∈Yqxjαβ3j(xj)+(1−α)β3j(xj)β3j(xj) (19)

#### 3.3.2 Reverse Magnanti-Wong Cuts

In this section we propose a cut that avoids solving the LP in Eq 18 or Eq 15. These correspond to the least binding cut that is tight at and we refer to them as Reverse Magnanti-Wong Cuts (RMC). The RMC corresponds to setting for all which results in a feasible solution to Eq 18. RMC can be used for non-binary problems by setting as follows (see Appendix D for details):

• .

• .

• .

### 3.4 Recycling Benders Rows After Updates to λ

In Alg 1, updates to terms; require re-computing all terms from scratch. In this section we save computation time by re-using Benders rows from the previous iterations before generating new Benders rows.

Updates to may make the Benders rows in generated in previous iterations no longer lower bound . Instead of constructing from scratch every iteration, we re-use terms produced by previous iterations. Specifically for each Benders row in we leave unchanged and set the term to the maximum feasible value. We write the corresponding update below.

 β0←minyq−λpq→q(yq)−∑i∈Ypj∈Yqxj[xj=xyqj]β2ij(xj) (20)

Solving Eq 20 is accelerated by storing for each selection of . In practice we observe that re-using Benders rows gives vast speed-ups compared with constructing Benders rows from scratch for each iteration of Alg 1.

#### 3.4.1 Recycling Reverse Magnanti-Wong Cuts

The RMC have the following useful property in addition to being fast to compute. Notably they remain tight after changes in at the that was used to generate them given that the corresponding term is first updated as described in Eq 20. This is because the do not vary with .

#### 3.4.2 Minor point: Sharing elements between ˙Zpq→p, and ˙Zpq→q

To provide additional speed we share all elements between and . For a given pair of elements that correspond their terms are identical however the terms differ as determined by Eq 20.

### 3.5 Algorithm for computing ν−

Our algorithm proceeds by iteratively testing if Eq 10 holds and generating Benders rows corresponding to a minimizer of