 # Exact MAP-Inference by Confining Combinatorial Search with LP Relaxation

We consider the MAP-inference problem for graphical models, which is a valued constraint satisfaction problem defined on real numbers with a natural summation operation. We propose a family of relaxations (different from the famous Sherali-Adams hierarchy), which naturally define lower bounds for its optimum. This family always contains a tight relaxation and we give an algorithm able to find it and therefore, solve the initial non-relaxed NP-hard problem. The relaxations we consider decompose the original problem into two non-overlapping parts: an easy LP-tight part and a difficult one. For the latter part a combinatorial solver must be used. As we show in our experiments, in a number of applications the second, difficult part constitutes only a small fraction of the whole problem. This property allows to significantly reduce the computational time of the combinatorial solver and therefore solve problems which were out of reach before.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper focuses on energy minimization or maximum a posteriori (MAP) inference for undirected graphical models. This problem is closely related to weighted and valued constraint satisfaction. In the most common pairwise case it amounts to minimizing a partially separable function

taking real values on a discrete set of finite-valued vectors

111We rigorously define notation in Section “Preliminaries”.

 minx∈XV⎡⎣EG(θ,x):=∑\mathclapu∈Vθu(xu)+∑\mathclapuv∈Eθuv(xu,xv)⎤⎦. (1)

The problem is known to be NP-hard (e. g. li2016complexity li2016complexity), and therefore a number of approximate algorithms were proposed to this end. In contrast, our goal is an efficient method able to solve large-scale, but mostly simple problem instances exactly

. Such instances typically arise in computer vision, machine learning and other areas of artificial intelligence. Although approximate methods often provide reasonable solutions, having an exact solver can be quite critical at the modeling stage, when one has to differentiate between modeling and optimization errors. In this case one usually resorts to either specialized combinatorial solvers (see references in kappes2015comparative kappes2015comparative; hurley2016multi hurley2016multi) or off-the-shelf integer linear program (ILP) solvers like CPLEX

[CPLEX, IBM2014] or Gurobi [Gurobi Optimization2016]. However, neither specialized nor off-the-shelf solvers scale well, as the problem instances get larger. Our method is able to use the fact that a linear program (LP) relaxation of the problem is “almost” tight, i. e. the obtained solution is close to the optimal one. It restricts application of an exact solver to a small fraction of the problem, where the LP relaxation is not tight and yet obtains a provably optimal solution to the whole problem. This allows to solve problems for which no efficient solving technique was available.

Related work   LP relaxations are an important building block for a number of algorithms addressing the MAP-inference problem (1

). It was probably first considered in (shlezinger1976syntactic shlezinger1976syntactic; see werner2007linear werner2007linear for the recent review) both in its primal and dual form. The notion of

reparametrization (known also as equivalent transformations or equivalence preserving transformations) was introduced in the same work as well. Although the bound provided by the LP relaxation is often good, the class of problems, where it is tight, is limited (see kolmogorov2015power kolmogorov2015power). Practically important problems from this class are mainly those having acyclic structure or submodular costs. Therefore a number of works were devoted to cutting plane techniques to tighten the relaxation (e. g. koster1998partial koster1998partial; sontag2007cutting sontag2007cutting; degivry2017clique degivry2017clique). Sometimes the tightening itself may lead to an exact solution, however, in general it is accomplished with branch-and-bound or algorithms. The most prominent representative of the first class are the DAOOPT [Marinescu and Dechter2005, Otten and Dechter2010] and Toulbar2 [Cooper et al.2010] solvers. The latter has recently shown impressive results on a number of benchmarks [Hurley et al.2016]. In contrast, the algorithm so far was mainly used in specific applications (e. g. bergtholdt2010study bergtholdt2010study).

Recently developed LP-relaxation-based partial optimality methods (e. g. shekhovtsov2014maximum shekhovtsov2014maximum; shekhovtsov2015maximum shekhovtsov2015maximum; Swoboda2016 Swoboda2016) can find optimal labels for a significant part of variables without solving the combinatorial problem (1). Afterwards, a combinatorial solver can be applied to the rest of the variables to obtain a complete solution. These methods work well if the pairwise costs play the role of a “smoothing regularizer” by slightly penalizing differences in values in neighboring variables and . However, they struggle as the pairwise costs get more weight and move towards “hard constraints”, when some pairs of variable values are strongly penalized or even forbidden.

The CombiLP method [Savchynskyy et al.2013] is the closest to our work. It iteratively splits the problem into a “simple” and a “difficult” part based on consistency of reparametrized unary and pairwise costs, known as (virtual) arc-consistency (see werner2007linear werner2007linear), and checks for agreement of their solutions. The “simple” part is addressed with a dedicated LP solver, whereas the “difficult” one is solved with an ILP method. Although CombiLP has shown promising results on the OpenGM benchmark [Kappes et al.2015], its usage is beneficial for sparse graphical models only, when .

Contribution   Based on CombiLP, we propose a method, which is not restricted to sparse models. Similar to CombiLP, we split the problem into LP and ILP parts based on local consistency properties. Our new consistency criterion guarantees that the concatenation of the obtained LP and ILP solutions is optimal for the whole problem, given that the criterion is satisfied. When the criterion is not satisfied, we increase the ILP subproblem and correspondingly decrease the LP one, like it is done in CombiLP. There are several crucial differences to the CombiLP approach, however:

• [nosep]

• Our “difficult” ILP subproblem is kept much more compact, which is critical for densely-connected graphs. This leads to substantial computational savings.

• Our optimality criterion is stronger than those of CombiLP: Satisfaction of CombiLP’s criterion for a given splitting implies satisfaction of ours.

Additionally, we treat the problem of an initial reparametrization suitable for the used splitting criterion and propose a method, which allows to use arbitrary dual LP solvers within our algorithm, whereas the CombiLP implementation has a fixed dedicated LP solver. This allowed us to choose a more efficient LP solver and to significantly (up to 18 times) speed up the original CombiLP implementation.

Finally, our criterion and implementation222Code is available at github.com/fgrsnau/combilp. are also able to deal with higher order models, which intrinsically have a higher connectivity. We show efficacy of our method on publicly available benchmarks from computer vision, machine learning and bio-imaging.

## 2 Preliminaries

Graphical Models and MAP-inference   Let be an undirected graph with the set of nodes and the set of edges . The neighborhood of is defined as . Each node is associated with a finite set of labels . For any subset of graph nodes the Cartesian product defines the set of labelings of the subset , when each node from  is assigned a label. This includes also the special cases and denoted as and respectively. We assume that stands for the set of minimal elements. At the same time, when used with “” or “” operators, it returns some element from this set.

Let be the set of indices enumerating all labels and label pairs in neighboring graph nodes. For each node and edge the cost functions , and , assign a cost to a label or label pair respectively. The vector contains all values of the functions and as its coordinates.

ILP formulation and LP relaxation   One way to address the MAP-inference problem (1) is to consider its ILP formulation (see e. g. shlezinger1976syntactic shlezinger1976syntactic; werner2007linear werner2007linear)

 minμ≥0⟨θ,μ⟩ ∑s∈Xuμu(s)=1, u∈V (2) ∑s∈Xuμuv(s,t)=μv(t), uv∈E,t∈Xv μ∈{0,1}I (3)

A natural LP relaxation is obtained by omitting the integrality constraints (3). The resulting LP (2) is known as a local polytope [Werner2007] or simply an LP relaxation of (1). We will call the problem (1) LP-tight, if the optimal values of (1) and its LP relaxation (2) coincide. This also implies that there is an integer solution to the relaxed problem (2). We will say that the LP relaxation has an integer solution in a node if there is such that . Due to constraints of (2) it implies that for .

Linear programs of the form (2) are as difficult as linear programs in general [Prusa and Werner2013] and therefore obtaining exact solutions for large-scale instances may require significant time. However, there are fast specialized solvers (e. g. kolmorogrov2006convergent kolmorogrov2006convergent; cooper2008virtual cooper2008virtual) returning approximate dual solutions of (2).

Partial Optimality Observation   Practical importance of the LP relaxation (2) is based on the fact that often most coordinates of its (approximate) relaxed solution are assigned integer values. The non-integer coordinates can be rounded [Ravikumar, Agarwal, and Wainwright2010] and the resulting labeling can be used as if it was a solution of the non-relaxed problem. A number of problems have been successfully addressed with this type of methods [Kappes et al.2015]. However, apart from special cases (e. g. boros2002pseudo boros2002pseudo; rother2007optimizing rother2007optimizing) there is no guarantee that the integer coordinates keep their values in an optimal solution of the non-relaxed problem.

Even though there is no guarantee that the rounded integer solution is a sensible approximation for the optimal solution, empirical tests have shown that usually many integer coordinates coincide with the ones found in the optimal solution. This is a purely practical observation with little theoretical background. Nevertheless, this observation can be used to address the non-relaxed problem efficiently and it is a basis of our method. An alternative, the partial optimality approach was pursued by e. g. shekhovtsov2015maximum shekhovtsov2015maximum; Swoboda2016 Swoboda2016. We will provide a corresponding empirical comparison later in the paper.

## 3 Idea of the Algorithm

Graph partition   A subgraph of the graph is called induced by the set of its nodes , if , i. e. the set of its edges contains all edges from connecting nodes from .

The subgraphs and are called partition of the graph , if , and and are induced by and respectively. The subgraph as complement to will be denoted as . The other way around, stands for , if is a partition of . Notation  will be used for the set of edges connecting  and : .

In the following, we will show how to partition the problem graph into (i) an easy part with subgraph , which can be solved exactly with approximate LP solvers and (ii) a difficult part with subgraph , which will require an ILP solver.

Lower bound induced by partition   Till the end of this section we will assume , are a partition of a graph . For the sake of notation, when considering different subgraphs of  we will nevertheless use a cost vector corresponding to the master graph , i. e.  will stand for , where .

Additionally, for and , their concatenation will be defined as

 (x′∘x′′)v={x′v,v∈VA,x′′v,v∈VB. (4)

Note that the energy function can be decomposed into subproblems on and and it holds

 EG(θ,x′∘x′′)=EA(θ,x′)+EB(θ,x′′)+∑\mathclapuv∈EABθuv(x′u,x′′v) (5)

and therefore,

 EG(θ,x′∘x′′)≥EA(θ,x′)+EB(θ,x′′)++∑uv∈EABmin(s,t)∈Xuvθuv(s,t) (6)

constitutes a lower bound for the energy function .

###### Proposition 1 (Sufficient optimality condition).

The lower bound specified in (6) is tight if for all it holds that , where , .

It is trivial to show that the labeling is optimal for if the lower bound (6) is tight.

When considering the set of all possible partitions of into and there is always at least one that leads to a tight lower bound (6). It corresponds to a trivial partition, where either the subgraph or is empty. The first case corresponds to solving the whole problem with an ILP method, whereas the second one corresponds to the case when the LP relaxation is tight, i. e. all coordinates of an LP solution are integer.

However, as our experimental evaluation shows, there exist often tight non-trivial partitions, with a large subgraph and a small subgraph .

Conceptual Algorithm   These partitions can be obtained for example by a conceptual Algorithm 1, which assigns all nodes of the graph having an integer solution to and all others to . After solving both subproblems one checks fulfillment of the sufficient optimality condition defined by Proposition 1. Should the condition hold, the problem is solved. Otherwise one increases the subproblem  (and respectively decreases ) by including those nodes , where the condition does not hold for at least one , in terms of Proposition 1.

Relation to CombiLP   Algorithm 1 differs from CombiLP [Savchynskyy et al.2013] in one very important aspect. Namely, the subgraphs used in CombiLP are overlapping, whereas ours are not. This substantially improves performance of the method in cases when the graph has a high connectivity. In later sections of this paper we will give a detailed theoretical and empirical comparison of the methods.

In the following, we will turn the conceptual algorithm into a working one. In order to do so, we will give positive answers to a number of important questions:

• [nosep]

• Why and when is the subproblem on LP-tight? This is critical, since we assume to be close to in its size and therefore it must be solvable by a (polynomial) LP method.

• Can we avoid running an LP solver for in each iteration?

• Can we use (fast specialized) approximate LP solvers on instead of (slow off-the-shelf) exact ones?

• How to encourage conditions of Proposition 1 to be fulfilled for a possibly small ?

Although our construction mostly follows the one given in [Savchynskyy et al.2013], we repeat it here to keep the paper self-contained.

## 4 Theoretical Background

Reparametrization   Decompositions of the energy function into unary and pairwise costs are not unique, which is, there exist other costs such that for all labelings . It is known (see e. g. werner2007linear werner2007linear) and straightforward to check that such equivalent costs can be obtained with an arbitrary vector  as follows:

 θ′u(s)≡θϕu(s) :=θu(s)−∑v∈Nb(u)ϕu,v(s) (7) θ′uv(s,t)≡θϕuv(s,t) :=θuv(s,t)+ϕu,v(s)+ϕv,u(t).

The costs  are called reparametrized and the vector  is known as a reparametrization. Costs related by (7) are also called equivalent. In this sense, all vectors can be split into equivalence classes according to (7). Other established terms for reparametrizations are equivalence preserving transformations [Cooper and Schiex2004] and equivalent transformations [Shlezinger1976].

Dual Problem   By swapping the and operations in (1) one obtains a lower bound to the energy333It can be shown that this bound is in general less tight than (6)., which reads as

 D(θ):=∑u∈Vmins∈Xuθu(s)+∑uv∈Emin(s,t)∈Xuvθuv(s,t). (8)

Although the energy remains the same for all cost vectors from a given equivalence class (), the lower bound is dependent on the reparametrization (). Therefore, a natural maximization problem arises as maximization of the lower bound over all equivalent costs: . It is known (e. g. werner2007linear werner2007linear) that this maximization problem is equivalent to the Lagrangian dual to the LP relaxation (2). In turn, this implies that the minimum of (2) coincides with the maximum of . Therefore, one speaks about optimal reparametrizations as those , where the maximum is attained. Apart from its lower bound property the function is important because (i) function is concave w. r. t.  as a sum of minima of linear functions; (ii) there exist many of scalable and efficient algorithms for its (approximate) maximization, e. g. [Kolmogorov2006, Cooper et al.2008].

Strict Arc-Consistency   From a practical point of view it is important how an optimal reparametrization can be translated into a labeling, i. e. into an (approximate) solution of the energy minimization problem (1). The following definition plays a crucial role for this question in general and for our method in particular:

###### Definition 1 (Strict arc-consistency).

The node is called strictly arc-consistent w. r. t. the costs if there exists a label and labels for all , such that it holds (i)  for all ; and (ii)  for all .

The set of strictly arc-consistent nodes is denoted by  .

If all nodes are strictly arc-consistent w. r. t. the reparametrized costs , then it is straightforward to check that , where

 x∗u=argmins∈Xuθϕu(s). (9)

In turn, this implies that is an optimal reparametrization and is an exact solution of the energy minimization problem (1).

Reconstructing labeling from reparametrization   Although there is no guarantee that the strict arc-consistency property holds for all nodes even with an optimal reparametrization, the rule (9) is still used to obtain an approximate minimizer for (1) with arbitrary, also non-optimal reparametrizations  (although, a number of more sophisticated rules were proposed, they are based on (9) and reduce to it if the strict arc-consistency holds for all nodes, see e. g. ravikumar2010message ravikumar2010message). Moreover, for an optimal reparametrization , when the strict arc-consistency holds for a node , the complementary slackness conditions imply (e. g. werner2007linear werner2007linear) that strict arc-consistency of a node guarantees an integer solution of the LP relaxation in .

From the application point of view, an (approximate) solution (9) is typically considered as good, if most of the nodes satisfy the strict arc-consistency property. At the same time, unless the strict arc-consistency holds for all nodes, there is in general no theoretical guarantee that obtained as (9) coincide with the corresponding coordinate of an optimal solution , even if the node  is strictly arc-consistent.

Algorithm 2 described in the next section, provides such guarantees by solving the non-relaxed minimization problem (1).

## 5 Detailed Algorithm Description

Let us consider Algorithm 2. It differs from Algorithm 1 provided above in several aspects: Instead of solving the relaxed problem (2) in the primal domain, it solves its dual formulation and resorts to the optimally reparametrized costs. Strict arc-consistency is used in place of integrality to form the initial set , which is justified by the fact that strict arc-consistency is sufficient for integrality.

The reparametrization step in line 2 plays a crucial role for the whole method. Due to this step, solving the energy minimization problem on becomes trivial because of its strict arc-consistency. It can be performed by selecting the best label in each node independently, according to (9). Therefore, there is no computational overhead of resolving the problem on in each iteration. Also, as more and more nodes from the initial subgraph move over to the subgraph their strict arc-consistency encourages solution on to coincide with the locally optimal labels. Moreover, instead of an optimal dual solution any, also approximate, non-optimal reparametrization can be used. According to Proposition 1, this does not affect correctness of Algorithm 2. Therefore, approximate solvers can be used in line 1 of the algorithm. However, the better the dual solution is, the larger the set of strictly arc-consistent nodes is and therefore, the lower computational complexity of the ILP phase of the algorithm. Finally, reparametrization of the costs typically speeds up the ILP solver in line 6, as it serves as preprocessing.

## 6 Analysis of the Method

Family of Tight Partitions   The proposition below that if the sufficient optimality criterion (Proposition 1) of Algorithm 2 is fulfilled for a partition , then for any other partition such that the criterion holds as well:

###### Proposition 2.

Let , and , be partitions of s. t. . Let also . Let , be the solutions of the MAP-inference problem (1) on respectively and , analogously for and . If is the unique optimal assignment and it fulfills requirements of Proposition 1, then they are fulfilled for as well.

This property shows that there are potentially many partitions, which results in a tight bound and allows to apply a greedy strategy for growing the subgraph by adding all inconsistent nodes (violating Proposition 1) at once, as it is done in line 11 of Algorithm 2.

Comparison to CombiLP   As mentioned above, the CombiLP-method is very similar to ours, but uses a different optimality criterion. Below we show that our criterion is in a certain sense stronger than theirs. To this end, following [Savchynskyy et al.2013], we introduce the notion of a boundary complement subgraph:

###### Definition 2 (savchynskyy2013global savchynskyy2013global).

Let be an induced subgraph of . A subgraph is called boundary complement to w. r. t.  if it is induced by the set , where is the set of nodes in incident to nodes outside .

The optimality criterion used in CombiLP reads:

###### Theorem 1 (savchynskyy2013global savchynskyy2013global).

Let be a subgraph of and be its boundary complement w. r. t. . Let and be labelings minimizing and respectively and let all nodes be strictly arc-consistent. Then from

 xv=x′v  for all  v∈V∂A (10)

it follows that the labeling with coordinates
, is optimal on .

As can be seen from comparing Proposition 1 and Theorem 1, the main difference between the methods is that we use a partition of the graph , i. e. non-intersecting subgraphs, whereas the subgraphs in CombiLP are boundary complement and therefore intersect.

The following proposition states that the bounds produced by our method are at least as tight as those of CombiLP:

###### Proposition 3.

Let be a partition of a graph and be boundary complement for and . Let also , be optimal labelings on and . If the condition (10) holds for and , i. e.  for all , then Proposition 1 holds for and as well, where is the restriction of to the set . In other words, for the same subgraph fulfillment of Theorem 1 implies fulfillment of Proposition 1.

## 7 Technical Details

Post-Processing of Reparametrization   The maximum of the dual objective is typically non-unique. Since is a concave function, the set of its maxima is convex and therefore it contains either a unique element or a continuum. Unfortunately, not all optimal (or suboptimal ones, corresponding to the same value of ) reparametrizations are equally good for our method. Moreover, different dual algorithms return different reparametrizations and the fastest algorithm may not return an appropriate one.

Therefore, we developed a post-processing algorithm to turn an arbitrary reparametrization into a suitable one without decreasing the value of . This algorithm consists of two steps: (i) several iterations of a message passing (dual block-coordinate ascent) algorithm, which accumulates weights in unary costs and (ii) partial redistribution of unary costs between incident pairwise cost functions. This two-step procedure empirically turns most of the nodes, where the LP relaxation (2) has an integer solution, into strictly arc-consistent ones. The details of both steps are described in the supplement.

Higher Order Extensions   All discussed techniques are easily extended to the higher-order MAP-inference problem

 minx∈XV[EG(θ,x):=∑c∈Cθc(xc)]. (11)

where the cliques in the decomposition of the energy function may contain terms dependent on 3, 4 and more nodes. The bound (6) in the higher-order case reads as

 EG(θ,x′∘x′′)≥EA(θ,x′)++EB(θ,x′′)+∑c∈CABminyc∈Xcθc(xc), (12)

where similar to in the pairwise case. Proposition 1 for the higher-order case turns into:

###### Proposition 4.

Let and . The lower bound (12) is tight if for all and it holds that where .

The proof follows the same reasoning as the proof of Proposition 1 and is omitted here.

## 8 Experimental Evaluation

Algorithms   In this section we compare our proposed algorithm with other related methods. As baselines we use CPLEX 12.6.2 [CPLEX, IBM2014] and ToulBar2 0.9.8.0 [Cooper et al.2010] where the first is the well-known commercial optimizer and the latter is one of the best dedicated branch-and-bound solvers for (1), see comparison in [Hurley et al.2016]. We used comparable parameters and settings like the ones used in [Hurley et al.2016]. They are denoted by cpx or tb2 respectively. The original CombiLP [Savchynskyy et al.2013] implementation is referred as clp-orig. For a fair comparison, we modified it to make it compatible with arbitrary LP and ILP solvers, in particular, by applying the reparametrization post-processing algorithm described above. The modified method referred as clp is up to an order of magnitude faster than the original one clp-orig (see Table 2). For the experiments with clp and dclp we used both CPLEX and ToulBar2 as ILP-solvers. The corresponding variants of clp are denoted as clp-cpx and clp-tb2 respectively and similarly for dclp. Since the ToulBar2 variants (clp-tb2 and dclp-tb2) were superior to the CPLEX variants in all our tests, we will mainly discuss the former here (see supplement for all results). TRW-S [Kolmogorov2006] is used as fast block-coordinate-descend LP-solver everywhere except higher-order models. We used a fast implementation of the solver from the work [Shekhovtsov, Swoboda, and Savchynskyy2015]. Only for higher-order examples we resort to SRMP [Kolmogorov2015] using the minimal or basic LP relaxation (for details see kolmogorov2015new kolmogorov2015new). We set the maximum number of TRW-S/SRMP iterations to . Furthermore we tested the performance of a recent partial optimality technique [Shekhovtsov, Swoboda, and Savchynskyy2015] which is denoted by popt. As this approach does not solve the whole problem, we run ToulBar2 on the reduced model and measure the total running time (popt-tb2). We set the maximal running time for all methods to hour. Figure 2: Primal and dual bound progress for worms. All values are averaged over instances where at least one of the solvers returned the optimal solution (25 instances). To improve clarity only the three best solvers are shown.

Datasets   We verify performance of the algorithms on the following publicly available datasets: worms [Kainmueller et al.2017], color-seg [Lellmann and Schnörr2011], mrf-stereo [Scharstein and Szeliski2002] and OnCallRostering [Stuckey et al.2014], proteinfolding [Yanover, Schueler-Furman, and Weiss2008]. Each of these datasets is included to highlight specific strengths and weaknesses of the competing methods. The worms dataset (30 instances) serves as a prime example for our algorithm due to its relatively densely connected graph structure and a small duality gap. The mrf-stereo (3 instances) and color-seg (19 instances) datasets consist of sparsely connected grid-models and are used to compare performance to the CombiLP method clp. The proteinfolding dataset can be split into easy problems (many nodes, sparsely connected) and hard problems (only around 33-40 nodes, fully connected). In the following, we only consider the hard problems ( instances in total). Last but not least, the dataset OnCallRostering ( instances) is included as an example of higher-order models, which include cliques of order four. Unfortunately, we were unable to convert other instances of this dataset from the benchmark [Hurley et al.2016] because of a memory bottlenecks in the conversion process. Apart from OnCallRostering and worms, all other problem instances were taken from the OpenGM benchmark [Kappes et al.2015].

Results   We compare and analyse performance of our method in the following three settings: (i) targeted dense problems like the worms and protein-folding datasets; (ii) sparse problems (mrf-stereo and color-seg), and (iii) exemplary higher-order problems (OnCallRostering).

(i) dense models: On the worms dataset our method dclp-tb2 clearly outperforms competitors, as Table 2 shows. dclp-tb2 solves instances out of , the next competitor clp-tb2 – only 17. Moreover, our solver is also more than times faster than clp-tb2 in average. This is due to the fact that the resulting ILP subproblem of dclp is much smaller that those of clp, see Figure 1 for visual comparison. The partial optimality method is unable to reduce the problem (see Table 1) because of infinite pairwise costs to disallow assigning the same label to different nodes. Figure 2 shows primal and dual bounds as a function of computational time for this dataset.

Although on the protein-folding dataset dclp-tb2 also outperforms all its competitors, the improvement over clp-tb2 and tb2 is not that pronouncing as for the worms dataset. This is because the final ILP subproblem of dclp covers a significant part of the whole graph (over in average). To satisfy its optimality criterion, dclp performs up to iterations with smaller ILP subproblems. In contrast, clp considers the whole graph as an ILP subproblem right at the very first iteration. Interestingly that even under this circumstances clp-tb2 outperforms tb2 and clp-cpx outperforms cpx (see supplement for details). The latter solves only problem instances out of , whereas clp-cpx is able to cope with . We attribute it to the reparametrization, which is performed by clp prior to passing the problem to cpx or tb2 and plays a role of an efficient presolving.

(ii) sparse models: Sparse (grid-structured) datasets mrfstereo and colorseg with about graph nodes each are very well suitable for both clp and dclp methods and are difficult for cpx and tb2. Both clp and dclp are able to solve all the problems except the largest one (teddy from mrf-stereo dataset with over nodes and labels) in similar time. On color-seg the method clp-tb2 is somewhat faster, whereas dclp-tb2 requires less time on mrf-stereo. This is due to the fact that dclp consistently produces smaller ILP subproblems (see Table 1 for comparison), but clp may require less iterations due to the start with a larger ILP subproblem. Partial optimality popt-tb2 is the winner for the color-seg dataset: Although its ILP subproblems are larger than those of clp and dclp, it runs an ILP solver only once. However, results of popt-tb2 on mrf-stereo are useful only up to a limited extend: They are sufficient to solve only a single, the simplest problem from that dataset (tsukuba). dclp-tb2 and clp-tb2 in contrast solve two problem instances each.

(iii) higher-order models: The dataset OnCallRostering is included mainly to show applicability of our method to higher-order models. Generally, higher-order models pose additional difficulties to solvers because they are intrinsically dense and the size of an ILP formulation of the problem grows exponentially with the problem order, therefore even small problems may not fit into memory of an ILP solver. The dclp method again shows its advantage over clp as similarly as in the case of the worms dataset: Since the problems are intrinsically dense, the ILP subproblem for dclp is smaller, which results in speed-up compared to clp. We also found tb2 and cpx to be quite efficient on this dataset, although they were able to solve only problems out of .

## 9 Conclusions

We presented a new method, suitable to solve efficiently large-scale MAP-inference problems. The prerequisites for efficiency is the “almost” tight LP relaxation, i. e. the non-strict-arc-consistent subset of nodes should constitute only a small portion of the problem. In this case, it isnot the size of the problem which important, but only the size of its non-strict-arc-consistent subproblem. Comparing to previous works, our method is able to further reduce this size, which is especially notable if the underlying graph structure of the model is non-sparse. In the future, we plan to extend the method to a broader class of combinatorial problems.

## Acknowledgement

This work was supported by the DFG grant “Exact Relaxation-Based Inference in Graphical Models” (SA 2640/1-1). We thank the Center for Information Services and High Performance Computing (ZIH) at TU Dresden for generous allocations of computer time.

## References

• [Bergtholdt et al.2010] Bergtholdt, M.; Kappes, J.; Schmidt, S.; and Schnörr, C. 2010. A study of parts-based object class detection using complete graphs. International journal of computer vision 87(1):93–117.
• [Boros and Hammer2002] Boros, E., and Hammer, P. L. 2002. Pseudo-boolean optimization. Discrete applied mathematics 123(1):155–225.
• [Cooper and Schiex2004] Cooper, M., and Schiex, T. 2004. Arc consistency for soft constraints. Artificial Intelligence 154(1-2):199–227.
• [Cooper et al.2008] Cooper, M. C.; de Givry, S.; Sanchez, M.; Schiex, T.; and Zytnicki, M. 2008. Virtual arc consistency for weighted csp. In AAAI, volume 8, 253–258.
• [Cooper et al.2010] Cooper, M. C.; de Givry, S.; Sánchez, M.; Schiex, T.; Zytnicki, M.; and Werner, T. 2010. Soft arc consistency revisited. Artificial Intelligence 174(7):449–478.
• [CPLEX, IBM2014] CPLEX, IBM. 2014. ILOG CPLEX 12.6 Optimization Studio.
• [de Givry and Katsirelos2017] de Givry, S., and Katsirelos, G. 2017. Clique cuts in weighted constraint satisfaction. In International Conference on Principles and Practice of Constraint Programming, 97–113. Springer.
• [Gurobi Optimization2016] Gurobi Optimization, I. 2016. Gurobi optimizer reference manual.
• [Hurley et al.2016] Hurley, B.; O’Sullivan, B.; Allouche, D.; Katsirelos, G.; Schiex, T.; Zytnicki, M.; and De Givry, S. 2016. Multi-language evaluation of exact solvers in graphical model discrete optimization. Constraints 21(3):413–434.
• [Kainmueller et al.2017] Kainmueller, D.; Jug, F.; Rother, C.; and Meyers, G. 2017. Graph matching problems for annotating c. elegans. Accessed: 2017-09-10.
• [Kappes et al.2015] Kappes, J. H.; Andres, B.; Hamprecht, F. A.; Schnörr, C.; Nowozin, S.; Batra, D.; Kim, S.; Kausler, B. X.; Kröger, T.; Lellmann, J.; et al. 2015. A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision 115(2):155–184.
• [Kolmogorov, Thapper, and Zivny2015] Kolmogorov, V.; Thapper, J.; and Zivny, S. 2015. The power of linear programming for general-valued csps. SIAM Journal on Computing 44(1):1–36.
• [Kolmogorov2006] Kolmogorov, V. 2006. Convergent tree-reweighted message passing for energy minimization. Pattern Analysis and Machine Intelligence, IEEE Transactions on 28(10):1568–1583.
• [Kolmogorov2015] Kolmogorov, V. 2015. A new look at reweighted message passing. IEEE transactions on pattern analysis and machine intelligence 37(5):919–930.
• [Koster, Van Hoesel, and Kolen1998] Koster, A. M.; Van Hoesel, S. P.; and Kolen, A. W. 1998. The partial constraint satisfaction problem: Facets and lifting theorems. Operations research letters 23(3):89–97.
• [Lellmann and Schnörr2011] Lellmann, J., and Schnörr, C. 2011. Continuous multiclass labeling approaches and algorithms. SIAM Journal on Imaging Sciences 4(4):1049–1096.
• [Li, Shekhovtsov, and Huber2016] Li, M.; Shekhovtsov, A.; and Huber, D. 2016. Complexity of discrete energy minimization problems. In European Conference on Computer Vision, 834–852. Springer.
• [Marinescu and Dechter2005] Marinescu, R., and Dechter, R. 2005. And/or branch-and-bound for graphical models. In IJCAI, 224–229.
• [Otten and Dechter2010] Otten, L., and Dechter, R. 2010. Toward parallel search for optimization in graphical models. In ISAIM.
• [Prusa and Werner2013] Prusa, D., and Werner, T. 2013. Universality of the local marginal polytope. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, 1738–1743.
• [Ravikumar, Agarwal, and Wainwright2010] Ravikumar, P.; Agarwal, A.; and Wainwright, M. J. 2010. Message-passing for graph-structured linear programs: Proximal methods and rounding schemes. Journal of Machine Learning Research 11(Mar):1043–1080.
• [Rother et al.2007] Rother, C.; Kolmogorov, V.; Lempitsky, V.; and Szummer, M. 2007. Optimizing binary mrfs via extended roof duality. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on, 1–8. IEEE.
• [Savchynskyy et al.2013] Savchynskyy, B.; Kappes, J. H.; Swoboda, P.; and Schnörr, C. 2013. Global MAP-optimality by shrinking the combinatorial search area with convex relaxation. In Advances in Neural Information Processing Systems, 1950–1958.
• [Scharstein and Szeliski2002] Scharstein, D., and Szeliski, R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. International journal of computer vision 47(1-3):7–42.
• [Shekhovtsov, Swoboda, and Savchynskyy2015] Shekhovtsov, A.; Swoboda, P.; and Savchynskyy, B. 2015. Maximum persistency via iterative relaxed inference with graphical models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 521–529.
• [Shekhovtsov2014] Shekhovtsov, A. 2014. Maximum persistency in energy minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1162–1169.
• [Shlezinger1976] Shlezinger, M. 1976. Syntactic analysis of two-dimensional visual signals in the presence of noise. Cybernetics and systems analysis 12(4):612–628.
• [Sontag2007] Sontag, D. A. 2007. Cutting plane algorithms for variational inference in graphical models. Ph.D. Dissertation, Massachusetts Institute of Technology.
• [Stuckey et al.2014] Stuckey, P. J.; Feydy, T.; Schutt, A.; Tack, G.; and Fischer, J. 2014. The minizinc challenge 2008–2013. AI Magazine 35(2):55–60.
• [Swoboda et al.2016] Swoboda, P.; Shekhovtsov, A.; Kappes, J.; Schnörr, C.; and Savchynskyy, B. 2016. Partial Optimality by Pruning for MAP-Inference with General Graphical Models. IEEE Trans. Patt. Anal. Mach. Intell. 38(7):1370–1382.
• [Werner2007] Werner, T. 2007. A linear programming approach to max-sum problem: A review. Pattern Analysis and Machine Intelligence, IEEE Transactions on 29(7):1165–1179.
• [Yanover, Schueler-Furman, and Weiss2008] Yanover, C.; Schueler-Furman, O.; and Weiss, Y. 2008. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology 15(7):899–911.

## 10 Supplementary Material for “Exact MAP-Inference by Confining Combinatorial Search with LP Relaxation”

Stefan Haller22footnotemark: 2, Paul Swoboda33footnotemark: 3, Bogdan Savchynskyy22footnotemark: 2

22footnotemark: 2

University of Heidelberg,   33footnotemark: 3 IST Austria

stefan.haller@iwr.uni-heidelberg.de

Proof of Proposition 1

###### Proof.

From it follows that for all . Hence the right-hand-sides of (5) and (6) are equal. ∎

Proof of Proposition 2

###### Lemma 1.

From requirements of Proposition 1 follows that .

###### Proof.

Since and , it holds that . From strict arc-consistency we know that and are determined by (9), hence for all .

From decomposition (5) we known that where the last equality holds due to Proposition 1. From optimality of we know that , hence for all as there is only one unique solution. ∎

Proof of Proposition 2:

###### Proof.

From the requirements of the Proposition we already know that and are the optimal assignment of and respectively. It remains to show that for all .

Applying (5) to results in . Due to if either or . In the first case it follows from , as for the optimal strictly arc-consistent label for it holds that and (Lemma 1). In the second case (Lemma 1) and from fulfillment of Proposition (1) for follows that .

Hence for and all requirements of Proposition (1) are fulfilled for . ∎

Proof of Proposition 3

###### Proof.

The prerequisites already assure that is optimal for , so it remains to show that is optimal for and that for all . The optimality of for follows trivially from and the fact that is optimal for , as for both and the optimal labeling is determined by (9). From Definition 2 we know that . In other words, all edges are covered by subgraph (note that ). Due to , for all it is true that . From the preconditions we know that for all , hence . ∎

Reparametrization Post-Processing

The details of both steps are described below.

(i) In order to obtain a fast post-processing algorithm we modified one of the fastest dual methods, TRW-S of [Kolmogorov2006], which also can be seen as a special case of its higher-order counterpart SRMP [Kolmogorov2015]. For the sake of brevity we refer to the latter method, because of its simpler presentation and because it works also for higher order models (see below). Our whole modification consisted in reassigning the weights defined by expression (14) in [Kolmogorov2015] with the values

 ω+α,β=⎧⎨⎩1|O+β|+max{|I+β|,|Iβ−I+β|}+λif (α,β)∈I+β0,if (α,β)∈Iβ−I+β. (13)

We also performed the same reassignment of the weights  (defined by expression (16) in [Kolmogorov2015]), with in place of . We refer to [Kolmogorov2015] for a detailed description of the notation and the algorithm itself. The only difference between (13) and the original expression (14) from [Kolmogorov2015] is an additional term added to the denominator of the expression in the upper line of (13). The non-zero leads to redistribution of the labeling costs between unary cost functions. Therefore, non-optimal labels get higher costs than those belonging to an optimal labeling. We empirically found the value to work well in practice.

(ii) It is a property of a (modified) SRMP method that locally optimal pairwise costs are always for any graph edge . The mentioned above partial redistribution of unary costs between incident pairwise cost functions was done as for all and .

Labelwise relative ILP size   As the partial optimality techniques work on a labelwise basis, we use a labelwise measure for comparing the size of the final ILP subproblem. For popt we use the formula (35) of [Shekhovtsov, Swoboda, and Savchynskyy2015] to compute the relative number of eliminated labels. Subtracting this value from 100% yields the values in Table 1. The final formula for popt looks like the following

 1−∑v∈V|Xv∖pv(Xv)|∑u∈V(|Xv|−1). (14)

For clp and dclp we evaluate the following expression to compute the value with the same semantic:

 1−∑v∈VA(|Xv|−1)∑u∈VG(|Xv|−1). (15)

As popt is a polynomial time algorithm, it will output a reduced model for all instance of the benchmark. As clp and dclp try to solve the NP-hard MAP-inference problem (1), they do not terminate for all instances. As the maximal ILP subproblem is only defined after Proposition 1 holds, we assume the worst-case and use 100% as ILP subproblem size for unsolved instances.

Complete benchmark table   For lack of space we removed some solvers from the experimental evaluation. The following tables show the results for each instance and each solver separately..