Accelerated Inference in Markov Random Fields via Smooth Riemannian Optimization

10/27/2018 ∙ by Siyi Hu, et al. ∙ MIT 0

Markov Random Fields (MRFs) are a popular model for several pattern recognition and reconstruction problems in robotics and computer vision. Inference in MRFs is intractable in general and related work resorts to approximation algorithms. Among those techniques, semidefinite programming (SDP) relaxations have been shown to provide accurate estimates while scaling poorly with the problem size and being typically slow for practical applications. Our first contribution is to design a dual ascent method to solve standard SDP relaxations that takes advantage of the geometric structure of the problem to speed up computation. This technique, named Dual Ascent Riemannian Staircase (DARS), is able to solve large problem instances in seconds. Since our goal is to enable real-time inference on robotics platforms, we develop a second and faster approach. The backbone of this second approach is a novel SDP relaxation combined with a fast and scalable solver based on smooth Riemannian optimization. We show that this approach, named Fast Unconstrained SEmidefinite Solver (FUSES), can solve large problems in milliseconds. Contrarily to local MRF solvers, e.g., loopy belief propagation, our approaches do not require an initial guess. Moreover, we leverage recent results from optimization theory to provide per-instance sub-optimality guarantees. We demonstrate the proposed approaches in multi-class image segmentation problems. Extensive experimental evidence shows that (i) FUSES and DARS produce near-optimal solutions, attaining an objective within 0.2 remarkably faster than general-purpose SDP solvers, and FUSES is more than two orders of magnitude faster than DARS while attaining similar solution quality, (iii) FUSES is faster than local search methods while being a global solver, and is a good candidate for real-time robotics applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Markov Random Fields (MRFs) are a popular graphical model for reconstruction and recognition problems in computer vision and robotics, including 2D and 3D semantic segmentation, stereo reconstruction, image restoration and denoising, texture synthesis, object detection, and panorama stitching [1, 2, 3]. An MRF can be understood as a factor graph including only unary and binary factors, and where node variables are discrete labels. The discrete nature of the variables makes maximum a posteriori (MAP) inference in MRFs intractable in general, hence several MRF-based applications remain out of reach for real-time robotics. Our motivating application in this paper is real-time semantic segmentation, which is crucial for the robot to understand the surrounding environment and execute high-level tasks. Therefore, we are interested in developing real-time MRF solvers that can support online operation at scale.

The literature on MRFs (reviewed in Section VI) is vast and includes methods based on graph cuts, message passing techniques, greedy methods, and convex relaxations, to mention a few. These approaches are typically approximation techniques, in the sense that they attempt to compute near-optimal MAP estimates efficiently (the problem is NP-hard in general, hence we do not expect to compute exact solutions in polynomial time). Among those, semidefinite programming (SDP) relaxations have been recognized to produce accurate approximations [4]. On the other hand, the computational cost of general-purpose SDP solvers prevented widespread use of this technique beyond problems with few hundred variables [5] (semantic segmentation typically involves thousands to millions of variables), and SDPs lost popularity in favor of computationally cheaper alternatives including move-making algorithms (based on graph cut) and message passing. Move-making methods [6] require specific assumptions on the MRF and their performance typically degrades when these assumptions are not satisfied. Message passing methods [7, 8], on the other hand, may not even converge, even thought they are observed to work well in practice.

(a) FUSES (b) DARS
Fig. 1: Snapshots of the multi-label semantic segmentation computed by the proposed MRF solvers (a) FUSES and (b) DARS on the Cityscapes dataset. FUSES is able to segment an image in 17ms (1000 superpixels).

Contribution. Our first contribution, presented in Section III, is to design a dual-ascent-based method to solve standard SDP relaxations that takes advantage of the geometric structure of the problem to speed up computation. In particular, we show that each dual ascent iteration can be solved using a fast low-rank SDP solver known as the Riemannian Staircase [9]. Upon convergence of the dual ascent iterations this technique attains the same objective as the standard SDP relaxation while being more scalable. This technique, named Dual Ascent Riemannian Staircase (DARS), is able to solve MRF instances with thousands of variables in seconds, while general-purpose SDP solvers (e.g., cvx [10]) are not able to provide an answer in reasonable time (hours) at that scale.

Our second contribution, presented in Section IV, is an even faster SDP relaxation. Despite being remarkably faster than general-purpose SDP solvers, DARS is currently slow for real-world robotics applications, hence, we develop a Fast Unconstrained SEmidefinite Solver (FUSES) that can solve large problems in milliseconds. The backbone of this second approach is a novel SDP relaxation combined with the Riemannian Staircase method [9]. The novel formulation uses a more intuitive binary matrix (with entries in

), contrarily to related work that parametrizes the problem using a vector with entries in

.

Our third contribution is an extensive experimental evaluation. We test the proposed SDP solvers in semantic image segmentation problems and evaluate the corresponding results in terms of accuracy and runtime. We compare the proposed techniques against several related approaches, including move-making methods (-expansion [6]) and message passing (Loopy Belief Propagation [8] and Tree-Reweighted Message Passing [7]). The results show that our MRF solver retains all the advantages of SDP relaxations (accuracy, no need for initial guess, no assumption on the objective function), while being fast and scalable. More specifically, our results show that (i) FUSES and DARS produce near-optimal solutions, attaining an objective within 0.2% of the optimum, (ii) FUSES and DARS are remarkably faster than general-purpose SDP solvers, and FUSES is more than two orders of magnitude faster than DARS while attaining similar solution quality, (iii) FUSES is faster than local search methods while being a global solver.

Before delving into the contribution of this paper, Section II provides preliminary notions on inference in MRFs, while we postpone the review of related work to Section VI. All proofs are given in the supplemental material [11].

Ii Preliminaries

This section introduces standard notation for MRFs (sec:preMRF) and provides necessary background on semidefinite relaxations (sec:standardSDPrelax).

Ii-a Markov Random Fields: Models and Inference

A Markov Random Field (MRF) is a graphical model in which nodes are associated with discrete labels we want to estimate, and edges (or potentials) represent given probabilistic constraints relating the labels of a subset of nodes. Formally, for each node in the node set (where is the number of nodes), we need to assign a label , where is the set of possible labels. If

(i.e., nodes are classified into two classes) the corresponding model is called a

binary MRF. Here we consider possible labels, a setup generally referred to as a multi-label MRF.

Maximum a posteriori (MAP) inference. The MAP estimate is the most likely assignment of labels, i.e., the assignment of the node labels that attains the maximum of the posterior distribution of an MRF, or, equivalently, the minimum of the negative log-posterior. MAP estimation can be formulated as a discrete optimization problem over the labels with  [1]:

(P0)

where is the set of unary potentials (probabilistic constraints involving a single node), is the set of binary potentials (involving a pair of nodes), and and represent the negative log-distribution for each unary and binary potential, respectively (described below). For instance, in semantic segmentation each node in the MRF corresponds to a pixel (or superpixel) in the image, the unary potentials are dictated by pixel-wise classification from a classifier applied to the image, and the binary potentials enforce smoothness of the resulting segmentation [12]. The binary potentials (often referred to as smoothness priors) are typically enforced between nearby (adjacent) pixels.

MRF Potentials. A typical form for the unary and binary potentials is:

(1)

where is a data-driven noisy measurement of the label of node (typically from a classifier), and and are given scalars. Typically, it is assumed , i.e., choosing a label different from the measured one incurs a cost in (P0). Similarly, for the binary potentials it is typically assumed , i.e., label mismatch () incurs a cost of in the objective (P0). In this case the binary potentials are called attractive, while they are referred to as repulsive when (i.e., the potentials encourage label mismatches) [13].

The MRF resulting from the choice of potentials in eq. (1) is known as the Potts model [14], which was first proposed in statistical mechanics to model interacting spins in ferromagnetic materials. When (binary MRFs) the resulting model is known as the Ising model [2, Section 1.4.1].

Ii-B Standard Semidefinite Relaxation

Semidefinite programming (SDP) relaxation has been shown to provide an effective approach to compute a good approximation of the global minimizer of (P0[5, 15, 16]. In this section we introduce a standard approach to obtain an SDP relaxation, for which we design a fast solver in Section III.

In order to obtain an SDP relaxation, related works rewrite each node variable as a vector , such that has a single entry equal to (all the others are ), and if the -th entry of is , then the corresponding node has label . Moreover, they stack all vectors , , in a single -vector . Using this reparametrization, the inference problem (P0) can be written in terms of the vector as follows (full derivation in Appendix A):

(2)

where and are a suitable symmetric matrix and a suitable vector collecting the coefficients of the binary terms and the unary terms in (1), respectively; is the diagonal of the matrix , and , where is an -vector which is all zero, except the -th entry which is one, is a -vector of ones, and is the Kronecker product. Intuitively, contains the square of each entry of , hence imposes that every entry of has norm , i.e., it belongs to ; the constraint writes in compact form , which enforces each node to have a unique label (i.e., a single entry in can be , while all the others are ).

Before relaxing problem (2), it is convenient to homogenize the objective by reparametrizing the problem in terms of an extended vector , where an entry equal to is concatenated to . We can now rewrite (2) in terms of :

(P1)

where and . In (P1), we used the equality , and noted that since , then .

So far we have only reparametrized problem (P0), hence (P1) is still a MAP estimator. We can now introduce the SDP relaxation: problem (P1) only includes terms in the form , hence we can reparametrize it using a matrix . Moreover, we note that the set of matrices that satisfy is the set of positive semidefinite () rank-1 matrices (). Rewriting (P1) using and dropping the non-convex rank-1 constraint, we obtain:

(S1)

which is a (convex) semidefinite program and can be solved globally in polynomial time using interior-point methods [17]. While the SDP relaxation (S1) is known to provide near-optimal approximations of the MAP estimate, interior-point methods are typically slow in practice and cannot solve problems with more than few hundred nodes in a reasonable time.

Iii Dars: Dual Ascent Riemannian Staircase

This section presents the first contribution of this paper: a dual ascent approach to efficiently solve large instances of the standard SDP relaxation (S1).

Iii-a Dual Ascent Approach

The main goal of this section is to design a dual ascent method, where the subproblem to be solved at each iteration has a more favorable geometry, and can be solved quickly using the Riemannian Staircase method introduced in Section III-B. Towards this goal, we rewrite (S1) equivalently as:

(3)

where the objective function is now , where is the indicator function which is zero when the constraint inside the parenthesis is satisfied and plus infinity otherwise.

Under constraints qualification (e.g., the Slater’s condition for convex programs [18, Theorem 3.1]), we can obtain an optimal solution to (3) by computing a saddle-point of the Lagrangian function :

(4)

where is the vector of dual variables and is the primal variable.

The basic idea behind dual ascent [19, Section 2.1] is to solve the saddle-point problem (4) by alternating maximization steps with respect to the dual variables and minimization steps with respect to the primal variable .

Dual Maximization. The maximization of the dual variable is carried out via gradient ascent. In particular, at each iteration ( is the maximum number of iterations), the dual ascent method fixes the primal variable and updates the dual variable as:

(5)

where is the gradient of the Lagrangian with respect to the dual variables, evaluated at the latest estimate of the primal-dual variables , and is a suitable stepsize. It is straightforward to compute the gradient with respect to the -th dual variable as . Intuitively, the second summand in (4) penalizes the violation of the constraint (for all ). Moreover, since the gradient in (5) grows with the amount of violation , the dual update (5) increases the penalty for constraints with large violation.

Primal Minimization. The minimization step fixes the dual variable to the latest estimate and minimizes (4) with respect to the primal variable :

(6)

where we substituted “” for “” since the objective cannot drift to minus infinity due to the implicit constraints imposed by the indicator functions in . Recalling the expression of , defining , and moving again the indicator functions to the constraints we write (6) more explicitly as:

(7)

where we dropped the constant terms from the objective since they are irrelevant for the optimization. The minimization step in the dual ascent is again an SDP, but contrarily to the standard SDP (S1), problem (7) can be solved quickly using the Riemannian Staircase, as discussed in the following.

Iii-B A Riemannian Staircase for the Dual Ascent Iterations

This section provides a fast solver to compute a solution for the SDP (7), that needs to be solved at each iteration of the dual ascent method of Section III-A.

We use of the Burer-Monteiro method [20], which replaces the matrix in (7) with a rank- product with :

(8)

Note that the constraint in (7) becomes redundant after the substitution, since is always positive semidefinite, hence it is dropped.

Following Boumal et al. [9] we note that the constraint set in (8) describes a smooth manifold, and in particular a product of Stiefel manifolds. To make this apparent, we recall that the (transposed) Stiefel manifold is defined as [9]:

(9)

Then, we observe that can be written as , (where is the -th row of ), which is equivalent to saying that for . This observation allows concluding that the matrix belongs to the product manifold . Therefore, we can rewrite (8) as an unconstrained optimization on manifold:

(R1)

The formulation (R1) is non-convex (the product of Stiefel manifolds describes a non-convex set), but one can find local minima efficiently using iterative methods [9, 21]. While it might seem that little was gained (we started with an intractable problem and we ended up with another non-convex problem), the following remarkable result from Boumal et al. [9] ties back local solutions of (R1) to globally optimal solutions of the SDP (7).

Proposition 1 (Optimality Conditions for (R1), Corollary 8 in [9])

If is a (column) rank-deficient second-order critical point of problem (R1), then is a global optimizer of (R1), and is a solution of the semidefinite relaxation (7).

The previous proposition ensures that when local solutions (second-order critical points) of (R1) are rank deficient, then they can be mapped back to global solutions of (7), hence providing a way to solve (7) efficiently via (R1).

The catch is that one has to choose the rank large enough to obtain rank-deficient solutions. Related work [9] therefore proposes the Riemannian staircase method, where one solves (R1) for increasing values of till a rank-deficient solution is found. Boumal et al. [9] also provide theoretical results ensuring that rank-deficient solutions are found for small (more details in Section V).

Iii-C Dars: Summary, Convergence, and Guarantees

We name DARS (Dual Ascent Riemannian Staircase) the approach resulting from the combination of dual ascent and the Riemannian Staircase. DARS starts with an initial guess for the dual variables (we use ), and then alternates two steps: (i) the primal minimization where a solution for (7) is obtained using the Riemannian Staircase (R1) (in practice this is solved using iterative methods, such as the Truncated Newton method); (ii) the dual maximization were the dual variables are updated using the gradient ascent update (5).

Rounding. Upon convergence, DARS produces a matrix . When deriving the standard SDP relaxation (S1) we dropped the rank-1 constraint, hence cannot be written in general as . The process of computing a feasible solution for the original problem (P1) is called rounding. A standard approach for rounding consists in computing a rank-1 approximation of

(which can be done via singular value decomposition) and rounding the entries of the resulting vector in

. We refer to as the rounded estimate and we call the objective value attained by in (P1).

Convergence. While dual ascent is a popular optimization technique, few convergence results are available in the literature. For instance, dual ascent is known to converge when the original objective is strictly convex [22]. Currently, we observe that DARS converges when the stepsize in (5) is sufficiently small. We prove the following per-instance performance guarantees.

Proposition 2 (Guarantees in Dars)

If the dual ascent iterations converge to a value (i.e., the dual iterations reach a solution where the gradient in (5) is zero) then the following properties hold:

  • let be a (column) rank-deficient second-order critical point of problem (R1) with , then the matrix is an optimal solution for the standard SDP relaxation (S1);

  • let be the (optimal) objective value attained by in the standard SDP relaxation (S1), be the optimal objective of (P1), and the objective attained by the rounded solution , then it holds .

The proof of Proposition 2 is given in Appendix B. The first claim in Proposition 2 ensures that when the dual ascent method converges, it produces an optimal solution for the standard SDP relaxation (S1). The second claim states that we can compute an upper-bound on how far the DARS’ solution is from optimality () using the rounded objective and the relaxed objectives .

Iv Fuses: Fast Unconstrained
SEmidefinite Solver

In this section we propose a more direct way to obtain a semidefinite relaxation and a remarkably faster solver. While DARS is already able to compute an approximate MAP estimate in seconds for large problems, the approach presented in this section requires two orders of magnitude less time to compute a solution of comparable quality. We first present a binary (rather than ) matrix formulation (sec:fuses-formulation) and derive an SDP relaxation (sec:fuses-sdp). We then present a Riemannian staircase approach to solve the resulting SDP in real time (sec:fuses-staircases) and discuss performance guarantees (sec:fuses-guarantees).

Iv-a Matrix Formulation

In this section we rewrite the node variables as an binary matrix that is such that if an entry in position is equal to , then node has label and is zero otherwise. In other words, the -th row of is a binary vector that describes the label of node and has a single entry equal to in position , where is the label assigned to the node. This is a more intuitive parametrization of the problem and indeed leads to a more elegant matrix formulation, given as follows.

Proposition 3 (Binary Matrix Formulation of MAP-MRF)

Let and be defined as follows:

(10)

where is the -th row of , is the entry of in row and column , and are the coefficients defining the MRF, cf. eq. (1), and is a vector with a unique nonzero entry equal to 1 in position ( is the measured label for node ). Then the MAP estimator (P0) can be equivalently written as:

(11)

The equivalence between (P0) and (11) is proven in Appendix C. We note that the constraint in (11) (contrarily to the constraint in (2)) imposes that each node has a unique label when .

Iv-B Novel Semidefinite Relaxation

This section presents a semidefinite relaxation of (11). Towards this goal, we first homogenize the cost by lifting the problem to work on a larger variable:

(12)

where is the identity matrix. The reparametrization is given as follows.

Proposition 4 (Homogenized Binary Matrix Formulation)

Let us define . Then the MAP estimator (11) can be rewritten as:

(P2)

where denotes the top-left block of the matrix , cf. (12), (the corresponding constraint rewrites the first constraint in (11)), and where denotes the bottom-right block of , cf. (12).

At this point it is straightforward to derive a semidefinite relaxation, by noting that and by observing that is a symmetric positive semidefinite matrix of rank .

Proposition 5 (Semidefinite Relaxation)

The following SDP is a convex relaxation of the MAP estimator (P2):

(S2)

where and are the top-left block and the bottom-right block of the matrix , respectively, and we dropped the rank- constraint for .

Iv-C Accelerated Inference via the Riemannian Staircase

We now present a fast specialized solver to solve the SDP (S2) in real time and for large problem instances. Similarly to Section III-B, we use the Burer-Monteiro method [20], which replaces the matrix in (S2) with a rank- product :

(13)

where (for a suitable rank ), and where the constraint in (S2) becomes redundant after the substitution, and is dropped.

Similarly to Section III-B, we note that the constraint set in (13) describes a smooth manifold, and in particular a product of Stiefel manifolds. Specifically, we observe that can be written as , , which is equivalent to saying that for . Moreover, denoting with the block matrix including the last rows of , the constraint can be written as , which is equivalent to saying that . The two observations above allow concluding that the matrix belongs to the product manifold . Therefore, we can rewrite (13) as an unconstrained optimization on manifold:

(R2)

The formulation (R2) is non-convex but one can find local minima efficiently using iterative methods [9, 21]. We can again adapt the result from Boumal et al. [9] to conclude that rank-deficient local solutions of (R2) can be mapped back to global solutions of the semidefinite relaxation (S2).

Proposition 6 (Optimality Conditions for (R2), Corollary 8 in [9])

If is a (column) rank-deficient second-order critical point of problem (R2), then is a global optimizer of (R2), and is a solution of the semidefinite relaxation (S2).

Similarly to Section III-B, we can adopt a Riemannian staircase method, where one solves (R2) for increasing values of till a rank-deficient solution is found. In all tests we performed, a single step of the staircase ( in this case) was sufficient to find a rank-deficient solution.

Iv-D Fuses: Summary, Convergence, and Guarantees

We name FUSES (Fast Unconstrained SEmidefinite Solver) the approach presented in this section. Contrarily to DARS, FUSES is extremely simple and only requires solving the rank-restricted problem (R2), which can be solved using iterative methods, such as the Truncated Newton method. Besides its simplicity, FUSES is guaranteed to converge to the solution of the SDP (S2) for increasing values of the rank (Proposition 6).

Rounding. Upon convergence, FUSES produces a matrix . Similarly to DARS, we obtain a rounded solution by computing a rank-K approximation of and rounding the corresponding matrix in (i.e., we assign the largest element in each row to 1 and we zero out all the others). We denote with the resulting estimate and we call the objective value attained by in (11).

Since the SDP (S2) is a relaxation of the MAP estimator (P2), it is straightforward to prove the following proposition.

Proposition 7 (Guarantees in Fuses)

Let be the optimal objective attained by in (S2), be the optimal objective of (P2), and be the objective attained by the rounded solution , then .

Again, we can use Proposition 7 to compute how far the solution computed by FUSES is from the optimal objective attained by the MAP estimator.

V Experiments

This section evaluates the proposed MRF solvers on semantic segmentation problems, comparing their performance against the state of the art.

V-a Fuses and Dars: Implementation Details

We implemented FUSES and DARS in C++ using Eigen’s sparse matrix manipulation and leveraging the optimization suite developed in [21]. Sparse matrix manipulation is crucial for speed and memory reasons, since the involved matrices are very large. For instance in DARS, the matrix in (R1) has size where typically and . We initialize the rank of the Riemannian Staircase to be for DARS and for FUSES (this is the smallest rank for which we expect a rank-deficient solution). The Riemannian optimization problems (R1) and (R2) are solved iteratively using the truncated-Newton trust-region method. We refer the reader to [23] for a description of the implementation of a truncated-Newton trust-region method. As in [23], we use the Lanczos algorithm to check that (R1) and (R2) converged to rank-deficient second-order critical points, which are optimal according to Proposition 1 and Proposition 6, respectively. If the optimality condition is not met, the algorithm proceeds to the next step of the Riemannian staircase, repeating the optimization with the rank increased by 1. In all experiments, FUSES finds an optimal solution in the first iteration of the staircase (), while we observed that the rank in DARS (initially ) sometimes increases to . In DARS, we limit the number of dual ascent iterations to , and we terminate iterations when the gradient in (5) has norm smaller than . Using a constant stepsize ensured convergence in all tests.

V-B Setup, Compared Techniques, and Performance Metrics

Setup. We evaluate FUSES and DARS using the Cityscapes dataset [24], which contains a large collection of images of urban scenes with pixel-wise semantic annotations. The annotations include 30 semantic classes (e.g., road, sidewalk, person, car, building, vegetation). We first extract superpixels from the images using OpenCV (we obtain around 1000 superpixels per image, unless specified otherwise). Then, the unary terms are obtained using Bonnet [25], which uses a CNN to obtain pixel-wise segmentation (Bonnet only uses 20 classes for classification purposes); the unary potential for each superpixel is set based on the majority of labels for the corresponding set of pixels (we set ). Bonnet returns noisy labels for each (super)pixel and the role of the MRF is to refine the segmentation by encouraging smoothness of nearby labels. In practice, since CNNs are typically inaccurate at the boundary between different objects, we expect the use of superpixels and MRF to improve the segmentation results. The binary potentials are modeled as  [2, Section 7.2], where denotes the average color vector in superpixel , and are parameters to tune, and where "" represents the sample mean. In our tests, we set , and .

Compared techniques. We compare the proposed techniques against three state-of-the-art methods: -expansion [6] (label: -exp). Loopy Belief Propagation [8] (label: LBP) and Tree-Reweighted Message Passing [7] (label: TRW-S). We use the implementation of these methods available in the newly-released OpenGM2 library [26].

Performance metrics. We evaluate the results in terms of suboptimality, accuracy, and CPU time. We measure the suboptimality using three metrics: the percentage of optimal labels, the percentage relaxation gap, and the percentage rounding gap. The optimal labels are those that agree with the optimal solution of (P0). The relaxation gap is for DARS, and for FUSES. The rounding gap is for DARS, and for FUSES. We compute the optimal labels (and the corresponding optimal objective) using a commercial tool for integer programming, CPLEX [27]. The runtime of CPLEX increases exponentially in the problem size hence we can only use it offline for benchmarking the proposed solvers. We measure the accuracy using the Intersection over Union (IoU) metric [28], and record the CPU time for each compared technique.

Fig. 2: Convergence results for a single test image: objective value over time (in milliseconds) for all the compared techniques. (a) Objective vs. time for (P2); (a) Objective vs. time for (P1); (c)-(d) Objective vs. time for (P0).

V-C Semantic Segmentation Results

Fig. 2 shows a typical execution of the algorithms for a single image in the Cityscapes dataset. Fig. 2(a) shows the convergence of FUSES, reporting the relaxed objective attained by iteratively solving (R2) (FUSES-relaxed), the objective of the corresponding rounded estimate at each iteration (FUSES-rounded), and the optimal cost attained by CPLEX (Exact). The approach converges in few milliseconds, and the corresponding rounded estimate settles near the optimal objective. Fig. 2(b) shows the convergence of DARS, reporting the relaxed objective attained by (R1) (DARS-relaxed), the objective of the corresponding rounded estimate (DARS-rounded), and the optimal cost from CPLEX (Exact). DARS’ relaxed cost does not decrease monotonically. Moreover, its convergence time is around two orders of magnitude slower than FUSES.

Fig. 2(c) shows all the compared techniques, while Fig. 2(d) provides a zoomed-in view restricted to the first 18ms. We only report the final cost for DARS, whose convergence is much slower than all the other methods. From Fig. 2(c)-(d) we note that -exp, LBP, an TRW-S perform well in segmentation problems. While not providing any optimality guarantee (LBP and TRW-S may not even converge to a local optimum), these techniques return near-optimal solutions in all the tested images. -exp and LBP have longer convergence tails but typically obtain a smaller value than FUSES and DARS. TRW-S also requires more time to terminate but attains a near-optimal objective in few iterations. FUSES is farther from optimal (see also Tables I-II), but it is the only technique that does not require any initial guess. FUSES attains an objective comparable to the one of DARS, while being much faster.

Table I provides statistics describing the performance of the compared techniques on the Cityscapes’ Lindau dataset over 59 images (we use approximately 1000 superpixels). We show the percentage of optimal labels (“Optimal Labels” column), the relaxation gap (“Relax Gap” column), and the rounding gap (“Round Gap” column). The tables show that FUSES and DARS have comparable suboptimality (typically larger than the other compared techniques). FUSES and DARS produce optimal assignments for most of the nodes in the MRF, and attain a rounded cost within 0.2% of the optimum. The IoU (“Accuracy” column) shows that all the techniques have comparable accuracy (around ). All the compared techniques produce more accurate results than the CNN-based segmentation produced by Bonnet, which has IoU equal to on this dataset. Note that the accuracy depends on the parameters of the MRF ( and ) besides depending on the solver. FUSES is the fastest MRF solvers and can compute a solution in milliseconds, while not relying on any initial guess. Table II shows that even with 2000 superpixels, the advantages of FUSES remain. Fig. 1 shows qualitative segmentation results obtained using the proposed techniques. We also attempted to use a general-purpose SDP solver, cvx [10], for our evaluation: with only 200 superpixels, cvx requires more than 50 minutes to solve the SDP (S1), while for 1000 superpixels it crashes due to excessive memory usage.

Method Suboptimality
Optimal Relax Round Accuracy Runtime
Labels (%) Gap (%) Gap (%) (% IoU) (ms)
FUSES
DARS
-exp -
LBP -
TRW-S -
TABLE I: Performance on the Cityscapes’ Lindau dataset (1000 superpixels).
Method Suboptimality
Optimal Relax Round Accuracy Runtime
Labels (%) Gap (%) Gap (%) (% IoU) (ms)
FUSES
DARS
-exp -
LBP -
TRW-S -
TABLE II: Performance on the Cityscapes’ Lindau dataset (2000 superpixels).

Fig. 3(a) shows the relaxation gap for FUSES and DARS for increasing number of nodes; we control the number of nodes by controlling how many superpixels each image is divided in. The relaxation gap decreases for increasing number of nodes, which is a desirable feature since one typically solves large problems (>1000 nodes). The relaxation gap in FUSES is slightly larger: in hindsight, we traded-off suboptimality for fast computation. Fig. 3(b) shows the relaxation gap for FUSES and DARS for increasing number of labels; we artificially reduce the number of labels in Cityscapes for this test. The quality of both relaxations does not degrades significantly for increasing number of labels.

Fig. 3: Relaxation gap for FUSES and DARS

for (a) increasing number of nodes and (b) increasing number of labels. The shaded area describes the 1-sigma standard deviation.

Vi Related Work

This section reviews inference techniques (sec:rw-techniques-exact-VI-B) and applications (sec:rw-applications) for pairwise MRFs including work on semantic segmentation. Our presentation is based on [1, 2, 3] but also covers more recent work on MRFs and semantic segmentation.

Vi-a Exact Inference in MRFs

Efficient Algorithms. Inference in MRFs is intractable in general. However, particular instances of the problem are solvable in polynomial time. In particular, the Ising model can be solved exactly in polynomial time via graph cut [29, 30]. Note that graph cut algorithms are exact when binary potentials are “attractive”, i.e., in (1) (priors encourage nearby nodes to have the same label). MRFs with repulsive potentials () are intractable in general [31]. A more general (necessary and sufficient) condition that ensures optimality of graph cut for binary pairwise MRFs with classes is the regularity condition:

(14)

for any , see Lemma 3.2 and Theorem 3,1 in [31]. The regularity condition in eq. (14) is a special case of submodularity, and indeed the corresponding potentials are also called submodular [31, 32, 33].

For multi-label pair-wise MRFs, exact solutions exist for the case when the binary potentials are convex functions of the labels [34, 35, 36] and for the case where the binary potentials are linear and the unary potentials are convex [37]. We remark that these approaches assume a linear ordering of the labels, where the potentials penalize node labels depending on their label distance ; this means that choosing and incurs a larger penalty than choosing and ; on the other hand, the Potts model in eq. (1) penalizes in the same way any class mismatch . Assuming a linear ordering is often unrealistic in practice; for instance, in semantic segmentation the classes (e.g., cat, table, car) do not admit a linear order in general. Moreover, convexity is a strong assumption for several MRF applications, such as depth reconstruction, where nonconvex costs have the desirable property of being discontinuity-preserving [31] contrarily to convex ones, which tend to smooth out depth discontinuities. Inference in multi-class MRF based on the Potts model is NP-hard, see [38].

In the special case where the topology of the MRF is a chain (e.g., when the MRF describes a 1D signal or sequence), or more generally a tree, Dynamic Programming provides an optimal MAP estimate in polynomial time, see [33, 39]. Related work [40, 41] also extends dynamic programming to certain families of graphs with cycles and small cliques.

Global Integer Solvers. The energy minimization problem (P0) is a quadratic integer program and can be easily reformulated as a binary optimization problem [42, 43, 44]. Integer programming is NP-hard in general, but one may still resort to state-of-the-art integer solvers (e.g., CPLEX [27]

) for moderate-size instances. For quadratic and linear programs, integer solvers based on cutting plane methods or branch & bound are able to produce solutions for problems with few hundred variables relatively quickly (i.e., in few seconds), but become unacceptably slow for larger problems. A Branch-and-Cut approach is proposed in 

[45]. An evaluation and a broader review of integer programming for MRFs is given in [3].

Vi-B Approximate and Local Inference in MRFs

Iterative Local Solvers and Meta-heuristics.

Local solvers start at a given initial guess and iteratively try to converge to a local optimum of the cost function. Early work includes the Iterative Local Modes (ICM) of Besag [46], which at each iteration greedily changes the label of a node in order to get the largest decrease in the cost. ICM is known to be very sensitive to the quality of the initial guess [1]. In order to improve convergence, Geman and Geman [47] use Simulated Annealing to perform inference in MRFs. Simulated Annealing requires exponential time to converge in theory and is notoriously slow in practice [48].

Graph Cuts and Move-Making Algorithms. While graph cut methods are able to compute globally optimal solutions in binary pairwise MRFs with submodular potentials (Section VI-A), they are only able to converge to local minima in non-submodular binary MRFs or in multi-class MRFs. For the binary case, related works [49, 32] develop schemes to approximately solve MRFs with non-submodular potentials. Regarding the multi-class case, popular graph cut methods include the swap-move (--swap) and the expansion-move (-expansion) algorithms, both proposed in [6]. At each inner iteration, these algorithms solve a binary segmentation problem using graph cut, while the outer loop attempts to reconcile the binary results into a coherent multi-class segmentation. Boykov et al. [6] show that the swap-move algorithm is applicable whenever the smoothness potentials are semi-metric (i.e., and ), and the expansion-move algorithm is applicable whenever the smoothness potentials are metric111Note that both the Potts model and the truncated distance are metrics. (i.e., they are semi-metric and also satisfy the triangle inequality ); these conditions are further generalized in [31]. Under these conditions, Boykov et al. [6] show that these graph cut methods produce “strong” local minima, i.e., local minima where no allowed move is able to further reduce the cost. Moreover, these techniques produce a local solution with is proven to be within a known factor from the global minimum [6]. When these conditions are not satisfied, approximations of the cost function can be used [50, 38]. Komodakis and Tziritas [51] draw connections between move-making algorithms and the dual of linear programming relaxations. Kumar and Koller [52, 53] propose a move-making approach that applies to the semi-metric case and attains the same guarantees of the linear relaxation (see paragraph below) in the metric case. Faster algorithmic variants are proposed by Alahari et al. [54]. Lempitsky et al. [55] provide a low-complexity algorithm (LogCut) that requires an offline learning step. A summary of the MRF formulations that can be solved exactly or within a constant factor from the global minimum via graph cut is given in [31]. When the potentials do not satisfy the conditions for applicability of graph cut methods, approximate versions of these techniques can be still applied [50] but the corresponding performance bounds no longer hold.

Message-Passing Techniques. Message passing techniques adjust the MAP estimate at each node in the MRF via local information exchange between neighboring nodes. A popular message passing technique is belief propagation [56], which results in exact inference in graphs without loops, but is also applicable to generic graphs [57, 58] (loopy belief propagation, or LBP in short). LBP is not guaranteed to converge in presence of cycles, but if convergence is attained LBP returns “strong” local minima [8, 59]. Tree-Reweighted Message Passing [7] (TRW-S) is another popular message-passing algorithm which is also able to estimate a lower-bound on the cost that can be used to assess the quality of the solution. Also in this case the estimate is not guaranteed to converge and may oscillate. Message-passing techniques do not necessarily return integer solutions, hence the resulting estimates need to be rounded, see [3, Section 4.5]. Krähenbühl and Koltun [60] use message passing to perform inference in a mean field approximation of a fully-connected Conditional Random Fields (CRFs).222Conditional Random Fields (CRFs) are a special case of MRFs, where the binary terms, rather than being smoothness priors, are data driven.

Linear Programming (LP) Relaxations. These techniques relax the optimization to work on continuous labels rather than discrete ones. Early relaxation techniques include the LP relaxation of the local polytope [7], which is typically applicable only to small problem instances [3]. Kleinberg and Tardos [61] provide suboptimality guarantees for LP relaxations with metric potentials. Gupta and Tardos [62] extend these results considering a truncated linear metric. Chekuri et al. [63] and Werner [64] further refine the suboptimality bounds. Komodakis and Tziritas [65] consider the case of semi-metric and non-metric potentials and derive primal-dual methods to efficiently solve the resulting LP relaxations. Sontag and Jaakkola [66] propose a cutting-plane algorithm for optimizing over the marginal polytope. Other specialized solvers to attack larger instances have also been proposed, including block-coordinate ascent [67], subgradient methods based on dual decomposition [68, 69, 70], Alternating Directions Dual Decomposition [71], and others [72, 73, 74]. The performance of these techniques is typically sensitive to the choice of the parameters (e.g., stepsize) and can only ensure local convergence [3]. For binary pairwise MRFs, LP relaxation over the local polytope can be solved efficiently by reformulating it as a maximum flow problem, see the roof duality (or QPBO) approach of Rother et al. [75]. LP relaxations typically do not produce an integer solution, therefore the corresponding solutions need to be rounded. Moreover, they are tightly coupled with message-passing algorithms, see [3, Section 4.3]. Kumar et al. [4] provide a comparison between linear, quadratic, and second-order cone programming relaxations, showing that the linear relaxation dominates the others.

Spectral and Semidefinite Relaxations. These techniques typically rephrase inference over an MRF in terms of a binary quadratic optimization problem [16], which can be then relaxed to a convex program (more details in Section II-B). Shi and Malik [76] propose a spectral relaxation for image segmentation; more recently, spectral segmentation is used by Aksoy et al. [77]. Keuchel et al. [5]

introduce SDP relaxations to several computer vision applications and use interior-point methods and randomized hyperplane techniques to obtain integer solutions, leveraging the celebrated result of Goemans and Williamson 

[78], which bounds the suboptimality of the resulting solutions. SDP relaxations are known to provide better solutions than spectral methods [5, 16]. While early approaches also recognized the accuracy of SDP relaxations with respect to commonly used alternatives (e.g., [4]), the computational cost of general-purpose SDP solvers prevented widespread use of this technique beyond problems with few hundred variables [5]. Keuchel et al. [15] propose an approach to reduce the dimension of the problem via image preprocessing and superpixel segmentation. Concurrently, Torr [79] proposes the use of SDP relaxations for pixel matching problems. Schellewald and C. Schnörr [80] suggest a similar SPD relaxation for subgraph matching in the context of object recognition. Heiler et al. [81] propose to add constraints in the SDP relaxation to enforce priors (e.g., constrain the number of pixels in a class, or force set of pixels to belong to the same class). Olsson et al. [16] develop a spectral subgradient method which is shown to reduce the relaxation gap of spectral relaxations. Huang et al. [82] use an Alternating Direction Methods of Multipliers to speed up computation, while Wang et al. [83, 84] develop a specialized dual solver. Frostig et al. [85] resort to non-convex optimization to approximate the SDP solution, while Wang et al. [86] consider fully-connected CRFs and propose fast solvers for the case where the pairwise potentials admit a low-rank decomposition. We remark that the approach to derive the SDP relaxation is common to all papers above and follows the line of Section II-B. Wainwright and Jordan [87] use semidefinite programming to approximately compute the marginal distributions in a graphical model. More generally, semidefinite programming has been a popular way to relax combinatorial integer programming problems [88, 89] and assignment problems [90, 91].

Vi-C Applications

Overview.

MRFs have been successfully used in several application domains including computer vision, computer graphics, machine learning, and robotics. Popular applications include image denoising, inpainting, and super-resolution 

[36, 38, 30, 57], image segmentation (reviewed below), stereo reconstruction [92, 36, 38, 93, 35, 94, 95, 96, 97], panorama stitching and digital photomontages [98], image/video/texture synthesis [99], multi-camera scene reconstruction [100], voxel occupancy estimation [101], non-rigid point matching and registration [102, 16], medical imaging [103, 104]. In stereo reconstruction, the labels are the disparities at each pixel and the binary potentials are function of the absolute color differences at nearby pixels. Birchfield and Tomasi [48] provide a comparison of graph-cut methods for stereo reconstruction, while Tappen and Freeman [105] compare graph cut and LBP; Kolomogorow and Rother [106] evaluate TRW-S, LBP, and graph cut. Szeliski et al. [1] compare several techniques on stereo reconstruction, photomontage, image segmentation, and image denoising benchmarks. The study concludes the the expansion move algorithm typically outperforms the swap move algorithm, while ICM performs poorly in practice. In general, the best approach may depend on the application: for instance, the expansion move algorithm is the best performer for the photomontage benchmark, while expansion move and TRW-S perform the best on the depth reconstruction benchmark. A broader evaluation is presented in [3], which also provides a C++ library, OpenGM2 [26], that implements several inference algorithms.

Semantic Segmentation. Semantic segmentation methods assign a semantic label to each “region” in an RBG image (2D segmentation), RBG-D image, or 3D model (3D segmentation). Depending on the approach, labels can be assigned to single pixels/voxels, superpixels, or keypoints [3]; Since semantic segmentation is typically modeled as an MRF, the literature review in Sections VI-A-VI-B already covers several work in segmentation, and indeed segmentation (together with depth reconstruction) is a typical benchmark for inference in MRFs, see [1, 2, 3, 33, 12] and the references therein. Therefore, the goal of this section is to (i) provide a brief taxonomy of semantic segmentation problems, and (ii) review semantic segmentation techniques that do not directly use MRFs. The corresponding literature is vast, and we refer the reader to the excellent survey of Zhu et al. [12] for a broader review of related work.

Taxonomy. Semantic segmentation is different from clustering, which groups pixels based on similarities without necessarily associating a given semantic label to each group (this is sometimes called non-semantic, unsupervised, or bottom-up segmentation [107, 12]). While semantic segmentation classifies image regions into semantic classes, instance segmentation also attempts to discern multiple objects belonging to the same class. In full analogy with MRFs, segmentation problems can be divided in binary segmentation problems (where only two classes, foreground and background, are segmented) and multi-class segmentation problems, where more than two labels are allowed. We can further divide the literature depending on the type of input data the segmentation operates on, including isolated RGB images (most common setup in computer vision), stereo images [38], RGB-D images [108, 109], volumetric 3D data (e.g., volumetric X-ray CT images [110], or 3D voxel-based models [111]), or multiple RBG images; the latter setup is typically referred to as co-segmentation [112, 113, 12] (for generic unordered images), or temporal (or video) segmentation [114] (if images are collected over time). Thoma [107] also categorizes the segmentation problems into active (where one can influence the data collection mechanism, as it happens in robotics), passive (where the input data is given), and interactive (where a human user provides coarse information to the segmentation algorithms).

Other Approaches. Traditional approaches for semantic segmentation work by extracting and classifying features in the input data, and then enforcing consistency of the classification across pixels (e.g., using MRFs or other models). Common features include pixel color, histogram of oriented gradients, SIFT, or textons, to mention a few [107, 115]. Shotton et al. [116, 117] use textons and Random Decision Forests for semantic segmentation. Yang et al. [118]

use Support Vector Machine (SVM) demonstrating competitive performance in the PASCAL segmentation challenge 

[119]. A latent SVM model is used by Felszenzwalb et al. [120] to detect objects using deformable part models. Winn and Shotton [121] use a CRF-based algorithm, named the Layout Conditional Random Field (LayoutCRF), to detect and segment objects from their parts; the approach is further generalized by Hoiem et al. [122]; Shotton et al. [123] use textons within a CRF model for object segmentation. Kumar et al. [124] use MRFs to detect and segment objects in an image. Bray et al. [125] concurrently segment and estimate the 3D pose of a human body from multiple views. Higher-order MRF formulations are also used for semantic segmentation, see the work by Kohli and co-authors [126, 127, 128] and the review [3]. Approaches for interactive segmentation include intelligent scissors [129], active contour models [130, 131] (based on dynamic programming), and graph cut methods (GrabCut [132]). While most of the work mentioned so far operates on a discrete set of nodes of a graphical model, related work in multi-class segmentation also includes contributions modeling the problem over a continuous domain; examples of such efforts include the variational method of Lellmann et al. [133], and the anisotropic diffusion method of Kim et al. [113]; see the chapter by Cremers et al. [134]

for a recent survey. More recently, deep convolutional neural networks have become a popular solution for semantic segmentation, see the recent review of Garcia-Garcia 

et al. [28].

Vii Conclusion

We propose fast optimization techniques to solve two semidefinite relaxations of maximum a posteriori inference in Markov Random Fields (MRFs). The first technique, named DARS (Dual Ascent Riemannian Staircase), provides a scalable solution for the standard SDP relaxation proposed in the literature. The second technique, named FUSES (Fast Unconstrained SEmidefinite Solver), is based on a novel relaxation. We test the proposed approaches in semantic segmentation problems and compare them against state-of-the-art MRF solvers, including move-making and message-passing methods. Our experiments show that (i) FUSES and DARS produce near-optimal solutions, attaining an objective within 0.2% of the optimum, (ii) our approaches are remarkably faster than general-purpose SDP solvers, while FUSES is more than two orders of magnitude faster than DARS, (iii) FUSES is faster than local search methods while being a global solver.

References

  • [1] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother, “A Comparative Study of Energy Minimization Methods for Markov Random Fields with Smoothness-Based Priors,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 6, pp. 1068–1080, 2008.
  • [2] A. Blake, P. Kohli, and C. Rother, Markov Random Fields for Vision and Image Processing.   The MIT Press, 2011.
  • [3] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, T. Kröger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother, “A Comparative Study of Modern Inference Techniques for Structured Discrete Energy Minimization Problems,” Intl. J. of Computer Vision, vol. 115, no. 2, pp. 155–184, 2015.
  • [4] P. M. Kumar, V. Kolmogorov, and P. Torr, “An analysis of convex relaxations for MAP estimation,” in Advances in Neural Information Processing Systems (NIPS), 2008, pp. 1041–1048.
  • [5] J. Keuchel, C. Schnörr, C. Schellewald, and D. Cremers, “Binary partitioning, perceptual grouping, and restoration with semidefinite programming,” IEEE Trans. Pattern Anal. Machine Intell., vol. 25, pp. 1364–1379, 2003.
  • [6] Boykov, Yuri, Veksler, Olga, and Zabih, Ramin, “Fast Approximate Energy Minimization via Graph Cuts.” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1222–1239, 2001.
  • [7] M. Wainwright, T. Jaakkola, and A. Willsky, “MAP Estimation Via Agreement on Trees: Message-Passing and Linear Programming,” IEEE Trans. on Information Theory, vol. 51, no. 11, pp. 3697–3717, 2005.
  • [8] Y. Weiss and W. T. Freeman, “On the optimality of solutions of the max-product belief-propagation algorithm in arbitrary graphs,” IEEE Trans. on Information Theory, vol. 47, no. 2, pp. 736–744, 2001.
  • [9] N. Boumal, V. Voroninski, and A. Bandeira, “The non-convex Burer–Monteiro approach works on smooth semidefinite programs,” in Advances in Neural Information Processing Systems (NIPS), 2016, pp. 2757–2765.
  • [10] M. Grant and S. Boyd, “CVX: Matlab software for disciplined convex programming.” [Online]. Available: http://cvxr.com/cvx
  • [11] S. Hu and L. Carlone, “Accelerated inference in Markov Random Fields via smooth Riemannian optimization,” Tech. Rep., 2018, supplemental material: (pdf).
  • [12] H. Zhu, F. Meng, J. Cai, and S. Lu, “Beyond pixels: A comprehensive survey from bottom-up to semantic image segmentation and cosegmentation,” Journal of Visual Communication and Image Representation, vol. 12–27, p. 34, 2016.
  • [13] A. Gallagher, D. Batra, and D. Parikh, “Inference for order reduction in Markov Random Fields,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2011, pp. 1857–1864.
  • [14] R. B. Potts and C. Domb, “Some generalized order-disorder transformations,” Mathematical Proceedings of the Cambridge Philosophical Society, vol. 48, no. 01, p. 106, 1952.
  • [15] J. Keuchel, M. Heiler, and C. Schnörr, “Hierarchical Image Segmentation Based on Semidefinite Programming.” DAGM-Symposium, vol. 3175, no. Chapter 15, pp. 120–128, 2004.
  • [16] C. Olsson, A. Eriksson, and F. Kahl, “Improved spectral relaxation methods for binary quadratic optimization problems,” Computer Vision and Image Understanding, vol. 112, no. 1, pp. 3–13, 2008.
  • [17] S. Boyd and L. Vandenberghe, Convex optimization.   Cambridge University Press, 2004.
  • [18] L. Vandenberghe and S. Boyd, “Semidefinite programming,” SIAM Rev., vol. 38, no. 1, pp. 49–95, 1996.
  • [19] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends, Machine Learning, vol. 3, no. 1, pp. 1–122, 2010.
  • [20] S. Burer and R. Monteiro, “A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization,” Mathematical Programming, vol. 95, no. 2, pp. 329–357, 2003.
  • [21] D. Rosen, L. Carlone, A. Bandeira, and J. Leonard, “SE-Sync: A certifiably correct algorithm for synchronization over the Special Euclidean group,” in Intl. Workshop on the Algorithmic Foundations of Robotics (WAFR), San Francisco, CA, December 2016, extended arxiv preprint: 1611.00128, (pdf) (pdf) (code).
  • [22] P. Tseng, “Dual ascent methods for problems with strictly convex costs and linear constraints: A unified approach,” SIAM J. Control Optim., vol. 28, no. 1, pp. 214–242, 1990.
  • [23] D. Rosen and L. Carlone, “Computational enhancements for certifiably correct SLAM,” in IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS), 2017, workshop on “Introspective Methods for Reliable Autonomy”, (pdf).
  • [24]

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The Cityscapes dataset for semantic urban scene understanding,” in

    IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [25] A. Milioto and C. Stachniss, “Bonnet: An Open-Source Training and Deployment Framework for Semantic Segmentation in Robotics using CNNs,” ArXiv, 2018.
  • [26] B. Andres, T. Beier, and J. Kappes, “OpenGM2,” 2016. [Online]. Available: http://hci.iwr.uni-heidelberg.de/opengm2/
  • [27] IBM, “CPLEX: IBM ILOG CPLEX Optimization Studio.” [Online]. Available: https://www.ibm.com/products/ilog-cplex-optimization-studio
  • [28] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. García-Rodríguez, “A review on deep learning techniques applied to semantic segmentation,” ArXiv Preprint: 1704.06857, 2017.
  • [29] P. L. Ivănescu, “Some Network Flow Problems Solved with Pseudo-Boolean Programming,” Operations Research, vol. 13, no. 3, pp. 388–399, 1965.
  • [30]

    D. Greig, B. Porteous, and A. Seheult, “Exact Maximum A Posteriori Estimation for Binary Images,”

    J. Royal Statistical Soc., vol. 51, no. 2, pp. 271–279, 1989.
  • [31] V. Kolmogorov and R. Zabih, “What energy functions can be minimized via graph cuts?” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 2, pp. 147–159, 2004.
  • [32] S. Jegelka and J. Bilmes, “Submodularity beyond submodular energies: Coupling edges in graph cuts,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2011, pp. 1897–1904.
  • [33] P. Felzenszwalb and R. Zabih, “Dynamic programming and graph algorithms in computer vision,” IEEE Trans. Pattern Anal. Machine Intell., vol. 33, no. 4, pp. 721–740, 2011.
  • [34] H. Ishikawa, “Exact optimization for markov random fields with convex priors,” IEEE Trans. Pattern Anal. Machine Intell., vol. 25, no. 10, pp. 1333–1336, 2003.
  • [35] H. Ishikawa and D. Geiger, “Occlusions, discontinuities, and epipolar lines in stereo,” in European Conf. on Computer Vision (ECCV), H. Burkhardt and B. Neumann, Eds.   Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 232–248.
  • [36] Y. Boykov, O. Veksler, and R. Zabih, “Markov Random Fields with Efficient Approximations,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 1998.
  • [37] D. S. Hochbaum, “An efficient algorithm for image segmentation, Markov random fields and related problems,” Journal of the ACM, vol. 48, no. 4, pp. 686–701, 2001.
  • [38] Boykov, Yuri, Veksler, Olga, and Zabih, Ramin, “Fast Approximate Energy Minimization via Graph Cuts.” IEEE Trans. Pattern Anal. Machine Intell., vol. 23, no. 11, pp. 1222–1239, 2001.
  • [39] O. Veksler, “Stereo correspondence by dynamic programming on a tree,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol. 2, 2005, pp. 384–390.
  • [40] Y. Amit and A. Kong, “Graphical templates for model registration,” IEEE Trans. Pattern Anal. Machine Intell., vol. 18, no. 3, pp. 225–236, 1996.
  • [41] P. Felzenszwalb, “Representation and detection of deformable shapes,” IEEE Trans. Pattern Anal. Machine Intell., vol. 27, no. 2, pp. 208–220, 2005.
  • [42] E. Boros and P. L. Hammer, “Pseudo-boolean optimization,” Discrete Applied Mathematics, vol. 123, no. 1, pp. 155–225, 2002. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0166218X01003419
  • [43] A. Schrijver, Theory of Linear and Integer Programming.   New York, NY, USA: John Wiley & Sons, Inc., 1986.
  • [44] E. Boros, P. L. Hammer, and G. Tavares, “Preprocessing of unconstrained quadratic binary optimization,” Tech. Rep., 2006.
  • [45] P. Wang, C. Shen, A. van den Hengel, and P. H. S. Torr, “Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference,” Intl. J. of Computer Vision, vol. 117, no. 3, pp. 269–289, 2015.
  • [46] J. Besag, “On the statistical analysis of dirty pictures,” J. Royal Statistical Soc., vol. 48, no. 3, pp. 48–259, 1986.
  • [47] Geman, S and Geman, D, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,” IEEE Trans. Pattern Anal. Machine Intell., vol. 6, no. 6, pp. 721–741, 1984.
  • [48] S. Birchfield and C. Tomasi, “A pixel dissimilarity measure that is insensitive to image sampling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 4, pp. 401–406, 1998.
  • [49] V. Kolmogorov and C. Rother, “Minimizing nonsubmodular functions with graph cuts-a review,” IEEE Trans. Pattern Anal. Machine Intell., vol. 29, no. 7, pp. 1274–1279, 2007.
  • [50] V. K. C. Rother, S. Kumar and A. Blake, “Digital Tapestry,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 589–596.
  • [51] N. Komodakis and G. Tziritas, “A new framework for approximate labeling via graph cuts,” in Intl. Conf. on Computer Vision (ICCV), vol. 2, 2005, pp. 1018–1025.
  • [52] M. Kumar and D. Koller, “MAP Estimation of Semi-metric MRFs via Hierarchical Graph Cuts,” in

    Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI)

    , 2009, pp. 313–320.
  • [53] P. Torr and M. Kumar, “Improved moves for truncated convex models,” in Advances in Neural Information Processing Systems (NIPS), 2009, pp. 889–896.
  • [54] K. Alahari, P. Kohli, and P. Torr, “Reduce, reuse, and recycle: Efficiently solving multi-label mrfs,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
  • [55] V. Lempitsky, C. Rother, and A. Blake, “LogCut - Efficient Graph Cut Optimization for Markov Random Fields,” in Intl. Conf. on Computer Vision (ICCV), 2007, pp. 1–8.
  • [56] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference.   Morgan Kaufmann, 1988.
  • [57] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient Belief Propagation for Early Vision,” International Journal of Computer Vision, vol. 70, no. 1, pp. 41–54, 2006.
  • [58] W. Freeman and E. Pasztor, “Learning low-level vision,” Intl. J. of Computer Vision, vol. 40, pp. 25–47, 1999.
  • [59] M. Wainwright, T. Jaakkola, and A. Willsky, “Tree consistency and bounds on the performance of the max-product algorithm and its generalizations,” Statistics and Computing, vol. 14, no. 2, pp. 143–166, 2004.
  • [60] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in Neural Information Processing Systems (NIPS), 2011, pp. 109–117.
  • [61] J. Kleinberg and E. Tardos, “Approximation algorithms for classification problems with pairwise relationships: Metric labeling and Markov Random Fields,” in Proc. of the 40th Annual Symposium on Foundations of Computer Science, 1999.
  • [62] A. Gupta and É. Tardos, “A constant factor approximation algorithm for a class of classification problems,” in

    Proceedings of the 32nd Annual ACM Symposium on the Theory of Computing

    , 2000.
  • [63] C. Chekuri, S. Khanna, J. Naor, and L. Zosin, “Approximation algorithms for the metric labeling problem via a new linear programming formulation,” in Proc. of the Annual ACM-SIAM Symposium on Discrete Algorithms, 2001, pp. 109–118.
  • [64] T. Werner, “A linear programming approach to max-sum problem: A review,” IEEE Trans. Pattern Anal. Machine Intell., vol. 29, no. 7, pp. 1165–1179, 2007.
  • [65] N. Komodakis and G. Tziritas, “Approximate Labeling via Graph Cuts Based on Linear Programming,” IEEE Trans. Pattern Anal. Machine Intell., vol. 29, no. 8, pp. 1436–1453, 2007.
  • [66] D. Sontag and T. Jaakkola, “New outer bounds on the marginal polytope,” in Advances in Neural Information Processing Systems (NIPS), 2008, pp. 1393–1400.
  • [67] V. Kolmogorov, “Convergent Tree-Reweighted Message Passing for Energy Minimization,” IEEE Trans. Pattern Anal. Machine Intell., vol. 28, no. 10, pp. 1568–1583, 2006.
  • [68] N. Komodakis, N. Paragios, and G. Tziritas, “MRF Energy Minimization and Beyond via Dual Decomposition,” pami, vol. 33, no. 3, pp. 531–552, 2011.
  • [69]

    J. Kappes, M. Speth, G. Reinelt, and C. Schn̈rr, “Towards efficient and exact map-inference for large scale discrete computer vision problems via combinatorial optimization,” in

    IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2013.
  • [70] M. Guignard and S. Kim, “Lagrangean decomposition: A model yielding stronger lagrangean bounds,” Mathematical Programming, vol. 39, no. 2, pp. 215–228, 1987.
  • [71] A. Martins, M. Figueiredo, P. Aguiar, N. Smith, and E. Xing, “An augmented Lagrangian approach to constrained MAP inference,” in Intl. Conf. on Machine Learning (ICML), 2011, pp. 169–176.
  • [72] B. Savchynskyy, S. Schmidt, J. Kappes, and C. Schnörr, “Efficient mrf energy minimization via adaptive diminishing smoothing,” in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI), 2012, pp. 746–755.
  • [73] A. Globerson and T. Jaakkola, “Fixing max-product: Convergent message passing algorithms for MAP LP-relaxations,” in Advances in Neural Information Processing Systems (NIPS), 2008, pp. 553–560.
  • [74] D. Sontag, D. Choe, and Y. Li, “Efficiently searching for frustrated cycles in map inference,” in Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence (UAI), 2012.
  • [75] C. Rother, V. Kolmogorov, V. Lempitsky, and M. Szummer, “Optimizing binary mrfs via extended roof duality,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2007.
  • [76] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 24, no. 5, 2000.
  • [77] Y. Aksoy, T. Oh, S. Paris, M. Pollefeys, and W. Matusik, “Semantic soft segmentation,” SIGGRAPH, vol. 37, no. 4, pp. 72:1–72:13, 2018.
  • [78] M. Goemans and D. Williamson, “Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming,” J. ACM, vol. 42, no. 6, pp. 1115–1145, 1995.
  • [79] P. H. S. Torr, “Solving markov random fields using semi definite programming,” in International Workshop on Artificial Intelligence and Statistics (AISTATS), 2003.
  • [80] C. Schellewald and C. Schnörr, “Probabilistic subgraph matching based on convex relaxation,” in Energy Minimization Methods in Computer Vision and Pattern Recognition.   Springer Berlin Heidelberg, 2005, pp. 171–186.
  • [81] M. Heiler, J. Keuchel, and C. Schnörr, “Semidefinite Clustering for Image Segmentation with A-priori Knowledge,” in Pattern Recognition.   Springer, Berlin, Heidelberg, 2005, pp. 309–317.
  • [82] Q. Huang, Y. Chen, and L. Guibas, “Scalable semidefinite relaxation for maximum a posterior estimation,” in Intl. Conf. on Machine Learning (ICML), 2014, pp. II–64–II–72.
  • [83] W. Peng, S. Chunhua, and A. van den Hengel, “A fast semidefinite approach to solving binary quadratic problems,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 1312–1319, 2013.
  • [84] P. Wang, C. Shen, A. V. D. Hengel, and P. Torr, “Large-scale Binary Quadratic Optimization Using Semidefinite Relaxation and Applications,” IEEE Trans. Pattern Anal. Machine Intell., vol. 39, no. 3, pp. 1–18, 2016.
  • [85] R. Frostig, S. Wang, P. Liang, and C. Manning, “Simple MAP inference via low-rank relaxations,” in NIPS, 2014.
  • [86] P. Wang, C. Shen, and A. V. D. Hengel, “Efficient SDP inference for fully-connected CRFs based on low-rank decomposition,” IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3222–3231, 2015.
  • [87] M. J. Wainwright and M. I. Jordan, “Semidefinite relaxations for approximate inference on graphs with cycles,” in Advances in Neural Information Processing Systems (NIPS), 2003, pp. 369–376.
  • [88] F. Alizadeh, “Interior Point Methods in Semidefinite Programming with Applications to Combinatorial Optimization,” SIAM Journal on Optimization, vol. 5, no. 1, pp. 13–51, 1995.
  • [89] S. Poljak, F. Rendl, and H. Wolkowicz, “A recipe for semidefinite relaxation for (0,1)-quadratic programming,” Journal of Global Optimization, vol. 7, no. 1, pp. 51–73, 1995.
  • [90] Q. Zhao, S. Karisch, F. Rendl, and H. H. Wolkowicz, “Semidefinite programming relaxations for the quadratic assignment problem,” Journal of Combinatorial Optimization, vol. 2, no. 1, pp. 71–109, 1998.
  • [91] K. Anstreicher and N. Brixius, “A new bound for the quadratic assignment problem based on convex quadratic programming,” Mathematical Programming, vol. 89, no. 3, pp. 341–357, 2001.
  • [92] S. Birchfield and C. Tomasi, “Multiway cut for stereo and motion with slanted surfaces,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol. 1, 1999, pp. 489–495.
  • [93] H. Ishikawa and D. Geiger, “Segmentation by grouping junctions,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 1998, pp. 125–131.
  • [94] V. Kolmogorov and R. Zabih, “Computing visual correspondence with occlusions using graph cuts,” in Intl. Conf. on Computer Vision (ICCV), vol. 2, 2001, pp. 508–515 vol.2.
  • [95] S. Roy, “Stereo without epipolar lines: A maximum-flow formulation,” Intl. J. of Computer Vision, vol. 34, no. 2, pp. 147–161, Aug 1999. [Online]. Available: https://doi.org/10.1023/A:1008192004934
  • [96] S. Roy and I. J. Cox, “A Maximum-Flow Formulation of the N-camera Stereo Correspondence Problem,” in Intl. Conf. on Computer Vision (ICCV), 1998, pp. 492–499.
  • [97] D. Scharstein and R. Szeliski, “A taxonomy and evaluation of dense two-frame stereo correspondence algorithms,” Intl. J. of Computer Vision, vol. 47, no. 1, pp. 7–42, 2002.
  • [98] A. Agarwala, M. Dontcheva, M. Agrawala, S. Drucker, A. Colburn, B. Curless, D. Salesin, and M. Cohen, “Interactive digital photomontage,” ACM Trans. Graph., vol. 23, no. 3, pp. 294–302, 2004.
  • [99] V. Kwatra, A. Schödl, I. Essa, G. Turk, and A. Bobick, “Graphcut textures,” ACM Transactions on Graphics, vol. 22, no. 3, 2003.
  • [100] V. Kolmogorov and R. Zabih, “Multi-camera scene reconstruction via graph cuts,” in European Conf. on Computer Vision (ECCV), 2002.
  • [101] D. Snow, P. Viola, and R. Zabih, “Exact voxel occupancy with graph cuts,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2000, pp. 345–352 vol.1.
  • [102] N. Komodakis and N. Paragios, “Beyond loose LP-relaxations: Optimizing MRFs by repairing cycles,” in European Conf. on Computer Vision (ECCV), 2008.
  • [103] Y. Boykov and M.-P. Jolly, “Interactive Organ Segmentation Using Graph Cuts,” 2000.
  • [104] J. Kim, J. F. III, A. Tsai, C. Wible, A. Willsky, and W. W. III, “Incorporating spatial priors into an information theoretic approach for fmri data analysis,” in Proc. of the Third International Conference on Medical Image Computing and Computer-Assisted Intervention, ser. MICCAI ’00.   London, UK, UK: Springer-Verlag, 2000, pp. 62–71. [Online]. Available: http://dl.acm.org/citation.cfm?id=646923.710388
  • [105] M. Tappen and W. Freeman, “Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters,” in Intl. Conf. on Computer Vision (ICCV), 2003, pp. 900–907.
  • [106] V. Kolmogorov and C. Rother, “Comparison of Energy Minimization Algorithms for Highly Connected Graphs,” 2006.
  • [107] M. Thoma, “A survey of semantic segmentation,” ArXiv Preprint: 1602.06541, 2017.
  • [108] D. Zhuo, T. Sinisa, and L. Longin, “Semantic segmentation of RGB-D images with mutex constraints,” Intl. Conf. on Computer Vision (ICCV), pp. 1733–1741, 2015.
  • [109] S. Gupta, P. Arbeláez, R. Girshick, and J. Malik, “Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation,” Intl. J. of Computer Vision, vol. 112, no. 2, pp. 133–149, 2015.
  • [110] S. Hu, E. Hoffman, and J. Reinhardt, “Automatic lung segmentation for accurate quantization of volumetric X-ray CT images,” IEEE Transactions on Medical Imaging, vol. 20, no. 6, pp. 490–498, 2001.
  • [111] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. Rehg, “Joint semantic segmentation and 3D reconstruction from monocular video,” in European Conf. on Computer Vision (ECCV), ser. Lecture Notes in Computer Science, vol. 8694, 2014, pp. 703–718.
  • [112] C. Rother, T. Minka, A. Blake, and V. Kolmogorov, “Cosegmentation of image pairs by histogram matching - incorporating a global constraint into MRFs,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2006, pp. 993–1000.
  • [113] G. Kim, E. Xing, L. Fei-Fei, and T. Kanade, “Distributed cosegmentation via submodular optimization on anisotropic diffusion,” in Intl. Conf. on Computer Vision (ICCV), 2011, pp. 169–176.
  • [114] A. Chen and J. Corso, “Temporally consistent multi-class video-object segmentation with the video graph-shifts algorithm,” in 2011 IEEE Workshop on Applications of Computer Vision (WACV), 2011, pp. 614–621.
  • [115] S.-C. Zhu, C. Guo, Y. Wang, and Z. Xu, “What are Textons?” Intl. J. of Computer Vision, vol. 62, no. 1/2, pp. 121–143, 2005.
  • [116] J. Shotton, M. Johnson, and R. Cipolla, “Semantic texton forests for image categorization and segmentation,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
  • [117]

    F. Schroff, A. Criminisi, and A. Zisserman, “Object class segmentation using random forests,” in

    British Machine Vision Conf. (BMVC), 2008.
  • [118] Y. Yang, S. Hallman, D. Ramanan, and C. Fowlkes, “Layered Object Models for Image Segmentation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 34, no. 9, pp. 1731–1743, 2012.
  • [119] M. Everingham, L. V. Gool, C. Williams, J. Winn, and A. Zisserman, “The Pascal Visual Object Classes (VOC) Challenge,” Intl. J. of Computer Vision, vol. 88, no. 2, pp. 303–338, 2009.
  • [120] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE Trans. Pattern Anal. Machine Intell., vol. 32, no. 9, pp. 1627–1645, 2010.
  • [121] J. Winn and J. Shotton, “The layout consistent random field for recognizing and segmenting partially occluded objects,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2006.
  • [122] D. Hoiem, C. Rother, and J. Winn, “3D LayoutCRF for multi-view object class recognition and segmentation,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2007, pp. 1–8.
  • [123] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation,” in European Conf. on Computer Vision (ECCV), 2006.
  • [124] M. Kumar, P. Torr, and A. Zisserman, “OBJ CUT,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005, pp. 18–25.
  • [125] M. Bray, P. Kohli, and P. Torr, “PoseCut: Simultaneous segmentation and 3d pose estimation of humans using dynamic graph-cuts,” in European Conf. on Computer Vision (ECCV), 2006, pp. 642–655.
  • [126] P. Kohli, L. Ladický, and P. Torr, “Robust higher order potentials for enforcing label consistency,” Intl. J. of Computer Vision, vol. 82, no. 3, pp. 302–324, 2009.
  • [127] P. Kohli, M. Kumar, and P. Torr, “ & beyond: solving energies with higher order cliques,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2007.
  • [128] P. Kohli, L. Ladicky, and P. Torr, “Robust higher order potentials for enforcing label consistency,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2008, pp. 1–8.
  • [129] E. Mortensen and W. Barrett, “Intelligent scissors for image composition,” in SIGGRAPH, 1995, pp. 191–198.
  • [130] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active Contour Models,” Intl. J. of Computer Vision, vol. 1, no. 4, pp. 321–331, 1987.
  • [131] A. Amini, T. Weymouth, and R. Jain, “Using dynamic programming for solving variational problems in vision,” IEEE Trans. Pattern Anal. Machine Intell., vol. 12, no. 9, pp. 855–867, 1990.
  • [132] C. Rother, V. Kolmogorov, and A. Blake, “GrabCut -interactive foreground extraction using iterated graph cuts,” in SIGGRAPH, 2004.
  • [133] J. Lellmann, F. Becker, and C. Schnörr, “Convex optimization for multi-class image labeling with a novel family of total variation based regularizers,” in Intl. Conf. on Computer Vision (ICCV), 2009, pp. 646–653.
  • [134] D. Cremers, T. Pock, K. Kolev, and A. Chambolle, “Convex relaxation techniques for segmentation, stereo and multiview reconstruction,” in Markov Random Fields for Vision and Image Processing.   MIT Press, 2011.

Appendix A: Equivalence between Problems (2) and (P0)

Here we prove that solving Problem (2) is equivalent to solving (P0), in the sense that the solution set of a problem is in 1-to-1 correspondence with the solution set of the other. Towards this goal, we show that (2) can be simply obtained as a reparametrization of (P0).

We first rewrite each node variable in (P0) as a vector , such that has a single entry equal to (all the others are ), and if the -th entry of is , then the corresponding node has label . Each vector is a valid label assignment as long as there is a unique entry equal to , or, equivalently, . Using this vector parametrization we rewrite the unary and binary potentials (1) as:

(15)

where is a vector of all zeros, except the entry in position (measured class label for node ), which is equal to . The reparametrization of the unary potentials in (15) can be seen to be the same as (1) by observing that if or otherwise; similarly, the reparametrization of the binary potentials follows from the fact that if or otherwise.

Using (15), we rewrite Problem (P0) as:

(16)

where we dropped the constant terms from (15) (which are irrelevant for the optimization), and where the constraint enforces each vector to have at most one entry equal to (i.e., we assign a single label to each node).

In order to obtain Problem (2), we adopt a more compact notation by stacking all vectors , with , in a single -vector , and note that the cost function (16) is quadratic in the entries of . Therefore, we rewrite problem (16) as:

(17)

where is an symmetric block matrix, and is an -vector, and , where is an -vector which is all zero, except the -th entry which is one, is a -vector of ones, and is the Kronecker product. The constraint simply rewrites the constraint in (16). The reader can also verify by inspection that the following choice of and ensures that the objective in (17) is the same as (16):

(18)

where stacks subvectors of size , is the -th subvector of , is the block of in block row and block column , and is the identity matrix of size .

Now we observe that for a scalar , we can equivalently write as . Moreover, we note that the diagonal of the matrix contains the squares of every entry of . Combining these two observations, we rewrite problem (16) equivalently as: