1 Introduction
Many computer vision problems can be modelled using Markov Random Fields (
mrfs). Maximum a posteriori (map) estimation in an
mrfassigns labels to the nodes, in order to maximize their joint probability distribution. However,
map inference is NPhard for a general graph and several approaches exist to achieve approximate solutions  graph cuts, belief propagation and LP relaxation based methods [17]. In recent years, higherorder mrfs have achieved excellent results in various applications, since they model farreaching interactions between the nodes. While the topic of inference in pairwise mrfs is well studied, development of scalable and efficient techniques for higher order models is evolving [15], [21], [18], [10], [20], [40].The LP relaxation approach has led to stateoftheart algorithms and also, provides a theoretical foundation to the topic of map inference. An attractive property of this approach is that it readily lends itself to inference in higherorder mrfs [21], [39]. Solving the LP relaxation in the primal is not scalable and the better approach is to solve the dual by exploiting the graph structure of the problem [37], [41]. The various approaches to solve the dual can be categorized into coordinate optimization and gradient based methods. Block coordinate methods converge fast but can get stuck in suboptimal corners when optimizing the nonsmooth dual [39], [20]. Since, we can recover better integer primal labels when closer to the optimum of the nonsmooth dual, this can lead to poor solutions. On the other hand, supergradient methods can theoretically reach the global optimum [22]. However, they have an rate of convergence towards an accurate solution.
These drawbacks can be addressed by approximating the dual with a smooth version. Smoothing the dual and applying accelerated gradient techniques was introduced by [16] and was further studied by [33], [34]. These algorithms reach an accurate solution in ) iterations. Block coordinate approaches can also work with a smooth objective [26], [14], [28], which allows these algorithms to avoid suboptimal corners. Some of these methods [14] can still get stuck, when the smoothing is reduced as we go closer to the optimum, in order to get more accurate results. Also, the scope for parallelization is more limited in block coordinate optimization methods compared to gradient based methods.
Alternating Direction Method of Multipliers (ADMM) inspired methods [25], [27] work with an augmented Lagrangian, which is also a smooth objective. However, convergence rate analysis for many of these methods has not been addressed and it is observed in practice, too. For example, AD3 [25] works well for many medium sized problems but can fail to converge in some instances. Also, scaling these methods to large vision problems with higher order cliques, has not been successful so far.
In recent years, Newtontype methods have led to stateoftheart results in various Machine Learning problems
[36], [24], [23]. These methods are able to take a better direction by considering curvature information and have quadratic convergence rate when sufficiently close to the optimum. One of the challenges while solving any optimization problem, is the conditioning of the objective. Intuitively, a problem is illconditioned if the change in objective value, due to a perturbation in variable value, varies radically depending on the direction of the perturbation. People have found that first order methods are faster when the condition number is small or moderate but secondorder Newtontype methods perform much better for illconditioned problems [11]. In map inference, the smoothing has to be reduced as one gets closer to the opimum, leading to a considerably illconditioned dual objective.The main disadvantage of Newtontype methods is the need to compute the Hessian, which can be very costly. However, it is worth investigating whether this is indeed the case or not for the problem in hand. Moreover, quasiNewton methods which work with a Hessian approximation, can also lead to stateoftheart results [36], [5].
Our contributions are summarized as follows: (i) We show that for the smoothing based approach, it is possible to efficiently compute the Hessian and Hessianvector product for a broad class of problems. (ii) We present a study of how to adapt Newtontype methods for
map inference in higherorder mrfs. (iii) We demonstrate an efficient way to perform sumproduct inference in chains of sparse patternbased higher order cliques. This greatly enables the applicability of smoothing based approach for inference in higher order mrfs. (iv) We showcase how Newtontype methods can beat first order methods for higherorder datasets. The remainder of this paper is organized as follows: section 2 outlines the concept of map inference and the smoothing based approach. The main contributions of the paper are presented in sections 3 and 4, while experimental validation is in section 5, with a concluding discussion.2 MAP inference by optimizing a smooth dual
Consider a graph , where
is the set of nodes, representing the random variables and
is the set of cliques, which enforce a certain relationship (e.g., smoothness) among the nodes they contain. Each node takes a label from a discrete set . For example, object detection can be formulated using an mrf, where each node corresponds to a part of the object, cliques represent geometric constraints between the parts and the labels are locations in the image. map inference is the task of finding the labeling that maximizes the joint probability distribution of these random variables. It can be formulated as an equivalent energy minimization problem, as follows,(1) 
where is the potential of node for label and is the potential of clique for label . Cliques having more than two nodes become higherorder cliques.
Further, this energy minimization problem can be represented as an Integer Linear Program (ILP) as follows,
(2) 
Here, and are indicator vectors for a given node (resp., clique ) for a label (resp., ) and they are the discrete optimization variables. The constraints represent a polytope, called the Marginal polytope. Relaxing the integrality constraint (last line), results in the LP relaxation, which is defined over the Local polytope.
The Lagrangian dual of this LP relaxation is obtained through dual decomposition [22], [38]. The underlying idea is to decompose the graph into tractable subgraphs, in order to derive the Lagrangian. This results in a concave, unconstrained and nonsmooth optimization problem. Especially, if one treats each clique as the tractable subgraph, then the approach given in [38], leads to the following dual,
(3) 
Here the dual variables are , for each label of each node within each clique . If all the cliques have nodes each, with the nodes taking labels, then the total number of dual variables is . We will denote this number as and , is the vector of dual variables. We note that the size of the dual is much lesser than the primal (2), hence, there has been considerable research effort towards solving the map problem through the dual.
In map inference, there is an interplay between the sizes of the graph and the subgraphs within dual decomposition [21] [42]. For medium sized graphs, decomposing into individual cliques, leads to excellent practical convergence. However, for large graphs, only bigger subgraphs like chains of cliques lead to practical convergence [21], [29]. We would like to emphasize that the formulation shown in (3) decomposes the graph according to the cliques but in general, the subgraphs can be bigger regions.
2.1 Smooth dual
The optimization of the nonsmooth dual (3) has slow convergence and can stall at a point away from the optimum. In order to reach closer to the optimum of the nonsmooth dual, optimization over smoother versions are carried out. The smooth dual can be obtained by adding to the primal objective (2), entropies corresponding to all the nodes and cliques. The dual of this modified problem can be derived based on duality theory (refer [28], [16] for further details). The smooth dual looks very similar to the nonsmooth one, where each minimum operation in (3) is replaced by the negative softmax function. The smooth optimization problem takes the following form,
(4) 
We will denote the smooth dual function as . The negative softmax function is defined as . As increases, (4) is a closer approximation of (3) and we get closer to the optimum of the nonsmooth dual. However, a smooth approximation becomes increasingly illconditioned as it approaches the shape of the nonsmooth problem (§7, [31]). Hence, it is costlier to optimize for larger values of . It is better to start with a very smooth version and switch to less smoothness, with warm start from the previous smoother version [34].
3 TrustRegion Newton for MAP Inference
The central concept in Newtontype methods is the use of curvature information to compute the search direction. In each iteration, a local quadratic approximation is constructed and a step towards the minimum of this quadratic is taken. For the smooth dual , the equations to obtain the Newton direction are shown below. In these equations, carries the curvature information and can be either the exact Hessian () or a modified Hessian or a Hessian approximation (e.g., quasiNewton).
(5)  
(6)  
(7)  
(8) 
Here, the minimum of the quadratic approximation will be a descent direction for positive definite and a line search along this direction determines the step size. For unconstrained problems, the Newton direction is found by solving the linear system (8). Since, is large for computer vision problems, we use Conjugate Gradients (CG) to obtain the Newton direction in our work.
For a Newtontype method, if is the exact Hessian () and if the backtracking line search tests full step length first, quadratic convergence is achieved, when sufficiently close to the optimum [4]. This is desirable, when compared to first order methods. As mentioned in section 2.1, smoothing is reduced as the algorithm proceeds and this results in illconditioning. Newtontype methods have some robustness against illconditioning due their affine invariance, i.e., for some functions and , where is an invertible square matrix, the iterates of Newtontype methods will be related as . Hence, in theory, these methods are not affected by illconditioning and in finite precision, the range of condition numbers in which Newtontype methods exibit robust behavior is more than that of first order methods. It is worth investigating how Newtontype methods handle map inference.
3.1 Hessian related computations
Even though Newtontype methods have advantages, it may be expensive to populate and solve the linear system (8). The computationally heavy steps for the Conjugate Gradient (CG) algorithm are: computing Hessianvector products and regularly constructing and applying a preconditioning matrix. Generally, if the Hessian is difficult to populate, Complex Step Differentiation (CSD) [1] can give very accurate Hessianvector products, with each product costing roughly one gradient computation, which can be costly to be used within CG. So, we investigate the possibility of efficiently populating the Hessian and obtaining fast Hessianvector products.
As mentioned in section 2, for medium sized problems it is sufficient to decompose according to cliques and achieve practical convergence. We will now show that for decompositions according to small subgraphs like individual cliques, the Hessian can be computed very efficiently. In fact, it only takes time of about twice the gradient computation time to compute both gradient and Hessian together.
If we take a closer look at the Hessian, we observe that it can be written as the sum of two components, as shown in figure 1. Both components have a block structure. Consider the following quantities,
(9)  
(10) 
The elements of component one can be written as , where can be calculated like by fixing the labels of two nodes. This leads to a block diagonal matrix, with as many unique symmetric blocks as there are cliques. The elements of component two can be written as and . It has as many unique symmetric blocks as there are nodes and a given row or column has repeated copies of the same block. Here are cliques, member nodes, node labels and clique labelling. Also, it is component two, which contains offdiagonal blocks, which is caused by overlapping cliques.
Hence, the elements of the Hessian can be computed by only iterating once through all cliques and nodes. Iterations corresponding to pairs of overlapping cliques is avoided. For the gradient, arrays of values need to be computed by iterating once through all cliques and nodes. For the Hessian, we need to compute (symmetric) blocks of values at each clique and node. Practically due to cache reuse, this is achieved overall with the overhead time of one gradient computation only. These blocks can be readily used for parallelizing Hessianvector multiplication and we do it with simple OpenMP code. Thus, efficient Hessianvector products leads to an efficient CG routine in our case.
For several problems, the Hessian is sparse because a clique overlaps only with a few other cliques. Nevertheless, because of the special structure, there is no need to exploit this sparsity for computing the unique blocks of the Hessian. Also, if there are many overlapping cliques, computing Hessianvector products is not adversely affected because we will only have more copies of the same block along each row of the component two matrix, corresponding to the shared node. Hence, data movement in computer memory gets limited. Moreover, these data structures can be readily used for constructing preconditioners.
3.2 Damping matrix approach
The smooth dual (4) is only strictly, not strongly, convex, i.e., away from the optimum, the Hessian is only positive semidefinite. In such a situation a trustregion approach is necessary to take meaningful Newton steps. Here, the same quadratic approximation (6) is minimized with the constraint , where is a trust radius. While developing our trustregion Newton method for mrf inference (trnmrf), we addressed several issues: how to enforce the trustregion, how to improve speed of convergence by coping with the illconditioning and how to design a suitable preconditioner. It is a combination of these choices that lead to an usable algorithm and we note that, what works for map inference, may not work for other tasks and viceversa.
The Steihaug method seems like the first method to try for large problems. It has the nice property of shaping the trustregion into an ellipsoid, according to the landscape. This is achieved by minimizing (6), with the constraint , where is a preconditioner and [7]. However, in the absense of a good preconditioner, in a given outer Newton iteration, the algorithm quickly reaches the trust radius, before computing a good direction, leading to several outer iterations.
The LevenbergMarquardt algorithm, which is generally used in least squares problems, offers another approach to impose a trustregion. The idea that we borrow is the damping matrix, which is a regularizer to address the strict convexity and the illconditioning. This matrix is added to the Hessian to obtain the modified Hessian in 8. The damping matrix restricts the Newton direction returned by CG to a trustregion. The LevenbergMarquardt algorithm uses the scaled Hessian diagonal as the damping matrix. This choice gave very poor results for map inference. Instead, we modify the Hessian as follows, , where . While Steihaug works explicitly with a trust radius , we implicitly impose it through . This can be seen through these equations that tie the damping parameter to a trust radius ,
(11) 
In a sufficiently positive definite region (close to the optimum), is close to zero and the Newton direction is computed with the true Hessian. If then , i.e., the direction is restricted by the trustregion. In illconditioned regions, will be large and the enforced trust radius will be small. Thus, after every iteration, is adapted and it can be done as follows,
(12) 
signifies how well the quadratic of equation (6) approximates the dual (). As the algorithm approaches the optimum, becomes vanishingly small and the algorithm reaches the true optimum without any perturbation.
3.3 Forcing sequence for CG truncation
Having found out the suitability of damping matrix based approach, it is still necessary to properly address the illconditioning caused by annealing. Trustregion Newton proceeds by taking approximate Newton steps, where in each outer iteration the run of CG iterations is truncated by a suitable criterion. However, as the algorithm approaches optimality, it is critical to solve the linear systems to greater accuracy and get better Newton steps [35]. Otherwise, the algorithm will take too long to converge or will not converge at all. Generally, at an outer iteration , CG can be truncated at iteration according to the following condition, . Here, , is the residual of equation 8, at iteration of CG and is referred to as the forcing sequence. Through we reduce the residual and achieve more accurate Newton direction. For map inference, we found textbook choices of the forcing sequence leading to Newton iterations not converging because the residual never becomes low enough to achieve the more accurate directions required for further progress. Hence, for trnmrf, the following criterion has been designed: . This condition naturally imposes stronger condition on the residual for later iterations and also, the term (), which takes smaller values as the annealing progresses, ensures that sufficiently accurate Newton steps are obtained, as smoothing reduces.
3.4 Clique based Preconditioner and Backtracking search
In order to further improve the efficiency of trnmrf, we will address one important aspect that influences trustregion Newton methods. This concerns the computational efficiency of the Conjugate Gradient routine. Convergence of CG iterations is a function of the number of distinct eigen value clusters of the linear system and a well preconditioned system () will have fewer of clusters. Matrix should be as similar to as possible and should be efficient to construct and invert.
Standard preconditioning approaches like incomplete Cholesky, quasiNewton and multigrid, fail to cope with generic map inference problems. A clique based block diagonal preconditioner has along the diagonal, clique specific blocks () of double derivative terms, i.e.,, where are nodes belonging to the clique and are their labels. Since its structure is closely related to the map inference problem, it performs quite well. The computational cost is of computing and inverting these blocks individually. This is done at the same time as computing the gradient and the Hessian. Applying this preconditioner corresponds to matrixvector multiplications, involving the block inverses.
For sufficiently large values of , the modified Hessian is automatically wellconditioned and CG will converge fast. It is towards the optimum, when reduces to vanishing values, that CG runs for large number of iterations. We observed considerable improvement in CG performance with this preconditioner and the results shown in this paper are based on this. Also, close to the optimum, there will be CG runs taking the maximum allowed number of iterations. We have set it as 250 for all our experiments.
The last aspect concerns backtracking search, once the Newton direction has been computed by the CG routine. When in equation (12) is less than a small value (say, ), it means a step of either very less decrease or an increase in the function value. So, we cannot directly take a step along this direction. However, it is desirable to take a step of sufficient decrease in every outer iteration of trnmrf and hence, we perform a backtracking search in these cases [30]. The backtracking can be either done along a straight line or along a curved path (a subset of CG iterations). Although backtracking along a curved path gives good results in other problems [24], we observed poor results for map inference: the final direction was very close to the steepest descent direction. On the other hand, performing a backtracking line search along the direction computed by CG, gave a huge speedup for trnmrf
. We have implemented a cubic interpolation based search, which is very efficient.
3.5 Annealing schedule and stopping condition
[34] suggests annealing by periodically computing the primaldual (PD) gap of the smooth problem. For intermediate dual variables, they recover feasible primal variables by solving a small LP (called a transportation problem) for each clique. Since, their approach is adapted to pairwise graphs, they achieve results with less than thousand oracle calls (an oracle call is either an iteration or computation of the PD gap). However, for higherorder mrfs, computing feasible primal variables and the primal objective is costly. Hence, computing PD gap of the smooth problem, every few iterations greatly affects computational efficiency.
We have used a simple but intuitive criterion for judging when to anneal. Since, we are working with a concave function, regions with lesser values of gradient euclidean norm are guaranteed to be closer to the optimum than regions with much greater values. Hence, if signifies the gradient after running iterations with a given , we update , if . If we impose a strong enough threshold , we are guaranteed to achieve sufficient improvement for that particular and annealing can be performed. We have defined, , just after is annealed. Similar to the proof in [34], this annealing approach will work with any optimization algorithm that will converge to the global optimum for a fixed value of . All the algorithms tested by us, have this guarantee for smooth, concave problems. Also, we ensure has reached a large enough value in order to obtain accurate results.
In order to exit the least smooth problem, [34] use the nonsmooth PD gap. We have observed that trnmrf, due to its quadratic convergence rate, can exit based on classical gradient based condition itself. More precisely, with the norm and a threshold of (§8, [12]), trnmrf achieves good exit behaviour. However, first order methods take too long a time to achieve this gradient based exit condition and many times never do so. Hence, we have implemented a PD gap based approach, so that all algorithms can exit gracefully. The approach in [34], is available only for pairwise graphs in the openGM library and implementing small LP solvers for all the higher order cliques looks challenging. [28] proposed a method involving only closed form calculations and we have implemented their approach.
Our complete trustregion Newton method, within an annealing framework, is described in Algorithm 1. In our experiments, we set, , , , , , and .
4 Scalable Smoothing based approach
Even though smoothing based approach has been scaled for large pairwise graphs [34], higher order mrfs with large label spaces and/or large graphs are less studied. In this section we show an efficient way to compute the smoothing operation with large label spaces for a very useful class of clique potentials. Next we demonstrate the use of quasiNewton methods for problems with large graphs.
4.1 Patternbased Smoothing
In the smooth dual (4), we denote the first term as . It corresponds to cliques and involves logsumexp calculations over all possible labellings for each clique. This scales exponentially () with clique size , where each node takes labels. Hence, computing the gradient of this dual is computationally heavy, especially for higher order cliques. [16] observed that these terms in the gradient correspond to marginal probabilities in a suitably defined graphical model. Hence, they can be computed using the sumproduct algorithm [19]. Still, complexity remains.
Sparse, patternbased clique potentials are very useful in computer vision [21], [32]. In these potentials, a big majority of the labellings take a constant (usually high) value and a small subset (of size ) take other significant values. Many clique potentials fit this description. [21] showed an efficient way to perform maxproduct computation in chains of such cliques. It is not clear how to extend their work to sumproduct. [8] showed a sumproduct approach in pairwise mrfs. They achieved a complexity of , instead of . They mention that their idea can be used for higher order cliques with a factor graph representation but don’t go into details. We have found that with a factor graph representation, their approach has a complexity of . Instead of a factor graph, if a clique tree representation (§10, [19]) is used, it is possible to achieve much better complexity of , where is the size of a subset of clique nodes. For individual cliques, these subsets will be (nearly) equal sized partitions of the clique. For clique chains, this is the separator set between two overlapping cliques. The separator set is the nodes shared by two cliques. For example, for cliques of size 4, with subset of size 2, one can quickly see the efficiency gain.
In the following, we will consider clique chains. This is for simple notation and these results are applicable to trees. Also, these results can be reduced to the case of individual cliques. While computing the gradient, the quantity is equal to , i.e., the marginal probability of node taking label in clique of a suitably defined graphical model. Let and denote the clique and node potentials, respectively and , , be three neighbouring cliques in a chain, with and being the separator set between and , and , respectively. Sumproduct message is passed from to , as follows,
(13) 
is derived from the message sent by clique to . This message was for the nodes in and is obtained by marginalization (summation). Sometimes, . Let denote the constant clique potential and let denote the patternbased clique labellings corresponding to the seperator set labelling , then the first equation of (13) can be written as,
(14) 
In (14), the second summation is computed only once () and used for all elements of . With a loop of complexity , the set of first summations for each element of can be computed. Then iterating once through , gives the message in . Within this last loop to compute the message, the probabilities of the corresponding nodes are computed. Also, the Hessian requires node pair marginals (, section 3.1) and they are calculated by parallelized sweeps of the sumproduct algorithm working on independent subsets of the cliques.
4.2 QuasiNewton approach
In section 2, we pointed out the tradeoff between the sizes of the graph and the subgraphs in dual decomposition. This becomes critical for large problems like stereo, where a decomposition based on individual cliques does not lead to practical convergence [21], [29]. Larger subgraphs are needed and chains of higher order cliques is suitable for grid graphs. As mentioned in section 4.1, the Hessian for map inference requires node pair marginals and it is very costly to calculate node pair marginals for clique chains. Hence, it is not practical to populate the Hessian for problems which require large subgraphs. Since we have promising results with trustregion Newton for medium sized problems with individual clique based decomposition, we explore quasiNewton methods for large problems. Note that quasiNewton methods have only superlinear convergence rate when sufficiently close to the optimum.
QuasiNewton methods build an approximate Hessian by low rank updates of an initial approximation (usually, a scaled Identity matrix) with vector pairs,
and . For large problems a limited memory variant is required, based on the most recent pairs of vectors. One can either approximate the Hessian or the Hessian inverse. For not strongly convex and/or illconditioned problem, the Hessian inverse approximation leads to large Newton steps. Downscaling them to a trust region radius does not help, as the direction itself is poor to start with. A trustregion based approach using the Hessian approximation is the better choice. Hence, we take the same approach as Algorithm 1 with an Hessian approximation (based on LBFGS) in the place of .The Hessian approximation is of the form of an Identity + rank matrix. The Conjugate Gradient algorithm converges in iterations for such a linear sytem (§11.3.4, [13]). Morever, matrixvector products in quasiNewton can be obtained through a compact representation [6]. Thus the cost for the CG routine is a very small fraction of the cost of computing the gradient ( in our experiments). Thus, given the iteration cost of quasiNewton being comparable to first order methods, it is worthwhile to check whether quasiNewton converges faster to the optimum.
5 Experiments
We present results based on higherorder mrf models. As our baseline, we used fista with backtracking line search [3], Smooth Coordinate Descent (scd) based on Star update [28] and ad3 [25]. Note that Star update based SCD, consistently performed better than MaxSum Diffusion based SCD in our experiments.
First, we tested on medium sized problems and compared trnmrf, fista, scd and ad3. We formulate the problem of matching two sets of interest points as map inference, based on [9]. For points, each point in the source can be matched to any of the points in the target, i.e., labels. The mrf is constructed by generating triangles in the source and each triangle in the source and target are characterized by tuples of side length. For each source tuple, we find the top 30K nearest neighbours among the target tuples. 30K is much lesser than all possible triangles in the target and this is a sparse, patternbased clique potential. The higherorder cliques potentials are defined as , where and refer to source and target, respectively and is the average squared differences between source tuple and its 30K nearest neighbours in the target. To get stateoftheart results additional terms to disallow manytoone mapping will be needed. For the sake of simplicity we ignore this issue.
We first tested with a synthetic problem, with on the 2D plane. We added Gaussian noise to these points to create the target image. The unary potentials are defined by assigning the value to node in both the images and taking the absolute differences. The results are presented in table 1, where two levels of added noise were tested.
Algorithm  TRN  SCD  FISTA  AD3  TRN  SCD  FISTA  AD3 
time (seconds)  948.9  1742.7  4368.1  369.35 (738.7)  2513.8  6884.2  9037.1  No convergence 
Nonsmooth dual  279.69  279.69  279.69  279.69  259.28  258.27  258.7  260.94 
Nonsmooth primal  279.67  279.69  279.66  N/A  261.04  258.36  258.86  N/A 
Integer primal  279.69  279.69  279.69  279.69  247.86  248.24  250.3  249.82 
We next tested with matching points on the House dataset [2]. points are marked in all the frames and the points in the first frame are matched with points in later frames (110 being the last frame). The unaries are set to zero. The camera rotates considerably around the object and we show results for three frames in table 2.
Frame 70  Frame 90  Frame 110  

Algorithm  TRN  SCD  FISTA  AD3  TRN  SCD  FISTA  AD3  TRN  SCD  FISTA  AD3 
time (seconds)  2374.5  3702.9  11964.5  1428.7 (2857.4)  4731.6  4206.4  12561.2  2303.05 (4606.1)  4451.8  10498.8  21171.1  No convergence 
Nonsmooth dual  368.59  368.59  368.59  368.59  337.81  337.81  337.81  337.81  333.03  331.51  331.31  335.5 
Nonsmooth primal  368.56  368.57  368.37  N/A  337.77  337.78  337.36  N/A  336.16  331.95  330.65  N/A 
Integer primal  368.59  368.59  368.59  368.59  337.81  337.81  337.81  337.81  315.93  317.69  317.78  314.16 
The main reason for being able to tackle such problems is the efficient computation of logsumexp (section 4.1). trnmrf, fista and scd, all benefit from this. AD3 can also exploit sparse, patternbased potentials, since, in its inner loop, maxproduct computations take place. The same steps shown for sumproduct, can be used for maxproduct by replacing summing operation by max operation. Since, these modifications cannot be made to the AD3 version in openGM, we show within brackets the actual time taken by openGM’s AD3. Since, our current implementation of sparse sumproduct gives two times speedup, the outside figure is half that value.
An important observation is that trnmrf, fista and scd have reliable convergence compared to ad3 (an admm based approach). Among these three, trnmrf is very competitive and is the fastest in many cases. The quadratic convergence rate guarantee of trnmrf is evident because in many cases the stricter gradient based exit condition is reached at the same time or before the PD gap based exit condition. The gradient based exit condition is never reached before the PD gap condition by any of the first order methods. We do simple rounding of the primal variables to recover the labels. This rounding leads to energies that can be slightly better or worse. The crux of this paper is about getting closer to the global optimum of the nonsmooth dual. Improved rounding schemes based on local search, will surely lead to better final labelling. AD3 has its own rounding scheme and outputs its results.
Next, we present results for stereo with curvature prior on and cliques. The clique energy is truncated, i.e., patternbased. The unaries are based on absolute difference. We present results for Tsukuba, image size , 16 depth levels. This is a large problem and patternbased sumproduct was computed on clique chains. We compare quasiNewton with fista and ad3. For large problems, the PD gap based criterion [28] never leads to convergence for both quasiNewton and fista. So, a simultaneous criterion on function value difference, variable difference and gradient has been used (adapted from §8.2.3.2 [12]). AD3 showed poor convergence behavior for this large problem.
Algorithm  QuasiNewton  FISTA 

Iterations  357  594 
Nonsmooth dual  29105.9  29105.5 
Integer primal  29347  29282.5 
6 Discussion
We presented Newtontype methods that offer convergence guarantee and very good convergence rates, through appropriate choices made concerning their algorithmic components. Specifically, for problems in which sumproduct computation is efficient, these Newtontype methods are very suitable. We showed promising results with higherorder mrfs of medium and large sizes. We hope this work spurs further research on exploiting curvature information within optimization algorithms for map inference.
7 Acknowledgments
Thanks to the reviewers for their helpful comments.
References
 [1] Complex step differentiation. http://blogs.mathworks.com/cleve/2013/10/14/complexstepdifferentiation/. Accessed: 20161030.
 [2] House dataset. http://vasc.ri.cmu.edu//idb/html/motion/house/index.html. Accessed: 20161030.
 [3] A. Beck and M. Teboulle. A fast iterative shrinkagethresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202, 2009.
 [4] D. P. Bertsekas. Nonlinear programming. Athena scientific Belmont, 1999.
 [5] R. H. Byrd, S. Hansen, J. Nocedal, and Y. Singer. A stochastic quasinewton method for largescale optimization. SIAM Journal on Optimization, 26(2):1008–1031, 2016.
 [6] R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasinewton matrices and their use in limited memory methods. Mathematical Programming, 63(13):129–156, 1994.
 [7] A. R. Conn, N. I. Gould, and P. L. Toint. Trust region methods. SIAM, 2000.
 [8] J. M. Coughlan and H. Shen. An embarrassingly simple speedup of belief propagation with robust potentials. arXiv preprint arXiv:1010.0012, 2010.

[9]
O. Duchenne, F. Bach, I.S. Kweon, and J. Ponce.
A tensorbased algorithm for highorder graph matching.
IEEE transactions on pattern analysis and machine intelligence, 33(12):2383–2395, 2011. 
[10]
A. Fix, C. Wang, and R. Zabih.
A primaldual algorithm for higherorder multilabel markov random
fields.
In
Computer Vision and Pattern Recognition (CVPR)
, pages 1138–1145. IEEE, 2014.  [11] K. Fountoulakis and J. Gondzio. Performance of firstand secondorder methods for big data optimization. arXiv preprint arXiv:1503.03520, 2015.
 [12] P. E. Gill, W. Murray, and M. H. Wright. Practical optimization. 1981.
 [13] G. H. Golub and C. F. Van Loan. Matrix computations, volume 3. JHU Press, 2012.
 [14] T. Hazan and A. Shashua. Normproduct belief propagation: Primaldual messagepassing for approximate inference. IEEE Transactions on Information Theory, 56(12):6294–6316, 2010.
 [15] H. Ishikawa. Transformation of general binary mrf minimization to the firstorder case. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(6):1234–1249, 2011.
 [16] V. Jojic, S. Gould, and D. Koller. Accelerated dual decomposition for map inference. In International Conference on Machine Learning (ICML), pages 503–510, 2010.
 [17] J. Kappes, B. Andres, F. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. Kausler, T. Kröger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother. A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision, 115(2):155–184, 2015.
 [18] P. Kohli, M. P. Kumar, and P. H. Torr. P & beyond: Move making algorithms for solving higher order functions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9):1645–1656, 2009.
 [19] D. Koller and N. Friedman. Probabilistic graphical models: principles and techniques. MIT press, 2009.
 [20] V. Kolmogorov. A new look at reweighted message passing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(5):919–930, 2015.
 [21] N. Komodakis and N. Paragios. Beyond pairwise energies: Efficient optimization for higherorder mrfs. In Computer Vision and Pattern Recognition (CVPR), pages 2985–2992. IEEE, 2009.
 [22] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):531–552, 2011.
 [23] J. Lee, Y. Sun, and M. Saunders. Proximal newtontype methods for convex optimization. In Advances in Neural Information Processing Systems (NIPS), pages 836–844, 2012.
 [24] J. Martens. Deep learning via hessianfree optimization. In International Conference on Machine Learning (ICML), pages 735–742, 2010.
 [25] A. F. Martins, M. A. Figueiredo, P. M. Aguiar, N. A. Smith, and E. P. Xing. Ad3: Alternating directions dual decomposition for map inference in graphical models. Journal of Machine Learning Research, 16:495–545, 2015.

[26]
T. Meltzer, A. Globerson, and Y. Weiss.
Convergent message passing algorithms: a unifying view.
In
Uncertainty in Artificial Intelligence (UAI)
, pages 393–401, 2009.  [27] O. Meshi and A. Globerson. An alternating direction method for dual map lp relaxation. In Machine Learning and Knowledge Discovery in Databases, pages 470–483. Springer, 2011.
 [28] O. Meshi, T. Jaakkola, and A. Globerson. Smoothed coordinate descent for map inference. Advanced Structured Prediction, pages 103–131, 2014.
 [29] O. Miksik, V. Vineet, P. Pérez, P. H. Torr, and F. Cesson Sévigné. Distributed nonconvex admminference in largescale random fields. In British Machine Vision Conference (BMVC), 2014.
 [30] J. Nocedal and Y.x. Yuan. Combining Trust Region and Line Search Techniques, volume 14 of Applied Optimization, pages 153–175. Springer, 1998.
 [31] R. T. Rockafellar and R. J.B. Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
 [32] C. Rother, P. Kohli, W. Feng, and J. Jia. Minimizing sparse higher order energy functions of discrete variables. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1382–1389. IEEE, 2009.
 [33] B. Savchynskyy, S. Schmidt, J. Kappes, and C. Schnörr. A study of nesterov’s scheme for lagrangian decomposition and map labeling. In Computer Vision and Pattern Recognition (CVPR), pages 1817–1823. IEEE, 2011.
 [34] B. Savchynskyy, S. Schmidt, J. Kappes, and C. Schnörr. Efficient mrf energy minimization via adaptive diminishing smoothing. Uncertainty in Artificial Intelligence (UAI), pages 746–755, 2012.
 [35] T. Schlick and M. Overton. A powerful truncated newton method for potential energy minimization. Journal of Computational Chemistry, 8(7):1025–1039, 1987.
 [36] M. Schmidt. Graphical model structure learning using L1regularization. PhD thesis, 2010.
 [37] M. Shlezinger. Syntactic analysis of twodimensional visual signals in the presence of noise. Cybernetics and systems analysis, 12(4):612–628, 1976.
 [38] D. Sontag, A. Globerson, and T. Jaakkola. Introduction to dual decomposition for inference. Optimization for Machine Learning, 1:219–254, 2011.
 [39] D. Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening lp relaxations for map using message passing. In Uncertainty in Artificial Intelligence (UAI), pages 503–510, 2008.
 [40] V. Vineet, J. Warrell, and P. H. Torr. Filterbased meanfield inference for random fields with higherorder terms and product labelspaces. International Journal of Computer Vision, 110(3):290–307, 2014.
 [41] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. Map estimation via agreement on trees: messagepassing and linear programming. IEEE Transactions on Information Theory, 51(11):3697–3717, 2005.
 [42] H. Wang and D. Koller. A fast and exact energy minimization algorithm for cycle mrfs. In Proceedings of the 30th International Conference on, volume 1, page 1, 2013.
Comments
There are no comments yet.