1 Introduction
MaximumAPosteriori (MAP) inference, that is minimizing an energy function over a discrete set of labelings
is a central tool in computer vision and machine learning. Many solvers have been proposed for various special forms of energy
and labeling space , see [11] for an overview of solvers and applications for the prominent special case of Markov Random Fields (MRF). Solvers can roughly be categorized into three categories: (i) Exact solvers that use search techniques (e.g. branchandbound) and possibly rely on solving lower bounds with LPsolvers to speed up search, (ii)primal heuristics
that find a suboptimal solution with an adhoc search strategy adapted to the specific problem and (iii) Lagrangean decomposition (a.k.a. dual decomposition) based algorithms that decompose the original problem into smaller efficiently optimizable subproblems and exchange Lagrangean multipliers between subproblems until consensus between subproblems is achieved.Except when the energy fulfills special assumptions, exact solvers are usually not applicable, since problem sizes in computer vision are too large. On the other hand, primal heuristics can be fast but solution quality need not be good and no information is given to judge it. Moreover, algorithms from the first two paradigms are usually developed adhoc for specific optimization problems and cannot be easily extended to other ones. Lagrangean decomposition based algorithms are a good middle ground, since they optimize a dual lower bound, hence can output a gap that shows the distance to optimum, yet use techniques that scale to large problem sizes. Generalization to new problems is also usually much easier, since subproblems can be readily combined.
A large number of algorithmic techniques have been proposed for optimizing a Lagrangean decomposition for MRFs, including (i) Message passing [16, 41, 37, 6] (a.k.a. block coordinate ascent, belief propagation), (ii) first order proximal splitting methods [26, 31] and (iii) Smoothing based methods [9, 29], (iv) Nesterov schemes [28, 10], (v) mirror descent [22], (vi) subgradient based algorithms [30, 36, 18]. In the case of MAP inference in MRFs, the study [11] has shown that message passing techniques outperform competing Lagrangean decomposition based methods by a large margin. However, there are two main practical shortcomings of message passing algorithms: (i) they need not converge to the optimum of the relaxation corresponding to the Lagrangean decomposition: while welldesigned algorithms monotonically improve a dual lower bound, they may get stuck in suboptimal fixed points. (ii) So called minmarginals must be computable fast for all the subproblems in the given Lagrangean decomposition. While the above problems do not seem to be an issue for most MAP inference tasks in MRFs, for other problems they are. In such cases, alternative techniques must be used. Subgradient based methods can help here, since they do not possess the above shortcomings: They converge to the optimum of the Lagrangean relaxation and only require finding solutions to the subproblems of the decomposition, which is easier than their minmarginals (as needed for (i)), proximal steps (as needed for (ii)) or smoothed solutions (as needed for (iii) and (iv)).
The simplest subgradient based algorithm is subgradient ascent. However, its convergence is typically slow. Bundle methods, which store a series of subgradients to build a local approximation of the function to be optimized, empirically converge faster.
1.1 Contribution & Organization
We propose a multi plane blockcoordinate version of the FrankWolfe method to find minimizing directions in a proximal bundle framework, see Section 2. Our method exploits the structure of the problem’s Lagrangean decomposition and is inspired by [34]. Applications of our approach to MRFs, discrete tomography and graph matching are presented in Section 3. An experimental evaluation on these problems is given in Section 4 and suggests that our method is superior to comparable established methods.
All proofs are given in the Appendix A. A C++implementation of our FrankWolfe method is available at http://pub.ist.ac.at/~vnk/papers/FWMAP.html. The MRF, discrete tomography and graph matching solvers built on top of our method can be obtained at https://github.com/LPMP/LPMP.
1.2 Related work
To our knowledge, the FrankWolfe method has not yet been used in our general setting, i.e. underlying a proximal bundle solver for a general class of structured energy minimization problems. Hence, we subdivide related work into (i) subgradient/bundle methods for energy minimization, (ii) adhoc approaches that use FrankWolfe for specific tasks and (iii) proximal bundle methods.
Subgradient based solvers have first been proposed by [36] for MAPinference in MRFs and were later popularized by [18]. These works rely on a decomposition of MRFs into trees. New decompositions for certain MRFs were introduced and optimized in [25] for maxflow subproblems and in [32] for perfect matching subproblems. The work [43] used a covering of the graph by a tree and optimized additional equality constraints on duplicated nodes via subgradient ascent. Usage of bundle methods that store a series of subgradients to build a local approximation of the objective function was proposed by [12, 32] for MAP inference in MRFs.
The FrankWolfe algorithm was developed in the 50s [4] and was popularized recently by [8]. In [20] a block coordinate version of FrankWolfe was proposed and applied to training structural SVMs. Further improvements were given in [34, 24] where, among other things, caching of planes was proposed for the FrankWolfe method. Several works have applied FrankWolfe to the MAPMRF inference problem: (1) [33] used FrankWolfe to compute an approximated steepestdescent direction in the local polytope relaxation for MRFs. (2) [23] used FrankWolfe to solve a modified problem obtained by adding a strictly convex quadratic function to the original objective (either primal or dual). In contrast to these works, we use FrankWolfe inside a proximal method. Furthermore, the papers above use a fine decomposition into many small subproblems (corresponding to pairwise terms of the energy function), while we decompose the problem into much larger subproblems. In general, our decomposition results in fewer dual variables to optimize over, and each FrankWolfe update can be expected to give a much larger gain. FrankWolfe was also used in [1] for MAP inference in dense MRFs with Potts interactions and Gaussian weights. As we do, they use FrankWolfe to optimize proximal steps for MAPinference. In constrast to our work, they do not optimize Lagrangean multipliers, but the original variables directly. In other words, they work in the primal, while we work in the dual. We remark that our formulation is applicable to more general integer optimization problems than either [33, 23, 1] and it does not seem straightforward to apply these approaches to our more general setting while only requiring access to MAPoracles of subproblems.
Proximal bundle methods were introduced in [13, 21] to accelerate subgradient descent algorithms. They work by locally building an approximation (bundle) to the function to be optimized and use this bundle to find a descent direction. For stability, a quadratic (proximal) term is added [14]. While not theoretically guaranteed, proximal bundle methods are often faster than subgradient methods.
2 Method
Original problem We consider the problem of minimizing a function of Boolean variables represented as a sum of individual terms:
(1) 
Here term is specified by a subset of variables and a function of
variables. Vector
is the restriction of vector to . The arity of function can be arbitrarily large, however we assume the existence of an efficient minoracle that for a given vector computes together with the cost , where is the effective domain of . It will be convenient to denotewhere subsets are defined as follows:
The assumption means that we can efficiently compute a supergradient of concave function at a given .
Since (1) is in general an NPhard problem, our goal will be to solve a certain convex relaxation of (1), which will turn out to be equivalent to the Basic LP relaxation (BLP) of (1) [17]. This relaxation has been widely studied in the literature, especially for the MAPMRF problem (in which case it is usually called the local polytope relaxation [40, 41]). We emphasize, however, that our methodology is different from most previous works: before applying the BLP relaxation, we represent the objective as a function of Boolean indicator variables. This allows expressing complicated combinatorial constraints such as those in multilabel discrete tomography and graph matching problems (see Section 3).
Lagrangean relaxation For a vector let us denote the first components as and the last one as (so that ). We also denote and . The th component of vector will be denoted as . Problem (1) can now be equivalently written as
(2) 
We form the relaxation of (2) by removing the nonconvex constraint :
(3) 
It can be shown that problem (3) is equivalent to the BLP relaxation of (1), see [swoboda:2018arxiv].
We will not directly optimize this relaxation, but its Lagrangean dual [35]. For each equality constraint we introduce Lagrange multipliers . The collection of these multipliers will be denoted as . The dual will be optimized over the set
(4) 
where we denoted
Proposition 1.
Next, we describe how we maximize function .
Proximal term We have a nonsmooth concave maximization problem in , hence algorithms that require a differentiable objective functions will not work. In proximal bundle methods [14] an additional proximal term is added. This results in the new objective
(6) 
for a center point . The proximal quadratic terms act as a trustregion term in the vicinity of and make the function strongly concave, hence smoothing the dual. A successively refined polyhedral approximation [21] is typically used for solving (6). We develop a proximal method that will alternate between minimizing (6) with the help of a multiplane block coordinate FrankWolfe method and updating the proximal center .
2.1 Maximizing : BCFW algorithm
Objectives similar to (6) (without the summation constraint on the
variables) are used for training structural Support Vector Machines (SSVMs). Following
[20, 34, 24], we use a blockcoordinate FrankWolfe algorithm (BCFW) applied to the dual of (6), more specifically its multiplane version MPBCFW [34]. The dual of (6) is formulated below.Proposition 2.
The dual problem to is
(7) 
Define by for . Then the optimal in (7) is and
(8) 
where denotes the derivative w.r.t. variables .
Next, we review and adapt to our setting BCFW and MPBCFW algorithms for minimizing function over
. We will also describe a practical improvement to the implementation, namely compact representation of planes, and discuss the problem of estimating the duality gap.
BCFW [20] The algorithm maintains feasible vectors . At each step BCFW tries to decrease the objective by updating component for a chosen term (while maintaining feasibility). To achieve this, it first linearizes the objective by using the Taylor expansion around the current point:
(9) 
The optimal solution of the linearized objective is computed by calling the th oracle: . The new vector
is obtained as the best interpolation of
and with all other components fixed to , i.e. . The step size is chosen to minimize the objective. The optimal can be easily computed in closed form (see Lemma 1 in [swoboda:2018arxiv]). One pass of BCFW is summarized in Algorithm 1. To avoid the expensive recomputation of the sum in Prop. 2 needed for computing the gradient and the step size, it is held and updated explicitly in Algorithm 1.MPBCFW [34] In this paper we use the multiplane version of BCFW. This method caches planes returned by minoracles for terms . Let be the set of planes currently stored in memory for the th subproblem. It defines an approximation of term . Note that for any .
MPBCFW uses exact passes (that call the “exact” oracle for in line 3 of Algorithm 1) and approximate passes (that call the “approximate” oracle for ). One MPBCFW iteration consists of one exact pass followed by several approximate passes. The number of approximate passes is determined automatically by monitoring how fast the objective decreases per unit of time. Namely, the method computes the ratio where is the vector at the beginning of the MPBCFW iteration, is the current vector and is the time passed since the beginning of the MPBCFW iteration. If this ratio drops after an approximate pass then the iteration terminates, and a new MPBCFW iteration is started with an exact pass. The number of approximate passes will thus depend on the relative speed of exact and approximate oracles.
Note that the time for an approximate oracle call is proportional to the size of the working set . The method in [34] uses a common strategy for controlling this size: planes that are not used during the last iterations are removed from . We use , which is the default parameter in [34].
Compact representation of planes Recall that planes in the set have the form for some . A naive approach (used, in particular, in previous works [34, 24]) is to store them explicitly as vectors of size . We observe that in some applications a special structure of allows storing and manipulating these planes more efficiently. For example, in MAPMRF inference problems a variable with possible values can be represented by a single integer, rather than indicator variables.
In our implementation we assume that each vector can be represented by an object in a possibly more compact space . To specify term , the user must provide an array that determines set in a natural way, specify the size of objects , and implement the following functions:

A bijection .

Minoracle that for a given computes and returns its compact representation (i.e. ) together with the cost .

A function that computes inner product for a given vector and object .
Note, calling the approximate oracle in line 3 involves calling the function in (F3) times; it typically takes time where is the length of the array for storing .
Remark 1.
The efficient plane storage mechanism gives roughly a speedup of a single MPBCFW pass on the proteinfolding MRF dataset (see Sections 3 and 4). Moreover, it typically results in a slightly better objective value obtained after each MPBCFW pass, since more approximate passes can be done before an exact pass is called (since approximate passes are accelerated by the compact plane storage, their objective decrease per unit of time is higher, hence they are called more often).
2.2 Algorithm’s summary
Recall that our goal is to maximize function over ; this gives a lower bound on the original discrete optimization problem . We now summarize our algorithm for solving , and describe our choices of parameters. To initialize, we set , and for each . Then we start the main algorithm. After every 10 iterations of MPBCFW we update (keeping vectors and sets unchanged), where is the vector with the largest value of objective seen so far. Since evaluating is an expensive operation, we do it for the current vector only after every 5 iterations of MPBCFW.
Remark 2 (Convergence).
If we evaluated the inner iterations in MPBCFW exactly, our method would amount to the proximal point algorithm which is convergent [27]. Even with nonexact evaluation, convergence can be proved when the error in the evaluation of the proximal is shrinking fast enough towards zero. However, we have use a more aggressive scheme that updates the proximal point every 10 iterations and does not bound the inexactness of the proximal step evaluation. Experimentally, we have found that it gives good results w.r.t. the objective of the overall problem (5).
2.3 Estimating duality gap
To get a good stopping criterion, it is desirable to have a bound on the gap between the current and optimal objectives. This could be easily done if we had feasible primal and dual solutions. Unfortunately, in our case vector is not a feasible solution of problem (3), since it does not satisfy equality constraints . ^{1}^{1}1 We say that vector is a feasible (optimal) solution of (3) if there exists a vector so that is a feasible (optimal) solution of (3). Clearly, can be easily computed from feasible , so omit it for brevity. To get a handle on the duality gap, we propose to use the following quantities:
(10) 
Proposition 3.
Note, if we knew that an optimal solution belongs to some bounded region then we could use (11) to obtain a bound on the duality gap. Such region can be obtained for some applications, but we did not pursue this direction.
3 Applications
In this section we give a detailed description of the three applications used in the evaluation: Markov Random Fields (MRFs), discrete tomography and the graph matching problem. The latter two are both extensions of the MAPinference problems for MRFs. Those three problems are reviewed below.
3.1 Markov Random Fields
An MRF consists of a graph and a discrete set of labels for each . The goal in MaximumAPosteriori (MAP) inference is to find labeling that is minimal with respect to the potentials:
(12) 
This problem is NPhard for general graphs , but can be solved efficiently for trees. Hence, we choose a covering by trees, as done in [18]. Additionally, we seek a small number of trees, such that the number of Lagrangean variables stays small and optimization becomes faster.
Arboricity A tree covering of a graph is called a minimal tree cover, if there is no covering consisting of fewer trees. The associated number of trees is called the graph’s arboricity. We compute the graph’s arboricity together with a minimal tree cover efficiently with the method [5]
Boolean encoding To phrase the problem as an instance of (1), we encode labelings via indicator variables for while adding constraints (i.e. assigning infinite cost to configurations that do not satisfy this constraint).
A tree cover and a Boolean encoding are also used for the two problems below; we will not explicitly comment on this anymore.
3.2 Discrete tomography
The discrete tomography problem is to reconstruct an image from a set of linear (tomographic) projections taken at different angles, where image intensity values are restricted to be from a discrete set of intensities. See Figure 2 for an illustration. The problem is illposed, since the number of linear projections is smaller than the number of pixels to reconstruct. Hence, we use a regularizer to penalize nonregular image structures. Formally, we have an MRF , where the node set corresponds to the pixel grid and the edges connect nodes which correspond to neighboring pixels. The label space is for some . Additionally, the labeling must satisfy linear projections with . Usually, no local information is available, hence the unary potentials are zero: for . The problem reads
(13) 
where . A typical choice for the pairwise potentials is the truncated norm . Each row of the projections forms another subproblem. The th row of hence is of the form . Efficient solvers for this problem were recently considered in [19]. We follow their recursive decomposition approach for the solution of the projection subproblems. Details are given below.
Discrete tomography subproblems
We use a simplified version of the recursive decomposition approach of [19] for efficiently solving the summation constraints of the discrete tomography problem. Below we give the corresponding details. Let the th row of be of the form and recall that . Each such tomographic projection will correspond to a single subproblem. Taking care of Lagrangean multipliers , we can rename variables and rewrite the problem as
(14) 
We will follow a recursive decomposition approach. To this end, we introduce helper summation variables .
Variable partitions
We partition the set into and . We recursively partition and analoguously until reaching single variables. This results in a tree with as root.
Constraints on summation variables
Whenever we have partitions , we add the constraint .
Solving (14)
We will propose a dynamic programming approach to solving (14). First, for each value of summation variable we store a value . We compute recursively from leaves to the root. For partitions we compute
(15) 
After having computed , we set . Subsequently, we make a pass from root to leaves to compute the optimal label sum for each variable as follows:
(16) 
Fast dynamic programming
3.3 Graph matching
The graph matching problem consists of finding a MAPsolution in an MRF where the labels for each node come from a common universe . The additional matching constraint requires that two nodes cannot take the same label: . Hence, any feasible solution defines an injective mapping into the set of labels. For an illustration see Figure 2. The problem is
(17) 
We use a minimum cost flow solver for handling the matching constraint, see e.g. [39, 44, 38] for an explanation of the minimum cost flow solver construction.
4 Experiments
Dataset  # I  FWMAP  CB  SA  MP  
MRF  
protein folding  11  3340  528780  12917.44  12970.61  12960.30  13043.67 
Discrete tomography  
synthetic 2 proj.  9  1024  1984  266.12  265.89  239.39  
synthetic 4 proj.  9  1024  1984  337.88  336.33  316.61  
synthetic 6 proj.  9  1024  1984  424.36  417.76  391.09  
sheep logan  3  4096  8064  897.18  847.87  701.93  
sheep logan  3  65536  130560  4580.06  4359.24  370.63  
Graph matching  
6d scene flow  6  48126  11485352  2864.2  2865.61  2867.60  2877.08 
We have chosen a selection of challenging MRF, discrete tomography and graph matching problems where message passing methods struggle or are not applicable. Detailed descriptions of these problems can be found in Appendix A.
Remark 3.
Note that there are a large number of MRF and graph matching problems where message passing is the method of choice due to its greater speed and sufficient solution quality, see [11, 38]. In such cases there is no advantage in using subgradient based solvers, which tend to be slower. However, our chosen evaluation problems contain some of the most challenging MRF and graph matching problems with corresponding Lagrangean decompositions not solved satisfactorily with message passing solvers.
We have excluded primal heuristics that do not solve a relaxation corresponding to our Lagrangean decomposition at all from comparison. First, they do not deliver any lower bounds, hence cannot be directly compared. Second, we see them as complementary solvers, since, when they are applied on the solved dual problem, their solution quality typically improves. We have run every instance for 10 minutes on a Intel i55200U CPU. Perdataset plots showing averaged lower bound over time can be seen in Figure 3. Dataset statistics and final lower bounds averaged over datasets can be seen in Table 1. Detailed results for each instance in each dataset can be found in in Appendix A.
Solvers We focus our evaluation on methods that are able to handle general MAP inference problems in which the access to subproblems is given by minoracles.

[leftmargin=1.5cm,noitemsep]

Our solver as described in Section 2.

The stateoftheart bundle method ConicBundle [7], which does not treat individual subproblems individually, but performs ascent on all Lagrangean multipliers simultaneously.

Subgradient ascent with a Polyak step size rule. This solver was also used in [12] for MRFs.
Additionally, we tested stateoftheart versions of message passing (MP) solvers, when applicable. MP is a popular method for MAPMRF problems, and has recently been applied to the graph matching problem [38].
All solvers optimize the same underlying linear programming relaxation. Additionally, all subgradient based solvers also use the same decomposition. This ensures that we compare the relevant solver methodology, not differences between relaxations or decompositions.
Choice of proximal weight from (6) The performance of our algorithm depends on the choice of the proximal weight parameter from (6). A too large value will make each proximal step take long, while a too small value will mean too many proximal steps until convergence. This behaviour can be seen in Figure 5, where we have plotted lower bound against time for an MRF problem and FWMAP with different choices of proximal weight . We see that there is an optimal value of , with larger and smaller values having inferior performance. However, we can also observe that performance of FWMAP is good for values an order of magnitude larger or smaller, hence FWMAP is not too sensitive on . It is enough to choose roughly the right order of magnitude for this parameter.
We have observed that the more subproblems there are in a problem decomposition (1), the smaller the proximal weight should be. Since more subproblems usually translate to more complex dependencies in the decomposition, a smaller value of will be beneficial, as it makes the resulting more complicated proximal steps better conditioned. A good formula for will hence be decreasing for increasing numbers of subproblems . We have taken three instances out of the 50 we evaluated on and roughly fitted a curve that takes suitable values of proximal weight for these three instances, resulting in
(18) 
Duality gap We have plotted the duality gap quantities and from (10) for the synthetic 6 proj. dataset from discrete tomography in Figure 5.
4.1 Markov Random Fields.
4.2 Discrete tomography.
In [19] a dual decomposition based solver was proposed for the multilabel discrete tomography problem. The decomposition was optimized with ConicBundle [7]. For our decomposition, message passing solvers are unfortunately not applicable. The main problem seems that due to the unary potentials being zero, minmarginals for all subproblems are also zero. Hence any minmarginal based step used in message passing will result in no progress. In other words, the initially zero Lagrangean multipliers are a local fixpoint for message passing algorithms. Therefore, we only compare against SA and CB.
Datasets
We compare on the synthetically generated text images from
[19], denoted by synthetic. These are images of random objects with , and projections directions. We also compare on the classic sheep logan image with resolution and and , and projections.4.3 Graph matching.
As shown in [39, 44], Lagrangean decomposition based solvers are superior to solvers based on primal heuristics also in terms of the quality of obtained primal solutions. In particular, [38] has proposed a message passing algorithm that is typically the method of choice and is on par/outperforms other message passing and subgradient based techniques on most problems. We denote it by MP.
4.4 Discussion
Our solver FWMAP achieved the highest lower bound on each instance. It was substantially better on the hardest and largest problems, e.g. on sheep logan. While message passing solvers were faster (whenever applicable) in the beginning stages of the optimization, our solver FWMAP was fastest among subgradient based ones and eventually achieved a higher lower bound than the message passing one. We also would like to mention that our solver had a much lower memory usage than the competing bundle solver CB. On the larger problem instances, CB would often use all available memory on our 8 GB machine.
References

[1]
T. Ajanthan, A. Desmaison, R. Bunel, M. Salzmann, P.H.S. Torr, and M. P. Kumar.
Efficient linear programming for dense CRFs.
In
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017.  [2] H. A. Alhaija, A. Sellent, D. Kondermann, and C. Rother. Graphflow  6d large displacement scene flow via graph matching. In GCPR, 2015.
 [3] M. Bussieck, H. Hassler, G. J. Woeginger, and U. T. Zimmermann. Fast algorithms for the maximum convolution problem. Oper. Res. Let., 15(3):133–141, 1994.
 [4] M. Frank and P. Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3:149–154, 1956.
 [5] H. N. Gabow and H. H. Westermann. Forests, frames, and games: Algorithms for matroid sums and applications. Algorithmica, 7(1):465, Jun 1992.
 [6] A. Globerson and T. S. Jaakkola. Fixing maxproduct: Convergent message passing algorithms for MAP LPrelaxations. In Conference on Neural Information Processing Systems (NIPS), 2007.
 [7] C. Helmberg. The conicbundle library for convex optimization v0.3.11., 2011. https://wwwuser.tuchemnitz.de/~helmberg/ConicBundle/.
 [8] M. Jaggi. Revisiting FrankWolfe: Projectionfree sparse convex optimization. In International Conference on Machine Learning (ICML), pages 427–435, 2013.
 [9] J. Johnson, D. M. Malioutov, and A. S. Willsky. Lagrangian relaxation for map estimation in graphical models. In 45th Annual Allerton Conference on Communication, Control and Computing, 2007.
 [10] V. Jojic, S. Gould, and D. Koller. Accelerated dual decomposition for MAP inference. In International Conference on Machine Learning (ICML), 2010.
 [11] J. H. Kappes, B. Andres, F. A. Hamprecht, C. Schnörr, S. Nowozin, D. Batra, S. Kim, B. X. Kausler, T. Kröger, J. Lellmann, N. Komodakis, B. Savchynskyy, and C. Rother. A comparative study of modern inference techniques for structured discrete energy minimization problems. International Journal of Computer Vision, 115(2):155–184, 2015.
 [12] J. H. Kappes, B. Savchynskyy, and C. Schnörr. A bundle approach to efficient MAPinference by Lagrangian relaxation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 1688–1695. IEEE, 2012.
 [13] K. C. Kiwiel. Methods of Descent for Nondifferentiable Optimization. Lecture Notes in Computer Science. SpringerVerlag, 1985.
 [14] K. C. Kiwiel. Proximity control in bundle methods for convex nondifferentiable minimization. Mathematical Programming, 46(1):105–122, Jan 1990.
 [15] V. Kolmogorov. Convergent treereweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell., 28(10):1568–1583, 2006.
 [16] V. Kolmogorov. A new look at reweighted message passing. IEEE Trans. Pattern Anal. Mach. Intell., 37(5):919–930, 2015.
 [17] V. Kolmogorov, J. Thapper, and S. Živný. The power of linear programming for generalvalued csps. SIAM Journal on Computing, 44(1):1–36, 2015.
 [18] N. Komodakis, N. Paragios, and G. Tziritas. MRF energy minimization and beyond via dual decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):531–552, March 2011.
 [19] J. Kuske, P. Swoboda, and S. Petra. A novel convex relaxation for nonbinary discrete tomography. In Scale Space and Variational Methods in Computer Vision  6th International Conference, SSVM 2017, 2017, Proceedings, pages 235–246, 2017.
 [20] S. LacosteJulien, M. Jaggi, M. Schmidt, and P. Pletscher. Blockcoordinate FrankWolfe optimization for structural SVMs. In International Conference on Machine Learning (ICML), 2013.
 [21] C. Lemarechal. Constructing bundle methods for convex optimization. NorthHolland Mathematics Studies, 129:201–240, 1986.
 [22] D. V. N. Luong, P. Parpas, D. Rueckert, and B. Rustem. Solving MRF minimization by mirror descent. In Advances in Visual Computing  8th International Symposium, ISVC 2012, Rethymnon, Crete, Greece, July 1618, 2012, Revised Selected Papers, Part I, pages 587–598, 2012.
 [23] Ofer Meshi, Mehrdad Mahdavi, and Alex Schwing. Smooth and strong: MAP inference with linear convergence. In Conference on Neural Information Processing Systems (NIPS), 2015.
 [24] A. Osokin, J.B. Alayrac, I. Lukasewitz, P. K. Dokania, and S. LacosteJulien. Minding the gaps for block FrankWolfe optimization of structured SVMs. In International Conference on Machine Learning (ICML), 2016.
 [25] A. Osokin and D. P. Vetrov. Submodular relaxation for inference in markov random fields. IEEE Transactions on Pattern Analysis & Machine Intelligence, 37(7):1347–1359, July 2015.
 [26] P. Ravikumar, A. Agarwal, and M. J. Wainwright. Messagepassing for graphstructured linear programs: Proximal methods and rounding schemes. Journal of Machine Learning Research, 11:1043–1080, 2010.
 [27] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
 [28] B. Savchynskyy, J. Kappes, S. Schmidt, and C. Schnörr. A study of nesterov’s scheme for lagrangian decomposition and map labeling. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1817–1823. IEEE, 2011.

[29]
B. Savchynskyy, S. Schmidt, J. H. Kappes, and C. Schnörr.
Efficient MRF energy minimization via adaptive diminishing
smoothing.
In
Uncertainty in Artificial Intelligence (UAI)
, 2012.  [30] M. I. Schlesinger and V. V. Giginyak. Solution to structural recognition (MAX,+)problems by their equivalent transformations. (2):3–18, 2007.
 [31] S. Schmidt, B. Savchynskyy, J. H. Kappes, and C. Schnörr. Evaluation of a firstorder primaldual algorithm for mrf energy minimization. In International Workshop on Energy Minimization Methods in Computer Vision and Pattern Recognition, pages 89–103. Springer, 2011.
 [32] N. N. Schraudolph. Polynomialtime exact inference in NPhard binary MRFs via reweighted perfect matching. In 13 Intl. Conf. Artificial Intelligence and Statistics (AIstats), volume 9 of Workshop and Conference Proceedings, pages 717–724. Journal of Machine Learning Research (JMLR), 2010.
 [33] Alexander Schwing, Tamir Hazan, Marc Pollefeys, and Raquel Urtasun. Globally convergent parallel MAP LP relaxation solver using the FrankWolfe algorithm. In International Conference on Machine Learning (ICML), 2014.
 [34] N. Shah, V. Kolmogorov, and C. H. Lampert. A multiplane blockcoordinate FrankWolfe algorithm for training structural SVMs with a costly maxoracle. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 [35] David Sontag, Amir Globerson, and Tommi Jaakkola. Introduction to dual decomposition for inference. In Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, editors, Optimization for Machine Learning, pages 219–254. MIT Press, 2012.
 [36] G. Storvik and G. Dahl. Lagrangianbased methods for finding map. IEEE Trans. on Image Processing, 9(3):469–479, march 2000.
 [37] P. Swoboda, J. Kuske, and B. Savchynskyy. A dual ascent framework for Lagrangean decomposition of combinatorial problems. In CVPR, 2017.
 [38] P. Swoboda, C. Rother, H. Abu Alhaija, D. Kainmueller, and B. Savchynskyy. Study of Lagrangean decomposition and dual ascent solvers for graph matching. In CVPR, 2017.
 [39] L. Torresani, V. Kolmogorov, and C. Rother. A dual decomposition approach to feature correspondence. IEEE Trans. Pattern Anal. Mach. Intell., 35(2):259–271, 2013.
 [40] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 [41] T. Werner. A linear programming approach to maxsum problem: A review. IEEE Trans. Pattern Analysis and Machine Intelligence, 29(7):1165–1179, 2007.
 [42] Chen Yanover, Ora SchuelerFurman, and Yair Weiss. Minimizing and learning energy functions for sidechain prediction. In Annual International Conference on Research in Computational Molecular Biology, pages 381–395. Springer, 2007.
 [43] Julian Yarkony, Charless Fowlkes, and Alexander Ihler. Covering trees and lowerbounds on quadratic assignment. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 887–894. IEEE, 2010.
 [44] Z. Zhang, Q. Shi, J. McAuley, W. Wei, Y. Zhang, and A. van den Hengel. Pairwise matching through maxweight bipartite belief propagation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Appendix A Proofs
The BLP relaxation [17]
introduces a probability distribution
over for each and a probability distribution over for each . It can be written as follows:(19) 
Let us show that the optimal values of (19) and (3) coincide.
Proof of equivalence of (19) and (3).
Define an extension of function as follows: for a vector set
(20) 
Note, if then (20) does not have a feasible solution, and so . Observe that the constraints in the last line of (19) for are redundant  they follow from the remaining constraints. Also observe that constraints for can be written as if we denote for . Therefore, problem (19) can be equivalently rewritten as follows:
(21) 
It can be seen that the last problem is equivalent to (3). Indeed, we just need to observe that for each and we have
∎
Proof of Proposition 1.
Write , then problem (3) can be written as
(22) 
The Lagrangian w.r.t. the equality constraints is given by
Therefore, the dual function for is
The problem can thus be formulated as , or equivalently as . This coincides with formulation given in Proposition 1.
Since constraint can be expressed as a linear program, the duality between (3) and (5) can be viewed as a special case of linear programming (LP) duality (where the value of function is also written as a resulting of some LP). For LPs it is known that strong duality holds assuming that either the primal or the dual problems have a feasible solution. This holds in our case, since vector is feasible. We can conclude that we have a strong duality between (3) and (5). ∎
Proof of Proposition 2.
First, we derive the dual of :
The function has a closed form expression, since it is a quadratic function subject to linear equalities. Write for . The in the expression defining are
(23) 
The function value is
The gradient is . ∎
Proof of Proposition 3.
Let be the set of vectors satisfying the equality constraints for all . By construction, for any we have
(24a)  
(24b)  
(24c) 
Eq. (24c) gives that for any and , and so from (24b) we get that . Clearly, we have . The following two facts imply part (b) of Proposition 3:

Consider vector . Then if and only if for some . (This can be seen from the definition of in Section 2.3).

Consider vectors and . They are an optimal primaldual pair if and only if , which in turn holds if and only if (since ).
It remains to show inequality (11). Denote , then for any . Denoting and , we get
Summing these inequalities over gives
Recalling that , we obtain the desired claim:
Comments
There are no comments yet.