Linear Programming (LP) relaxations have been used to approximate the maximum a posteriori (MAP) inference of Probabilistic Graphical Models (PGMs) Koller and Friedman (2009) by enforcing local consistency over edges or clusters. An attractive property of this approach is that it is guaranteed to find the optimal MAP solution when the labels are integers. This is particularly significant in light of the fact that Kumar et al.showed that LP relaxation provides a better approximation than Quadratic Programming relaxation and Second Order Cone Programming relaxation Kumar et al (2009). Despite their success, there remain a variety of large-scale problems that off-the-shelf LP solvers can not solve Yanover et al (2006). Moreover, it has been shown Yanover et al (2006); Sontag et al (2008) that LP relaxations have a large gap between the dual objective and the decoded primal objective and fail to find the optimal MAP solution in many real-world problems.
In response to this shortcoming a number of dual message passing methods have been proposed including Dual Decompositions Komodakis et al (2007); Sontag et al (2011, 2012) and Generalised Max Product Linear Programming (GMPLP) Globerson and Jaakkola (2007). These methods can still be computationally expensive when there are a large number of constraints in the LP relaxations. It is desirable to reduce the number of constraints in order to reduce computational complexity without sacrificing the quality of the solution. However, this is non-trivial, because for a MAP inference problem the dimension of the primal variable can be different in various LP relaxations. This also presents a barrier for effectively comparing the quality of two LP relaxations and their corresponding message passing methods. Furthermore, these message-passing methods may get stuck in non-optimal solutions due to the non-smooth dual objectives Schwing et al (2012); Hazan and Shashua (2010); Meshi et al (2012).
Our contributions are: 1) we propose a unified form for MAP LP relaxations, under which existing MAP LP relaxations can be rewritten as constrained optimisation problems with variables of the same dimension and objective; 2) we present a new tool which we call the Marginal Polytope Diagram to effectively compare different MAP LP relaxations. We show that any MAP LP relaxation in the above unified form has a Marginal Polytope Diagram, and vice versa. We establish propositions to conveniently show the equivalence of seemingly different Marginal Polytope Diagrams; 3) Using Marginal Polytope Diagrams, we show how to safely reduce the number of constraints (and consequently the number of messages) without sacrificing the quality of the solution, and propose three new message passing algorithms in the dual; 4) we show how to perform message passing in the dual without computing and storing messages (via updating the beliefs only and directly); 5) we propose a new cluster pursuit strategy.
2 MAP Inference and LP Relaxations
We consider MAP inference over factor graphs with discrete states. For generality, we will use higher order potentials (where possible) throughout the paper.
2.1 MAP inference
Assume that there are variables , each taking discrete states Vals. Let denote the node set, and let be a collection of subsets of . has an associated group of potentials , where . Given a graph and potentials , we consider the following exponential family distribution (Wainwright and Jordan, 2008):
where , and is known as a normaliser, or partition function. The goal of MAP inference is to find the MAP assignment, , that maximises . That is
Here we slightly generalise the notation of to , and where are subsets of reserved for later use.
2.2 Linear Programming Relaxations
the MAP inference problem can be written as an equivalent Linear Programming (LP) problem as follows
in which the feasible set, , is known as the marginal polytope (Wainwright and Jordan, 2008), defined as follows
Here the first two groups of constraints specify that is a distribution over , and we refer to the last group of constraints as the global marginalisation constraint, which guarantees that for arbitrary in , all can be obtained by marginalisation from a common distribution over . In general, exponentially many inequality constraints () are required to define a marginal polytope, which makes the LP hard to solve. Thus eq:LPProb is often relaxed with a local marginal polytope to obtain the following LP relaxation
Compared to the marginal polytope, for arbitrary in a local marginal polytope, all may not be the marginal distributions of a common distribution over , but there are much fewer constraints in local marginal polytope. As a result, the LP relaxation can be solved more efficiently. This is of particular practical significance because state-of-the-art interior point or simplex LP solvers can only handle problems with up to a few hundred thousand variables and constraints while many real-world datasets demand far more variables and constraints (Yanover et al, 2006; Kumar et al, 2009).
Several message passing-based approximate algorithms (Globerson and Jaakkola, 2007; Sontag et al, 2008, 2012) have been proposed to solve large scale LP relaxations. Each of them applies coordinate descent to the dual objective of an LP relaxation problem with a particular local marginal polytope. Different local marginal polytopes use different local marginalisation constraints, which leads to different dual problems and hence different message updating schemes.
2.3 Generalised Max Product Linear Programming
Globerson and Jaakkola (2007) showed that LP relaxations can also be solved by message passing, known as Max Product LP (MPLP) when only node and edge potentials are considered, or Generalised MPLP (GMPLP) (see Section 6 of (Globerson and Jaakkola, 2007)) when potentials over clusters are considered.
In GMPLP, they define , and
Then they consider the following LP relaxation
where the local marginal polytope is defined as
with . To derive a desirable dual formulation, they replace the third group of constraints with the following equivalent constraints
where is known as the copy variable. Let be the dual variable associated with the first group of the new constraints above, using standard Lagrangian yields the following dual problem:
Let , they use a coordinate descent method to minimise the dual by picking up a particular and updating all as following:
where . At each iteration the dual objective always decreases, thus guaranteeing convergence. Under certain conditions GMPLP finds the exact solution. Sontag et al.(Sontag et al, 2008) extended this idea by iteratively adding clusters and reported faster convergence empirically.
2.4 Dual Decomposition
Dual Decomposition (Komodakis et al, 2007; Sontag et al, 2011) explicitly splits node potentials (those potentials of order 1) from cluster potentials with order greater than 1, and rewrites the MAP objective eq:map-cluster as
where . By defining , they consider the following LP relaxation:
with a different local marginal polytope defined as
Let be the Lagrangian multipliers corresponding to each for each , one can show that the standard Lagrangian duality is
Subgradient or coordinate descent can be used to minimise the dual objective. Since the Dual Decomposition using coordinate descent is closely related to GMPLP and the unified form which we will present, we give the update rule derived by coordinate descent below,
where is a particular cluster from , and .
Compared to GMPLP, the local marginal polytope in the Dual Decomposition has much fewer constraints. In general for an arbitrary graph , is looser than () .
2.5 Dual Decomposition with cycle inequalities
Recently, Sontag et al.Sontag et al (2012) proposed a Dual Decomposition with cycle inequalities considering the following LP relaxation
with a local marginal polytope ,
They added cycle inequalities to tighten the problem. Reducing the primal feasible set may reduce the maximum primal objective, which reduces the minimum dual objective. They showed that finding the “tightest” cycles, which maximise the decrease in the dual objective, is NP-hard. Thus, instead, they looked for the most “frustrated” cycles, which correspond to the cycles with the smallest LHS of their cycle inequalities. Searching for “frustrated” cycles, adding the cycles’ inequalities and updating the dual is repeated until the algorithm converges.
3 A Unified View of MAP LP Relaxations
In different LP relaxations, not only the formulations of the objective, but also the dimension of primal variable may vary, which makes comparison difficult. By way of illustration, note that the primal variable in GMPLP is , while in Dual Decomposition the primal variable is . Although can be reformulated to if , the variables (corresponding to intersections) in GMPLP still do not appear in Dual Decomposition. This shows that the dimensions of the primal variables in GMPLP and Dual Decomposition are different.
3.1 A Unified Formulation
When using the local marginal polytope the objective of the LP relaxation depends only on those . We thus reformulate the LP Relaxation into a unified formulation as follows:
where is defined in def:mub. The local marginal polytope, , can be defined in a unified formulation as
Here is what we call an extended cluster set, where each is called an extended cluster. , where each is a subset of , which we refer to as a sub-cluster. The choices of and correspond to existing or even new inference algorithms, which will be shown later, and when specifying and , we require .
The first two groups of constraints in ensure that , is a distribution over Vals(). We refer to the third group of constraints as local marginalisation constraints.
The LP formulation in (1) and (2) of (Werner, 2010) may look similar to ours. However, the work of (Werner, 2010) is in fact a special case of ours. In their work, an additional restriction for eq:ML must be satisfied (see (4) in (Werner, 2010)). As a result, their work does not cover the LP relaxations in Sontag et al (2008) and GMPLP, where redundant constraints like are used to derive a message from one cluster to itself (see Figure 1 of Sontag et al (2008)). Our approach, however, is in fact a generalisation of Sontag et al (2008), GMPLP and Werner (2010).
3.2 Reformulating GMPLP and Dual Decomposition
Here we show that both GMPLP and Dual Decomposition can be reformulated by eq:LPR.
Let us start with GMPLP first. Let be and be . GMPLP eq:OBJGMPLP can be reformulated as follows
where is defined as
We can see that eq:MLGrefrom and eq:MLGOriginal only differ in the dimensions of their variables and (see def:mub and eq:mu_g). Since the objectives in eq:GMPLPref and eq:OBJGMPLP do not depend on directly, the solutions of the two optimisation problems eq:GMPLPref and eq:OBJGMPLP are the same on .
For Dual Decomposition, we let , and
Let be and be . Dual Decomposition eq:OBJDD can be reformulated as
where is defined as
Similarly, for Dual Decomposition with cycle inequalities in eq:cycle, we define as follows
Let be and be , we reformulate the problem in eq:cycle as
3.3 Generalised Dual Decomposition
Note that and in Dual Decomposition are looser than . This suggests that for some Dual Decomposition may achieve a lower quality solution or slower convergence (in terms of number of iterations) than GMPLP 111This does not contradict the result reported in Sontag et al (2012), where Dual Decomposition with cycle inequalities converges faster in terms of running time than GMPLP. In Sontag et al (2012), on all their datasets as the order of clusters are at most 3. Dual Decomposition with cycle inequalities runs faster because it has a better cluster pursuit strategy. On datasets with higher order potentials, it may have worse performance than GMPLP.. We show using the unified formulation of LP Relaxation in eq:LPR, Dual Decomposition can be derived on arbitrary local marginal polytopes (including those tighter than , and ). We refer to this new type of Dual Decomposition as Generalised Dual Decomposition (GDD), which forms a basic framework for more efficient algorithms to be presented in Section 6.
3.3.1 GDD Message Passing
Let be the Lagrangian multipliers (dual variables) corresponding to the local marginalisation constraints for each . Define
and the following variables :
where is the indicator function, which is equal to 1 if the statement is true and 0 otherwise. Define , we have the dual problem (see derivation in Section LABEL:sec:der_tdd_mp of the supplementary).
In eq:DualNotArrange, if for some , the variable will always be cancelled out222In dual objective other than eq:DualNotArrange, may not be cancelled out (the dual objective used in GMPLP).. As a result, can be set to arbitrary value. To optimise eq:DualNotArrange, we use coordinate descent. For any fixing all except yeilds a sub-optimisation problem,
A solution is provided in the proposition below.
then is a solution of eq:subopt.
The derivation of eq:DualNotArrange and eq:subopt, and the proof of Proposition 1 are provided in Section LABEL:sec:der_tdd_mp in the supplementary material. The are often referred to as beliefs, and messages (see Globerson and Jaakkola (2007); Sontag et al (2008)). In eqn:MSGUPD, and , are known, and they do not depend on . We summarise the message updating procedure in Algorithm 1. Dual Decomposition can be seen as a special case of GDD with a specific local marginal polytope in eq:MLDref.
The beliefs are computed via eq:bcdef to evaluate the dual objective and decode an integer solution of the original MAP problem. For a obtained via GDD based message passing, we find (so called decoding) via
Here we use instead of is because there may be multiple maximisers. In fact, if a node is also an extended cluster or sub-cluster (, s.t. ), then we perform more efficient decoding via
3.3.2 Convergence and Decoding Consistency
In this part we analyse the convergence and decoding consistency of GDD message passing.
GDD essentially iterates over , and updates the messages via eqn:MSGUPD. The dual decrease defined below
plays a role in the analysis of GDD.
Proposition 2 (Dual Decrease)
For any , the dual decrease
The proof is provided in Section LABEL:sec:der_dual_decrease of the supplementary. A natural question is whether GDD is convergent, which is answered by the following proposition.
Proposition 3 (Convergence)
GDD always converges.
According to duality and LP relaxation, we have for arbitrary ,
By Proposition 2 in each single step of coordinate descent, the dual decrease is non-negative. Thus GDD message passing produces a monotonically decreasing sequence of . Since the sequence has a lower bound, the sequence must converge.
Note that Proposition 3 does not guarantee reaches the limit in finite steps in GDD (GMPLP and Dual Decomposition have the same issue). However, in practice we observe that GDD often converges in finite steps. The following proposition in part explains why the decoding in eq:decode is reasonable.
Proposition 4 (Decoding Consistency)
If GDD reaches a fixed point in finite steps, then , there exist , and ,
If GDD reaches a fixed point, , (see eq:DualDecrease ). Otherwise a non-zero dual decrease means GDD would not stop. Thus ,
which completes the proof.
It’s obvious that the solution of GDD is exact, if the gap between the dual objective and the decoded primal objective is zero. Here we show that the other requirements for the exact solution also hold.
If there exists that maximises , the solution of GDD is exact.
The proof is provided in Section LABEL:sec:proof_propos_exact of the supplementary.
Proposition 4 and 5 generalise the results of Section 1.7 in (Sontag et al, 2011).
3.4 Belief Propagation Without Messages
GDD involves updating many messages ( and ). These messages are then used to compute the beliefs (see eq:bcdef). Here we show that we can directly update the beliefs without computing and storing messages.
When optimising eq:subopt, and are determined by (see eq:basfunctionlambda in supplementary). Thus let be the beliefs determined by , and