## 1 Introduction

Markov random fields are fundamental tools in machine learning with broad application in areas including computer vision, speech recognition and computational biology. Two forms of inference are commonly employed:

maximum a posteriori (MAP), where the most likely configuration is returned; and marginal, where the marginal probability distributions for each set of variables with a linking potential function are returned. In general, MAP inference is NP-hard

[17] and marginal inference, even for pairwise models, is harder still in #P [21, 2, 4].An important class of MRFs, those with only unary and pairwise submodular cost functions, admits efficient MAP inference. This was first shown for binary models [8] and applied broadly in computer vision [1], where the graph cuts method is particularly effective [22]. Recent work extended the application of this approach to multi-label submodular energies of up to third order [14, 16]. Yet marginal inference, even for binary pairwise models, is intractable with few known exceptions. Belief propagation is efficient (and exact) for trees, and loopy belief propagation is guaranteed to converge when the topology has one cycle [28].

Applying the same framework to general models, termed loopy belief propagation (LBP), has proved remarkably effective in some situations but fails in others and has no general guarantees on convergence. A key result is that belief propagation (BP) fixed points coincide with stationary points of the Bethe variational problem [27]. Stationary points, however, may not identify the global optimum of the the Bethe free energy. Subsequently, it was further shown that all stable BP fixed points are known to be local optima (rather than saddle points) of this problem, but not vice versa [9, 10]. Variational methods demonstrate that minimizing the Bethe free energy should deliver a good approximation to the true marginal distribution and recently [15] proved that for submodular MRFs, the Bethe optimum is an upper bound on the true free energy and thus yields a desirable lower bound on the partition function.

Marginal inference is a crucial problem in probabilistic systems. A noteworthy example is the Quick Medical Reference (QMR) problem [20]

, a graphical model involving 600 diseases and 4000 possible findings. Therein, medical diagnostics are performed by computing the posterior marginal probability of each disease given a set of possible findings. The marginal distribution over the presence of a disease must often be precisely estimated in order to determine the course of medical treatment. Thus, we seek the probability that a patient suffers from a condition, rather than the MAP estimate, which could be very different.

Marginal inference also arises during learning or parameter estimation in Markov random fields. For instance, computing the gradients of a partition function in a maximum likelihood estimation procedure is equivalent to marginal inference. In learning problems, the intractability of the marginal inference problem requires the exploration of marginal approximation schemes [6]. However, in the general case, both exact marginal inference and approximate marginal inference are NP-hard [2, 4].

### 1.1 Contribution

We derive various properties of the Bethe free energy and apply them to discretized pseudo-marginals to prove a polynomial-time approximation scheme (PTAS) for the global minimum of the Bethe free energy for binary pairwise associative MRFs.

The idea is that if we can find the optimal discretized point on a sufficiently fine mesh that covers all possible locations of an optimum point within a distance of , then we can bound the difference to the optimum by where is the greatest directional second derivative. To our knowledge, we present the first rigorous bounds on . One reason this is difficult is that derivatives tend to infinity as singleton marginals approach the boundary cases of or . Hence we need to prove bounds on the location away from these edges.

We first prove various bounds including on the location of any stationary point of the Bethe free energy, as well as on the true marginals. In doing this we develop Bethe bound propagation (BBP) which sometimes produces remarkably tight bounds by itself. We then consider the second derivatives with a view to bounding . Additional analysis allows us to prove that the discretized multi-label problem is submodular on any mesh and hence the discretized optimum can be found efficiently using graph cuts [16].

Various extensions are discussed in the closing section, including applications to non-associative models, to models that are themselves multi-label and to models with higher order terms.

### 1.2 Related work

A variety of heuristics have been proposed for marginal inference problems. Marginal inference in the QMR medical diagnostic problem has been explored with Markov Chain Monte Carlo (MCMC)

[13, 19, 3] methods, variational methods [11], and search methods [5]. Many of these heuristics are restricted to certain classes of graphical model (such as QMR). Here we explore another approach to approximate marginal inference by minimizing the Bethe free energy.The minimization of Bethe free energy is often approached using loopy Belief propagation. However, there are few guarantees on the rate of convergence of LBP which prevent it from functioning as a PTAS for Bethe minimization [24]. An important contribution [26] showed that the Bethe free energy of a binary pairwise MRF may be considered as a function only of the singleton marginals, however this connection was provided without convergence results.

A PTAS was recently proposed [18] for the location of a point whose derivative of the Bethe free energy has magnitude less than . However, this identifies only an approximately stationary point (which may not be even a local minimum) that could be arbitrarily far from the global optimum. That result applies for a general binary pairwise MRF subject to an edge sparsity restriction that the maximum degree is . Here we primarily focus on associative models with the same degree restriction, but our deliverable not only satisfies the property in [18] but importantly is also guaranteed to have Bethe free energy within of the optimum.

We note that the PTAS in [18] may provide the global optimum when the fixed point is unique and recent work [25] has enumerated necessary and sufficient conditions for uniqueness. Nevertheless, aside from these restricted settings, there are no prior polynomial-time methods for finding or rigorously approximating the global minimum of the Bethe free energy. Earlier work considered discretizations of pseudo-marginals but presented incomplete results [12]. We go significantly further in deriving additional key results which together admit the PTAS. These include explicit forms and bounds on the second derivatives, on the third derivatives and on the locations of stationary points.

## 2 Preliminaries & Notation

We focus on a binary pairwise MRF over variables with topology and generally follow the notation of [26]. We assume^{1}^{1}1The energy can always be thus reparameterized with finite and terms provided . There are reasonable distributions where this does not hold, i.e. but this can often be handled by assigning such configurations a sufficiently small positive probability .

(1) |

where the partition function is a normalizing constant. Let be the Bethe free energy, so where is the Bethe approximation to the true entropy, . is the entropy of a pseudo-marginal of on the local polytope, is the entropy of the singleton distribution and is the degree of , that is the number of variables to which is adjacent. We assume the model is connected so all . For each node define sum of positive and negative incident edge weights: , where indicates the neighbors of node . For a pseudo-marginal distribution , let . Consistency and normalization constraints from the local polytope imply

(2) |

for some , where is the pairwise marginal. Let . may be assumed not to occur else the edge may be deleted. has the same sign as , if positive then the edge is associative; if negative then the edge is repulsive. The MRF is associative if all edges are associative. As in [26], one can solve for explicitly in terms of and by minimizing the free energy, leading to a quadratic equation with real roots

(3) |

For , is the lower root, for it is the higher. Notice that when (no edge relationship) this reduces as expected to .

is the entropy of . Hence

(4) |

Collecting the pairwise terms for one edge, define

(5) |

We are interested in discretized pseudo-marginals where for each we restrict its possible values to a discrete set of points in . Note we may often have . Let .

Recall the sigmoid function

which will be used for Bethe bounds. We write for the lower bound of and for the lower bound of so . Define .### 2.1 Submodularity

In our context, a pairwise multi-label function on a set of ordered labels is submodular if

(7) |

where for and , and

. For binary variables this is equivalent to associativity.

## 3 Bounds & Bethe bound propagation

We use the technique of flipping variables, i.e. considering . Flipping a variable flips the parity of all its incident edges so associative repulsive. Flipping both ends of an edge leaves its parity unchanged.

### 3.1 Flipping all variables

Consider a new model with variables and the same edges. Instead of s and s, let the new model have parameters and . We identify values such that the energies of all states are maintained up to a constant.^{2}^{2}2Any constant difference will be absorbed into the partition function and leave probabilities unchanged.

Matching coefficients yields

(8) |

If the original model was associative, so too is the new.

### 3.2 Flipping some variables

Sometimes we flip only a subset of the variables. This can be useful, for example, to make the model locally associative around a variable, which can always be achieved by flipping just those neighbors to which it has a repulsive edge. Let if else for , where . Let edges with exactly ends in for .

As in 3.1, solving for and such that energies are unchanged up to a constant,

(9) |

###### Lemma 1.

Flipping any set of variables changes affected pseudo-marginal matrix entries’ locations but not values. The Bethe free energy is unchanged up to a constant, hence the locations of stationary points are unaffected.

###### Proof.

By construction energies are the same up to a constant. The singleton entropies are symmetric functions of and so are unaffected. The impact on pseudo-marginal matrix entries follows directly from definitions. Thus Bethe entropy is unaffected. ∎

### 3.3 Bounds

We derive several results that are useful in bounding the Bethe free energy as well as the marginals.

###### Lemma 2.

###### Proof.

The quadratic equation (3) for may be rewritten . Both terms in parentheses on the right are elements of the pseudo-marginal matrix so are constrained to be . ∎

This simple result is sufficient to bound the location of stationary points of the Bethe free energy away from the edges of and , though we improve the bounds in Lemma 6.

###### Theorem 3.

If all edges incident to are associative then at any stationary point of the Bethe free energy, . Remark exactly the same sandwich result holds for the true marginal .

###### Proof.

We first prove the left inequality. Consider (6). Using and Lemma 2 we have

To obtain the right inequality, flip all variables as in section 3.1. Using the first inequality, (8) and Lemma 1 yields since . To show the result for the true marginal, let then using (1), . Since all the result follows. ∎

Using (3.2) we obtain a more powerful corollary.

###### Theorem 4.

For general edge types (associative or repulsive), let , . At any stationary point of the Bethe free energy, . The same sandwich result holds for the true marginal .

###### Proof.

The following lemma will be useful.

###### Lemma 5.

For .

###### Proof.

Let . To show the left inequality, consider and , then . For the right inequality observe ∎

###### Lemma 6 (Better lower bound for ).

If , then , equality only possible at an edge, i.e. one or both of .

###### Proof.

###### Lemma 7 (Upper bound for ).

If , then .

Also .

###### Proof.

We prove the first inequality. The second follows by Lemma 5 and those for follow by symmetry. The final inequality follows by combining the earlier ones. Let and substitute into (3)

The function is a convex parabola which at is at .^{3}^{3}3This confirms neatly that we must take the left root else (a contradiction). From Lemma 2 we know that the left root is at so we may take the derivative there, i.e. at and by convexity use this to establish a lower bound for . That derivative is . ∎

###### Lemma 8.

Unless or , all entries of the pseudo-marginal are strictly , whether is associative or repulsive.^{4}^{4}4Here we assume is finite, see footnote 1.

### 3.4 Bethe bound propagation (BBP)

We have already derived bounds on stationary points in Theorems 3 and 4. Here we show for variables with only associative edges how we can iteratively improve these bounds, sometimes with striking results. Note that a fully associative model is not required, and as in section 3.2, any model may be selectively flipped to yield local associativity around a particular node.

We first assume all and adopt the approach of Theorem 3, now using the better bound from Lemma 6 to obtain

Hence where

monotonically increasing with and decreasing with . Hence

(11) |

Using Theorem 3 we initialize and .

Using (6), at any stationary point we must have

where . Intuitively, in an associative model, if variable has neighbors which are likely to be (i.e. high ) then this pulls up the probability that will be 1 (i.e. raises ).

Flipping all variables,

where with

It is also possible to write this as

This establishes a message passing type of algorithm for iteratively improving the bounds . Repeat until convergence:

recompute |

###### Lemma 9.

At every iteration, all of monotonically increase.

###### Proof.

All of the dependencies are monotonically increasing on all inputs. The first iteration yields an increase since each . ∎

Since , each is bounded above and we achieve monotonic convergence. Combining this with the main global optimization approach can dramatically reduce the range of values that need be considered, leading to significant time savings. Convergence is rapid even for large, densely connected graphs. Each iteration takes time; a good heuristic is to run for up to 20 iterations, terminating early if all parameters improve by less than a threshold value. This adds negligible time to the global optimization.

This procedure alone can produce impressive results. For example, running on a -node graph with independent random edge probability (hence average degree ), each and drawn randomly from Uniform and then adjusting in order to be unbiased, convergence takes about 11 iterations yielding final average bracket width of after starting with average bracket width of . Greater connectivity, higher edge strengths and smaller individual node potentials make the problem more challenging and may widen the returned final brackets significantly.

### 3.5 BBP for general models

A repulsive edge may always be flipped to associative by flipping variable , which flips its Bethe bounds . Using Theorem 4 we can extend the analysis above to run BBP on any model, see Algorithm 1. Performance in terms of convergence speed and final bracket width is similar for associative and non-associative models.

## 4 Higher derivatives & submodularity

We first derive a novel result for the second derivatives of an edge which will be crucial later for bounding the error of the discretized global optimum and also will allow us to show that the discretized multi-label problem is submodular.

### 4.1 Second derivatives for each edge

###### Theorem 10.

For any edge , for any , writing and from (2),

where with equality only for or . Further and has the sign of .

###### Proof.

We begin with the same approach as [12] but extend the analysis and derive stronger results.

For notational convenience add a third pseudo-dimension restricted to the value . Let

be the vector with components

, and where . Define , and if or otherwise. Let . Define function used in entropy calculations as .Consider (5) but instead of solving for explicitly, express as an optimization problem, minimizing free energy subject to local consistency and normalization constraints in order to use techniques from convex optimization. We have where

(12) |

The Lagrangian can be written as

and its derivative is

which yields a minimum at

(13) |

Since the minimization problem in (14) is convex and satisfies the weak Slater’s condition (the constraints are affine), strong duality applies and where the dual is simply

(14) |

Let then .

Hence using (14). Focusing on our goal of obtaining second derivatives, we consider which we shall express in terms of .

Differentiating with respect to ,

Considering (14), hence . Thus . Using its definition and (14), we have

Earlier work [12] stopped here, recognizing that . We more precisely characterize this matrix

(15) |

Recall constraints , , . Note is symmetric.

Applying our result above and using Cramer’s rule,