Belief Propagation, Bethe Approximation and Polynomials

08/08/2017 ∙ by Damian Straszak, et al. ∙ 0

Factor graphs are important models for succinctly representing probability distributions in machine learning, coding theory, and statistical physics. Several computational problems, such as computing marginals and partition functions, arise naturally when working with factor graphs. Belief propagation is a widely deployed iterative method for solving these problems. However, despite its significant empirical success, not much is known about the correctness and efficiency of belief propagation. Bethe approximation is an optimization-based framework for approximating partition functions. While it is known that the stationary points of the Bethe approximation coincide with the fixed points of belief propagation, in general, the relation between the Bethe approximation and the partition function is not well understood. It has been observed that for a few classes of factor graphs, the Bethe approximation always gives a lower bound to the partition function, which distinguishes them from the general case, where neither a lower bound, nor an upper bound holds universally. This has been rigorously proved for permanents and for attractive graphical models. Here we consider bipartite normal factor graphs and show that if the local constraints satisfy a certain analytic property, the Bethe approximation is a lower bound to the partition function. We arrive at this result by viewing factor graphs through the lens of polynomials. In this process, we reformulate the Bethe approximation as a polynomial optimization problem. Our sufficient condition for the lower bound property to hold is inspired by recent developments in the theory of real stable polynomials. We believe that this way of viewing factor graphs and its connection to real stability might lead to a better understanding of belief propagation and factor graphs in general.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Several important classes of probability distributions studied in statistical physics, coding theory, and machine learning can be succinctly represented as factor graphs [22, 38]. Informally, they provide a way to describe complex, multivariate functions by specifying variables and relations between them in a form of a hypergraph [18]

. In this context, of interest are the inference problem of estimating marginal probabilities of certain variables and the problem of estimating the partition function of such a factor graph. In computer vision one applies such inference primitives to learn about objects in a stage being captured by several cameras 

[12]. They are also essential components for decoding algorithms for Low-Density Parity Check codes [13, 32]. In statistical physics, these problems are equivalent to learning properties of typical configurations of a given mechanical system [22].

Due to the practical relevance and broad applicability of such inference primitives, over several decades numerous approximate and heuristic methods have been developed to compute these quantities. Among them, the most widely deployed is the belief propagation method

[13, 25], which is an iterative message passing algorithm (or equivalently a discrete-time dynamical system) for computing marginals and partition functions. It is known that belief propagation provides exact answers when the considered factor graph is a tree [25] and gives decent approximations on locally tree-like graphs [10]. However, a general theory explaining the great empirical success of the belief propagation method is lacking.

Another, seemingly unrelated approach, with its roots in physics, is the Bethe approximation [4, 19, 14]. It is based on computing the optimal value (called the Bethe partition function) to a certain continuous optimization problem and using it as an estimate of the true partition function. There is a fundamental connection known between belief propagation and Bethe approximation – the fixed points of the former arrive exactly as the stationary points of the optimization problem underlying the latter [42]. This provides a good grasp on the belief propagation algorithm, that is otherwise hard to reason about. By establishing bounds on the Bethe partition function, one can deduce facts about the behavior of the belief propagation algorithm and, importantly, learn to some extent, where will it converge to.

Even though for real-world examples of factor graphs the Bethe partition function seems to provide a decent estimate to the partition function, there are known examples for which the approximation is arbitrarily bad [38, 41]. This is not a surprise, as the inference problems related to factor graphs can encode NP-hard problems and even #P-hard problems (such as counting independent sets in a graph) can be seen as computing certain partition functions. Another difficulty, which rules out several proof techniques for dealing with such relaxations is the fact that the underlying optimization problem is not convex. For this reason, it is hard to expect a characterization of factor graphs for which the Bethe approximation can be related to the true partition function. Instead, there are efforts to describe viable sufficient conditions under which some relation can be established. For factor graphs representing permanents, it has been proved that the Bethe approximation is a lower bound to the true partition function [36, 16, 17]. A similar phenomenon has been observed and conjectured to hold for log-supermodular factor graphs [31] and a positive resolution was proposed by [27].

We propose a new, alternative view on factor graphs via the lens of polynomials. Specifically, we introduce a natural way of representing local functions as polynomials, so that the Bethe approximation can be restated as a polynomial optimization problem. This allows us to relate properties of the underlying polynomials to the behavior of the Bethe approximation. We state a natural analytic condition under which the Bethe partition function lower-bounds the true partition function. The condition is inspired by recent developments in the theory of real stable polynomials [7, 5, 6] and in particular by recent polynomial approaches to partition functions [3, 30] (see Remark 5.7 for a comparison) based on ideas from [15]. In its simplest form, it requires all the polynomials underlying the factor graph to be real stable. Interestingly, such factor graphs are necessarily repulsive or log-submodular, which complements the lower bounds obtained by [27] – for attractive or log-supermodular models. We believe that this framework based on polynomials might be used to establish similar bounds for different classes of factor graphs and more generally to answer different questions about the Bethe approximation and the belief propagation algorithm.

2 Factor Graphs and Bethe Approximation

2.1 Factor Graphs

We work with probability distributions represented by Normal Factor Graphs (NFGs). In an NFG , there is a set of factors (or nodes) and a set of variables (or edges) . Every edge connects exactly two factors. The set of edges incident to a factor is denoted by . The last component of is a collection of local functions . Every such function takes as input a binary string of length and outputs a non-negative number, in other words

. For a given vector

and any set of edges we denote by the sub-vector of of length indexed by edges in . Edges are to be thought of as variables that can take one of two possible values: or . Then the set of all possible configurations of is . Consider the probability distribution on by setting

(1)

It is always assumed that , in which case is a well defined probability distribution over configurations. The focus here is on the problem of estimating for a given normal factor graph .

Note that in a related model of factor graphs, variables are represented by variable nodes, whereas in the model considered here they are represented by edges. However, a simple reduction shows that these two models are equivalent [11]. We choose to work with normal factor graphs to allow a cleaner statement of results.

2.2 Bethe Approximation

The Bethe approximation is a popular heuristic called for computing . It is based on computing a quantity – called the Bethe partition function of – as a solution to a continuous optimization problem defined with respect to . To derive the Bethe approximation, one begins with the following convex program

(2)
s.t.

where . It is not hard to prove that the above program has an optimal solution (with as in (1)), and the optimal value is . Thus the problem of computing the partition function is reduced to solving the program (2). This reduction, however, does not seem to make the problem any easier, as the number of variables in (2) is exponential. Thus, various heuristics have been proposed on how to reduce the number of variables in (2) so as to make this approach of estimating feasible.

The Bethe approximation has variables for , which are the marginals of the distribution , more formally we think of as where is distributed according to . Similarly one introduces variables representing marginals over factors, i.e. for we have a vector which is a probability distribution over local configurations , and its interpretation is that . To simplify the program (2) the following assumption is made about the form of the distribution

(3)

The intuition behind such a form of is that one might (for simplicity) assume independence between factors and calculate the probability of a global configuration as a product of probabilities over local configurations of factors. The term in the denominator can be thought of as a correction term, as every edge is “taken twice into account” in the numerator. Another way of motivating (3) is to observe that when the graph is a tree, then the probability function can be written in this form and, wishfully, one may expect that for other graphs it might serve as a good estimate. Assuming such a special form of , the program (2) reduces to

(4)
s.t.

where is the binary entropy function (i.e., for ) and is the set of all marginal vectors which satisfy local agreement constraints (it is thus called the pseudo-marginal polytope). This means that and are as above and they satisfy:

The optimal value of (4) is called the Bethe partition function and its exponential is denoted by . One expects that is a decent approximation to , which has been confirmed empirically for various examples of factor graphs.

However, in general, can be an arbitrarily bad approximation to , as for instance it might be positive for some cases where . From a theoretical viewpoint, not much is known about the behavior of Bethe approximation. The main source of difficulty in understanding this relaxation is its non-convexity, which in particular manifests itself in multiple local optima. In the current paper we derive some sufficient conditions under which the Bethe partition function lower-bounds the true partition function.

2.3 Related Work

The notion of free energy that appears as the objective in the Bethe partition function was formulated in [4] in the physics literature. See also [23] and references therein for more historical notes on Bethe approximation. The correspondence between Bethe approximation and the belief propagation algorithm was explicitly derived in [42]. This combined with the work [25] on the belief propagation method implies that Bethe approximation gives exact values of the partition function on tree factor graphs. It is also known that Bethe partition function gives precise estimates in the asymptotic sense on locally tree-like graphs [10].

In the work [9], the loop series expansion of the Bethe partition function was introduced, which is a tool to study the relation between the Bethe partition function and the true partition function. In [9] the loop expansion was used to prove that Bethe approximation gives a good estimate on the number of independent sets on graphs with small maximum degree and large girth.

The problem of computing permanents of nonnegative matrices has been also intensively studied in the context of Bethe approximation [40, 36, 16, 17]. Recall that the permanent of a matrix is defined to be

and the problem of computing it is a canonical example of a #P-hard problem [33], hence no polynomial time exact algorithm is expected to exist. This problem can be formulated in a natural way as evaluating a certain partition function  [40, 36] and hence one can investigate the question on how well the Bethe partition function does approximate permanents.

It has been observed [36] that unlike in the general case, for permanents the program (4) is convex. This allows one to analyze the optimality via KKT conditions and to conclude that using a permanental inequality due to [28]. The success of this approach crucially relies on the existence of a convex form of the Bethe approximation, this seems to be an exception rather than a rule among various factor graphs.

The Bethe approximation was also studied in the context of the Ising model [31], and shown to lower-bound the true partition function for the ferromagnetic case under certain technical assumption. This result was extended by [27] to the class of all log-supermodular (also called attractive) factor graphs. A factor graph is called log-supermodular if every local function is log-supermodular, i.e., for every we have

where and denote entry-wise OR and entry-wise AND respectively. The proof is based on the following combinatorial characterization of the Bethe approximation, due to [36, 35]. It says that

where is the set of -covers of the factor graph , and the expectation is over a uniformly random choice of in (for details we refer to [35]). It follows that in order to prove that for a given factor graph , it is enough to prove that for every

(5)

This is the main idea behind the reasoning of [27]; the inequality (5) is then proved using a certain generalization of the four function theorem [1]. In the context of attractive models, several conjectures regarding similar lower bounds were stated in [39], out of which only one (for independent sets on bipartite graphs) has been so far resolved (by the above result of [27]).

Finally we mention that this paper is inspired by recent developments in the theory of real stable polynomials [7, 5, 6] and the works of [15, 30, 3], where several polynomial based relaxations are considered; for details, we refer the reader to Remark 5.7.

3 Our Contribution

3.1 Polynomial Form of Bethe Approximation

The main conceptual result of this paper is a new approach to prove inequalities between the Bethe partition function and the true partition function. We start by presenting an alternative view on the Bethe approximation – through the lens of polynomials. Towards this, let us first define the polynomial representation of local functions. For any we define a multivariate polynomial over a set of variables as follows

(6)

where is a monomial defined as and the coefficient is given by . We prove the following, alternative characterization of the Bethe partition function as a polynomial optimization problem. In the statement below we use the convenient notation that for two vectors , .

Theorem 3.1 (Bethe Approximation via Polynomials)

Let be a normal factor graph with a set of factors and a set of variables . For every factor let be the corresponding -variate polynomial. Then the Bethe partition function can be written as

In the above statements stands for a vector which collects all variables for and . The proof of Theorem 3.1 appears in Section 4. It is established by adapting a dual view on the max-entropy program which defines the Bethe partition function.

3.2 Lower Bound on the Partition Function

Technically, we prove that assuming a certain geometric condition on the factor graph , the Bethe approximation provides a lower bound on the true partition function. This condition captures permanents as a special case (see the example provided in Section A). Below we state a simplified variant of the main technical result in terms of local polynomials . For a more general statement, which is expressed in the language of probability, as well as a proof of the below theorem, we refer to Section 5.

Theorem 3.2 (Lower Bound via Real Stability)

Let be a bipartite normal factor graph with a set of factors and a set of variables . Assume that all the polynomials corresponding to local functions (for ) are real stable. Then it holds that

A few comments are in order. In the statement above we assume that the NFG is bipartite. This might seem to be restrictive, but as it turns out, every NFG can be converted into an equivalent bipartite form, with at most a double growth in size, hence no real restriction is put on with this assumption. The key condition we require is real stability of the underlying polynomials.

Real stability is a geometric condition on the location of zeros of a polynomial, which generalizes real-rootedness. We say that a polynomial is real stable if none of its roots satisfies: for every . Real stable polynomials have recently found numerous applications in mathematics [5, 21] and computer science [15, 20, 2, 24, 30, 3] (see also surveys [37, 26, 34]).

We remark that coefficients of multi-affine real stable polynomials are known to be given by log-submodular set functions (see [37]), which corresponds to the following assumption on local functions for

This demonstrates that Theorem 3.2 addresses the opposite case when compared to the result of [27], where an analogous result for log-supermodular functions is proved. These two assumptions turn out to imply significantly different properties of the underlying factor graphs.

One interesting aspect that is worth mentioning here is that, under log-supermodularity, feasible fractional configurations are easy to round to integral configurations. More precisely, given a point whose objective value in the Bethe approximation is finite (larger than ), one can obtain (by just rounding up all entries of ) a configuration such that . Such a procedure might fail in finding a feasible configuration when G is log-submodular (i.e., the resulting has ). In fact, finding a feasible configuration in such models (even assuming real stability of local polynomials) might be a nontrivial task, even NP

-complete if no assumptions on the local functions are made. It turns out in particular, that for the case of permanents, the Bethe approximation is implicitly solving a nontrivial combinatorial optimization problem of detecting if a bipartite graph has a perfect matching.

Remark 3.3 (Upper Bound)

Using the characterization from Theorem 3.1 one can prove that . Indeed, by plugging in the term is equal to and we obtain

(here by we mean an entry-wise inequality). Hence, altogether, under the assumptions of Theorem 3.2 the Bethe partition function provides a approximation to the true partition function.

3.3 Discussion

In this paper we propose a new approach for establishing bounds on the partition function for graphical models based on polynomial techniques. This work is inspired by recent developments in the theory of stable polynomials [15, 5, 3, 30] and is an attempt to expand the scope of applicability of these tools. While our result seems to require real stability (with respect to the upper-half complex plane) of the underlying polynomials to deduce the desired bound, we believe that other forms of stability, such as stability with respect to a disc, or other analytic assumptions on the polynomials might yield other nontrivial bounds.

Finally, we note that real stability also improves the computational properties of the Bethe approximation. Indeed, the fact that the function is concave, for a real stable polynomial , can be used to show efficient computability of certain relaxations, similar to the Bethe partition function in the polynomial form (see [30, 3]). This might eventually lead to designing relaxations which match or even outperform Bethe approximation, while having provably correct and efficient algorithms.

4 Bethe Approximation via Polynomials

In this section we derive an equivalent form of the Bethe partition function – stated in terms of a polynomial optimization problem.

4.1 Local Functions as Polynomials

Consider a NFG . In this paper we view the local functions (for ) as polynomials. More formally, given a function , we define the corresponding polynomial representation of as an -variate polynomial over variables given by the formula

where denotes and is the value of the function at . Note that even if two factors share an edge , the variables of and are still pairwise different.

4.2 Bethe Approximation as Polynomial Optimization

Let be a NFG. Denote by the negative entropy of , i.e.,

We use to denote the KL-divergence between two nonnegative vectors (typically probability distributions),

The Bethe approximation problem can be then rewritten as

where is the pseudo-marginal polytope, as introduced in Section 2. We define the following entropy maximization problem.

Definition 4.1

Let be any function with and be any vector. We define to be the optimal value of the following optimization problem over vectors

(7)
s.t.

In case when no satisfies the above constraints, we set .

Lemma 4.1

For every normal factor graph , the Bethe approximation can be stated equivalently as

  • Proof:   The objective of the Bethe approximation has separated and variables, however they are implicitly coupled because of the constraint. For a fixed and a factor the constraint on following from is

    This can be equivalently written in the vector form as

    Note that maximizing under this constraint gives us exactly .       

The lemma below explains how does the entropy maximization problem underlying relate to polynomial optimization.

Lemma 4.2

Let and be any vector. Define a -variate, multi-linear polynomial to be with . We have

(8)
  • Proof:   A proof follows by applying strong duality to the max-entropy program (7). For details, see [29, 30].       

Theorem 3.1 is now a simple consequence of the above established results.

  • Proof of Theorem 3.1:   From Lemma 4.1 we have

    Next, by Lemma 4.2 this can be rewritten as

    By taking exponentials on both sides

     

5 Proof of the Lower Bound

To prove Theorem 3.2 we first formulate a more general condition which we call IPC, and prove that under IPC, the inequality holds. Afterwards we conclude the proof by showing that the assumption of Theorem 3.2 implies that IPC is satisfied.

5.1 The IPC

To state IPC we need to introduce some notation related to the bipartite structure of the factor graph . Let the set of factors be partitioned into two sets and such that no edges go between factors within or within , only between these two sets. Next, for any we define

Furthermore we define the normalized variants of and to be , . We refer to as to the distributions induced by (the “left” side of the bipartition) and induced by (the “right” side of the bipartition) respectively.

We are now ready to state a condition on the pair of distributions which will turn out sufficient for the inequality to hold.

Definition 5.1 (Iterated Positive Correlation)

Let be probability distributions over and let be distributed according to and respectively. Define the event to be for all . For any two sequences of positive reals and and for any pair define

Where the expectation is over and , assuming are independent. We say that the pair of distributions satisfies the Iterated Positive Correlation (IPC) property if

for every and for every .

Note that in the definition above we implicitly assume that , as otherwise some conditional expectations above might not be well defined. For the setting which we have in mind, this corresponds to the assumption that .

To gain some intuition about the IPC property it is instructive to examine the special case when . Under the notation we obtain

which can be seen as a form of iterated (as ) positive correlation between subsequent ’s and ’s. In other words, it quantifies, in a certain sense the fact that conditioned on for , it is more likely to see rather than . We are now ready to state the main technical lemma of the paper, which asserts that if a NFG satisfies IPC then .

Lemma 5.1

Let be a bipartite normal factor graph with a set of factors , a set of variables and bipartition . Let and be the distributions over induced by the left side and the right side of the bipartition of respectively. If the pair satisfies the IPC property then

A proof of Lemma 5.1 appears in Section 5.2. To conclude Theorem 3.2 from the above it suffices to argue that the real stability assumption on local polynomials implies IPC. This is the subject of the next lemma

Lemma 5.2 (Real Stability implies IPC)

Let be a bipartite normal factor graph with a set of factors and a set of variables . Assume that all the polynomials corresponding to local functions (for ) are real stable. Let and be the distributions over induced by the left side and the right side of the bipartition of respectively. Then the pair satisfies the IPC property.

The proof of Lemma 5.2 appears in Section 5.3. We are now ready to deduce Theorem 3.2.

  • Proof of Theorem 3.2:   Lemma 5.1 asserts that the inequality holds under IPC. Further, by Lemma 5.2 the assumption of Theorem 3.2 (saying that local polynomials are real stable) implies that IPC holds. Thus the Theorem 3.2 follows.       

Remark 5.3

We note that the IPC condition is significantly more general than the real stability assumption in 3.2 and there are examples of factor graphs which do not satisfy real stability, but IPC holds for them. The downside of IPC might be however that there does not seem to be a simple way to verify it, especially since it is a global condition on the factor graph. On the other hand, the real stability assumption is only local and can be checked easily whenever the degrees of all factors are reasonably small.

5.2 Proof of the Lower Bound under IPC

In this section, the following linear operator on the set of polynomials is used.

Definition 5.2

Let be a real polynomial with being single variables and being tuples of variables. Define

In other words, first applies the differential operator to and then sets ; the result is a polynomial in the variables .

The lemma below explains how the IPC property is related to polynomials.

Lemma 5.4

Let be distributions over . Define the polynomials and . Further, for every let

For any number and for any two sequences of non-negative numbers, the polynomial is of the form

where (up to scaling) for every (as in Definition 5.1 with and ).

  • Proof:   We start by providing explicit formulas for the coefficients of . Note first that all the operators are linear. Hence it is enough to consider only one monomial , for .

    For this reason, the coefficient of in is equal to

    where and . In the language of probability this coefficient is equal to the probability that

    when and are distributed according to and respectively. Thus, when we consider for some , the corresponding coefficients are given by sums of the form

    Again, probabilistically this corresponds to

    and the lemma follows.       

Lemma 5.5 ([3])

Suppose is a bivariate multi-linear polynomial such that for all and , then for every

  • Proof:   Fix any . It is not hard to prove that for , the left hand side of the inequality is actually , hence we can focus on . Note also that we can assume that since if , we can keep increasing until the inequality becomes an equality, this way we might only increase the value of

    but stays the same. From it then follows that that

    for some . Now using Lemma 4.2 we obtain

    Where , and . Therefore

    Where . What then remains to prove is that

    However, this follows from the fact that the KL-divergence between two probability distributions is nonnegative, when applied to: and .       

Lemma 5.6

Let be distributions over satisfying the IPC property. Then

  • Proof:   We proceed by induction. Observe first that can be obtained from by applying a sequence of differential operators. More precisely, let

    Note that is a constant polynomial given by

    (9)

    Let us fix , we prove that for every

    (10)

    Where and Note that for we obtain the lemma. We proceed by induction starting from the base case and go backwards with . The base case follows directly (with equality) from (9). Suppose now that and (10) has been proved for all with , we prove it for .       

Let us fix and values such that