Belief propagation was originally formulated by Judea Pearl as a distributed algorithm to perform statistical inference on probability distributions Pearl1982. His primary observation was that computing marginals is, in general, an expensive operation. However, if the probability distribution can be written as a product of smaller factors that only depend on a small subset of the variables then one could possibly compute the marginals much faster. This “factorization” is captured by a corresponding graphical model. Pearl demonstrated that, when the graphical model is a tree, the belief propagation algorithm is guaranteed to converge to the exact marginals of the input probability distribution. If the algorithm is run on an arbitrary graph that is not a tree, then neither convergence nor correctness are guaranteed.
Max-product, which Pearl dubbed belief revision, is a variant of the belief propagation algorithm where the summations are replaced by maximizations. The goal of the max-product algorithm is to compute the assignment of the variables that maximizes a given objective function. In general, computing such an assignment is an NP-hard problem, but for graphical models possessing a single cycle the algorithm is guaranteed to converge to the maximizing assignment under a few mild assumptions singleweiss. Over arbitrary graphical models, the max-product algorithm may fail to converge malioutov or, worse, may converge to an assignment that is not optimal weisscomp. Despite these difficulties, max-product and its variants have found empirical success in a variety of application areas including statistical physics, combinatorial optimization matchingbayati sanghavi 5290304 ruozzi, computer vision, clustering affinity, error-correcting codes turbo, and the minimization of convex functions convexciamac; however, rigorously characterizing their behavior outside of a few well-structured instances has proved challenging.
In order to resolve the difficulties presented by the standard max-product algorithm, several alternate message passing schemes have been proposed to compute the maximizing assignment over arbitrary graphical models: MPLP MPLP, tree-reweighted max-product (TRMP) waintree, and max-sum diffusion (MSD) MSD. Recently, all of these algorithms were shown to be members of a class of “bound minimizing” algorithms for which, under a suitable update schedule, convergence of the algorithms is guaranteed weissconv.
The TRMP algorithm is the max-product analog of the tree-reweighted belief propagation algorithm (TRBP). TRBP, like belief propagation (BP), is an algorithm designed to compute the marginals of a given probability distribution. The key insight that the TRMP algorithm exploits is the observation that the max-product algorithm is correct on trees. The TRMP algorithm begins by choosing a probability distribution over spanning trees of the factor graph and then rewrites the original distribution as an expectation over spanning trees. With this simple rewriting and subsequent derivation of a new message passing scheme, one can show that, for discrete state spaces, TRMP guarantees correctness upon convergence to a unique estimate waintree. These results were expanded in subsequent works wainkol kolserial, and recently, a serial version of TRMP denoted TRW-S was shown to be provably convergent kolserial.
The MPLP algorithm is derived from a special form of the dual linear programming relaxation of the maximization problem. Over discrete state spaces, the algorithm is guaranteed to converge and is correct upon convergence to a unique estimate. Unlike the TRMP algorithm, the MPLP algorithm does not require a choice of parameters. Because choosing the constants for TRMP does require some care, the MPLP algorithm may seem preferable. However, as we will demonstrate by example, the constants provide some flexibility to overcome bad behavior of the algorithm. For example, there are applications over continuous state spaces for which the choice of constants is critical to convergence and correctness. One such example is the quadratic minimization problem. For this application, there exist positive definite matrices for which the TRMP message passing scheme does not converge to the correct minimizing, regardless of the chosen distribution over spanning treesCISS2010.
We propose a new message passing scheme for the solving the maximization problem based on a simple “splitting” heuristic. Our contributions include:
A simple and novel derivation of our message passing scheme for general factor graphs.
A message passing schedule and conditions under which our algorithm converges.
A simple choice of the parameters such that if each of the beliefs has unique argmax then the output of the algorithm is a local optimum of the objective function.
A simple choice of the parameters such that if each of the beliefs has unique argmax then the output of the algorithm is a global optimum of the objective function.
Conditions under which the algorithm cannot converge to a unique globally optimal estimate.
Unlike MPLP, TRMP, and MSD, the derivation of this algorithm is surprisingly simple, and the update rules closely mirror the standard max-product message updates. Because of its simplicity, we are able to present the algorithm in its most general form: our algorithm is not restricted to binary state spaces or pairwise factor graphs. More importantly, almost all of the intuition for the standard max-product algorithm can be extended with very little effort to our framework.
Like TRMP, our algorithm requires choosing a set of constants. Indeed, TRMP can be seen as a special case of our algorithm. However, unlike TRMP, any choice of non-zero constants will suffice to produce a valid message passing algorithm. In this way, our message passing scheme is more appropriately thought of as a family of message passing algorithms. We will show that, assuming the messages passed by the algorithm are always finite, there is always a simple choice of constants that will guarantee convergence. Further, if we are able to extract a unique estimate from the converged beliefs then this estimate is guaranteed to be the maximizing assignment.
The outline of this paper is as follows: in Section 2 we review the max-product algorithm and other relevant background material, in Section 3 we derive a new passing passing algorithm by splitting factor nodes and prove some basic results, in Section 4 we explore the local and global optimality of the fixed points of our message passing scheme, in Section 5 we provide an alternate message passing schedule under which the algorithm is guaranteed to converge and demonstrate that the algorithm cannot always produce a tight lower bound to the objective function, in Section 6 we show how to strengthen the results of the previous sections for the special case in which the alphabet is binary and the factors are pairwise, and we conclude in Section 7.
Before we proceed to our results, we will briefly review the relevant background material pertaining to message passing algorithms. The focus of this paper will be on solving minimization problems for which we can write the objective function as a sum of functions over fewer variables. These “smaller” functions are called potentials. We note that this is equivalent to the problem of maximizing a product of non-negative potentials, as we can convert the maximum over a product of potentials into a minimum over a sum by taking negative logs. Although the max-product formulation is more popular in the literature, for notational reasons that will become clear in the sequel, we will use the min-sum formulation.
Let , where is an arbitrary set (e.g. , , , etc.). Throughout this paper, we will be interested in finding an element that minimizes , and as such, we will assume that there is such an element. For an arbitrary function, computing this minimum may be difficult, especially if is large. The basic observation of the min-sum algorithm is that, even though the original minimization problem may be difficult, if can be written as a sum of functions depending on only a small subset of the variables, then we may be able to minimize the global function by performing a series of minimizations over (presumably easier) sub-problems. To make this concrete, let . We say that factorizes over if we can write as a sum of real valued potential functions and as follows:
Every factorization of has a corresponding graphical representation known as a factor graph. The factor graph consists of a node for each variable and a factor node for each of the factors with an edge joining the factor node corresponding to to the variable node representing if . For a concrete example, see Figure 1. The min-sum algorithm is a message passing algorithm on this factor graph. In the algorithm, there are two types of messages: messages passed from variable nodes to factor nodes and messages passed from factor nodes to variable nodes. On the iteration of the algorithm, messages are passed along each edge of the factor graph as follows:
where denotes the set of all such that (intuitively, this is the set of neighbors of variable node in the factor graph),
is the vector formed from the entries ofby selecting only the indices in , and is abusive notation for the set-theoretic difference .
Each message update has an arbitrary normalization factor . Because is not a function of any of the variables, it only affects the value of the minimum and not where the minimum is located. As such, we are free to choose it however we like for each message and each time step. In practice, these constants are used to avoid numerical issues that may arise during execution of the algorithm. We will think of the messages as a vector of functions indexed by the edge over which the message is passed.
A vector of messages is finite if for all , , and , and .
Any vector of finite messages is a valid choice for the vector of initial messages , but the choice of initial messages can greatly affect the behavior of the algorithm. A typical assumption is that the initial messages are chosen such that and .
We want to use the messages in order to construct an estimate of the min-marginals of . A min-marginal of is a function of one variable obtained by minimizing the function over all of the remaining variables. The min-marginal for the variable would be which is a function of . Given any vector of messages, , we can construct a set of beliefs that are intended to approximate the min-marginals of :
If , then for any there exists a vector such that and minimizes the function . If the for all , then we can take , but, if the objective function has more than one optimal solution, then we may not be able to construct such an so easily. For this reason, one commonly assumes that the objective function has a unique global minimum. Although this assumption is common, we will not adopt this convention in this work. Unfortunately, because our beliefs are not necessarily the true min-marginals, we can only approximate the optimal assignment by computing an estimate of the argmin:
A vector, , of beliefs admits a unique estimate, , if and the argmin is unique for each .
If the algorithm converges to a collection of beliefs from which we can extract a unique estimate , then we hope that the vector is indeed a global minimum of the objective function.
2.1 Computation Trees
An important tool in the analysis of the min-sum algorithm is the notion of a computation tree. Intuitively, the computation tree is an unrolled version of the original graph that captures the evolution of the messages passed by the min-sum algorithm needed to compute the belief at time at a particular node of the factor graph. Computation trees describe the evolution of the beliefs over time, which, in some cases, can help us prove correctness and/or convergence of the message passing updates.
The depth computation tree rooted at node contains all of the length non-backtracking walks in the factor graph starting at node . For any node in the factor graph, the computation tree at time rooted at , denoted by , is defined recursively as follows: is just the node , the root of the tree. The tree at time is generated from by adding to each leaf of a copy of each of its neighbors in (and the corresponding edge), except for the neighbor that is already present in . Each node of is a copy of a node in , and the potentials on the nodes in , which operate on a subset of the variables in , are copies of the potentials of the corresponding nodes in . The construction of a computation tree for the graph in Figure 1 is pictured in Figure 2. Note that each variable node in represents a distinct copy of some variable in the original graph.
Given any initialization of the messages, captures the information available to node at time . At time , node has received only the initial messages from its neighbors, so consists only of . At time , receives the round one messages from all of its neighbors, so ’s neighbors are added to the tree. These round one messages depend only on the initial messages, so the tree terminates at this point. By construction, we have the following lemma:
The belief at node produced by the min-sum algorithm at time corresponds to the exact min-marginal at the root of whose boundary messages are given by the initial messages. See, for example, tatjor02 and weisscomp.
2.2 Fixed Point Properties
Computation trees provide us with a dynamic view of the min-sum algorithm. After a finite number of time steps, we hope that the beliefs on the computation trees stop changing. In practice, when the beliefs change by less than some small amount, we say that the algorithm has converged. If the messages of the min-sum algorithm converge then the converged messages must be fixed points of the message update equations.
Ideally, the converged beliefs would be the true min-marginals of the function . If the beliefs are the exact min-marginals, then the estimate corresponding to our beliefs would indeed be the global minimum. Unfortunately, the algorithm is only known to produce the exact min-marginals on special factor graphs (e.g. when the factor graph is a tree). Instead, we will show that the fixed point beliefs are almost like min-marginals. Like the messages, we will think of the beliefs as a vector of functions indexed by the nodes of the factor graph. Consider the following definitions:
A vector of beliefs, , is admissible for a function if
A vector of beliefs, , is min-consistent if for all and all :
Any vector of beliefs that satisfies these two properties provides a meaningful reparameterization of the original objective function. We can show that any vector of beliefs obtained from a fixed point of the message updates does indeed satisfy these two properties: For any vector of fixed point messages, the corresponding beliefs are admissible and min-consistent. See wainwright, Proposition 2 and lemmas 3 and 1 below.
For any objective function such that for all , , there always exists a fixed point of the min-sum message passing updates (see Theorem 2 of wainwright). Moreover, the min-sum algorithm is guaranteed to converge to the correct solution on factor graphs that are trees. However, convergence and correctness for arbitrary factor graphs has only been demonstrated for a few special cases wainwright singleweiss.
3 A General Splitting Heuristic
In this section, we introduce a family of message passing algorithms parameterized by a vector of reals. The intuition for this family of algorithms is simple: given any factorization of the objective function , we can split any of the factors into several pieces and obtain a new factorization of the objective function . The standard notation masks the fact that each of the potentials may further factorize into smaller pieces. For example, suppose we are given the objective function . There are many different ways that we can factorize :
Each of these represents a factorization of into a different number of potentials (the parenthesis indicate a single potential function). All of these can be captured by the standard min-sum algorithm except for the last. Recall that was taken to be a subset of . In order to accommodate the factorization given by Equation 10, we will now allow to be a multiset over the set . We can then construct the factor graph as before with a distinct factor node for each element of the multiset . We can use the standard min-sum algorithm in an attempt to compute the minimum of given this new factorization.
We could, of course, rewrite the objective function in many different ways. However, arbitrarily rewriting the objective function could significantly increase the size of the factor graph, and such rewriting may not make the minimization problem any easier. In this paper, we will focus on one special rewriting of the objective function. Suppose factorizes over as in Equation 1. Let be the corresponding factor graph. Suppose now that we take one potential and split it into potentials such that for each , . This allows us to rewrite the objective function, , as
This rewriting does not change the objective function, but it does produce a new factor graph (see Figure 3). Now, take some and consider the messages and given by the standard min-sum algorithm:
where denotes the neighbors of in . Notice that there is an automorphism of the graph that maps to . As the messages passed from any node only depend on the messages received at the previous time step, if the initial messages are the same at both of these nodes, then they must produce identical messages at time 1. More formally, if we initialize the messages identically over each split edge, then, at any time step , and for any by symmetry (i.e. there is an automorphism of the graph that maps to ). Because of this, we can rewrite the message from to as:
Notice that Equation 18 can be viewed as a message passing algorithm on the original factor graph. The primary difference then between Equation 18 and the standard min-sum updates is that the message passed from to now depends on the message from to .
Analogously, we can also split the variable nodes. Suppose factorizes over as in Equation 1. Let be the corresponding factor graph. Suppose now that we take one variable and split it into variables such that for each , . This produces a new factor graph, . Because are all the same variable, we must add a constraint to ensure that they are indeed the same. Next, we need to modify the potentials to incorporate the constraint and the change of variables. We will construct such that for each with there is a in . Define where is the 0-1 indicator function for the equality constraint. For each with we simply add to with its old potential. For an example of this construction, see Figure 4. This rewriting produces a new objective function
Minimizing is equivalent to minimizing . Again, we will show that we can collapse the min-sum message passing updates over to message passing updates over with modified potentials. Take some containing the new variable which augments the potential and consider the messages and given by the standard min-sum algorithm:
Again, if we initialize the messages identically over each split edge, then, at any time step , and for any by symmetry. Using this, we can rewrite the message from to as:
By symmetry, we only need to perform one message update to compute for each . As a result, we can think of these messages as being passed on the original factor graph .
The combined message updates for each of these splitting operations are presented in Algorithm 1. Observe that if we choose for each and for each , then the message updates described in the algorithm are exactly the min-sum message passing updates described in the preliminaries. Rewriting the message updates in this way seems purely cosmetic, but as we will show in the following sections, the choice of the vector can influence both the convergence and correctness of the algorithm.
We define the beliefs corresponding to the new message updates as follows:
Compare these definitions with Equations 4 and 5. Notice that the bracketed expression in Equation 28 is the definition of . As we will see in Lemma 3, if we define the beliefs in this way, then any vector of finite messages will produce a vector of admissible beliefs. The beliefs are still approximating the min-marginals of , but each variable node has been split times and each factor node has been split times. Applying Definition 2.2 to the new factor factor graph , a vector of beliefs is admissible for our new message passing algorithm if
Throughout this discussion, we have assumed that the vector contained only positive integers. If we allow to be an arbitrary vector of non-zero reals, then the notion of splitting no longer makes sense. Instead, we will think of the vector as parameterizing a specific factorization of the function . The definitions of the message updates and the beliefs are equally valid for any choice of non-zero real constants. In what follows, we will explore the properties of this new message passing scheme for a vector of non-zero reals.
As before, we want the fixed point beliefs produced by our message passing scheme to behave like min-marginals (i.e. they are min-consistent) and they produce a reparameterization of the objective function. Using the definitions above, we have have the following lemmas:
Let factorize over . If is a vector of non-zero reals, then for any vector of finite messages, , the corresponding beliefs are admissible. Let be the vector of messages given in the statement of the lemma and the corresponding vector of beliefs. For any set of messages, we can rewrite the belief as:
where the last line follows by observing that . Notice that this subtraction of the messages only makes sense if the messages are finite valued.
The previous lemma guarantees that any vector of finite messages is guaranteed to be admissible, but an analogous lemma is not true for min-consistency. We require a stronger assumption about the vector of messages in order to ensure min-consistency:
Let be a fixed point of the message updates in Algorithm 1. The corresponding beliefs, b, are min-consistent. Up to an additive constant, we can write,
Again, for any objective function such that for all , , there always exists a fixed point of the min-sum message passing updates (see Theorem 2 of wainwright). The proof of this statement can be translated almost exactly for our message passing updates, and we will not reproduce it here.
3.1 Computation Trees
The computation trees produced by the synchronous splitting algorithm are different from their predecessors. Again, the computation tree captures the messages that would need to be passed in order to compute . However, the messages that are passed in the new algorithm are multiplied by a non-zero constant. As a result, the potential at a node in the computation tree corresponds to some potential in the original graph multiplied by a constant that depends on all of the nodes above in the computation tree. We summarize the changes as follows:
The message passed from to may now depend on the message from to at the previous time step. As such, we now form the time computation tree from the time computation tree by taking any leaf , which is a copy of node in the factor graph, of the time computation tree, creating a new node for every , and connecting to these new nodes. As a result, the new computation tree rooted at node of depth contains at least all of the non-backtracking walks of length in the factor graph starting from and, at most, all walks of length in the factor graph starting at .
The messages are weighted by the elements of . This changes the potentials at the nodes in the computation tree. For example, suppose the computation tree was rooted at variable node and that depends on the message from to . Because is multiplied by in , every potential along this branch of the computation tree is multiplied by . To make this concrete, we can associate a weight to every edge of the computation tree that corresponds to the constant that multiplies the message passed across that edge. To compute the new potential at a variable node in the computation tree, we now need to multiply the corresponding potential by each of the weights corresponding to the edges that appear along the path from to the root of the computation tree. An analogous process can be used to compute the potentials at each of the factor nodes. The computation tree produced by the splitting algorithm at time for the factor graph in Figure 1 is pictured in Figure 5. Compare this with computation tree produced by the standard min-sum algorithm in Figure 2.
If we make these adjustments and all of the weights are positive, then the belief, , at node at time is given by the min-marginal at the root of . If some of the weights are negative, then is computed by maximizing over each variable in whose self-potential has a negative weight and minimizing over each variable whose self-potential has a non-negative weight. In this way, the beliefs correspond to marginals at the root of these computation trees.
4 Optimality of Fixed Points
Empirically, the standard min-sum algorithm need not converge and, even if it does, the estimate produced at convergence need not actually minimize the objective function. Up until this point, we have not placed any restriction on the vector except that all of its entries are non-zero. Still, we know from the TRMP case that certain choices of the parameters are better than others: some ensure that the estimate obtained at a fixed point is correct.
From the previous section, we know that the fixed point beliefs produced by Algorithm 1 are admissible and min-consistent. From these fixed point beliefs, we construct a fixed point estimate such that . If the objective function had a unique global minimum and the fixed point beliefs were the true min-marginals, then would indeed be the global minimum. Now, suppose that the are not the true min-marginals. What can we say about the optimality of any vector such that ? What can we say if there is a unique vector with this property? Our primary tool for answering these questions will be the following lemma:
Let be a vector of min-consistent beliefs. If there exists a unique estimate that minimizes for each , then also minimizes and, for any , minimizes . Because the beliefs are min-consistent for any , we have:
From this, we can conclude that there is a some that minimizes with . Further, because the minimum is unique for each , must minimize . Now fix a vector and consider
This lemma will be a crucial building block of many of the theorems in this paper, and many variants of this lemma have been proven in the literature (e.g. Lemma 4 in wainwright and Theorem 1 in convexweiss).
Using this lemma and the observation of Lemma 3 that can be written as a sum of the beliefs, we can convert questions about the optimality of the vector into questions about the choice of parameters. We will show how to choose the and such that we will be guaranteed some form of optimality for a collection of admissible and min-consistent beliefs.
4.1 Local Optimality
A function has a local optimum at the point if there is some neighborhood of such that does not increase in that neighborhood. The definition of neighborhood is metric dependent, and in the interest of keeping our results applicable to a wide variety of spaces, we will choose the metric to be the Hamming distance. For any two vectors , the Hamming distance is the number of entries in which the two vectors differ. For the purposes of this paper, we will restrict our definition of local optimality to vectors within Hamming distance one:
is a local minimum of the objective function, , if for every vector that has at most one entry different from , .
Our notion does not necessarily coincide with other notions of local optimality from the literature wainwright. If the standard min-sum algorithm converges to unique estimate, then is locally optimal in the following sense: is a global minimum of the reparameterization when it is restricted to factor-induced subgraphs of the factor graph that contain exactly one cycle wainwright. However, is not necessarily a global optimum of the objective function. Suppose the factor graph consists of only pairwise factors. In this case, the collection of nodes formed by taking some variable node and every node in its two-hop neighborhood must be a tree, . The restriction of the reparameterization to this tree is given by:
contains every part of the reparameterization that depends on the variable , and minimizes . As a result, we observe that if we change only the value of , then we cannot decrease the value of and, consequently, we cannot decrease the objective function. In this case, the local optimality condition in wainwright does imply local optimality in our sense. However, if the factorization is not pairwise, then the two-hop neighborhood of any node is not necessarily cycle free (see Figure 6). Consequently, the notion of optimality from wainwright need not correspond to Definition 4.1 for graphs where the factorization is not pairwise.
We will show that there exist choices of the parameters for which any fixed point estimate extracted from a vector of admissible and min-consistent beliefs that simultaneously minimizes all of the beliefs is guaranteed to be locally optimal with respect to the Hamming distance. In order to prove such a result, we first need to relate the minima of the fixed point beliefs to the minima of the objective function. By Lemma 3, the objective function can be written as a sum of the beliefs. Let be a vector of admissible beliefs for the function . Define . For a fixed , we can lower bound the optimum value of the objective function as follows: