## 1 Introduction

Undirected graphical models, also known as Markov Random Fields (MRF), provide a framework for modeling high dimensional distributions with dependent variables. Ising models are a special class of discrete pairwise graphical models originated from statistical physics. Ising models have numerous applications in computer vision

[13], bio-informatics [7], and social networks [3]. Explicitly, the joint distribution of Ising model is given by

(1) |

where

are random variables valued in a binary alphabet (also known as "spins"),

represents the pairwise interactions between spin and spin , represents the external field for spin , is the reciprocal temperature, and is a normalization constant called*partition function*.

Historically, Ising models are proposed to study ferromagnetism. However, researchers find the computational complexity is the main challenge of performing sampling and inference on Ising models. In the literature, there are multiple ways to tackle the computational complexity. One of the ways are Markov-Chain Monte Carlo (MCMC) algorithms. A well-known example is Gibbs sampling, which is a special case of the Metropolis–Hastings algorithm. Basically Gibbs sampling samples a random variable conditioned on the distribution based on the previous samples. It can be shown that Gibbs sampling generates a reversible Markov chain of samples. Thus, the stationary distribution of the Markov chain is the desired joint distribution over the random variables, and it can be reached after the

*burn-in period*. However, it is also well-known that Gibbs sampling will become trapped on multi-modal distribution. For example, Smith and Roberts [15] and Mengersen [8]

show that when the joint distribution is bi-modal, the Gibbs sampling iterations may be trapped in one of the modes, reducing the probability of convergence.

Another popular way to go around the computational complexity is *variational methods*, which makes some approximation to the joint distribution. These methods usually turn the inference problem with respect to the approximate joint distribution into some non-convex optimization problem, and solve it either by the standard optimization methods, e.g, gradient descent, or by specialized algorithms like belief propagation. However, due to the non-convexity, those methods usually do not have theoretical guarantees that the solution converges to the global optimum.

Belief propagation (BP) is an effective numerical method for solving inference problems on graphical models. It was originally proposed by Pearl [12] for tree-like graphs. Ever since it plays a fundamental role in numerous applications including coding theory [4, 14], constraint satisfaction problems [1], and community detection in the stochastic block model [2]. It is well-known that belief propagation is only exact for a model on a graph with locally tree-like structures. The long haunting question is, theoretically how does belief propagation perform on loopy graphs.

We now describe the related work and our contributions.

Related work and our contribution

In a classic work, Yedidia et al. [16] establishes the connection between belief propagation and the Bethe free energy. He shows that there is one-to-one correspondence between the fixed points of belief propagation and stationary points of the Bethe free energy. Following his work, it is known that the Bethe free energy at the critical points can be represented in terms of fixed point messages of belief propagation [10]. In a recent work, Koehler [6] further studies the properties of Bethe free energy at the critical points, and shows for ferromagnetic Ising models, initialized with all one messages, belief propagation converges to the fixed point corresponds to the global maximum of the Bethe free energy. However, those theories consider either asymptotic locally tree-like graphs, or loopy graphs with simple edges. Real technological, social and biological networks have numerous short and large loops and other complex motifs, which lead to non-tree-like structures and essentially loopy graphs with hyper edges. Newman [11, 5] and Miller [9] independently propose a model of random graphs with arbitrary distributions of motifs. And Yoon et al. [17] generalizes the Belief Propagation to graphs with motifs.

Our work builds on generalized belief propagation on graphs with motifs [17] and the convergence of belief propagation on ferromagnetic Ising models on loopy graphs with simple edges [6]. In this paper, we show for ferromagnetic Ising models on graphs with motifs, with all messages initialized to one, generalized belief propagation converges to the fixed point corresponds to the global maximum of the Bethe free energy.

## 2 Ising Models on Graphs with Motifs

Let us introduce the concept of graphs with motifs. In graphs with motifs, each vertex belongs to a given set of motifs. As shown in Fig.0(a) , different motifs can be attached to vertex : a simple edge , a triangle, a square, a pentagon, and other non-clique motifs. Graphs with motifs can be viewed as hyper-graphs where motifs play a role of hyper-edges. And the number of specific motifs attached to a vertex is equal to hyper-degree with respect to the specific motifs. In this paper, for simplicity, we only consider simple motifs such as simple edges, and cliques.

Consider the Ising model with arbitrary order of interactions among vertices in each motif on a hyper-graph. Let denote a cluster of size attached to vertex , where vertices together with form the motif. And let denote the random variable of spin configurations, the Hamiltonian of the model is

(2) |

where the first sum corresponds to the external fields at each vertex, the second sum corresponds to the pairwise interactions on simple edges, the third sum corresponds to the higher order interactions among spins in triangles, the fourth sum corresponds to the higher order interactions among spins in squares, and so on. As discussed in the previous section, most previous works focus on Ising models with pairwise interactions. In this paper, we are interested in Ising models with higher order interactions. For simplicity, we consider Ising models with only external fields and higher order interactions in triangles. Our derivation can be extended to more general cases.

Consider Ising models with only external fields and higher order interactions in triangles, the Hamiltonian of the model is

(3) |

where is a triangle, which can also be denoted as , , or .

By Boltzmann’s law, the joint distribution is defined by

(4) |

where is the *partition function*.

Throughout this paper, we focus on *ferromagnetic* Ising models, which is defined below

###### Definition 1.

An Ising model is ferromagnetic if for all triangle motifs and for all .

We introduce a intermediate message from a motif to spin .

(5) |

In the literature, different works have different definitions of messages. is not the message definition we eventually work with in this paper, but it helps to understand the connections between different works. So, abusing the terminology a little bit, we call it ‘intermediate message’.

By the definition of generalized Belief Propagation, the probability that spin is in a state is determined by the normalized product of incoming intermediate messages from motifs attached to spin and the external field factor ,

(6) |

where is a normalization constant. And the belief update rule is given by:

(7) |

where is an energy of the interaction among spins in the triangle , and is a normalization constant.

Multiplying Equation (7) by and summing over all spin configurations, we obtain an equation for the effective field ,

(8) |

where

(9) |

(10) |

(11) |

For more detailed explanations of Equations (7) to (11), please refer to [17].

Now, define a message from a spin to motif as . More specifically, if the motif is a triangle , the message can be alternatively represented as . From now on, let the reciprocal temperature , we can further simplify Equation (8) as

## 3 Bethe Free Energy of Higher Order Ising Models

In order to get the Bethe free energy of our higher order Ising model (3), we need to go through the Gibbs variational principle as Yedidia et al. [16] did for standard Ising models with pairwise interactions. Let be a joint distribution defined by our model (4). If we have some approximate joint distribution , from Gibbs variational principle, we can write Gibbs free energy as

(13) | ||||

(14) |

where is called the *average energy*, and is the *entropy*.

We would like to derive a Gibbs free energy that is a function of both the one-node beliefs and the three-node beliefs . The beliefs should satisfy the normalization conditions and the marginalization conditions. In other words, lies in the following polytope of locally consistent distributions

(15) | ||||

Because we only consider external fields and higher order interactions with triangles in our model, the one-node and three-node beliefs are actually sufficient to determine the average energy. For our model (3) and for any approximate joint probability such that one-node marginal probabilities are and the three-node marginal probabilities are , the average energy will have the form

(16) |

The average energy computed with the true marginal probabilities and will also have this form, so if one-node and three-node beliefs are exact, the average energy given by Equation (16) will be exact.

For computing the entropy, we usually need an approximation. We can compute the entropy exactly if we can explicitly express the joint distribution in terms of the one-node and three-node beliefs. If our graph were tree-like hyper-graph with triangle motifs only (see Fig. 0(b)

as an example), we can in fact do that. In that case, we can represent the joint probability distribution in the form

(17) |

where is the hyper-degree of node .

Using Equation (17), we get the Bethe approximation to the entropy as

(18) |

Combining Equation (16) and (3), we obtain the Bethe free energy

(19) | ||||

(20) |

Notice when the hyper-graph is a tree, the Bethe free energy will have the correct functional dependence on the beliefs. And solving the optimization problem: maximizing over the polytope of locally consistent distribution (3) will give the true marginals. For loopy hyper-graphs, the Bethe free energy is only an approximation, which is the essence of the variational methods.

We can derive the BP equations from the first-order optimality conditions for the aforementioned optimization problem. In other words, we can verify that *a set of beliefs gives a BP fixed point in any hyper-graph if and only if they are stationary points of the Bethe free energy* for the generalized BP. To see this, we need to add Lagrange multipliers to to form a Lagrangian . Let be a multiplier that enforces the marginalization constraint , and be a multiplier that enforces the normalization of . So, the largrangian corresponding to the optimization problem is

(21) |

where we ignore the constraints because, given other constraints, those constraints are always satisfied at a critical point.

The equation gives:

(22) |

Setting , we find that at a critical point of the Lagrangian that

(23) | ||||

(24) |

And the equation gives:

(25) |

Setting , we find that at a critical point of the Lagrangian that

(26) | ||||

(27) |

Furthermore, by differentiating with respect to , we see that the marginalization constraints are satisfied. Therefore, for any triangle , . Hence,

(28) | ||||

(29) | ||||

(30) |

So

(31) | ||||

(32) |

Define , we have

(33) |

Let

(34) |

Then we see

(35) | ||||

(36) |

which is the BP consistency equation (12) we derived in Section 2.

Till this point, we represent the Bethe free energy in terms of beliefs corresponds to BP fixed points. In order to analyze the behavior of the Bethe free energy at BP fixed points, we need to represent the Bethe free energy in terms of the hyper-edge messages , which is called *dual Bethe free energy* in the literature. First, we have the following lemma.

###### Lemma 1.

The dual Bethe free energy at a critical point can be defined by

(37) |

where

(38) | ||||

(39) |

###### Proof.

Recall the Bethe free energy

(40) | ||||

(41) | ||||

(42) |

By rearranging terms, we have

(43) |

where

(44) | ||||

(45) |

and

(46) |

W.l.o.g., let us look at the term , let , it can be rewritten as

(47) | ||||

(48) |

From Equation (4), we know

(49) |

where is a normalization constant . Substitute it back into Equation (47), we have

(50) | ||||

(51) |

∎

If we use the definition , and define , we have the following corollary:

###### Corollary 1.

The dual Bethe free energy in terms of hyper-edge messages is

(52) |

where

(53) | ||||

(54) |

## 4 Optimization Landscape

Now, we can study the behavior of the Bethe free energy at critical points. The following lemma establishes that is a concave monotone function for some non-negative .

###### Lemma 2.

Suppose that for any . Then is a concave monotone function on the domain .

###### Proof.

Observe that

(55) |

which proves monotonicity, and

(56) |

Note that for any non-negative vector

, if we let(57) |

Then we have,

(58) | ||||

(59) | ||||

(60) | ||||

(61) | ||||

(62) |

For any edge , let (note ), and

(63) |

Due to the fact , we know as , and . Since is continuous over , if we assume is the largest root for in , we know in . Let , we have

(64) |

for . ∎

We define the set of *pre-fixpoints* and *post-fixpoints* messages similar as in [6]:

(65) |

From Lemma 2, we know is a convex set, while is typically non-convex and even disconnected. Next, we show the gradient of the dual Bethe free energy is well-behaved on these sets:

###### Lemma 3.

If then and if then

###### Proof.

The lemma will follow if we compute the gradient of the dual Bethe free energy function .

(66) |

Recall is the updated message from spin to motif based on the current messages . If or , then the signs of the gradient of Bethe free energy are determined by Equation (4) as claimed. ∎

###### Theorem 1.

Suppose that generalized BP is run from initial messages and there is at least one fixed point in . The messages converge to a fixed point of the generalized BP equations such that for any other fixed point , . Furthermore

(67) |

###### Proof.

If there is at least one fixed point in , and the initialization is for all hyper-edges