# Concavity of reweighted Kikuchi approximation

We analyze a reweighted version of the Kikuchi approximation for estimating the log partition function of a product distribution defined over a region graph. We establish sufficient conditions for the concavity of our reweighted objective function in terms of weight assignments in the Kikuchi expansion, and show that a reweighted version of the sum product algorithm applied to the Kikuchi region graph will produce global optima of the Kikuchi approximation whenever the algorithm converges. When the region graph has two layers, corresponding to a Bethe approximation, we show that our sufficient conditions for concavity are also necessary. Finally, we provide an explicit characterization of the polytope of concavity in terms of the cycle structure of the region graph. We conclude with simulations that demonstrate the advantages of the reweighted Kikuchi approach.

## Authors

• 28 publications
• 16 publications
07/04/2012

### Sufficient conditions for convergence of Loopy Belief Propagation

We derive novel sufficient conditions for convergence of Loopy Belief Pr...
01/09/2018

### Sufficient Conditions for the Tightness of Shannon's Capacity Bounds for Two-Way Channels

New sufficient conditions for determining in closed form the capacity re...
05/28/2020

### On Functions of Markov Random Fields

We derive two sufficient conditions for a function of a Markov random fi...
07/24/2019

### Reducing Path TSP to TSP

We present a black-box reduction from the path version of the Traveling ...
07/01/2019

### Strong equivalences of approximation numbers and tractability of weighted anisotropic Sobolev embeddings

In this paper, we study multivariate approximation defined over weighted...
01/26/2022

### Fourier-Reflexive Partitions and Group of Linear Isometries with Respect to Weighted Poset Metric

Let 𝐇 be the cartesian product of a family of abelian groups indexed by ...
07/13/2018

### An O(^1.5n n) Approximation Algorithm for Mean Isoperimetry and Robust k-means

Given a weighted graph G=(V,E), and U⊆ V, the normalized cut value for U...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Undirected graphical models are a familiar framework in diverse application domains such as computer vision, statistical physics, coding theory, social science, and epidemiology. In certain settings of interest, one is provided with potential functions defined over nodes and (hyper)edges of the graph. A crucial step in probabilistic inference is to compute the log partition function of the distribution based on these potential functions for a given graph structure. However, computing the log partition function either exactly or approximately is NP-hard in general

[2, 17]. An active area of research involves finding accurate approximations of the log partition function and characterizing the graph structures for which such approximations may be computed efficiently [29, 22, 7, 19, 25, 18].

When the underlying graph is a tree, the log partition function may be computed exactly via the sum product algorithm in time linear in the number of nodes [15]. However, when the graph contains cycles, a generalized version of the sum product algorithm known as loopy belief propagation may either fail to converge or terminate in local optima of a nonconvex objective function [26, 20, 8, 13].

In this paper, we analyze the Kikuchi approximation method, which is constructed from a variational representation of the log partition function by replacing the entropy with an expression that decomposes with respect to a region graph. Kikuchi approximations were previously introduced in the physics literature [9] and reformalized by Yedidia et al. [28, 29] and others [1, 14] in the language of graphical models. The Bethe approximation, which is a special case of the Kikuchi approximation when the region graph has only two layers, has been studied by various authors [3, 28, 5, 25]. In addition, a reweighted version of the Bethe approximation was proposed by Wainwright et al. [22, 16]. As described in Vontobel [21], computing the global optimum of the Bethe variational problem may in turn be used to approximate the permanent of a nonnegative square matrix.

The particular objective function that we study generalizes the Kikuchi objective appearing in previous literature by assigning arbitrary weights to individual terms in the Kikuchi entropy expansion. We establish necessary and sufficient conditions under which this class of objective functions is concave, so a global optimum may be found efficiently. Our theoretical results synthesize known results on Kikuchi and Bethe approximations, and our main theorem concerning concavity conditions for the reweighted Kikuchi entropy recovers existing results when specialized to the unweighted Kikuchi [14] or reweighted Bethe [22] case. Furthermore, we provide a valuable converse result in the reweighted Bethe case, showing that when our concavity conditions are violated, the entropy function cannot be concave over the whole feasible region. As demonstrated by our experiments, a message-passing algorithm designed to optimize the Kikuchi objective may terminate in local optima for weights outside the concave region. Watanabe and Fukumizu [24, 25] provide a similar converse in the unweighted Bethe case, but our proof is much simpler and our result is more general.

In the reweighted Bethe setting, we also present a useful characterization of the concave region of the Bethe entropy function in terms of the geometry of the graph. Specifically, we show that if the region graph consists of only singleton vertices and pairwise edges, then the region of concavity coincides with the convex hull of incidence vectors of single-cycle forest subgraphs of the original graph. When the region graph contains regions with cardinality greater than two, the latter region may be strictly contained in the former; however, our result provides a useful way to generate weight vectors within the region of concavity. Whereas Wainwright et al.

[22] establish the concavity of the reweighted Bethe objective on the spanning forest polytope, that region is contained within the single-cycle forest polytope, and our simulations show that generating weight vectors in the latter polytope may yield closer approximations to the log partition function.

The remainder of the paper is organized as follows: In Section 2, we review background information about the Kikuchi and Bethe approximations. In Section 3, we provide our main results on concavity conditions for the reweighted Kikuchi approximation, including a geometric characterization of the region of concavity in the Bethe case. Section 4 outlines the reweighted sum product algorithm and proves that fixed points correspond to global optima of the Kikuchi approximation. Section 5 presents experiments showing the improved accuracy of the reweighted Kikuchi approximation over the region of concavity. Technical proofs and additional simulations are contained in the Appendix.

## 2 Background and problem setup

In this section, we review basic concepts of the Kikuchi approximation and establish some terminology to be used in the paper.

Let denote a region graph defined over the vertex set , where each region is a subset of . Directed edges correspond to inclusion, so is an edge of if . We use the following notation, for :

 A(r) :={s∈R:r⊊s} (\emph{ancestors} of r) F(r) :={s∈R:r⊆s} (\emph{forebears} of r) N(r) :={s∈R:r⊆s or s⊆r} (\emph{neighbors} of r).

For , we define , and we define and similarly.

We consider joint distributions

that factorize over the region graph; i.e.,

 p(x)=1Z(α)∏r∈Rαr(xr), (1)

for potential functions . Here, is the normalization factor, or partition function, which is a function of the potential functions , and each variable takes values in a finite discrete set . One special case of the factorization (1) is the pairwise Ising model, defined over a graph , where the distribution is given by

 pγ(x)=exp(∑s∈Vγs(xs)+∑(s,t)∈Eγst(xs,xt)−A(γ)), (2)

and . Our goal is to analyze the log partition function

 logZ(α)=log{∑x∈X|V|∏r∈Rαr(xr)}. (3)

### 2.1 Variational representation

It is known from the theory of graphical models [14] that the log partition function (3) may be written in the variational form

 logZ(α)=sup{τr(xr)}∈ΔR{∑r∈R∑xrτr(xr)log(αr(xr))+H(pτ)}, (4)

where is the maximum entropy distribution with marginals and

 H(p):=−∑xp(x)logp(x)

is the usual entropy. Here, denotes the -marginal polytope; i.e., if and only if there exists a distribution such that for all . For ease of notation, we also write . Let denote the collection of log potential functions . Then equation (4) may be rewritten as

 logZ(θ)=supτ∈ΔR{⟨θ,τ⟩+H(pτ)}. (5)

Specializing to the Ising model (2), equation (5) gives the variational representation

 A(γ)=supμ∈M{⟨γ,μ⟩+H(pμ)}, (6)

which appears in Wainwright and Jordan [23]. Here, denotes the marginal polytope, corresponding to the collection of mean parameter vectors of the sufficient statistics in the exponential family representation (2), ranging over different values of , and is the maximum entropy distribution with mean parameters .

### 2.2 Reweighted Kikuchi approximation

Although the set appearing in the variational representation (5) is a convex polytope, it may have exponentially many facets [23]. Hence, we replace with the set

 ΔKR={τ:∀t,u∈R s.t. t⊆u,∑xu∖tτu(xt,xu∖t)=τt(xt)   and   ∀u∈R,∑xuτu(xu)=1}

of locally consistent -pseudomarginals. Note that and the latter set has only polynomially many facets, making optimization more tractable.

In the case of the pairwise Ising model (2), we let denote the polytope . Then is the collection of nonnegative functions satisfying the marginalization constraints

 ∑xsτs(xs)=1, ∀s∈V, ∑xtτst(xs,xt)=τs(xs)  and  ∑xsτst(xs,xt)=τt(xt), ∀(s,t)∈E.

Recall that , with equality achieved if and only if the underlying graph is a tree. In the general case, we have when the Hasse diagram of the region graph admits a minimal representation that is loop-free (cf. Theorem 2 of Pakzad and Anantharam [14]).

Given a collection of -pseudomarginals , we also replace the entropy term , which is difficult to compute in general, by the approximation

 H(pτ)≈∑r∈RρrHr(τr):=H(τ;ρ), (7)

where is the entropy computed over region , and are weights assigned to the regions. Note that in the pairwise Ising case (2), with , we have the equality

 H(p)=∑s∈VHs(ps)−∑(s,t)∈EIst(pst)

when is a tree, where denotes the mutual information and and denote the node and edge marginals. Hence, the approximation (7) is exact with

 ρst=1,∀(s,t)∈E,andρs=1−deg(s),∀s∈V.

Using the approximation (7), we arrive at the following reweighted Kikuchi approximation:

 B(θ;ρ):=supτ∈ΔKR{⟨θ,τ⟩+H(τ;ρ)}Bθ,ρ(τ). (8)

Note that when are the overcounting numbers , defined recursively by

 cr=1−∑s∈A(r)cs, (9)

the expression (8) reduces to the usual (unweighted) Kikuchi approximation considered in Pakzad and Anantharam [14].

## 3 Main results and consequences

In this section, we analyze the concavity of the Kikuchi variational problem (8). We derive a sufficient condition under which the function is concave over the set , so global optima of the reweighted Kikuchi approximation may be found efficiently. In the Bethe case, we also show that the condition is necessary for to be concave over the entire region , and we provide a geometric characterization of in terms of the edge and cycle structure of the graph.

### 3.1 Sufficient conditions for concavity

We begin by establishing sufficient conditions for the concavity of . Clearly, this is equivalent to establishing conditions under which is concave. Our main result is the following:

###### Theorem 1.

If satisfies

 ∑s∈F(S)ρs≥0,∀S⊆R, (10)

then the Kikuchi entropy is strictly concave on .

The proof of Theorem 1 is contained in Appendix A.1, and makes use of a generalization of Hall’s marriage lemma for weighted graphs (cf. Lemma 1 in Appendix A.2).

The condition (10) depends heavily on the structure of the region graph. For the sake of interpretability, we now specialize to the case where the region graph has only two layers, with the first layer corresponding to vertices and the second layer corresponding to hyperedges. In other words, for , we have only if , and , where is the set of hyperedges and denotes the set of singleton vertices. This is the Bethe case, and the entropy

 H(τ;ρ)=∑s∈VρsHs(τs)+∑α∈FραHα(τα) (11)

is consequently known as the Bethe entropy.

The following result is proved in Appendix A.3:

###### Corollary 1.

Suppose for all , and the following condition also holds:

 ∑s∈Uρs+∑α∈F:α∩U≠∅ρα≥0,∀U⊆V. (12)

Then the Bethe entropy is strictly concave over .

### 3.2 Necessary conditions for concavity

We now establish a converse to Corollary 1 in the Bethe case, showing that condition (12) is also necessary for the concavity of the Bethe entropy. When for and for , we recover the result of Watanabe and Fukumizu [25] for the unweighted Bethe case. However, our proof technique is significantly simpler and avoids the complex machinery of graph zeta functions. Our approach proceeds by considering the Bethe entropy on appropriate slices of the domain so as to extract condition (12) for each . The full proof is provided in Appendix B.1.

###### Theorem 2.

If the Bethe entropy is concave over , then for all , and condition (12) holds.

Indeed, as demonstrated in the simulations of Section 5, the Bethe objective function may have multiple local optima if does not satisfy condition (12).

### 3.3 Polytope of concavity

We now characterize the polytope defined by the inequalities (12). We show that in the pairwise Bethe case, the polytope may be expressed geometrically as the convex hull of single-cycle forests formed by the edges of the graph. In the more general (non-pairwise) Bethe case, however, the polytope of concavity may strictly contain the latter set.

Note that the Bethe entropy (11) may be written in the alternative form

 H(τ;ρ)=∑s∈Vρ′sHs(τs)−∑α∈Fρα˜Iα(τα), (13)

where is the KL divergence between the joint distribution and the product distribution , and the weights are defined appropriately.

We show that the polytope of concavity has a nice geometric characterization when for all , and for all . Note that this assignment produces the expression for the reweighted Bethe entropy analyzed in Wainwright et al. [22] (when all elements of have cardinality two). Equation (13) then becomes

 H(τ;ρ)=∑s∈V(1−∑α∈N(s)ρα)Hs(τs)+∑α∈FραHα(τα), (14)

and the inequalities (12) defining the polytope of concavity are

 ∑α∈F:α∩U≠∅(|α∩U|−1)ρα≤|U|,∀U⊆V. (15)

Consequently, we define

 C:={ρ∈[0,1]|F|:∑α∈F:α∩U≠∅(|α∩U|−1)ρα≤|U|,∀U⊆V}.

By Theorem 2, the set is the region of concavity for the Bethe entropy (14) within .

We also define the set

 F:={1F′:F′⊆F and F′∪N(F′) is a single-cycle forest in G}⊆{0,1}|F|,

where a single-cycle forest is defined to be a subset of edges of a graph such that each connected component contains at most one cycle. (We disregard the directions of edges in .) The following theorem gives our main result. The proof is contained in Appendix C.1.

###### Theorem 3.

In the Bethe case (i.e., the region graph has two layers), we have the containment . If in addition for all , then .

The significance of Theorem 3 is that it provides us with a convenient graph-based method for constructing vectors . From the inequalities (15), it is not even clear how to efficiently verify whether a given lies in , since it involves testing inequalities.

Comparing Theorem 3 with known results, note that in the pairwise case ( for all ), Theorem 1 of Wainwright et al. [22] states that the Bethe entropy is concave over , where is the set of edge indicator vectors for spanning forests of the graph. It is trivial to check that , since every spanning forest is also a single-cycle forest. Hence, Theorems 2 and 3 together imply a stronger result than in Wainwright et al. [22], characterizing the precise region of concavity for the Bethe entropy as a superset of the polytope analyzed there. In the unweighted Kikuchi case, it is also known [1, 14] that the Kikuchi entropy is concave for the assignment when the region graph is connected and has at most one cycle. Clearly, in that case, so this result is a consequence of Theorems 2 and 3, as well. However, our theorems show that a much more general statement is true.

It is tempting to posit that holds more generally in the Bethe case. However, as the following example shows, settings arise where . Details are contained in Appendix C.2.

###### Example 1.

Consider a two-layer region graph with vertices and factors , , and . Then .

In fact, Example 1 is a special case of a more general statement, which we state in the following proposition. Here, , and an element is maximal if it is not contained in another element of .

###### Proposition 1.

Suppose (i) is not a single-cycle forest, and (ii) there exists a maximal element such that the induced subgraph is a forest. Then .

The proof of Proposition 1 is contained in Appendix C.3. Note that if for all , then condition (ii) is violated whenever condition (i) holds, so Proposition 1 provides a partial converse to Theorem 3.

## 4 Reweighted sum product algorithm

In this section, we provide an iterative message passing algorithm to optimize the Kikuchi variational problem (8). As in the case of the generalized belief propagation algorithm for the unweighted Kikuchi approximation [28, 29, 11, 14, 12, 27] and the reweighted sum product algorithm for the Bethe approximation [22], our message passing algorithm searches for stationary points of the Lagrangian version of the problem (8). When satisfies condition (10), Theorem 1 implies that the problem (8) is strictly concave, so the unique fixed point of the message passing algorithm globally maximizes the Kikuchi approximation.

Let be a region graph defining our Kikuchi approximation. Following Pakzad and Anantharam [14], for , we write if and there does not exist such that . For , we define the parent set of to be and the child set of to be . With this notation, belongs to the set if and only if for all , .

The message passing algorithm we propose is as follows: For each and , let denote the message passed from to at assignment . Starting with an arbitrary positive initialization of the messages, we repeatedly perform the following updates for all , :

 Msr(xr)←C⎡⎢ ⎢ ⎢⎣∑xs∖rexp(θs(xs)/ρs)∏v∈P(s)Mvs(xs)ρv/ρs∏w∈C(s)∖rMsw(xw)−1exp(θr(xr)/ρr)∏u∈P(r)∖sMur(xr)ρu/ρr∏t∈C(r)Mrt(xt)−1⎤⎥ ⎥ ⎥⎦ρrρr+ρs. (16)

Here, may be chosen to ensure a convenient normalization condition; e.g., . Upon convergence of the updates (16), we compute the pseudomarginals according to

 τr(xr)∝exp(θr(xr)ρr)∏s∈P(r)Msr(xr)ρs/ρr∏t∈C(r)Mrt(xt)−1, (17)

and we obtain the corresponding Kikuchi approximation by computing the objective function (8) with these pseudomarginals. We have the following result, which is proved in Appendix D:

###### Theorem 4.

The pseudomarginals specified by the fixed points of the messages via the updates (16) and (17) correspond to the stationary points of the Lagrangian associated with the Kikuchi approximation problem (8).

As with the standard belief propagation and reweighted sum product algorithms, we have several options for implementing the above message passing algorithm in practice. For example, we may perform the updates (16) using serial or parallel schedules. To improve the convergence of the algorithm, we may damp the updates by taking a convex combination of new and previous messages using an appropriately chosen step size. As noted by Pakzad and Anantharam [14], we may also use a minimal graphical representation of the Hasse diagram to lower the complexity of the algorithm.

Finally, we remark that although our message passing algorithm proceeds in the same spirit as classical belief propagation algorithms by operating on the Lagrangian of the objective function, our algorithm as presented above does not immediately reduce to the generalized belief propagation algorithm for unweighted Kikuchi approximations or the reweighted sum product algorithm for tree-reweighted pairwise Bethe approximations. Previous authors use algebraic relations between the overcounting numbers (9) in the Kikuchi case [28, 29, 11, 14] and the two-layer structure of the Hasse diagram in the Bethe case [22] to obtain a simplified form of the updates. Since the coefficients in our problem lack the same algebraic relations, following the message-passing protocol used in previous work [11, 28] leads to more complicated updates, so we present a slightly different algorithm that still optimizes the general reweighted Kikuchi objective.

## 5 Experiments

In this section, we present empirical results to demonstrate the advantages of the reweighted Kikuchi approximation that support our theoretical results. For simplicity, we focus on the binary pairwise Ising model given in equation (2). Without loss of generality, we may take the potentials to be and for some . We run our experiments on two types of graphs: (1) , the complete graph on vertices, and (2) , the toroidal grid graph where every vertex has degree four.

#### Bethe approximation.

We consider the pairwise Bethe approximation of the log partition function with weights and . Because of the regularity structure of and , we take for all and study the behavior of the Bethe approximation as varies. For this particular choice of weight vector , we define

 ρtree=max{ρ≥0:→ρ∈conv(T)}, and ρcycle=max{ρ≥0:→ρ∈conv(F)}.

It is easily verified that for , we have and ; while for , we have and .

Our results in Section 3 imply that the Bethe objective function in equation (8) is concave if and only if , and Wainwright et al. [22] show that we have the bound for . Moreover, since the Bethe entropy may be written in terms of the edge mutual information (13), the function is decreasing in . In our results below, we observe that we may obtain a tighter approximation to by moving from the upper bound region to the concavity region . In addition, for , we observe multiple local optima of .

#### Procedure.

We generate a random potential for the Ising model (2) by sampling each potential and independently. We consider two types of models:

 {\em Attractive}:  γst∼Uniform[0,ωst], and {\em Mixed}:  γst∼Uniform[−ωst,ωst].

In each case, . We set and . Intuitively, the attractive model encourages variables in adjacent nodes to assume the same value, and it has been shown [18, 19] that the ordinary Bethe approximation () in an attractive model lower-bounds the log partition function. For , we compute stationary points of by running the reweighted sum product algorithm of Wainwright et al. [22]. We use a damping factor of , convergence threshold of for the average change of messages, and at most iterations. We repeat this process with at least 8 random initializations for each value of . Figure 1 shows the scatter plots of and the Bethe approximation . In each plot, the two vertical lines are the boundaries and , and the horizontal line is the value of the true log partition function .

#### Results.

Figures 1(a)1(d) show the results of our experiments on small graphs ( and ) for both attractive and mixed models. We see that the Bethe approximation with generally provides a better approximation to than the Bethe approximation computed over . However, in general we cannot guarantee whether will give an upper or lower bound for when . As noted above, we have for attractive models.

We also observe from Figures 1(a)1(d) that shortly after leaves the concavity region , multiple local optima emerge for the Bethe objective function. The presence of the point clouds near in Figures 1(a) and 1(c) arises because the sum product algorithm has not converged after iterations. Indeed, the same phenomenon is true for all our results: in the region where multiple local optima begin to appear, it is more difficult for the algorithm to converge. See Figure 2 and the accompanying text in Appendix E for a plot of the points , where is the final average change in the messages at termination of the algorithm. From Figure 2, we see that the values of are significantly higher for the values of near where multiple local optima emerge. We suspect that for these values of , the sum product algorithm fails to converge since distinct local optima are close together, so messages oscillate between the optima. For larger values of , the local optima become sufficiently separated and the algorithm converges to one of them. However, it is interesting to note that this point cloud phenomenon does not appear for attractive models, despite the presence of distinct local optima.

Simulations for larger graphs are shown in Figures 1(e)1(h). If we zoom into the region near , we still observe the same behavior that generally provides a better Bethe approximation than . Moreover, the presence of the point clouds and multiple local optima are more pronounced, and we see from Figures 1(c)1(g), and 1(h) that new local optima with even worse Bethe values arise for larger values of . Finally, we note that the same qualitative behavior also occurs in all the other graphs that we have tried ( for and for ), with multiple random instances of the Ising model .

## 6 Discussion

In this paper, we have analyzed the reweighted Kikuchi approximation method for estimating the log partition function of a distribution that factorizes over a region graph. We have characterized necessary and sufficient conditions for the concavity of the variational objective function, generalizing existing results in literature. Our simulations demonstrate the advantages of using the reweighted Kikuchi approximation and show that multiple local optima may appear outside the region of concavity.

An interesting future research direction is to obtain a better understanding of the approximation guarantees of the reweighted Bethe and Kikuchi methods. In the Bethe case with attractive potentials , several recent results [22, 19, 18] establish that the Bethe approximation is an upper bound to the log partition function when lies in the spanning tree polytope, whereas when . By continuity, we must have for some values of , and it would be interesting to characterize such values where the reweighted Bethe approximation is exact.

Another interesting direction is to extend our theoretical results on properties of the reweighted Kikuchi approximation, which currently depend solely on the structure of the region graph and the weights , to incorporate the effect of the model potentials . For example, several authors [20, 6] present conditions under which loopy belief propagation applied to the unweighted Bethe approximation has a unique fixed point. The conditions for uniqueness of fixed points slightly generalize the conditions for convexity, and they involve both the graph structure and the strength of the potentials. We suspect that similar results would hold for the reweighted Kikuchi approximation.

#### Acknowledgments.

The authors thank Martin Wainwright for introducing the problem to them and providing helpful guidance. The authors also thank Varun Jog for discussions regarding the generalization of Hall’s lemma. The authors thank the anonymous reviewers for feedback that improved the clarity of the paper. PL was partly supported from a Hertz Foundation Fellowship and an NSF Graduate Research Fellowship while at Berkeley.

## References

• [1] S. M. Aji and R. J. McEliece. The generalized distributive law and free energy minimization. In Proceedings of the 39th Allerton Conference, 2001.
• [2] F. Barahona. On the computational complexity of Ising spin glass models. Journal of Physics A: Mathematical and General, 15(10):3241, 1982.
• [3] H. A. Bethe. Statistical theory of superlattices. Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences, 150(871):552–575, 1935.
• [4] P. Hall. On representatives of subsets. Journal of the London Mathematical Society, 10:26–30, 1935.
• [5] T. Heskes. Stable fixed points of loopy belief propagation are minima of the Bethe free energy. In Advances in Neural Information Processing Systems 15, 2002.
• [6] T. Heskes. On the uniqueness of loopy belief propagation fixed points. Neural Computation, 16(11):2379–2413, 2004.
• [7] T. Heskes. Convexity arguments for efficient minimization of the Bethe and Kikuchi free energies.

Journal of Artificial Intelligence Research

, 26:153–190, 2006.
• [8] A. T. Ihler, J. W. Fischer III, and A. S. Willsky. Loopy belief propagation: Convergence and effects of message errors.

Journal of Machine Learning Research

, 6:905–936, December 2005.
• [9] R. Kikuchi. A theory of cooperative phenomena. Phys. Rev., 81:988–1003, March 1951.
• [10] B. Korte and J. Vygen. Combinatorial Optimization: Theory and Algorithms. Springer, 4th edition, 2007.
• [11] R. J. McEliece and M. Yildirim. Belief propagation on partially ordered sets. In Mathematical Systems Theory in Biology, Communications, Computation, and Finance, pages 275–300, 2002.
• [12] T. Meltzer, A. Globerson, and Y. Weiss. Convergent message passing algorithms: a unifying view. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, 2009.
• [13] J. M. Mooij and H. J. Kappen. Sufficient conditions for convergence of the sum-product algorithm. IEEE Transactions on Information Theory, 53(12):4422–4437, December 2007.
• [14] P. Pakzad and V. Anantharam. Estimation and marginalization using Kikuchi approximation methods. Neural Computation, 17:1836–1873, 2003.
• [15] J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
• [16] T. Roosta, M. J. Wainwright, and S. S. Sastry. Convergence analysis of reweighted sum-product algorithms. IEEE Transactions on Signal Processing, 56(9):4293–4305, 2008.
• [17] D. Roth. On the hardness of approximate reasoning. Artificial Intelligence, 82(1–2):273 – 302, 1996.
• [18] N. Ruozzi. The Bethe partition function of log-supermodular graphical models. In Advances in Neural Information Processing Systems 25, 2012.
• [19] E. B. Sudderth, M. J. Wainwright, and A. S. Willsky. Loop series and Bethe variational bounds in attractive graphical models. In Advances in Neural Information Processing Systems 20, 2007.
• [20] S. C. Tatikonda and M. I. Jordan. Loopy belief propagation and Gibbs measures. In Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence, UAI ’02, 2002.
• [21] P. O. Vontobel. The Bethe permanent of a nonnegative matrix. IEEE Transactions on Information Theory, 59(3):1866–1901, 2013.
• [22] M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005.
• [23] M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, January 2008.
• [24] Y. Watanabe and K. Fukumizu. Graph zeta function in the Bethe free energy and loopy belief propagation. In Advances in Neural Information Processing Systems 22, 2009.
• [25] Y. Watanabe and K. Fukumizu. Loopy belief propagation, Bethe free energy and graph zeta function. arXiv preprint arXiv:1103.0605, 2011.
• [26] Y. Weiss.

Correctness of local probability propagation in graphical models with loops.

Neural Computation, 12(1):1–41, 2000.
• [27] T. Werner. Primal view on belief propagation. In UAI 2010: Proceedings of the Conference of Uncertainty in Artificial Intelligence, pages 651–657, Corvallis, Oregon, July 2010. AUAI Press.
• [28] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Generalized belief propagation. In Advances in Neural Information Processing Systems 13, 2000.
• [29] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free energy approximations and generalized belief propagation algorithms. IEEE Transactions on Information Theory, 51:2282–2312, 2005.

## Appendix A Proofs for Section 3.1

### a.1 Proof of Theorem 1

We use the proof technique of Theorem 1 in Pakzad and Anantharam [14] for the unweighted Bethe entropy, together with Lemma 1 in Appendix A.2, which provides a generalization of Hall’s marriage lemma for weighted bipartite graphs.

We construct a bipartite graph according to

 V1:={r∈R:ρr<0},andV2:={r∈R:ρr>0},

where for and when . Let weights be defined such that for and for . We claim that condition (19) of Lemma 1 is satisfied. Indeed, for , we have

 w(U)=−∑s∈Uρs≤∑s∈A(U)ρs=∑s∈A(U):ρs>0ρs+∑s∈A(U):ρs<0ρs≤∑s∈A(U):ρs>0ρs=w(N(U)),

where the first inequality is a direct application of the assumption (10). Hence, by Lemma 1, we have a saturating edge labeling .

For each , define

 ρ′t:=ρt−∑s∈N(t)γst≥0.

We may then write

 H(τ;ρ) =∑s∈V1ρsHs(τs)+∑t∈V2ρtHt(τt) =∑(s,t)∈Eγst{−Hs(τs)+Ht(τt)}+∑t∈V2ρ′tHt(τt) =∑(s,t)∈Eγst{∑xsτs(xs)logτs(xs)−∑xtτt(xt)logτt(xt)}+∑t∈V2ρ′tHt(τt) (18)

where we have used the fact that , since , to obtain the last equality.

Note that for each pair , we have

 ∑xtτt(xt)log(τs(xs)τt(xt))=−DKL(τt∥τs),

which is strictly concave in the pair . Furthermore, each term is concave in . It follows by the expansion (A.1) that is strictly concave, as wanted.

### a.2 Generalization of Hall’s marriage lemma

In this section, we prove a generalization of Hall’s marriage lemma, which is useful in proving concavity of the Bethe entropy function .

Let be a bipartite graph, where each vertex is assigned a weight . For a set , define

 w(U):=∑s∈Uw(s).

Also define the neighborhood set

 N(U):=⋃s∈UN(s),

where is the usual neighborhood set of a single node.

We say that an edge labeling saturates if the following conditions hold:

1. For all , we have .

2. For all , we have .

###### Lemma 1.

Suppose

 w(U)≤w(N(U)),∀U⊆V1. (19)

Then there exists an edge labeling that saturates .

###### Proof.

We prove the lemma in stages. First, assume for all and condition (19) holds. With an appropriate rescaling, we may assume that all weights are integers. Call the new weights . We then construct a graph such that each node is expanded into a set of nodes, and edges of are constructed by connecting all nodes in to all nodes in , for each . By the usual version of Hall’s marriage lemma [4], there exists a matching of that saturates . Indeed, it follows immediately from condition (19) that

 w′(U)≤w′(N(U)),∀U⊆V1.

Suppose , and let . Then

 |T′|≤∣∣ ∣∣⋃s∈TUs∣∣ ∣∣=w′(T)≤w′(N(T))=|N(T′)|,

so the sufficient condition of Hall’s marriage lemma is met, implying the existence of a matching. The edge labeling is obtained by setting

 γst={# of edges between Us and Ut in % matching}

and rescaling.

Next, suppose for all and condition (19) holds with strict inequality; i.e.,

 w(U)

We claim that there exists an edge labeling that saturates . Indeed, let

 ϵ:=minU⊆V1{w(N(U))−w(U)}>0.

Define a new weighting with only rational values, such that

 w′(s) ∈[w(s),w(s)+ϵ2⋅deg(G)),∀s∈V1, w′(t) ∈(w(t)−ϵ2⋅deg(G),w(t)],∀t∈V2,

where is the number of edges in . It is clear that Hall’s condition (19) still holds for . Hence, by the result of the last paragraph, there exists an edge labeling that saturates with respect to . Observe that by decreasing the weights of slightly, we easily obtain an edge labeling that saturates with respect to the original weighting .

Finally, consider the most general case: condition (19) holds and for all . Note that the problem of finding an edge labeling that saturates may be rephrased as follows. Let be the vector of weights . Then for an appropriate choice of the matrix , the conditions

 ∑t∈N(s)γst=w(s),∀s∈V1,

may be expressed as a system of linear equations,

 A1γ=b1. (21)

Similarly, letting , the conditions

 ∑s∈N(t)γst≤w(t),∀t∈V2,

may be expressed in the form

 A2γ≤b2, (22)

where . A saturating edge labeling exists if and only if there exists that simultaneously satisfies conditions (21) and (22). Now consider a sequence of weight vectors , such that and the convergence is from below and strictly monotone for each component. Let denote the full sequence of weights. Then

 wn(U)

It follows by the result of the previous paragraph that there exists an edge labeling such that

 A1γn=bn1,andγn∈D:={γ∈R|E|≥0:A2γ≤b2}.

Clearly, is a closed set; furthermore, it is easy to see that the constraint implies that each component of is bounded from above, since contains only nonnegative entries. It follows that the sequence has a limit point . By continuity of the linear map , we must have

 A1γ∗=limn→∞A1γn=limn→∞bn1=b1.

Hence, is a valid edge labeling that saturates .

### a.3 Proof of Corollary 1

By Theorem 1, is strictly concave provided condition (10) holds. Note that

 F(α)={α},∀α∈F,

whereas

 F(s)={s}∪N(s),∀s∈V.

Condition (10) applied to the set gives the inequality

 ρα≥0,∀α∈F. (23)

For a subset , we can write

 F(U)=⋃s∈UF(s)=U∪N(U)=U∪{α∈F:α∩U≠∅},

so (10) translates into

 ∑s∈Uρs+∑α∈F:α∩U≠∅ρα≥0,∀U⊆V, (24)

which is condition (12). It is easy to see that conditions (23) and (24) together also imply the validity of condition (10) for any other set of regions .

## Appendix B Proofs for Section 3.2

### b.1 Proof of Theorem 2

Our result relies on the property that if the Bethe entropy is concave over , then is also concave over any subset . In particular, it is sufficient to assume that is binary, say ; the general multinomial case follows by restricting the distribution of to be supported on only two points.

The first lemma shows that for all . The proof is contained in Appendix B.2.

###### Lemma 2.

If the Bethe entropy is concave over , then for all .

To establish the necessity of condition (12), consider a nonempty subset and the corresponding sub-region graph , where . From the original weights , construct the sub-region weights given by

 ρUs=ρs,∀s∈U, and ρUα∩U=ρα,∀α∩U∈FU.

For simplicity, we consider to be a multiset by remembering which factor each comes from; we can equivalently work with as a set by defining the weights