DeepAI

Gauging Variational Inference

Computing partition function is the most important statistical inference task arising in applications of Graphical Models (GM). Since it is computationally intractable, approximate methods have been used to resolve the issue in practice, where mean-field (MF) and belief propagation (BP) are arguably the most popular and successful approaches of a variational type. In this paper, we propose two new variational schemes, coined Gauged-MF (G-MF) and Gauged-BP (G-BP), improving MF and BP, respectively. Both provide lower bounds for the partition function by utilizing the so-called gauge transformation which modifies factors of GM while keeping the partition function invariant. Moreover, we prove that both G-MF and G-BP are exact for GMs with a single loop of a special structure, even though the bare MF and BP perform badly in this case. Our extensive experiments, on complete GMs of relatively small size and on large GM (up-to 300 variables) confirm that the newly proposed algorithms outperform and generalize MF and BP.

• 17 publications
• 36 publications
• 92 publications
11/12/2018

Gauges, Loops, and Polynomials for Partition Functions of Graphical Models

We suggest a new methodology for analysis and approximate computations o...
03/23/2022

Approximate Inference for Stochastic Planning in Factored Spaces

Stochastic planning can be reduced to probabilistic inference in large d...
01/25/2023

Exact Fractional Inference via Re-Parametrization Interpolation between Tree-Re-Weighted- and Belief Propagation- Algorithms

Inference efforts – required to compute partition function, Z, of an Isi...
03/14/2018

Bucket Renormalization for Approximate Inference

Probabilistic graphical models are a key tool in machine learning applic...
01/05/2018

Gauged Mini-Bucket Elimination for Approximate Inference

Computing the partition function Z of a discrete graphical model is a fu...
03/15/2012

Primal View on Belief Propagation

It is known that fixed points of loopy belief propagation (BP) correspon...
04/17/2016

Probabilistic Receiver Architecture Combining BP, MF, and EP for Multi-Signal Detection

Receiver algorithms which combine belief propagation (BP) with the mean ...

1 Introduction

Graphical Models (GM) express factorization of the joint multivariate probability distributions in statistics via a graph of relations between variables. The concept of GM has been developed and/or used successfully in information theory

(gallager1962low, ; kschischang1998iterative, ), physics (35Bet, ; 36Pei, ; 87MPZ, ; parisi1988statistical, ; 09MM, )(pearl2014probabilistic, )

, and machine learning

(jordan1998learning, ; freeman2000learning, ). Of many inference problems one can formulate using a GM, computing the partition function (normalization), or equivalently computing marginal probability distributions, is the most important and universal inference task of interest. However, this paradigmatic problem is also known to be computationally intractable in general, i.e., it is #P-hard even to approximate (jerrum1993polynomial, ).

The Markov chain monte carlo (MCMC)

alpaydin2014introduction

is a classical approach addressing the inference task, but it typically suffers from exponentially slow mixing or large variance. Variational inference is an approach stating the inference task as an optimization. Hence, it does not have such issues of MCMC and is often more favorable. The mean-field (MF)

(parisi1988statistical, ) and belief propagation (BP) (pearl1982reverend, )

are arguably the most popular algorithms of the variational type. They are distributed, fast and overall very successful in practical applications even though they are heuristics lacking systematic error control. This has motivated researchers to seek for methods with some guarantees, e.g., providing lower bounds

(liu2012negative, ; ermon2012density, ) and upper bounds (wainwright2005new, ; liu2011bounding, ; ermon2012density, ) for the partition function of GM.

In another line of research, which this paper extends and contributes, the so-called re-parametrizations (03WJW, ), gauge transformations (GT) (06CCa, ; 06CCb, ) and holographic transformations (valiant2008holographic, ; al2011normal, ) were explored. This class of distinct, but related, transformations consist in modifying a GM by changing factors, associated with elements of the graph, continuously such that the partition function stays the same/invariant.111See 08JW ; forney2011partition ; Misha_notes for discussions of relations between the aforementioned techniques. In this paper, we choose to work with GT as the most general one among the three approaches. Once applied to a GM, it transforms the original partition function, defined as a weighted series/sum over states, to a new one, dependent on the choice of gauges. In particular, a fixed point of BP minimizes the so-called Bethe free energy 05YFW , and it can also be understood as an optimal GT (06CCa, ; 06CCb, ; chernyak2007loop, ; mori2015holographic, ). Moreover, fixing GT in accordance with BP results in the so-called loop series expression for the partition function (06CCa, ; 06CCb, ). In this paper we generalize (06CCa, ; 06CCb, ) and explore a more general class of GT. This allows us to develop a new gauge-optimization approach which results in ‘better’ variational inference schemes than one provided by MF, BP and other related methods.

Contribution.

The main contribution of this paper consists in developing two novel variational methods, called Gauged-MF (G-MF) and Gauged-BP (G-BP), providing lower bounds on the partition function of GM. While MF minimizes the (exact) Gibbs free energy under (reduced) product distributions, G-MF does the same task by introducing an additional GT. Due to the the additional degree of freedom in optimization, G-MF improves the lower bound of the partition function provided by MF systematically. Similarly, G-BP generalizes BP, extending interpretation of the latter as an optimization of the Bethe free energy over GT

(06CCa, ; 06CCb, ; chernyak2007loop, ; mori2015holographic, ), by imposing additional constraints on GT forcing all the terms in the resulting series for the partition function to remain non-negative. Thus, G-BP results in a provable lower bound for the partition function, while BP does not (except for log-supermodular models (ruozzi2012bethe, )).

We prove that both G-MF and G-BP are exact for GMs defined over single cycles, which we call ‘alternating cycle/loop’, as well as over the line graphs. The alternative cycle case is surprising as it represents the simplest ‘counter-example’ from weller2014understanding , illustrating failures of MF and BP. For general GMs, we also establish that G-MF is better than, or at least as good as G-BP. However, we also develop novel error correction schemes for G-BP such that the lower bound of the partition function provided by G-BP can also be improved systematically/sequentially, eventually outperforming G-MF on the expense of increasing computational complexity. Such an error correction scheme has been studied for improving BP by considering the loop series consisting of positive and negative terms chertkov2008belief ; ahn2016synthesis . Due to our design of G-BP, the corresponding series consists of only non-negative terms, which makes much easier to improve the quality of G-BP systematically.

We further found that our newly proposed GT-based optimizations can be restated as smooth and unconstrained ones, thus allowing efficient solutions via algorithms of a gradient descent type or any generic optimization solver such as IPOPT (wachter2006implementation, ). We experiment with IPOPT on complete GMs of relatively small size and on large GM (up-to 300 variables) of fixed degree, which confirm that the newly proposed algorithms outperform and generalize MF and BP. Finally, note that all statements of the paper are made within the framework of the so-called Forney-style GMs (forney2001codes, ) which is general as it allows interactions beyond pair-wise (i.e., high-order GM) and includes other/alternative GM formulations, such as factor graphs of (03WJ, ). Our results using GT for variational inference provide a refreshing angle for the important inference task, and we believe it should be of broad interest in many applications involving GMs.

2 Preliminaries

2.1 Graphical model

Factor-graph model. Given (undirected) bipartite factor graph

, a joint distribution of (binary) random variables

is called a factor-graph Graphical Model (GM) if it factorizes as follows:

 p(x)=1Z∏a∈F\mathnormalfa(x∂a),

where are some non-negative functions called factor functions, consists of nodes neighboring factor , and the normalization constant is called the partition function. A factor-graph GM is called pair-wise if for all , and high-order otherwise. It is known that approximating the partition function is #P-hard even for pair-wise GMs in general (jerrum1993polynomial, ).

Forney-style model. In this paper, we primarily use the Forney-style GM (forney2001codes, ) instead of factor-graph GM. Elementary random variables in the Forney-style GM are associated with edges of an undirected graph,

. Then the random vector,

is realized with the probability distribution

 p(x)=1Z∏a∈V\mathnormalfa(xa), (1)

where is associated with set of edges neighboring node , i.e. and As argued in 06CCa ; 06CCb , the Forney-style GM constitutes a more universal and compact description of gauge transformations without any restriction of generality, i.e., given any factor-graph GM, one can construct an equivalent Forney-style GM (see the supplementary material).

2.2 Mean-field and belief propagation

In this section, we introduce two most popular methods for approximating the partition function: the mean-field and Bethe (i.e., belief propagation) approximation methods. Given any (Forney-style) GM defined as in (1) and any distribution over all variables, the Gibbs free energy is defined as

 FGibbs(q):=∑x∈{0,1}Eq(x)logq(x)∏a∈V\mathnormalfa(xa). (2)

Then the partition function is derived according to , where the optimum is achieved at , e.g., see (03WJ, ). This optimization is over all valid probability distributions on the exponentially large space and obviously intractable.

In the case of the mean-field (MF) approximation, we minimize the Gibbs free energy over a family of tractable probability distributions factorized into the following product: , where each independent is a proper probability distribution, behaving as a (mean-field) proxy to the marginal of over . By construction, the MF approximation provides a lower bound for . In the case of the Bethe approximation, the so-called Bethe free energy approximates the Gibbs free energy (yedidia2001bethe, ):

 FBethe(b)=∑a∈V∑xa∈{0,1}∂aba(xa)logba(xa)\mathnormalfa(xa)−∑{a,b}∈E∑xab∈{0,1}bab(xab)logbab(xab), (3)

where beliefs should satisfy following ‘consistency’ constraints:

 0≤ba,bab≤1,∑xab∈{0,1}ba(xab)=1,∑x′a∖xab∈{0,1}∂ab(x′a)=b(xab)∀{a,b}∈E.

Here, denotes a vector with fixed and

is the Bethe estimation for

. The popular belief propagation (BP) distributed heuristics solves the optimization iteratively (yedidia2001bethe, ). The Bethe approximation is exact over trees, i.e., . However, in the case of a general loopy graph, the BP estimation lacks approximation guarantees. It is known, however, that the result of BP-optimization lower bounds the log-partition function, , if the factors are log-supermodular (ruozzi2012bethe, ).

2.3 Gauge transformation

Gauge transformation (GT) (06CCa, ; 06CCb, )

is a family of linear transformations of the factor functions in (

1) which leaves the the partition function invariant. It is defined with respect to the following set of invertible matrices for , coined gauges:

 Gab=[Gab(0,0)Gab(0,1)Gab(1,0)Gab(1,1)].

The GM, gauge transformed with respect to , consists of factors expressed as:

 \mathnormalfa,G(xa)=∑x′a∈{0,1}∂a\mathnormalfa(x′a)∏b∈∂aGab(xab,x′ab).

Here one treats independent and equivalently for notational convenience, and is a conjugated pair of distinct matrices satisfying the gauge constraint , where

is the identity matrix. Then, one can prove invariance of the partition function under the transformation:

 Z = ∑x∈{0,1}|E|∏a∈V\mathnormalfa(xa) = ∑x∈{0,1}|E|∏a∈V\mathnormalfa,G(xa). (4)

Consequently, GT results in the gauge transformed distribution Note that some components of can be negative, in which case it is not a valid probability distribution.

We remark that the Bethe/BP approximation can be interpreted as a specific choice of GT (06CCa, ; 06CCb, ). Indeed any fixed point of BP corresponds to a special set of gauges making an arbitrarily picked configuration/state to be least sensitive to the local variation of the gauge. Formally, the following non-convex optimization is known to be equivalent to the Bethe approximation:

 \operatornamewithlimitsmaximizeG ∑a∈Vlog\mathnormalfa,G(0,0,…) subject to (5)

and the set of BP-gauges correspond to stationary points of (5), having the objective as the respective Bethe free energy, i.e., .

3 Gauge optimization for approximating partition functions

Now we are ready to describe two novel gauge optimization schemes (different from (5)) providing guaranteed lower bound approximations for . Our first GT scheme, coined Gauged-MF (G-MF), shall be considered as modifying and improving the MF approximation, while our second GT scheme, coined Gauged-BP (G-BP), modifies and improves the Bethe approximation in a way that it now provides a provable lower bound for , while the bare BP does not have such guarantees. The G-BP scheme also allows further improvement (in terms of the output quality) on the expense of making underlying algorithm/computation more complex.

3.1 Gauged mean-field

We first propose the following optimization inspired by, and also improving, the MF approximation:

 \operatornamewithlimitsmaximizeq,G ∑a∈V∑xa∈{0,1}∂aqa(xa)log\mathnormalfa,G(xa)−∑{a,b}∈E∑xab∈{0,1}qab(xab)logqab(xab) subject to \mathnormalfa,G(xa)≥0,∀a∈V, ∀xa∈{0,1}∂a, q(x)=∏{a,b}∈Eqab(xab),qa(xa)=∏b∈∂aqab(xab),∀a∈V. (6)

Recall that the MF approximation optimizes the Gibbs free energy with respect to given the original GM, i.e. factors. On the other hand, (6) jointly optimizes it over and . Since the partition function of the gauge transformed GM is equal to that of the original GM, (6) also outputs a lower bound on the (original) partition function, and always outperforms MF due to the additional degree of freedom in . The non-negative constraints for each factor enforce that the gauge transformed GM results in a valid probability distribution (all components are non-negative).

To solve (6), we propose a strategy, alternating between two optimizations, formally stated in Algorithm 1. The alternation is between updating , within Step A, and updating , within Step C. The optimization in Step A is simple as one can apply any solver of the mean-field approximation. On the other hand, Step C requires a new solver and, at the first glance, looks complicated due to nonlinear constraints. However, the constraints can actually be eliminated. Indeed, one observes that the non-negative constraint is redundant, because each term in the optimization objective already prevents factors from getting close to zero, thus keeping them positive. Equivalently, once current satisfies the non-negative constraints, the objective, , acts as a log-barrier forcing the constraints to be satisfied at the next step within an iterative optimization procedure. Furthermore, the gauge constraint, , can also be removed simply expressing one (of the two) gauge via another, e.g., via . Then, Step C can be resolved by any unconstrained iterative optimization method of a gradient descent type or any generic optimization solver such as IPOPT (wachter2006implementation, ). Next, the additional (intermediate) procedure Step B was considered to handle extreme cases when for some , at the optimum. We resolve the singularity perturbing the distribution by setting zero probabilities to a small value, where is sufficiently small. In summary, it is straightforward to check that the Algorithm 1 converges to a local optimum of (6), similar to some other solvers developed for the mean-field and Bethe approximations.

We also provide an important class of GMs where the Algorithm 1 provably outperforms both the MF and BP (Bethe) approximations. Specifically, we prove that the optimization (6

) is exact in the case when the graph is a line (which is a special case of a tree) and, somewhat surprisingly, a single loop/cycle with odd number of factors represented by negative definite matrices. In fact, the latter case is the so-called ‘alternating cycle’ example which was introduced in

weller2014understanding as the simplest loopy example where the MF and BP approximations perform quite badly. Formally, we state the following theorem whose proof is given in the supplementary material.

Theorem 1.

For GM defined on any line graph or alternating cycle, the optimal objective of (6) is equal to the exact log partition function, i.e., .

3.2 Gauged belief propagation

We start discussion of the G-BP scheme by noticing that, according to chertkov2006loop , the G-MF gauge optimization (6) can be reduced to the BP/Bethe gauge optimization (5) by eliminating the non-negative constraint for each factor and replacing the product distribution by:

 q(x)={1if x=(0,0,⋯),0otherwise. (7)

Motivated by this observation, we propose the following G-BP optimization:

 \operatornamewithlimitsmaximizeG ∑a∈Vlog\mathnormalfa,G(0,0,⋯) subject to G⊤abGba=I,∀(a,b)∈E, \mathnormalfa,G(xa)≥0,∀a∈V, ∀xa∈{0,1}∂a. (8)

The only difference between (5) and (8) is addition of the non-negative constraints for factors in (8). Hence, (8) outputs a lower bound on the partition function, while (5) can be larger or smaller then . It is also easy to verify that (8) (for G-BP) is equivalent to (6) (for G-MF) with fixed to (7). Hence, we propose the algorithmic procedure for solving (8), formally described in Algorithm 2, and it should be viewed as a modification of Algorithm 1 with replaced by (7) in Step A, also with a properly chosen log-barrier term in Step C. As we discussed for Algorithm 1, it is straightforward to verify that Algorithm 2 also converges to a local optimum of (8) and one can replace by for each pair of the conjugated matrices in order to build a convergent gradient descent algorithmic implementation for the optimization.

Since fixing eliminates the degree of freedom in (6), G-BP should perform worse than G-MF, i.e., (8) (6). However, G-BP is still meaningful due to the following reasons. First, Theorem 1 still holds for (8), i.e., the optimal of (6) is achieved at (7) for any line graph or alternating cycle (see the proof of the Theorem 1 in the supplementary material). More importantly, G-BP can be corrected systematically. At a high level, the “error-correction" strategy consists in correcting the approximation error of (8) sequentially while maintaining the desired lower bounding guarantee. The key idea here is to decompose the error of (8) into partition functions of multiple GMs, and then repeatedly lower bound each partition function. Formally, we fix an arbitrary ordering of edges and define the corresponding GM for each as follows: for , where and

 Xi:={x : xei=1,xej=0,xek∈{0,1}∀ j,k, such that 1≤j

Namely, we consider GMs from sequential conditioning of in the gauge transformed GM. Next, recall that (8) maximizes and outputs a single configuration . Then, since and , the error of (8) can be decomposed as follows:

 Z−∏a\mathnormalfa,G(0,0,⋯)=|E|∑i=1∑x∈Xi∏a∈V\mathnormalfa,G(x)=|E|∑i=1Zi, (9)

Now, one can run G-MF, G-BP or any other methods (e.g., MF) again to obtain a lower bound of for all and then output . However, such additional runs of optimization inevitably increase the overall complexity. Instead, one can also pick a single term for from , as a choice of just after solving (8) initially, and output

 ∏a∈V\mathnormalfa,G(0,0,⋯)+|E|∑i=1\mathnormalfa,G(x(i)a),x(i)=[xei=1,xej=0, ∀ j≠i], (10)

as a better lower bound for than . This choice is based on the intuition that configurations partially different from may be significant too as they share most of the same factor values with the zero configuration maximized in (8). In fact, one can even choose more configurations (partially different from ) by paying more complexity, which is always better as it brings the approximation closer to the true partition function. In our experiments, we consider additional configurations , i.e., output

 ∏a∈V\mathnormalfa,G(0,0,⋯)+|E|∑i=1|E|∑i′=i\mathnormalfa,G(x(i,i′)a),x(i,i′)=[xei=1,xei′=1,xej=0, ∀ j≠i,i′], (11)

as a better lower bound of than (10).

4 Experimental results

In this section, we report results of our experiments with G-MF and G-BP defined in Section 3. We also experiment here with G-BP boosted by schemes correcting errors by accounting for single (10) and multiple (11) terms, as well as correcting G-BP by applying G-BP sequentially again to each residual partition function . The error decreases, while the evaluation complexity increases, as we move from G-BP-single to G-BP-multiple and then to G-BP-sequential. As mentioned earlier, we use the IPOPT solver (wachter2006implementation, ) to resolve the proposed gauge optimizations. We generate random GMs with factors dependent on the ‘interaction strength’ parameters (akin inverse temperature) as follows:

 \mathnormalfa(xa)=exp(−βa|h0(xa)−h1(xa)|),

where and count numbers of and contributions in , respectively. Intuitively, we expect that as increases, it becomes more difficult to approximate the partition function. See the supplementary material for additional information on how we generate the random models.

In the first set of experiments, we consider relatively small, complete graphs with two types of factors: random generic (non-log-supermodular) factors and log-supermodular (positive/ferromagnetic) factors. Recall that the bare BP also provides a lower bound in the log-supermodular case ruozzi2012bethe , thus making the comparison between each proposed algorithm and BP informative. We use the log partition approximation error defined as , where is the algorithm output (a lower bound of ), to quantify the algorithm’s performance. In the first set of experiments, we deal with relatively small graphs and the explicit computation of (i.e., the approximation error) is feasible. The results for experiments over the small graphs are illustrated in Figure 1 and Figure 2 for the non-log-supermodular and log-supermodular cases, respectively. Figure 1 shows that, as expected, G-MF always outperforms MF. Moreover, we observe that G-MF typically provides the tightest low-bound, unless it is outperformed by G-BP-multiple or G-BP-sequential. We remark that BP is not shown in Figure 1, because in this non-log-supermodular case, it does not provide a lower bound in general. According to Figure 2, showing the log-supermodular case, both G-MF and G-BP outperform MF, while G-BP-sequential outperforms all other algorithms. Notice that G-BP performs rather similar to BP in the log-supermodular case, thus suggesting that the constraints, distinguishing (8) from (5), are very mildly violated.

In the second set of experiments, we consider more sparse, larger graphs of two types: -regular and grid graphs with size up to factors/ variables. As in the first set of experiments, the same non-log-supermodular/log-supermodular factors are considered. Since computing the exact approximation error is not feasible for the large graphs, we instead measure here the ratio of estimation by the proposed algorithm to that of MF, i.e., where is the output of MF. Note that a larger value of the ratio indicates better performance. The results are reported in Figure 3 and Figure 4 for the non-log-supermodular and log-supermodular cases, respectively. In Figure 3, we observe that G-MF and G-BP-sequential outperform MF significantly, e.g., up-to times better in -regular graphs of size . We also observe that even the bare G-BP outperforms MF. In Figure 4, algorithms associated with G-BP outperform G-MF and MF (up to times). This is because the choice of for G-BP is favored by log-supermodular models, i.e., most of configurations are concentrated around similar to the choice (7) of for G-BP. One observes here (again) that performance of G-BP in this log-supermodular case is almost on par with BP. This implies that G-BP generalizes BP well: the former provides a lower bound of for any GMs, while the latter does only for log-supermodular GMs.

5 Conclusion and future research

We explore the freedom in gauge transformations of GM and develop novel variational inference methods which result in significant improvement of the partition function estimation. In this paper, we have focused solely on designing approaches which improve the bare/basic MF and BP via specially optimized gauge transformations. In terms of the path forward, it is of interest to extend this GT framework/approach to other variational methods, e.g., Kikuchi approximation kikuchi1951theory , structured/conditional MF saul1996exploiting ; carbonetto2007conditional . Furthermore, G-BP and G-MF were resolved in our experiments via a generic optimization solver (IPOPT), which was sufficient for the illustrative tests conducted so far, however we expect that it might be possible to develop more efficient distributed solvers of the BP-type. Finally, we plan working on applications of the newly designed methods and algorithms to a variety of practical inference applications associated to GMs.

References

• (1) Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory, 8(1):21–28, 1962.
• (2) Frank R. Kschischang and Brendan J. Frey. Iterative decoding of compound codes by probability propagation in graphical models. IEEE Journal on Selected Areas in Communications, 16(2):219–230, 1998.
• (3) Hans .A. Bethe. Statistical theory of superlattices. Proceedings of Royal Society of London A, 150:552, 1935.
• (4) Rudolf E. Peierls. Ising’s model of ferromagnetism. Proceedings of Cambridge Philosophical Society, 32:477–481, 1936.
• (5) Marc Mézard, Georgio Parisi, and M. A. Virasoro. Spin Glass Theory and Beyond. Singapore: World Scientific, 1987.
• (6) Giorgio Parisi. Statistical field theory, 1988.
• (7) Marc Mezard and Andrea Montanari. Information, Physics, and Computation. Oxford University Press, Inc., New York, NY, USA, 2009.
• (8) Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 2014.
• (9) Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business Media, 1998.
• (10) William T Freeman, Egon C Pasztor, and Owen T Carmichael. Learning low-level vision.

International journal of computer vision

, 40(1):25–47, 2000.
• (11) Mark Jerrum and Alistair Sinclair. Polynomial-time approximation algorithms for the ising model. SIAM Journal on computing, 22(5):1087–1116, 1993.
• (12) Ethem Alpaydin. Introduction to machine learning. MIT press, 2014.
• (13) Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science, University of California, Los Angeles, 1982.
• (14) Qiang Liu and Alexander T Ihler. Negative tree reweighted belief propagation. arXiv preprint arXiv:1203.3494, 2012.
• (15) Stefano Ermon, Ashish Sabharwal, Bart Selman, and Carla P Gomes. Density propagation and improved bounds on the partition function. In Advances in Neural Information Processing Systems, pages 2762–2770, 2012.
• (16) Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005.
• (17) Qiang Liu and Alexander T Ihler. Bounding the partition function using holder’s inequality. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 849–856, 2011.
• (18) Martin J. Wainwright, Tommy S. Jaakkola, and Alan S. Willsky. Tree-based reparametrization framework for approximate estimation on graphs with cycles. Information Theory, IEEE Transactions on, 49(5):1120–1146, 2003.
• (19) Michael Chertkov and Vladimir Chernyak. Loop calculus in statistical physics and information science. Physical Review E, 73:065102(R), 2006.
• (20) Michael Chertkov and Vladimir Chernyak. Loop series for discrete statistical models on graphs. Journal of Statistical Mechanics, page P06009, 2006.
• (21) Leslie G Valiant. Holographic algorithms. SIAM Journal on Computing, 37(5):1565–1594, 2008.
• (22) Ali Al-Bashabsheh and Yongyi Mao. Normal factor graphs and holographic transformations. IEEE Transactions on Information Theory, 57(2):752–763, 2011.
• (23) Martin J. Wainwright and Michael E. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1):1–305, 2008.
• (24) G David Forney Jr and Pascal O Vontobel. Partition functions of normal factor graphs. arXiv preprint arXiv:1102.0316, 2011.
• (25) Michael Chertkov. Lecture notes on “statistical inference in structured graphical models: Gauge transformations, belief propagation & beyond", 2016.
• (26) J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms. Information Theory, IEEE Transactions on, 51(7):2282–2312, 2005.
• (27) Vladimir Y Chernyak and Michael Chertkov. Loop calculus and belief propagation for q-ary alphabet: Loop tower. In Information Theory, 2007. ISIT 2007. IEEE International Symposium on, pages 316–320. IEEE, 2007.
• (28) Ryuhei Mori. Holographic transformation, belief propagation and loop calculus for generalized probabilistic theories. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 1099–1103. IEEE, 2015.
• (29) Nicholas Ruozzi. The bethe partition function of log-supermodular graphical models. In Advances in Neural Information Processing Systems, pages 117–125, 2012.
• (30) Adrian Weller, Kui Tang, Tony Jebara, and David Sontag. Understanding the bethe approximation: when and how can it go wrong? In UAI, pages 868–877, 2014.
• (31) Michael Chertkov, Vladimir Y Chernyak, and Razvan Teodorescu. Belief propagation and loop series on planar graphs. Journal of Statistical Mechanics: Theory and Experiment, 2008(05):P05003, 2008.
• (32) Sung-Soo Ahn, Michael Chertkov, and Jinwoo Shin. Synthesis of mcmc and belief propagation. In Advances in Neural Information Processing Systems, pages 1453–1461, 2016.
• (33) Andreas Wächter and Lorenz T Biegler. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Mathematical programming, 106(1):25–57, 2006.
• (34) G David Forney. Codes on graphs: Normal realizations. IEEE Transactions on Information Theory, 47(2):520–548, 2001.
• (35) Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, 2003.
• (36) Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, kikuchi approximations, and belief propagation algorithms. Advances in neural information processing systems, 13, 2001.
• (37) Michael Chertkov and Vladimir Y Chernyak. Loop series for discrete statistical models on graphs. Journal of Statistical Mechanics: Theory and Experiment, 2006(06):P06009, 2006.
• (38) Ryoichi Kikuchi. A theory of cooperative phenomena. Physical review, 81(6):988, 1951.
• (39) Lawrence K Saul and Michael I Jordan. Exploiting tractable substructures in intractable networks. Advances in neural information processing systems, pages 486–492, 1996.
• (40) Peter Carbonetto and Nando D Freitas. Conditional mean field. In Advances in neural information processing systems, pages 201–208, 2007.

Appendix A Construction of Forney-style model equivalent to factor-graph model

In this Section, we describe construction of a Forney-style GM equivalent to the factor-graph GM. Consider a factor-graph GM defined on graph with factors . Then one introduces the following Forney-style GM defined over the graph with factors

 V ←X∪F,\mathnormalf†a←\mathnormalfa,∀a∈F, \mathnormalf†a(xa) ←{1if xa=(1,1,⋯) %or (0,0,⋯)0otherwise, ∀a∈X.

One observes that if the factor-graph GM (possibly, of high-order) is sparse, i.e., the maximum degree of is small, then the equivalent Forney-style GM is too. See Figure 5 for illustration.

Appendix B Proof of Theorem 1

To prove Theorem 1 one, first, shows that the line graph GM can be gauge transformed into a distribution equivalent to the alternating cycle GM. Then it is sufficient for proving Theorem 1 to consider only the case of an alternating cycle.

Consider a GM defined on a line graph with and edges . Then the gauge transformed factor can be expressed as:

 \mathnormalfai,G=G⊤aiai−1\mathnormalfaiGaiai+1,

where we used the fact that the size/cardinality of the factor is . Next, we ‘flip’ factor , associated with the node number , such that there exist an odd number of negative definite factors among , i.e., the flipping sets

 Ga1a2,Ga2a1=[0110], (12)

thus resulting in reversing the sign of . If is non-invertible, i.e. , we instead flip and so on. If all factors are non-invertible, the resulting distribution is a product distribution and one can easily find the optimal for the corresponding line graph, which completes the proof. Otherwise, we ‘join’ the endpoints into by introducing a non-invertible factor , which results in an alternating cycle with the probability distribution identical to the one of a line graph GM.

Our next step is to prove Theorem 1 for an alternating cycle GM. Our high level logic here is as follows. We first fix the distribution of (6) according to

 q(x)={1if x=(0,0,⋯),0otherwise.,

and then show that the GM can be gauge transformed into a distribution with a nonzero probability concentrated only at . The resulting objective of (6) will become exactly the partition function. To implement this logic, consider an alternating cycle defined on some graph with and edges . Observe that, that the gauge transformed factor, , and the original factor,

, share a pair of eigenvalues

due to the following relationship:

 ∏i\mathnormalfai,G=G−1ana1∏i\mathnormalfiGana1

One finds that since there exist an odd number of negative definite factors in the cycle. Moreover, because the diagonal sum, , is equivalent to the partition function of GM. Thus one can assume, without loss of generality, that and .

Next, utilizing a simple linear algebra, one derives

 Q−12Q1Gana1∏i\mathnormalfi,GQ−11Q2=[λ1+λ2λ1−λ20],

where and are matrices whose -th column is an eigen-vector of and, , respectively. Now let

 Gana1=Q−11Q2,Gai−1ai=(\mathnormalfiG⊤aiai+1)−1fori=2,⋯n,

where . Here we assume that there exists at most one non-invertible factor in the GM and are invertible so that is defined properly. Otherwise, the GM can be decomposed into separate line graphs and the proof can be applied recursively. Then the gauge transformed factors become:

 \mathnormalfa1,G=[λ1+λ2λ1−λ20],\mathnormalfai,G=[[]ccc1001]∀i≠1,

which corresponds to a GM with objective of (6) to be equal to the log partition function. This completes the proof of the Theorem 1.

Appendix C Generating GM instances (for experiments)

In this Section, we provide more details on our experimental setups reported in in Section 4. First, we explain how the two types of factors, non-log-supermodular and log-supermodular, were constructed. In the generic case (of non-log-supermodular factors), i.e., correspondent to Figure 1 and Figure 3

, one generates factor by first drawing the interaction strength vector at random from the i.i.d. uniform distribution over the interval

for some , i.e., . Then, in order to introduce a bias, we add an external variable , i.e., half-edge, as follows:

 \mathnormalfa(xa)=exp(βa|h0(xa∪ya)−h1(xa∪ya)|),

where is either or with probability each. More specifically in experiments resulted in Figure 1 one varies while in the experiments resulted in Figure 3 one fixes to , i.e., . Next, in the case of the log-supermodular factors, i.e., setting resulted in Figure 2 and Figure 4

, one generates log-supermodular factors by drawing the interaction strength vector from normal distribution with the average

and the variance, , i.e., . Note that there exist no bias in the factors and even though the distribution of the interaction strength is normal, it is highly likely to observe a positive value concentrated around .