1 Introduction
Graphical Models (GM) express factorization of the joint multivariate probability distributions in statistics via a graph of relations between variables. The concept of GM has been developed and/or used successfully in information theory
(gallager1962low, ; kschischang1998iterative, ), physics (35Bet, ; 36Pei, ; 87MPZ, ; parisi1988statistical, ; 09MM, )(pearl2014probabilistic, ), and machine learning
(jordan1998learning, ; freeman2000learning, ). Of many inference problems one can formulate using a GM, computing the partition function (normalization), or equivalently computing marginal probability distributions, is the most important and universal inference task of interest. However, this paradigmatic problem is also known to be computationally intractable in general, i.e., it is #Phard even to approximate (jerrum1993polynomial, ).The Markov chain monte carlo (MCMC)
alpaydin2014introductionis a classical approach addressing the inference task, but it typically suffers from exponentially slow mixing or large variance. Variational inference is an approach stating the inference task as an optimization. Hence, it does not have such issues of MCMC and is often more favorable. The meanfield (MF)
(parisi1988statistical, ) and belief propagation (BP) (pearl1982reverend, )are arguably the most popular algorithms of the variational type. They are distributed, fast and overall very successful in practical applications even though they are heuristics lacking systematic error control. This has motivated researchers to seek for methods with some guarantees, e.g., providing lower bounds
(liu2012negative, ; ermon2012density, ) and upper bounds (wainwright2005new, ; liu2011bounding, ; ermon2012density, ) for the partition function of GM.In another line of research, which this paper extends and contributes, the socalled reparametrizations (03WJW, ), gauge transformations (GT) (06CCa, ; 06CCb, ) and holographic transformations (valiant2008holographic, ; al2011normal, ) were explored. This class of distinct, but related, transformations consist in modifying a GM by changing factors, associated with elements of the graph, continuously such that the partition function stays the same/invariant.^{1}^{1}1See 08JW ; forney2011partition ; Misha_notes for discussions of relations between the aforementioned techniques. In this paper, we choose to work with GT as the most general one among the three approaches. Once applied to a GM, it transforms the original partition function, defined as a weighted series/sum over states, to a new one, dependent on the choice of gauges. In particular, a fixed point of BP minimizes the socalled Bethe free energy 05YFW , and it can also be understood as an optimal GT (06CCa, ; 06CCb, ; chernyak2007loop, ; mori2015holographic, ). Moreover, fixing GT in accordance with BP results in the socalled loop series expression for the partition function (06CCa, ; 06CCb, ). In this paper we generalize (06CCa, ; 06CCb, ) and explore a more general class of GT. This allows us to develop a new gaugeoptimization approach which results in ‘better’ variational inference schemes than one provided by MF, BP and other related methods.
Contribution.
The main contribution of this paper consists in developing two novel variational methods, called GaugedMF (GMF) and GaugedBP (GBP), providing lower bounds on the partition function of GM. While MF minimizes the (exact) Gibbs free energy under (reduced) product distributions, GMF does the same task by introducing an additional GT. Due to the the additional degree of freedom in optimization, GMF improves the lower bound of the partition function provided by MF systematically. Similarly, GBP generalizes BP, extending interpretation of the latter as an optimization of the Bethe free energy over GT
(06CCa, ; 06CCb, ; chernyak2007loop, ; mori2015holographic, ), by imposing additional constraints on GT forcing all the terms in the resulting series for the partition function to remain nonnegative. Thus, GBP results in a provable lower bound for the partition function, while BP does not (except for logsupermodular models (ruozzi2012bethe, )).We prove that both GMF and GBP are exact for GMs defined over single cycles, which we call ‘alternating cycle/loop’, as well as over the line graphs. The alternative cycle case is surprising as it represents the simplest ‘counterexample’ from weller2014understanding , illustrating failures of MF and BP. For general GMs, we also establish that GMF is better than, or at least as good as GBP. However, we also develop novel error correction schemes for GBP such that the lower bound of the partition function provided by GBP can also be improved systematically/sequentially, eventually outperforming GMF on the expense of increasing computational complexity. Such an error correction scheme has been studied for improving BP by considering the loop series consisting of positive and negative terms chertkov2008belief ; ahn2016synthesis . Due to our design of GBP, the corresponding series consists of only nonnegative terms, which makes much easier to improve the quality of GBP systematically.
We further found that our newly proposed GTbased optimizations can be restated as smooth and unconstrained ones, thus allowing efficient solutions via algorithms of a gradient descent type or any generic optimization solver such as IPOPT (wachter2006implementation, ). We experiment with IPOPT on complete GMs of relatively small size and on large GM (upto 300 variables) of fixed degree, which confirm that the newly proposed algorithms outperform and generalize MF and BP. Finally, note that all statements of the paper are made within the framework of the socalled Forneystyle GMs (forney2001codes, ) which is general as it allows interactions beyond pairwise (i.e., highorder GM) and includes other/alternative GM formulations, such as factor graphs of (03WJ, ). Our results using GT for variational inference provide a refreshing angle for the important inference task, and we believe it should be of broad interest in many applications involving GMs.
2 Preliminaries
2.1 Graphical model
Factorgraph model. Given (undirected) bipartite factor graph
, a joint distribution of (binary) random variables
is called a factorgraph Graphical Model (GM) if it factorizes as follows:where are some nonnegative functions called factor functions, consists of nodes neighboring factor , and the normalization constant is called the partition function. A factorgraph GM is called pairwise if for all , and highorder otherwise. It is known that approximating the partition function is #Phard even for pairwise GMs in general (jerrum1993polynomial, ).
Forneystyle model. In this paper, we primarily use the Forneystyle GM (forney2001codes, ) instead of factorgraph GM. Elementary random variables in the Forneystyle GM are associated with edges of an undirected graph,
. Then the random vector,
is realized with the probability distribution(1) 
where is associated with set of edges neighboring node , i.e. and As argued in 06CCa ; 06CCb , the Forneystyle GM constitutes a more universal and compact description of gauge transformations without any restriction of generality, i.e., given any factorgraph GM, one can construct an equivalent Forneystyle GM (see the supplementary material).
2.2 Meanfield and belief propagation
In this section, we introduce two most popular methods for approximating the partition function: the meanfield and Bethe (i.e., belief propagation) approximation methods. Given any (Forneystyle) GM defined as in (1) and any distribution over all variables, the Gibbs free energy is defined as
(2) 
Then the partition function is derived according to , where the optimum is achieved at , e.g., see (03WJ, ). This optimization is over all valid probability distributions on the exponentially large space and obviously intractable.
In the case of the meanfield (MF) approximation, we minimize the Gibbs free energy over a family of tractable probability distributions factorized into the following product: , where each independent is a proper probability distribution, behaving as a (meanfield) proxy to the marginal of over . By construction, the MF approximation provides a lower bound for . In the case of the Bethe approximation, the socalled Bethe free energy approximates the Gibbs free energy (yedidia2001bethe, ):
(3) 
where beliefs should satisfy following ‘consistency’ constraints:
Here, denotes a vector with fixed and
is the Bethe estimation for
. The popular belief propagation (BP) distributed heuristics solves the optimization iteratively (yedidia2001bethe, ). The Bethe approximation is exact over trees, i.e., . However, in the case of a general loopy graph, the BP estimation lacks approximation guarantees. It is known, however, that the result of BPoptimization lower bounds the logpartition function, , if the factors are logsupermodular (ruozzi2012bethe, ).2.3 Gauge transformation
Gauge transformation (GT) (06CCa, ; 06CCb, )
is a family of linear transformations of the factor functions in (
1) which leaves the the partition function invariant. It is defined with respect to the following set of invertible matrices for , coined gauges:The GM, gauge transformed with respect to , consists of factors expressed as:
Here one treats independent and equivalently for notational convenience, and is a conjugated pair of distinct matrices satisfying the gauge constraint , where
is the identity matrix. Then, one can prove invariance of the partition function under the transformation:
(4) 
Consequently, GT results in the gauge transformed distribution Note that some components of can be negative, in which case it is not a valid probability distribution.
We remark that the Bethe/BP approximation can be interpreted as a specific choice of GT (06CCa, ; 06CCb, ). Indeed any fixed point of BP corresponds to a special set of gauges making an arbitrarily picked configuration/state to be least sensitive to the local variation of the gauge. Formally, the following nonconvex optimization is known to be equivalent to the Bethe approximation:
subject to  (5) 
and the set of BPgauges correspond to stationary points of (5), having the objective as the respective Bethe free energy, i.e., .
3 Gauge optimization for approximating partition functions
Now we are ready to describe two novel gauge optimization schemes (different from (5)) providing guaranteed lower bound approximations for . Our first GT scheme, coined GaugedMF (GMF), shall be considered as modifying and improving the MF approximation, while our second GT scheme, coined GaugedBP (GBP), modifies and improves the Bethe approximation in a way that it now provides a provable lower bound for , while the bare BP does not have such guarantees. The GBP scheme also allows further improvement (in terms of the output quality) on the expense of making underlying algorithm/computation more complex.
3.1 Gauged meanfield
We first propose the following optimization inspired by, and also improving, the MF approximation:
subject to  
(6) 
Recall that the MF approximation optimizes the Gibbs free energy with respect to given the original GM, i.e. factors. On the other hand, (6) jointly optimizes it over and . Since the partition function of the gauge transformed GM is equal to that of the original GM, (6) also outputs a lower bound on the (original) partition function, and always outperforms MF due to the additional degree of freedom in . The nonnegative constraints for each factor enforce that the gauge transformed GM results in a valid probability distribution (all components are nonnegative).
subject to 
subject to 
To solve (6), we propose a strategy, alternating between two optimizations, formally stated in Algorithm 1. The alternation is between updating , within Step A, and updating , within Step C. The optimization in Step A is simple as one can apply any solver of the meanfield approximation. On the other hand, Step C requires a new solver and, at the first glance, looks complicated due to nonlinear constraints. However, the constraints can actually be eliminated. Indeed, one observes that the nonnegative constraint is redundant, because each term in the optimization objective already prevents factors from getting close to zero, thus keeping them positive. Equivalently, once current satisfies the nonnegative constraints, the objective, , acts as a logbarrier forcing the constraints to be satisfied at the next step within an iterative optimization procedure. Furthermore, the gauge constraint, , can also be removed simply expressing one (of the two) gauge via another, e.g., via . Then, Step C can be resolved by any unconstrained iterative optimization method of a gradient descent type or any generic optimization solver such as IPOPT (wachter2006implementation, ). Next, the additional (intermediate) procedure Step B was considered to handle extreme cases when for some , at the optimum. We resolve the singularity perturbing the distribution by setting zero probabilities to a small value, where is sufficiently small. In summary, it is straightforward to check that the Algorithm 1 converges to a local optimum of (6), similar to some other solvers developed for the meanfield and Bethe approximations.
We also provide an important class of GMs where the Algorithm 1 provably outperforms both the MF and BP (Bethe) approximations. Specifically, we prove that the optimization (6
) is exact in the case when the graph is a line (which is a special case of a tree) and, somewhat surprisingly, a single loop/cycle with odd number of factors represented by negative definite matrices. In fact, the latter case is the socalled ‘alternating cycle’ example which was introduced in
weller2014understanding as the simplest loopy example where the MF and BP approximations perform quite badly. Formally, we state the following theorem whose proof is given in the supplementary material.Theorem 1.
For GM defined on any line graph or alternating cycle, the optimal objective of (6) is equal to the exact log partition function, i.e., .
3.2 Gauged belief propagation
We start discussion of the GBP scheme by noticing that, according to chertkov2006loop , the GMF gauge optimization (6) can be reduced to the BP/Bethe gauge optimization (5) by eliminating the nonnegative constraint for each factor and replacing the product distribution by:
(7) 
Motivated by this observation, we propose the following GBP optimization:
subject to  
(8) 
The only difference between (5) and (8) is addition of the nonnegative constraints for factors in (8). Hence, (8) outputs a lower bound on the partition function, while (5) can be larger or smaller then . It is also easy to verify that (8) (for GBP) is equivalent to (6) (for GMF) with fixed to (7). Hence, we propose the algorithmic procedure for solving (8), formally described in Algorithm 2, and it should be viewed as a modification of Algorithm 1 with replaced by (7) in Step A, also with a properly chosen logbarrier term in Step C. As we discussed for Algorithm 1, it is straightforward to verify that Algorithm 2 also converges to a local optimum of (8) and one can replace by for each pair of the conjugated matrices in order to build a convergent gradient descent algorithmic implementation for the optimization.
subject to 
Since fixing eliminates the degree of freedom in (6), GBP should perform worse than GMF, i.e., (8) (6). However, GBP is still meaningful due to the following reasons. First, Theorem 1 still holds for (8), i.e., the optimal of (6) is achieved at (7) for any line graph or alternating cycle (see the proof of the Theorem 1 in the supplementary material). More importantly, GBP can be corrected systematically. At a high level, the “errorcorrection" strategy consists in correcting the approximation error of (8) sequentially while maintaining the desired lower bounding guarantee. The key idea here is to decompose the error of (8) into partition functions of multiple GMs, and then repeatedly lower bound each partition function. Formally, we fix an arbitrary ordering of edges and define the corresponding GM for each as follows: for , where and
Namely, we consider GMs from sequential conditioning of in the gauge transformed GM. Next, recall that (8) maximizes and outputs a single configuration . Then, since and , the error of (8) can be decomposed as follows:
(9) 
Now, one can run GMF, GBP or any other methods (e.g., MF) again to obtain a lower bound of for all and then output . However, such additional runs of optimization inevitably increase the overall complexity. Instead, one can also pick a single term for from , as a choice of just after solving (8) initially, and output
(10) 
as a better lower bound for than . This choice is based on the intuition that configurations partially different from may be significant too as they share most of the same factor values with the zero configuration maximized in (8). In fact, one can even choose more configurations (partially different from ) by paying more complexity, which is always better as it brings the approximation closer to the true partition function. In our experiments, we consider additional configurations , i.e., output
(11) 
as a better lower bound of than (10).
4 Experimental results
In this section, we report results of our experiments with GMF and GBP defined in Section 3. We also experiment here with GBP boosted by schemes correcting errors by accounting for single (10) and multiple (11) terms, as well as correcting GBP by applying GBP sequentially again to each residual partition function . The error decreases, while the evaluation complexity increases, as we move from GBPsingle to GBPmultiple and then to GBPsequential. As mentioned earlier, we use the IPOPT solver (wachter2006implementation, ) to resolve the proposed gauge optimizations. We generate random GMs with factors dependent on the ‘interaction strength’ parameters (akin inverse temperature) as follows:
where and count numbers of and contributions in , respectively. Intuitively, we expect that as increases, it becomes more difficult to approximate the partition function. See the supplementary material for additional information on how we generate the random models.
In the first set of experiments, we consider relatively small, complete graphs with two types of factors: random generic (nonlogsupermodular) factors and logsupermodular (positive/ferromagnetic) factors. Recall that the bare BP also provides a lower bound in the logsupermodular case ruozzi2012bethe , thus making the comparison between each proposed algorithm and BP informative. We use the log partition approximation error defined as , where is the algorithm output (a lower bound of ), to quantify the algorithm’s performance. In the first set of experiments, we deal with relatively small graphs and the explicit computation of (i.e., the approximation error) is feasible. The results for experiments over the small graphs are illustrated in Figure 1 and Figure 2 for the nonlogsupermodular and logsupermodular cases, respectively. Figure 1 shows that, as expected, GMF always outperforms MF. Moreover, we observe that GMF typically provides the tightest lowbound, unless it is outperformed by GBPmultiple or GBPsequential. We remark that BP is not shown in Figure 1, because in this nonlogsupermodular case, it does not provide a lower bound in general. According to Figure 2, showing the logsupermodular case, both GMF and GBP outperform MF, while GBPsequential outperforms all other algorithms. Notice that GBP performs rather similar to BP in the logsupermodular case, thus suggesting that the constraints, distinguishing (8) from (5), are very mildly violated.
In the second set of experiments, we consider more sparse, larger graphs of two types: regular and grid graphs with size up to factors/ variables. As in the first set of experiments, the same nonlogsupermodular/logsupermodular factors are considered. Since computing the exact approximation error is not feasible for the large graphs, we instead measure here the ratio of estimation by the proposed algorithm to that of MF, i.e., where is the output of MF. Note that a larger value of the ratio indicates better performance. The results are reported in Figure 3 and Figure 4 for the nonlogsupermodular and logsupermodular cases, respectively. In Figure 3, we observe that GMF and GBPsequential outperform MF significantly, e.g., upto times better in regular graphs of size . We also observe that even the bare GBP outperforms MF. In Figure 4, algorithms associated with GBP outperform GMF and MF (up to times). This is because the choice of for GBP is favored by logsupermodular models, i.e., most of configurations are concentrated around similar to the choice (7) of for GBP. One observes here (again) that performance of GBP in this logsupermodular case is almost on par with BP. This implies that GBP generalizes BP well: the former provides a lower bound of for any GMs, while the latter does only for logsupermodular GMs.
5 Conclusion and future research
We explore the freedom in gauge transformations of GM and develop novel variational inference methods which result in significant improvement of the partition function estimation. In this paper, we have focused solely on designing approaches which improve the bare/basic MF and BP via specially optimized gauge transformations. In terms of the path forward, it is of interest to extend this GT framework/approach to other variational methods, e.g., Kikuchi approximation kikuchi1951theory , structured/conditional MF saul1996exploiting ; carbonetto2007conditional . Furthermore, GBP and GMF were resolved in our experiments via a generic optimization solver (IPOPT), which was sufficient for the illustrative tests conducted so far, however we expect that it might be possible to develop more efficient distributed solvers of the BPtype. Finally, we plan working on applications of the newly designed methods and algorithms to a variety of practical inference applications associated to GMs.
References
 (1) Robert Gallager. Lowdensity paritycheck codes. IRE Transactions on information theory, 8(1):21–28, 1962.
 (2) Frank R. Kschischang and Brendan J. Frey. Iterative decoding of compound codes by probability propagation in graphical models. IEEE Journal on Selected Areas in Communications, 16(2):219–230, 1998.
 (3) Hans .A. Bethe. Statistical theory of superlattices. Proceedings of Royal Society of London A, 150:552, 1935.
 (4) Rudolf E. Peierls. Ising’s model of ferromagnetism. Proceedings of Cambridge Philosophical Society, 32:477–481, 1936.
 (5) Marc Mézard, Georgio Parisi, and M. A. Virasoro. Spin Glass Theory and Beyond. Singapore: World Scientific, 1987.
 (6) Giorgio Parisi. Statistical field theory, 1988.
 (7) Marc Mezard and Andrea Montanari. Information, Physics, and Computation. Oxford University Press, Inc., New York, NY, USA, 2009.
 (8) Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 2014.
 (9) Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business Media, 1998.

(10)
William T Freeman, Egon C Pasztor, and Owen T Carmichael.
Learning lowlevel vision.
International journal of computer vision
, 40(1):25–47, 2000.  (11) Mark Jerrum and Alistair Sinclair. Polynomialtime approximation algorithms for the ising model. SIAM Journal on computing, 22(5):1087–1116, 1993.
 (12) Ethem Alpaydin. Introduction to machine learning. MIT press, 2014.
 (13) Judea Pearl. Reverend Bayes on inference engines: A distributed hierarchical approach. Cognitive Systems Laboratory, School of Engineering and Applied Science, University of California, Los Angeles, 1982.
 (14) Qiang Liu and Alexander T Ihler. Negative tree reweighted belief propagation. arXiv preprint arXiv:1203.3494, 2012.
 (15) Stefano Ermon, Ashish Sabharwal, Bart Selman, and Carla P Gomes. Density propagation and improved bounds on the partition function. In Advances in Neural Information Processing Systems, pages 2762–2770, 2012.
 (16) Martin J Wainwright, Tommi S Jaakkola, and Alan S Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51(7):2313–2335, 2005.
 (17) Qiang Liu and Alexander T Ihler. Bounding the partition function using holder’s inequality. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 849–856, 2011.
 (18) Martin J. Wainwright, Tommy S. Jaakkola, and Alan S. Willsky. Treebased reparametrization framework for approximate estimation on graphs with cycles. Information Theory, IEEE Transactions on, 49(5):1120–1146, 2003.
 (19) Michael Chertkov and Vladimir Chernyak. Loop calculus in statistical physics and information science. Physical Review E, 73:065102(R), 2006.
 (20) Michael Chertkov and Vladimir Chernyak. Loop series for discrete statistical models on graphs. Journal of Statistical Mechanics, page P06009, 2006.
 (21) Leslie G Valiant. Holographic algorithms. SIAM Journal on Computing, 37(5):1565–1594, 2008.
 (22) Ali AlBashabsheh and Yongyi Mao. Normal factor graphs and holographic transformations. IEEE Transactions on Information Theory, 57(2):752–763, 2011.
 (23) Martin J. Wainwright and Michael E. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1):1–305, 2008.
 (24) G David Forney Jr and Pascal O Vontobel. Partition functions of normal factor graphs. arXiv preprint arXiv:1102.0316, 2011.
 (25) Michael Chertkov. Lecture notes on “statistical inference in structured graphical models: Gauge transformations, belief propagation & beyond", 2016.
 (26) J. S. Yedidia, W. T. Freeman, and Y. Weiss. Constructing freeenergy approximations and generalized belief propagation algorithms. Information Theory, IEEE Transactions on, 51(7):2282–2312, 2005.
 (27) Vladimir Y Chernyak and Michael Chertkov. Loop calculus and belief propagation for qary alphabet: Loop tower. In Information Theory, 2007. ISIT 2007. IEEE International Symposium on, pages 316–320. IEEE, 2007.
 (28) Ryuhei Mori. Holographic transformation, belief propagation and loop calculus for generalized probabilistic theories. In Information Theory (ISIT), 2015 IEEE International Symposium on, pages 1099–1103. IEEE, 2015.
 (29) Nicholas Ruozzi. The bethe partition function of logsupermodular graphical models. In Advances in Neural Information Processing Systems, pages 117–125, 2012.
 (30) Adrian Weller, Kui Tang, Tony Jebara, and David Sontag. Understanding the bethe approximation: when and how can it go wrong? In UAI, pages 868–877, 2014.
 (31) Michael Chertkov, Vladimir Y Chernyak, and Razvan Teodorescu. Belief propagation and loop series on planar graphs. Journal of Statistical Mechanics: Theory and Experiment, 2008(05):P05003, 2008.
 (32) SungSoo Ahn, Michael Chertkov, and Jinwoo Shin. Synthesis of mcmc and belief propagation. In Advances in Neural Information Processing Systems, pages 1453–1461, 2016.
 (33) Andreas Wächter and Lorenz T Biegler. On the implementation of an interiorpoint filter linesearch algorithm for largescale nonlinear programming. Mathematical programming, 106(1):25–57, 2006.
 (34) G David Forney. Codes on graphs: Normal realizations. IEEE Transactions on Information Theory, 47(2):520–548, 2001.
 (35) Martin Wainwright and Michael Jordan. Graphical models, exponential families, and variational inference. Technical Report 649, UC Berkeley, Department of Statistics, 2003.
 (36) Jonathan S Yedidia, William T Freeman, and Yair Weiss. Bethe free energy, kikuchi approximations, and belief propagation algorithms. Advances in neural information processing systems, 13, 2001.
 (37) Michael Chertkov and Vladimir Y Chernyak. Loop series for discrete statistical models on graphs. Journal of Statistical Mechanics: Theory and Experiment, 2006(06):P06009, 2006.
 (38) Ryoichi Kikuchi. A theory of cooperative phenomena. Physical review, 81(6):988, 1951.
 (39) Lawrence K Saul and Michael I Jordan. Exploiting tractable substructures in intractable networks. Advances in neural information processing systems, pages 486–492, 1996.
 (40) Peter Carbonetto and Nando D Freitas. Conditional mean field. In Advances in neural information processing systems, pages 201–208, 2007.
Appendix A Construction of Forneystyle model equivalent to factorgraph model
In this Section, we describe construction of a Forneystyle GM equivalent to the factorgraph GM. Consider a factorgraph GM defined on graph with factors . Then one introduces the following Forneystyle GM defined over the graph with factors
One observes that if the factorgraph GM (possibly, of highorder) is sparse, i.e., the maximum degree of is small, then the equivalent Forneystyle GM is too. See Figure 5 for illustration.
Appendix B Proof of Theorem 1
To prove Theorem 1 one, first, shows that the line graph GM can be gauge transformed into a distribution equivalent to the alternating cycle GM. Then it is sufficient for proving Theorem 1 to consider only the case of an alternating cycle.
Consider a GM defined on a line graph with and edges . Then the gauge transformed factor can be expressed as:
where we used the fact that the size/cardinality of the factor is . Next, we ‘flip’ factor , associated with the node number , such that there exist an odd number of negative definite factors among , i.e., the flipping sets
(12) 
thus resulting in reversing the sign of . If is noninvertible, i.e. , we instead flip and so on. If all factors are noninvertible, the resulting distribution is a product distribution and one can easily find the optimal for the corresponding line graph, which completes the proof. Otherwise, we ‘join’ the endpoints into by introducing a noninvertible factor , which results in an alternating cycle with the probability distribution identical to the one of a line graph GM.
Our next step is to prove Theorem 1 for an alternating cycle GM. Our high level logic here is as follows. We first fix the distribution of (6) according to
and then show that the GM can be gauge transformed into a distribution with a nonzero probability concentrated only at . The resulting objective of (6) will become exactly the partition function. To implement this logic, consider an alternating cycle defined on some graph with and edges . Observe that, that the gauge transformed factor, , and the original factor,
, share a pair of eigenvalues
due to the following relationship:One finds that since there exist an odd number of negative definite factors in the cycle. Moreover, because the diagonal sum, , is equivalent to the partition function of GM. Thus one can assume, without loss of generality, that and .
Next, utilizing a simple linear algebra, one derives
where and are matrices whose th column is an eigenvector of and, , respectively. Now let
where . Here we assume that there exists at most one noninvertible factor in the GM and are invertible so that is defined properly. Otherwise, the GM can be decomposed into separate line graphs and the proof can be applied recursively. Then the gauge transformed factors become:
which corresponds to a GM with objective of (6) to be equal to the log partition function. This completes the proof of the Theorem 1.
Appendix C Generating GM instances (for experiments)
In this Section, we provide more details on our experimental setups reported in in Section 4. First, we explain how the two types of factors, nonlogsupermodular and logsupermodular, were constructed. In the generic case (of nonlogsupermodular factors), i.e., correspondent to Figure 1 and Figure 3
, one generates factor by first drawing the interaction strength vector at random from the i.i.d. uniform distribution over the interval
for some , i.e., . Then, in order to introduce a bias, we add an external variable , i.e., halfedge, as follows:where is either or with probability each. More specifically in experiments resulted in Figure 1 one varies while in the experiments resulted in Figure 3 one fixes to , i.e., . Next, in the case of the logsupermodular factors, i.e., setting resulted in Figure 2 and Figure 4
, one generates logsupermodular factors by drawing the interaction strength vector from normal distribution with the average
and the variance, , i.e., . Note that there exist no bias in the factors and even though the distribution of the interaction strength is normal, it is highly likely to observe a positive value concentrated around .