1 Introduction
GMs express factorization of the joint multivariate probability distributions in statistics via graph of relations between variables. The concept of GM has been used successfully in information theory, physics, artificial intelligence and machine learning
[1, 2, 3, 4, 5, 6]. Of many inference problems one can set with a GM, computing partition function (normalization), or equivalently marginalizing the joint distribution, is the most general problem of interest. However, this paradigmatic inference problem is known to be computationally intractable in general, i.e., formally it is #Phard even to approximate
[7, 8].To address this obstacle, extensive efforts have been made to develop practical approximation methods, among which MCMC [9] based and BP [10] based algorithms are, arguably, the most popular and practically successful ones. MCMC is exact, i.e., it converges to the correct answer, but its convergence/mixing is, in general, exponential in the system size. On the other hand, message passing implementations of BP typically demonstrate fast convergence, however in general lacking approximation guarantees for GM containing loops. Motivated by this complementarity of the MCMC and BP approaches, we aim here to synthesize a hybrid approach benefiting from a joint use of MCMC and BP.
At a high level, our proposed scheme uses BP as the first step and then runs MCMC to correct for the approximation error of BP. To design such an “errorcorrecting" MCMC, we utilize the Loop Calculus approach [11] which allows, in a nutshell, to express the BP error as a sum (i.e., series) of weights of the socalled generalized loops (subgraphs of a special structure). There are several challenges one needs to overcome. First of all, to design an efficient Markov Chain (MC) sampler, one needs to design a scheme which allows efficient transitions between the generalized loops. Second, even if one designs such a MC which is capable of accessing all the generalized loops, it may mix slowly. Finally, weights of generalized loops can be positive or negative, while an individual MCMC can only generate nonnegative contributions.
Since approximating the full loop series (LS) is intractable in general, we first explore whether we can deal with the challenges at least in the case of the truncated LS corresponding to 2regular loops. In fact, this problem has been analyzed in the case of the planar pairwise binary GMs [12, 13] where it was shown that the 2regular LS is computable exactly in polynomialtime through a reduction to a Pfaffian (or determinant) computation [14]. In particular, the partition function of the Ising model without external field (i.e., where only pairwise factors present) is computable exactly via the 2regular LS. Furthermore, the authors show that in the case of general planar pairwise binary GMs, the 2regular LS provides a highly accurate approximation empirically. Motivated by these results, we address the same question in the general (i.e., nonplanar) case of pairwise binary GMs via MCMC. For the choice of MC, we adopt the Worm algorithm [15]
. We prove that with some modification including rejections, the algorithm allows to sample (with probabilities proportional to respective weights) 2regular loops in polynomialtime. Then, we design a novel simulated annealing strategy using the sampler to estimate separately positive and negative parts of the 2regular LS. Given any
, this leads to a approximation polynomialtime scheme for the 2regular LS under a mild assumption.We next turn to estimating the full LS. In this part, we ignore the theoretical question of establishing the polynomial mixing time of a MC, and instead focus on designing an empirically efficient MCMC scheme. We design an MC using a cycle basis of the graph [16] to sample generalized loops directly, without rejections. It transits from one generalized loop to another by adding or deleting a random element of the cycle basis. Using the MC sampler, we design a simulated annealing strategy for estimating the full LS, which is similar to what was used earlier to estimate the 2regular LS. Notice that even though the prime focus of this paper is on pairwise binary GMs, the proposed MCMC scheme allows straightforward generalization to general nonbinary GMs.
In summary, we propose novel MCMC schemes to estimate the LS correction to the BP contribution to the partition function. Since already the bare BP provides a highly nontrivial estimation for the partition function, it is naturally expected and confirmed in our experimental results that the proposed algorithm outperforms other standard (not related to BP) MCMC schemes applied to the original GM. We believe that our approach provides a new angle for approximate inference on GM and is of broader interest to various applications involving GMs.
2 Preliminaries
2.1 Graphical models and belief propagation
Given undirected graph with , a pairwise binary Markov Random Fields (MRF) defines the following joint probability distribution on :
where are some nonnegative functions, called compatibility or factor functions, and the normalization constant is called the partition function. Without loss of generality, we assume is connected. It is known that approximating the partition function is #Phard in general [8]
. Belief Propagation (BP) is a popular messagepassing heuristic for approximating marginal distributions of MRF. The BP algorithm iterates the following message updates for all
:where denotes the set of neighbors of . In general BP may fail to converge, however in this case one may substitute it with a somehow more involved algorithm provably convergent to its fixed point [22, 23, 24]. Estimates for the marginal probabilities are expressed via the fixedpoint messages as follows: and
2.2 Bethe approximation and loop calculus
BP marginals also results in the following Bethe approximation for the partition function :
If graph is a tree, the Bethe approximation is exact, i.e., . However, in general, i.e. for the graph with cycles, BP algorithm provides often rather accurate but still an approximation.
Loop Series (LS) [11] expresses, , as the following sum/series:
where each term/weight is associated with the socalled generalized loop and denotes the set of all generalized loops in graph (including the empty subgraph ). Here, a subgraph of is called generalized loop if all vertices have degree (in the subgraph) no smaller than .
Since the number of generalized loops is exponentially large, computing is intractable in general. However, the following truncated sum of , called 2regular loop series, is known to be computable in polynomialtime if is planar [12]:^{1}^{1}1 Note that the number of 2regular loops is exponentially large in general.
where denotes the set of all 2regular generalized loops, i.e., if for every vertex of . One can check that for the Ising model without the external fields. Furthermore, as stated in [12, 13] for the general case, provides a good empirical estimation for .
3 Estimating 2regular loop series via MCMC
In this section, we aim to describe how the regular loop series can be estimated in polynomialtime. To this end, we first assume that the maximum degree of the graph is at most 3. This degree constrained assumption is not really restrictive since any pairwise binary model can be easily expressed as an equivalent one with , e.g., see the supplementary material. The rest of this section consists of two parts. We first propose an algorithm generating a 2regular loop sample with the probability proportional to the absolute value of its weight, i.e.,
Note that this 2regular loop contribution allows the following factorization: for any ,
(1) 
In the second part we use the sampler constructed in the first part to design a simulated annealing scheme to estimate .
3.1 Sampling 2regular loops
We suggest to sample the 2regular loops distributed according to through a version of the Worm algorithm proposed by Prokofiev and Svistunov [15]. It can be viewed as a MC exploring the set, , where is the set of all subgraphs of
with exactly two odddegree vertices. Given current state
, it chooses the next state as follows:
If , pick a random vertex (uniformly) from . Otherwise, pick a random odddegree vertex (uniformly) from .

Choose a random neighbor of (uniformly) within , and set initially.

Update with the probability
Here, denotes the symmetric difference and for , its weight is defined according to . In essence, the Worm algorithm consists in either deleting or adding an edge to the current subgraph . From the Worm algorithm, we transition to the following algorithm which samples regular loops with probability simply by adding rejection of if .
The following theorem states that Algorithm 1 can generate a desired random sample in polynomialtime.
Theorem 1.
The proof of the above theorem is presented in the supplementary material due to the space constraint. In the proof, we first show that MC induced by the Worm algorithm mixes in polynomial time, and then prove that acceptance of a 2regular loop, i.e., line 6 of Algorithm 1, occurs with high probability. Notice that the uniformweight version of the former proof, i.e., fast mixing, was recently proven in [18]. For completeness of the material exposition, we present the general case proof of interest for us. The latter proof, i.e., high acceptance, requires to bound and to show that the probability of sampling 2regular loops under the Worm algorithm is for some polynomial function .
3.2 Simulated annealing for approximating 2regular loop series
Here we utilize Theorem 1 to describe an algorithm approximating the regular LS in polynomial time. To achieve this goal, we rely on the simulated annealing strategy [19] which requires to decide a monotone cooling schedule , where corresponds to the target counting problem and does to its relaxed easy version. Thus, designing an appropriate cooling strategy is the first challenge to address. We will also describe how to deal with the issue that is a sum of positive and negative terms, while most simulated annealing strategies in the literature mainly studied on sums of nonnegative terms. This second challenge is related to the socalled ‘fermion sign problem’ common in statistical mechanics of quantum systems [25]. Before we describe the proposed algorithm in details, let us provide its intuitive sketch.
The proposed algorithm consists of two parts: a) estimating via a simulated annealing strategy and b) estimating via counting samples corresponding to negative terms in the 2regular loop series. First consider the following parametrized, auxiliary distribution over regular loops:
(2) 
Note that one can generate samples approximately with probability (2) in polynomialtime using Algorithm 1 by setting . Indeed, it follows that for ,
where the expectation can be estimated using samples if it is , i.e., is sufficiently close to . Then, for any increasing sequence , we derive
where it is know that , i.e., the total number of 2regular loops, is exactly [16]. This allows us to estimate simply by estimating for all .
Our next step is to estimate the ratio . Let denote the set of negative 2regular loops, i.e.,
Then, the regular loop series can be expressed as
where we estimate again using samples generated by Algorithm 1.
We provide the formal description of the proposed algorithm and its error bound as follows.
Theorem 2.
The proof of the above theorem is presented in the supplementary material due to the space constraint. We note that all constants entering in Theorem 2 were not optimized. Theorem 2 implies that complexity of Algorithm 2 is polynomial with respect to under assumption that and are polynomially small. Both and depend on the choice of BP fixed point, however it is unlikely (unless a degeneracy) that these characteristics become large. In particular, in the case of attractive models [20].
4 Estimating full loop series via MCMC
In this section, we aim for estimating the full loop series . To this end, we design a novel MC sampler for generalized loops, which adds (or removes) a cycle basis or a path to (or from) the current generalized loop. Therefore, we naturally start this section introducing necessary backgrounds on cycle basis. Then, we turn to describe the design of MC sampler for generalized loops. Finally, we describe a simulated annealing scheme similar to the one described in the preceding section. We also report its experimental performance comparing with other methods.
4.1 Sampling generalized loops with cycle basis
The cycle basis of the graph is a minimal set of cycles which allows to represent every Eulerian subgraph of (i.e., subgraphs containing no odddegree vertex) as a symmetric difference of cycles in the set [16]. Let us characterize the combinatorial structure of the generalized loop using the cycle basis. To this end, consider a set of paths between any pair of vertices:
i.e., . Then the following theorem allows to decompose any generalized loop with respect to any selected and .
Theorem 3.
Consider any cycle basis and path set . Then, for any generalized loop , there exists a decomposition, , such that can be expressed as a symmetric difference of the elements of , i.e., for some .
The proof of the above theorem is given in the supplementary material due to the space constraint. Now given any choice of , consider the following transition from , to the next state :

Choose, uniformly at random, an element , and set initially.

If , update
Due to Theorem 3, it is easy to check that the proposed MC is irreducible and aperiodic, i.e., ergodic, and the distribution of its th state converges to the following stationary distribution as :
One also has a freedom in choosing . To accelerate mixing of MC, we suggest to choose the minimum weighted cycle basis and the shortest paths with respect to the edge weights defined in (1), which are computable using the algorithm in [16] and the BellmanFord algorithm [21], respectively. This encourages transitions between generalized loops with similar weights.
4.2 Simulated annealing for approximating full loop series
Now we are ready to describe a simulated annealing scheme for estimating . It is similar, in principle, with that in Section 3.2. First, we again introduce the following parametrized, auxiliary probability distribution . For any decreasing sequence of annealing parameters, , we derive
Following similar procedures in Section 3.2, one can estimate using the sampler described in Section 4.1. Moreover, is estimated by sampling generalized loop with the highest probability . For large enough , the approximation error becomes relatively small since dominates over the distribution. In combination, this provides a desired approximation for . The result is stated formally in Algorithm 3.
4.3 Experimental results
In this section, we report experimental results for computing partition function of the Ising model and the hardcore model. We compare Algorithm 2 in Section 3 (coined MCMCBP2reg) and Algorithm 3 in Section 4.2 (coined MCMCBPwhole), with the bare Bethe approximation (coined BP) and the popular Gibbssampler (coined MCMCGibbs). To make the comparison fair, we use the same annealing scheme for all MCMC schemes, thus making their running times comparable. More specifically, we generate each sample after running iterations of an MC and take samples to compute each estimation (e.g., ) at intermediate steps. For performance measure, we use the logpartition function approximation error defined as , where is the output of the respective algorithm. We conducted 3 experiments on the
grid graph. In our first experimental setting, we consider the Ising model with varying interaction strength and no external (magnetic) field. To prepare the model of interest, we start from the Ising model with uniform (ferromagnetic/attractive and antiferromagnetic/repulsive) interaction strength and then add ‘glassy’ variability in the interaction strength modeled via i.i.d Gaussian random variables with mean
and variance
, i.e. . In other words, given average interaction strength , each interaction strength in the model is independently chosen as . The second experiment was conducted by adding corrections to the external fields under the same condition as in the first experiment. In this case we observe that BP often fails to converge, and use the Concave Convex Procedure (CCCP) [23] for finding BP fixed points. Finally, we experiment with the hardcore model on the grid graph with varying a positive parameter , called ‘fugacity’ [26]. As seen clearly in Figure 1, BP and MCMCGibbs are outperformed by MCMCBP2reg or MCMCBPwhole at most tested regimes in the first experiment with no external field, where in this case, the 2regular loop series (LS) is equal to the full one. Even in the regimes where MCMCGibbs outperforms BP, our schemes correct the error of BP and performs at least as good as MCMCGibbs. In the experiments, we observe that advantage of our schemes over BP is more pronounced when the error of BP is large. A theoretical reasoning behind this observation is as follows. If the performance of BP is good, i.e. the loop series (LS) is close to , the contribution of empty generalized loop, i.e., , in LS is significant, and it becomes harder to sample other generalized loops accurately.5 Conclusion
In this paper, we propose new MCMC schemes for approximate inference in GMs. The main novelty of our approach is in designing BPaware MCs utilizing the nontrivial BP solutions. In experiments, our BP based MCMC scheme also outperforms other alternatives. We anticipate that this new technique will be of interest to many applications where GMs are used for statistical reasoning.
Acknowledgements.
The work of MC was carried out under the auspices of the National Nuclear Security Administration of the U.S. Department of Energy at Los Alamos National Laboratory under Contract No. DEAC5206NA25396, and was partially supported by the Advanced Grid Modeling Program in the U.S. Department of Energy Office of Electricity.
References
 [1] J. Pearl, “Probabilistic reasoning in intelligent systems: networks of plausible inference,” Morgan Kaufmann, 2014.
 [2] R. G. Gallager, “Lowdensity paritycheck codes,” Information Theory, IRE Transactions 8(1): 2128, 1962.
 [3] R. F. Kschischang, and J. F. Brendan, “Iterative decoding of compound codes by probability propagation in graphical models,” Selected Areas in Communications, IEEE Journal 16(2): 219230, 1998.
 [4] M. I. Jordan, ed. “Learning in graphical models,” Springer Science & Business Media 89, 1998.
 [5] R.J. Baxter, “Exactly solved models in statistical mechanics,” Courier Corporation, 2007.

[6]
W.T. Freeman, C.P. Egon, and T.C. Owen,
“Learning lowlevel vision.”
International journal of computer vision
40(1): 2547, 2000.  [7] V. Chandrasekaran, S. Nathan, and H. Prahladh, “Complexity of Inference in Graphical Models,” Association for Uncertainty and Artificial Intelligence, 2008
 [8] M. Jerrum, and A. Sinclair, “Polynomialtime approximation algorithms for the Ising model,” SIAM Journal on computing 22(5): 10871116, 1993.
 [9] C. Andrieu, N. Freitas, A. Doucet, and M. I. Jordan, “An introduction to MCMC for machine learning,” Machine learning 50(12), 543, 2003.
 [10] J. Pearl, “Reverend Bayes on inference engines: A distributed hierarchical approach,” Association for the Advancement of Artificial Intelligence, 1982.
 [11] M. Chertkov, and V. Y. Chernyak, “Loop series for discrete statistical models on graphs,” Journal of Statistical Mechanics: Theory and Experiment 2006(6): P06009, 2006.
 [12] M. Chertkov, V. Y. Chernyak, and R. Teodorescu, “Belief propagation and loop series on planar graphs,” Journal of Statistical Mechanics: Theory and Experiment 2008(5): P05003, 2008.
 [13] V. Gomez, J. K. Hilbert, and M. Chertkov, “Approximate inference on planar graphs using Loop Calculus and Belief Propagation,” The Journal of Machine Learning Research, 11: 12731296, 2010.
 [14] P. W. Kasteleyn, “The statistics of dimers on a lattice,” Classic Papers in Combinatorics. Birkhäuser Boston, 281298, 2009.
 [15] N. Prokof’ev, and B. Svistunov, “Worm algorithms for classical statistical models,” Physical review letters 87(16): 160601, 2001.
 [16] J.D. Horton, “A polynomialtime algorithm to find the shortest cycle basis of a graph.” SIAM Journal on Computing 16(2): 358366, 1987. APA
 [17] H. A. Kramers, and G. H. Wannier, “Statistics of the twodimensional ferromagnet. Part II,” Physical Review 60(3): 263, 1941.
 [18] A. Collevecchio, T. M. Garoni, T.Hyndman, and D. Tokarev, “The worm process for the Ising model is rapidly mixing,” arXiv preprint arXiv:1509.03201, 2015.
 [19] S. Kirkpatrick, “Optimization by simulated annealing: Quantitative studies.” Journal of statistical physics 34(56): 975986, 1984.
 [20] R. Nicholas, “The Bethe partition function of logsupermodular graphical models,” Advances in Neural Information Processing Systems. 2012.
 [21] J. Bang, J., and G. Z. Gutin. “Digraphs: theory, algorithms and applications.” Springer Science & Business Media, 2008.
 [22] Y. W. Teh and M. Welling, “Belief optimization for binary networks: a stable alternative to loopy belief propagation,” Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, 493500, 2001.
 [23] A. L. Yuille, “CCCP algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation,” Neural Computation, 14(7): 16911722, 2002.
 [24] J. Shin, “The complexity of approximating a Bethe equilibrium,” Information Theory, IEEE Transactions on, 60(7): 39593969, 2014.
 [25] https://www.quora.com/StatisticalMechanicsWhatisthefermionsignproblem
 [26] Dyer, M., Frieze, A., and Jerrum, M. “On counting independent sets in sparse graphs,” SIAM Journal on Computing 31(5): 15271541, 2002.
 [27] J. Schweinsberg, “An bound for the relaxation time of a Markov chain on cladograms.” Random Structures & Algorithms 20(1): 5970, 2002.
Appendix A Transformation to an equivalent binary pairwise model with maximum degree at most 3
Appendix B Proof of Theorem 1
First, note that the MC induced by the worm algorithm converges to the following stationary distribution
where
We first prove its polynomial mixing, i.e. it produces a sample from a distribution with the desired total variation distance from in a polynomial number of iterations.
Lemma 1.
Given any and any , choose
and let denote the resulting distribution of after updating times by the worm algorithm with initial state . Then, it follows that
namely, the mixing time of the MC is bounded above by .
The proof of the above lemma is given in Section B.1. Collevecchio et al. [18] recently proved that the worm algorithm mixes in polynomial time when the weights are uniform, i.e., equal. We extend the result to our case of nonuniform weights. The proof is based on the method of canonical path, which views the state space as a graph and constructs a path between every pair of states having certain amount of flow defined by . From Lemma 1 with parameters
we obtain that the total variation distance between and the distribution of updated states in line 4 of Algorithm 1 is at most . Next, we prove that the probability of acceptance in line of Algorithm 1 is sufficiently large.
Lemma 2.
The probability of sampling a regular loop from distribution is bounded below by , i.e.
The proof of the above lemma is given in Section B.2. The proof relies on the fact that the size of is bounded by a polynomial of the size of .
Now we are ready to complete the proof of Theorem 1. Let denote the distribution of regular loops from line 6 of Algorithm 1 under parameters as in Theorem 1. We say Algorithm 1 fails if it outputs from line 9. Choose a set of regular loops Then the total variation distance between and can be expressed as:
By applying Lemma 1 and Lemma 2, we obtain the following under parameters as in Theorem 1:
In the above, (a) comes from the fact that a sample from line 6 of Algorithm 1 follows the distribution and the failure probability of Algorithm 1 is . For (b), we use the variation distance between and due to Lemma 1 and parameters as in Theorem 1, i.e.,
For (c), we use for any and (d) follows from Lemma 2 and . The converse can be done similarly by considering the complementary set . This completes the proof of Theorem 1.
b.1 Proof of Lemma 1
First, let denote the transition matrix of MC induced by the worm algorithm in Section 3.1. Then we are able to define the corresponding transition graph , where each vertex is a state of the MC, and edges are defined on state pairs with nonzero transition probability, i.e.
Our proof makes use of the following result proved in [18].
Theorem 4 (Schweinsberg 2002 [18]).
Consider an irreducible and lazy MC, with finite state space , transition matrix and transition graph , which is reversible with respect to the distribution . Let be nonempty, and for each pair , specify a path in from to . Let
denote the collection of all such paths, and let be the length of longest path in . For any transition , let
Then
where
To this end, we choose and we show that there exists a choice of paths such that
Then we obtain the statement in Lemma 1 immediately.
We begin by specifying , and then proceed to the bound of . To this end, we fix an valued vertex labeling of . The labeling induces a lexicographical total order of the edges, which in turn induces a lexicographical total order on the set of all subgraphs of . In order for the state transit to the , it suffices that it updates, precisely once, those edges in . In order to describe such path, we first prove that there exist a injection from to some unique disjoint partition , where is either a path or a cycle and are cycles. Observe that since , applying symmetric difference with does not change the parity of degrees of the vertices and . First consider the case when . Then there exist a path between two odddegree vertices in , since the sum of degrees over all vertices in a component is even. Among such paths, we pick as the path with the highest order according to the valued vertex labeling. Now observe that is Eulerian, which can be decomposed into disjoint set of cycles. We are able to choose a uniquely by recursively excluding a cycle with the highest order, i.e. we pick as a cycle with highest order from , then pick from with the highes order, and so on. For the case when , is Eulerian and we can apply similar logic to obtain the unique decomposition into disjoint cycles.
Now we are ready to describe , which updates the edges in from to in order. If is a path, pick an endpoint with higher order of label and update the edges in the paths by it unwinding the edges along the path until other endpoint is met. In the case of cycles, pick a vertex with highest order of label and unwind the edges by a fixed orientation. Note that during the update of cycles, the number of odddegree vertices are at most 2, so the intermediate states are stil in . As a result, we have constructed a path for each and where each edge correspond to an update on and .
Next, we bound the corresponding . First let denote the set of subgraphs with exactly odddegree vertices. We define a mapping by the following:
where . Observe that agrees with on the components that have already been processed, and with on the components that have not. We prove that is an injection by reconstructing and from given . To this end, observe that is uniquely decided from and . Then given , we are able to infer the decomposition of by the rules defined previously. Moreover the updated edge implies the current set being updated. Therefore we can infer the processed part of . Then we can recover by beginning in and unwinding the remaining edges in that was not processed yet. Then we recover via and therefore is injective.
Next, we define a metric such that given an edge set ,
We complete the proof by showing that for any , the following inequality holds:
First, (a) holds by definition of . We prove (b) by the following chain of inequality: