1 Introduction
Graphical Models (GMs) express the factorization of the joint multivariate probability distribution over subsets of variables via graphical relations among them. They have played an important role in many fields, including computer vision
(Freeman et al., 2000), speech recognition (Bilmes, 2004), social science (Scott, 2017)and deep learning
(Hinton & Salakhutdinov, 2006). Given a GM, computing the partition function (the normalizing constant) is the essence of other statistical inference tasks such as marginalization and sampling. The partition function can be calculated efficiently in treestructured GMs through an iterative (dynamic programming) algorithm eliminating, i.e. summing up, variables sequentially. In principle, the elimination strategy extends to arbitrary loopy graphs, but the computational complexity is exponential in the treewidth, e.g., the junctiontree method (Shafer & Shenoy, 1990). Formally, the computation task is #Phard even to approximate (Jerrum & Sinclair, 1993).Variational approaches are often the most popular practical choice for approximate computation of the partition function. They map the counting problem into an approximate optimization problem stated over a polynomial (in the graph size) number of variables. The optimization is typically solved iteratively via a messagepassing algorithm, e.g., meanfield (Parisi, 1988), belief propagation (Pearl, 1982), treereweighted (Wainwright et al., 2005), or gauges and/or reparametrizations (Ahn et al., 2017, 2018 (accepted to appear)
. Lack of accuracy control and difficulty in forcing convergence in an acceptable number of steps are, unfortunately, typical for hard GM instances. Markov chain Monte Carlo methods (e.g., see
Alpaydin, 2014) are also popular to approximate the partition function, but typically suffer, even more than variational methods, from slow convergence/mixing.Approximate elimination is a sequential method to estimate the partition function. Each step consists of summation over variables followed by (or combined with) approximation of the resulting complex factors. Notable flavors of this method include truncation of the Fourier coefficients
(Xue et al., 2016), approximation by random mixtures of rank tensors (Wrigley et al., 2017), and arguably the most popular, elimination over minibuckets (Dechter & Rish, 2003; Liu & Ihler, 2011). One advantage of the minibucket elimination approach is the ability to control the tradeoff between computational complexity and approximation quality by adjusting an inducedwidth parameter. Note that analogous control in variational methods, such as varying region sizes in generalized belief propagation (Yedidia et al., 2001), typically results in much more complicated optimization formulations to solve. Another important advantage of minibucket elimination is that it is always guaranteed to terminate and, usually, it does so quickly. This is in contrast to iterative messagepassing implementations of variational methods which can be notoriously slow on difficult instances.Contribution. We improve the approximation quality of minibucket methods using tensor network and renormalization group approaches from statistical physics. In this regard, our method extends a series of recent papers exploring multilinear tensor network transformations/contractions (Novikov et al., 2014; Wrigley et al., 2017; Ahn et al., 2017, 2018 (accepted to appear). More generally, tensor network renormalization algorithms (Levin & Nave, 2007; Evenbly & Vidal, 2015) have been proposed in the quantum and statistical physics literature for estimating partition functions. The algorithms consist of coarsegraining the graph/network by contracting subgraphs/networks using a lowrank projection as a subroutine. However, the existing renormalization methods in the physics literature have focused primarily on a restricted class of tensorfactorized models over regular grids/lattices,^{1}^{1}1 The special models are related to what may be called Forneystyle grids/lattices (Forney, 2001) in the GM community. while factorgraph models (Clifford, 1990) over generic graphical structures are needed in most machine learning applications.
For generalizing them to factorgraph models, one would face at two challenges: (a) coarsegraining of the tensor network relies on the periodic structure of grid/lattices and (b) its lowrank projections are only defined on “edge variables” that allows only two adjacent factors. To overcome them, we first replace the coarsegraining step by sequential elimination of the minibucket algorithms, and then use the strategy of “variable splitting” in order to generate auxiliary edge variables. Namely, we combine ideas from tensor network renormalization and the minibucket schemes where one is benefical to the other. We propose two algorithms, which we call MBR and GBR:

Minibucket renormalization
(MBR) consists of sequentially splitting summation over the current (remaining) set of variables into subsets – multiple minibuckets which are then “renormalized”. We show that this process is, in fact, equivalent to applying lowrank projections on the minibuckets to approximate the variableelimination process, thus resulting in better approximation than the original minibucket methods. In particular, we show how to resolve approximate renormalization locally and efficiently through application of truncated singular value decomposition (SVD) over small matrices.

While MBR is based on a sequence of local lowrank approximations applied to the minibuckets, globalbucket renormalization (GBR) extends MBR by approximating minibuckets globally. This is achieved by first applying MBR to minibuckets, then calibrating the choice of low rank projections by minimizing the partition function approximation error with respect to renormalization of the “globalbucket”. Hence, GBR takes additional time to run but may be expected to yield better accuracy.
Both algorithms are easily applicable to arbitrary GMs with interactions (factors) of high orders, hypergraphs and large alphabets. We perform extensive experiments on synthetic (Ising models on complete and grid graphs) and realworld models from the UAI dataset. In our experiments, both MBR and GBR show performance superior to other stateoftheart elimination and variational algorithms.
2 Preliminaries
Graphical model. Consider a hypergraph with vertices and hyperedges . A graphical model (GM) associates a collection of discrete random variables with the following joint probability distribution:
where , , is a set of nonnegative functions called factors, and is the normalizing constant called the partition function that is computationally intractable.
Minibucket elimination. Bucket (or variable) elimination (BE, Dechter, 1999; Koller & Friedman, 2009) is a procedure for computing the partition function exactly based on sequential elimination of variables. Without loss of generality, we assume through out the paper that the elimination order is fixed . BE groups factors by placing each factor in the “bucket” of its earliest argument appearing in the elimination order . Next, BE eliminates the variable by introducing a new factor marginalizing the product of factors in it, i.e.,
(1) 
Here, abbreviates , where indicates the set of variables associated with the bucket . The subscript in represents a similar abbreviation. Finally, the new function is added to another bucket corresponding to its earliest argument in the elimination order. Formal description of BE is given in Algorithm 1.
One can easily check that BE applies a distributive property for computing exactly: groups of factors corresponding to buckets are summed out sequentially, and then the newly generated factor (without the eliminated variable) is added to another bucket. The computational cost of BE is exponential with respect to the number of uneliminated variables in the bucket, i.e., its complexity is Here, is called the induced width of given the elimination order , and the minimum possible induced width across all possible is called the treewidth. Furthermore, we remark that summation of GM over variables defined on the subset of vertices , i.e., , can also be computed via BE in time by summing out variables in elimination order on .
Minibucket elimination (MBE, Dechter & Rish, 2003) achieves lower complexity by approximating each step of BE by splitting the computation of each bucket into several smaller “minibuckets”. Formally, for each variable in the elimination order , the bucket is partitioned into minibuckets such that and for any . Next, MBE generates new factors differently from BE as follows:
(2) 
for all and
(3) 
Other steps are equivalent to that of BE. Observe that MBE replaces the exact marginalization of the bucket in (1) by its upper bound, i.e., , and hence yields an upper bound of . We remark that one could instead obtain a lower bound for by replacing by in (2).
Note that one has to choose minibuckets for MBE carefully as their sizes typically determine complexity and accuracy: smaller minibuckets may be better for speed, but worse in accuracy. Accordingly, MBE has an additional induced width bound parameter as the maximal size of a minibucket, i.e., . The time complexity of MBE is since the maximum number of minibuckets is bounded by .
3 MiniBucket Renormalization
We propose a new scheme, named minibucket renormalization (MBR). Our approach approximates BE by splitting each bucket into several smaller minibuckets to be “renormalized”. Inspired by tensor renormalization groups (TRG, Levin & Nave, 2007, see also references therein) in the physics literature, MBR utilizes lowrank approximations to the minibuckets instead of simply applying (or ) operations as in MBE. Intuitively, MBR is similar to MBE but promises better accuracy.
3.1 Algorithm Description
For each variable in the elimination order , MBR partitions a bucket into distinct minibuckets with maximal size bounded by . Then for , minibucket is “renormalized” through replacing vertex by its replicate and then introducing local factors for error compensation, i.e.,
Here, are local “compensating/renormalizing factors”, chosen to approximate the factor well, where MBE approximates it using (2) and (3). See Figure 1 for illustration of the renormalization process. Specifically, the local compensating/renormalizing factors are chosen by solving the following optimization:
(4) 
where is the factor induced on from the renormalized minibucket :
We show that (4) can be solved efficiently in Section 3.2. After all minibuckets are processed, factors can be summed over the variables and separately, i.e., introduce new factors as follows:
(5)  
for and
(6) 
Resulting factors are then added to its corresponding minibucket and repeat until all buckets are processed, like BE and MBE. Here, one can check that
from (4) and MBR indeed approximates BE. The formal description of MBR is given in Algorithm 2.
3.2 Complexity
The optimization (4) is related to the rank approximation on , which can be solved efficiently via (truncated) singular value decomposition (SVD). Specifically, let be a matrix representing as follows:
(7) 
Then rank1 truncated SVD for solves the following optimization:
where optimization is over
dimensional vectors
and denotes the Frobenious norm. Namely, the solution becomes the most significant (left) singular vector of , associated with the largest singular value.^{2}^{2}2 holds without forcing it. Especially, since is a nonnegative matrix, its most significant singular vector is always nonnegative due to the PerronFrobenius theorem (Perron, 1907). By letting and , one can check that this optimization is equivalent to (4), where in fact, , i.e., they share the same values. Due to the above observations, the complexity of (4) is that denotes the complexity of SVD for matrix . Therefore, the overall complexity becomeswhere in general, but typically much faster in the existing SVD solver.
4 GlobalBucket Renormalization
In the previous section, MBR only considers the local neighborhood for renormalizing minibuckets to approximate a single marginalization process of BE. Here we extend the approach and propose globalbucket renormalization (GBR), which incorporates a global perspective. The new scheme reupdates the choice of compensating local factors obtained in MBR by considering factors that were ignored during the original process. In particular, GBR directly minimizes the error in the partition function from each renormalization, aiming for improved accuracy compared to MBR.
4.1 Intuition and KeyOptimization
Renormalized GMs in MBR. For providing the underlying design principles of GBR, we first track an intermediate estimation of the partition function made during MBR by characterizing the corresponding sequence of renormalized GMs. Specifically, we aim for constructing a sequence of GMs with by breaking each th iteration of MBR into steps of GM renormalizations, is the original GM, and each transition from to corresponds to renormalization of some minibucket to . Then, the intermediate estimation for the original partition function at the th step is partition function of where the last one is the output of MBR.
To this end, we introduce “scope” for each factor appearing in MBR to indicate which parts of GM are renormalized at each step. In particular, the scope consists of a graph and set of factors that are associated to as follows:
Initially, scopes associated with initial factors are defined by themselves, i.e.,
(8) 
for each factor of , and others of with are to be defined iteratively under the MBR process, as we describe in what follows.
Consider the th iteration of MBR, where minibucket is being renormalized into . Then scope for all goes through renormalization by replacing every in the scope by as follows:
(9) 
where . Then, the respective GM is renormalized accordingly for the change of scopes, in addition to compensating factors :
(10)  
where and other union of scope components are defined similarly. Finally, scope for newly generated factors is
(11) 
Furthermore, if , we have
(12) 
This is repeated until the MBR process terminates, as formally described in Algorithm 3. By construction, the output of MBR is equal to the partition function of the last GM , which is computable via BE with induced width smaller than given elimination order
(13) 
See Algorithm 3 for the formal description of this process, and Figure 2 for an example.
Optimizing intermediate approximations. Finally, we provide an explicit optimization formulation for minimizing the change of intermediate partition functions in terms of induced factors. Specifically, for each th renormalization, i.e., from to , we consider change of the following factor induced from globalbucket to variable
in a “skewed” manner as follows:
where are the newly introduced “split variables” that are associated with the same vertex , but allowed to have different values for our purpose. Next, the bucket is renormalized into , leading to the induced factor of defined as follows:
Then change in is directly related with change in partition function since and can be described as follows:
Consequently, GBR chooses to minimize the change in by reupdating , i.e., it solves
(14) 
However, we remark that (14) is intractable since its objective is “global”, and requires summation over all variables except one, i.e., , and this is the key difference from (4) which seeks to minimize the error described by the local minibucket. GBR avoids this issue by substituting by its tractable approximation , which is to be described in the following section.
4.2 Algorithm Description
In this section, we provide a formal description of GBR. First, consider the sequence of GMs from interpreting MBR as GM renormalizations. One can observe that this corresponds to choices of compensating factors made at each renormalization, i.e., where for the associated replicate vertex . GBR modifies this sequence iteratively by replacing intermediate choice of compensation by another choice in reversed order, approximately solving (14) until all compensating factors are updated. Then, GBR outputs partition function for as an approximation of the original partition function .
Now we describe the process of choosing new compensating factors at th iteration of GBR by approximately solving (14). To this end, the th choice of compensating factors are expressed as follows:
(15) 
with and already chosen in the previous iteration of GBR. Next, consider sequence of GMs that were generated similarly with GM renormalization corresponding to MBR, but with compensating factors chosen by (15). Observe that the first renormalizations of GM correspond to those of MBR since the updates are done in reverse order, i.e., for . Next, in (4) is expressed as summation over in , defined as follows:
Since resembles the partition function in a way that it is also a summation over GM with small change in a set of factors, i.e., excluding local factors , we expect to be approximated well by a summation over in :
which can be computed in complexity via appying BE in with elimination order as in (13). Then the following optimization is obtained as an approximation of (14):
(16) 
where corresponds to renormalized factor :
As a result, one can expect choosing compensating factors from (16) to improve over that of (4) as long as MBR provides reasonable approximation for . The optimization is again solvable via rank truncated SVD and the overall complexity of GBR is
where is the complexity for performing SVD on matrix representing function as in (7). While the formal description of GBR is conceptually a bit complicated, one can implement it efficiently. Specifically, during the GBR process, it suffices to keep only the description of renormalized GM with an ongoing set of compensating factors, e.g., (15), in order to update compensating factors iteratively by (16). The formal description of the GBR scheme is provided in Algorithm 4.
5 Experimental Results
In this section, we report experimental results on performance of our algorithms for approximating the partition function . Experiments were conducted for Ising models defined on gridstructured and complete graphs as well as two realworld datasets from the UAI 2014 Inference Competition (Gogate, 2014). We compare our minibucket renormalization (MBR) and globalbucket renormalization (GBR) scheme with other minibucket algorithms, i.e., minibucket elimination (MBE) by Dechter & Rish (2003) and weighted minibucket elimination (WMBE) by Liu & Ihler (2011). Here, WMBE used uniform Hölder weights, with additional fixed point reparameterization updates for improving its approximation, as proposed by Liu & Ihler, 2011
. For all minibucket algorithms, we unified the choice of elimination order for each instance of GM by applying minfill heuristics
(Koller & Friedman, 2009). Further, we also run the popular variational inference algorithms: meanfield (MF) and loopy belief propagation (BP), ^{3}^{3}3MF and BP were implemented to run for iterations at maximum. BP additionally used damping with ratio. For performance measure, we use the logpartition function error, i.e., where is the approximated partition function from a respective algorithm.Ising models. We first consider the most popular binary pairwise GMs, called Ising models (Onsager, 1944): where . In our experiments, we draw