Graphical Models (GMs) express the factorization of the joint multivariate probability distribution over subsets of variables via graphical relations among them. They have played an important role in many fields, including computer vision(Freeman et al., 2000), speech recognition (Bilmes, 2004), social science (Scott, 2017)
and deep learning(Hinton & Salakhutdinov, 2006). Given a GM, computing the partition function (the normalizing constant) is the essence of other statistical inference tasks such as marginalization and sampling. The partition function can be calculated efficiently in tree-structured GMs through an iterative (dynamic programming) algorithm eliminating, i.e. summing up, variables sequentially. In principle, the elimination strategy extends to arbitrary loopy graphs, but the computational complexity is exponential in the tree-width, e.g., the junction-tree method (Shafer & Shenoy, 1990). Formally, the computation task is #P-hard even to approximate (Jerrum & Sinclair, 1993).
Variational approaches are often the most popular practical choice for approximate computation of the partition function. They map the counting problem into an approximate optimization problem stated over a polynomial (in the graph size) number of variables. The optimization is typically solved iteratively via a message-passing algorithm, e.g., mean-field (Parisi, 1988), belief propagation (Pearl, 1982), tree-reweighted (Wainwright et al., 2005), or gauges and/or re-parametrizations (Ahn et al., 2017, 2018 (accepted to appear)
. Lack of accuracy control and difficulty in forcing convergence in an acceptable number of steps are, unfortunately, typical for hard GM instances. Markov chain Monte Carlo methods (e.g., seeAlpaydin, 2014) are also popular to approximate the partition function, but typically suffer, even more than variational methods, from slow convergence/mixing.
Approximate elimination is a sequential method to estimate the partition function. Each step consists of summation over variables followed by (or combined with) approximation of the resulting complex factors. Notable flavors of this method include truncation of the Fourier coefficients(Xue et al., 2016), approximation by random mixtures of rank- tensors (Wrigley et al., 2017), and arguably the most popular, elimination over mini-buckets (Dechter & Rish, 2003; Liu & Ihler, 2011). One advantage of the mini-bucket elimination approach is the ability to control the trade-off between computational complexity and approximation quality by adjusting an induced-width parameter. Note that analogous control in variational methods, such as varying region sizes in generalized belief propagation (Yedidia et al., 2001), typically results in much more complicated optimization formulations to solve. Another important advantage of mini-bucket elimination is that it is always guaranteed to terminate and, usually, it does so quickly. This is in contrast to iterative message-passing implementations of variational methods which can be notoriously slow on difficult instances.
Contribution. We improve the approximation quality of mini-bucket methods using tensor network and renormalization group approaches from statistical physics. In this regard, our method extends a series of recent papers exploring multi-linear tensor network transformations/contractions (Novikov et al., 2014; Wrigley et al., 2017; Ahn et al., 2017, 2018 (accepted to appear). More generally, tensor network renormalization algorithms (Levin & Nave, 2007; Evenbly & Vidal, 2015) have been proposed in the quantum and statistical physics literature for estimating partition functions. The algorithms consist of coarse-graining the graph/network by contracting sub-graphs/networks using a low-rank projection as a subroutine. However, the existing renormalization methods in the physics literature have focused primarily on a restricted class of tensor-factorized models over regular grids/lattices,111 The special models are related to what may be called Forney-style grids/lattices (Forney, 2001) in the GM community. while factor-graph models (Clifford, 1990) over generic graphical structures are needed in most machine learning applications.
For generalizing them to factor-graph models, one would face at two challenges: (a) coarse-graining of the tensor network relies on the periodic structure of grid/lattices and (b) its low-rank projections are only defined on “edge variables” that allows only two adjacent factors. To overcome them, we first replace the coarse-graining step by sequential elimination of the mini-bucket algorithms, and then use the strategy of “variable splitting” in order to generate auxiliary edge variables. Namely, we combine ideas from tensor network renormalization and the mini-bucket schemes where one is benefical to the other. We propose two algorithms, which we call MBR and GBR:
(MBR) consists of sequentially splitting summation over the current (remaining) set of variables into subsets – multiple mini-buckets which are then “renormalized”. We show that this process is, in fact, equivalent to applying low-rank projections on the mini-buckets to approximate the variable-elimination process, thus resulting in better approximation than the original mini-bucket methods. In particular, we show how to resolve approximate renormalization locally and efficiently through application of truncated singular value decomposition (SVD) over small matrices.
While MBR is based on a sequence of local low-rank approximations applied to the mini-buckets, global-bucket renormalization (GBR) extends MBR by approximating mini-buckets globally. This is achieved by first applying MBR to mini-buckets, then calibrating the choice of low rank projections by minimizing the partition function approximation error with respect to renormalization of the “global-bucket”. Hence, GBR takes additional time to run but may be expected to yield better accuracy.
Both algorithms are easily applicable to arbitrary GMs with interactions (factors) of high orders, hyper-graphs and large alphabets. We perform extensive experiments on synthetic (Ising models on complete and grid graphs) and real-world models from the UAI dataset. In our experiments, both MBR and GBR show performance superior to other state-of-the-art elimination and variational algorithms.
Graphical model. Consider a hyper-graph with vertices and hyper-edges . A graphical model (GM) associates a collection of discrete random variables with the following joint probability distribution:
where , , is a set of non-negative functions called factors, and is the normalizing constant called the partition function that is computationally intractable.
Mini-bucket elimination. Bucket (or variable) elimination (BE, Dechter, 1999; Koller & Friedman, 2009) is a procedure for computing the partition function exactly based on sequential elimination of variables. Without loss of generality, we assume through out the paper that the elimination order is fixed . BE groups factors by placing each factor in the “bucket” of its earliest argument appearing in the elimination order . Next, BE eliminates the variable by introducing a new factor marginalizing the product of factors in it, i.e.,
Here, abbreviates , where indicates the set of variables associated with the bucket . The subscript in represents a similar abbreviation. Finally, the new function is added to another bucket corresponding to its earliest argument in the elimination order. Formal description of BE is given in Algorithm 1.
One can easily check that BE applies a distributive property for computing exactly: groups of factors corresponding to buckets are summed out sequentially, and then the newly generated factor (without the eliminated variable) is added to another bucket. The computational cost of BE is exponential with respect to the number of uneliminated variables in the bucket, i.e., its complexity is Here, is called the induced width of given the elimination order , and the minimum possible induced width across all possible is called the tree-width. Furthermore, we remark that summation of GM over variables defined on the subset of vertices , i.e., , can also be computed via BE in time by summing out variables in elimination order on .
Mini-bucket elimination (MBE, Dechter & Rish, 2003) achieves lower complexity by approximating each step of BE by splitting the computation of each bucket into several smaller “mini-buckets”. Formally, for each variable in the elimination order , the bucket is partitioned into mini-buckets such that and for any . Next, MBE generates new factors differently from BE as follows:
for all and
Other steps are equivalent to that of BE. Observe that MBE replaces the exact marginalization of the bucket in (1) by its upper bound, i.e., , and hence yields an upper bound of . We remark that one could instead obtain a lower bound for by replacing by in (2).
Note that one has to choose mini-buckets for MBE carefully as their sizes typically determine complexity and accuracy: smaller mini-buckets may be better for speed, but worse in accuracy. Accordingly, MBE has an additional induced width bound parameter as the maximal size of a mini-bucket, i.e., . The time complexity of MBE is since the maximum number of mini-buckets is bounded by .
3 Mini-Bucket Renormalization
We propose a new scheme, named mini-bucket renormalization (MBR). Our approach approximates BE by splitting each bucket into several smaller mini-buckets to be “renormalized”. Inspired by tensor renormalization groups (TRG, Levin & Nave, 2007, see also references therein) in the physics literature, MBR utilizes low-rank approximations to the mini-buckets instead of simply applying (or ) operations as in MBE. Intuitively, MBR is similar to MBE but promises better accuracy.
3.1 Algorithm Description
For each variable in the elimination order , MBR partitions a bucket into distinct mini-buckets with maximal size bounded by . Then for , mini-bucket is “renormalized” through replacing vertex by its replicate and then introducing local factors for error compensation, i.e.,
Here, are local “compensating/renormalizing factors”, chosen to approximate the factor well, where MBE approximates it using (2) and (3). See Figure 1 for illustration of the renormalization process. Specifically, the local compensating/renormalizing factors are chosen by solving the following optimization:
where is the factor induced on from the renormalized mini-bucket :
Resulting factors are then added to its corresponding mini-bucket and repeat until all buckets are processed, like BE and MBE. Here, one can check that
The optimization (4) is related to the rank- approximation on , which can be solved efficiently via (truncated) singular value decomposition (SVD). Specifically, let be a matrix representing as follows:
Then rank-1 truncated SVD for solves the following optimization:
where optimization is over
-dimensional vectorsand denotes the Frobenious norm. Namely, the solution becomes the most significant (left) singular vector of , associated with the largest singular value.222 holds without forcing it. Especially, since is a non-negative matrix, its most significant singular vector is always non-negative due to the Perron-Frobenius theorem (Perron, 1907). By letting and , one can check that this optimization is equivalent to (4), where in fact, , i.e., they share the same values. Due to the above observations, the complexity of (4) is that denotes the complexity of SVD for matrix . Therefore, the overall complexity becomes
where in general, but typically much faster in the existing SVD solver.
4 Global-Bucket Renormalization
In the previous section, MBR only considers the local neighborhood for renormalizing mini-buckets to approximate a single marginalization process of BE. Here we extend the approach and propose global-bucket renormalization (GBR), which incorporates a global perspective. The new scheme re-updates the choice of compensating local factors obtained in MBR by considering factors that were ignored during the original process. In particular, GBR directly minimizes the error in the partition function from each renormalization, aiming for improved accuracy compared to MBR.
4.1 Intuition and Key-Optimization
Renormalized GMs in MBR. For providing the underlying design principles of GBR, we first track an intermediate estimation of the partition function made during MBR by characterizing the corresponding sequence of renormalized GMs. Specifically, we aim for constructing a sequence of GMs with by breaking each -th iteration of MBR into steps of GM renormalizations, is the original GM, and each transition from to corresponds to renormalization of some mini-bucket to . Then, the intermediate estimation for the original partition function at the -th step is partition function of where the last one is the output of MBR.
To this end, we introduce “scope” for each factor appearing in MBR to indicate which parts of GM are renormalized at each step. In particular, the scope consists of a graph and set of factors that are associated to as follows:
Initially, scopes associated with initial factors are defined by themselves, i.e.,
for each factor of , and others of with are to be defined iteratively under the MBR process, as we describe in what follows.
Consider the -th iteration of MBR, where mini-bucket is being renormalized into . Then scope for all goes through renormalization by replacing every in the scope by as follows:
where . Then, the respective GM is renormalized accordingly for the change of scopes, in addition to compensating factors :
where and other union of scope components are defined similarly. Finally, scope for newly generated factors is
Furthermore, if , we have
This is repeated until the MBR process terminates, as formally described in Algorithm 3. By construction, the output of MBR is equal to the partition function of the last GM , which is computable via BE with induced width smaller than given elimination order
Optimizing intermediate approximations. Finally, we provide an explicit optimization formulation for minimizing the change of intermediate partition functions in terms of induced factors. Specifically, for each -th renormalization, i.e., from to , we consider change of the following factor induced from global-bucket to variable
in a “skewed” manner as follows:
where are the newly introduced “split variables” that are associated with the same vertex , but allowed to have different values for our purpose. Next, the bucket is renormalized into , leading to the induced factor of defined as follows:
Then change in is directly related with change in partition function since and can be described as follows:
Consequently, GBR chooses to minimize the change in by re-updating , i.e., it solves
However, we remark that (14) is intractable since its objective is “global”, and requires summation over all variables except one, i.e., , and this is the key difference from (4) which seeks to minimize the error described by the local mini-bucket. GBR avoids this issue by substituting by its tractable approximation , which is to be described in the following section.
4.2 Algorithm Description
In this section, we provide a formal description of GBR. First, consider the sequence of GMs from interpreting MBR as GM renormalizations. One can observe that this corresponds to choices of compensating factors made at each renormalization, i.e., where for the associated replicate vertex . GBR modifies this sequence iteratively by replacing intermediate choice of compensation by another choice in reversed order, approximately solving (14) until all compensating factors are updated. Then, GBR outputs partition function for as an approximation of the original partition function .
Now we describe the process of choosing new compensating factors at -th iteration of GBR by approximately solving (14). To this end, the -th choice of compensating factors are expressed as follows:
with and already chosen in the previous iteration of GBR. Next, consider sequence of GMs that were generated similarly with GM renormalization corresponding to MBR, but with compensating factors chosen by (15). Observe that the first renormalizations of GM correspond to those of MBR since the updates are done in reverse order, i.e., for . Next, in (4) is expressed as summation over in , defined as follows:
Since resembles the partition function in a way that it is also a summation over GM with small change in a set of factors, i.e., excluding local factors , we expect to be approximated well by a summation over in :
where corresponds to renormalized factor :
As a result, one can expect choosing compensating factors from (16) to improve over that of (4) as long as MBR provides reasonable approximation for . The optimization is again solvable via rank- truncated SVD and the overall complexity of GBR is
where is the complexity for performing SVD on matrix representing function as in (7). While the formal description of GBR is conceptually a bit complicated, one can implement it efficiently. Specifically, during the GBR process, it suffices to keep only the description of renormalized GM with an ongoing set of compensating factors, e.g., (15), in order to update compensating factors iteratively by (16). The formal description of the GBR scheme is provided in Algorithm 4.
5 Experimental Results
In this section, we report experimental results on performance of our algorithms for approximating the partition function . Experiments were conducted for Ising models defined on grid-structured and complete graphs as well as two real-world datasets from the UAI 2014 Inference Competition (Gogate, 2014). We compare our mini-bucket renormalization (MBR) and global-bucket renormalization (GBR) scheme with other mini-bucket algorithms, i.e., mini-bucket elimination (MBE) by Dechter & Rish (2003) and weighted mini-bucket elimination (WMBE) by Liu & Ihler (2011). Here, WMBE used uniform Hölder weights, with additional fixed point reparameterization updates for improving its approximation, as proposed by Liu & Ihler, 2011
. For all mini-bucket algorithms, we unified the choice of elimination order for each instance of GM by applying min-fill heuristics(Koller & Friedman, 2009). Further, we also run the popular variational inference algorithms: mean-field (MF) and loopy belief propagation (BP), 333MF and BP were implemented to run for iterations at maximum. BP additionally used damping with ratio. For performance measure, we use the log-partition function error, i.e., where is the approximated partition function from a respective algorithm.
Ising models. We first consider the most popular binary pairwise GMs, called Ising models (Onsager, 1944): where . In our experiments, we draw