1 Introduction
Inference in general discrete graphical models is hard. Besides variational methods, the main approach for inference in such models is through running a Markov chain that in the limit draws samples from the true posterior distribution.
One such MarkovChain is given by the socalled Gibbssampler that was first introduced introduced by Geman and Geman [4]
. In each step, one random variable is resampled given all the others, using the conditional probability distributions in the graphical model. Under mild hypotheses, the Gibbs sampler produces an ergodic Markov chain that converges to the target distribution. The main appeal of the Gibbs sampler lies in its simplicity and ease of implementation. Unfortunately, for highly coupled random variables, mixing of this Markov chain can be prohibitively slow. Moreover, in order to achieve ergodicity, we have to sample one variable after another, yielding an inherently sequential algorithm. However, with the advent of affordable parallel computing hardware in the form of GPUs, it is desirable to have a parallel sampling algorithm. Early attempts tried running all the update steps in parallel
[4], however this update schedule usually does not converge to the target distribution [8].Another common approach to parallel Gibbssampling is to compute a graph coloring of the underlying graph and then perform Gibbssampling blockwise [5]. However, it is not always straightforward to find an appropriate graph coloring and it is hard to maintain such a graph coloring in a dynamic setup, i.e. when factors are added and removed on a continuous basis.
In this paper, we show a simple method of parallelizing a Gibbssampler that does not require a graph coloring. This is particularly useful for situations in which a graph coloring is hard to obtain or the graph topology changes frequently, which requires to maintain or to recompute the graph coloring.
2 Related Work
An early attempt at parallelization for Ising models was described by Swendsen and Wang [12]. Their method was generalized to arbitrary probabilistic graphical models in [1]. However, while the Swendsen Wangalgorithm mixes fast for the Ising model with no unary potentials, this needs not be the case for general probabilistic graphical models [6, 7]. Higdon [6] presents a method for performing partial Swendsen Wangupdates. However, sampling using Higdon’s method requires sampling from a coarser graphical model as a subproblem, which Higdon tackles using conventional sampling methods. Our dualization strategy allows to circumvent this step, so that only standard clusterwise sampling as in [12] is required.
[5] describes two ways of parallelizing Gibbs sampling in discrete Markov random fields. The first method relies on computing graph colorings, the second on decomposing the graph into blocks (called splashes) consisting of subgraphs with limited tree width. Both methods are complimentary in that graph colorings work well for looselycoupled graphical models whereas splash sampling works well in the strongly coupled case. However, computing a minimal graph coloring is a NPhard problem [3] and the number of colorings necessary depends on the graph. Moreover, it is hard to maintain a graph coloring in a dynamic setting in which the graph topology is not constant anymore. Our approach does not suffer from these issues and requires almost no preprocessing. Moreover, our approach can be combined with splash sampling. Whereas the approach in [5] requires the splashes to be induced subgraphs of the graphical model, our approach allows to select arbitrary subgraphs of the graphical model as splashes, making it possible to use splashes containing many variables.
Schmidt et al. show in [10] how a Gaussian scale mixtures (GSMs) can be used for efficiently sampling from fields of experts [9]. Our approach is very similar to theirs, but deals with discrete graphical models. Moreover, whereas Schmidt et al. started with a model in a primaldual formulation and trained it on data, we focus on decomposing existing graphical models and mainly use duality for inference. In fact, both techniques can be subsumed in the framework of exponential family harmoniums [13]
, which makes it possible to deal with models consisting of both discrete and continuous random variables.
Martens et al. [7] show how to sample from a discrete graphical model using auxiliary variables. Their approach is similar to ours, but relies on computing the (sparse) Cholesky decomposition of a large matrix beforehand. Our approach does not have this issue.
A variant of our algorithm that computes expectations instead of performing sampling corresponds to the meanfieldalgorithm for junction treeapproximations in [14]. We can show that our algorithm minimizes an upper bound to the true meanfield objective. However, whereas the meanfield algorithm in [14] has to recalibrate the tree for each update of a potential, our algorithm updates all the potentials at once with only one run of the junctiontreealgorithm.
Schwing et al. [11] designed a system that performs belief propagation in a distributed way. Our approach has a similar goal but follows a different strategy: while Schwing et al. achieve parallelism through a convex formulation, we augment the probabilistic model with additional random variables.
3 Probabilistic duality
We first define the notion of duality of random variables. Note that this definition is very similar to the Lagrange functional of a convex optimization problem in convex analysis.
Definition 1.
Let and denote random variables. We call functions and
to a common vector space with some bilinear form link functions. We say
and are dual to each other viaif the joint distributions can be written as
with positive functions and .
In the language of [13], a dual pair of randomvariables is simply an exponential family harmonium. For a pair of link functions and , some real valued functions we now define the concept of an transform.
Definition 2.
The transforms of , is defined as
Note the resemblance to the notion convex conjugacy. The following simple lemma is central to the rest of the theory:
Lemma 1.
Let be and two dually related random variables with joint probability density as above. Then
Formally, this is similar to the notion of duality in convex optimization, we call the problem of sampling from the primal and the corresponding problem of sampling from the dual sampling problem. The primal and dual problems are linked to each other via the conditional densities and . Note that both and are in the exponential family.
Another view is that we represent as a mixture of probability distributions from a specified exponential family determined by and . The density of the mixture parameters is then given by .
To obtain a dual formulation of a sampling problem, we have to decompose as
with some functions and . is then the marginal of
We show how this can be done for the discrete case in Section 4.1.
Further evidence can be incorporated by replacing and with
The lemma already shows a possible strategy how to sample from using a simple Gibbs sampler: first sample from , then from and so on. As both and are generally highdimensional, sampling from and from could potentially cause problems. However, we show that and both factorize in Markov random fields for appropriate choices and which yields algorithms that are very easy to parallelize.
4 Duality in MRFs
We now take a closer look at sampling in Markov random fields. The HammersleyClifford theorem states that the joint probability density of nodes in the MRF can be decomposed as
with some probability measures and normalization constant . In general the only depend on a small subset of the components of (e.g. the cliques in the MRF). Assume that for every we have a random variable , so that and are dual to each other via with respect to . We can then write
Theorem 1.
Let and , and .
Then is the marginal of
Thus if , and are dual to each other. The marginal distribution of is given by
Proof.
We have
∎
Corollary 1.
and are given by
In particular, factorizes and if and every component of only depends on one component of , factorizes as well. In particular, this is true for the standard choice .
4.1 Binary pairwise MRFs
We now want to turn to the special case of a pairwise binary MRF. It turns out that finding a dual representation is equivalent to finding an appropriate factorization of the probability table.
Theorem 2.
Let be proportional to the probability table of two binary random variables and . Assume we are given a factorization with , where both and have strictly positive entries. Let
Then
with
Proof.
This follows from
after a simple calculation. ∎
We will now show how to find such a factorization:
Lemma 2.
If is symmetric and , can be factored in the form , where
Proof.
is well defined because and positive because . Now let
We have
Due to trigonometric considerations
This shows as required.∎
Remark 1.
For , we have
Lemma 3.
For any , then
is symmetric.
Lemma 4.
If , then
has positive determinant.
In summary, we have found a strictly positive factorization of any strictly positive matrix. Together with Theorem 1 this yields a dual representation for every binary pairwise MRF.
4.2 General discrete MRFs
When the variables are allowed to have multiple states, we can convert any discrete pairwise MRF into a binary MRF using
encoding and additional hardconstraints that ensure that exactly one binary variable belonging to a random variable in the original MRF has value
. All inference algorithms in this paper therefore generalize to this situation.Dualizing a factor in this way introduces auxiliary binary random variables to the model. Note however, that no new random variables need to be introduced for entries in the factor. For example, for a Pottsfactor of order , only auxiliary binary random variables have to be introduced per factor.
Arbitrary discrete MRFs with higherorder factors work as well, as long as we can find an appropriate positive tensor factorization of the probability table. Moreover, it is also possible to perform inference approximately by fitting a mixture of Bernoulli / mixture of Dirichlet distributions to the factors using expectation maximization.
4.3 SwendsenWang and local constraints
As it turns out, the SwendsenWang algorithm can be seen as a degenerate special case of our formalism for a particular choice of . Moreover, more general local constraint models can potentially be derived from this formalism.
More explicitly, for the Ising model let
where denotes the set of edges and
The Ising factor of the form
can then be decomposed as
where
The primal dual sampling algorithm then proceeds as follows:
which are just the update rules for the SwendsenWang algorithm.
The partial SwendsenWang method by Higdon [6] can be regarded as a decomposition of the form
This leads to the method described by Higdon, where we are left with sampling from coarser Isingmodel. By applying a factorization as in Section 4.1 to the first term enables us to circumvent this step, so that all clusters can be sampled independently (the latent variables then have different states).
Similarly, the generalization of the SwendsonWang algorithm in [1] can be regarded as a multiplicative decomposition of the form
where we used the operator to indicate componentwise multiplication. Applying the decomposition above to the first factor, yields (a variant of) the method in [1]. We can use our method to further decompose , allowing to update all clusters in parallel.
5 Inference
5.1 Sampling
Having a primaldual representation of of the form
we can sample from (and thereby from ) by blockwise Gibbssampling, i.e.
For discrete pairwise MRFs with the dual representations as described above, both distributions factor, so that sampling can be done in parallel, e.g. on the GPU. Effectively, we have converted our model to a restricted Boltzmann machine.
5.2 Estimation of the logpartitionfunction
The logarithm normalization constant of an unnormalized probability distribution, the socalled logpartitionfunction
, is important for modelselection and related tasks. In this section, we provide a simple estimator for this quantity that can be used for any dual pair of random variables.
The following defines an unbiased estimator for the partition function
:where and are the unnormalized probability distributions. Indeed, we have
Written in terms of and , can be written as
Note that
where is the mutual information between and . This is a measure for the uncertainty of
, as it can be interpeted as a generalized variance for the convex function
.This also implies that the expectation of
yields a lower bound to the logpartition function. In practice, has too much variance to be useful. Therefore, we estimate the expectation of , which yields a lower bound to the logpartition function.
Example 1.
For the SwendsenWangduality for the Isingmodel, we have , and therefore
where is the number of clusters defined by . Therefore
where is the unnormalized distribution of the Isingmodel.
A natural question to ask is how this estimator relates to the estimate obtained by running naive meanfield on the primal distribution only. The following negative result shows that in most cases the estimate obtained by meanfield approximations is preferable:
Lemma 5.
We have
(1) 
where we define .
Proof.
The equality in (1) can be obtained by a straightforward calculation. The inequality is a simple application of the fact that the expectation over is always bigger than the minimum over . ∎
Note, however, that it is not always straightforward to find that minimizes . This is for example the case for the SwendsenWangrepresentation from Example 1.
5.3 MAP and meanfield inference
The concept of probabilistic duality that we introduced in Section 3 is also useful to derive parallel MAP and meanfield inference algorithms.
By applying EM to we can also compute local MAPassignments to in parallel. The updates read
Similar, we can compute meanfield assignments to using the updates
where the expectations are taken over the distributions
Note that these updates have the advantage over ICM and standard naive mean field that they can again run in parallel and still have convergence guarantees.
Using this algorithm, we can show that we minimize an upper bound to the true meanfield objective :
Lemma 6.
We have
(2) 
Proof.
We have
∎
Lemma 6 implies that traditional meanfield updates are still preferable to the ones from our method. Indeed, in practice we found that our method can lead to poor approximations in presence of many factors. However, it is still possible to first run our fast parallel algorithm and then finetune the result using traditional meanfield updates.
5.4 Blocking
Gibbs sampling as described in Section 5.1 can still be prohibitively slow in presence of strongly correlated random variables. Similarly, EM and meanfield updates tend to get stuck in local optima in this situation. This problem also occurs for standard Gibbs sampling, ICM and naive meanfield updates. A common way out is to introduce blocking, e.g. as in [5] for Gibbs sampling.
Unfortunately, blocking is only possible with respect to induced subgraphs for traditional algorithms. As it turns out, our primaldual decomposition allows to perform blocking with respect to arbitrary subgraphs. This is illustrated in Figure 1. The idea is to decompose the dual variables into two subsets and , so that ) is tractable. This is the case, when is tractable, because then
Note that is tractable, if the graph obtained by removing all the factors belonging to has low treewidth.
For blocked Gibbs sampling, we sample in each step and then . As a variation of this process, we can vary the decomposition of into and in each step.
When we perform maxproductbelief propagation for and take expectations for we obtain a new inference algorithm for MAPinference. Note that in each step, we maximize over all variables at once. Similarly, we obtain a probabilistic inference algorithm by changing the maxproductbelief propagation step for by sumproductbelief propagation.
Note also that both the EM, as well as the meanfield algorithm are guaranteed to increase the objective function in each step.
In this framework, the standard sequential Gibbs sampler can also be interpreted as a blocked Gibbs sampler, where blocking is performed with respect to one primal and all neighboring dual variables. Unfortunately, as blocking generally improves mixing of a Gibbs chain [5], this implies that the standard sequential Gibbs sampler has better mixing properties than the parallel primaldual sampling algorithm. Still, the primal dual formulation allows for more flexible blocking schemes, potentially making it possible to improve on the mixing properties of the standard sequential Gibbs sampler in some situations.
6 Experimental Results
We tested our method on synthetic graphical models. The first model consists of a Ising grid and coupling strenghts ranging from to . Even though the Ising grid is twocolorable and it is therefore trivial to implement a parallel Gibbs sampler in this setting, this is no longer possible when the Graph topology is dynamic, i.e. we remove and add factors from time to time. Maintaining a coloring in this setting is itself a hard problem. The second model is given by a random graph with variables and factors, where
. Both the unitary and pairwise logpotentials were sampled from a normal distribution with mean
and a standard deviation of
. The last model consists of a fully connected Ising model of variables and coupling strengths ranging from to . Note that for such models, there is an algorithm that computes the partition function and the marginals in polynomial time [2]. However, this is not longer the case, when the potentials have varying coupling strengths.For all these models we compute the potential scale reduction factor (PSRF) for both a sequential Gibbs sampler and our primaldualsampler by running Markov chains in parallel. From the PSRF we compute an estimate of the mixing time of the Markov chain by taking the first index, so that the PSRF remains below some specified threshhold afterwards.
The result for the Ising grid is shown in Figure 1(a). For both the primaldual sampler and the sequential Gibbs sampler, we plotted the number of seeps over the whole grid to achieve a PSRF below . As expected, both the sequential Gibbs sampler and the primaldual sampler mix slower as we increase the coupling strength. Moreover, even though the primaldual sampler mixes slower than the sequential Gibbs sampler, in our experiments the ratio of the mixing times was between and for all the coupling strength. Therefore, even though a Gibbs sampler based on a two coloring is preferable in the static setting, our primaldual sampler becomes a viable alternative in the dynamic setting.
Similar results where obtained for the random graphs. As expected, mixing of the primal dual sampler became worse as the number of factors per vertix increases. While our primaldualsampler can be an interesting alternative when the factortovertex ratio is low (e.g. ), we do not recommend our method for models with many more factors than variables if these factors are not very weak.
The result for the fully connected Isingmodel is shown in Figure 1(b). As there is no coloring available for a fully connected graphical model, we compare the number of full sweeps of our primaldual sampler against the number of singlesite updates of the sequential Gibbs sampler. We see that our method leads to improved mixing in this setting.
7 Conclusion
We have introduced a new concept of duality to random variales and showed its usefulness in performing inference in probabilistic graphical models. In particular, we demonstrated how to obtain a highlyparallizing Gibbs sampler. Even though this parallel Gibbs sampler has inferior mixing properties compared to the sequential Gibbs sampler, we believe that it can still be very useful in settings, where a good graph coloring is hard to obtain or the graph topology changes frequently. Possible extensions of our approach include good algorithms for selecting appropriate subgraphs for blocking. Moreover, as primaldual representations are not unique, we believe that further progress can be made by deriving new decompositions. Another line of research is to generalize our ideas to higher order factors, both exactly and in approximate ways. We believe that this is possible and allows to apply the methods in this paper to arbitrary discrete graphical models.
Acknowledgements
This work was supported by Microsoft Research through its PhD Scholarship Programme.
References

[1]
Adrian Barbu and SongChun Zhu.
Generalizing swendsenwang to sampling arbitrary posterior probabilities.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1239–1253, 2005.  [2] Boris Flach. A class of random fields on complete graphs with tractable partition function. IEEE transactions on pattern analysis and machine intelligence, 35(9):2304–2306, 2013.

[3]
Michael R Garey, David S Johnson, and Larry Stockmeyer.
Some simplified npcomplete problems.
In
Proceedings of the sixth annual ACM symposium on Theory of computing
, pages 47–63. ACM, 1974.  [4] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721–741, 1984.
 [5] Joseph Gonzalez, Yucheng Low, Arthur Gretton, and Carlos Guestrin. Parallel gibbs sampling: From colored fields to thin junction trees. In AISTATS, volume 15, pages 324–332, 2011.
 [6] David M Higdon. Auxiliary variable methods for markov chain monte carlo with applications. Journal of the American Statistical Association, 93(442):585–595, 1998.
 [7] James Martens and Ilya Sutskever. Parallelizable sampling of markov random fields. In AISTATS, pages 517–524, 2010.
 [8] David Newman, Padhraic Smyth, Max Welling, and Arthur U Asuncion. Distributed inference for latent dirichlet allocation. In Advances in neural information processing systems, pages 1081–1088, 2007.

[9]
Stefan Roth and Michael J Black.
Fields of experts.
International Journal of Computer Vision
, 82(2):205–229, 2009. 
[10]
Uwe Schmidt, Qi Gao, and Stefan Roth.
A generative perspective on mrfs in lowlevel vision.
In
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on
, pages 1751–1758. IEEE, 2010.  [11] Alexander Schwing, Tamir Hazan, Marc Pollefeys, and Raquel Urtasun. Distributed message passing for large scale graphical models. In Computer vision and pattern recognition (CVPR), 2011 IEEE conference on, pages 1833–1840. IEEE, 2011.
 [12] Robert H Swendsen and JianSheng Wang. Nonuniversal critical dynamics in monte carlo simulations. Physical review letters, 58(2):86, 1987.
 [13] Max Welling, Michal RosenZvi, and Geoffrey E Hinton. Exponential family harmoniums with an application to information retrieval. In Advances in neural information processing systems, pages 1481–1488, 2004.

[14]
Wim Wiegerinck.
Variational approximations between mean field theory and the junction
tree algorithm.
In
Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
, pages 626–633. Morgan Kaufmann Publishers Inc., 2000.