# A belief propagation algorithm based on domain decomposition

This note provides a detailed description and derivation of the domain decomposition algorithm that appears in previous works by the author. Given a large re-estimation problem, domain decomposition provides an iterative method for assembling Boltzmann distributions associated to small subproblems into an approximation of the Bayesian posterior of the whole problem. The algorithm is amenable to using Boltzmann sampling to approximate these Boltzmann distributions. In previous work, we have shown the capability of heuristic versions of this algorithm to solve LDPC decoding and circuit fault diagnosis problems too large to fit on quantum annealing hardware used for sampling. Here, we rigorously prove soundness of the method.

## Authors

• 4 publications
• ### Assessment of image generation by quantum annealer

Quantum annealing was originally proposed as an approach for solving com...
03/15/2021 ∙ by Takehito Sato, et al. ∙ 0

• ### Benchmarking Quantum Hardware for Training of Fully Visible Boltzmann Machines

Quantum annealing (QA) is a hardware-based heuristic optimization and sa...
11/14/2016 ∙ by Dmytro Korenkevych, et al. ∙ 0

• ### Bayesian machine learning for Boltzmann machine in quantum-enhanced feature spaces

Bayesian learning is ubiquitous for implementing classification and regr...
12/20/2019 ∙ by Yusen Wu, et al. ∙ 0

• ### Optimal Decomposition of Belief Networks

In this paper, optimum decomposition of belief networks is discussed. So...
03/27/2013 ∙ by Wilson X. Wen, et al. ∙ 0

• ### Restricted Boltzmann Machine Assignment Algorithm: Application to solve many-to-one matching problems on weighted bipartite graph

In this work an iterative algorithm based on unsupervised learning is pr...
04/30/2019 ∙ by Francesco Curia, et al. ∙ 0

• ### Leveraging Adiabatic Quantum Computation for Election Forecasting

Accurate, reliable sampling from fully-connected graphs with arbitrary c...
01/30/2018 ∙ by Maxwell Henderson, et al. ∙ 0

• ### Annealing Gaussian into ReLU: a New Sampling Strategy for Leaky-ReLU RBM

Restricted Boltzmann Machine (RBM) is a bipartite graphical model that i...
11/11/2016 ∙ by Chun-Liang Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

In our previous work bian2014discrete ; bian2016mapping , we explored methods to decompose constrained optimization problems into subproblems that could be solved on quantum annealing hardware, where the individual solutions can then be reassembled into a solution of the whole problem. We have used the term domain decomposition to represent any of a number of heuristic belief propagation-like algorithms to do this. In bian2014discrete

, we created a variant of the min-sum algorithm to solve LDPC decoding problems with more variables than qubits available on the available hardware. In

bian2016mapping , we created a sum-product style algorithm that solved fault diagnosis problems too large to fit on the hardware directly.

While ultimately successful in solving these problems, the versions of the domain decomposition method used in those works was heuristic: there was no guarantee that the object formed by reassembling the solutions on each subproblem was in any way related to the original problem. The goal of this note is provide a mathematical proof of the soundness of a specific domain decomposition algorithm (Algorithm 1) formed by balancing “free energy flow” between subproblems. In §2 we review the well known relationship between the Boltzmann distribution, Helmholtz free energy, and belief propagation algorithm. In §3 we develop the notion of a regional approximate free energy analogous to the Bethe free energy. In §4 we examine the free energy of a single region and show that when the free energy between regions is balanced, a critical point for the regional approximate free energy is obtained (Theorem 3). We then given an explicit statement for the domain decomposition algorithm (Algorithm 1) and prove that a stationary point for this algorithm is such a critical point (Theorem 5).

Like belief propagation there is no guarantee that the domain decomposition algorithm converges. In the case of belief propagation failure to converge is often attributed to a lack of convexity of the Bethe approximate free energy stemming from loops in the factor graph. Several sufficient conditions for convergence have been based on this observation tatikonda2002loopy ; heskes2003stable ; heskes2004uniqueness ; mooij2007sufficient ; watanabe2009graph ; watanabe2011uniqueness , and other forms of generalized belief propagation algorithms have been developed from convex bounds or treating convex and concave regions separately heskes2002approximate ; teh2002unified ; yuille2002cccp ; heskes2003approximate ; globerson2007approximate . While the algorithm we propose here is a significant departure from these methods, it is possible that one or more of these techniques could apply. We leave such explorations for future work.

Algorithm 1 relies on obtaining marginals of Boltzmann distributions along certain “boundary” variables. While the intent was to empirically estimate these using a dedicated device that produces Boltzmann samples, these marginals could also be computed by any number of other means such as simulated annealing or even belief propagation. It would be interesting to see if domain decomposition using—say—simulated annealing to compute these marginals could outperform belief propagation by carefully designing regions that internalize many of the loops in the factor graph, and so reduce the nonconvexity of the free energy approximation.

## Ii Free energy and belief propagation

Belief propagation pearl2014probabilistic

is an iterative method to approximate posterior marginal distributions of a constrained optimization problem given observations. The most common version of belief propagation, the sum-product algorithm, can derived in a straightforward manner by carefully tracking conditional probabilities and making suitable independence assumptions; for a concise presentation see

kschischang2001factor . Such algorithms, including various variants of belief propagation, can be viewed as a type of generalized distributive law on semirings aji2000generalized .

Our starting point is the variational approach to Bayesian re-estimation, yedidia2000generalized ; aji2001generalized ; yedidia2001bethe ; pakzad2002belief ; yedidia2003understanding ; yedidia2005constructing : the Bayesian posterior distribution is a critical point of a Helmholtz free energy constructed from the given problem. Specifically, given a space of configurations, let us write our target probability as a Gibbs state based on the product of problem dependent factors

 p0(x)=1Zm∏α=1fα(x)=1Z(T)e−∑mα=1Eα(x)/kT, (1)

where each factor term contributes an energy to the total energy

. The Helmholtz free energy of an ensemble with probability distribution

is

 A(p)=U(p)−TH(p)=∑x∈Ωp(x)E(x)+kT∑x∈Ωp(x)logp(x).

From (1) above one can write

 E(x)=m∑α=1Eα(x)=−kT(logp0(x)+logZ(T)).

Inserting this into the above equation,

 A(p) = = −kTlogZ(T)+kT∑xp(x)log(p(x)p0(x)) = −kTlogZ(T)+kTDKL(p∥p0).

, with equality if and only , and therefore the posterior is the global minimum of this Helmholtz free energy.

Decomposing the Helmholtz free energy as

 A(p)=U(p)−TH(p)=∑x∈Ωm∑α=1p(x)Eα(x)+kT∑x∈Ωp(x)logp(x),

one sees each factor contributes an energy term to the internal energy. Let us assume that the energies are “local.” We do not want to be too formal about this; simply, we assume the configurations in our space all have the form and that each does not depend on all the variables , but rather only a few of them. Let us introduce the notation

• to mean (through ) depends nontrivially on ,

• for the support of , and

• for the domain of the vector

.

The marginal distribution with respect to one of these supports is (the notation refers to the “belief” and hence “belief propagation”). The internal energy then has the form

 ∑x∈Ωp(x)Eα(x)=∑xα∈Ωαbα(xα)Eα(xα).

That is, the internal energy is linear in the sense that

 U(p)=∑x∈Ωm∑α=1p(x)Eα(x)=m∑α=1∑xα∈Ωαbα(xα)Eα(xα)=m∑α=1Uα(bα).

At this point it is natural to disassociate the beliefs with the marginals of and pose a local free energy associated to each factor (for which we will reuse the name for its argument)

 Aα(bα)=Uα(bα)−TH(bα)=∑xαbα(xα)Eα(xα)+kT∑xαbα(xα)logbα(xα).

The main problem lies in “localizing” the entropy term. The sum of the factors’ local free energies does not recover the Helmholtz free energy partially because of the nonlinearity inherent in the entropy, but mostly because one has grossly over-counted entropy contributions from factors sharing common variables. To illustrate this by a simple example, suppose and is uniform; form the marginals

 b1(x2,x3)=∑x1∈Ω1p(x1,x2,x3) and b2(x1,x3)=∑x2∈Ω2p(x1,x2,x3),

which are also uniform. Then

 H(p)=klog|Ω|=k(log|Ω1|+log|Ω2|+log|Ω3|).

and

 H(b1)=k(log|Ω2|+log|Ω3|) and H(b2)=k(log|Ω1|+log|Ω3|).

Therefore over-counts the entropy by simply because in the support of each belief.

The Bethe approximation overcomes this failure by correcting the entropy count at each variable. For each variable let us write , the number of factors involving this variable. Removing extra entropy through over counting gives

 ABethe = m∑α=1[∑xα∈Ωαbα(xα)Eα(xα)+kT∑xα∈Ωαbα(xα)logbα(xα)] (2) + kTn∑j=1⎛⎝(1−Cj)⋅∑xj∈Ωjbj(xj)logbj(xj)⎞⎠.

There is some freedom to use different weights to correct the entropy contributions, which leads to variants of the sum-product algorithm wiegerinck2003fractional ; weller2015bethe . We will only consider the Bethe approximation as given above.

Here, the “ensemble” of the Bethe approximate free energy is a collection of beliefs , one for each factor and variable. The target posterior distribution minimizes the Helmholtz free energy, so it is not unreasonable to attempt to approximate this posterior with the minimum of the Bethe approximation. However, the minimum of the Helmholtz free energy is taken over global probability distributions, while the Bethe approximation is a function of disjoint beliefs. To rectify this, one adds consistency constraints on the beliefs so to make them marginals of a single distribution:

• and , and

• whenever we require .

Enforcing these conditions with Lagrange multipliers produces a constrained Bethe approximate free energy:

 ~ABethe = m∑α=1[∑xα∈Ωαbα(xα)Eα(xα)+kT∑xα∈Ωαbα(xα)logbα(xα)] + kTn∑j=1⎛⎝(1−Cj)⋅∑xj∈Ωjbj(xj)logbj(xj)⎞⎠ + m∑α=1λα(∑xα∈Ωαbα(xα)−1)+n∑j=1λj⎛⎝∑xj∈Ωjbj(xj)−1⎞⎠ + m∑α=1∑j≺α∑xj∈Ωjλjα(xj)⎛⎜⎝∑xα∖xjbα(xα)−bj(xj)⎞⎟⎠.

We find relations at interior critical points of by setting various derivatives to zero. The derivatives with respect to the multipliers simply recover the constraints when set to zero. The two nontrivial types of derivatives are

 ∂~ABethe∂bα(xα) = Eα(xα)+kT(logbα(xα)+1)+λα+∑j≺αλjα(xj), and ∂~ABethe∂bj(xj) = kT(1−Cj)(logbj(xj)+1)+λj−∑α≻jλjα(xj).

Setting the first of these to zero produces the equation

 bα(xα)=e−(1+λα/kT)e−Eα(xα)/kT∏j≺αe−λjα(xj)/kT.

In particular, at a critical point the multiplier can be selected, and is completely determined by, the normalization constraint . So, we are free to work with unnormalized, for which we have

 bα(xα)∝fα(xα)⋅∏j≺αe−λjα(xj)/kT (3)

at any interior critical point. Similarly, the second type of derivative above produces the relation

 bj(xj)∝∏α≻je−λjα(xj)/kT(Cj−1) (4)

at an interior critical point, where now is selected, and determined by, the normalization of .

The remaining constraints are the consistency of the marginals. Computing the marginal using (3) one finds the additional relations on each :

 bj(xj)∝e−λjα(xj)/kT⋅∑xα∖xjfα(xα)∏r≺α∖je−λrα(xj)/kT.

We define the function

 Mα→j(xj)∝∑xα∖xjfα(xα)∏r≺α∖je−λrα(xj)/kT,

with the condition . One can simplify the above relation to

 bj(xj)∝e−λjα(xj)/kT⋅Mα→j(xj). (5)

Note the trivial equation

 ∏β≻j:β≠αbj(xj)=bj(xj)Cj−1,

which is just a restatement of the definition of . Now, we evaluate the left side of this equation using (5) and the right side using (4). This results in the relation

 ∏β≻j:β≠αe−λjβ(xj)/kT⋅Mβ→j(xj)∝∏β≻je−λjβ(xj)/kT.

Canceling common terms from both side leaves

 e−λjα(xj)/kT∝∏β≻j:β≠αMβ→j(xj). (6)

Now we define the function

 Mj→α(xj)∝e−λjα(xj)/kT,

where as before we require to be a probability distribution.

###### Proposition 1.

Suppose that for each variable and factor with one has two probability distributions, and , which jointly satisfy

 Mα→j(xj) ∝ ∑xα∖xjfα(xα)∏k≺α∖jMk→α(xk), and Mj→α(xj) ∝ ∏β≻j:β≠αMβ→j(xj).

Then the probability distributions and satisfying

 bα(xα) ∝ fα(xα)⋅∏j≺αMj→α(xj), and bj(xj) ∝ ∏α≻jMα→j(xj),

are critical points of the Bethe approximate free energy with

 ∑xα∖xjbα(xα)=bj(xj), % whenever j≺α.

This result hands us the sum-product algorithm. We initialize distributions and in a reasonable way (which will be problem dependent) and iteratively redefine these as in the proposition:

 M(t+1)α→j(xj) ∝ ∑xα∖xjfα(xα)∏k≺α∖jM(t)k→α(xk), and M(t+1)j→α(xj) ∝ ∏β≻j:β≠αM(t+1)β→j(xj).

The proposition shows that if this converges to a stationary point, that point is a critical point of the Bethe approximate free energy and so may be taken as an approximation of the Bayesian posterior.

In practice, one has a class of factors that depend only a single variable

; these typically arise as prior probabilities in a re-estimation problem. Let us write

in this case. Since is vacuous, we have

 M(t){j}→j(xj)∝f{j}(xj)

for all and hence there is no need to compute these. One can then simplify

 M(t+1)j→α∝f{j}(xj)⋅∏β≻j:β≠α,{j}M(t+1)β→j(xj).

Since for , one can also eliminate all factors arising from priors and use this last rule as the update rule at such nodes.

In most problems, one is primarily interested in the marginals

, as these are the posterior probabilities one wishes to compute by re-estimation. From (

4) these satisfy

 bj(xj)∝∏α≻je−λjα(xj)/kT(Cj−1)∝e−(∑α≻j(λjα(xj)/(Cj−1)))/kT.

That is, is the Gibbs state associated to the energy spectrum given by normalized Lagrange multipliers . In particular, the posterior marginal is itself the minimum of the Helmholtz free energy

 Aj(bj)=∑α≻j∑xj∈Ωjbj(xj)λjα(xj)/(Cj−1)+kT∑xj∈Ωjbj(xj)logbj(xj).

## Iii Regional approximate free energies

The presentation of the Bethe approximate free energy of the previous section was a purposefully unusual. Let us return to this construction, but rather than focus on the energies of single factors let us construct “regions” containing multiple factors. We introduce the notation for a region of factors . Without loss of generality, we will assume that to every variable there is a factor . As noted at the end of the previous section, we can treat priors differently and so not include them in any of the regions. To be precise, all the factors that are not priors are partitioned into regions and the prior factors are treated individually. We hasten to indicate that unlike the usual approach to regional belief propagation algorithms yedidia2000generalized ; aji2001generalized ; pakzad2002belief , we do not use a hierarchy of regions and the Kukuchi free energy approximation. Here we simply organize our factors in a different way and use the Bethe approximation above.

We extend the notation to , meaning the collection of variables for which depends nontrivially; we also write or when is one of these variables. Each region has internal energy

 UR(bR)=∑xRbR(xR)ER(xR),

where we have defined . Similarly, if is the energy associated to the prior , the internal energy at the variable is

 Uj(bj)=∑xjbj(xj)Ej(xj).

When and are obtained by marginalizing a global probability , the internal energy of the whole system is

 U(p)=∑RUR(bR)+∑jUj(bj).

Exactly at in the Bethe method, define the weight

 Cj=#{R:R≻j},

and the constrained regional approximate free energy:

 ~Aregional = ∑R(∑xRbR(xR)ER(xR)+kT∑xRbR(xR)logbR(xR)) + ∑j(∑xjbj(xj)Ej(xj)−kT(Cj−1)∑xjbj(xj)logbj(xj)) + ∑R∑j≺R∑xjλjR(xj)⎛⎜⎝∑xR∖xjbR(xR)−bj(xj)⎞⎟⎠.

By a similar argument, setting the derivative with respect to to zero yields

 bR(xR)∝fR(xR)⋅∏j≺Re−λjR(xj)/kT. (7)

One finds the analysis of the derivative with respect to splits into two cases. First if (when the variable is internal to one region) one obtains

 ∂~Aregional∂bj(xj)=Ej(xj)+λj−λjR(xj),

where is the region that contains . By setting this to zero and exponentiating, we find

 e−λjR(xj)/kT∝fj(xj).

However for variables contained in multiple regions (which we call boundary variables) one finds

 ∂~Aregional∂bj(xj)=Ej(xj)−(Cj−1)kT(logbj(xj)+1)+λj−λjR(xj).

Setting this derivative to zero produces the relation

 bj(xj)∝(fj(xj))−1/(Cj−1)⋅∏R≻je−λjR(xj)/kT(Cj−1). (8)

The variables forming the support of a region divide into boundary variables (denoted ) and the remainder which we call interior (denoted ). Computing the marginal of the belief in (7) at a boundary variable yields the relation

 bj(xj)∝e−λjR(xj)/kT∑xR∖xjfR(xR)∏r≺R∖je−λrR(xr)/kT.

Note that in this product, runs over all variables. For variables in we know . Based on this we define the augmented regional factor as the usual factor for that region times the priors for all its internal variables:

 ~fR(xR)=fR(xR)⋅∏r≺R∘fr(xR).

As ,

 bj(xj)∝e−λjR(xj)/kT∑xR∖xj~fR(xR)∏r≺∂R∖je−λrR(xr)/kT.

 MR→j(xj)=∑xR∖xj~fR(xR)∏r≺∂R∖je−λrR(xr)/kT

so that

 bj(xj)∝e−λjR(xj)/kTMR→j(xj). (9)

Except for the fact that one deals only with messages to and from boundary variables, the arguments of the previous section apply. For a fixed region one has

 ∏S≻j:S≠Rbj(xj)=bj(xj)Cj−1.

Evaluating the left and right sides with equations (9) and (8) respectively gives the relation

 ∏S≻j:S≠Re−λjS(xj)/kTMS→j(xj)∝fj(xj)−1⋅∏S≻je−λjS(xj)/kT.

Canceling common terms produces the analogous result:

 Mj→R(xj)∝e−λjR(xj)/kT∝fj(xj)⋅∏S≻j:S≠RMR→j(xj).
###### Proposition 2.

Suppose that for each region and variable with one has two probability distributions, and , which jointly satisfy

 MR→j(xj) ∝ ∑xR∖xj~fR(xR)∏r≺∂R∖jMr→R(xr), and Mj→R(xj) ∝ fj(xj)⋅∏S≻j:S≠RMS→j(xj).

Then the probability distributions and satisfying

 bR(xR) ∝ ~fR(xR)⋅∏j≺∂RMj→R(xj), and bj(xj) ∝ fj(xj)⋅∏R≻jMR→j(xj),

are critical points of the regional approximate free energy with

 ∑xR∖xjbR(xR)=bj(xj), whenever j≺∂R.

Note that this proposition yields a generalized belief propagation algorithm. However this algorithm becomes impractical as the size of the regions increases. With large regions, there is potentially far fewer variables for which we need to compute messages. However the marginalization over becomes intractable.

## Iv Modified free energy of a single region

In the previous section we developed a regional belief propagation algorithm that rapidly becomes inefficient as the size of regions get large. However, if we assume we have access to a device that samples from the Boltzmann distribution, can this capability aid us in computing—or at least in approximating—these marginals?

Given a regional decomposition as in the last section, the Helmholtz free energy of the factors of a single region is given by

 AR(bR)=∑xRbR(xR)(ER(xR)+∑j≺REj(xj))+kT∑xRbR(xR)logbR(xR).

Unless is isolated from the rest of the system one would not expect that minimizing would produce a result related to our desired posterior distribution. Nonetheless, we argue as follows. At the minimum of the free energy, the outgoing message represents the flow out of variable into region . The (change in) free energy associated to this is

 ASj(bj)=∑xjbj(xj)λjS(xj)+kTbj(xj)logbj(xj).

At each variable in we compensate for this expected flow. This produces the modified free energy of the region , defined as

 AR(bR)−∑j∈∂R⎛⎝∑S≻j:S≠RASj(bj)⎞⎠.

As with the Bethe approximation, we add appropriate marginalization constraints, to obtain the constrained modified free energy

 ~AR = ∑xRbR(xR)~ER(xR)+kT∑xRbR(xR)logbR(xR) + ∑j≺∂R∑xjbj(xj)VRj(xj)−kT(Cj−1)∑xjbj(xj)logbj(xj) + λR(∑xRbR(xR)−1)+∑j≺∂RλRj(∑xjbj(xj)−1) + ∑j≺∂R∑xjλjR(xj)⎛⎜⎝∑xR∖xjbR(xR)−bj(xj)⎞⎟⎠,

where

 ~ER(xR)=ER(xR)+∑j≺R∖∂REj(xj)

and for ,

 VRj(xj)=Ej(xj)−∑S≻j:S≠RλjS(xj).

Note the the definition of involves the Lagrange multipliers of the constrained modified free energy of adjacent regions.

Adding these local “corrective” potentials could be viewed as a form of hybrid mean field approach riegler2013merging , however the values of the potential are unknown. Nonetheless, like the cavity method mooij2007loop , if the corrective potentials are set as indicated above then the minima of each regional free energy do minimize the global approximate free energy.

###### Theorem 3.

A joint interior critical point of each modified regional free energy defines a critical point of the regional approximate free energy.

###### Proof.

Just as above we compute an interior critical point of each region’s constrained modified free energy, yielding

 ~ER(xR)+kT(logbR(xR)+1)+λR+∑j≺∂RλjR(xj) = 0, VRj(xj)−(Cj−1)kT(logbj(xj)+1)+λRj−λjR(xj) = 0.

Now using these become

 bR(xR) ∝ ~fR(xR)⋅∏j≺∂Re−λjR(xj)/kT bj(xj) ∝ f(xj)1/(1−Cj)⋅∏S≻je−λjS(xr)/kT(Cj−1).

These are precisely the relations of a critical point of the regional approximate free energy. ∎

Finally we can tackle the question of how to utilize a modest sized device or method that can produce Boltzmann samples. Suppose that: given an arbitrary selection of corrective potentials , we have a black box that produces an interior minimum of each region’s modified regional free energy. Then: we instantiate Algorithm 1 below. Note that in this algorithm it is only the marginals of the Boltzmann distribution that are required. With the ability to do Boltzmann sampling from physical hardware (or by other means), we approximate these marginals in each region by drawing a sufficiently large number of samples and estimating the marginal probabilities empirically. Specifically, we decompose our large re-estimation problem into regions small enough for our Boltzmann sampler to handle, and use the sampler on each region to approximate the marginals of Algorithm 1. By Theorem 5 below, if the algorithm converges then it recovers the approximate posterior.

The numerical results presented in bian2014discrete ; bian2016mapping indicate that these approximations can be sufficient to solve problems too large to be sampled directly. In bian2014discrete we used a variant of Algorithm 1 based on the min-sum method to solve 1000 variable LDPC decoding problems on a 504 qubit D-Wave quantum annealer for which belief propagation failed to converge. In bian2016mapping we implemented a slight variant of Algorithm 1 to solve hardware fault diagnosis problems requiring up to seven regions to completely embed on a D-Wave 2X architecture.

###### Lemma 4.

If Algorithm 1 converges then the marginal at any boundary variable is independent of the region used to compute it.

###### Proof.

For any boundary variable and region , if the algorithm has converged then steps 5 and 6 of the algorithm show

 ∑xR∖xjbR(xR)∝Fj→R(xj)⋅FR→j(xj)∝fj(xj)⋅∏S≻jFS→j(xj),

which is independent of . ∎

###### Theorem 5.

If Algorithm 1 converges then the collection of minima of each region’s modified regional free energy from step 4, and their marginals , produces a critical point of the approximate regional free energy of the whole system.

###### Proof.

For an arbitrary selection of corrective potentials , the critical point of the constrained modified free energy satisfies

 bR(xR) ∝ ~fR(xR)⋅∏j≺∂Re−λjR(xj)/kT b(R)j(xj) ∝ eVRj(xj)/kT(Cj−1)⋅e−λjR(xj)/kT(Cj−1).

Therefore at a critical point,

 bR(x