Faster Algorithms for Max-Product Message-Passing

10/17/2009
by   Julian J. McAuley, et al.
CSIRO
0

Maximum A Posteriori inference in graphical models is often solved via message-passing algorithms, such as the junction-tree algorithm, or loopy belief-propagation. The exact solution to this problem is well known to be exponential in the size of the model's maximal cliques after it is triangulated, while approximate inference is typically exponential in the size of the model's factors. In this paper, we take advantage of the fact that many models have maximal cliques that are larger than their constituent factors, and also of the fact that many factors consist entirely of latent variables (i.e., they do not depend on an observation). This is a common case in a wide variety of applications, including grids, trees, and ring-structured models. In such cases, we are able to decrease the exponent of complexity for message-passing by 0.5 for both exact and approximate inference.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/09/2012

Convergent message passing algorithms - a unifying view

Message-passing algorithms have emerged as powerful techniques for appro...
09/22/2013

A new look at reweighted message passing

We propose a new family of message passing techniques for MAP estimation...
07/27/2017

Anytime Exact Belief Propagation

Statistical Relational Models and, more recently, Probabilistic Programm...
05/10/2011

Feedback Message Passing for Inference in Gaussian Graphical Models

While loopy belief propagation (LBP) performs reasonably well for infere...
01/30/2013

A Comparison of Lauritzen-Spiegelhalter, Hugin, and Shenoy-Shafer Architectures for Computing Marginals of Probability Distributions

In the last decade, several architectures have been proposed for exact c...
05/29/2012

Generalized sequential tree-reweighted message passing

This paper addresses the problem of approximate MAP-MRF inference in gen...
11/02/2021

Variational message passing (VMP) applied to LDA

Variational Bayes (VB) applied to latent Dirichlet allocation (LDA) is t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is well-known that exact inference in tree-structured graphical models can be accomplished efficiently by message-passing operations following a simple protocol making use of the distributive law (Aji and McEliece, 2000; Kschischang et al., 2001). It is also well-known that exact inference in arbitrary graphical models can be solved by the junction-tree algorithm; its efficiency is determined by the size of the maximal cliques after triangulation, a quantity related to the treewidth of the graph.

Figure 1 illustrates an attempt to apply the junction-tree algorithm to some graphical models containing cycles. If the graphs are not chordal ((a) and (b)), they need to be triangulated, or made chordal (red edges in (c) and (d)). Their clique-graphs are then guaranteed to be junction-trees, and the distributive law can be applied with the same protocol used for trees; see Aji and McEliece (2000) for a beautiful tutorial on exact inference in arbitrary graphs. Although the models in this example contain only pairwise factors, triangulation has increased the size of their maximal cliques, making exact inference substantially more expensive. Hence approximate solutions in the original graph (such as loopy belief-propagation, or inference in a loopy factor-graph) are often preferred over an exact solution via the junction-tree Algorithm.

Even when the model’s factors are the same size as its maximal cliques, neither exact nor approximate inference algorithms take advantage of the fact that many factors consist only of latent variables. In many models, those factors that are conditioned upon the observation contain fewer latent variables than the purely latent cliques. Examples are shown in Figure 2. This encompasses a wide variety of models, including grid-structured models for optical flow and stereo disparity as well as chain and tree-structured models for text or speech.

(a) (b) (c) (d)
Figure 1: The models at left ((a) and (b)) can be triangulated ((c) and (d)) so that the junction-tree algorithm can be applied. Despite the fact that the new models have larger maximal cliques, the corresponding potentials are still factored over pairs of nodes only. Our algorithms exploit this fact.
(a) (b) (c) (d)
Figure 2: Some graphical models to which our results apply: cliques containing observations have fewer latent variables than purely latent cliques. White nodes correspond to the observation, gray nodes to the labeling. In other words, cliques containing a white node encode the data likelihood, whereas cliques containing only gray nodes encode priors.

In this paper, we exploit the fact that the maximal cliques (after triangulation) often have potentials that factor over subcliques, as illustrated in Figure 1. We will show that whenever this is the case, the expected computational complexity of exact inference can be improved (both the asymptotic upper-bound and the actual runtime).

Additionally, we will show that this result can be applied so long as those cliques that are conditioned upon an observation contain fewer latent variables than those cliques consisting of purely latent variables; the ‘purely latent’ cliques can be pre-processed offline, allowing us to achieve the same benefits as described in the previous paragraph.

We show that these properties reveal themselves in a wide variety of real applications. Both of our improvements shall increase the class of problems for which inference via max-product belief-propagation is tractable.

A core operation encountered in the junction-tree algorithm is that of finding the index that chooses the largest product amongst two lists of length :

(1)

Our results stem from the realization that while (eq. 1) appears to be a linear time operation, it can be decreased to (in the expected case) if we know the permutations that sort and .

A preliminary version of this work appeared in McAuley and Caetano (2010).

1.1 Summary of Results

A selection of the results to be presented in the remainder of this paper can be summarized as follows:

  • We are able to lower the asymptotic expected running time of the max-product belief-propagation for any graphical model whose cliques factorize into lower-order terms.

  • The results obtained are exactly those that would be obtained by the traditional version of the algorithm, i.e., no approximations are used.

  • Our algorithm also applies whenever cliques containing an observed variable contain fewer latent variables than purely latent cliques, as in Figure 2 (meaning that certain computations can be taken offline).

  • For any cliques composed of pairwise factors, we obtain an expected speed-up of at least (assuming states per node; denotes an asymptotic lower-bound).

  • For example, in models with third-order cliques containing pairwise terms, message-passing is reduced from to , as in Figure 1(d). For models containing pairwise (but purely latent) cliques, message-passing is reduced from to , as in Figure 2.

  • For cliques composed of -ary factors, the expected speed-up generalizes to at least , though it is never asymptotically slower than the original solution.

  • The expected-case improvement is derived under the assumption that the order-statistics of different factors are independent.

  • If the different factors have ‘similar’ order statistics, the performance will be better than the expected case.

  • If the different factors have ‘opposite’ order statistics, the performance will be worse than the expected case, but is never asymptotically more expensive than the traditional version of the algorithm.

Our results do not apply for every semiring , but only to those whose ‘addition’ operation defines an order (for example, or ); we also assume that under this ordering, our ‘multiplication’ operator satisfies

(2)

Thus our results certainly apply to the max-sum and min-sum semirings (as well as max-product and min-product, assuming non-negative potentials), but not for sum-product (for example). Consequently, our approach is useful for computing MAP-states, but cannot be used to compute marginal distributions. We also assume that the domain of each node is discrete.

We shall initially present our algorithm as it applies to models of the type shown in Figure 1. The more general (and arguably more useful) application of our algorithm to those models in Figure 2 shall be deferred until Section 4, where it can be seen as a straightforward generalization of our initial results.

1.2 Related Work

There has been previous work on speeding-up message-passing algorithms by exploiting some type of structure in certain graphical models. For example, Kersting et al. (2009) study the case where different cliques share the same potential function. In Felzenszwalb and Huttenlocher (2006), fast message-passing algorithms are provided for cases in which the potential of a 2-clique is only dependent on the difference

of the latent variables (which is common in some computer vision applications); they also show how the algorithm can be made faster if the graphical model is a bipartite graph. In

Kumar and Torr (2006), the authors provide faster algorithms for the case in which the potentials are truncated, whereas in Petersen et al. (2008) the authors offer speed-ups for models that are specifically grid-like.

The latter work is perhaps the most similar in spirit to ours, as it exploits the fact that certain factors can be sorted in order to reduce the search space of a certain maximization problem. In practice, this leads to linear speed-ups over a algorithm.

Another closely related paper is that of Park and Darwiche (2003). This work can be seen to compliment ours in the sense that it exploits essentially the same type of factorization that we study, though it applies to sum-product versions of the algorithm, rather than the max-product version that we shall study. Kjærulff (1998) also exploits factorization within cliques of junction-trees, albeit a different type of factorization than that studied here.

In Section 3, we shall see that our algorithm is closely related to a well-studied problem known as ‘funny matrix multiplication’ (Kerr, 1970). The worst-case complexity of this problem has been studied in relation to the all-pairs shortest path problem (Alon et al., 1997; Karger et al., 1993).

2 Background

The notation we shall use is briefly defined in Table 1. We shall assume throughout that the max-product semiring is being used, though our analysis is almost identical for any suitable choice.

Example description
capital letters refer to sets of nodes (or similarly, cliques);
standard set operators are used ( denotes set difference);
the domain of a set; this is just the Cartesian product of the domains of each element in the set;
bold capital letters refer to arrays;

bold lower-case letters refer to vectors;

vectors are indexed using square brackets;
similarly, square brackets are used to index a row of a 2-d array,
or a row of an -dimensional array;
superscripts are just labels, i.e., is an array, is a vector;
constant subscripts are also labels, i.e., if is a constant, then is a constant vector;
variable subscripts define variables; the subscript defines the domain of the variable;
if is a constant vector, then is the restriction of that vector to those indices corresponding to variables in (assuming that is an ordered set);
a function over the variables in a set ; the argument will be suppressed if clear, given that ‘functions’ are essentially arrays for our purposes;
a function over a pair of variables ;
if one argument to a function is constant (here ), then it becomes a function over fewer variables (in this case, only is free);
Table 1: Notation

MAP-inference in a graphical model consists of solving an optimization problem of the form

(3)

where is the set of maximal cliques in . This problem is often solved via message-passing algorithms such as the junction-tree algorithm, loopy belief-propagation, or inference in a factor graph (Aji and McEliece, 2000; Weiss, 2000; Kschischang et al., 2001).

Two of the fundamental steps encountered in message-passing algorithms are defined below. Firstly, the message from a clique to an intersecting clique is defined by

(4)

(where returns the neighbors of the clique ). If such messages are computed after has received messages from all of its neighbors except (i.e., ), then this defines precisely the update scheme used by the junction-tree algorithm. The same update scheme is used for loopy belief-propagation, though it is done iteratively in a randomized fashion.

Secondly, after all messages have been passed, the MAP-states for a subset of nodes (assumed to belong to a clique ) is computed using

(5)

Often, the clique-potential shall be decomposable into several smaller factors, i.e.,

(6)

Some simple motivating examples are shown in Figure 3

: a model for pose estimation from

Sigal and Black (2006), a ‘skip-chain CRF’ from Galley (2006), and a model for shape matching from Coughlan and Ferreira (2002). In each case, the triangulated model has third-order cliques, but the potentials are only pairwise. Other examples have already been shown in Figure 1; analogous cases are ubiquitous in many real applications.

(a) (b) (c)
Figure 3: (a) A model for pose reconstruction from Sigal and Black (2006); (b) A ‘skip-chain CRF’ from Galley (2006); (c) A model for deformable matching from Coughlan and Ferreira (2002). Although the (triangulated) models have cliques of size three, their potentials factorize into pairwise terms.

The optimizations we suggest shall apply to general problems of the form

(7)

which subsumes both (eq. 4) and (eq. 5), where we simply treat the messages as factors of the model. Algorithm 1 gives the traditional solution to this problem, which does not exploit the factorization of . This algorithm runs in , where is the number of states per node, and is the size of the clique (we assume that for a given , computing takes constant time, as our optimizations shall not modify this cost).

0:  a clique whose max-marginal (where ) we wish to compute; assume that each node in has domain
1:  for  {i.e., do
2:     
3:     for  do
4:        if  then
5:           
6:        end if
7:     end for {this loop takes }
8:     
9:  end for {this loop takes }
10:  Return:  
Algorithm 1 Brute-force computation of max-marginals

3 Optimizing Algorithm 1

In order to specify a more efficient version of Algorithm 1, we begin by considering the simplest possible nontrivial factorization: a clique of size three containing pairwise factors. In such a case, our aim is to compute

(8)

which we have assumed takes the form

(9)

For a particular value of , we must solve

(10)

which we note is in precisely the form shown in (eq. 1).

There is a close resemblance between (eq. 10) and the problem of multiplying two matrices: if the ‘’ in (eq. 10) is replaced by summation, we essentially recover traditional matrix multiplication. While traditional matrix multiplication is well known to have a sub-cubic worst-case solution (see Strassen, 1969), the version in (eq. 10) (often referred to as ‘funny matrix multiplication’, or simply ‘max-product matrix multiplication’) is known to be cubic in the worst case, assuming that only multiplication and comparison operations are used (Kerr, 1970). The complexity of solving (eq. 10) can also be shown to be equivalent to the all-pairs shortest path problem, which is studied in Alon et al. (1997). Not surprisingly, we shall not improve the worst-case complexity, but shall instead give far better expected-case performance than existing solutions. Just as Strassen’s algorithm can be used to solve (eq. 10) when maximization is replaced by summation, there has been work studying the problem of sum-product inference in graphical models, subject to the same type of factorization we discuss (Park and Darwiche, 2003).

As we have previously suggested, it will be possible to solve (eq. 10) efficiently if and are already sorted. We note that will be reused for every value of , and likewise will be reused for every value of . Sorting every row of and can be done in (for rows of length ).

The following elementary lemma is the key observation required in order to solve (eq. 10) efficiently:

Lemma 1.

If the largest element of has the same index as the largest element of , then we only need to search through the largest values of , and the largest values of ; any corresponding pair of smaller values could not possibly be the largest solution.

This observation is used to construct Algorithm 2. Here we iterate through the indices starting from the largest values of and , stopping once both indices are ‘behind’ the maximum value found so far (which we then know is the maximum). This algorithm is demonstrated pictorially in Figure 4.

0:  two vectors and , and permutation functions and that sort them in decreasing order (so that is the largest element in )
1:  Initialize: , , {if , then the largest element in has the same index as the largest element in }
2:  ,
3:  if  then
4:     ,
5:  end if
6:  while  {in practice, we could also stop if , but the version given here is the one used for analysis in Appendix Ado
7:     
8:     if  then
9:        
10:        
11:     end if
12:     if  then
13:        
14:     end if
15:     {repeat Lines 814, interchanging and }
16:  end while {this takes expected time }
17:  Return:  
Algorithm 2 Find such that is maximized

Figure 4: Algorithm 2, explained pictorially. The arrows begin at and ; the red line connects and , behind which we need not search; a dashed arrow is used when a new maximum is found. Note that in the event that and contain repeated elements, they can be sorted arbitrarily.

A prescription of how Algorithm 2 can be used to solve (eq. 8) is given in Algorithm 3. Determining precisely the running time of Algorithm 2 (and therefore Algorithm 3) is not trivial, and will be explored in depth in Appendix A. We note that if the expected-case running time of Algorithm 2 is , then the time taken to solve Algorithm 3 shall be . At this stage we shall state an upper-bound on the true complexity in the following theorem:

Theorem 1.

The expected running time of Algorithm 2 is , yielding a speed-up of at least in cliques containing pairwise factors.

0:  a potential whose max-marginal we wish to compute
1:  for  do
2:     compute by sorting {takes }
3:     compute by sorting { and are arrays, each row of which is a permutation; and are functions over , since is constant in this expression}
4:  end for {this loop takes }
5:  for  do
6:     
7:     
8:      {takes }
9:     
10:  end for {this loop takes }{the total running time is , which is dominated by }
11:  Return:  
Algorithm 3 Use Algorithm 2 to compute the max-marginal of a 3-clique containing pairwise factors

3.1 An Extension to Higher-Order Cliques with Three Factors

The simplest extension that we can make to Algorithms 2 and 3 is to note that they can be applied even when there are several overlapping terms in the factors. For instance, Algorithm 3 can be adapted to solve

(11)

and similar variants containing three factors. Here both and are shared by and . We can follow precisely the reasoning of the previous section, except that when we sort (similarly ) for a fixed value of , we are now sorting an array rather than a vector (Algorithm 3, Lines 2 and 3); in this case, the permutation functions and in Algorithm 2 simply return pairs of indices. This is illustrated in Figure 5. Effectively, in this example we are sorting the variable , which has state space of size .

Figure 5: The reasoning applied in Algorithm 2 applies even when the elements of and are multidimensional indices.

As the number of shared terms increases, so does the improvement to the running time. While (eq. 11) would take to solve using Algorithm 1, it takes only to solve using Algorithm 3 (more precisely, if Algorithm 2 takes , then (eq. 11) takes , which we have mentioned is ). In general, if we have shared terms, then the running time is , yielding a speed-up of over the naïve solution of Algorithm 1.

3.2 An Extension to Higher-Order Cliques with Decompositions Into Three Groups

By similar reasoning, we can apply our algorithm to cases where there are more than three factors, in which the factors can be separated into three groups. For example, consider the clique in Figure 6(a), which we shall call (the entire graph is a clique, but for clarity we only draw an edge when the corresponding nodes belong to a common factor). Each of the factors in this graph have been labeled using either differently colored edges (for factors of size larger than two) or dotted edges (for factors of size two), and the max-marginal we wish to compute has been labeled using colored nodes. We assume that it is possible to split this graph into three groups such that every factor is contained within a single group, along with the max-marginal we wish to compute (Figure 6, (b)). If such a decomposition is not possible, we will have to resort to further extensions to be described in Section 3.3.

(a) (b)

(a) We begin with a set of factors (indicated using colored lines), which are assumed to belong to some clique in our model; we wish to compute the max-marginal with respect to one of these factors (indicated using colored nodes); (b) The factors are split into three groups, such that every factor is entirely contained within one of them (Algorithm 4, line 1).

(c) (d) (e)

(c) Any nodes contained in only one of the groups are marginalized (Algorithm 4, lines 2, 3, and 4); the problem is now very similar to that described in Algorithm 3, except that nodes have been replaced by groups; note that this essentially introduces maximal factors in and ; (d) For every value , is sorted (Algorithm 4, lines 57); (e) For every value , is sorted (Algorithm 4, lines 810).

(f) (g)

(f) For every , we choose the best value of by Algorithm 2 (Algorithm 4, lines 1116); (g) The result is marginalized with respect to (Algorithm 4, line 17).

Figure 6: Algorithm 4, explained pictorially. In this case, the most computationally intensive step is the marginalization of (in step (c)), which takes . However, the algorithm can actually be applied recursively to the group , resulting in an overall running time of , for a max-marginal that would have taken to compute using the naïve solution of Algorithm 1.
0:  potentials ; each of the factors should be contained in exactly one of these terms, and we assume that (see Figure 6)
1:  Define: ; ; { contains the variables in that are shared by at least one other group; alternately, the variables in appear only in (sim. for and )}
2:  compute  {we are marginalizing over those variables in that do not appear in any of the other groups (or in ); this takes if done by brute force (Algorithm 1), but may also be done by a recursive call to Algorithm 4}
3:  compute
4:  compute
5:  for  do
6:     compute by sorting  {takes ; is free over , and is treated as an array by ‘flattening’ it; contains the -dimensional indices that sort it}
7:  end for {this loop takes }
8:  for  do
9:     compute by sorting
10:  end for {this loop takes }
11:  for  do
12:      { is the ‘restriction’ of the vector to those indices in (meaning that ); hence is free in , while is fixed}
13:     
14:      {takes }
15:     
16:  end for
17:   {i.e., we are using Algorithm 1 to marginalize with respect to ; this takes }
Algorithm 4 Compute the max-marginal of with respect to , where is split into three groups

Ideally, we would like these groups to have size , though in the worst case they will have size no larger than . We call these groups , , , where is the group containing the max-marginal that we wish to compute. In order to simplify the analysis of this algorithm, we shall express the running time in terms of the size of the largest group, , and the largest difference, . The max-marginal can be computed using Algorithm 4.

The running times shown in Algorithm 4 are loose upper-bounds, given for the sake of expressing the running time in simple terms. More precise running times are given in Table 2; any of the terms shown in Table 2 may be dominant. Some example graphs, and their resulting running times are shown in Figure 7.

Description lines time
Marginalization of , without recursion 2
Marginalization of 3
Marginalization of 4
Sorting 57
Sorting 810
Running Algorithm 2 on the sorted values 1116
Table 2: Detailed running time analysis of Algorithm 4; any of these terms may be asymptotically dominant
Graph:

A complete graph , with pairwise terms

(a) (b) (c) (d) (e)
Algorithm 1:
Algorithm 4:
Speed-up:
Figure 7: Some example graphs whose max-marginals are to be computed with respect to the colored nodes, using the three regions shown. Factors are indicated using differently colored edges, while dotted edges always indicate pairwise factors. (a) is the region from Figure 6 (recursion is applied again to achieve this result); (b) is the graph used to motivate Algorithm 3; (c) shows a query in a graph with regular structure; (d) shows a complete graph with six nodes; (e) generalizes this to a clique with nodes.

3.2.1 Applying Algorithm 4 Recursively

The marginalization steps of Algorithm 4 (Lines 2, 3, and 4) may further decompose into smaller groups, in which case Algorithm 4 can be applied recursively. For instance, the graph in Figure 7(a) represents the marginalization step that is to be performed in Figure 6(c) (Algorithm 4, Line 4). Since this marginalization step is the asymptotically dominant step in the algorithm, applying Algorithm 4 recursively lowers the asymptotic complexity.

Another straightforward example of applying recursion in Algorithm 4 is shown in Figure 8, in which a ring-structured model is marginalized with respect to two of its nodes. Doing so takes ; in contrast, solving the same problem using the junction-tree algorithm (by triangulating the graph) would take . Loopy belief-propagation takes per iteration, meaning that our algorithm will be faster if the number of iterations is . Naturally, Algorithm 3 could be applied directly to the triangulated graph, which would again take .

(by Algorithm 3)
Figure 8: In the above example, lines 24 of Algorithm 4 are applied recursively, achieving a total running time of for a loop with nodes (our algorithm achieves the same running time in the triangulated graph).

3.3 A General Extension to Higher-Order Cliques

Naturally, there are cases for which a decomposition into three terms is not possible, such as

(12)

(i.e., a clique of size four with third-order factors). However, if the model contains factors of size , it must always be possible to split it into groups (e.g. four in the case of (eq. 12)).

Our optimizations can easily be applied in these cases simply by adapting Algorithm 2 to solve problems of the form

(13)

Pseudocode for this extension is presented in Algorithm 5. Note carefully the use of the variable : we are storing which indices have been read to avoid re-reading them; this guarantees that our Algorithm is never asymptotically worse than the naïve solution. Figure 9 demonstrates how such an algorithm behaves in practice. Again, we shall discuss the running time of this extension in Appendix A

. For the moment, we state the following theorem:

Theorem 2.

Algorithm 5 generalizes Algorithm 2 to lists with an expected running time of , yielding a speed-up of at least in cliques containing -ary factors. It is never worse than the naïve solution, meaning that it takes .

0:   vectors ; permutation functions that sort them in decreasing order; a vector indicating which indices have been read, and a unique value { is essentially a boolean array indicating which indices have been read; since creating this array is an operation, we create it externally, and reuse it times; setting indicates that a particular index has been read; we use a different value of for each call to this function so that can be reused without having to be reinitialized}
1:  Initialize: ,,
2:  for  do
3:     
4:  end for
5:  
6:  while  do
7:     
8:     if  then
9:        continue
10:     end if
11:     
12:     
13:     
14:     if  then
15:        
16:        
17:     end if
18:     for  do
19:        
20:     end for
21:     for  do
22:        
23:     end for
24:  end while {see Appendix A for running times}
25:  Return:  
Algorithm 5 Find such that is maximized

Figure 9: Algorithm 2 can easily be extended to cases including more than two sequences.

Using Algorithm 5, we can similarly extend Algorithm 4 to allow for any number of groups (pseudocode is not shown; all statements about the groups and simply become statements about groups , and calls to Algorithm 2 become calls to Algorithm 5). The one remaining case that has not been considered is when the sequences are functions of different (but overlapping) variables; naïvely, we can create a new variable whose domain is the product space of all of the overlapping terms, and still achieve the performance improvement guaranteed by Theorem 2; in some cases, better results can again be obtained by applying recursion, as in Figure 7.

As a final comment we note that we have not provided an algorithm for choosing how to split the variables of a model into -groups. We note even if we split the groups in a naïve way, we are guaranteed to get at least the performance improvement guaranteed by Theorem 2, though more ‘intelligent’ splits may further improve the performance. Furthermore, in all of the applications we have studied, is sufficiently small that it is inexpensive to consider all possible splits by brute-force.

4 Exploiting ‘Data Independence’ in Latent Factors

While (eq. 3) gave the general form of MAP-inference in a graphical model, it will often be more convenient to express our objective function as being conditioned upon some observation, . Thus inference consists of solving an optimization problem of the form

(14)

When our objective function is written in this way, further factorization is often possible, yielding an expression of the form

(15)

where each is a subset of some . We shall say that those factors that do not depend on the observation are ‘data independent’.

By far the most common instance of this type of model has ‘data dependent’ factors consisting of a single latent variable, and conditioned upon a single observation, and ‘data independent’ factors consisting of a pair of latent variables. This was precisely the class of models depicted at the beginning of our paper in Figure 2, whose objective function takes the form

(16)

(where and are the set of nodes and edges in our graphical model). As in the Section 3, we shall concern ourselves with this version of the model, and explain only briefly how it can be applied with larger factors, as in Section 3.2.

Note that in (eq. 16) we are no longer concerned solely with exact inference via the junction-tree algorithm. In many models, such as grids and rings, (eq. 16) shall be solved approximately by means of either loopy belief-propagation, or inference in a factor graph.

Given the decomposition of (eq. 16), message-passing now takes the form

(17)

(where and ). Just as we made the comparison between (eq. 10) and matrix multiplication, we can see (eq. 17) as being related to the multiplication of a matrix () with a vector (), again with summation replaced by maximization. Given the results we have already shown, it is trivial to solve (eq. 17) in if we know the permutations that sort , and the rows of . The algorithm for doing so is shown in Algorithm 6. The difficultly we face in this instance is that sorting the rows of takes , i.e., longer than Algorithm 6 itself.

This problem is circumvented due to the following simple observation: since consists only of latent variables (and not upon the observation), this ‘sorting’ step can take place offline, i.e., before the ‘data’ has been observed.

Two further observations mean that even this offline cost can often be avoided. Firstly, many models have a ‘homogeneous’ prior, i.e., the same prior is shared amongst every edge (or clique) of the model. In such cases, only a single ‘copy’ of the prior needs to be sorted, meaning that in any model containing edges, speed improvements can be gained over the naive implementation. Secondly, where an iterative algorithm (such as loopy belief-propagation) is to be used, the sorting step need only take place prior to the first iteration; if iterations of belief propagation are to be performed (or indeed, if the number of edges multiplied by the number of iterations is ), we shall again gain speed improvements even when the sorting step is done online.

In fact, the second of these conditions obviates the need for data independence altogether. In other words, in any pairwise model in which iterations of belief propagation are to be performed, the pairwise terms need to be sorted only during the first iteration. Thus these improvements apply to those models in Figure 1, so long as the number of iterations is .

0:  a potential whose max-marginal we wish to compute, and a set of permutation functions such that sorts the row of (in decreasing order).
1:  compute the permutation function by sorting {takes }
2:  for  do
3:     
4:      {}
5:     
6:  end for {expected-case }
7:  Return:  
Algorithm 6 Solve (eq. 17) using Algorithm 2

4.1 Extension to Higher-Order Cliques

Just as in Section 3.2, we can extend Algorithm 6 to factors of any size, so long as the purely latent cliques contain more latent variables than those cliques that depend upon the observation. The analysis for this type of model is almost exactly the same as that presented in Section 3.2, except that any terms consisting of purely latent variables are processed offline.

As we mentioned in 3.2, if a model contains (non-maximal) factors of size , we will gain a speed-up of . If in addition there is a factor (either maximal or non-maximal) consisting of purely latent variables, we can still obtain a speed-up of , since this factor merely contributes an additional term to (eq. 13). Thus when our ‘data-dependent’ terms contain only a single latent variable (i.e., ), we gain a speed-up of , as in Algorithm 6.

5 Performance Improvements in Existing Applications

Our results are immediately compatible with several applications that rely on inference in graphical models. As we have mentioned, our results apply to any model whose cliques decompose into lower-order terms.

Often, potentials are defined only on nodes and edges of a model. A

-order Markov model has a tree-width of

, despite often containing only pairwise relationships. Similarly ‘skip-chain CRFs’ (Sutton and McCallum, 2006; Galley, 2006), and junction-trees used in SLAM applications (Paskin, 2003) often contain only pairwise terms, and may have low tree width under reasonable conditions. In each case, if the tree-width is , Algorithm 4 takes (for a model with nodes and states per node), yielding a speed-up of .

Models for shape matching and pose reconstruction often exhibit similar properties (Tresadern et al., 2009; Donner et al., 2007; Sigal and Black, 2006). In each case, third-order cliques factorize into second order terms; hence we can apply Algorithm 3 to achieve a speed-up of .

Another similar model for shape matching is that of Felzenszwalb (2005); this model again contains third-order cliques, though it includes a ‘geometric’ term constraining all three variables. Here, the third-order term is independent of the input data, meaning that each of its rows can be sorted offline, as described in Section 4. In this case, those factors that depend upon the observation are pairwise, meaning that we achieve a speed-up of . Further applications of this type shall be explored in Section 6.2.

In Coughlan and Ferreira (2002), deformable shape-matching is solved approximately using loopy belief-propagation. Their model has only second-order cliques, meaning that inference takes per iteration. Although we cannot improve upon this result, we note that we can typically do exact inference in a single iteration in ; thus our model has the same running time as iterations of the original version. This result applies to all second-order models containing a single loop (Weiss, 2000).

In McAuley et al. (2008), a model is presented for graph-matching using loopy belief-propagation; the maximal cliques for -dimensional matching have size , meaning that inference takes per iteration (it is shown to converge to the correct solution); we improve this to .

Interval graphs can be used to model resource allocation problems (Fulkerson and Gross, 1965); each node encodes a request, and overlapping requests form edges. Maximal cliques grow with the number of overlapping requests, though the constraints are only pairwise, meaning that we again achieve an improvement.

Belief-propagation can be used to solve LP-relaxations in pairwise graphical models. In Sontag et al. (2008), LP-relaxations are computed for pairwise models by constructing several third-order ‘clusters’, which compute pairwise messages for each of their edges. Again, an improvement is achieved.

Finally, in Section 6.2 we shall explore a variety of applications in which we have pairwise models of the form shown in (eq. 16). In all of these cases, we see an (expected) reduction of a message-passing algorithm to .

Table 3 summarizes these results. Reported running times reflect the expected case. Note that we are assuming that max-product belief-propagation is being used in a discrete model; some of the referenced articles may use different variants of the algorithm (e.g. Gaussian models, or approximate inference schemes). We believe that our improvements may revive the exact, discrete version as a tractable option in these cases.

Reference description running time our method
McAuley et al. (2008) -d graph-matching (iter.) (iter.)
Sutton and McCallum (2006) Width- skip-chain
Galley (2006) Width-3 skip-chain
Paskin (2003) (discrete case) SLAM, width
Tresadern et al. (2009) Deformable matching
Coughlan and Ferreira (2002) Deformable matching (iter.)
Sigal and Black (2006) Pose reconstruction