1 Introduction
It is well-known that exact inference in tree-structured graphical models can be accomplished efficiently by message-passing operations following a simple protocol making use of the distributive law (Aji and McEliece, 2000; Kschischang et al., 2001). It is also well-known that exact inference in arbitrary graphical models can be solved by the junction-tree algorithm; its efficiency is determined by the size of the maximal cliques after triangulation, a quantity related to the treewidth of the graph.
Figure 1 illustrates an attempt to apply the junction-tree algorithm to some graphical models containing cycles. If the graphs are not chordal ((a) and (b)), they need to be triangulated, or made chordal (red edges in (c) and (d)). Their clique-graphs are then guaranteed to be junction-trees, and the distributive law can be applied with the same protocol used for trees; see Aji and McEliece (2000) for a beautiful tutorial on exact inference in arbitrary graphs. Although the models in this example contain only pairwise factors, triangulation has increased the size of their maximal cliques, making exact inference substantially more expensive. Hence approximate solutions in the original graph (such as loopy belief-propagation, or inference in a loopy factor-graph) are often preferred over an exact solution via the junction-tree Algorithm.
Even when the model’s factors are the same size as its maximal cliques, neither exact nor approximate inference algorithms take advantage of the fact that many factors consist only of latent variables. In many models, those factors that are conditioned upon the observation contain fewer latent variables than the purely latent cliques. Examples are shown in Figure 2. This encompasses a wide variety of models, including grid-structured models for optical flow and stereo disparity as well as chain and tree-structured models for text or speech.
![]() |
![]() |
![]() |
![]() |
(a) | (b) | (c) | (d) |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|
(a) | (b) | (c) | (d) |
In this paper, we exploit the fact that the maximal cliques (after triangulation) often have potentials that factor over subcliques, as illustrated in Figure 1. We will show that whenever this is the case, the expected computational complexity of exact inference can be improved (both the asymptotic upper-bound and the actual runtime).
Additionally, we will show that this result can be applied so long as those cliques that are conditioned upon an observation contain fewer latent variables than those cliques consisting of purely latent variables; the ‘purely latent’ cliques can be pre-processed offline, allowing us to achieve the same benefits as described in the previous paragraph.
We show that these properties reveal themselves in a wide variety of real applications. Both of our improvements shall increase the class of problems for which inference via max-product belief-propagation is tractable.
A core operation encountered in the junction-tree algorithm is that of finding the index that chooses the largest product amongst two lists of length :
(1) |
Our results stem from the realization that while (eq. 1) appears to be a linear time operation, it can be decreased to (in the expected case) if we know the permutations that sort and .
A preliminary version of this work appeared in McAuley and Caetano (2010).
1.1 Summary of Results
A selection of the results to be presented in the remainder of this paper can be summarized as follows:
-
We are able to lower the asymptotic expected running time of the max-product belief-propagation for any graphical model whose cliques factorize into lower-order terms.
-
The results obtained are exactly those that would be obtained by the traditional version of the algorithm, i.e., no approximations are used.
-
Our algorithm also applies whenever cliques containing an observed variable contain fewer latent variables than purely latent cliques, as in Figure 2 (meaning that certain computations can be taken offline).
-
For any cliques composed of pairwise factors, we obtain an expected speed-up of at least (assuming states per node; denotes an asymptotic lower-bound).
-
For cliques composed of -ary factors, the expected speed-up generalizes to at least , though it is never asymptotically slower than the original solution.
-
The expected-case improvement is derived under the assumption that the order-statistics of different factors are independent.
-
If the different factors have ‘similar’ order statistics, the performance will be better than the expected case.
-
If the different factors have ‘opposite’ order statistics, the performance will be worse than the expected case, but is never asymptotically more expensive than the traditional version of the algorithm.
Our results do not apply for every semiring , but only to those whose ‘addition’ operation defines an order (for example, or ); we also assume that under this ordering, our ‘multiplication’ operator satisfies
(2) |
Thus our results certainly apply to the max-sum and min-sum semirings (as well as max-product and min-product, assuming non-negative potentials), but not for sum-product (for example). Consequently, our approach is useful for computing MAP-states, but cannot be used to compute marginal distributions. We also assume that the domain of each node is discrete.
We shall initially present our algorithm as it applies to models of the type shown in Figure 1. The more general (and arguably more useful) application of our algorithm to those models in Figure 2 shall be deferred until Section 4, where it can be seen as a straightforward generalization of our initial results.
1.2 Related Work
There has been previous work on speeding-up message-passing algorithms by exploiting some type of structure in certain graphical models. For example, Kersting et al. (2009) study the case where different cliques share the same potential function. In Felzenszwalb and Huttenlocher (2006), fast message-passing algorithms are provided for cases in which the potential of a 2-clique is only dependent on the difference
of the latent variables (which is common in some computer vision applications); they also show how the algorithm can be made faster if the graphical model is a bipartite graph. In
Kumar and Torr (2006), the authors provide faster algorithms for the case in which the potentials are truncated, whereas in Petersen et al. (2008) the authors offer speed-ups for models that are specifically grid-like.The latter work is perhaps the most similar in spirit to ours, as it exploits the fact that certain factors can be sorted in order to reduce the search space of a certain maximization problem. In practice, this leads to linear speed-ups over a algorithm.
Another closely related paper is that of Park and Darwiche (2003). This work can be seen to compliment ours in the sense that it exploits essentially the same type of factorization that we study, though it applies to sum-product versions of the algorithm, rather than the max-product version that we shall study. Kjærulff (1998) also exploits factorization within cliques of junction-trees, albeit a different type of factorization than that studied here.
2 Background
The notation we shall use is briefly defined in Table 1. We shall assume throughout that the max-product semiring is being used, though our analysis is almost identical for any suitable choice.
Example | description |
---|---|
capital letters refer to sets of nodes (or similarly, cliques); | |
standard set operators are used ( denotes set difference); | |
the domain of a set; this is just the Cartesian product of the domains of each element in the set; | |
bold capital letters refer to arrays; | |
bold lower-case letters refer to vectors; |
|
vectors are indexed using square brackets; | |
similarly, square brackets are used to index a row of a 2-d array, | |
or a row of an -dimensional array; | |
superscripts are just labels, i.e., is an array, is a vector; | |
constant subscripts are also labels, i.e., if is a constant, then is a constant vector; | |
variable subscripts define variables; the subscript defines the domain of the variable; | |
if is a constant vector, then is the restriction of that vector to those indices corresponding to variables in (assuming that is an ordered set); | |
a function over the variables in a set ; the argument will be suppressed if clear, given that ‘functions’ are essentially arrays for our purposes; | |
a function over a pair of variables ; | |
if one argument to a function is constant (here ), then it becomes a function over fewer variables (in this case, only is free); |
MAP-inference in a graphical model consists of solving an optimization problem of the form
(3) |
where is the set of maximal cliques in . This problem is often solved via message-passing algorithms such as the junction-tree algorithm, loopy belief-propagation, or inference in a factor graph (Aji and McEliece, 2000; Weiss, 2000; Kschischang et al., 2001).
Two of the fundamental steps encountered in message-passing algorithms are defined below. Firstly, the message from a clique to an intersecting clique is defined by
(4) |
(where returns the neighbors of the clique ). If such messages are computed after has received messages from all of its neighbors except (i.e., ), then this defines precisely the update scheme used by the junction-tree algorithm. The same update scheme is used for loopy belief-propagation, though it is done iteratively in a randomized fashion.
Secondly, after all messages have been passed, the MAP-states for a subset of nodes (assumed to belong to a clique ) is computed using
(5) |
Often, the clique-potential shall be decomposable into several smaller factors, i.e.,
(6) |
Some simple motivating examples are shown in Figure 3
: a model for pose estimation from
Sigal and Black (2006), a ‘skip-chain CRF’ from Galley (2006), and a model for shape matching from Coughlan and Ferreira (2002). In each case, the triangulated model has third-order cliques, but the potentials are only pairwise. Other examples have already been shown in Figure 1; analogous cases are ubiquitous in many real applications.
![]() |
![]() |
![]() |
---|---|---|
(a) | (b) | (c) |
The optimizations we suggest shall apply to general problems of the form
(7) |
which subsumes both (eq. 4) and (eq. 5), where we simply treat the messages as factors of the model. Algorithm 1 gives the traditional solution to this problem, which does not exploit the factorization of . This algorithm runs in , where is the number of states per node, and is the size of the clique (we assume that for a given , computing takes constant time, as our optimizations shall not modify this cost).
3 Optimizing Algorithm 1
In order to specify a more efficient version of Algorithm 1, we begin by considering the simplest possible nontrivial factorization: a clique of size three containing pairwise factors. In such a case, our aim is to compute
(8) |
which we have assumed takes the form
(9) |
For a particular value of , we must solve
(10) |
which we note is in precisely the form shown in (eq. 1).
There is a close resemblance between (eq. 10) and the problem of multiplying two matrices: if the ‘’ in (eq. 10) is replaced by summation, we essentially recover traditional matrix multiplication. While traditional matrix multiplication is well known to have a sub-cubic worst-case solution (see Strassen, 1969), the version in (eq. 10) (often referred to as ‘funny matrix multiplication’, or simply ‘max-product matrix multiplication’) is known to be cubic in the worst case, assuming that only multiplication and comparison operations are used (Kerr, 1970). The complexity of solving (eq. 10) can also be shown to be equivalent to the all-pairs shortest path problem, which is studied in Alon et al. (1997). Not surprisingly, we shall not improve the worst-case complexity, but shall instead give far better expected-case performance than existing solutions. Just as Strassen’s algorithm can be used to solve (eq. 10) when maximization is replaced by summation, there has been work studying the problem of sum-product inference in graphical models, subject to the same type of factorization we discuss (Park and Darwiche, 2003).
As we have previously suggested, it will be possible to solve (eq. 10) efficiently if and are already sorted. We note that will be reused for every value of , and likewise will be reused for every value of . Sorting every row of and can be done in (for rows of length ).
The following elementary lemma is the key observation required in order to solve (eq. 10) efficiently:
Lemma 1.
If the largest element of has the same index as the largest element of , then we only need to search through the largest values of , and the largest values of ; any corresponding pair of smaller values could not possibly be the largest solution.
This observation is used to construct Algorithm 2. Here we iterate through the indices starting from the largest values of and , stopping once both indices are ‘behind’ the maximum value found so far (which we then know is the maximum). This algorithm is demonstrated pictorially in Figure 4.
A prescription of how Algorithm 2 can be used to solve (eq. 8) is given in Algorithm 3. Determining precisely the running time of Algorithm 2 (and therefore Algorithm 3) is not trivial, and will be explored in depth in Appendix A. We note that if the expected-case running time of Algorithm 2 is , then the time taken to solve Algorithm 3 shall be . At this stage we shall state an upper-bound on the true complexity in the following theorem:
Theorem 1.
The expected running time of Algorithm 2 is , yielding a speed-up of at least in cliques containing pairwise factors.
3.1 An Extension to Higher-Order Cliques with Three Factors
The simplest extension that we can make to Algorithms 2 and 3 is to note that they can be applied even when there are several overlapping terms in the factors. For instance, Algorithm 3 can be adapted to solve
(11) |
and similar variants containing three factors. Here both and are shared by and . We can follow precisely the reasoning of the previous section, except that when we sort (similarly ) for a fixed value of , we are now sorting an array rather than a vector (Algorithm 3, Lines 2 and 3); in this case, the permutation functions and in Algorithm 2 simply return pairs of indices. This is illustrated in Figure 5. Effectively, in this example we are sorting the variable , which has state space of size .
As the number of shared terms increases, so does the improvement to the running time. While (eq. 11) would take to solve using Algorithm 1, it takes only to solve using Algorithm 3 (more precisely, if Algorithm 2 takes , then (eq. 11) takes , which we have mentioned is ). In general, if we have shared terms, then the running time is , yielding a speed-up of over the naïve solution of Algorithm 1.
3.2 An Extension to Higher-Order Cliques with Decompositions Into Three Groups
By similar reasoning, we can apply our algorithm to cases where there are more than three factors, in which the factors can be separated into three groups. For example, consider the clique in Figure 6(a), which we shall call (the entire graph is a clique, but for clarity we only draw an edge when the corresponding nodes belong to a common factor). Each of the factors in this graph have been labeled using either differently colored edges (for factors of size larger than two) or dotted edges (for factors of size two), and the max-marginal we wish to compute has been labeled using colored nodes. We assume that it is possible to split this graph into three groups such that every factor is contained within a single group, along with the max-marginal we wish to compute (Figure 6, (b)). If such a decomposition is not possible, we will have to resort to further extensions to be described in Section 3.3.
![]() |
![]() |
---|---|
(a) | (b) |
(a) We begin with a set of factors (indicated using colored lines), which are assumed to belong to some clique in our model; we wish to compute the max-marginal with respect to one of these factors (indicated using colored nodes); (b) The factors are split into three groups, such that every factor is entirely contained within one of them (Algorithm 4, line 1).
![]() |
![]() |
![]() |
---|---|---|
(c) | (d) | (e) |
(c) Any nodes contained in only one of the groups are marginalized (Algorithm 4, lines 2, 3, and 4); the problem is now very similar to that described in Algorithm 3, except that nodes have been replaced by groups; note that this essentially introduces maximal factors in and ; (d) For every value , is sorted (Algorithm 4, lines 5–7); (e) For every value , is sorted (Algorithm 4, lines 8–10).
![]() |
![]() |
---|---|
(f) | (g) |
(f) For every , we choose the best value of by Algorithm 2 (Algorithm 4, lines 11–16); (g) The result is marginalized with respect to (Algorithm 4, line 17).
Ideally, we would like these groups to have size , though in the worst case they will have size no larger than . We call these groups , , , where is the group containing the max-marginal that we wish to compute. In order to simplify the analysis of this algorithm, we shall express the running time in terms of the size of the largest group, , and the largest difference, . The max-marginal can be computed using Algorithm 4.
The running times shown in Algorithm 4 are loose upper-bounds, given for the sake of expressing the running time in simple terms. More precise running times are given in Table 2; any of the terms shown in Table 2 may be dominant. Some example graphs, and their resulting running times are shown in Figure 7.
Description | lines | time |
---|---|---|
Marginalization of , without recursion | 2 | |
Marginalization of | 3 | |
Marginalization of | 4 | |
Sorting | 5–7 | |
Sorting | 8–10 | |
Running Algorithm 2 on the sorted values | 11–16 |
Graph: |
![]() |
![]() |
![]() |
![]() |
A complete graph , with pairwise terms |
---|---|---|---|---|---|
(a) | (b) | (c) | (d) | (e) | |
Algorithm 1: | |||||
Algorithm 4: | |||||
Speed-up: |
3.2.1 Applying Algorithm 4 Recursively
The marginalization steps of Algorithm 4 (Lines 2, 3, and 4) may further decompose into smaller groups, in which case Algorithm 4 can be applied recursively. For instance, the graph in Figure 7(a) represents the marginalization step that is to be performed in Figure 6(c) (Algorithm 4, Line 4). Since this marginalization step is the asymptotically dominant step in the algorithm, applying Algorithm 4 recursively lowers the asymptotic complexity.
Another straightforward example of applying recursion in Algorithm 4 is shown in Figure 8, in which a ring-structured model is marginalized with respect to two of its nodes. Doing so takes ; in contrast, solving the same problem using the junction-tree algorithm (by triangulating the graph) would take . Loopy belief-propagation takes per iteration, meaning that our algorithm will be faster if the number of iterations is . Naturally, Algorithm 3 could be applied directly to the triangulated graph, which would again take .
3.3 A General Extension to Higher-Order Cliques
Naturally, there are cases for which a decomposition into three terms is not possible, such as
(12) |
(i.e., a clique of size four with third-order factors). However, if the model contains factors of size , it must always be possible to split it into groups (e.g. four in the case of (eq. 12)).
Our optimizations can easily be applied in these cases simply by adapting Algorithm 2 to solve problems of the form
(13) |
Pseudocode for this extension is presented in Algorithm 5. Note carefully the use of the variable : we are storing which indices have been read to avoid re-reading them; this guarantees that our Algorithm is never asymptotically worse than the naïve solution. Figure 9 demonstrates how such an algorithm behaves in practice. Again, we shall discuss the running time of this extension in Appendix A
. For the moment, we state the following theorem:
Theorem 2.
Using Algorithm 5, we can similarly extend Algorithm 4 to allow for any number of groups (pseudocode is not shown; all statements about the groups and simply become statements about groups , and calls to Algorithm 2 become calls to Algorithm 5). The one remaining case that has not been considered is when the sequences are functions of different (but overlapping) variables; naïvely, we can create a new variable whose domain is the product space of all of the overlapping terms, and still achieve the performance improvement guaranteed by Theorem 2; in some cases, better results can again be obtained by applying recursion, as in Figure 7.
As a final comment we note that we have not provided an algorithm for choosing how to split the variables of a model into -groups. We note even if we split the groups in a naïve way, we are guaranteed to get at least the performance improvement guaranteed by Theorem 2, though more ‘intelligent’ splits may further improve the performance. Furthermore, in all of the applications we have studied, is sufficiently small that it is inexpensive to consider all possible splits by brute-force.
4 Exploiting ‘Data Independence’ in Latent Factors
While (eq. 3) gave the general form of MAP-inference in a graphical model, it will often be more convenient to express our objective function as being conditioned upon some observation, . Thus inference consists of solving an optimization problem of the form
(14) |
When our objective function is written in this way, further factorization is often possible, yielding an expression of the form
(15) |
where each is a subset of some . We shall say that those factors that do not depend on the observation are ‘data independent’.
By far the most common instance of this type of model has ‘data dependent’ factors consisting of a single latent variable, and conditioned upon a single observation, and ‘data independent’ factors consisting of a pair of latent variables. This was precisely the class of models depicted at the beginning of our paper in Figure 2, whose objective function takes the form
(16) |
(where and are the set of nodes and edges in our graphical model). As in the Section 3, we shall concern ourselves with this version of the model, and explain only briefly how it can be applied with larger factors, as in Section 3.2.
Note that in (eq. 16) we are no longer concerned solely with exact inference via the junction-tree algorithm. In many models, such as grids and rings, (eq. 16) shall be solved approximately by means of either loopy belief-propagation, or inference in a factor graph.
Given the decomposition of (eq. 16), message-passing now takes the form
(17) |
(where and ). Just as we made the comparison between (eq. 10) and matrix multiplication, we can see (eq. 17) as being related to the multiplication of a matrix () with a vector (), again with summation replaced by maximization. Given the results we have already shown, it is trivial to solve (eq. 17) in if we know the permutations that sort , and the rows of . The algorithm for doing so is shown in Algorithm 6. The difficultly we face in this instance is that sorting the rows of takes , i.e., longer than Algorithm 6 itself.
This problem is circumvented due to the following simple observation: since consists only of latent variables (and not upon the observation), this ‘sorting’ step can take place offline, i.e., before the ‘data’ has been observed.
Two further observations mean that even this offline cost can often be avoided. Firstly, many models have a ‘homogeneous’ prior, i.e., the same prior is shared amongst every edge (or clique) of the model. In such cases, only a single ‘copy’ of the prior needs to be sorted, meaning that in any model containing edges, speed improvements can be gained over the naive implementation. Secondly, where an iterative algorithm (such as loopy belief-propagation) is to be used, the sorting step need only take place prior to the first iteration; if iterations of belief propagation are to be performed (or indeed, if the number of edges multiplied by the number of iterations is ), we shall again gain speed improvements even when the sorting step is done online.
In fact, the second of these conditions obviates the need for data independence altogether. In other words, in any pairwise model in which iterations of belief propagation are to be performed, the pairwise terms need to be sorted only during the first iteration. Thus these improvements apply to those models in Figure 1, so long as the number of iterations is .
4.1 Extension to Higher-Order Cliques
Just as in Section 3.2, we can extend Algorithm 6 to factors of any size, so long as the purely latent cliques contain more latent variables than those cliques that depend upon the observation. The analysis for this type of model is almost exactly the same as that presented in Section 3.2, except that any terms consisting of purely latent variables are processed offline.
As we mentioned in 3.2, if a model contains (non-maximal) factors of size , we will gain a speed-up of . If in addition there is a factor (either maximal or non-maximal) consisting of purely latent variables, we can still obtain a speed-up of , since this factor merely contributes an additional term to (eq. 13). Thus when our ‘data-dependent’ terms contain only a single latent variable (i.e., ), we gain a speed-up of , as in Algorithm 6.
5 Performance Improvements in Existing Applications
Our results are immediately compatible with several applications that rely on inference in graphical models. As we have mentioned, our results apply to any model whose cliques decompose into lower-order terms.
Often, potentials are defined only on nodes and edges of a model. A
-order Markov model has a tree-width of
, despite often containing only pairwise relationships. Similarly ‘skip-chain CRFs’ (Sutton and McCallum, 2006; Galley, 2006), and junction-trees used in SLAM applications (Paskin, 2003) often contain only pairwise terms, and may have low tree width under reasonable conditions. In each case, if the tree-width is , Algorithm 4 takes (for a model with nodes and states per node), yielding a speed-up of .Models for shape matching and pose reconstruction often exhibit similar properties (Tresadern et al., 2009; Donner et al., 2007; Sigal and Black, 2006). In each case, third-order cliques factorize into second order terms; hence we can apply Algorithm 3 to achieve a speed-up of .
Another similar model for shape matching is that of Felzenszwalb (2005); this model again contains third-order cliques, though it includes a ‘geometric’ term constraining all three variables. Here, the third-order term is independent of the input data, meaning that each of its rows can be sorted offline, as described in Section 4. In this case, those factors that depend upon the observation are pairwise, meaning that we achieve a speed-up of . Further applications of this type shall be explored in Section 6.2.
In Coughlan and Ferreira (2002), deformable shape-matching is solved approximately using loopy belief-propagation. Their model has only second-order cliques, meaning that inference takes per iteration. Although we cannot improve upon this result, we note that we can typically do exact inference in a single iteration in ; thus our model has the same running time as iterations of the original version. This result applies to all second-order models containing a single loop (Weiss, 2000).
In McAuley et al. (2008), a model is presented for graph-matching using loopy belief-propagation; the maximal cliques for -dimensional matching have size , meaning that inference takes per iteration (it is shown to converge to the correct solution); we improve this to .
Interval graphs can be used to model resource allocation problems (Fulkerson and Gross, 1965); each node encodes a request, and overlapping requests form edges. Maximal cliques grow with the number of overlapping requests, though the constraints are only pairwise, meaning that we again achieve an improvement.
Belief-propagation can be used to solve LP-relaxations in pairwise graphical models. In Sontag et al. (2008), LP-relaxations are computed for pairwise models by constructing several third-order ‘clusters’, which compute pairwise messages for each of their edges. Again, an improvement is achieved.
Finally, in Section 6.2 we shall explore a variety of applications in which we have pairwise models of the form shown in (eq. 16). In all of these cases, we see an (expected) reduction of a message-passing algorithm to .
Table 3 summarizes these results. Reported running times reflect the expected case. Note that we are assuming that max-product belief-propagation is being used in a discrete model; some of the referenced articles may use different variants of the algorithm (e.g. Gaussian models, or approximate inference schemes). We believe that our improvements may revive the exact, discrete version as a tractable option in these cases.
Reference | description | running time | our method |
---|---|---|---|
McAuley et al. (2008) | -d graph-matching | (iter.) | (iter.) |
Sutton and McCallum (2006) | Width- skip-chain | ||
Galley (2006) | Width-3 skip-chain | ||
Paskin (2003) (discrete case) | SLAM, width | ||
Tresadern et al. (2009) | Deformable matching | ||
Coughlan and Ferreira (2002) | Deformable matching | (iter.) | |
Sigal and Black (2006) | Pose reconstruction |
Comments
There are no comments yet.