1 Introduction
Most real complex systems are made up or modeled by elementary objects that locally interact with each other. Graphical models (Bishop, 2006; Koller and Friedman, 2009; Murphy, 2012) are formed by variables linked to each other by stochastic relationships. They enable to model dependencies in possibly highdimensional heterogeneous data and to capture uncertainty. Graphical models have been applied in a wide range of areas like image analysis, speech recognition, bioinformatics, ecology to name a few.
In real applications a large number of random variables with a complex dependency structure are involved. As a consequence, inference tasks such as the calculation of a normalization constant, a marginal distribution or the mode of the joint distribution are challenging. Three main approaches exist to evaluate such quantities for a given distribution
defining a graphical model: compute them in an exact manner; use a stochastic algorithm to sample from the distributionto get (unbiased) estimates;
derive an approximation of for which the exact calculation is possible. Even if appealing, exact computation on often leads to very time and memory consuming procedures, since the number of elements to store or elementary operations to perform increase exponentially withthe number of random variables. The second approach is probably the most widely used by statisticians and modelers. Stochastic algorithms such as MonteCarlo Markov Chains
(MCMC, Robert and Casella, 2004), Gibbs sampling (Casella and George, 1992) and particle filtering (Gordon and Smith, 1993) have become standard tools in many fields of application using statistical models. The last approach includes variational approximation techniques (Wainwright and Jordan, 2008), which are starting to become common practice in computational statistics. In essence, approaches of type provide an approximate answer to an exact problem whereas approaches of type provide an exact answer to an approximate problem.In this paper we focus on approaches of type and , and we will review techniques for exact or approximate inference in graphical models borrowed from both optimization and computer science. They are computationally efficient, yet not standard in the statistician toolkit. Our purpose is to show that the characterization of the structure of the graph associated to a graphical model (precise definitions are given in Section 2) enables both to determine if the exact calculation of the quantities of interest (marginal distribution, normalization constant, mode) can be implemented efficiently and to derive a class of operational algorithms. When the answer is no, the same analysis enables to design algorithms to compute an approximation of the desired quantities for which an acceptable complexity can be obtained.
The central algorithmic tool is the variable elimination concept (Bertelé and Brioshi, 1972). In Section 3 we adopt a unified algebraic presentation of the different inference tasks (marginalization, normalizing constant or mode evaluation) to emphasize that all of them can be solved as particular cases of variable elimination. This implies that if variable elimination is efficient for one task it will also be efficient for the other ones. The key ingredient to design efficient algorithms based on variable elimination is the clever use of distributivity between algebraic operators. For instance distributivity of the product () over the sum () enables to write and evaluating the lefthand side of this equality requires two multiplications and one addition while evaluating the righthand side requires one multiplication and one addition. Similarly since
it is more efficient to compute the righthand side from an algorithmic point of view. Distributivity enables to minimize the number of operations. Associativity and commutativity are also required and the algebra behind is the semiring category (from which some notations will be borrowed). Inference algorithms using the distributivity property have been known and published in the Artificial Intelligence and Machine Learning literature under different names, such as sumprod, or maxsum
(Pearl, 1988; Bishop, 2006) and are examples of variable elimination.Variable elimination relies on the choice of an order of elimination of the variables (either by marginalization or by maximization). This corresponds to the ordering calculations are performed when applying distributivity. The topology of the graph provides key information to organize the calculations for an optimal use of distributivity, i.e. to minimize the number of elementary operations to perform. For example, when the graph is a tree, the most efficient elimination order corresponds to eliminating recursively the vertices of degree one, starting from the leaves towards the root. For an arbitrary graph, the notion of an optimal elimination order for inference in a graphical model is closely linked to the notion of treewidth of the associated graph . We will see in Section 3 the reason why inference algorithms based on variable elimination with the best elimination order are of complexity linear in but exponential in the treewidth. Therefore treewidth is the key characterization of to determine if exact inference is possible in practice or not.
The concept of treewidth has been proposed in parallel in computer science (Bodlaender, 1994) and in discrete mathematics and graph minor theory (see Robertson and Seymour, 1986; Lovász, 2005). Discrete mathematics existence theorems (Robertson and Seymour, 1986) establish that there exists an algorithm for computing the treewidth of any graph with complexity polynomial in (but exponential in the treewidth), and even the degree of the polynomial is given. However this result does not tell how to derive and implement the algorithm, apart from some specific cases (as trees, chordal graphs, and seriesparallel graphs, see Duffin (1965). So we will also present in Section 4 several stateoftheart algorithms for approximate evaluation of the treewidth and illustrate their behavior on benchmarks borrowed from optimization competitions.
Variable elimination has also lead to message passing algorithms (Pearl, 1988) which are now common tools in computer science or machine learning. More recently these algorithms have been reinterpreted as reparametrization tools (Koller and Friedman, 2009). We will explain in Section 5 how reparametrization can be used as a preprocessing tool to transform the original graphical model into an equivalent one for which inference may been simpler. Message passing is not the only way to perform reparametrization and we will discuss alternative efficient algorithms that have been proposed in the context of constraint satisfaction problems (CSP, see (Rossi et al., 2006)) and that have not yet been exploited in the context of graphical models.
As emphasized above, efficient exact inference algorithms can only be designed for graphical models with limited treewidth (much less than the number of vertices), which is a far from being the general case. But the principle of variable elimination and message passing for a tree can still be applied to any graph leading then to heuristic inference algorithms. The most famous heuristics is the Loopy Belief Propagation algorithm
(Kschischang et al., 2001) We recall in Section 6 the result that establishes LBP as a variational approximation method. Variational methods rely on the choice of a distribution which renders inference easier, to approximate the original complex graphical model . The approximate distribution is chosen within a class of models for which efficient inference algorithms exist, that is models with small treewidth (0, 1 or 2 in practice). We review some of the standard choices and we illustrate on the problem of parameter estimation in coupled Hidden Markov Model (Ghahramani and Jordan, 1997) how variational methods have been applied in practice with different approximate distributions, each of them corresponding to a different underlying treewidth (Section 7).2 Graphical Models
2.1 Models definition
Consider a a stochastic system defined by a set of random variables . Each variable takes values in . Then, a realization of is a set , with . The set of all possible realizations is called the state space, and is denoted . If is a subset of , , and are respectively the subset of random variables , the set of realizations and the state space of . If
is the joint probability distribution of
on , we denoteNote that we focus here on discrete variables (we will discuss inference in the case of continuous variables on examples in Section 8). A joint distribution on is said to be a probabilistic graphical model (Lauritzen, 1996; Bishop, 2006; Koller and Friedman, 2009) indexed on a set of parts of if there exists a set of maps from to , called potential functions, indexed by such that can be expressed in the following factorized form:
(1) 
where is the normalizing constant, also called partition function. The elements are the scopes of the potential functions and is the arity of the potential function . The set of scopes of all the potential functions involving variable is denoted .
One desirable property of graphical models is that of Markov local independence: if can be expressed as (1) then a variable is (stochastically) independent of all others in conditionally to the set of variables . This set is called the Markov blanket of , or its neighborhood. We will denote it . These conditional independences can be represented graphically, by a graph with one vertex per variable in . The question of encoding the independence properties associated with a given distribution into a graph structure is well described in (Koller and Friedman, 2009), and we will not discuss it here. We will consider the classical graph associated to a decomposition of the form (1) where an edge is drawn between two vertices and if there exists such that and are in . Such a representation of a graphical model is actually not as rich as the representation (1). For instance, if , the two cases and are represented by the same graph , namely a clique of size 3. The factor graph representation goes beyond this limit: this graphical representation is a bipartite graph with one vertex per potential function and one vertex per variable. edges are only between functions and variables. An edge is present between a function vertex (also called factor vertex) and a variable vertex if the variable is in the scope of the potential function. Figure 1 displays examples of the two graphical representations.
There exists several families of probabilistic graphical models (Koller and Friedman, 2009; Murphy, 2012). They can be grouped into directed and undirected ones. The most classical directed framework is that of Bayesian network (Pearl, 1988; Jensen and Nielsen, 2007)
. In a Bayesian network, an edge is directed from a parent vertex to a child vertex and potential functions are conditional probabilities of a variable given its parents in the graph (see Figure
1 (a)). In such models, trivially . Undirected probabilistic graphical models (see Figure 1 (c)) are equivalent to Markov Random Fields (Li, 2001) as soon as the potential functions are in . In a Markov random field (MRF), a potential function is not necessarily a probability distribution: is not required to be normalized.Deterministic Graphical models.
Although the terminology of ’Graphical Models’ is often used to refer to stochastic graphical models, the idea of describing a joint distribution on a set of variables through local functions has also been used in Artificial Intelligence to concisely describe Boolean functions or cost functions, with no normalization constraint. In a graphical model with only Boolean (0/1) potential functions, each potential function describes a constraint between variables. If the potential function takes value 1, the corresponding realization is said to satisfy the constraint. The graphical model is known as a ’Constraint Network’. It describes a joint Boolean function on all variables that takes value 1 if and only if all constraints are satisfied. The problem of finding a realization that satisfies all the constraints, called a solution of the network, is the ’Constraint Satisfaction Problem’ (CSP) (Rossi et al., 2006)
. This framework is used to model and solve combinatorial optimization problems and there is a variety of software tools to solve it. When variables are Boolean too and when the Boolean functions are described as disjunctions of variables or of their negation, the CSP reduces to the ’Boolean Satisfiability’ problem (or SAT), the seminal NPcomplete problem
(Cook, 1971).CSP have been extended to describe joint cost functions, decomposed as a sum of local cost functions in the ‘Weighted Constraint Network’ (Rossi et al., 2006) or ‘Cost Function Network’. In this case, potential functions take finite or infinite integer or rational values: infinity enables to express hard constraints while finite values encode costs for unsatisfied soft constraints. The problem of finding a realization of minimum cost is the ’Weighted Constraint Satisfaction Problem’ (WCSP), which is also NPhard. It is easy to observe that any stochastic graphical model can be translated in a weighted constraint network using a simple transformation. With this equivalence, it becomes possible to use exact WCSP resolution algorithms that have been developed in this field for mode evaluation in stochastic graphical model.
2.2 Inference tasks in probabilistic graphical models
Computations on probabilities and potentials rely on two fundamental types of operations. Firstly, multiplication (or addition in the domain) is used to combine potentials to define a joint potential distribution. Secondly, sum or / can be used to eliminate variables and compute marginals or modes of the joint distribution on subsets of variables. The precise identity of these two basic operations is not crucial for the inference algorithms considered in this paper. We denote as the combination operator and as the elimination operator. The algorithms just require that defines a commutative semiring. Specifically, the semiring algebra offers distributivity: . This corresponds to distributivity of product over sum since or distributivity of over sum since , or again distributivity of over product since . These two abstract operators can be defined to be applied to potential functions, as follows:
 Combine operator:

the combination of two potential functions and is a new function , from to defined as .
 Elimination operator:

the elimination of variable from a potential function is a new function from to defined as . For , represents .
We can now describe classical counting and optimization tasks in graphical models in terms of these two operators. For simplicity, we denote by , where a sequence of eliminations for all , the result being insensitive to the order in a commutative semiring. Similarly, represents the successive combination of all potential functions such that .
Counting tasks.
Under this name we group all tasks that involve summing over the state space of a subset of variables in . This includes the computation of the partition function or of any marginal distribution, as well as entropy evaluation. For and , the marginal distribution of associated to the joint distribution is defined as:
(2) 
The function then satisfies:
where combines functions using and eliminates variables using .
Marginal evaluation is also interesting in the case where some variables are observed. If the values of some variables () have been observed, we can compute the marginal conditional distribution by restricting the domains of variables to the observed value.
The entropy of a probabilistic graphical model is defined as
(3) 
where denotes the mathematical expectation. In the case of a graphical model, by linearity of the expectation, the entropy is equal to
This expression is an alternation of use of and operators (for evaluation, for each and ).
Optimization task
An optimization task in a graphical model corresponds to the evaluation of the most probable state
of the random vector
, defined as(4) 
The maximum itself is with set to and to . The computation of the mode does not require the computation of the normalizing constant , however computing the mode probability does.
Therefore counting and optimization tasks can be interpreted as two instantiations of the same computational task expressed in terms of combination and elimination operators, namely where . When the combination operator and the elimination operator are respectively set to and , this computational problem is known as a sumproduct problem in the Artificial Intelligence literature (Pearl, 1988),(Bishop, 2006, chapter 8). When is set to and to the sum operator it is a maxsum problem (Bishop, 2006, chapter 8).
We will see in Section 3 that there exists an exact algorithm solving this general task that exploits the distributivity of the combination and elimination operators to perform operations in a smart order. From this generic algorithm, known as variable elimination (Bertelé and Brioshi, 1972) or bucket elimination (Dechter, 1999), one can deduce exact algorithms to solve counting and optimization tasks in a graphical model, by instantiating the operators and .
Deterministic Graphical models
: the Constraint Satisfaction Problem is a  problem as it can can be defined using (logical ’or’) as the elimination operator and (logical ’and’) as the combination operator over Booleans. The weighted CSP is a  as it uses as the elimination operator and (or bounded variants of ) as the combination operator. Several other variants exist (Rossi et al., 2006), including generic algebraic variants (Schiex et al., 1995; Bistarelli et al., 1997; Cooper, 2004; Pralet et al., 2007; Kohlas, 2003).
3 Variable elimination for exact inference
We describe now the principle of variable elimination. We first recall the Viterbi algorithm for Hidden Markov Chains, a classical example of variable elimination for optimization. Then we formally describe the variable elimination procedure in the general graphical model framework. The key element is the choice of an ordering for the sequential elimination of the variables. It is closely linked to the notion of treewidth of the graphical representation of the graphical model. As it will be shown, the complexity of the variable elimination is fully characterized by this notion. Conversely, the treewidth can be bounded from above from a given variable elimination scheme.
3.1 An example: hidden Markov chain models
As an introduction to exact inference on graphical models by variable elimination, we consider a well studied stochastic process: the discrete Hidden Markov Chain model (HMC).
A HMC (Figure 2) is defined by two sequences of random variables and of the same length, . A realization of the variables is observed, while the states of variables are unknown. In the HMC model the assumption is made that is independent of and given the hidden variable . These independences are modeled by pairwise potential functions . Furthermore, hidden variable is independent of and given the hidden variable . These independences are modeled by pairwise potential functions . Then the model is fully defined by specifying an additional potential function to model the initial distribution. In the classical HMC formulation (Rabiner, 1989)
, these potential functions are normalized conditional probability distributions i.e.,
, and . As a consequence, the normalizing constant is equal to , as in any Bayesian network.A classical inference task for HMC is to identify the most likely value of the variables given a realization of the variables . The problem is to compute or equivalently the argument of:
(5) 
The number of possible realizations of is exponential in . Nevertheless this optimization problem can be solved in a number of operations linear in using the wellknown Viterbi algorithm (Rabiner, 1989). This algorithm, based on dynamic programming, performs successive eliminations (by maximization) of all hidden variables, starting with , then , and finishing by , to compute the most likely sequence of hidden variables. By using distributivity between the and the product operators, the elimination of variable can be done by rewriting equation (5) as:
The new potential function created by maximizing on depends only on variable , so that variables and potential functions involving them have been removed from the optimization problem. This is a simple application of the general variable elimination algorithm that we describe in the next section.
3.2 General principle of variable elimination
In Section 2, we have seen that counting and optimization tasks can be formalized by the same generic algebraic formulation
(6) 
where .
The trick behind variable elimination (Bertelé and Brioshi, 1972) relies on a clever use of the distributivity property. Indeed, evaluating as requires fewer operations. Since distributivity applies both for counting and optimizing tasks, variable elimination can be applied to both tasks. It also means that if variable elimination is efficient for one task it will also be efficient for the other one. As in the HMC example, the principle of the variable elimination algorithm for counting or optimizing consists in eliminating variables one by one in expression (6).
Elimination of the first variable, say , is performed by merging all potential functions involving and applying operator to these potential functions. Using commutativity and associativity equation (6) can be rewritten as follows:
Then using distributivity we obtain
This shows that the elimination of results in a new graphical model where the variable and the potential functions have been removed and replaced by a new potential which does not involve , but its neighboring vertices. The graph associated to the new graphical model is similar to the graph of the original model except that vertex has been removed and that the neighbors of are now connected together in a clique. The new edges between the neighbors of are called fillin edges. For instance, when eliminating variable in the graph of Figure 3 (left), potential functions and are replaced by . The new graph is shown in Figure 3, right part.
When the first elimination step is applied with and , the probability distribution defined by this new graphical model is the marginal distribution of the original distribution . The complete elimination can be obtained by successively eliminating all variables in . The result is a graphical model over which is the marginal distribution . When , the result is a model with a single constant potential function with value .
If instead is , and (or with a log transformation of the potential functions) and
, the last potential function obtained after elimination of the last variable is equal to the maximum of the non normalized distribution. So evaluating
or the maximal probability of a graphical model can be both obtained with the same variable elimination algorithm, just changing the meaning of and .However, if one is interested in the mode itself, an additional simple computation is required. The mode is obtained by induction: if is the mode of the graphical model obtained after the elimination of the first variable, , then the mode of can be defined as where is a value in that maximizes . This maximization is straightforward to derive because can take only values. itself is obtained by completing the mode of the graphical model obtained after elimination the second variable, and so on. We stress here that the procedure requires to keep the intermediary potential functions generated during the successive eliminations.
When eliminating a variable , the task that can be computationally expensive is the computation of the intermediate . It requires to compute the product of several potential functions for all elements of , the state space of . The time and space complexity of the operation are entirely determined by the cardinality of the set of indices . If is the maximum domain size of a variable, the time complexity (i.e. number of elementary operations performed) is in and space complexity (i.e. memory space needed) is in . Complexity is therefore exponential in , the number of neighbors of the eliminated variable in the current graphical model. The total complexity of the variable elimination is then exponential in the maximum cardinality over all successive eliminations (but linear in ). Because the graphical model changes at elimination each step, this number usually depends on the order in which variables are eliminated.
As a consequence, the prerequisite to apply variable elimination is to decide for an ordering of elimination of the variables. As illustrated in Figure 4 two different orders can lead to two different subsets. The key message is that the choice of the order is crucial/ it dictates the efficiency of the variable elimination procedure. We will now illustrate and formalize this intuition.
3.3 When is variable elimination efficient ?
We can understand why the Viterbi algorithm is an efficient algorithm for mode evaluation in a HMC. The graph associated to a HMC is combshaped: the hidden variables form a line and each observed variable is a leaf in the comb (see Figure 2). So it is possible to design an elimination order where the current variable to eliminate has a unique neighbor in the graphical representation of the current model: for instance (the first eliminated variable is the largest according to this ordering). Following this elimination order, when eliminating a variable using , the resulting graphical model has one fewer vertex than the previous one and no fillin edge. Indeed, the new potential function is a function of a single variable since .
More generally, variable elimination is very efficient, i.e. leads to intermediate sets of small cardinality, on graphical models whose graph representation is a tree, again because it is always possible to design an elimination order where the current variable to eliminate has only one neighbor in the graphical representation of the current model.
Another situation where variable elimination can be efficient is when the graph associated to the graphical model is chordal (any cycle of length 4 or more has a chord i.e., an edge connecting two non adjacent vertices in the cycle), the size of the largest clique being low. The reason is the following. In Figure 2, new edges are created between neighbors of the eliminated vertex. If this neighborhood is a clique, no new edge is added. A vertex whose neighborhood is a clique is called a simplicial vertex. Chordal graphs have the property that there exists an elimination order of the vertices such that every vertex during elimination process is simplicial. Then, there exists an elimination order such that no fillin edges are created. Thus, the largest size of is no more than the size of a clique, and is equal to or less than the size of the largest clique in the graph. Let us note that a tree is a chordal graph in which all edges and only edges are cliques. Hence, for a tree, simplicial vertices are vertices of degree one. Then, elimination of degree one vertices on a tree is an example of simplicial elimination on a chordal graph.
For arbitrary graphs, if the maximal scope size of the intermediate functions created during variable elimination is too large, then memory and time required for the storage and computation quickly exceed computer capacities. Depending on the chosen elimination order, this maximal scope can be reasonable from a computational point of view, or too large. So again, the choice of the elimination order is crucial.
3.4 The treewidth to characterize variable elimination complexity
The lowest complexity achievable when performing variable elimination is characterized by a parameter called the treewidth of the graph associated to the original graphical model. This concept has been repeatedly discovered and redefined. The treewidth of a graph is sometimes called its induced width (Dechter and Pearl, 1988), its minimum front size (Liu, 1992), its tree number (Arnborg, 1985), its dimension (Bertelé and Brioshi, 1972) and is also equal to the minmax clique number of minus one (Arnborg, 1985). The treewidth is also a key notion in the theory of graph minors (see Robertson and Seymour, 1986; Lovász, 2005).
We insist here on two definitions. The first one from (Bodlaender, 1994) relies on the notion of induced graph and the link between fillin edges and the intermediate sets created during variable elimination is straightforward. The second (Robertson and Seymour, 1986; Bodlaender, 1994) is the commonly used characterization of the treewidth using socalled tree decompositions, also known as junction trees which are key tools to derive variable elimination algorithms. It underlies the blockbyblock elimination procedure described in Section 3.5.
Definition 1 (induced graph)
Let be a graph defined by a set of vertices indexed on and a set of edges. Given an ordering of the vertices of , the induced graph is obtained as follows. and have same vertices. Then to each edge in corresponds an oriented edge in going from the first of the two nodes according to toward the second. Then each vertex of is considered one after the other following the order defined by . When vertex is treated, an oriented edge is created between all pairs of neighbors of in that follows according to . Again the edge is going from the first of the two nodes according to toward the second.
The induced graph is also called the fill graph of and the process of computing it is sometimes referred to as “playing the elimination game” on , as it just simulates elimination on using the variable ordering . This graph is chordal (Vandenberghe and Andersen, 2014). It is known that every chordal graph has at least one vertex ordering such that , called a perfect elimination ordering (Fulkerson and Gross, 1965).
The second notion that enables to define the treewidth is the notion of tree decomposition. Intuitively, a tree decomposition of a graph organizes the vertices of in clusters of vertices which are linked by edges such that the graph obtained is a tree. Specific constraints on the way vertices of are associated to clusters in the decomposition tree are demanded. These contraints ensure properties to tree decomposition usefull for building variable elimination algorithms.
Definition 2 (tree decomposition)
Given a graph , a tree decomposition is a tree , where is a family of subsets of (called clusters), and is a set of edges between the subsets , satisfying the following properties:

The union of all clusters equals (each vertex is associated with at least one vertex of ).

For every edge in , there is at least one cluster that contains both and .

If clusters and both contain a vertex of , then all clusters of in the (unique) path between and contain as well: clusters containing vertex form a connected subset of . This is known as the running intersection property).
The concept of tree decomposition is illustrated in Figure 5.
Definition 3 (treewidth)
The two following definitions of the treewidth derived respectively from the notion of induce graph and from that of tree decomposition are equivalent (but this is not trivial to establish):

The treewidth of a graph for the ordering is the maximum number of outgoing edges of a vertex in the induced graph . The treewidth of a graph is the minimum treewidth over all possible orderings .

The width of a tree decomposition is the size of the largest . and the treewidth of a graph is the minimum width among all its tree decompositions.
is exactly the cardinality of the largest set created during variable elimination with elimination order . For example, in Figure 4, the middle and right graphs are the two induced graphs for two different orderings. is equal to 2 with the first ordering and to 3 with the second. It is easy to see that in this example . The treewidth of the graph of the HMC model, and of any tree is equal to 1.
It has been established that finding a minimum treewidth ordering for a graph , finding a minimum treewidth tree decomposition or computing the treewidth of a graph are of equivalent complexity. For an arbitrary graph, computing the treewidth is not an easy task. Section 4 is dedicated to this question, from a theoretical and a practical point of view.
The treewidth is therefore a key indicator to answer the driving subject of this review: will variable elimination be efficient for a given graphical model? For instance, the principle of variable elimination have been applied to the exact computation of the normalizing constant of a Markov random field on a small by lattice in (Reeves and Pettitt, 2004). For this regular graph, it is known that the treewidth is equal to . So exact computation through variable elimination is only possible for lattices with a small value for . It is however well beyond computer capacities for real challenging problems in image analysis. In this case variable elimination can be used to define heuristic computational solutions, such as the algorithm of (Friel et al., 2009) which relies on exact computations on small sublattices of the original lattice.
3.5 Tree decomposition and block by block elimination
Given a graphical model and a tree decomposition of its graph, a possible alternative to solve counting or optimization tasks is to eliminate variables in successive blocks instead of one after the other. To do so, the block by block elimination procedure (Bertelé and Brioshi, 1972) relies on the tree decomposition characterization of the treewidth. The underlying idea is to apply the variable elimination procedure on the tree decomposition, eliminating one cluster of the tree at each step. First a root cluster is chosen and used to define an order of elimination of the clusters, by progressing from the leaves toward the roots, such that every eliminated cluster corresponds to a leaf of the current intermediate tree. Then each potential function is assigned to the cluster in such that which is the closest to the root. Sucha cluster always exists from the properties of a tree decomposition and the fact that a potential function is associated to a clique in ). The procedure starts with the elimination of any leaf cluster of , with parent in . Let us note . Here again, commutativity and distributivity are used to rewrite expression (6) (with ) as follows:
Note that only variables with indices in are eliminated, even if it is common to say that the cluster has been eliminated. For instance, for the graph of Figure 5, if the first eliminated cluster is , the new potential function is , it depends only on variables and . Cluster elimination continues until no cluster is left. The interest of this procedure is that the intermediate potential function created after each cluster elimination may have a scope much smaller than the treewidth, leading to better space complexity (Bertelé and Brioshi, 1972, chapter 4). However, the time complexity is increased.
In summmary, the lowest complexity achievable when performing variable elimination is for elimination orders whose cardinalities of the intermediate sets are lower of equal to the treewidth of . This treewidth can be determined by considering clusters sizes in tree decompositions of . Furthermore, a tree decomposition can be used to build an elimination order and vice versa. Indeed, an elimination order can be defined by using a cluster elimination order based on and choosing an arbitrary order to eliminate variables with indices in the subsets . Conversely, it is easy to build a tree decomposition from a given vertex ordering . Since the induced graph is chordal, its maximum cliques can be identified in polynomial time. Each such clique defines a cluster of the tree decomposition. Edges of can be identified as the edges of any maximum spanning tree in the graph with vertices and edges weighed by .
Deterministic Graphical Models
: to our knowledge, the notion of treewidth and its properties have been first identified in combinatorial optimization in (Bertelé and Brioshi, 1972) where it was called “dimension”, a graph parameter which has been shown equivalent to the treewidth (Bodlaender, 1998). Variable elimination itself is related to FourierMotzkin elimination (Fourier, 1827), a variable elimination algorithm that benefits from the linearity of the handled formulas. Variable elimination has been repeatedly rediscovered, as nonserial dynamic programming (Bertelé and Brioshi, 1972), in the DavidPutnam procedure for boolean satisfiability problems (SAT, Davis and Putnam, 1960), as Bucket elimination for the CSP and WCSP (Dechter, 1999), in the Viterbi and ForwardBackward algorithms for HMM (Rabiner, 1989) and many more.
Theree exists other situations where the choice of an elimination order has a deep impact on the complexity of the computations as in Gauss elimination scheme for a system of linear equations, or Choleski factorization of very large sparse matrices, and where equivalence between elimination and decomposition have been used (see Bodlaender et al., 1995).
4 Treewidth computation and approximation
As already mentioned, the complexity of the counting and the optimization tasks on graphical models is heavily linked to the treewidth of the underlying graph . If one could guess the optimal vertex ordering, , leading to , then, one would be able to achieve the “optimal complexity” for solving exactly these tasks (we recall that is the maximal domain size of a variable in the graphical model). However, the problem is that one cannot easily evaluate the treewidth of a given graph. The treewidth computation problem is known to be NPhard (Arnborg et al., 1987).
In the following subsections we provide a short presentation of the stateoftheart theoretical and experimental results concerning the exact computation of the treewidth of a graph, and the computation of suboptimal vertex orderings providing approximations of the treewidth in the form of an upper bound.
4.1 Exact solution algorithms
Several exponential time exact algorithms have been proposed to compute the treewidth. These algorithms compute the treewidth in time exponential in . The algorithm with the best complexity bound has been proposed by (Fomin and Villanger, 2012). They provide an exact algorithm for computing the treewidth, which run in time (using polynomial space), or in time , using exponential (memory) space.
Since the treewidth of a network can be quite small (compared to ) in practice, there has been a great deal of interest in finding exact algorithms with time complexity exponential in and potentially only polynomial in . Some of these algorithms even have complexity linear in (Bodlaender, 1996; Perkovic and Reed, 2000). In Bodlaender (1996), an algorithm is proposed to compute the treewidth (it also provides an associated tree decomposition) of in time . If this algorithm is used to compute the treewidth of graphs in a family of graphs whose treewidth is uniformly bounded, then computing the treewidth would become of time complexity linear in (however, even for a small bound on the treewidth, the constant can be huge). Moreover, in the general case, there is no way to bound the treewidth a priori.
4.2 Approximation of the treewidth with guarantee
Now, recall that even though crucial, finding a “good” tree decomposition of the graph is only one element in the computation of quantities of interest in graphical models. If one has to spend more time on finding an optimal vertex ordering than on computing probabilities on the underlying graphical model using an easytocompute suboptimal ordering, the utility of exact treewidth computation becomes limited. Therefore, an alternative line of search is to look for algorithms computing a vertex ordering leading to a suboptimal width, , but more efficient in terms of computational time. When defining such approximation algorithms, one is particularly interested in polynomial time (in ) algorithms, finding a vertex ordering that approaches the optimal ordering within a constant multiplicative factor : .
However, the existence of such constantfactor approximation algorithms is not guaranteed for all NPhard problems. Some NPhard problems are even known not to admit polynomial time approximation algorithms. As far as treewidth approximation is concerned, we are in the interesting case where it is not yet known whether or not a polynomial time approximation algorithm does exist (Austrin et al., 2012).
Finally, there have been a variety of proposed algorithms, trading off approximation quality and running time complexity (Robertson and Seymour, 1986; Lagergreen, 1996; Amir, 2010; Bodlaender et al., 2013). Table 1 (extracted from Bodlaender et al., 2013) summarizes the results in terms of approximation guarantee and time complexity for these algorithms.
Algorithm  Approximation  Time complexity  

guarantee  
Robertson and Seymour (1986)  
Lagergreen (1996)  
Amir (2010)  
Bodlaender et al. (2013) 
The theoretical results about the complexity and approximability of treewidth computation are interesting by the insight they give about the difficulties of finding good, if not optimal, vertex ordering. They are also interesting in that they offer worstcase guarantees, i.e., the approximation quality is guaranteed to be at least that promised by the algorithm. Furthermore, the increase in computation time is also upperbounded, allowing to get some guarantees that the approximation can be obtained in “reasonable” time.
However, the main drawback of these worstcase based approaches, is that they can be dominated, empirically, by heuristic approaches, on most instances. Indeed, several algorithms, working well in practice, even though without worstcase complexity/quality bounds have been proposed. We describe some of these approaches in the following section.
4.3 Treewidth in practice
A broad class of heuristic approaches is that of greedy algorithms (Bodlaender and Koster, 2010). They use the same iterative approach as the variable elimination algorithm (Section 3) except that they manipulate the graph structure only and do not perform any actual combination/elimination computation. Starting from an empty vertex ordering and an initial graph , they repeatedly select the next vertex to add in the ordering by locally optimizing one of the following criteria:

select a vertex with minimum degree in the current graph ;

select a vertex with minimum number of fillin edges in the current graph.
After each vertex selection, the current graph is modified by removing the selected vertex and making a clique on its neighbors. The new edges added by this clique are fillin edges. A vertex with no fillin edges is called a simplicial vertex. Fast implementations of minimum degree algorithms have been developed, see e.g., AMD (Amestoy et al., 1996) with time complexity in (Heggernes et al., 2001) for an input graph with vertices and edges. The minimum fillin heuristic tends to be slower to compute but yields slightly better treewidth approximations in practice. Moreover, it will find a perfect elimination ordering (ı.e., adding no fillin edges) if it exists, thus recognizing chordal graphs and it returns the optimal treewidth in this particular case ((this can be easily established from results in Bodlaender et al., 2005).
Notice that there exists linear time algorithms to detect chordal graphs as the Maximum Cardinality Search (MCS) greedy algorithm (Tarjan and Yannakakis, 1984) but the treewidth approximation they return is usually worse than the previous heuristic approaches.
A simple way to improve the treewidth bound found by these greedy algorithms is to break ties for the selected criterion using a second criterion, such as minimum fillin first and then maximum degree, or to break ties at random and to iterate on the resulting randomized algorithms as done in Kask et al. (2011).
We compared the mean treewidth bound found by these four approaches (minimum degree, minimum fillin, MCS and randomized iterative minimum fillin) on a set of five CSP and MRF benchmarks used as combinatorial optimization problems in various solver competitions. ParityLearning is an optimization variant of the minimal disagreement parity CSP problem originally contributed to the DIMACS benchmark and used in the Minizinc challenge (Optimization Research Group, 2012). Linkage is a genetic linkage analysis benchmark (Elidan and Globerson, 2010)
. GeomSurf and SceneDecomp are respectively geometric surface labeling and scene decomposition problems in computer vision
(Andres et al., 2013). The number of instances per problem as well as their mean characteristics are given in Table 2. Results are reported in Figure 6 (Left).The randomized iterative minimum fillin algorithm used a maximum of iterations or seconds (respectively iterations and seconds for ParityLearning and Linkage), compared to a maximum of second used by the noniterative approaches. The minimum fillin algorithm (using maximum degree for ties breaking) performed better than the other greedy approaches, being slightly improved by its randomized iterative version.Problem  Nb  Mean nb  Mean nb 

Type/Name  of instances  of vertices  of potential functions 
CSP/ParityLearning  7  659  1246 
MRF/Linkage  22  917  1560 
MRF/GeomSurf3  300  505  2140 
MRF/GeomSurf7  300  505  2140 
MRF/SceneDecomp  715  183  672 
On the same benchmark, we also compared three exact methods for the task of mode evaluation that exploit either minimum fillin ordering or its randomized iterative version: variable elimination (ELIM), BTD (de Givry et al., 2006) and AND/OR search (Marinescu and Dechter, 2006). Elim and BTD exploit the minimum fillin ordering while AND/OR search used its randomized iterative version. In addition, BTD and AND/OR Search exploit a tree decomposition during a Depth First Branch and Bound method in order to get a good tradeoff between memory space and search effort. As variable elimination, they have a worstcase time complexity exponential in the treewidth. All methods were allocated a maximum of 1 hour and 4 GB of RAM on an AMD Operon 6176 at 2.3 GHz. The results, as reported in Figure 6 (Right) shown that BTD was able to solve more problems than the two other methods for a fixed CPU time. However, on a given problem, the best method heavily depends on the problem category. On ParityLearning, ELIM was the fastest method, but it ran out of memory on of the total set of instances, while BTD (resp. AND/OR search) used less than GB (resp. 4GB). The randomized iterative minimum fillin heuristic used by AND/OR search in preprocessing consumed a fixed amount of time ( seconds, included in the CPU time measurements) larger than the cost of a simple minimum fillin heuristic. BDT was faster than AND/OR search to solve most of the instances except on two problem categories (ParityLearning and Linkage).
To perform this comparison, we ran the followinf implementation of each method. The version of ELIM was the one implemented in the combinatorial optimization solver toolbar 2.3 (options i T3, available at mulcyber.toulouse.inra.fr/projects/toolbar). The version of BTD was the one implemented in the combinatorial optimization solver toulbar2 0.9.7 (options B=1 O=3 nopre. Toulbar2 is available at mulcyber.toulouse.inra.fr/projects/toulbar2. This software won the UAI 2010 (Elidan and Globerson, 2010) and 2014 (Gogate, 2014) Inference Competitions on the MAP task. AND/OR search was the version implemented in the opensource version 1.1.2 of daoopt (Otten et al., 2012) (options y i 35 slsX=20 slsT=10 lds 1 m 4000 t 30000 orderTime=180 for benchmarks from computer vision and y i 25 slsX=10 slsT=6 lds 1 m 4000 t 10000 orderTime=60 for the other benchmarks) which won the Probabilistic Inference Challenge 2011 (Elidan and Globerson, 2011), albeit with a different closedsource version (Otten et al., 2012).
5 From Variable Elimination to Message Passing
Message passing algorithms make use of messages, which can be described as potential functions which are external to the definition of graphical models. On treestructured graphical models message passing algorithms extend the variable elimination algorithm by efficiently computing every marginals (or modes) simultaneously, when variable elimination only computes one. On general graphical models, message passing algorithms can still be applied but either provide approximate results efficiently or have an exponential running cost.
We present how it may be conceptually interesting to view these algorithms as performing a reparametrization of the original graphical model i.e., modifications of the potentials, instead of producing external messages, which are not easy to interpret by themselves.
5.1 Message passing and belief propagation
Message passing algorithms over trees can be described as an extension of variable elimination, where the marginals of all variables are computed in a double pass of the algorithm (instead of one variable in classical variable elimination). Instead of eliminating a leaf and the potential functions involving , we just mark the leaf as “processed“ and consider that the new potential is a “message” sent from to (the parents of in the tree), denoted as . This message is a potential function over only. We can iterate this process, always applying it to a leaf in the subgraph defined by unmarked variables, handling already computed messages as unary potentials.
When only one variable remains unmarked (defining the root of the tree), the combination of all the functions on this variable (messages and possibly original potential function) will be equal to the marginal unnormalized distribution on this variable. This results directly from the fact that the operations are equivalent to variable elimination. The root of the tree defines a directed tree where the root is at the top, descendants are below and messages are flowing upwards, to the root.
To compute the marginal of another variable, one can redirect the tree using this new root. Then some subtrees will remain unchanged (in terms of direction from the root of the subtree to the leaves) in this new tree and the messages in these subtrees do not need to be recomputed.
It turns out that in a tree, one can organize all these computations cleverly so that only two messages are computed for each edge, one for each possible direction of the edge.
Formally, in the Sumproduct algorithm over a tree , messages are defined for each edge (there are such messages, one for each edge direction) in a leavestoroottoleaves order. Messages are functions of , which are computed iteratively, by the following algorithm:

First, messages leaving the leaves of the tree are initialized: , where is a leaf of the tree,
Mark all leaves as processed.

Then, messages are sent upward through all edges. Message updates are performed iteratively, from marked nodes to their only unmarked neighbor through edge . Message updates take the following form:
(7) where .
Mark node as processed. See Figure 7 for an illustration. 
It remains to send the latter messages downward (from root to leaves). This second phase of message updates takes the following form:

Unmark root node.

While there remains a marked node, send update (7) from an unmarked node to one of its marked neighbors, unmark the corresponding neighbor.


After the two above steps, messages have been transmitted through all edges in both directions. Finally, marginal distributions over variables and pairs of variables (linked by an edge) are computed as follows:
and are suitable normalizing constants.
When the graph of the original graphical model is not a tree, the twopass message passing algorithm can no more be applied. Still, for general graphical models, this message passing approach can be generalized in two different ways.

One can compute a tree decomposition, as previously shown. Message passing can then be applied on the resulting cluster tree, handling each cluster as a crossproduct of variables following a blockbyblock approach. This yields an exact algorithm, for which computations can be expensive (exponential in the treewidth) and space intensive (exponential in the separator size). A typical example of such algorithm is the algebraic exact message passing algorithm of Shafer and Shenoy (1988); Shenoy and Shafer (1990).

Alternatively, the Loopy Belief Propagation algorithm (Frey and MacKay, 1998) is another extension of Message Passing in which messages updates are repeated, in arbitrary order through all edges (possibly many times through each edge), until a termination condition is met. The algorithm returns approximations of the marginal probabilities (over variables and pairs of variables). The quality of the approximation and the convergence to steady state messages are not guaranteed (hence, the importance of the termination condition). However, it has been observed that LBP often provides good estimates of the marginals, in practice. A deeper analysis of messagepassing algorithms will be provided in Section 6.
We have described above the Sumproduct algorithm. Maxproduct, Maxsum algorithms can be equivalently defined, for exact computation or approximation of the marginal of a joint distribution or its logarithm. In algebraic language, updates like defined in formula (7) take the general form:
As for sumproduct, the resulting algorithm computes exact marginals on a treestructured graphical model from which the mode of the distribution can be computed while on general graphical models, it provides only approximations.
5.2 Message Passing and Reparametrization
It is possible to use message passing on trees as a reparametrization technique. Instead of computing external messages, message passing can reformulate the original treestructured graphical model in a new equivalent treestructured graphical model. By “equivalent” we mean that the resulting tree defines exactly the same joint distribution as the original graphical model. In the reparameterized problem, information of interest (marginals) can be directly read in the potential functions (Koller and Friedman, 2009).
The idea behind reparametrization is conceptually very simple: when a message is computed, instead of keeping it as a message, it is possible to multiply any potential function involving by , using . To preserve the joint distribution defined by the graphical model, we need to divide another potential function involving by the same message using the inverse of .^{1}^{1}1Zeros in potential can be dealt with by a proper extension of the algebraic operations, including an inverse for zero. If the algebraic structure equipped with is not a group but only a semigroup or monoid, suitable pseudo inverses can often be defined. See (Cooper and Schiex, 2004; Gondran and Minoux, 2008).
One possibility is to incorporate the messages in the binary potentials: we replace by while is divided by and is divided by . In this case, each pairwise potential can be shown to be equal to the marginal of the joint potential on .
The resulting treestructured MRF is said to be calibrated to emphasize the fact that all pairs of binary potentials sharing a common variable agree on their marginals:
The main advantage of a calibrated reparametrization is that it can be used instead for the original model for any further processing. This is useful in the context of incremental updates, where new evidence is introduced incrementally and each recalibration is simpler than a new calibration (Koller and Friedman, 2009).
Message passing based reparameterizations can be generalized to cyclic graphs. If an exact approach using tree decompositions is followed, messages may have a size exponential in the intersection of pairs of clusters and the reparametrization will create new potentials of this size. If these messages are multiplied inside the clusters, each resulting cluster will be the marginal of the joint distribution on the cluster variables. The treedecomposition is calibrated and any two intersecting clusters agree on their marginals. This is exploited in the LauritzenSpiegelhalter and Jensen sumproductdivide algorithms (Lauritzen and Spiegelhalter, 1988; Jensen et al., 1990). In this context, besides incremental updates, a calibrated tree decomposition allows also to locally compute exact marginals for any set of variables in the same cluster.
If a local “loopy” approach is used instead, reparameterizations do not change scopes but provide a reparameterized model where estimates of the marginals can be directly read. For MAP, such reparameterizations can follow clever update rules to provide convergent reparameterizations maximizing a well defined criterion. Typical examples of this schema are the sequential version of the tree reweighted algorithm (TRWS, Kolmogorov, 2006)
, or the Max Product Linear Programming algorithm
(MPLP, Globerson and Jaakkola, 2008) which try to optimize a bound on the nonnormalized probability of the mode. A seminal reference, published in Russian is (Schlesinger, 1976). These algorithms can be exact on graphical models with loops, provided the potential functions are all submodular (often described as the discrete version of convexity).Deterministic graphical models
: message passing algorithms have also been used in deterministic graphical models where they are known as “local consistency” enforcing or constraint propagation algorithms. A local consistency property defines the targeted calibration property and the enforcing algorithm allows to transform the original network into an equivalent network (defining the same joint function) that satisfies the desired calibration/local consistency property. Similar to LBP, Arc Consistency Waltz (1972); Rossi et al. (2006) is the most usual form of local consistency and is related to Unit Propagation in SAT Biere et al. (2009). Arc consistency is exact on trees and is usually incrementally maintained during an exact tree search, using reparametrization. Because of the idempotency of logical operators, local consistencies always converge to a unique fixpoint.
Local consistency properties and algorithms for the Weighted CSP are very closely related to message passing for MAP. They however are always convergent, thanks to suitable calibration properties (Schiex, 2000; Cooper and Schiex, 2004; Cooper et al., 2010) and may also solve tree structured or fully submodular problems.
6 Heuristics and approximations for inference
We mainly discussed methods for exact inference in graphical models. They are useful if an order for variable elimination with small treewidth is available. In real life applications, interaction network are seldom treeshaped, and their treewidth can be large (e.g. a grid of pixel in image analysis) and exact methods cannot be applied anymore. However, they are starting points to derive heuristic methods for inference that can be applied to any graphical model. By heuristic method, we mean an algorithm that is (a priori) not derived from the optimization of a particular criterion, as opposed to what we will call approximation methods. Nevertheless, we shall alleviate this distinction and show that good performing heursitics can sometimes be interpreted as approximate methods. For the marginalization task, the most widespread heuristics derived from variable elimination and message passing principles is the Loopy Belief Propagation algorithm (LBP, Kschischang et al., 2001) described in Section 5.1, and numerous extensions (e.g. Generalized BP, Yedidia et al., 2005) have been proposed since then. In the last decade, a better understanding of these heuristics has been reached and they can now be reinterpreted as particular instances of variational approximation methods Wainwright and Jordan (2008). A variational approximation of a distribution is defined as the best approximation of in a class of tractable distributions (for inference), according to the KüllbackLeibler divergence. Depending of the application (e.g. discrete or continuous variables), several choices for have been considered. We are apparently far from variable elimination principles and treewidth issues. However, as we just emphasized, LBP can be cast in the variational framework. The treewidth of the chosen variational distribution depends on the nature of the variables: in the case of discrete variables the treewidth is low: the class is in the majority of cases that of independent variables (mean field approximation), with associated treewidth of 0, and some works consider a class with associated treewidth of 1; in the case of continuous variables, the treewidth of the variational distribution is the same as in the original model:
is in general a class of multivariate Gaussian distributions, for which numerous inference tools are available.
We will illustrate these remarks in Section 7. Before that in the remainder of this section, we recall the two key component for a variational approximation method: the KüllbackLeibler divergence and the choice of a class of tractable distributions. We also explain how LBP can be interpreted as a variational approximation method.
6.1 Variational approximations
The KüllbackLeibler divergence measures the dissimilarity between two probability distributions and .