Lifted Tree-Reweighted Variational Inference

06/17/2014 ∙ by Hung Hai Bui, et al. ∙ JVN Institute NYU college 0

We analyze variational inference for highly symmetric graphical models such as those arising from first-order probabilistic models. We first show that for these graphical models, the tree-reweighted variational objective lends itself to a compact lifted formulation which can be solved much more efficiently than the standard TRW formulation for the ground graphical model. Compared to earlier work on lifted belief propagation, our formulation leads to a convex optimization problem for lifted marginal inference and provides an upper bound on the partition function. We provide two approaches for improving the lifted TRW upper bound. The first is a method for efficiently computing maximum spanning trees in highly symmetric graphs, which can be used to optimize the TRW edge appearance probabilities. The second is a method for tightening the relaxation of the marginal polytope using lifted cycle inequalities and novel exchangeable cluster consistency constraints.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lifted probabilistic inference focuses on exploiting symmetries in probabilistic models for efficient inference [5, 2, 3, 10, 17, 18, 21]. Work in this area has demonstrated the possibility to perform very efficient inference in highly-connected, large tree-width, but symmetric models, such as those arising in the context of relational (first-order) probabilistic models and exponential family random graphs [19]

. These models also arise frequently in probabilistic programming languages, an area of increasing importance as demonstrated by DARPA’s PPAML program (Probabilistic Programming for Advancing Machine Learning).

Even though lifted inference can sometimes offer order-of-magnitude improvement in performance, approximation is still necessary. A topic of particular interest is the interplay between lifted inference and variational approximate inference. Lifted loopy belief propagation (LBP) [13, 21] was one of the first attempts at exploiting symmetry to speed up loopy belief propagation; subsequently, counting belief propagation (CBP) [16] provided additional insights into the nature of symmetry in BP. Nevertheless, these work were largely procedural and specific to the choice of message-passing algorithm (in this case, loopy BP). More recently, Bui et al., [3] proposed a general framework for lifting a broad class of convex variational techniques by formalizing the notion of symmetry (defined via automorphism groups) of graphical models and the corresponding variational optimization problems themselves, independent of any specific methods or solvers.

Our goal in this paper is to extend the lifted variational framework in [3] to address the important case of approximate marginal inference. In particular, we show how to lift the tree-reweighted (TRW) convex formulation of marginal inference [28]. As far as we know, our work presents the first lifted convex

variational marginal inference, with the following benefits over previous work: (1) a lifted convex upper bound of the log-partition function, (2) a new tightening of the relaxation of the lifted marginal polytope exploiting exchangeability, and (3) a convergent inference algorithm. We note that convex upper bounds of the log-partition function immediately lead to concave lower bounds of the log-likelihood which can serve as useful surrogate loss functions in learning and parameter estimation

[29, 13].

To achieve the above goal, we first analyze the symmetry of the TRW log-partition and entropy bounds. Since TRW bounds depend on the choice of the edge appearance probabilities , we prove that the quality of the TRW bound is not affected if one only works with suitably symmetric . Working with symmetric

gives rise to an explicit lifted formulation of the TRW optimization problem that is equivalent but much more compact. This convex objective function can be convergently optimized via a Frank-Wolfe (conditional gradient) method where each Frank-Wolfe iteration solves a lifted MAP inference problem. We then discuss the optimization of the edge-appearance vector

, effectively yielding a lifted algorithm for computing maximum spanning trees in symmetric graphs.

As in Bui et al.’s framework, our work can benefit from any tightening of the local polytope such as the use of cycle inequalities [1, 23]

. In fact, each method for relaxing the marginal polytope immediately yields a variant of our algorithm. Notably, in the case of exchangeable random variables, radically sharper tightening (sometimes even exact characterization of the lifted marginal polytope) can be obtained via a set of simple and elegant linear constraints which we call

exchangeable polytope constraints. We provide extensive simulation studies comparing the behaviors of different variants of our algorithm with exact inference (when available) and lifted LBP demonstrating the advantages of our approach. The supplementary material [4] provides additional proof and algorithm details.

2 Background

We begin by reviewing variational inference and the tree-reweighted (TRW) approximation. We focus on inference in Markov random fields, which are distributions in the exponential family given by , where is called the log-partition function and serves to normalize the distribution. We assume that the random variables are discrete-valued, and that the features factor according to the graphical model structure ; can be non-pairwise and is assumed to be overcomplete. This paper focuses on the inference tasks of estimating the marginal probabilities and approximating the log-partition function. Throughout the paper, the domain is the binary domain , however, except for the construction of exchangeable polytope constraints in Section 6, this restriction is not essential.

Variational inference approaches view the log-partition function as a convex optimization problem over the marginal polytope and seek tractable approximations of the negative entropy and the marginal polytope [27]. Formally,

is the entropy of the maximum entropy distribution with moments

. Observe that is upper bounded by the entropy of the maximum entropy distribution consistent with any subset of the expected sufficient statistics . To arrive at the TRW approximation [26], one uses a subset given by the pairwise moments of a spanning tree111If the original model contains non-pairwise potentials, they can be represented as cliques in the graphical model, and the bound based on spanning trees still holds.. Hence for any distribution over spanning trees, an upper bound on is obtained by taking a convex combination of tree entropies . Since is a distribution over spanning trees, it must belong to the spanning tree polytope with denoting the edge appearance probability of . Combined with a relaxation of the marginal polytope , an upper bound of the log-partition function is obtained:

(1)

We note that is linear w.r.t. , and for , is convex w.r.t. . On the other hand, is convex w.r.t. and .

The optimal solution of the optimization problem (1) can be used as an approximation to the mean parameter . Typically, the local polytope LOCAL given by pairwise consistency constraints is used as the relaxation OUTER; in this paper, we also consider tightening of the local polytope.

Since (1) holds with any edge appearance in the spanning tree polytope , the TRW bound can be further improved by optimizing

(2)

The resulting is then plugged into (1) to find the marginal approximation. In practice, one might choose to work with some fixed choice of

, for example the uniform distribution over all spanning trees.

[14] proposed using the most uniform edge-weight which can be found via conditional gradient where each direction-finding step solves a maximum spanning tree problem.

Several algorithms have been proposed for optimizing the TRW objective (1) given fixed edge appearance probabilities. [27] derived the tree-reweighted belief propagation algorithm from the fixed point conditions. [8] show how to solve the dual of the TRW objective, which is a geometric program. Although this algorithm has the advantage of guaranteed convergence, it is non-trivial to generalize this approach to use tighter relaxations of the marginal polytope, which we show is essential for lifted inference. [14] use an explicit set of spanning trees and then use dual decomposition to solve the optimization problem. However, as we show in the next section, to maintain symmetry it is essential that one not work directly with spanning trees but rather use symmetric edge appearance probabilities. [23]

optimize TRW over the local and cycle polytopes using a Frank-Wolfe (conditional gradient) method, where each iteration requires solving a linear program. We follow this latter approach in our paper.

To optimize the edge appearance in (2), [26] proposed using conditional gradient. They observed that where is the solution of (1). The direction-finding step in conditional gradient reduces to solving , again equivalent to finding the maximum spanning tree with edge mutual information as weights. We discuss the corresponding lifted problem in section 5.

3 Lifted Variational Framework

We build on the key element of the lifted variational framework introduced in [3]. The automorphism group of a graphical model, or more generally, an exponential family is defined as the group of permutation pairs where permutes the set of variables and permutes the set of features in such a way that they preserve the feature function: . Note that this construction of is entirely based on the structure of the model and does not depend on the particular choice of the model parameters; nevertheless the group stabilizes222Formally, stabilizes if for all . (preserves) the key characteristics of the exponential family such as the marginal polytope , the log-partition and entropy .

As shown in [3] the automorphism group is particularly useful for exploiting symmetries when parameters are tied. For a given parameter-tying partition such that for in the same cell333If is a partition of , then each subset is called a cell. of , the group gives rise to a subgroup called the lifting group that stabilizes the tied-parameter vector as well as the exponential family. The orbit partition of the the lifting group can be used to formulate equivalent but more compact variational problems. More specifically, let be the orbit partition induced by the lifting group on the feature index set , let denote the symmetrized subspace and define the lifted marginal polytope as , then (see Theorem 4 of [3])

(3)

In practice, we need to work with convex variational approximations of the LHS of (3) where is relaxed to an outer bound and is approximated by a convex function . We now state a similar result for lifting general convex approximations.

Theorem 1.

If is convex and stabilized by the lifting group , i.e., for all , , then is the lifting partition for the approximate variational problem

(4)

The importance of Theorem 1 is that it shows that it is equivalent to optimize over a subset of where pseudo-marginals in the same orbit are restricted to take the same value. As we will show in Section 4.2, this will allow us to combine many of the terms in the objective, which is where the computational gains will derive from. A sketch of its proof is as follows. Consider a single pseudo-marginal vector . Since the objective value is the same for every for and since the objective is concave, the average of these, , must have at least as good of an objective value. Furthermore, note that this averaged vector lives in the symmetrized subspace. Thus, it suffices to optimize over .

4 Lifted Tree-Reweighted Problem

4.1 Symmetry of TRW Bounds

We now show that Theorem 1 can be used to lift the TRW optimization problem (1). Note that the applicability of Theorem 1 is not immediately obvious since depends on the distribution over trees implicit in . In establishing that the condition in Theorem 1 holds, we need to be careful so that the choice of the distribution over trees does not destroy the symmetry of the problem.

The result below ensures that with no loss in optimality, can be assumed to be suitably symmetric. More specifically, let be the set of ’s edge orbits induced by the action of the lifting group ; the edge-weights for every in the same edge orbits can be constrained to be the same, i.e. can be restricted to .

Theorem 2.

For any , there exists a symmetrized that yields at least as good an upper bound, i.e.

As a consequence, in optimizing the edge appearance, can be restricted to the symmetrized spanning tree polytope

Proof.

Let be the argmin of the LHS, and define so that . For all and for all tied-parameter , , so . By theorem 1 of [3], must be an automorphism of the graph . By lemma 7 (see supplementary material), . Thus . Since is convex w.r.t. , by Jensen’s inequality we have that

Using a symmetric choice of , the TRW bound then satisfies the condition of theorem 1, enabling the applicability of the general lifted variational inference framework.

Theorem 3.

For a fixed , is the lifting partition for the TRW variational problem

(5)

4.2 Lifted TRW Problems

We give the explicit lifted formulation of the TRW optimization problem (5). As in [3], we restrict to by introducing the lifted variables for each cell , and for all , enforcing that . Effectively, we substitute every occurrence of , by ; in vector form, is substituted by where is the characteristic matrix of the partition : if and otherwise. This results in the lifted form of the TRW problem

(6)

where ; is obtained from via the above substitution; and is the edge appearance per edge-orbit: for every edge orbit , and for every edge , . Using an alternative but equivalent form , we obtain the following explicit form for

(7)

Intuitively, the above can be viewed as a combination of node and edge entropies defined on nodes and edges of the lifted graph . Nodes of are the node orbits of while edges are the edge-orbits of . is not a simple graph: it can have self-loops or multi-edges between the same node pair (see Fig. 1). We encode the incidence on this graph as follows: if is not incident to , if is incident to and is not a loop, if is a loop incident to . The entropy at the node orbit is defined as

and the entropy at the edge orbit is

where for is a representative (any element) of , is an assignment of the ground edge , and is the assignment orbit. As in [3], we write as , and for , as where is the arc-orbit .

When OUTER is the local or cycle polytope, the constraints yield the lifted local (or cycle) polytope respectively. For these constraints, we use the same form given in [3]. In section 6, we describe a set of additional constraints for further tightening when some cluster of nodes are exchangeable.

Figure 1: Left: ground graphical model. Same colored nodes and edges have the same parameters. Right: lifted graph showing 2 node orbits (b and r), and 3 edge orbits. Numbers on the lifted graph representing the incidence degree between an edge and a node orbit.

Example. Consider the MRF shown in Fig. 1

(left) with 10 binary variables that we denote

(for the blue nodes) and (for the red nodes). The node and edge coloring denotes shared parameters. Let and be the single-node potentials used for the blue and red nodes, respectively. Let be the edge potential used for the red edges connecting the blue and red nodes, for the edge potential used for the blue edges , and for the edge potential used for the black edges .

There are two node orbits: and There are three edge orbits: for the red edges, for the blue edges , and for the black edges. The size of the node and edge orbits are all 5 (e.g., ), and , . Suppose that corresponds to a uniform distribution over spanning trees, which satisfies the symmetry needed by Theorem 2. We then have and . Putting all of this together, the lifted TRW entropy is given by . We illustrate the expansion of the entropy of the red edge orbit . This edge orbit has 2 corresponding arc-orbits: and . Thus, .

Finally, the linear term in the objective is given by where, as an example,

4.3 Optimization using Frank-Wolfe

What remains is to describe how to optimize Eq. 6. Our lifted tree-reweighted algorithm is based on Frank-Wolfe, also known as the conditional gradient method [7, 11]. First, we initialize with a pseudo-marginal vector corresponding to the uniform distribution, which is guaranteed to be in the lifted marginal polytope. Next, we solve the linear program whose objective is given by the gradient of the objective Eq. 6 evaluated at the current point, and whose constraint set is OUTER. When using the lifted cycle relaxation, we solve this linear program using a cutting-plane algorithm [3, 23]. We then perform a line search to find the optimal step size using a golden section search (a type of binary search that finds the maxima of a unimodal function), and finally repeat using the new pseudo-marginal vector. We warm start each linear program using the optimal basis found in the previous run, which makes the LP solves extremely fast after the first couple of iterations. Although we use a generic LP solver in our experiments, it is also possible to use dual decomposition to derive efficient algorithms specialized to graphical models [24].

5 Lifted Maximum Spanning Tree

Optimizing the TRW edge appearance probability requires finding the maximum spanning tree (MST) in the ground graphical model. For lifted TRW, we need to perform MST while using only information from the node and edge orbits, without referring to the ground graph. In this section, we present a lifted MST algorithm for symmetric graphs which works at the orbit level.

Suppose that we are given a weighted graph , its automorphism group and its node and edge orbits. We aim to derive an algorithm analogous to the Kruskal’s algorithm, but with complexity depends only on the number of node/edge orbits of . However, if the algorithm has to return an actual spanning tree of then clearly its complexity cannot be less than Instead, we consider an equivalent problem: solving a linear program on the spanning-tree polytope

(8)

The same mechanism for lifting convex optimization problem (Lemma 1 in [3]) applies to this problem. Let be the edge orbit partition, then an equivalent lifted problem problem is

(9)

Since is constrained to be the same for edges in the same orbit, it is now possible to solve (9) with complexity depending only on the number of orbits. Any solution of the LP (8) can be turned into a solution of (9) by letting .

5.1 Lifted Kruskal’s Algorithm

The Kruskal’s algorithm first sorts the edges according to their decreasing weight. Then starting from an empty graph, at each step it greedily attempts to add the next edge while maintaining the property that the used edges form a forest (containing no cycle). The forest obtained at the end of this algorithm is a maximum-weight spanning tree.

Imagine how Kruskal’s algorithm would operate on a weighted graph with non-trivial automorphisms. Let be the list of edge-orbits sorted in the order of decreasing weight (the weights on all edges in the same orbit by definition must be the same). The main question therefore is how many edges in each edge-orbit will be added to the spanning tree by the Kruskal’s algorithm. Let be the subgraph of formed by the set of all the edges and nodes in . Let and denote the set of nodes and set of connected components of a graph, respectively. Then (see the supplementary material for proof)

Lemma 4.

The number of edges in appearing in the MST found by the Kruskal’s algorithm is where and . Thus a solution for the linear program (9) is .

5.2 Lifted Counting of the Number of Connected Components

We note that counting the number of nodes can be done simply by adding the size of each node orbit. The remaining difficulty is how to count the number of connected components of a given graph444Since we are only interested in connectivity in this subsection, the weights of play no role. Thus, orbits in this subsection can also be generated by the automorphism group of the unweighted version of . using only information at the orbit level. Let be the lifted graph of . Then (see supplementary material for proof)

Lemma 5.

If is connected then all connected components of are isomorphic. Thus if furthermore is a connected component of then .

To find just one connected component, we can choose an arbitrary node and compute , the lifted graph fixing (see section 8.1 in [3]), then search for the connected component in that contains . Finally, if is not connected, we simply apply lemma 5 for each connected component of .

The final lifted Kruskal’s algorithm combines lemma 4 and 5 while keeping track of the set of connected components of incrementally. The full algorithm is given in the supplementary material.

6 Tightening via Exchangeable Polytope Constraints

One type of symmetry often found in first-order probabilistic models are large sets of exchangeable random variables. In certain cases, exact inference with exchangeable variables is possible via lifted counting elimination and its generalization [17, 2]. The drawback of these exact methods is that they do not apply to many models (e.g., those with transitive clauses). Lifted variational inference methods do not have this drawback, however local and cycle relaxation can be shown to be loose in the exchangeable setting, a potentially serious limitation compared to earlier work.

To remedy this situation, we now show how to take advantage of highly symmetric subset of variables to tighten the relaxation of the lifted marginal polytope.

We call a set of random variables an exchangeable cluster iff can be arbitrary permuted while preserving the probability model. Mathematically, the lifting group acts on and the image of the action is isomorphic to , the symmetric group on . The distribution of the random variables in is also exchangeable in the usual sense.

Our method for tightening the relaxation of the marginal polytope is based on lift-and-project, wherein we introduce auxiliary variables specifying the joint distribution of a large cluster of variables, and then enforce consistency between the cluster distribution and the pseudo-marginal vector

[20, 24, 27]. In the ground model, one typically works with small clusters (e.g., triplets) because the number of variables grows exponentially with cluster size. The key (and nice) difference in the lifted case is that we can make use of very large clusters of highly symmetric variables: while the grounded relaxation would clearly blow up, the corresponding lifted relaxation can still remain compact.

Specifically, for an exchangeable cluster of arbitrary size, one can add cluster consistency constraints for the entire cluster and still maintain tractability. To keep the exposition simple, we assume that the variables are binary. Let denote a -configuration, i.e., a function . The set is the collection of -cluster auxiliary variables. Since is exchangeable, all nodes in belong to the same node orbit; we call this node orbit . Similarly, and denote the single edge and arc orbit that contains all edges and arcs in respectively. Let be two distinct nodes in . To enforce consistency between the cluster and the edge in the ground model, we introduce the constraints

(10)

These constraints correspond to using intersection sets of size two, which can be shown to be the exact characterization of the marginal polytope involving variables in if the graphical model only has pairwise potentials. If higher-order potentials are present, a tighter relaxation could be obtained by using larger intersection sets together with the techniques described below.

The constraints in (10) can be methodically lifted by replacing occurrences of ground variables with lifted variables at the orbit level. First observe that in place of the grounded variables , the lifted local relaxation has three corresponding lifted variables, and . Second, we consider the orbits of the set of configurations . Since is exchangeable, there can be only -configuration orbits; each orbit contains all configurations with precisely 1’s where . Thus, instead of the ground auxiliary variables, we only need lifted cluster variables. Further manipulation leads to the following set of constraints, which we call lifted exchangeable polytope constraints.

Theorem 6.

Let be an exchangeable cluster of size ; and be the single edge and arc orbit of the graphical model that contains all edges and arcs in respectively; be the lifted marginals. Then there exist , such that

Proof.

See the supplementary material. ∎

In contrast to the lifted local and cycle relaxations, the number of variables and constraints in the lifted exchangeable relaxation depends linearly on the domain size of the first-order model. From the lifted local constraints given by [3], . Substituting in the expression involved , we get . Intuitively, represents the approximation of the marginal probability of having precisely ones in .

As proved by [2], groundings of unary predicates in Markov Logic Networks (MLNs) gives rise to exchangeable clusters. Thus, for MLNs, the above theorem immediately suggests a tightening of the relaxation: for every unary predicate of a MLN, add a new set of constraints as above to the existing lifted local (or cycle) optimization problem. Although it is not the focus of our paper, we note that this should also improve the lifted MAP inference results of [3]. For example, in the case of a symmetric complete graphical model, lifted MAP inference using the linear program given by these new constraints would find the exact that maximizes , hence recover the same solution as counting elimination. Marginal inference may still be inexact due to the tree-reweighted entropy approximation. We re-emphasize that the complexity of variational inference with lifted exchangeable constraints is guaranteed to be polynomial in the domain size, unlike exact methods based on lifted counting elimination and variable elimination.

7 Experiments

In this section, we provide an empirical evaluation of our lifted tree reweighted (LTRW) algorithm. As a baseline we use a dampened version of the lifted belief propagation (LBP-Dampening) algorithm from [21]. Our lifted algorithm has all of the same advantages of the tree-reweighted approach over belief propagation, which we will illustrate in the results: (1) a convex objective that can be convergently solved to optimality, (2) upper bounds on the partition function, and (3) the ability to easily improve the approximation by tightening the relaxation. Our evaluation includes four variants of the LTRW algorithm corresponding to using different outer bounds: lifted local polytope (LTRW-L), lifted cycle polytope (LTRW-C), lifted local polytope with exchangeable polytope constraints (LTRW-LE), and lifted cycle polytope with exchangeable constraints (LTRW-CE). The conditional gradient optimization of the lifted TRW objective terminates when the duality gap is less than or when a maximum number of iterations is reached. To solve the LP problem during conditional gradient, we use Gurobi555http://www.gurobi.com/.

We evaluate all the algorithms using several first-order probabilistic models. We assume that no evidence has been observed, which results in a large amount of symmetry. Even without evidence, performing marginal inference in first-order probabilistic models can be very useful for maximum likelihood learning [13]. Furthermore, the fact that our lifted tree-reweighted variational approximation provides an upper bound on the partition function enables us to maximize a lower bound on the likelihood [29], which we demonstrate in Sec. 7.5. To find the lifted orbit partition, we use the renaming group as in [3] which exploits the symmetry of the unobserved contants in the model.

Rather than optimize over the spanning tree polytope, which is computationally intensive, most TRW implementations use a single fixed choice of edge appearance probabilities, e.g. an (un)weighted distribution obtained using the matrix-tree theorem. In these experiments, we initialize the lifted edge appearance probabilities to be the most uniform per-orbit edge-appearance probabilties by solving the optimization problem using conditional gradient. Each direction-finding step of this conditional gradient solves a lifted MST problem of the form using our lifted Kruskal’s algorithm, where is the current solution. After this initialization, we fix the lifted edge appearance probabilities and do not attempt to optimize them further.

7.1 Test models

Figure 2: An example of the ground graphical model for the Clique-Cycle model (domain size = 3).

Fig. 3 describes the four test models in MLN syntax. We focus on the repulsive case, since for attractive models, all TRW variants and lifted LBP give similar results. The parameter denotes the weight that will be varying during the experiments. In all models except Clique-Cycle, acts like the “local field” potential in an Ising model; a negative (or positive) value of means the corresponding variable tends to be in the 0 (or 1) state. Complete-Graph is equivalent to an Ising model on the complete graph of size (the domain size) with homogenous parameters. Exact marginals and the log-partition function can be computed in closed form using lifted counting elimination. The weight of the interaction clause is set to (repulsive). Friends-Smokers (negated) is a variant of the Friends-Smokers model [21] where the weight of the final clause is set to -1.1 (repulsive). We use the method in [2] to compute the exact marginal for the Cancer predicate and the exact value of the log-partition function. Lovers-Smokers is the same MLN used in [3] with a full transitive clause and where we vary the prior of the Loves predicate. Clique-Cycle is a model with 3 cliques and 3 bipartite graphs in between. Its corresponding ground graphical model is shown in Fig. 2.

Complete Graph
Friends-Smokers (Negated)
Lovers-Smokers
Clique-Cycle
Figure 3: Test models

7.2 Accuracy of Marginals

Fig. 4 shows the marginals computed by all the algorithms as well as exact marginals on the Complete-Graph and Friends-Smokers models. We do not know how to efficiently perform exact inference in the remaining two models, and thus do not measure accuracy for them. The result on complete graphs illustrates the clear benefit of tightening the relaxation: LTRW-Local and LBP are inaccurate for moderate , whereas cycle constraints and, especially, exchangeable constraints drastically improve accuracy. As discussed earlier, for the case of symmetric complete graphical models, the exchangeable constraints suffice to exactly characterize the marginal polytope. As a result, the approximate marginals computed by LTRW-LE and LTRW-CE are almost the same as the exact marginals; the very small difference is due to the entropy approximation. On the Friends-Smokers (negated) model, all LTRW variants give accurate marginals while lifted LBP even with very strong dampening ( weight given to previous iterations’ messages) fails to converge for . We observed that LTRW-LE gives the best trade-off between accuracy and running time for this model. Note that we do not compare to ground versions of the lifted TRW algorithms because, by Theorem 3, the marginals and log-partition function are the same for both.

Figure 4: Left: marginal accuracy for complete graph model. Right: marginal accuracy for in Friends-Smokers (neg). Lifted TRW variants using different outer bounds: L=local, C=cycle, LE=local+exchangeable, CE=cycle+exchangeable (best viewed in color).

7.3 Quality of Log-Partition Upper bounds

Fig. 5 plots the values of the upper bounds obtained by the LTRW algorithms on the four test models. The results clearly show the benefits of adding each type of constraint to the LTRW, with the best upper bound obtained by tightening the lifted local polytope with both lifted cycle and exchangeable constraints. For the Complete-Graph and Friends-Smokers model, the log-partition approximation using exchangeable polytope constraints is very close to exact. In addition, we illustrate lifted LBP’s approximation of the log-partition function on the Complete-Graph (note it is non-convex and not an upper bound).

Figure 5: Approximations of the log-partition function on the four test models from Fig. 3 (best viewed in color).

7.4 Running time

As shown in Table 1, lifted variants of TRW are order-of-magnitudes faster than the ground version. Interestingly, lifted TRW with local constraints is observed to be faster as the domain size increase; this is probably due to the fact that as the domain size increases, the distribution becomes more peak, so marginal inference becomes more similar to MAP inference. Lifted TRW with local and exchangeable constraints requires a smaller number of conditional gradient iterations, thus is faster; however note that its running time slowly increases since the exchangeable constraint set grows linearly with domain size.

LBP’s lack of convergence makes it difficult to have a meaningful timing comparison with LBP. For example, LBP did not converge for about half of the values of in the Lovers-Smokers model, even after using very strong dampening. We did observe that when LBP converges, it is much faster than LTRW. We hypothesize that this is due to the message passing nature of LBP, which is based on a fixed point update whereas our algorithm is based on Frank-Wolfe.

Domain size 10 20 30 100 200
TRW-L 138370 609502 1525140 - -
LTRW-L 3255 3581 3438 1626 1416
LTRW-LE 681 703 721 1033 1307
Table 1: Ground vs lifted TRW runtime on Complete-Graph (milliseconds)

7.5 Application to Learning

We now describe an application of our algorithm to the task of learning relational Markov networks for inferring protein-protein interactions from noisy, high-throughput, experimental assays [12]. This is equivalent to learning the parameters of an exponential family random graph model [19] where the edges in the random graph represents the protein-protein interactions. Despite fully observed data, maximum likelihood learning is challenging because of the intractability of computing the log-partition function and its gradient. In particular, this relational Markov network has over 330K random variables (one for each possible interaction of 813 variables) and tertiary potentials. However, Jaimovich et al. [13] observed that the partition function in relational Markov networks is highly symmetric, and use lifted LBP to efficiently perform approximate learning in running time that is independent of the domain size. They use their lifted inference algorithm to visualize the (approximate) likelihood landscape for different values of the parameters, which among other uses characterizes the robustness of the model to parameter changes.

We use precisely the same procedure as [13], substituting lifted BP with our new lifted TRW algorithms. The model has three parameters:

, used in the single-node potential to specify the prior probability of a protein-protein interaction;

, part of the tertiary potentials which encourages cliques of three interacting proteins; and , also part of the tertiary potentials which encourages chain-like structures where proteins interact, interact, but and do not (see supplementary material for the full model specification as an MLN). We follow their two-step estimation procedure, first estimating in the absence of the other parameters (the maximum likelihood, BP, and TRW estimates of this parameter coincide, and estimation can be performed in closed-form: ). Next, for each setting of and we estimate the log-partition function using lifted TRW with the cycle+exchangeable vs. local constraints only. Since TRW is an upper bound on the log-partition function, these provide lower bounds on the likelihood.

Our results are shown in Fig. 6, and should be compared to Fig. 7 of [13]. The overall shape of the likelihood landscapes are similar. However, the lifted LBP estimates of the likelihood have several local optima, which cause gradient-based learning with lifted LBP to reach different solutions depending on the initial setting of the parameters. In contrast, since TRW is convex, any gradient-based procedure would reach the global optima, and thus learning is much easier. Interestingly, we see that our estimates of the likelihood have a significantly smaller range over these parameter settings than that estimated by lifted LBP. Moreover, the high-likelihood parameter settings extends to larger values of . For all algorithms there is a sudden decrease in the likelihood at (not shown in the figure).

Figure 6: Log-likelihood lower-bound obtained using lifted TRW with the cycle and exchangeable constraints (CE) for the same protein-protein interaction data used in [13] (left) (c.f. Fig. 7 in [13]). Improvement in lower-bound after tightening the local constraints (L) with CE (right).

8 Discussion and Conclusion

Lifting partitions used by lifted and counting BP [21, 16] can be coarser than orbit partitions. In graph-theoretic terms, these partitions are called equitable partitions. If each equitable partition cell is thought of as a distinct node color, then among nodes with the same color, their neighbors must have the same color histogram. It is known that orbit partitions are always equitable, however the converse is not always true [9].

Since equitable partition can be computed more efficiently and potentially leads to more compact lifted problems, the following question naturally arises: can we use equitable partition in lifting the TRW problem? Unfortunately, a complete answer is non-trivial. We point out here a theoretical barrier due to the interplay between the spanning tree polytope and the equitable partition of a graph.

Let be the coarsest equitable partition of edges of . We give an example graph in the supplementary material (see example 9) where the symmetrized spanning tree polytope corresponding to the equitable partition , is an empty set. When is empty, the consequence is that if we want to be within so that is guaranteed to be a convex upper bound of the log-partition function, we cannot restrict to be consistent with the equitable partition. In lifted and counting BP, so it is clearly consistent with the equitable partition; however, one loses convexity and upper bound guarantee as a result. This suggests that there might be a trade-off between the compactness of the lifting partition and the quality of the entropy approximation, a topic deserving the attention of future work.

In summary, we presented a formalization of lifted marginal inference as a convex optimization problem and showed that it can be efficiently solved using a Frank-Wolfe algorithm. Compared to previous lifted variational inference algorithms, in particular lifted belief propagation, our approach comes with convergence guarantees, upper bounds on the partition function, and the ability to improve the approximation (e.g. by introducing additional constraints) at the cost of small additional running time.

A limitation of our lifting method is that as the amount of soft evidence (the number of distinct individual objects) approaches the domain size, the behavior of lifted inference approaches ground inference. The wide difference in running time between ground and lifted inference suggests that significant efficiency can be gained by solving an approximation of the orignal problem that is more symmetric [25, 15, 22, 6]. One of the most interesting open questions raised by our work is how to use the variational formulation to perform approxiate lifting. Since our lifted TRW algorithm provides an upper bound on the partition function, it is possible that one could use the upper bound to guide the choice of approximation when deciding how to re-introduce symmetry into an inference task.

Acknowledgements: Work by DS supported by DARPA PPAML program under AFRL contract no. FA8750-14-C-0005.

References

  • Barahona and Mahjoub [1986] F. Barahona and A. R. Mahjoub. On the cut polytope. Mathematical Programming, 36:157–173, 1986.
  • Bui et al. [2012] Hung Hai Bui, Tuyen N. Huynh, and Rodrigo de Salvo Braz. Lifted inference with distinct soft evidence on every object. In AAAI-2012, 2012.
  • Bui et al. [2013] Hung Hai Bui, Tuyen N. Huynh, and Sebastian Riedel. Automorphism groups of graphical models and lifted variational inference. In

    Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI-2013

    . AUAI Press, 2013.
  • Bui et al. [2014] Hung Hai Bui, Tuyen N. Huynh, and David Sontag. Lifted tree-reweighted variational inference. arXiv pre-print, 2014. http://arxiv.org/abs/1406.4200.
  • de Salvo Braz et al. [2005] R. de Salvo Braz, E. Amir, and D. Roth. Lifted first-order probabilistic inference. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI ’05), pages 1319–1125, 2005.
  • de Salvo Braz et al. [2009] Rodrigo de Salvo Braz, Sriraam Natarajan, Hung Bui, Jude Shavlik, and Stuart Russell. Anytime lifted belief propagation. In 6th International Workshop on Statistical Relational Learning (SRL 2009), 2009.
  • Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(1-2):95–110, 1956. ISSN 1931-9193.
  • Globerson and Jaakkola [2007] A. Globerson and T. Jaakkola. Convergent Propagation Algorithms via Oriented Trees. In Uncertainty in Artificial Intelligence, 2007.
  • Godsil and Royle [2001] Chris Godsil and Gordon Royle. Algebraic Graph Theory. Springer, 2001.
  • Gogate and Domingos [2011] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In Proceedings of the Twenty-Seventh Annual Conference on Uncertainty in Artificial Intelligence (UAI-11), pages 256–265, 2011.
  • Jaggi [2013] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th ICML, volume 28, pages 427–435. JMLR Workshop and Conference Proceedings, 2013.
  • Jaimovich et al. [2006] Ariel Jaimovich, Gal Elidan, Hanah Margalit, and Nir Friedman. Towards an integrated protein-protein interaction network: a relational markov network approach. Journal of Computational Biology, 13(2):145–164, 2006.
  • Jaimovich et al. [2007] Ariel Jaimovich, Ofer Meshi, and Nir Friedman. Template based inference in symmetric relational markov random fields. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, July 19-22, 2007, pages 191–199. AUAI Press, 2007.
  • Jancsary and Matz [2011] Jeremy Jancsary and Gerald Matz. Convergent decomposition solvers for tree-reweighted free energies. Journal of Machine Learning Research - Proceedings Track, 15:388–398, 2011.
  • Kersting et al. [2010] K. Kersting, Y. El Massaoudi, B. Ahmadi, and F. Hadiji. Informed lifting for message–passing. In D. Poole M. Fox, editor, Twenty–Fourth AAAI Conference on Artificial Intelligence (AAAI–10), Atlanta, USA, July 11 – 15 2010. AAAI Press.
  • Kersting et al. [2009] Kristian Kersting, Babak Ahmadi, and Sriraam Natarajan. Counting belief propagation. In Proceedings of the 25th Annual Conference on Uncertainty in AI (UAI ’09), 2009.
  • Milch et al. [2008] B. Milch, L. S. Zettlemoyer, K. Kersting, M. Haimes, and L. P. Kaelbling. Lifted Probabilistic Inference with Counting Formulas. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI ’08), pages 1062–1068, 2008.
  • Niepert [2012] Mathias Niepert. Markov chains on orbits of permutation groups. In UAI-2012, 2012.
  • Robins et al. [2007] Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. An introduction to exponential random graph () models for social networks. Social networks, 29(2):173–191, 2007.
  • Sherali and Adams [1990] H. D. Sherali and W. P. Adams. A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM Journal on Discrete Mathematics, 3(3):411–430, 1990. doi: 10.1137/0403036. URL http://link.aip.org/link/?SJD/3/411/1.
  • Singla and Domingos [2008] Parag Singla and Pedro Domingos. Lifted first-order belief propagation. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI ’08), pages 1094–1099, 2008.
  • Singla et al. [2010] Parag Singla, Aniruddh Nath, and Pedro Domingos. Approximate lifted belief propagation. In Workshop on Statistical Relational Artificial Intelligence (StaR-AI 2010), 2010.
  • Sontag and Jaakkola [2008] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Information Processing Systems 21. MIT Press, 2008.
  • Sontag et al. [2008] David Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In Proceedings of the 24th Annual Conference on Uncertainty in AI (UAI ’08), 2008.
  • Van den Broeck and Darwiche [2013] Guy Van den Broeck and Adnan Darwiche. On the complexity and approximation of binary evidence in lifted inference. In Advances in Neural Information Processing Systems, pages 2868–2876, 2013.
  • Wainwright et al. [2005] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51:2313–2335, 2005.
  • Wainwright and Jordan [2008] Martin Wainwright and Michael Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers, 2008.
  • Wainwright et al. [2003] Martin J. Wainwright, Tommi Jaakkola, and Alan S. Willsky. Tree-based reparameterization framework for analysis of sum-product and related algorithms. IEEE Transactions on Information Theory, 49:1120–1146, 2003.
  • Yanover et al. [2008] C. Yanover, O. Schueler-Furman, and Y. Weiss. Minimizing and learning energy functions for side-chain prediction. Journal of Computational Biology, 15(7):899–911, 2008.

Supplementary Materials for “Lifted Tree-Reweighted Variational Inference”

We present (1) proofs not given in the main paper, (2) full pseudo-code for the lifted Kruskal’s algorithm for finding maximum spanning tree in symmetric graphs, and (3) additional details of the protein-protein interaction model.

Proof of Theorem 1.

Proof.

The lifting group stabilizes both the objective function and the constraints of the convex optimization problem in the LHS of Eq. (4). The equality is then established using Lemma 1 in [3]. ∎

We state and prove a lemma about the symmetry of the bounds and that will be used in subsequent proofs.

Lemma 7.

let be an automorphism of the graphical model , then and .

Proof.

The intuition is that since the entropy bound is defined on the graph structure of the graphical model , it inherits the symmetry of . This can be verified by viewing the graph automorphism as a bijection from nodes to nodes and edges to edges, and so simply rearranges the summation inside .

We now show same symmetry applies to the log-partition upper bound . Let be an outer bound of the marginal polytope such that . Note that acts on and stabilizes OUTER, i.e., . Thus

Proof of Theorem 3.

Proof.

We will show that the condition of theorem 1 holds. Let us fix a . Then for all Thus . On the other hand, by Lemma 7, . Thus . Note that in case of the overcomplete representation, the action of the group is the permuting action of ; thus, the TRW bound (for fixed ) is stabilized by the lifting group . ∎

Conditional Gradient (Frank-Wolfe) Algorithm for Lifted TRW

The pseudo-code is given in Algorithm 1. Step 2 essentially solves a lifted MAP problem which we used the same algorithms presented in [3] with Gurobi as the main linear programming engine. Step 3 solves a 1-D constrained convex problem via the golden search algorithm to find the optimal step size.

1:; uniform
2:Direction finding via lifted MAP
3:Step size finding via golden section search
4:Update
5:
6:if not converged go to 2
Algorithm 1 Conditional gradient for optimizing lifted TRW problem

Proof of Lemma 4.

Proof.

After considering all edges in , Kruskal’s algorithm must form a spanning forest of (the forest is spanning since if there remains an edge that can be used without forming a cycle, Kruskal’s algorithm must have used it already). Since the forest is spanning, the number of edges used by Kruskal’s algorithm at this point is precisely . Similarly, just after considering all edges in , the number of edges used is . Therefore, the number of -edges used must be

which is the difference between the number of new nodes (which must be non-negative) and the number of new connected components (which could be negative) induced by considering edges in . Any MST solution can be turned into a solution of (9) by letting . Thus, we obtain a solution . ∎

Proof of Lemma 5.

We first need an intermediate result.

Lemma 8.

If and are two distinct node orbits, and and are reachable in the lifted graph , then for any , there is some such that is reachable from .

Proof.

Induction on the length of the - path. Base case: if and are adjacent, there exists an edge orbit incident to both and . Therefore, the exists a ground edge in such that and . The automorphism mapping will map for some node . Clearly is an edge, and . Main case: assume the statement is true for all pair of orbits with path length . Suppose is a path of length . Take the orbit right in front of in the path, so that is a path of length , and and are adjacent. By the inductive assumption, there exists such that is connected to . Applying the same argument of the base case, there exists such that is an edge. Thus is connected to . ∎

We now return to the main proof of lemma 5.

Proof.

If the ground graph has only one component then this is trivially true. Let and be two distinct connected components of , let be a node in and be the orbit containing . Let be any node in . Since the lifted graph is connected, all orbits are reachable from one another in . By the above lemma, there must be some node reachable from , hence (if then we just take ). This establishes that the node orbit intersects with both and . Note that since otherwise . Let be the automorphism that takes to .

We now show that . Since maps edges to edges and non-edges to non-edges, it is sufficient to show that . Let be a node of and . Since is connected, there exists a path from to . But must map this path to a path from to , hence