1 Introduction
Lifted probabilistic inference focuses on exploiting symmetries in probabilistic models for efficient inference [5, 2, 3, 10, 17, 18, 21]. Work in this area has demonstrated the possibility to perform very efficient inference in highlyconnected, large treewidth, but symmetric models, such as those arising in the context of relational (firstorder) probabilistic models and exponential family random graphs [19]
. These models also arise frequently in probabilistic programming languages, an area of increasing importance as demonstrated by DARPA’s PPAML program (Probabilistic Programming for Advancing Machine Learning).
Even though lifted inference can sometimes offer orderofmagnitude improvement in performance, approximation is still necessary. A topic of particular interest is the interplay between lifted inference and variational approximate inference. Lifted loopy belief propagation (LBP) [13, 21] was one of the first attempts at exploiting symmetry to speed up loopy belief propagation; subsequently, counting belief propagation (CBP) [16] provided additional insights into the nature of symmetry in BP. Nevertheless, these work were largely procedural and specific to the choice of messagepassing algorithm (in this case, loopy BP). More recently, Bui et al., [3] proposed a general framework for lifting a broad class of convex variational techniques by formalizing the notion of symmetry (defined via automorphism groups) of graphical models and the corresponding variational optimization problems themselves, independent of any specific methods or solvers.
Our goal in this paper is to extend the lifted variational framework in [3] to address the important case of approximate marginal inference. In particular, we show how to lift the treereweighted (TRW) convex formulation of marginal inference [28]. As far as we know, our work presents the first lifted convex
variational marginal inference, with the following benefits over previous work: (1) a lifted convex upper bound of the logpartition function, (2) a new tightening of the relaxation of the lifted marginal polytope exploiting exchangeability, and (3) a convergent inference algorithm. We note that convex upper bounds of the logpartition function immediately lead to concave lower bounds of the loglikelihood which can serve as useful surrogate loss functions in learning and parameter estimation
[29, 13].To achieve the above goal, we first analyze the symmetry of the TRW logpartition and entropy bounds. Since TRW bounds depend on the choice of the edge appearance probabilities , we prove that the quality of the TRW bound is not affected if one only works with suitably symmetric . Working with symmetric
gives rise to an explicit lifted formulation of the TRW optimization problem that is equivalent but much more compact. This convex objective function can be convergently optimized via a FrankWolfe (conditional gradient) method where each FrankWolfe iteration solves a lifted MAP inference problem. We then discuss the optimization of the edgeappearance vector
, effectively yielding a lifted algorithm for computing maximum spanning trees in symmetric graphs.As in Bui et al.’s framework, our work can benefit from any tightening of the local polytope such as the use of cycle inequalities [1, 23]
. In fact, each method for relaxing the marginal polytope immediately yields a variant of our algorithm. Notably, in the case of exchangeable random variables, radically sharper tightening (sometimes even exact characterization of the lifted marginal polytope) can be obtained via a set of simple and elegant linear constraints which we call
exchangeable polytope constraints. We provide extensive simulation studies comparing the behaviors of different variants of our algorithm with exact inference (when available) and lifted LBP demonstrating the advantages of our approach. The supplementary material [4] provides additional proof and algorithm details.2 Background
We begin by reviewing variational inference and the treereweighted (TRW) approximation. We focus on inference in Markov random fields, which are distributions in the exponential family given by , where is called the logpartition function and serves to normalize the distribution. We assume that the random variables are discretevalued, and that the features factor according to the graphical model structure ; can be nonpairwise and is assumed to be overcomplete. This paper focuses on the inference tasks of estimating the marginal probabilities and approximating the logpartition function. Throughout the paper, the domain is the binary domain , however, except for the construction of exchangeable polytope constraints in Section 6, this restriction is not essential.
Variational inference approaches view the logpartition function as a convex optimization problem over the marginal polytope and seek tractable approximations of the negative entropy and the marginal polytope [27]. Formally,
is the entropy of the maximum entropy distribution with moments
. Observe that is upper bounded by the entropy of the maximum entropy distribution consistent with any subset of the expected sufficient statistics . To arrive at the TRW approximation [26], one uses a subset given by the pairwise moments of a spanning tree^{1}^{1}1If the original model contains nonpairwise potentials, they can be represented as cliques in the graphical model, and the bound based on spanning trees still holds.. Hence for any distribution over spanning trees, an upper bound on is obtained by taking a convex combination of tree entropies . Since is a distribution over spanning trees, it must belong to the spanning tree polytope with denoting the edge appearance probability of . Combined with a relaxation of the marginal polytope , an upper bound of the logpartition function is obtained:(1) 
We note that is linear w.r.t. , and for , is convex w.r.t. . On the other hand, is convex w.r.t. and .
The optimal solution of the optimization problem (1) can be used as an approximation to the mean parameter . Typically, the local polytope LOCAL given by pairwise consistency constraints is used as the relaxation OUTER; in this paper, we also consider tightening of the local polytope.
Since (1) holds with any edge appearance in the spanning tree polytope , the TRW bound can be further improved by optimizing
(2) 
The resulting is then plugged into (1) to find the marginal approximation. In practice, one might choose to work with some fixed choice of
, for example the uniform distribution over all spanning trees.
[14] proposed using the most uniform edgeweight which can be found via conditional gradient where each directionfinding step solves a maximum spanning tree problem.Several algorithms have been proposed for optimizing the TRW objective (1) given fixed edge appearance probabilities. [27] derived the treereweighted belief propagation algorithm from the fixed point conditions. [8] show how to solve the dual of the TRW objective, which is a geometric program. Although this algorithm has the advantage of guaranteed convergence, it is nontrivial to generalize this approach to use tighter relaxations of the marginal polytope, which we show is essential for lifted inference. [14] use an explicit set of spanning trees and then use dual decomposition to solve the optimization problem. However, as we show in the next section, to maintain symmetry it is essential that one not work directly with spanning trees but rather use symmetric edge appearance probabilities. [23]
optimize TRW over the local and cycle polytopes using a FrankWolfe (conditional gradient) method, where each iteration requires solving a linear program. We follow this latter approach in our paper.
To optimize the edge appearance in (2), [26] proposed using conditional gradient. They observed that where is the solution of (1). The directionfinding step in conditional gradient reduces to solving , again equivalent to finding the maximum spanning tree with edge mutual information as weights. We discuss the corresponding lifted problem in section 5.
3 Lifted Variational Framework
We build on the key element of the lifted variational framework introduced in [3]. The automorphism group of a graphical model, or more generally, an exponential family is defined as the group of permutation pairs where permutes the set of variables and permutes the set of features in such a way that they preserve the feature function: . Note that this construction of is entirely based on the structure of the model and does not depend on the particular choice of the model parameters; nevertheless the group stabilizes^{2}^{2}2Formally, stabilizes if for all . (preserves) the key characteristics of the exponential family such as the marginal polytope , the logpartition and entropy .
As shown in [3] the automorphism group is particularly useful for exploiting symmetries when parameters are tied. For a given parametertying partition such that for in the same cell^{3}^{3}3If is a partition of , then each subset is called a cell. of , the group gives rise to a subgroup called the lifting group that stabilizes the tiedparameter vector as well as the exponential family. The orbit partition of the the lifting group can be used to formulate equivalent but more compact variational problems. More specifically, let be the orbit partition induced by the lifting group on the feature index set , let denote the symmetrized subspace and define the lifted marginal polytope as , then (see Theorem 4 of [3])
(3) 
In practice, we need to work with convex variational approximations of the LHS of (3) where is relaxed to an outer bound and is approximated by a convex function . We now state a similar result for lifting general convex approximations.
Theorem 1.
If is convex and stabilized by the lifting group , i.e., for all , , then is the lifting partition for the approximate variational problem
(4) 
The importance of Theorem 1 is that it shows that it is equivalent to optimize over a subset of where pseudomarginals in the same orbit are restricted to take the same value. As we will show in Section 4.2, this will allow us to combine many of the terms in the objective, which is where the computational gains will derive from. A sketch of its proof is as follows. Consider a single pseudomarginal vector . Since the objective value is the same for every for and since the objective is concave, the average of these, , must have at least as good of an objective value. Furthermore, note that this averaged vector lives in the symmetrized subspace. Thus, it suffices to optimize over .
4 Lifted TreeReweighted Problem
4.1 Symmetry of TRW Bounds
We now show that Theorem 1 can be used to lift the TRW optimization problem (1). Note that the applicability of Theorem 1 is not immediately obvious since depends on the distribution over trees implicit in . In establishing that the condition in Theorem 1 holds, we need to be careful so that the choice of the distribution over trees does not destroy the symmetry of the problem.
The result below ensures that with no loss in optimality, can be assumed to be suitably symmetric. More specifically, let be the set of ’s edge orbits induced by the action of the lifting group ; the edgeweights for every in the same edge orbits can be constrained to be the same, i.e. can be restricted to .
Theorem 2.
For any , there exists a symmetrized that yields at least as good an upper bound, i.e.
As a consequence, in optimizing the edge appearance, can be restricted to the symmetrized spanning tree polytope
Proof.
Using a symmetric choice of , the TRW bound then satisfies the condition of theorem 1, enabling the applicability of the general lifted variational inference framework.
Theorem 3.
For a fixed , is the lifting partition for the TRW variational problem
(5) 
4.2 Lifted TRW Problems
We give the explicit lifted formulation of the TRW optimization problem (5). As in [3], we restrict to by introducing the lifted variables for each cell , and for all , enforcing that . Effectively, we substitute every occurrence of , by ; in vector form, is substituted by where is the characteristic matrix of the partition : if and otherwise. This results in the lifted form of the TRW problem
(6) 
where ; is obtained from via the above substitution; and is the edge appearance per edgeorbit: for every edge orbit , and for every edge , . Using an alternative but equivalent form , we obtain the following explicit form for
(7)  
Intuitively, the above can be viewed as a combination of node and edge entropies defined on nodes and edges of the lifted graph . Nodes of are the node orbits of while edges are the edgeorbits of . is not a simple graph: it can have selfloops or multiedges between the same node pair (see Fig. 1). We encode the incidence on this graph as follows: if is not incident to , if is incident to and is not a loop, if is a loop incident to . The entropy at the node orbit is defined as
and the entropy at the edge orbit is
where for is a representative (any element) of , is an assignment of the ground edge , and is the assignment orbit. As in [3], we write as , and for , as where is the arcorbit .
When OUTER is the local or cycle polytope, the constraints yield the lifted local (or cycle) polytope respectively. For these constraints, we use the same form given in [3]. In section 6, we describe a set of additional constraints for further tightening when some cluster of nodes are exchangeable.
Example. Consider the MRF shown in Fig. 1
(left) with 10 binary variables that we denote
(for the blue nodes) and (for the red nodes). The node and edge coloring denotes shared parameters. Let and be the singlenode potentials used for the blue and red nodes, respectively. Let be the edge potential used for the red edges connecting the blue and red nodes, for the edge potential used for the blue edges , and for the edge potential used for the black edges .There are two node orbits: and There are three edge orbits: for the red edges, for the blue edges , and for the black edges. The size of the node and edge orbits are all 5 (e.g., ), and , . Suppose that corresponds to a uniform distribution over spanning trees, which satisfies the symmetry needed by Theorem 2. We then have and . Putting all of this together, the lifted TRW entropy is given by . We illustrate the expansion of the entropy of the red edge orbit . This edge orbit has 2 corresponding arcorbits: and . Thus, .
Finally, the linear term in the objective is given by where, as an example,
4.3 Optimization using FrankWolfe
What remains is to describe how to optimize Eq. 6. Our lifted treereweighted algorithm is based on FrankWolfe, also known as the conditional gradient method [7, 11]. First, we initialize with a pseudomarginal vector corresponding to the uniform distribution, which is guaranteed to be in the lifted marginal polytope. Next, we solve the linear program whose objective is given by the gradient of the objective Eq. 6 evaluated at the current point, and whose constraint set is OUTER. When using the lifted cycle relaxation, we solve this linear program using a cuttingplane algorithm [3, 23]. We then perform a line search to find the optimal step size using a golden section search (a type of binary search that finds the maxima of a unimodal function), and finally repeat using the new pseudomarginal vector. We warm start each linear program using the optimal basis found in the previous run, which makes the LP solves extremely fast after the first couple of iterations. Although we use a generic LP solver in our experiments, it is also possible to use dual decomposition to derive efficient algorithms specialized to graphical models [24].
5 Lifted Maximum Spanning Tree
Optimizing the TRW edge appearance probability requires finding the maximum spanning tree (MST) in the ground graphical model. For lifted TRW, we need to perform MST while using only information from the node and edge orbits, without referring to the ground graph. In this section, we present a lifted MST algorithm for symmetric graphs which works at the orbit level.
Suppose that we are given a weighted graph , its automorphism group and its node and edge orbits. We aim to derive an algorithm analogous to the Kruskal’s algorithm, but with complexity depends only on the number of node/edge orbits of . However, if the algorithm has to return an actual spanning tree of then clearly its complexity cannot be less than Instead, we consider an equivalent problem: solving a linear program on the spanningtree polytope
(8) 
The same mechanism for lifting convex optimization problem (Lemma 1 in [3]) applies to this problem. Let be the edge orbit partition, then an equivalent lifted problem problem is
(9) 
Since is constrained to be the same for edges in the same orbit, it is now possible to solve (9) with complexity depending only on the number of orbits. Any solution of the LP (8) can be turned into a solution of (9) by letting .
5.1 Lifted Kruskal’s Algorithm
The Kruskal’s algorithm first sorts the edges according to their decreasing weight. Then starting from an empty graph, at each step it greedily attempts to add the next edge while maintaining the property that the used edges form a forest (containing no cycle). The forest obtained at the end of this algorithm is a maximumweight spanning tree.
Imagine how Kruskal’s algorithm would operate on a weighted graph with nontrivial automorphisms. Let be the list of edgeorbits sorted in the order of decreasing weight (the weights on all edges in the same orbit by definition must be the same). The main question therefore is how many edges in each edgeorbit will be added to the spanning tree by the Kruskal’s algorithm. Let be the subgraph of formed by the set of all the edges and nodes in . Let and denote the set of nodes and set of connected components of a graph, respectively. Then (see the supplementary material for proof)
Lemma 4.
The number of edges in appearing in the MST found by the Kruskal’s algorithm is where and . Thus a solution for the linear program (9) is .
5.2 Lifted Counting of the Number of Connected Components
We note that counting the number of nodes can be done simply by adding the size of each node orbit. The remaining difficulty is how to count the number of connected components of a given graph^{4}^{4}4Since we are only interested in connectivity in this subsection, the weights of play no role. Thus, orbits in this subsection can also be generated by the automorphism group of the unweighted version of . using only information at the orbit level. Let be the lifted graph of . Then (see supplementary material for proof)
Lemma 5.
If is connected then all connected components of are isomorphic. Thus if furthermore is a connected component of then .
6 Tightening via Exchangeable Polytope Constraints
One type of symmetry often found in firstorder probabilistic models are large sets of exchangeable random variables. In certain cases, exact inference with exchangeable variables is possible via lifted counting elimination and its generalization [17, 2]. The drawback of these exact methods is that they do not apply to many models (e.g., those with transitive clauses). Lifted variational inference methods do not have this drawback, however local and cycle relaxation can be shown to be loose in the exchangeable setting, a potentially serious limitation compared to earlier work.
To remedy this situation, we now show how to take advantage of highly symmetric subset of variables to tighten the relaxation of the lifted marginal polytope.
We call a set of random variables an exchangeable cluster iff can be arbitrary permuted while preserving the probability model. Mathematically, the lifting group acts on and the image of the action is isomorphic to , the symmetric group on . The distribution of the random variables in is also exchangeable in the usual sense.
Our method for tightening the relaxation of the marginal polytope is based on liftandproject, wherein we introduce auxiliary variables specifying the joint distribution of a large cluster of variables, and then enforce consistency between the cluster distribution and the pseudomarginal vector
[20, 24, 27]. In the ground model, one typically works with small clusters (e.g., triplets) because the number of variables grows exponentially with cluster size. The key (and nice) difference in the lifted case is that we can make use of very large clusters of highly symmetric variables: while the grounded relaxation would clearly blow up, the corresponding lifted relaxation can still remain compact.Specifically, for an exchangeable cluster of arbitrary size, one can add cluster consistency constraints for the entire cluster and still maintain tractability. To keep the exposition simple, we assume that the variables are binary. Let denote a configuration, i.e., a function . The set is the collection of cluster auxiliary variables. Since is exchangeable, all nodes in belong to the same node orbit; we call this node orbit . Similarly, and denote the single edge and arc orbit that contains all edges and arcs in respectively. Let be two distinct nodes in . To enforce consistency between the cluster and the edge in the ground model, we introduce the constraints
(10) 
These constraints correspond to using intersection sets of size two, which can be shown to be the exact characterization of the marginal polytope involving variables in if the graphical model only has pairwise potentials. If higherorder potentials are present, a tighter relaxation could be obtained by using larger intersection sets together with the techniques described below.
The constraints in (10) can be methodically lifted by replacing occurrences of ground variables with lifted variables at the orbit level. First observe that in place of the grounded variables , the lifted local relaxation has three corresponding lifted variables, and . Second, we consider the orbits of the set of configurations . Since is exchangeable, there can be only configuration orbits; each orbit contains all configurations with precisely 1’s where . Thus, instead of the ground auxiliary variables, we only need lifted cluster variables. Further manipulation leads to the following set of constraints, which we call lifted exchangeable polytope constraints.
Theorem 6.
Let be an exchangeable cluster of size ; and be the single edge and arc orbit of the graphical model that contains all edges and arcs in respectively; be the lifted marginals. Then there exist , such that
Proof.
See the supplementary material. ∎
In contrast to the lifted local and cycle relaxations, the number of variables and constraints in the lifted exchangeable relaxation depends linearly on the domain size of the firstorder model. From the lifted local constraints given by [3], . Substituting in the expression involved , we get . Intuitively, represents the approximation of the marginal probability of having precisely ones in .
As proved by [2], groundings of unary predicates in Markov Logic Networks (MLNs) gives rise to exchangeable clusters. Thus, for MLNs, the above theorem immediately suggests a tightening of the relaxation: for every unary predicate of a MLN, add a new set of constraints as above to the existing lifted local (or cycle) optimization problem. Although it is not the focus of our paper, we note that this should also improve the lifted MAP inference results of [3]. For example, in the case of a symmetric complete graphical model, lifted MAP inference using the linear program given by these new constraints would find the exact that maximizes , hence recover the same solution as counting elimination. Marginal inference may still be inexact due to the treereweighted entropy approximation. We reemphasize that the complexity of variational inference with lifted exchangeable constraints is guaranteed to be polynomial in the domain size, unlike exact methods based on lifted counting elimination and variable elimination.
7 Experiments
In this section, we provide an empirical evaluation of our lifted tree reweighted (LTRW) algorithm. As a baseline we use a dampened version of the lifted belief propagation (LBPDampening) algorithm from [21]. Our lifted algorithm has all of the same advantages of the treereweighted approach over belief propagation, which we will illustrate in the results: (1) a convex objective that can be convergently solved to optimality, (2) upper bounds on the partition function, and (3) the ability to easily improve the approximation by tightening the relaxation. Our evaluation includes four variants of the LTRW algorithm corresponding to using different outer bounds: lifted local polytope (LTRWL), lifted cycle polytope (LTRWC), lifted local polytope with exchangeable polytope constraints (LTRWLE), and lifted cycle polytope with exchangeable constraints (LTRWCE). The conditional gradient optimization of the lifted TRW objective terminates when the duality gap is less than or when a maximum number of iterations is reached. To solve the LP problem during conditional gradient, we use Gurobi^{5}^{5}5http://www.gurobi.com/.
We evaluate all the algorithms using several firstorder probabilistic models. We assume that no evidence has been observed, which results in a large amount of symmetry. Even without evidence, performing marginal inference in firstorder probabilistic models can be very useful for maximum likelihood learning [13]. Furthermore, the fact that our lifted treereweighted variational approximation provides an upper bound on the partition function enables us to maximize a lower bound on the likelihood [29], which we demonstrate in Sec. 7.5. To find the lifted orbit partition, we use the renaming group as in [3] which exploits the symmetry of the unobserved contants in the model.
Rather than optimize over the spanning tree polytope, which is computationally intensive, most TRW implementations use a single fixed choice of edge appearance probabilities, e.g. an (un)weighted distribution obtained using the matrixtree theorem. In these experiments, we initialize the lifted edge appearance probabilities to be the most uniform perorbit edgeappearance probabilties by solving the optimization problem using conditional gradient. Each directionfinding step of this conditional gradient solves a lifted MST problem of the form using our lifted Kruskal’s algorithm, where is the current solution. After this initialization, we fix the lifted edge appearance probabilities and do not attempt to optimize them further.
7.1 Test models
Fig. 3 describes the four test models in MLN syntax. We focus on the repulsive case, since for attractive models, all TRW variants and lifted LBP give similar results. The parameter denotes the weight that will be varying during the experiments. In all models except CliqueCycle, acts like the “local field” potential in an Ising model; a negative (or positive) value of means the corresponding variable tends to be in the 0 (or 1) state. CompleteGraph is equivalent to an Ising model on the complete graph of size (the domain size) with homogenous parameters. Exact marginals and the logpartition function can be computed in closed form using lifted counting elimination. The weight of the interaction clause is set to (repulsive). FriendsSmokers (negated) is a variant of the FriendsSmokers model [21] where the weight of the final clause is set to 1.1 (repulsive). We use the method in [2] to compute the exact marginal for the Cancer predicate and the exact value of the logpartition function. LoversSmokers is the same MLN used in [3] with a full transitive clause and where we vary the prior of the Loves predicate. CliqueCycle is a model with 3 cliques and 3 bipartite graphs in between. Its corresponding ground graphical model is shown in Fig. 2.
7.2 Accuracy of Marginals
Fig. 4 shows the marginals computed by all the algorithms as well as exact marginals on the CompleteGraph and FriendsSmokers models. We do not know how to efficiently perform exact inference in the remaining two models, and thus do not measure accuracy for them. The result on complete graphs illustrates the clear benefit of tightening the relaxation: LTRWLocal and LBP are inaccurate for moderate , whereas cycle constraints and, especially, exchangeable constraints drastically improve accuracy. As discussed earlier, for the case of symmetric complete graphical models, the exchangeable constraints suffice to exactly characterize the marginal polytope. As a result, the approximate marginals computed by LTRWLE and LTRWCE are almost the same as the exact marginals; the very small difference is due to the entropy approximation. On the FriendsSmokers (negated) model, all LTRW variants give accurate marginals while lifted LBP even with very strong dampening ( weight given to previous iterations’ messages) fails to converge for . We observed that LTRWLE gives the best tradeoff between accuracy and running time for this model. Note that we do not compare to ground versions of the lifted TRW algorithms because, by Theorem 3, the marginals and logpartition function are the same for both.
7.3 Quality of LogPartition Upper bounds
Fig. 5 plots the values of the upper bounds obtained by the LTRW algorithms on the four test models. The results clearly show the benefits of adding each type of constraint to the LTRW, with the best upper bound obtained by tightening the lifted local polytope with both lifted cycle and exchangeable constraints. For the CompleteGraph and FriendsSmokers model, the logpartition approximation using exchangeable polytope constraints is very close to exact. In addition, we illustrate lifted LBP’s approximation of the logpartition function on the CompleteGraph (note it is nonconvex and not an upper bound).
7.4 Running time
As shown in Table 1, lifted variants of TRW are orderofmagnitudes faster than the ground version. Interestingly, lifted TRW with local constraints is observed to be faster as the domain size increase; this is probably due to the fact that as the domain size increases, the distribution becomes more peak, so marginal inference becomes more similar to MAP inference. Lifted TRW with local and exchangeable constraints requires a smaller number of conditional gradient iterations, thus is faster; however note that its running time slowly increases since the exchangeable constraint set grows linearly with domain size.
LBP’s lack of convergence makes it difficult to have a meaningful timing comparison with LBP. For example, LBP did not converge for about half of the values of in the LoversSmokers model, even after using very strong dampening. We did observe that when LBP converges, it is much faster than LTRW. We hypothesize that this is due to the message passing nature of LBP, which is based on a fixed point update whereas our algorithm is based on FrankWolfe.
Domain size  10  20  30  100  200 

TRWL  138370  609502  1525140     
LTRWL  3255  3581  3438  1626  1416 
LTRWLE  681  703  721  1033  1307 
7.5 Application to Learning
We now describe an application of our algorithm to the task of learning relational Markov networks for inferring proteinprotein interactions from noisy, highthroughput, experimental assays [12]. This is equivalent to learning the parameters of an exponential family random graph model [19] where the edges in the random graph represents the proteinprotein interactions. Despite fully observed data, maximum likelihood learning is challenging because of the intractability of computing the logpartition function and its gradient. In particular, this relational Markov network has over 330K random variables (one for each possible interaction of 813 variables) and tertiary potentials. However, Jaimovich et al. [13] observed that the partition function in relational Markov networks is highly symmetric, and use lifted LBP to efficiently perform approximate learning in running time that is independent of the domain size. They use their lifted inference algorithm to visualize the (approximate) likelihood landscape for different values of the parameters, which among other uses characterizes the robustness of the model to parameter changes.
We use precisely the same procedure as [13], substituting lifted BP with our new lifted TRW algorithms. The model has three parameters:
, used in the singlenode potential to specify the prior probability of a proteinprotein interaction;
, part of the tertiary potentials which encourages cliques of three interacting proteins; and , also part of the tertiary potentials which encourages chainlike structures where proteins interact, interact, but and do not (see supplementary material for the full model specification as an MLN). We follow their twostep estimation procedure, first estimating in the absence of the other parameters (the maximum likelihood, BP, and TRW estimates of this parameter coincide, and estimation can be performed in closedform: ). Next, for each setting of and we estimate the logpartition function using lifted TRW with the cycle+exchangeable vs. local constraints only. Since TRW is an upper bound on the logpartition function, these provide lower bounds on the likelihood.Our results are shown in Fig. 6, and should be compared to Fig. 7 of [13]. The overall shape of the likelihood landscapes are similar. However, the lifted LBP estimates of the likelihood have several local optima, which cause gradientbased learning with lifted LBP to reach different solutions depending on the initial setting of the parameters. In contrast, since TRW is convex, any gradientbased procedure would reach the global optima, and thus learning is much easier. Interestingly, we see that our estimates of the likelihood have a significantly smaller range over these parameter settings than that estimated by lifted LBP. Moreover, the highlikelihood parameter settings extends to larger values of . For all algorithms there is a sudden decrease in the likelihood at (not shown in the figure).
8 Discussion and Conclusion
Lifting partitions used by lifted and counting BP [21, 16] can be coarser than orbit partitions. In graphtheoretic terms, these partitions are called equitable partitions. If each equitable partition cell is thought of as a distinct node color, then among nodes with the same color, their neighbors must have the same color histogram. It is known that orbit partitions are always equitable, however the converse is not always true [9].
Since equitable partition can be computed more efficiently and potentially leads to more compact lifted problems, the following question naturally arises: can we use equitable partition in lifting the TRW problem? Unfortunately, a complete answer is nontrivial. We point out here a theoretical barrier due to the interplay between the spanning tree polytope and the equitable partition of a graph.
Let be the coarsest equitable partition of edges of . We give an example graph in the supplementary material (see example 9) where the symmetrized spanning tree polytope corresponding to the equitable partition , is an empty set. When is empty, the consequence is that if we want to be within so that is guaranteed to be a convex upper bound of the logpartition function, we cannot restrict to be consistent with the equitable partition. In lifted and counting BP, so it is clearly consistent with the equitable partition; however, one loses convexity and upper bound guarantee as a result. This suggests that there might be a tradeoff between the compactness of the lifting partition and the quality of the entropy approximation, a topic deserving the attention of future work.
In summary, we presented a formalization of lifted marginal inference as a convex optimization problem and showed that it can be efficiently solved using a FrankWolfe algorithm. Compared to previous lifted variational inference algorithms, in particular lifted belief propagation, our approach comes with convergence guarantees, upper bounds on the partition function, and the ability to improve the approximation (e.g. by introducing additional constraints) at the cost of small additional running time.
A limitation of our lifting method is that as the amount of soft evidence (the number of distinct individual objects) approaches the domain size, the behavior of lifted inference approaches ground inference. The wide difference in running time between ground and lifted inference suggests that significant efficiency can be gained by solving an approximation of the orignal problem that is more symmetric [25, 15, 22, 6]. One of the most interesting open questions raised by our work is how to use the variational formulation to perform approxiate lifting. Since our lifted TRW algorithm provides an upper bound on the partition function, it is possible that one could use the upper bound to guide the choice of approximation when deciding how to reintroduce symmetry into an inference task.
Acknowledgements: Work by DS supported by DARPA PPAML program under AFRL contract no. FA875014C0005.
References
 Barahona and Mahjoub [1986] F. Barahona and A. R. Mahjoub. On the cut polytope. Mathematical Programming, 36:157–173, 1986.
 Bui et al. [2012] Hung Hai Bui, Tuyen N. Huynh, and Rodrigo de Salvo Braz. Lifted inference with distinct soft evidence on every object. In AAAI2012, 2012.

Bui et al. [2013]
Hung Hai Bui, Tuyen N. Huynh, and Sebastian Riedel.
Automorphism groups of graphical models and lifted variational
inference.
In
Proceedings of the TwentyNinth Conference on Uncertainty in Artificial Intelligence, UAI2013
. AUAI Press, 2013.  Bui et al. [2014] Hung Hai Bui, Tuyen N. Huynh, and David Sontag. Lifted treereweighted variational inference. arXiv preprint, 2014. http://arxiv.org/abs/1406.4200.
 de Salvo Braz et al. [2005] R. de Salvo Braz, E. Amir, and D. Roth. Lifted firstorder probabilistic inference. In Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI ’05), pages 1319–1125, 2005.
 de Salvo Braz et al. [2009] Rodrigo de Salvo Braz, Sriraam Natarajan, Hung Bui, Jude Shavlik, and Stuart Russell. Anytime lifted belief propagation. In 6th International Workshop on Statistical Relational Learning (SRL 2009), 2009.
 Frank and Wolfe [1956] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval Research Logistics Quarterly, 3(12):95–110, 1956. ISSN 19319193.
 Globerson and Jaakkola [2007] A. Globerson and T. Jaakkola. Convergent Propagation Algorithms via Oriented Trees. In Uncertainty in Artificial Intelligence, 2007.
 Godsil and Royle [2001] Chris Godsil and Gordon Royle. Algebraic Graph Theory. Springer, 2001.
 Gogate and Domingos [2011] Vibhav Gogate and Pedro Domingos. Probabilistic theorem proving. In Proceedings of the TwentySeventh Annual Conference on Uncertainty in Artificial Intelligence (UAI11), pages 256–265, 2011.
 Jaggi [2013] Martin Jaggi. Revisiting FrankWolfe: Projectionfree sparse convex optimization. In Proceedings of the 30th ICML, volume 28, pages 427–435. JMLR Workshop and Conference Proceedings, 2013.
 Jaimovich et al. [2006] Ariel Jaimovich, Gal Elidan, Hanah Margalit, and Nir Friedman. Towards an integrated proteinprotein interaction network: a relational markov network approach. Journal of Computational Biology, 13(2):145–164, 2006.
 Jaimovich et al. [2007] Ariel Jaimovich, Ofer Meshi, and Nir Friedman. Template based inference in symmetric relational markov random fields. In Proceedings of the TwentyThird Conference on Uncertainty in Artificial Intelligence, Vancouver, BC, Canada, July 1922, 2007, pages 191–199. AUAI Press, 2007.
 Jancsary and Matz [2011] Jeremy Jancsary and Gerald Matz. Convergent decomposition solvers for treereweighted free energies. Journal of Machine Learning Research  Proceedings Track, 15:388–398, 2011.
 Kersting et al. [2010] K. Kersting, Y. El Massaoudi, B. Ahmadi, and F. Hadiji. Informed lifting for message–passing. In D. Poole M. Fox, editor, Twenty–Fourth AAAI Conference on Artificial Intelligence (AAAI–10), Atlanta, USA, July 11 – 15 2010. AAAI Press.
 Kersting et al. [2009] Kristian Kersting, Babak Ahmadi, and Sriraam Natarajan. Counting belief propagation. In Proceedings of the 25th Annual Conference on Uncertainty in AI (UAI ’09), 2009.
 Milch et al. [2008] B. Milch, L. S. Zettlemoyer, K. Kersting, M. Haimes, and L. P. Kaelbling. Lifted Probabilistic Inference with Counting Formulas. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI ’08), pages 1062–1068, 2008.
 Niepert [2012] Mathias Niepert. Markov chains on orbits of permutation groups. In UAI2012, 2012.
 Robins et al. [2007] Garry Robins, Pip Pattison, Yuval Kalish, and Dean Lusher. An introduction to exponential random graph () models for social networks. Social networks, 29(2):173–191, 2007.
 Sherali and Adams [1990] H. D. Sherali and W. P. Adams. A hierarchy of relaxations between the continuous and convex hull representations for zeroone programming problems. SIAM Journal on Discrete Mathematics, 3(3):411–430, 1990. doi: 10.1137/0403036. URL http://link.aip.org/link/?SJD/3/411/1.
 Singla and Domingos [2008] Parag Singla and Pedro Domingos. Lifted firstorder belief propagation. In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI ’08), pages 1094–1099, 2008.
 Singla et al. [2010] Parag Singla, Aniruddh Nath, and Pedro Domingos. Approximate lifted belief propagation. In Workshop on Statistical Relational Artificial Intelligence (StaRAI 2010), 2010.
 Sontag and Jaakkola [2008] D. Sontag and T. Jaakkola. New outer bounds on the marginal polytope. In Advances in Neural Information Processing Systems 21. MIT Press, 2008.
 Sontag et al. [2008] David Sontag, T. Meltzer, A. Globerson, T. Jaakkola, and Y. Weiss. Tightening LP relaxations for MAP using message passing. In Proceedings of the 24th Annual Conference on Uncertainty in AI (UAI ’08), 2008.
 Van den Broeck and Darwiche [2013] Guy Van den Broeck and Adnan Darwiche. On the complexity and approximation of binary evidence in lifted inference. In Advances in Neural Information Processing Systems, pages 2868–2876, 2013.
 Wainwright et al. [2005] M. Wainwright, T. Jaakkola, and A. Willsky. A new class of upper bounds on the log partition function. IEEE Transactions on Information Theory, 51:2313–2335, 2005.
 Wainwright and Jordan [2008] Martin Wainwright and Michael Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers, 2008.
 Wainwright et al. [2003] Martin J. Wainwright, Tommi Jaakkola, and Alan S. Willsky. Treebased reparameterization framework for analysis of sumproduct and related algorithms. IEEE Transactions on Information Theory, 49:1120–1146, 2003.
 Yanover et al. [2008] C. Yanover, O. SchuelerFurman, and Y. Weiss. Minimizing and learning energy functions for sidechain prediction. Journal of Computational Biology, 15(7):899–911, 2008.
Supplementary Materials for “Lifted TreeReweighted Variational Inference”
We present (1) proofs not given in the main paper, (2) full pseudocode for the lifted Kruskal’s algorithm for finding maximum spanning tree in symmetric graphs, and (3) additional details of the proteinprotein interaction model.
Proof of Theorem 1.
Proof.
We state and prove a lemma about the symmetry of the bounds and that will be used in subsequent proofs.
Lemma 7.
let be an automorphism of the graphical model , then and .
Proof.
The intuition is that since the entropy bound is defined on the graph structure of the graphical model , it inherits the symmetry of . This can be verified by viewing the graph automorphism as a bijection from nodes to nodes and edges to edges, and so simply rearranges the summation inside .
We now show same symmetry applies to the logpartition upper bound . Let be an outer bound of the marginal polytope such that . Note that acts on and stabilizes OUTER, i.e., . Thus
∎
Proof of Theorem 3.
Proof.
We will show that the condition of theorem 1 holds. Let us fix a . Then for all Thus . On the other hand, by Lemma 7, . Thus . Note that in case of the overcomplete representation, the action of the group is the permuting action of ; thus, the TRW bound (for fixed ) is stabilized by the lifting group . ∎
Conditional Gradient (FrankWolfe) Algorithm for Lifted TRW
The pseudocode is given in Algorithm 1. Step 2 essentially solves a lifted MAP problem which we used the same algorithms presented in [3] with Gurobi as the main linear programming engine. Step 3 solves a 1D constrained convex problem via the golden search algorithm to find the optimal step size.
Proof of Lemma 4.
Proof.
After considering all edges in , Kruskal’s algorithm must form a spanning forest of (the forest is spanning since if there remains an edge that can be used without forming a cycle, Kruskal’s algorithm must have used it already). Since the forest is spanning, the number of edges used by Kruskal’s algorithm at this point is precisely . Similarly, just after considering all edges in , the number of edges used is . Therefore, the number of edges used must be
which is the difference between the number of new nodes (which must be nonnegative) and the number of new connected components (which could be negative) induced by considering edges in . Any MST solution can be turned into a solution of (9) by letting . Thus, we obtain a solution . ∎
Proof of Lemma 5.
We first need an intermediate result.
Lemma 8.
If and are two distinct node orbits, and and are reachable in the lifted graph , then for any , there is some such that is reachable from .
Proof.
Induction on the length of the  path. Base case: if and are adjacent, there exists an edge orbit incident to both and . Therefore, the exists a ground edge in such that and . The automorphism mapping will map for some node . Clearly is an edge, and . Main case: assume the statement is true for all pair of orbits with path length . Suppose is a path of length . Take the orbit right in front of in the path, so that is a path of length , and and are adjacent. By the inductive assumption, there exists such that is connected to . Applying the same argument of the base case, there exists such that is an edge. Thus is connected to . ∎
We now return to the main proof of lemma 5.
Proof.
If the ground graph has only one component then this is trivially true. Let and be two distinct connected components of , let be a node in and be the orbit containing . Let be any node in . Since the lifted graph is connected, all orbits are reachable from one another in . By the above lemma, there must be some node reachable from , hence (if then we just take ). This establishes that the node orbit intersects with both and . Note that since otherwise . Let be the automorphism that takes to .
We now show that . Since maps edges to edges and nonedges to nonedges, it is sufficient to show that . Let be a node of and . Since is connected, there exists a path from to . But must map this path to a path from to , hence
Comments
There are no comments yet.