1 Introduction
Probabilistic graphical models such as Bayesian networks, Markov random fields, and conditional random fields are powerful tools for modeling complex dependencies over a large number of random variables
Koller and Friedman (2009), Wainwright et al. (2008). Graphs are used to represent joint probability distributions, where nodes in a graph denote random variables, and edges represent dependency relationships between different nodes. With the specification of a graphical model, a fundamental problem is to calculate the marginal distributions of variables of interest. This problem is closely related to computing the partition function, or the normalization constant of a graphical model, which is known to be intractable and #Pcomplete. As a result, developing efficient approximation inference algorithms becomes a pressing need. The most popular algorithms include deterministic variational inference and Markov Chain Monte Carlo sampling.
However, many challenging practical problems involve very large graphs on which it is computationally expensive to use existing variational inference or Monte Carlo sampling algorithms. This happens, for example, when we use Markov random fields to represent the social network of Facebook or use a Bayesian network to model the knowledge graph that is derived from the entire Wikipedia, where in both cases the sizes of the graphical models can be prohibitively large (e.g., millions or billions of variables). It is thus infeasible to perform traditional approximate inference such as message passing or Mote Carlo on these models because such methods need to traverse the entire model to make an inference. Despite the daunting sizes of large graphical models, in most realworld applications, users only want to make an inference on a set of query variables of interest. The distribution of a query variable is often only dependent on a small number of nearby variables in the graph. As a result, complete inference over the entire graph is not necessary and practical methods should perform inference only with the most relevant variables in local graph regions that are close to the query variables, while ignoring the variables that are weakly correlated and/or distantly located on the graph.
In this work, we develop a new localized inference method for very large graphical models. Our approach leverages the Dobrushin’s comparison theorem that casts explicit bounds based on the correlation decay property in the graphs, in order to restrict the inference to a smaller local region that is sufficient for the inference of marginal distribution of the query variable. The use of the Dobrushin’s comparison theorem allows us to explicitly bound the truncation error which guides the selection of localized region from the original large graph. Extensive experiments demonstrate both the effectiveness of our theoretical bounds and the accuracy of our inference algorithm on a variety of datasets.
Related Work
Approximate inference algorithms of graphical models have been extensively studied in the past decades (see for example (Koller and Friedman, 2009, Wainwright et al., 2008, Dechter, 2013) for an overview). Queryspecific inference, including (Chechetka and Guestrin, 2010) which proposed a focused belief propagation for query specific inference, and (Wick and McCallum, 2011, Shi et al., 2015) which study queryaware sampling algorithms, have recently been introduced for large graphical models. Compared with these methods, our work is theoretically motivated by the Dobrushin’s comparison theorem and enables us to efficiently construct the localized region in a principled and practically efficient manner.
2 Background on Graphical Models
Graphical models provide a flexible framework for representing relationships between random variables Heinemann and Globerson (2014). In graph , we use to denote a finite collection of random variables and we use to refer to an assignment. Suppose is a set of edges and is a set of functions with for edge and for node . We use
to represent the joint distribution of the graphical model
as following,where
is the normalization constant (also called the partition function). In this work, we will focus on the Ising model, an extensively studied graphical model. The Ising model is a pairwise model with binary variables
. The pairwise and singleton parameters are defined as followsSo the distribution of an Ising model is defined as,
(1) 
Given a graphical model, marginal inference involves calculating the normalization constant, or the marginal probabilities of small subsets of variables. These problems require summation over an exponential number of configurations and are typically #Phard in the worst case for loopy graphical models. However, practical problems can be often easier than the theoretically worst cases, and it is still possible to obtain efficient approximations by leveraging the special structures of given models. In this work, we focus on the queryspecific inference, where the goal is to calculate the marginal distribution of given individual variable . For this task, it is possible to make good approximations based on a local region around , thus significantly accelerates the inference in very large graphical models.
3 Localized Inference and Correlation Decay
Given a large graphical model, it is usually not feasible to compute the exact marginal of a specific variable due to the exponential time complexity. Furthermore, it is even not practical to perform the variational approximation algorithms, such as mean field and belief propagation, when the graph is very large. This is because these traditional methods need to traverse the entire graph multiple times before convergence, and thus are prohibitively slow for very large models such as these built on social networks or knowledge bases.
On the other hand, it is relatively cheap to calculate exact or approximate marginals in small or medium size graphical models. In many applications, users are only interested in certain queries of node marginals. Because users’ queries of interest often have strong associations with only a small number of nearby variables in the graph, the complete inference over the full graph is not necessary. This can be formally captured by the phenomenon of correlation decay, that is, when the graph is large and sparse, the influence of a random variable on the distribution of another random variable decreases quickly as the distance of the shortest path between the corresponding nodes in the graph increases. The correlation decay property has been widely studied in statistical mechanics and graphical models (Rebeschini et al., 2015).
Formally, assuming that the edge potentials are well bounded, we may expect that variables and are strongly correlated when the distance between node and on graph is small (e.g. ), while and may have a rather weak correlation or nearly be independent when node and node are far away from each other on . Such property exists broadly in realworld graphical models, such as those built upon social networks in which an individual is mostly influenced by his/her friends. Often, the decaying rate of correlation is negative exponential to the distance .
If a graphical model satisfies the property of correlation decay, it is possible that we can use only the local information in the graph to perform marginal inference, as the distant variables have little correlation with the query variable. This intuition allows us to use the information from the most relevant variables in the local region close to the queried variable to efficiently approximate its marginal distribution. Assume that is a large graphical model, and we want to calculate a marginal distribution of variable . Localized inference constructs a much smaller model , defined on a small subgraph that includes , such that . The challenge here, however, is how to construct a good localized model and bound its approximation error. We address this problem via the Dobrushin’s comparison theorem (Föllmer, 1982), and propose an efficient algorithm to find the local graph region for a given query node and provide an error bound between its approximate and true marginals. To get started, we first introduce the Dobrushin’s comparison theorem, which is used to compare two Gibbs measures.
Theorem 1
(Föllmer, 1982) Dobrushin’s comparison theorem Let be a Gibbs measure on a finite product space , where is an index set. For , we define
where is the conditional distribution of the th coordinate with respect to the field generated by the coordinates with index , and
is the total variance distance. We compute
(2) 
and assume . Let and , then for any probability measure on the same place and any function , we have
where is the singleton perturbation coefficient of node :
(3) 
and is the oscillation of in the th coordinate, that is,
In Theorem 1, is the probability of variable conditioned on its adjacent variables whose assignments are the same as corresponding entries in . According to the Markov property, calculating only requires information from the local starshaped graph as shown in Figure 2. It is worth noting that a tighter bound can be obtained by defining to be . Here we use the definition in (3) for lower computational complexity. The matrix is known as the Dobrushin’s interaction matrix, and the inequality is the Dobrushin condition. If this condition holds, the theorem can give us a bound between two measures, which is the result of correlation decay.
In the following, we will apply Theorem 1 to undirected graphical models to derive an approximation bound of marginal distributions. We first denote by the index set of the variables and assume that we want to query the marginal distribution of variable . In order to apply Theorem 1, we set to be the indicator function of the variable , that is, . Then becomes the absolute marginal difference between the two measures and . In addition, the oscillation of function is thus reduced to and . With these simplifications, we obtain a bound of the maximum difference between marginals of the queried node for two measures:
Corollary 1
Following the assumptions in Theorem 1 and the above text, we have
(4) 
Note that the roles of and in (4) are not symmetric because the Dobrushin coefficient is solely defined based on (and independent with ). As a result, there are two ways to use bound (4) for localized inference, depending on whether we treat or as the original model that we want to query or the localized model that we use for approximation, respectively. We will next exploit both possibilities in the next sections. In Section 4, we take as the global model (or measure) and as the localized model (or measure) and derive a simple upper bound relates the approximation error to the distance between the query node and the boundary of the local region on the graph. In Section 5, we take as the global model (or measure) and as the localized model (or measure), we derive another upper bound that only involves the localized region, and leverage it to propose a greedy expansion algorithm to construct the localized model with guaranteed approximations.
4 Distancebased Upper Bound
In this section, we assume that in Theorem 1 is defined by the original graphical model that we want to query, and is a simpler and more tractable distribution that we use to approximate the marginal of in .
For notational simplicity, we partition the nodeset to two disjoint sets and , where is the local subgraph that contains query node and is the rest of the graph. We use and to represent the set of subscripts of nodes on the boundary and in the interior of . Obviously, , , and . Similarly, , , and . In addition, we use to denote the variables in and to denote the variables in . We will first apply the following lemma to obtain our first result on the relationship between the approximation error of marginals and the radius of the local subgraph .
Lemma 1
(Rebeschini and van Handel, 2014) Assume is a finite set and let be a pseudometric on set . is a nonnegative matrix. Suppose that
Then matrix satisfies
In particular, this implies that
for every set , where .
This lemma indicates that if decays exponentially with the distance between and , the , which is used in Theorem 1 and Corollary 1, also decays exponentially with the distance between and . The condition of this correlation decay lemma is usually mild in practice. When we choose , which is naturally a pseudometric, and use Dobrushin’s interaction matrix as , the conditions of the lemma hold once the Dobrushin condition is satisfied, because matrix in Theorem 1 is by definition a nonnegative matrix and hence is also nonnegative, and every entry in is less than 1/2. Applying Lemma 1, we can obtain the following result.
Theorem 2
Suppose is the probability measure for a graphical model for which we want to query the marginal distribution of node . Let be the another probability on the same space, whose parameters of edges on subgraph and parameters of nodes in are the same as . Assume the Dobrushin condition holds for (). Let denote the distance between node and nodeset on the Markov graph of . If we assume
(5) 
then , we have
This theorem characterizes the error bound when approximating the global model using another model that matches locally in region . Our result shows that in order to ensure an bound on the query node , the distance from the query node to the boundary should be at least linear to . In other words, the error decreases exponentially with . The proof of Theorem 2 can be found in the appendix.
As a result, given and , we can get the minimum value of the lower bound of by optimizing . Theorem 2 gives a simple but general way to bound the local subgraph of variable , as we only need to check the Dobrushin condition and compute on the whole true graphical model.
5 Localized Bound and Greedy Expansion
The bound in Theorem 2 requires computing the value as defined in (2) for a given graphical model. However, since is the maximum of the entire graph, it can be very expensive to compute when the graph is large. In this section, we explore another approach of using the bound in Corollary 1, by setting to be the distribution of the original graphical model and to be the localized model. In this way, we will derive a novel approximation approach by greedily constructing a local graph from the query variable , with guaranteed upper bounds of the approximation error between marginal distributions of and .
To start with, we note that can be decomposed to
where is the exponential of a potential function of , is the exponential of potential function of , and is the exponential of potential function defined on and .
We want to approximate with a simpler model in which the nodes in and are disconnected, so that the inference over can be performed locally within , irrelevant to the nodes in . Formally, we want to approximate by
which replaces the factor with a product with approximations and . Therefore, the marginal distributions of and get decoupled in , that is,
This decomposition thus allows us to approximately calculate marginal efficiently within subgraph . The challenges here are 1) how to construct the factors and in to closely approximate , 2) how to decide the subgraph region and 3) how to bound the approximation error. We consider two methods for constructing in this work:
1. [Dropping out] Simply remove the in . To do so, we set
(6) 
This corresponds to directly remove all the edges between and , which is also referred as the “dropping out” method in our experiments.
2. [Mean field] Find to closely approximate by performing a mean field approximation, that is, we solve the following optimization problem:
(7) 
where the
refers to the KL divergence of the corresponding normalized distributions. To apply the mean field approximation and reduce complexity, we further assume that the nodes are independent in
and . By using the optimized approximation , we will be able to compensate the error of marginal of , which is introduced by simply removing the edges between and , as mentioned in the above.Note that the potentials and in do not influence the calculation of , for . For simplicity, we remove all the edges in . This will not change the marginal of node .
By applying Corollary 1, we can now obtain an error bound which, remarkably, only involves the local region .
Corollary 2
Assume , and the conditions in Theorem 1 holds, we have
(8) 
where is defined in Eq 3, and is defined by ; here is defined in Theorem 1.
Note that the upper bound in (8) only involves the local region and hence can be computed efficiently using mean field or belief propagation within the subgraph on . The proof of Corollary 2 and the details on how to calculate and for Ising models in practice can be found in the appendix.
Using the bound in (8), we propose a greedy algorithm to expand the local graph starting from query node incrementally. At iteration, we add a neighboring node that yields the tightest bound using the above bound and repeat this process until the bound is tight enough or a maximum of graph size is reached. This process is summarized in Algorithm 1. After we complete the expanding phase, we can apply exact inference or on local region to calculate the marginal of the query if the size of is small or perform approximate inference methods if the size of is medium. The actual size of can vary in different graphical models, which is mainly determined by the correlation decay property near the query variable or the tightness of the upper bound in Eq (8).
Computational Complexity
Here we consider the computational complexity of expanding the local subgraph and the complexity of localized inference. We always suppose that the maximum degree of the graph is and we define the maximum distance between the query node and any node in the subgraph to be the radius of the subgraph.
First, given a threshold , from Theorem 2, we just need a subgraph with radius
where we recall that is the Dobrushin coefficient . In particular, taking shows that we just need . It is worth noting that decreases when becomes small and/or the accuracy threshold becomes large. Since the size of the subgraph with radius is no more than , it can be much smaller than the whole graph. As a result, the inference over the subgraph is much more efficient.
Then, we discuss the computation complexity in each expansion step (Algorithm 1, line 610). We need to loop over the nodes in
. In the loop, we need to calculate the vector
and the matrix . The calculation for each element in requires the enumeration of different assignments in the neighborhood of such node, which is bounded because it is not related to the size of the whole graph. In the calculation of matrix , we only need to update the elements related to the new node. The number of such elements is no more than and the calculation of each element is not related to the size of the whole graph. can be derived from and use historical information to calculate incrementally. The complexity is no more than computing the inverse of the whole matrix . If we use mean field approximation in the greedy expansion, the computation is also cheap because the sizes of and are small.6 Experiments
We test our algorithm on both simulated and realworld datasets. The results indicate that our method provides an efficient localized inference technique.
6.1 2D Ising Grid
In this section, we perform experiments on 2Dgrid Ising models and regard the localized probability as and regard the true probability as . The graph is a lattice and the coordinate of query node is . The parameters in the Ising model is generated by drawing uniformly from for all nodes and uniformly from for all edges . Here and control the locality and hardness of this Ising model.
Checking Dobrushin’s condition We start with numerically checking the Dobrushin condition . In Figure 3, we show the values of for Ising models generated with different values of and , using a heatmap. We can see that is smaller than one in most regions, but is larger than one when is very large and is very small, in which case the nodes are strongly coupled together (no correlation decay) and there is no significant local information. The hope, however, is that real problems tend to be easier because a large amount of information is available.
Comparing Different Expansion Algorithms In this part, we compare the true approximation error to the bound given by our algorithm when we expand the local subgraph. The true error is evaluated using the bruteforce algorithm. When removing the bipartite graph, we try both simply dropping edges and the mean field approximation. In all the experiments, we use the UGM Matlab package^{1}^{1}1http://www.cs.ubc.ca/ schmidtm/Software/UGM.html for the mean field approximation.
In order to better compare the error, we also add two baselines. One baseline is that we expand the local subgraph in each step by randomly selecting a node in the boundary . Another baseline is that we expand the local subgraph greedily by choosing the node in that has the maximum norm over the edgeset between such node and the subgraph . Formally, consider the Ising model in (1) whose weight on each edge is , the nodes we we add in each expansion should be The intuition is that when the magnitude of the edges weights is large, the node may be more related to the nodes in the subgraph.
In Figure 4, we compare our greedy expansion method stated in Algorithm 1 and the baselines stated above to construct the local graph incrementally. For this experiment, we fix and and average on 100 random trials. We stop expanding the graph when the local subgraph contains 16 nodes. We calculate the mean value of the true errors and bounds in the 100 trials for a different number of nodes in the subgraph.
From Figure 4, we can seer that, when combined the dropping out method for constructing , our greedy expansion method significantly outperforms the two baselines. We also find that the mean field method for constructing gives about the same true error as the dropping out method, but provides a tighter upper bound. It is interesting to note that the true errors of the two baseline expansion methods are sometimes even worse than the upper bounds of our greedy expansion, indicating the strong advantage of our method.
We further investigate how the parameters of the Ising model may influence the results of the algorithms and the tightness of the bound. For this purpose, we fix and vary in the range of in Figure 5. For each setting, we simulate 100 times and then calculate the mean error and bound. From Figure 5, we can find that the bound is again relatively tight, especially when the value of is large. both the bounds and the true errors decrease as increases because the correlation decay is stronger and the inference task is easier with strong local evidence on the singleton potentials (large ).
6.2 Cora data set
We perform experimental evaluations on the Cora data set^{2}^{2}2https://people.cs.umass.edu/ mccallum/data.html
. Cora consists of a large collection of machine learning papers with citation relations between the papers, in which each paper is labeled as one of seven classes. For our experiment, we binarize the labels by taking “Neural Networks” as label 1 and the remaining classes as label
. We process the data by removing the hubs in the graph and truncate the graph to have a maximum degree of 15; this is done by randomly deleting edges of the nodes whose degree is larger than 15 until the whole graph is degree bounded by 15. We then experiment on the maximum connected subgraph, which consists of 2389 nodes and 4325 edges.In order to construct an Ising model based on Cora, we random draw edge potentials by for each edge of the citation graph, and draw the singleton potentials by for nodes with true label , and for nodes with true label . Here is a parameter that we choose from . When increases from 0 to 10, the node potentials increases so that marginal is more dominated by the status of the query node and the querying is more easily.
Comparing local inference with global inference In this part, we want to compare the performance of inference on the local graph to the inference on the global graph. Since the global graph is too large, we can only use approximate inference algorithm. Here, we use mean field to do the global inference and use it as a baseline. For the local graph, we expand the graph greedily as stated in Algorithm 1 and choose a threshold of and stop expanding when the subgraph already has 16 nodes.
For , we query the same 500 nodes randomly selected out of the 2389 nodes and evaluate their marginal distributions. In global inference and local inference, we have the marginal on the each query node. If the marginal is larger than 0.5, we consider our inference algorithm give it label 1, whereas if the marginal is less than 0.5, we give it label .
In Figure 6, we report the accuracy of the labels given by the global and local inference evaluated w.r.t. the true labels, as well as the accuracy of the local inference evaluated w.r.t. the labels provided by global inference. We find that as increases, both the accuracies of global and local inference w.r.t. the true labels increase significantly. In addition, the local inference gives similar result as the global inference (the green curve is high) and the accuracy increases as increases as well. We also report in Figure 7 the precision, recall, and Fmeasure when comparing local inference with global inference, by treating label as positives. Both figures show that when is fixed and increases, which means that the correlation decay is stronger, our local inference method achieves better results.
7 Conclusion
In this paper, we address queryspecific marginal inference in largescale graphical models using a new localized inference algorithm. We leverage the Dobrushin’s comparison theorem to derive two error bounds for localized inference, including a simple bound based on graph distance and a localized bound from which we derive an efficient greedy expansion algorithm for constructing local regions for localized inference. Our experiments have shown that our bounds are practically useful and the algorithm works efficiently on various graphical models. Future directions include theoretical investigation on tighter bounds and development of more efficient greedy expansion algorithms.
References
 Chechetka and Guestrin (2010) Chechetka, A. and Guestrin, C. (2010). Focused belief propagation for queryspecific inference. In AISTATS, pages 89–96.

Dechter (2013)
Dechter, R. (2013).
Reasoning with probabilistic and deterministic graphical models:
Exact algorithms.
Synthesis Lectures on Artificial Intelligence and Machine Learning
, 7(3):1–191.  Föllmer (1982) Föllmer, H. (1982). A covariance estimate for gibbs measures. Journal of Functional Analysis, 46(3):387–395.
 Heinemann and Globerson (2014) Heinemann, U. and Globerson, A. (2014). Inferning with high girth graphical models. In ICML, pages 1260–1268.
 Koller and Friedman (2009) Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
 Rebeschini and van Handel (2014) Rebeschini, P. and van Handel, R. (2014). Comparison theorems for gibbs measures. Journal of Statistical Physics, 157(2):234–281.

Rebeschini et al. (2015)
Rebeschini, P., Van Handel, R., et al. (2015).
Can local particle filters beat the curse of dimensionality?
The Annals of Applied Probability, 25(5):2809–2866.  Shi et al. (2015) Shi, T., Steinhardt, J., and Liang, P. (2015). Learning where to sample in structured prediction. In AISTATS.
 Wainwright et al. (2008) Wainwright, M. J., Jordan, M. I., et al. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305.
 Wick and McCallum (2011) Wick, M. L. and McCallum, A. (2011). Queryaware mcmc. In Advances in Neural Information Processing Systems, pages 2564–2572.
Appendix
Proof of Theorem 2
Proof 1
Let , where represents the distance between node and node in the graph and . Then
Applying Lemma 1, we have
Substituting the inequality of in the condition into righthand side yields to the result.
Proof of Corollary 2
Proof 2
Note that the Dobrushin’s interaction matrix of is a block diagonal matrix. Since there are no edges between and , the corresponding blocks equal to zero. If the Dobrushin condition holds, would also be a blockdiagonal matrix and can be calculated easily from . To see this, we have
Comments
There are no comments yet.