Efficient Localized Inference for Large Graphical Models

10/28/2017
by   Jinglin Chen, et al.
0

We propose a new localized inference algorithm for answering marginalization queries in large graphical models with the correlation decay property. Given a query variable and a large graphical model, we define a much smaller model in a local region around the query variable in the target model so that the marginal distribution of the query variable can be accurately approximated. We introduce two approximation error bounds based on the Dobrushin's comparison theorem and apply our bounds to derive a greedy expansion algorithm that efficiently guides the selection of neighbor nodes for localized inference. We verify our theoretical bounds on various datasets and demonstrate that our localized inference algorithm can provide fast and accurate approximation for large graphical models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/11/2020

Query Training: Learning and inference for directed and undirected graphical models

Probabilistic graphical models (PGMs) provide a compact representation o...
09/22/2015

Efficient Neighborhood Selection for Gaussian Graphical Models

This paper addresses the problem of neighborhood selection for Gaussian ...
11/20/2017

Structured Stein Variational Inference for Continuous Graphical Models

We propose a novel distributed inference algorithm for continuous graphi...
08/10/2016

Combinatorial Inference for Graphical Models

We propose a new family of combinatorial inference problems for graphica...
11/12/2018

Universal Marginalizer for Amortised Inference and Embedding of Generative Models

Probabilistic graphical models are powerful tools which allow us to form...
06/13/2012

Inference for Multiplicative Models

The paper introduces a generalization for known probabilistic models suc...
12/06/2017

Learning General Latent-Variable Graphical Models with Predictive Belief Propagation and Hilbert Space Embeddings

In this paper, we propose a new algorithm for learning general latent-va...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Probabilistic graphical models such as Bayesian networks, Markov random fields, and conditional random fields are powerful tools for modeling complex dependencies over a large number of random variables

Koller and Friedman (2009), Wainwright et al. (2008)

. Graphs are used to represent joint probability distributions, where nodes in a graph denote random variables, and edges represent dependency relationships between different nodes. With the specification of a graphical model, a fundamental problem is to calculate the marginal distributions of variables of interest. This problem is closely related to computing the partition function, or the normalization constant of a graphical model, which is known to be intractable and #P-complete. As a result, developing efficient approximation inference algorithms becomes a pressing need. The most popular algorithms include deterministic variational inference and Markov Chain Monte Carlo sampling.

However, many challenging practical problems involve very large graphs on which it is computationally expensive to use existing variational inference or Monte Carlo sampling algorithms. This happens, for example, when we use Markov random fields to represent the social network of Facebook or use a Bayesian network to model the knowledge graph that is derived from the entire Wikipedia, where in both cases the sizes of the graphical models can be prohibitively large (e.g., millions or billions of variables). It is thus infeasible to perform traditional approximate inference such as message passing or Mote Carlo on these models because such methods need to traverse the entire model to make an inference. Despite the daunting sizes of large graphical models, in most real-world applications, users only want to make an inference on a set of query variables of interest. The distribution of a query variable is often only dependent on a small number of nearby variables in the graph. As a result, complete inference over the entire graph is not necessary and practical methods should perform inference only with the most relevant variables in local graph regions that are close to the query variables, while ignoring the variables that are weakly correlated and/or distantly located on the graph.

In this work, we develop a new localized inference method for very large graphical models. Our approach leverages the Dobrushin’s comparison theorem that casts explicit bounds based on the correlation decay property in the graphs, in order to restrict the inference to a smaller local region that is sufficient for the inference of marginal distribution of the query variable. The use of the Dobrushin’s comparison theorem allows us to explicitly bound the truncation error which guides the selection of localized region from the original large graph. Extensive experiments demonstrate both the effectiveness of our theoretical bounds and the accuracy of our inference algorithm on a variety of datasets.

Related Work

Approximate inference algorithms of graphical models have been extensively studied in the past decades (see for example (Koller and Friedman, 2009, Wainwright et al., 2008, Dechter, 2013) for an overview). Query-specific inference, including (Chechetka and Guestrin, 2010) which proposed a focused belief propagation for query specific inference, and (Wick and McCallum, 2011, Shi et al., 2015) which study query-aware sampling algorithms, have recently been introduced for large graphical models. Compared with these methods, our work is theoretically motivated by the Dobrushin’s comparison theorem and enables us to efficiently construct the localized region in a principled and practically efficient manner.

2 Background on Graphical Models

Graphical models provide a flexible framework for representing relationships between random variables Heinemann and Globerson (2014). In graph , we use to denote a finite collection of random variables and we use to refer to an assignment. Suppose is a set of edges and is a set of functions with for edge and for node . We use

to represent the joint distribution of the graphical model

as following,

where

is the normalization constant (also called the partition function). In this work, we will focus on the Ising model, an extensively studied graphical model. The Ising model is a pairwise model with binary variables

. The pairwise and singleton parameters are defined as follows

So the distribution of an Ising model is defined as,

(1)

Given a graphical model, marginal inference involves calculating the normalization constant, or the marginal probabilities of small subsets of variables. These problems require summation over an exponential number of configurations and are typically #P-hard in the worst case for loopy graphical models. However, practical problems can be often easier than the theoretically worst cases, and it is still possible to obtain efficient approximations by leveraging the special structures of given models. In this work, we focus on the query-specific inference, where the goal is to calculate the marginal distribution of given individual variable . For this task, it is possible to make good approximations based on a local region around , thus significantly accelerates the inference in very large graphical models.

3 Localized Inference and Correlation Decay

Given a large graphical model, it is usually not feasible to compute the exact marginal of a specific variable due to the exponential time complexity. Furthermore, it is even not practical to perform the variational approximation algorithms, such as mean field and belief propagation, when the graph is very large. This is because these traditional methods need to traverse the entire graph multiple times before convergence, and thus are prohibitively slow for very large models such as these built on social networks or knowledge bases.

On the other hand, it is relatively cheap to calculate exact or approximate marginals in small or medium size graphical models. In many applications, users are only interested in certain queries of node marginals. Because users’ queries of interest often have strong associations with only a small number of nearby variables in the graph, the complete inference over the full graph is not necessary. This can be formally captured by the phenomenon of correlation decay, that is, when the graph is large and sparse, the influence of a random variable on the distribution of another random variable decreases quickly as the distance of the shortest path between the corresponding nodes in the graph increases. The correlation decay property has been widely studied in statistical mechanics and graphical models (Rebeschini et al., 2015).

Formally, assuming that the edge potentials are well bounded, we may expect that variables and are strongly correlated when the distance between node and on graph is small (e.g. ), while and may have a rather weak correlation or nearly be independent when node and node are far away from each other on . Such property exists broadly in real-world graphical models, such as those built upon social networks in which an individual is mostly influenced by his/her friends. Often, the decaying rate of correlation is negative exponential to the distance .

If a graphical model satisfies the property of correlation decay, it is possible that we can use only the local information in the graph to perform marginal inference, as the distant variables have little correlation with the query variable. This intuition allows us to use the information from the most relevant variables in the local region close to the queried variable to efficiently approximate its marginal distribution. Assume that is a large graphical model, and we want to calculate a marginal distribution of variable . Localized inference constructs a much smaller model , defined on a small subgraph that includes , such that . The challenge here, however, is how to construct a good localized model and bound its approximation error. We address this problem via the Dobrushin’s comparison theorem (Föllmer, 1982), and propose an efficient algorithm to find the local graph region for a given query node and provide an error bound between its approximate and true marginals. To get started, we first introduce the Dobrushin’s comparison theorem, which is used to compare two Gibbs measures.

Figure 1: The key goal of this work is to approximate queries in large scale graphical models using smaller models on local regions.
Theorem 1

(Föllmer, 1982) Dobrushin’s comparison theorem Let be a Gibbs measure on a finite product space , where is an index set. For , we define

where is the conditional distribution of the th coordinate with respect to the -field generated by the coordinates with index , and

is the total variance distance. We compute

(2)

and assume . Let and , then for any probability measure on the same place and any function , we have

where is the singleton perturbation coefficient of node :

(3)

and is the oscillation of in the th coordinate, that is,

In Theorem 1, is the probability of variable conditioned on its adjacent variables whose assignments are the same as corresponding entries in . According to the Markov property, calculating only requires information from the local star-shaped graph as shown in Figure 2. It is worth noting that a tighter bound can be obtained by defining to be . Here we use the definition in (3) for lower computational complexity. The matrix is known as the Dobrushin’s interaction matrix, and the inequality is the Dobrushin condition. If this condition holds, the theorem can give us a bound between two measures, which is the result of correlation decay.

Figure 2: Star-shaped graph centered on consists of nodes within distance 1 to node and edges represented by solid arrow. In fact, the dashed edges shown in the graph does not exist in the star-shaped graph.

In the following, we will apply Theorem 1 to undirected graphical models to derive an approximation bound of marginal distributions. We first denote by the index set of the variables and assume that we want to query the marginal distribution of variable . In order to apply Theorem 1, we set to be the indicator function of the variable , that is, . Then becomes the absolute marginal difference between the two measures and . In addition, the oscillation of function is thus reduced to and . With these simplifications, we obtain a bound of the maximum difference between marginals of the queried node for two measures:

Corollary 1

Following the assumptions in Theorem 1 and the above text, we have

(4)

Note that the roles of and in (4) are not symmetric because the Dobrushin coefficient is solely defined based on (and independent with ). As a result, there are two ways to use bound (4) for localized inference, depending on whether we treat or as the original model that we want to query or the localized model that we use for approximation, respectively. We will next exploit both possibilities in the next sections. In Section 4, we take as the global model (or measure) and as the localized model (or measure) and derive a simple upper bound relates the approximation error to the distance between the query node and the boundary of the local region on the graph. In Section 5, we take as the global model (or measure) and as the localized model (or measure), we derive another upper bound that only involves the localized region, and leverage it to propose a greedy expansion algorithm to construct the localized model with guaranteed approximations.

4 Distance-based Upper Bound

In this section, we assume that in Theorem 1 is defined by the original graphical model that we want to query, and is a simpler and more tractable distribution that we use to approximate the marginal of in .

For notational simplicity, we partition the node-set to two disjoint sets and , where is the local subgraph that contains query node and is the rest of the graph. We use and to represent the set of subscripts of nodes on the boundary and in the interior of . Obviously, , , and . Similarly, , , and . In addition, we use to denote the variables in and to denote the variables in . We will first apply the following lemma to obtain our first result on the relationship between the approximation error of marginals and the radius of the local subgraph .

Lemma 1

(Rebeschini and van Handel, 2014) Assume is a finite set and let be a pseudo-metric on set . is a non-negative matrix. Suppose that

Then matrix satisfies

In particular, this implies that

for every set , where .

This lemma indicates that if decays exponentially with the distance between and , the , which is used in Theorem 1 and Corollary 1, also decays exponentially with the distance between and . The condition of this correlation decay lemma is usually mild in practice. When we choose , which is naturally a pseudo-metric, and use Dobrushin’s interaction matrix as , the conditions of the lemma hold once the Dobrushin condition is satisfied, because matrix in Theorem 1 is by definition a non-negative matrix and hence is also non-negative, and every entry in is less than 1/2. Applying Lemma 1, we can obtain the following result.

Theorem 2

Suppose is the probability measure for a graphical model for which we want to query the marginal distribution of node . Let be the another probability on the same space, whose parameters of edges on subgraph and parameters of nodes in are the same as . Assume the Dobrushin condition holds for (). Let denote the distance between node and node-set on the Markov graph of . If we assume

(5)

then , we have

This theorem characterizes the error bound when approximating the global model using another model that matches locally in region . Our result shows that in order to ensure an bound on the query node , the distance from the query node to the boundary should be at least linear to . In other words, the error decreases exponentially with . The proof of Theorem 2 can be found in the appendix.

As a result, given and , we can get the minimum value of the lower bound of by optimizing . Theorem 2 gives a simple but general way to bound the local subgraph of variable , as we only need to check the Dobrushin condition and compute on the whole true graphical model.

5 Localized Bound and Greedy Expansion

The bound in Theorem 2 requires computing the value as defined in (2) for a given graphical model. However, since is the maximum of the entire graph, it can be very expensive to compute when the graph is large. In this section, we explore another approach of using the bound in Corollary 1, by setting to be the distribution of the original graphical model and to be the localized model. In this way, we will derive a novel approximation approach by greedily constructing a local graph from the query variable , with guaranteed upper bounds of the approximation error between marginal distributions of and .

To start with, we note that can be decomposed to

where is the exponential of a potential function of , is the exponential of potential function of , and is the exponential of potential function defined on and .

We want to approximate with a simpler model in which the nodes in and are disconnected, so that the inference over can be performed locally within , irrelevant to the nodes in . Formally, we want to approximate by

which replaces the factor with a product with approximations and . Therefore, the marginal distributions of and get decoupled in , that is,

This decomposition thus allows us to approximately calculate marginal efficiently within subgraph . The challenges here are 1) how to construct the factors and in to closely approximate , 2) how to decide the subgraph region and 3) how to bound the approximation error. We consider two methods for constructing in this work:

1. [Dropping out] Simply remove the in . To do so, we set

(6)

This corresponds to directly remove all the edges between and , which is also referred as the “dropping out” method in our experiments.

2. [Mean field] Find to closely approximate by performing a mean field approximation, that is, we solve the following optimization problem:

(7)

where the

refers to the KL divergence of the corresponding normalized distributions. To apply the mean field approximation and reduce complexity, we further assume that the nodes are independent in

and . By using the optimized approximation , we will be able to compensate the error of marginal of , which is introduced by simply removing the edges between and , as mentioned in the above.

Note that the potentials and in do not influence the calculation of , for . For simplicity, we remove all the edges in . This will not change the marginal of node .

By applying Corollary 1, we can now obtain an error bound which, remarkably, only involves the local region .

Corollary 2

Assume , and the conditions in Theorem 1 holds, we have

(8)

where is defined in Eq 3, and is defined by ; here is defined in Theorem 1.

Note that the upper bound in (8) only involves the local region and hence can be computed efficiently using mean field or belief propagation within the subgraph on . The proof of Corollary 2 and the details on how to calculate and for Ising models in practice can be found in the appendix.

Using the bound in (8), we propose a greedy algorithm to expand the local graph starting from query node incrementally. At iteration, we add a neighboring node that yields the tightest bound using the above bound and repeat this process until the bound is tight enough or a maximum of graph size is reached. This process is summarized in Algorithm 1. After we complete the expanding phase, we can apply exact inference or on local region to calculate the marginal of the query if the size of is small or perform approximate inference methods if the size of is medium. The actual size of can vary in different graphical models, which is mainly determined by the correlation decay property near the query variable or the tightness of the upper bound in Eq (8).

1:  given a graphical model and a node , approximate marginal probability
2:  input: maximum number of nodes in the local subgraph and the improvement threshold
3:  initialize local subgraph and
4:  while  ( represents the number of nodes in do
5:     set and the nodes in that connects with in .
6:     for node  do
7:        add node to and get a candidate local subgraph .
8:        construct local model by setting (dropping out, Eq (6

)) or estimating it using mean field as in Eq (

7).
9:        calculate the bound in (8) where the refers to
10:     end for
11:     if  then
12:        update
13:        update
14:     end if
15:  end while
Algorithm 1 Greedy expansion algorithm for localized inference

Computational Complexity

Here we consider the computational complexity of expanding the local subgraph and the complexity of localized inference. We always suppose that the maximum degree of the graph is and we define the maximum distance between the query node and any node in the subgraph to be the radius of the subgraph.

First, given a threshold , from Theorem 2, we just need a subgraph with radius

where we recall that is the Dobrushin coefficient . In particular, taking shows that we just need . It is worth noting that decreases when becomes small and/or the accuracy threshold becomes large. Since the size of the subgraph with radius is no more than , it can be much smaller than the whole graph. As a result, the inference over the subgraph is much more efficient.

Then, we discuss the computation complexity in each expansion step (Algorithm 1, line 6-10). We need to loop over the nodes in

. In the loop, we need to calculate the vector

and the matrix . The calculation for each element in requires the enumeration of different assignments in the neighborhood of such node, which is bounded because it is not related to the size of the whole graph. In the calculation of matrix , we only need to update the elements related to the new node. The number of such elements is no more than and the calculation of each element is not related to the size of the whole graph. can be derived from and use historical information to calculate incrementally. The complexity is no more than computing the inverse of the whole matrix . If we use mean field approximation in the greedy expansion, the computation is also cheap because the sizes of and are small.

6 Experiments

We test our algorithm on both simulated and real-world datasets. The results indicate that our method provides an efficient localized inference technique.

6.1 2D Ising Grid

In this section, we perform experiments on 2D-grid Ising models and regard the localized probability as and regard the true probability as . The graph is a lattice and the coordinate of query node is . The parameters in the Ising model is generated by drawing uniformly from for all nodes and uniformly from for all edges . Here and control the locality and hardness of this Ising model.

Checking Dobrushin’s condition We start with numerically checking the Dobrushin condition . In Figure 3, we show the values of for Ising models generated with different values of and , using a heatmap. We can see that is smaller than one in most regions, but is larger than one when is very large and is very small, in which case the nodes are strongly coupled together (no correlation decay) and there is no significant local information. The hope, however, is that real problems tend to be easier because a large amount of information is available.

Figure 3: Verifying Dobrushin’s condition. The color denotes the value in (2) of models generated with different values of and .

Comparing Different Expansion Algorithms In this part, we compare the true approximation error to the bound given by our algorithm when we expand the local subgraph. The true error is evaluated using the brute-force algorithm. When removing the bipartite graph, we try both simply dropping edges and the mean field approximation. In all the experiments, we use the UGM Matlab package111http://www.cs.ubc.ca/ schmidtm/Software/UGM.html for the mean field approximation.

In order to better compare the error, we also add two baselines. One baseline is that we expand the local subgraph in each step by randomly selecting a node in the boundary . Another baseline is that we expand the local subgraph greedily by choosing the node in that has the maximum norm over the edge-set between such node and the subgraph . Formally, consider the Ising model in (1) whose weight on each edge is , the nodes we we add in each expansion should be The intuition is that when the magnitude of the edges weights is large, the node may be more related to the nodes in the subgraph.

In Figure 4, we compare our greedy expansion method stated in Algorithm 1 and the baselines stated above to construct the local graph incrementally. For this experiment, we fix and and average on 100 random trials. We stop expanding the graph when the local subgraph contains 16 nodes. We calculate the mean value of the true errors and bounds in the 100 trials for a different number of nodes in the subgraph.

From Figure 4, we can seer that, when combined the dropping out method for constructing , our greedy expansion method significantly outperforms the two baselines. We also find that the mean field method for constructing gives about the same true error as the dropping out method, but provides a tighter upper bound. It is interesting to note that the true errors of the two baseline expansion methods are sometimes even worse than the upper bounds of our greedy expansion, indicating the strong advantage of our method.

Figure 4: The true errors and our upper bounds for the marginal approximation when we expand the local subgraphs to different sizes.

We further investigate how the parameters of the Ising model may influence the results of the algorithms and the tightness of the bound. For this purpose, we fix and vary in the range of in Figure 5. For each setting, we simulate 100 times and then calculate the mean error and bound. From Figure 5, we can find that the bound is again relatively tight, especially when the value of is large. both the bounds and the true errors decrease as increases because the correlation decay is stronger and the inference task is easier with strong local evidence on the singleton potentials (large ).

Figure 5: Our bounds and the true errors vs. different .

6.2 Cora data set

We perform experimental evaluations on the Cora data set222https://people.cs.umass.edu/ mccallum/data.html

. Cora consists of a large collection of machine learning papers with citation relations between the papers, in which each paper is labeled as one of seven classes. For our experiment, we binarize the labels by taking “Neural Networks” as label 1 and the remaining classes as label

. We process the data by removing the hubs in the graph and truncate the graph to have a maximum degree of 15; this is done by randomly deleting edges of the nodes whose degree is larger than 15 until the whole graph is degree bounded by 15. We then experiment on the maximum connected subgraph, which consists of 2389 nodes and 4325 edges.

In order to construct an Ising model based on Cora, we random draw edge potentials by for each edge of the citation graph, and draw the singleton potentials by for nodes with true label , and for nodes with true label . Here is a parameter that we choose from . When increases from 0 to 10, the node potentials increases so that marginal is more dominated by the status of the query node and the querying is more easily.

Comparing local inference with global inference In this part, we want to compare the performance of inference on the local graph to the inference on the global graph. Since the global graph is too large, we can only use approximate inference algorithm. Here, we use mean field to do the global inference and use it as a baseline. For the local graph, we expand the graph greedily as stated in Algorithm 1 and choose a threshold of and stop expanding when the subgraph already has 16 nodes.

Figure 6: The accuracy of different algorithms when changes. Red and blue: the accuracy of the labels given by the global and the local inference evaluated w.r.t. the true labels. Green: the accuracy of the local inference evaluated w.r.t. the labels provided by the global inference.

For , we query the same 500 nodes randomly selected out of the 2389 nodes and evaluate their marginal distributions. In global inference and local inference, we have the marginal on the each query node. If the marginal is larger than 0.5, we consider our inference algorithm give it label 1, whereas if the marginal is less than 0.5, we give it label .

In Figure 6, we report the accuracy of the labels given by the global and local inference evaluated w.r.t. the true labels, as well as the accuracy of the local inference evaluated w.r.t. the labels provided by global inference. We find that as increases, both the accuracies of global and local inference w.r.t. the true labels increase significantly. In addition, the local inference gives similar result as the global inference (the green curve is high) and the accuracy increases as increases as well. We also report in Figure 7 the precision, recall, and F-measure when comparing local inference with global inference, by treating label as positives. Both figures show that when is fixed and increases, which means that the correlation decay is stronger, our local inference method achieves better results.

Figure 7: Precision, Recall, and F-measure of the labels given by local inference vs. global inference when the value of changes.

7 Conclusion

In this paper, we address query-specific marginal inference in large-scale graphical models using a new localized inference algorithm. We leverage the Dobrushin’s comparison theorem to derive two error bounds for localized inference, including a simple bound based on graph distance and a localized bound from which we derive an efficient greedy expansion algorithm for constructing local regions for localized inference. Our experiments have shown that our bounds are practically useful and the algorithm works efficiently on various graphical models. Future directions include theoretical investigation on tighter bounds and development of more efficient greedy expansion algorithms.

References

  • Chechetka and Guestrin (2010) Chechetka, A. and Guestrin, C. (2010). Focused belief propagation for query-specific inference. In AISTATS, pages 89–96.
  • Dechter (2013) Dechter, R. (2013). Reasoning with probabilistic and deterministic graphical models: Exact algorithms.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    , 7(3):1–191.
  • Föllmer (1982) Föllmer, H. (1982). A covariance estimate for gibbs measures. Journal of Functional Analysis, 46(3):387–395.
  • Heinemann and Globerson (2014) Heinemann, U. and Globerson, A. (2014). Inferning with high girth graphical models. In ICML, pages 1260–1268.
  • Koller and Friedman (2009) Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.
  • Rebeschini and van Handel (2014) Rebeschini, P. and van Handel, R. (2014). Comparison theorems for gibbs measures. Journal of Statistical Physics, 157(2):234–281.
  • Rebeschini et al. (2015) Rebeschini, P., Van Handel, R., et al. (2015).

    Can local particle filters beat the curse of dimensionality?

    The Annals of Applied Probability, 25(5):2809–2866.
  • Shi et al. (2015) Shi, T., Steinhardt, J., and Liang, P. (2015). Learning where to sample in structured prediction. In AISTATS.
  • Wainwright et al. (2008) Wainwright, M. J., Jordan, M. I., et al. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305.
  • Wick and McCallum (2011) Wick, M. L. and McCallum, A. (2011). Query-aware mcmc. In Advances in Neural Information Processing Systems, pages 2564–2572.

Appendix

Proof of Theorem 2

Proof 1

Let , where represents the distance between node and node in the graph and . Then

Applying Lemma 1, we have

Substituting the inequality of in the condition into right-hand side yields to the result.

Proof of Corollary 2

Proof 2

Note that the Dobrushin’s interaction matrix of is a block diagonal matrix. Since there are no edges between and , the corresponding blocks equal to zero. If the Dobrushin condition holds, would also be a block-diagonal matrix and can be calculated easily from . To see this, we have