Adapting the Stochastic Block Model to Edge-Weighted Networks

05/24/2013 ∙ by Christopher Aicher, et al. ∙ University of Colorado Boulder 0

We generalize the stochastic block model to the important case in which edges are annotated with weights drawn from an exponential family distribution. This generalization introduces several technical difficulties for model estimation, which we solve using a Bayesian approach. We introduce a variational algorithm that efficiently approximates the model's posterior distribution for dense graphs. In specific numerical experiments on edge-weighted networks, this weighted stochastic block model outperforms the common approach of first applying a single threshold to all weights and then applying the classic stochastic block model, which can obscure latent block structure in networks. This model will enable the recovery of latent structure in a broader range of network data than was previously possible.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In social and biological networks, vertices often play distinct functional roles in the large-scale structure of the graph. The automatic detection of these latent roles, by identifying the induced “community” or block structures from connectivity data alone, is a fundamental problem in network analysis and many approaches have been proposed Fortunato (2010); Porter et al. (2009). The stochastic block model (SBM) is a popular generative model that solves this problem in an unsupervised fashion Holland et al. (1983); Wang & Wong (1987).

In its classic form, the SBM is a probabilistic model of pairwise interactions among vertices. Each vertex belongs to one of

latent groups, and each undirected edge exists or does not with a probability that depends only on the block memberships of the connecting vertices. The model is thus defined by a vector

containing the block assignment of each vertex and a matrix , where gives the probability that a vertex of block connects to a vertex of block .

This model can capture a wide variety of large-scale organizational patterns of network connectivity, depending on the choices of and . If ’s diagonal elements are greater than its off-diagonal elements, the block structure is assortative, with communities exhibiting greater edge densities within than between them, as is often found in social networks. Other choices of can generate hierarchical, multi-partite, or core-periphery patterns, among others. This flexibility, and the principled probabilistic statements it produces, has made the SBM a popular tool for unsupervised network analysis, in which we seek to infer the latent block labels from the observed graph structure alone.

There is broad interest in machine learning, physics, and computational social science to develop and apply generalizations of the classic SBM. Generalizations have been made to allow degree heterogeneity within blocks 

Karrer & Newman (2011), probabilistic or mixed block membership Airoldi et al. (2008); Ball et al. (2011), infinite number of blocks Kemp et al. (2006), or hierarchical (nested) relationships among blocks Clauset et al. (2008).

Several efficient techniques exist for estimating latent block structures from data. Of particular relevance to our weighted generalization of the SBM are the variational algorithms, both Bayesian and frequentist. Scalability is typically achieved by constraining the parameter space or using modern optimization techniques. Examples include variational expectation-maximization (EM) for the classic SBM 

Daudin et al. (2008); Park et al. (2010), variational Bayes EM for a restricted, two-parameter matrix Hofman & Wiggins (2008), nested variational EM for the classic mixed membership SBM Airoldi et al. (2008), and stochastic variational inference for assortative mixed membership SBM Gopalan et al. (2012).

In most of these efforts, the SBM is restricted to binary or Bernoulli networks, in which edges are unweighted. The one exception has been block models with Poisson distributed edge weights 

Mariadassou et al. (2010); Karrer & Newman (2011); Ball et al. (2011), which can be fitted to multigraphs. In practice, however, most binary networks are produced after applying a threshold to a weighted relationship Thomas & Blitzstein (2011), and this practice clearly destroys potentially valuable information. To apply the SBM on weighted data without thresholding, we introduce a generalization of the SBM to the important case in which edges are annotated with weights drawn from an exponential family distribution.

This weighted stochastic block model (WSBM) includes as special cases most standard distributional forms, and thus allows us to use weighted relations directly in recovering latent block structure, preventing the information loss caused by thresholding. Handling these general weight distributions presents several technical difficulties for model estimation, which we solve using a Bayesian approach. We first give the WSBM’s form and derive a variational Bayes algorithm for fitting to dense graphs. We then present synthetic examples that illustrate the type of behavior the WSBM captures that is overlooked by thresholding. We close with a brief discussion of extensions of the model.

2 Weighted Stochastic Block Models

The weighted stochastic block model is a generative model for weighted pairwise interactions among vertices, and is composed of an exponential family distribution and a block structure . The block structure defines a set of vertex labels, denoted where . The block structure defines a partition on the edges into disjoint bundles, one for each pair of blocks. Edges weights in some bundle are modeled by a distribution in , parameterized by . That is, each bundle has its own set of distribution parameters.

The choice of determines the large-scale structure of the network, just as and do for the classic SBM. When is a Bernoulli trial, we cover this classic case. Although constraining , or the variation of its parameters across edge bundles, can be used to create specific types of large-scale structure, here we focus on the general case of blocks with independent parameters. In principle, the form of could be learned directly from data, but we do not explore this topic.

We denote a WSBM with edge distribution family and block structure by , whose parameters are the vertex labels and the matrix of edge bundle parameters . The likelihood of observing a graph , given distribution , is then

Restricting to exponential family distribution makes the mathematics tractable while covering a broad range of models of edge weights, including many common distributions produced by classic stochastic processes. A distribution belongs to an exponential family if it can be written as

where , , are fixed mappings, is the distribution’s parameter, and is the distribution’s support. Under these assumptions, the log-likelihood becomes

where is the sufficient statistic for the weights in edge bundle .

For some choices of , the likelihood function contains degeneracies that prevent the direct estimation of parameters and . For instance, when weights are real-valued and

is a Normal distribution. An edge bundle with all-equal weights will have zero variance, which creates a degeneracy in the likelihood calculation. Another technical problem is that non-edges in a sparse graph (a zero in the adjacency matrix) may represent a pair of non-interacting vertices, an interaction with zero weight, or an interaction we have not yet observed. The classic SBM does not exhibit these problems because edge weights are Bernoulli random variables, whose sufficient statistics are always well defined. To regularize the degeneracy problem, we take a Bayesian approach and assign an appropriate prior distribution

to our parameters . Now, the posterior distribution will exhibit no degeneracies and estimation can proceed smoothly.

Estimating the posterior distribution given the observed edge weights and prior is generally difficult, and so we approximate by a factorizable distribution . How we estimate also depends on whether the graph is dense or sparse, and our interpretation of non-edges. Here, we present the solution for dense graphs. In a separate paper, we will present a belief propagation algorithm for sparse graphs that correctly handles non-edges.

3 Variational Bayes

For a dense graph, we construct a variational Bayes (VB) expectation-maximization algorithm to estimate . We approximate the posterior distribution by a product of marginals .

We then select by minimizing the Kullback-Leibler (KL) divergence between our approximation and the posterior . It can be shown that

where is a functional lower bound on the constant , calculated as

The first term is the expected log-likelihood under the approximation and the second term is the KL-divergence of the approximation from the prior . As the likelihood is constant, minimizing the KL divergence is equivalent to maximizing .

To maximize , we maximize the expected log-likelihood of the data and weakly constrain the approximation to be close to the prior. This regularizer prevents over fitting and eliminates the aforementioned likelihood degeneracies. In practice, the first term overwhelms the second term given sufficient data.

Conjugate priors.

For mathematical convenience, we restrict the prior to a product of parameterized conjugate distributions.

The conjugate prior for the parameter

of an exponential family has the form

where parameterizes the prior and is a normalizing constant. When we update the prior based on the observed weights in a given edge bundle , the posterior’s parameter becomes , and can be viewed as a set of pseudo-observations. This prevents the posterior from becoming degenerate since every edge bundle, no matter how small or uniform, produces a parameter estimate.

The conjugate prior for a vertex label is a categorical distribution with parameter , where is the probability that node belongs to group . We fit directly, with a flat prior .

The form of our prior is thus

where , are the parameters for the priors , . With conjugate priors for , our approximation takes the form

Now, maximizing is equivalent to maximizing over ’s parameters , .

Optimizing .

These choices of and yield G_F,R&= ∑_i,j logh(A_i,j) + ∑_r (⟨T ⟩_r + τ_0 - τ_r ) ⋅⟨η⟩_r
& + ∑_r logZ(τr)Z(τ0) + ∑_i ∑_z_i μ_i (z_i) logμ0(zi)μi(zi)  , where , are expectations of , under the approximation ; for exponential families they are, ⟨T ⟩_r & := ∑_i,j ∑_R(z_i,z_j) = r μ_i(z_i)   μ_j(z_j)   T(A_i,j)
η⟩_r & := ∂logZ(τr)∂τr  .

To optimize we take derivatives with respect to ’s parameters , and set them to zero. We iteratively solve for the maximum by updating and independently.

For , this yields ∂GF,Rτr &= [⟨T ⟩_r + τ_0 - τ_r] ∂⟨ηrτr - ⟨η⟩_r + ∂logZrτr
&∝⟨T ⟩_r + τ_0 - τ_r  , and the update equation for each edge-bundle parameter is .

For , we use Lagrange multipliers to enforce . Setting the derivative of with respect to equal to yields

where

Solving for produces the update equation

where each

is normalized to a probability distribution. To calculate the

values, we iteratively update each from some initial guess until convergence to within some tolerance.

  Input: Data , Model
  Initialize
  repeat
     for all  do
        Set
        Set
        Set
     end for
     repeat
        for all  do
           
           
        end for
     until  converge
  until  converge
  return  
Algorithm 1 VB for dense networks

Algorithm 1 gives pseudocode for the full variational Bayes algorithm, which alternates between updating the edge-bundle parameters and the vertex label parameters using the update equations derived above. Because every pairwise interaction contributes to the estimation of some parameter, the algorithm takes time, assuming fast convergence on and . Like all VB approaches, only convergence to a local optima of is guaranteed. In practical contexts, multiple trials with a variety of initial conditions are used, and the best overall model selected.

4 Model Selection

An important intermediate step toward applying the WSBM to some graph is the selection of a class of distributions or the number of blocks . Any of a number of principled approaches could be employed, including maximum likelihood, possibly with cross-validation Airoldi et al. (2008)

, Bayes factors 

Hofman & Wiggins (2008), approximations thereof Mariadassou et al. (2010); Daudin et al. (2008), or minimum description length Peixoto (2013).

In our experiments below, we use Bayes factors, which assume a uniform prior and are equivalent to selecting the model with the largest model-likelihood,

where we approximate with .

Although Bayes factors assign a uniform prior on a set of nested models, they have a built-in penalty for complex models. Recall that is penalized for large divergence from the prior and since the vertex-label prior is uniform on all groups, there is a penalty if an increase in does not sufficiently reduce the entropy or correspondingly increase the expected log-likelihood.

5 Experimental results

(a) Example Plot
(b) Normal VI vs. 
(c) Normal VI vs. Variance
(d) Normal VI vs. 
Figure 1: Results of fitting the WSBM (blue) and other methods to our dense synthetic data. (a) An example of a dense synthetic network with . (b) Comparison of VI versus the parameter , fixing the variance and . (c) Comparison of VI versus the variance of the edges, fixing and . (d) Comparison of VI versus the size of the network , fixing the variance and

. Points in (b,c,d) are averaged over 30 generated datasets. SBM with thresholding and K-means are averaged over 100 trials for each dataset for different thresholds.

We compare the WSBM against several alternative methods for recovering latent block structure. Our goal is to demonstrate that the classic SBM after applying a single threshold to all edge weights may miss important structure and that the WSBM can be used to explicitly evaluate the accuracy of inferring latent block via thresholding. We also include k-means clustering and hierarchical clustering to show that the weighted behavior the WSBM captures is different.

To demonstrate how the WSBM can find structure other methods may miss, we use synthetically generated dense graphs with vertices divided into heterogenous blocks; the weights of each edge bundle are Normally distributed with bundle-specific parameters (see Fig. 1(a)). This -block model is a weighted variation of Newman’s four-group test for unweighted graphs Newman & Girvan (2004). We then vary three model parameters—graph size , variance of the edge weight distributions, and number of blocks we fit to the data—and measure the accuracy of the inferred block structure. Varying the graph size corresponds to consistency, varying the variance shows the performance in high-noise settings, and varying the number of blocks corresponds to robustness.

We characterize the accuracy of the recovered block structures using the variation of information (VI)  Meilă (2007), a standard metric for such tasks. The VI is a mathematically principled, information theoretic metric for the distance between the inferred and true assignment (vertex labels). Let denote the true block structure and be our estimate. Then , with being the conditional entropy. When and we recover the true structure exactly, . One nice property of VI is that it increases only modestly when differs from mainly by splitting or dividing blocks.

Under all test settings, the WSBM outperforms the alternatives (Fig. 1b–d). As edge-weight variance increases, all methods have decreased performance, but the WSBM fails most gracefully. As the graph size increases, all methods perform better, with the WSBM performing best by far. And, when varying the number of blocks we infer, all methods perform better when , but only the WSBM correctly recovers the latent structure at , which is the value selected under model selection using Bayes factors. Additionally, the WSBM fails gracefully when .

Thresholding with the SBM performs poorly in all tests, because choosing a universal weight threshold destroys information about the latent block structure. Thresholding converts the original weights into a Bernoulli distribution with parameter equal to the probability of exceeding the threshold. This effect is substantial whenever distinct blocks exhibit similar weight distributions. If the two blocks’ distributions are similar (Fig. 

2), the SBM with thresholding typically finds only one block because the probabilities of exceeding the threshold are too similar. In this case, thresholding confuses latent differences with Bernoulli sampling noise, and the SBM merges blocks that are distinct. With well-separated weight distributions and an optimal threshold, the SBM may find correct structure. However, selecting the ‘optimal’ threshold is a challenging problem itself. Because a threshold will impact different edge bundles differently, a single ‘optimal’ threshold may not, in fact, exist.

(a) Small Variance
(b) Large Variance
Figure 2: The probability density plots (pdfs) of two pairs of normal distributions. In both figures, the distributions are centered at and , but differ in variance and post-threshold probability. (a) The variance is and the probabilities of exceeding the threshold are and respectively. (b) The variance is and the probabilities of exceeding the threshold are and respectively.

As a result, when , the SBM with thresholding tends to under-fit the data, leading to very poor results. In contrast, the WSBM, having no thresholds, utilizes the complete weight information and performs well even when given more flexibility than the underlying data require.

The performance of k-means and hierarchical clustering is particularly poor for increasing edge-weight variance, when the signal-to-noise ratio is low. These methods over fit the data less than the classic SBM when given , but they still perform more poorly than the WSBM. The reason for this difference is our particular choice example. The k-means algorithm uses principle component analysis, which suffers in high variance settings. Similarly, hierarchical clustering focuses on only intra-block behavior (the blocks on the diagonal) and misses out on inter-block behavior.

6 Discussion

The weighted stochastic block model we introduce here generalizes the classic stochastic block model to the important case of edges with weights drawn from an exponential family distribution. This generalization presented several technical challenges, which we solved using a Bayesian approach to develop a variational Bayes algorithm for dense graphs. This model accurately recovers latent block structure under a wide variety of conditions, and performs substantially better than simple alternatives. These results demonstrate that applying a threshold to edges weights before applying the unweighted SBM is generally unreliable.

The WSBM can be naturally generalized in several potentially useful ways. For sparse graphs, we have developed a scalable belief-propagation algorithm, to be presented in future work. It could also be extended to mixed membership Airoldi et al. (2008) or, in the sparse case, to allow degree heterogeneity Karrer & Newman (2011). Stochastic variational inference has shown promising results for scaling in the mixed-membership SBM, and this technique could also be adapted to the WSBM Gopalan et al. (2012)

. Finally, an interesting question is the extent to which utilizing weight information modifies the phase transition in the detectability of latent block structure, which is known to exist in the classic SBM 

Decelle et al. (2011).

Acknowledgements

We thank D. Larremore for helpful conversations. We acknowledge financial support from Grant #FA9550-12-1-0432 from the U.S. Air Force Office of Scientific Research (AFOSR) and the Defense Advanced Research Projects Agency (DARPA).

References

  • Airoldi et al. (2008) Airoldi, E.M., Blei, D.M., Fienberg, S.E., and Xing, E.P. Mixed membership stochastic blockmodels. J. Mach. Learn. Res., 9:1981–2014, 2008.
  • Ball et al. (2011) Ball, B., Karrer, B., and Newman, M.E.J. Efficient and principled method for detecting communities in networks. Phys. Rev. E, 84:036103, 2011.
  • Clauset et al. (2008) Clauset, A., Moore, C., and Newman, M. E. J. Hierarchical structure and the prediction of missing links in networks. Nature, 453:98–101, 2008.
  • Daudin et al. (2008) Daudin, J.-J., Picard, F., and Robin, S. A mixture model for random graphs. Statistics and Computing, 18:173–183, 2008.
  • Decelle et al. (2011) Decelle, A., Krzakala, F., Moore, C., and Zdeborová, L. Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett., 107(6):65701, 2011.
  • Fortunato (2010) Fortunato, S. Community detection in graphs. Physics Reports, 486:75–174, 2010.
  • Gopalan et al. (2012) Gopalan, P., Mimno, D., Gerrish, S., Freedman, M., and Blei, D. Scalable inference of overlapping communities. In Adv. in Neural Info. Proc. Sys. 25, pp. 2258–2266, 2012.
  • Hofman & Wiggins (2008) Hofman, J.M. and Wiggins, C.H. Bayesian approach to network modularity. Phys. Rev. Lett., 100(25):258701, 2008.
  • Holland et al. (1983) Holland, P.W., Laskey, K.B., and Leinhardt, S. Stochastic blockmodels: First steps. Social Networks, 5:109–137, 1983.
  • Karrer & Newman (2011) Karrer, B. and Newman, M.E.J. Stochastic blockmodels and community structure in networks. Phys. Rev. E, 83(1):016107, 2011.
  • Kemp et al. (2006) Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T., and Ueda, N. Learning systems of concepts with an infinite relational model. In

    Proc. Nat. Conf. on Artificial Intelligence

    , volume 21, pp. 381, 2006.
  • Mariadassou et al. (2010) Mariadassou, M., Robin, S., and Vacher, C. Uncovering latent structure in valued graphs: A variational approach. Ann. Appl. Stat., 4:715–742, 2010.
  • Meilă (2007) Meilă, M. Comparing clusterings: an information based distance.

    J. Multivariate Analysis

    , 98(5):873–895, May 2007.
  • Newman & Girvan (2004) Newman, M. E. J. and Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E, pp. 026113, February 2004.
  • Park et al. (2010) Park, Y., Moore, C., and Bader, J.S. Dynamic networks from hierarchical bayesian graph clustering. PLoS ONE, 5(1):e8118, 2010.
  • Peixoto (2013) Peixoto, T.P. Parsimonious module inference in large networks. Phys. Rev. Lett., 110:148701, 2013.
  • Porter et al. (2009) Porter, M. A., Onnela, J., and Mucha, P.J. Communities in networks. Notices of the AMS, 56(9):1082–1097, 2009.
  • Thomas & Blitzstein (2011) Thomas, A.C. and Blitzstein, J.K. Valued ties tell fewer lies: Why not to dichotomize network edges with thresholds. arXiv:1101.0788, 2011.
  • Wang & Wong (1987) Wang, Y.J. and Wong, G.Y. Stochastic blockmodels for directed graphs. J. Am. Stat. Assoc., 82:8–19, 1987.