The proliferation of models for networks raises challenging problems of model selection: the data are sparse and globally dependent, and models are typically high-dimensional and have large numbers of latent variables. Together, these issues mean that the usual model-selection criteria do not work properly for networks. We illustrate these challenges, and show one way to resolve them, by considering the key network-analysis problem of dividing a graph into communities or blocks of nodes with homogeneous patterns of links to the rest of the network. The standard tool for doing this is the stochastic block model, under which the probability of a link between two nodes is a function solely of the blocks to which they belong. This imposes a homogeneous degree distribution within each block; this can be unrealistic, so degree-corrected block models add a parameter for each node, modulating its over-all degree. The choice between ordinary and degree-corrected block models matters because they make very different inferences about communities. We present the first principled and tractable approach to model selection between standard and degree-corrected block models, based on new large-graph asymptotics for the distribution of log-likelihood ratios under the stochastic block model, finding substantial departures from classical results for sparse graphs. We also develop linear-time approximations for log-likelihoods under both the stochastic block model and the degree-corrected model, using belief propagation. Applications to simulated and real networks show excellent agreement with our approximations. Our results thus both solve the practical problem of deciding on degree correction, and point to a general approach to model selection in network analysis.READ FULL TEXT VIEW PDF
A central problem in analyzing networks is partitioning them into module...
Many popular models from the networks literature can be viewed through a...
Empirical networks are often globally sparse, with a small average numbe...
In Stochastic blockmodels, which are among the most prominent statistica...
This chapter provides a self-contained introduction to the use of Bayesi...
In this paper we present a generalization of the classical configuration...
This paper provides statistical theory and intuition for Personalized
In many networks, nodes divide naturally into modules or communities, where nodes in the same group connect to the rest of the network in similar ways. Discovering such communities is an important part of modeling networks (Porter et al., 2009), as community structure offers clues to the processes which generated the graph, on scales ranging from face-to-face social interaction (Zachary, 1977) through social-media communications (Adamic & Glance, 2005) to the organization of food webs (Allesina & Pascual, 2009; Moore et al., 2011).
The stochastic block model (Fienberg & Wasserman, 1981a; Holland et al., 1983; Snijders & Nowicki, 1997; Airoldi et al., 2008; Bickel & Chen, 2009) has, deservedly, become one of the most popular generative models for community detection. It splits nodes into communities or blocks, within which all nodes are stochastically equivalent (Wasserman & Anderson, 1987). That is, the probability of an edge between any two nodes depends only on which blocks they belong to, and all edges are independent given the nodes’ block memberships. Block models are highly flexible, representing assortative, disassortative and satellite community structures, as well as combinations thereof, in a single generative framework (Newman, 2002, 2003; Bickel & Chen, 2009)
. Their asymptotic properties, including phase transitions in the detectability of communities, can be determined exactly using tools from statistical physics(Decelle et al., 2011b, a) and random graph theory (Mossel et al., 2012).
Despite this flexibility, stochastic block models impose real restrictions on networks; notably, the degree distribution within each block is asymptotically Poisson for large graphs. This makes the stochastic block model implausible for many networks, where the degrees within each community are highly inhomogeneous. Fitting stochastic block models to such networks tends to split the high- and low- degree nodes in the same community into distinct blocks; for instance, dividing both liberal and conservative political blogs into high-degree “leaders” and low-degree “followers” (Adamic & Glance, 2005; Karrer & Newman, 2011). To avoid this pathology, and allow degree inhomogeneity within blocks, there is a long history of generative models where the probability of an edge depends on node attributes as well as their group memberships (e.g., Mørup & Hansen 2009; Reichardt et al. 2011). Here we use the variant due to Karrer & Newman (2011), called the degree-corrected block model111From a different perspective, the famous model of Holland & Leinhardt (1981), and the Chung & Lu (2002) model, allow each node to have its own expected degree, but otherwise treat nodes as homogeneous (Rinaldo et al., 2011). The degree-corrected block model extends these models to allow for systematic variation in linking patterns..
We often lack the domain knowledge to choose between the ordinary and the degree-corrected block model, and so face a model selection problem. The standard methods of model selection are largely based on likelihood ratios (possibly penalized), and we follow that approach here. Since both the ordinary and degree-correct block models have many latent variables, calculating likelihood ratios is itself non-trivial; the likelihood must be summed over all partitions of nodes into blocks, so (in statistical physics terms) the log-likelihood is a free energy. We approximate the log-likelihood using belief propagation and the Bethe free energy, giving a highly scalable algorithm that can deal with large sparse networks in nearly linear time. However, even with the likelihoods in hand, it turns out that the usual theory for likelihood ratios is invalid in our setting, because of a combination of the sparsity of the data and the high-dimensional nature of the degree-corrected model. We derive the correct asymptotics, under regularity assumptions, recovering the classic results in the limit of large, dense graphs, but finding that substantial corrections are needed for sparse graphs, corrections that grow with graph size. Simulations confirm the validity of our theory, and we apply our method to both real and synthetic networks.
Let us set the problem on an observed, stochastic graph with nodes and edges; we assume edges are undirected, though the directed case is only notationally more cumbersome. The graph is represented by its symmetric adjacency matrix . We want to split the nodes into communities, taking
to be given a priori. (We will address estimatingelsewhere.)
Traditionally, stochastic block models are applied to simple graphs, where each entry
of the adjacency matrix follows a Bernoulli distribution. Following, e.g.,Karrer & Newman (2011), we use a multigraph version of the block model, where the
are independent and Poisson-distributed. (For simplicity, we ignore self-loops.) In the sparse network regime we are most interested in, this Poisson mode differs only negligibly from the original Bernoulli model(Perry & Wolfe, 2012), but the former is easier to analyze.
In all stochastic block models, each node has a latent variable indicating which of the blocks it belongs to. The block assignment is then . The are independent draws from a multinomial distribution parameterized by , so
is the prior probability that a node is in block. Thus . After it assigns nodes to blocks, a block model generates the number of edges between the nodes and by making an independent Poisson draw for each pair. In the ordinary stochastic block model, the means of these Poisson draws are specified by the
block affinity matrix, so
The complete-data likelihood (involving as well as ) is
Here is the number of nodes in block , and the number of edges connecting block to block , or twice that number if . The last product is constant in the parameters, and for simple graphs, so we discard it below. The log-likelihood is then
Maximizing (2) over and gives
Of course, the block assignments are not observed, but rather are what we want to infer. We could try to find by maximizing (2) over , and jointly; in terms borrowed from statistical physics, this amounts to finding the ground state that minimizes the energy . When this can be found, it recovers the correct exactly if the graph is dense enough (Bickel & Chen, 2009). But if we wish to infer the parameters , or to perform model selection, we are interested in the total likelihood of generating the graph at hand. This is
summing over all
possible block assignments. Again following the physics lexicon, this is the partition function of the Gibbs distribution of, and its logarithm is (minus) the free energy.
As is usual with latent variable models, we can infer and using an EM algorithm (Dempster et al., 1977), where the E step approximates the average over with respect to the Gibbs distribution, and the M step estimates and in order to maximize that average (Neal & Hinton, 1998)
. One approach to the E step would use a Monte Carlo Markov Chain algorithm to samplefrom the Gibbs distribution. However, as we review below, in order to determine and it suffices to estimate the marginal distributions of of each
, and joint distributions offor each pair of nodes (Beal & Ghahramani, 2006). As we show in §3, belief propagation efficiently approximates both the log-likelihood and these marginals, and for typical networks it converges very rapidly. Other methods of approximating the E step are certainly possible, and could be used with our model-selection analysis.
As discussed above, in the stochastic block model, all nodes in the same block have the same degree distribution. Moreover, their degrees are sums of independent Poisson variables, so this distribution is Poisson. As a consequence, the stochastic block model resists putting nodes with very different degrees in the same block. This leads to problems with networks where the degree distributions within blocks are highly skewed.
The degree-corrected model allows for heterogeneity of degree within blocks. Nodes are assigned to blocks as before, but each node also gets an additional parameter , which scales the expected number of edges connecting it to other nodes. Thus
Varying the gives any desired expected degree sequence. Setting for all recovers the stochastic block model.
The likelihood stays the same if we increase by some factor for all nodes in block , provided we also decrease for all by the same factor, and decrease by . Thus identification demands a constraint, and here we use the one that forces to sum to the total number of nodes within each block,
The complete-data likelihood of the degree-corrected model is then
where and are as in (1). Again dropping constants, the log-likelihood is
Maximizing (6) yields
is the average degree of the nodes in block .
However, as with the ordinary stochastic block model, we will estimate and not just for a ground state , but using belief propagation to find the marginal distributions of and .
We referred above to the use of belief propagation for computing log-likelihoods and marginal distributions of block assignments; for our purposes, belief propagation is essentially a way of doing the expectation step of the expectation-maximization algorithm. Here we describe how belief propagation works for the degree-corrected block model, extending the treatment of the ordinary stochastic block model inDecelle et al. (2011b, a).
The key idea (Yedidia et al., 2003) is that each node sends a “message” to every other node , indicating the marginal distribution of if were absent. We write for the probability that would be of type in the absence of . Then gets updated in light of the messages gets from the other nodes as follows. Let
be the probability that takes its observed value if and . Then
where ensures that . Here, as usual in belief propagation, we treat the block assignments of the other nodes as independent, conditioned on .
Each node sends messages to every other node, not just to its neighbors, since non-edges (where ) are also informative about and . Thus we have a Markov random field on a weighted complete graph, as opposed to just on the network itself. However, keeping track of messages is cumbersome. For sparse networks, we can restore scalability by noticing that, up to terms, each node sends the same message to all of its non-neighbors. That is, for any such that , we have where
This simplification reduces the number of messages to where is the number of edges. We can then write
Since the second product depends only on , we can compute it once for each distinct degree in the network, and then update the messages for each in time. Thus, for fixed , the total time needed to update all the messages is , where is the number of distinct degrees. For many families of networks the number of updates necessary to reach a fixed point is only a constant or , making the entire algorithm quite scalable (see Decelle et al. 2011a, b for details).
The belief-propagation estimate of the joint distribution of is
normalized so that . The maximization step of the expectation-maximization algorithm sets and as in (7),
where is the average degree of block with respect to the belief-propagation estimates.
Finally, belief propagation also lets us approximate the total log-likelihood, summed over but holding the observed graph fixed. The Bethe free energy is the following approximation to the log-likelihood (Yedidia et al., 2005):
An alternative to belief propagation would be the use of Markov chain Monte Carlo maximum likelihood, which is often advocated for network modeling (Hunter & Handcock, 2006). However, the computational complexity of Monte Carlo maximum likelihood is typically much worse than that of belief propagation; it does not seem to be practical for graphs beyond a few hundred nodes. We reiterate that while we use belief propagation in our numerical work, our results on model selection in the next section are quite indifferent as to how the likelihood is computed.
When the degree distribution is relatively homogeneous within each block (e.g., Fienberg & Wasserman 1981a; Holland et al. 1983), the ordinary stochastic block model is better than the degree-corrected model, since the extra parameters simply lead to over-fitting. On the other hand, when degree distributions within blocks are highly heterogeneous, the degree-corrected model is better. However, without prior knowledge about the communities, and thus the block degree distributions, we need to use the data to pick a model, i.e., to do model selection.
It is natural to approach the problem as one of hypothesis testing222We discuss other approaches to model selection in the conclusion.. Since the ordinary stochastic block model is nested within the degree-corrected model, any given graph is at least as likely under the latter as under the former. Moreover, if the ordinary block model really is superior, the degree-corrected model should converge to it, at least in the limit of large networks. Our null model is then the stochastic block model, and the larger, nesting alternative
is the degree-corrected model. The appropriate test statistic is the log-likelihood ratio,
As usual, we reject the null model in favor of the more elaborate alternative when exceeds some threshold. This threshold, in turn, is fixed by our desired error rate, and by the distribution of when is generated from the null model. When is small, the null-model distribution of can be found through parametric bootstrapping (Davison & Hinkley, 1997, §4.2.3.): fitting , generating new graphs from it, and evaluating . When is large, however, it is helpful to replace bootstrapping with analytic calculations.
Classically (Schervish, 1995, Theorem 7.125, p. 459), the large- null distribution of approaches , where is the number of constraints that must be imposed on to recover . In this case we have , as we must set all of the to 1, while our identifiability convention (4) already imposed constraints.
distribution rests on the assumption that the log-likelihood of both models is well-approximated by a quadratic function in the vicinity of its maximum, so that the parameter estimates have Gaussian distributions around the true model(Geyer, 2005)
. The most common grounds for this assumption are central limit theorems for the data, together with a smooth functional dependence of each parameter estimate on a growing number of samples, i.e., being in a “large data limit”. This assumption fails in the present case. The degree-corrected model hasnode-specific parameters. Dense graphs have an effective sample size of , so even with a growing parameter space the degree-corrected model can pass to the large data limit. But in sparse networks, the effective sample size is only , and so we never get the usual asymptotics no matter how large grows.
Nevertheless, with some work we are able to compute the mean and variance of’s null distribution. While we recover the classical distribution in the limit of dense graphs, there are important corrections when the average degree of the graph is small, even as . As we shall see, this has drastic consequences for the appropriate threshold in likelihood ratio tests.
To characterize the null distribution of , we assume that the posterior distributions and concentrate on the same block assignment . This is a major assumption, but it is borne out by our simulations (Fig. 1 and Fig. 3), and the fact that under some conditions (Bickel & Chen, 2009) the stochastic block model recovers the underlying block assignment exactly. Under this assumption, while the free energy differs from the ground state energy by an entropy term, the free energy difference between the two models has the same distribution as the ground state energy difference. The maximum-likelihood estimates for and are then (3) and (7) respectively. Substituting these into (12), most of the terms cancel, giving
the form of a Kullback-Leibler divergence,
where we applied (7). Here is the empirical mean degree of block , not the expected degree of the stochastic block model.
Given (4.1), the distribution of follows from the distributions of the nodes’ degrees; under the null model, all the in block are independent . (This assumption is sound in the limit , since the correlations between node degrees are .) Using this, we can compute the expectation and variance of analytically (see Appendix), showing that departs from classical asymptotics, as well as revealing the limits where those results apply. Specifically,
where, if ,
For dense graphs, where , both and approach , and (14) gives just as in the standard analysis. However, when is small, differs noticeably from .
The variance of is somewhat more complicated. The limiting variance per node is
where, again taking ,
Since the variance of is , asymptotics would predict . Indeed approaches as , but like it differs substantially from for small . Figure 3 plots and for .
Figure 3 shows that, for networks simulated from the stochastic block model, the mean and variance of
are very well fit by our formulas. We have not attempted to compute higher moments of. However, if we assume that are independent, then the simplest form of the central limit theorem applies, and will approach a Gaussian distribution as
. Quantile plots from the same simulations (Fig.3(c)) show that a Gaussian with mean and variance from (14) and (16) is indeed a good fit. Moreover, the free energy difference and the ground state energy difference have similar distributions, as implied by our assumption that both Gibbs distributions are concentrated around the ground state. Interestingly, in Fig. 3(c), the degree is low enough that this concentration must be imperfect, but our theory still holds remarkably well. For ease of illustration, we assume that and are the same for all .
, comparing 95% bootstrap confidence intervals (overreplicates) to the asymptotic formulas (respectively from (15) and from (17)). Figure 2(c) compares the distribution of from replicates, all with , to a Gaussian with the theoretical mean and variance. (Observe that the free energy difference and the ground state energy difference have similar distributions.)
Fundamentally, does not follow the usual distribution because the parameters are in a high-dimensional regime. For each , we really have only one relevant observation, the node degree . If is large, then the Poisson distribution of is well-approximated by a Gaussian, as is the sampling distribution of ’s maximum likelihood estimate, so that the usual analysis applies. In a sparse graph, however, all the Poisson distributions have small expected values and are highly non-Gaussian, as are the maximum likelihood estimates (Zhu et al., 2012). Said differently, the degree-corrected model has more parameters than the null model. In the dense-graph case, there are observations, at least of which are informative about each of these extra parameters. For sparse graphs, however, there are really only observations, and only of them are informative about each , so the ordinary large- asymptotics cannot apply to them. As we have seen, the expected increase in likelihood from adding the parameters is larger than theory predicts, as are the fluctuations in this increase in likelihood.
This reasoning elaborates on a point made long ago by Fienberg & Wasserman (1981b) regarding hypothesis testing in the model, where each node has two node-specific parameters (for in- and out- degree); our calculations of and above, and especially of how and why they differ from , go some way towards meeting Fienberg and Wasserman’s call for appropriate asymptotics for large-but-sparse graphs.
Ignoring these phenomena and using a test inflates the type I error rate (), eventually rejecting the stochastic block model for almost all graphs which it generates. Indeed, since the distribution is tightly peaked around , this inflation of gets worse as gets bigger. For instance, when , a test with nominal passes has a true once , while for , this happens once (Fig. 4). In essence, the test underestimates the amount of degree inhomogeneity we would get simply from noise, incorrectly concluding that the inhomogeneity must come from underlying properties of the nodes.
We have derived the theoretical null distribution of , and backed up our calculations with simulations. We now apply our theory to two examples, considering networks studied in Karrer & Newman (2011).
The first is a social network consisting of members of a karate club, where undirected edges represent friendships (Zachary, 1977). The network is made up of two assortative blocks, each with one high-degree hub (respectively the instructor and the club president) and many low-degree peripheral nodes. Karrer & Newman (2011) compared the performance of the ordinary and the degree-corrected block models on this network, and heavily favored degree correction, because the former leads to division into communities agreeing with ethnographic observations.
While a classic data set for network modeling, the karate club has both low degree and very small . If we nonetheless use parametric bootstrapping to find the null distribution of , we see that it fits a Gaussian with our predicted mean and variance reasonably well (Fig. 5(a)). The observed has a -value of according to the bootstrap, and according to our Gaussian asymptotics. Thus a prudent analyst would think twice before embracing the
additional degree-correction parameters. Indeed, using active learning,Moore et al. (2011) found that the stochastic block model labels most of the nodes correctly if the instructor and president are forced into different blocks. This implies that the degree inhomogeneity is mild, and that only a handful of nodes are responsible for the better performance of the degree-corrected model.
Note that if we apply standard testing to the karate club, we obtain a lower -value of . As in Fig. 4, testing underestimates the extent to which an inhomogeneous degree distribution can result simply from noise, causing it to reject the null model more confidently than it should.
Hypothesis testing with real networks; both panels show the complementary cumulative distribution function of the log-likelihood ratiofor testing for degree correction. (a): Zachary’s karate club (Zachary, 1977) (). The distribution found by parametric bootstrapping (shaded) fits reasonably well to a Gaussian (curve) with our theoretical mean and variance. The observed (marked with the red line) has -values of and according to the bootstrap and theoretical distributions respectively, whereas the test (dashed) has a -value of . (b): A network of political blogs (Adamic & Glance, 2005) (). The bootstrap distribution (shaded) is very well fit by our theoretical Gaussian (curve) as well as the test (dashed). The actual log-likelihood ratio is so far in the tail that its -value is effectively zero (see inset). Thus for the blog network, we can decisively reject the ordinary block model in favor of the degree-corrected model, while for the karate club, the evidence is less clear.
The second example is a network of political blogs in the US assembled by Adamic & Glance (2005). As in Karrer & Newman (2011), we focus on the giant component, which contains blogs with links between them. The blogs have known political leanings, and were labeled as either liberal or conservative. The network is politically assortative, with highly right-skewed degree distributions within each block, so degree correction greatly assists in recovering political divisions, as observed by Karrer & Newman (2011). This time around, our hypothesis testing procedure completely agrees with their choice of model. As shown in Fig. 5(b), the bootstrap distribution of is very well fit by a Gaussian with our theoretical prediction of the mean and variance. The observed log-likelihood ratio is standard deviations above the mean. It is essentially impossible to produce such extreme results through mere fluctuations under the null model. Thus, for this network, introducing extra parameters to capture the degree heterogeneity is fully justified.
The blog network shows several advantages of our theoretical approach over just using bootstrapping. As with many other real networks, is large enough that bootstrapping is quite slow, but for the same reason the Gaussian approximation for is fairly tight.
Deciding between ordinary and degree-corrected stochastic block models for sparse graphs presents a difficult hypothesis testing problem. The distribution of the log-likelihood ratio does not follow the classic theory, because the nuisance parameter , only present in the alternative, is in a high-dimensional regime. We have nonetheless derived ’s mean and variance in the limit of large, sparse graphs, where node degrees become independent and Poisson. Simulations confirm the accuracy of our theory for moderate , and we applied it to two real networks.
Beyond hypothesis testing, two standard approaches to model selection are information criteria and cross-validation. While we have not directly dealt with the former, the derivations of such popular criteria as AIC or DIC use exactly the same asymptotics as the test (Claeskens & Hjort, 2008, ch. 2); these tools will break down for the same reasons
theory fails. As for cross-validation, standard practice in machine learning suggests using multi-fold cross-validation, but the global dependence of network data means there is (as yet) no good way to split a graph into training and testing sets. Predicting missing links or tagging false positives are popular forms of leave--out cross-validation in the network literature (Clauset et al., 2008; Guimera & Sales-Pardo, 2009), but leave--out does not converge on the true model even for independent and identically-distributed data (Claeskens & Hjort, 2008, §2.9). Thus, while our results apply directly only to the specific problem of testing the need for degree correction, they open the way to more general approaches to model selection and hypothesis testing in a wide range of network problems.
For simplicity we focus on one block with expected degree . Independence between blocks will then recover the expressions (14) and (16) where the mean and variance of is a weighted sum over blocks. We have
where is the empirical mean degree. We wish to compute the mean and expectation of if the data is generated by the null model.
If , let denote the difference between the expectation of and its most probable value :
Assume that the are independent and ; this is reasonable in a large sparse graph, since the correlations between degrees of different nodes is . Then , and (18) gives
To grasp what this implies, begin by observing that converges to when is large. Thus in the limit of large , . When is large, this gives , just as theory suggests. However, as Fig. 2 shows, deviates noticeably from for finite . We can obtain the leading corrections as a power series in by approximating (19) with the Taylor series of around , giving
Computing the variance is harder. It will be convenient to define several functions. If , let denote the variance of :
We will also use
Finally, let , and let and be independent and Poisson with mean and respectively. Then let
where we used the fact that .
Again assuming that the are independent, we have the following terms and cross-terms for the variance of (18) :
Putting this all together, we have
Also, when and , using lets us simplify (23), giving
In particular, setting gives
An exponential family of probability distributions for directed graphs: Comment.Journal of the American Statistical Association, 76, 54–57.
Exploring Artificial Intelligence in the New Millennium [IJCAI 2001], G. Lakemeyer & B. Nebel, eds. San Francisco: Morgan Kaufmann.