# Stochastic blockmodels with growing number of classes

We present asymptotic and finite-sample results on the use of stochastic blockmodels for the analysis of network data. We show that the fraction of misclassified network nodes converges in probability to zero under maximum likelihood fitting when the number of classes is allowed to grow as the root of the network size and the average network degree grows at least poly-logarithmically in this size. We also establish finite-sample confidence bounds on maximum-likelihood blockmodel parameter estimates from data comprising independent Bernoulli random variates; these results hold uniformly over class assignment. We provide simulations verifying the conditions sufficient for our results, and conclude by fitting a logit parameterization of a stochastic blockmodel with covariates to a network data example comprising a collection of Facebook profiles, resulting in block estimates that reveal residual structure.

## Authors

• 1 publication
• 7 publications
• 18 publications
• ### Finite-sample risk bounds for maximum likelihood estimation with arbitrary penalties

The MDL two-part coding index of resolvability provides a finite-sampl...
12/29/2017 ∙ by W. D. Brinda, et al. ∙ 0

• ### Consistency of the maximum likelihood and variational estimators in a dynamic stochastic block model

We consider a dynamic version of the stochastic block model, in which th...
03/11/2019 ∙ by Léa Longepierre, et al. ∙ 0

• ### Maximum Likelihood Estimation of Stochastic Frontier Models with Endogeneity

We provide a closed-form maximum likelihood estimation of stochastic fro...
04/26/2020 ∙ by Samuele Centorrino, et al. ∙ 0

• ### Efficient closed-form estimation of large spatial autoregressions

Newton-step approximations to pseudo maximum likelihood estimates of spa...
08/27/2020 ∙ by Abhimanyu Gupta, et al. ∙ 0

• ### Consistency of Maximum Likelihood for Continuous-Space Network Models

Network analysis needs tools to infer distributions over graphs of arbit...
11/06/2017 ∙ by Cosma Rohilla Shalizi, et al. ∙ 0

• ### Analysis of Networks via the Sparse β-Model

Data in the form of networks are increasingly available in a variety of ...
08/08/2019 ∙ by Mingli Chen, et al. ∙ 0

• ### Finite-Sample Analysis of Image Registration

We study the problem of image registration in the finite-resolution regi...
01/12/2020 ∙ by Ravi Kiran Raman, et al. ∙ 0

## Code Repositories

### StochasticBlockmodel

Exploring inference in variants of a stochastic blockmodel for (directed) network data

### StochasticBlockmodel

Exploring inference in variants of a stochastic blockmodel for (directed) network data

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The global structure of social, biological, and information networks is sometimes envisioned as the aggregate of many local interactions whose effects propagate in ways that are not yet well understood. There is increasing opportunity to collect data on an appropriate scale for such systems, but their analysis remains challenging (Goldenberg et al., 2009). Here we analyze a statistical model for network data known as the (single-membership) stochastic blockmodel. Its salient feature is that it partitions the nodes of a network into distinct classes whose members all interact similarly with the network. Blockmodels were first associated with the deterministic concept of structural equivalence in social network analysis (Lorrain & White, 1971), where two nodes were considered interchangeable if their connections were equivalent in a formal sense. This concept was adapted to stochastic settings and gave rise to the stochastic blockmodel in work by Holland et al. (1983) and Fienberg et al. (1985). The model and extensions thereof have since been applied in a variety of disciplines (Wang & Wong, 1987; Nowicki & Snijders, 2001; Girvan & Newman, 2002; Airoldi et al., 2005; Doreian et al., 2005; Newman, 2006; Handcock et al., 2007; Hoff, 2008; Airoldi et al., 2008; Copic et al., 2009; Mariadassou et al., 2010; Karrer & Newman, 2011).

In this work we provide a finite-sample confidence bound that can be used when estimating network structure from data modeled by independent Bernoulli random variates, and also show that under maximum likelihood fitting of a correctly specified -class blockmodel, the fraction of misclassified network nodes converges in probability to zero even when the number of classes grows with . As noted by Rohe et al. (2011), this is advantageous if we expect class sizes to remain relatively constant even as increases. Related results for fixed have been shown by Snijders & Nowicki (1997) for networks with linearly increasing degree, and in a stronger sense for sparse graphs with poly-logarithmically increasing degree by Bickel & Chen (2009).

Our results can be related to those of Rohe et al. (2011), who use spectral methods to bound the number of misclassified nodes in the stochastic blockmodel with increasing , although with the more restrictive requirement of nearly linearly increasing degree. As noted by those authors, this assumption may not hold in many practical settings. Our manner of proof requires only poly-logarithmically increasing degree, and is more closely related to the fixed- proof of Bickel & Chen (2009)

, although we note that spectral clustering as suggested by

Rohe et al. (2011) provides a computationally appealing alternative to maximum likelihood fitting in practice.

As discussed by Bickel & Chen (2009), one may assume exchangeability in lieu of a generative

-class blockmodel: An analogue to de Finetti’s theorem for exchangeable sequences states that the probability distribution of an infinite exchangeable random graph is expressible as a mixture of distributions whose components can be approximated by blockmodels

(Kallenberg, 2005; Bickel & Chen, 2009). An observed network can then be viewed as a sample drawn from this infinite conceptual population, and so in this case the fitted blockmodel describes one mixture component thereof.

## 2 Statement of results

### 2.1 Problem formulation and definitions

We consider likelihood-based inference for independent Bernoulli data , both when no structure linking the success probabilities is assumed, as well as the special case when a stochastic blockmodel of known order is assumed to apply. To this end, let denote the symmetric adjacency matrix of a simple, undirected graph on nodes whose entries for are assumed independent random variates, and whose main diagonal is fixed to zero. The average degree of this graph is , where is its expected number of edges. Under a -class stochastic blockmodel, these edge probabilities are further restricted to satisfy

 Pij=θzizj(i=1,…,N;j=i+1,…,N) (1)

for some symmetric matrix

and membership vector

. Thus the probability of an edge between two nodes is assumed to depend only on the class of each node.

Let denote the log-likelihood of observing data matrix under a -class blockmodel with parameters , and its expectation:

 L(A;z,θ) =∑i

For fixed class assignment , let denote the number of nodes assigned to class , and let denote the maximum number of possible edges between classes and ; i.e., if and . Further, let and be symmetric matrices in , with

 ^θ(z)ab =1nab∑i

defined whenever . Observe that comprises sample proportion estimators as a function of , whereas is its expectation under the independent model. Taken over all class assignments , the sets comprise a sufficient statistic for the family of -class stochastic blockmodels, and for each , maximizes . Analogously, the sets are functions of the model parameters , and maximize . We write and when the choice of is understood, and and to abbreviate and respectively.

Finally, observe that when a blockmodel with parameters is in force, then in accordance with (1), and consequently is maximized by the true parameter values :

 ¯LP(¯z,¯θ)−¯LP(z,θ)=∑i

where

denotes the Kullback–Leibler divergence of a

distribution from a one.

### 2.2 Fitting a K-class stochastic blockmodel to independent Bernoulli trials

Fitting a -class stochastic blockmodel to independent trials yields estimates of averages of subsets of the parameter set , with each class assignment inducing a partition of that set. We begin with a basic lemma that expresses the difference in terms of and , and follows directly from their respective maximizing properties.

###### Lemma 1

Let comprise independent trials. Then the difference can be expressed for as

 L(A;z)−¯LP(z)=∑a≤bnabD(^θab∣∣¯θab)+X−E(X).

We first bound the former quantity in this expression, which provides a measure of the distance between and its estimand under the setting of Lemma 1. The bound is used in subsequent asymptotic results, and also yields a kind of confidence measure on in the finite-sample regime.

###### Theorem 1

Suppose that a -class stochastic blockmodel is fitted to data comprising independent trials, where, for any class assignment , estimate maximizes the blockmodel log-likelihood . Then with probability at least ,

 maxz{∑a≤bnabD(^θab∣∣¯θab)}

Theorem 1 is proved in the Appendix via the method of types: for fixed , the probability of any realization of is first bounded by . A counting argument then yields a deviation result in terms of , and finally a union bound is applied so that the result holds uniformly over all possible choices of assignment vector .

Our second result is asymptotic, and combines Theorem 1

with a Bernstein inequality for bounded random variables, applied to the latter terms

in Lemma 1. To ensure boundedness we assume minimal restrictions on each ; this Bernstein inequality, coupled with a union bound to ensure that the result holds uniformly over all , dictates growth restrictions on and .

###### Theorem 2

Assume the setting of Theorem 1, whereby a -class blockmodel is fitted to independent random variates , and further assume that for all and . Then if and for some ,

 maxz|L(A;z)−¯LP(z)|=oP(M).

Thus whenever each is bounded away from 0 and 1 in the manner above, the maximized log-likelihood function is asymptotically well behaved in network size as long as the network’s average degree grows faster than and the number of classes fitted to it grows no faster than .

### 2.3 Fitting a correctly specified K-class stochastic blockmodel

The above results apply to the general case of independent Bernoulli data , with no additional structure assumed amongst the set of success probabilities ; if we further assume the data to be generated by a -class stochastic blockmodel whose parameters are subject to suitable identifiability conditions, it is possible to characterize the behavior of the class assignment estimator under maximum likelihood fitting of a correctly specified -class blockmodel.

###### Theorem 3

If the conclusion of Theorem 2 holds, and data are generated according to a -class blockmodel with membership vector , then

 ¯LP(¯z)−¯LP(^z)=oP(M), (3)

with respect to the maximum-likelihood -class blockmodel class assignment estimator .

Let be the number of incorrect class assignments under , counted for every node whose true class under is not in the majority within its estimated class under . If furthermore the following identifiability conditions hold with respect to the model sequence:

(i) for all blockmodel classes , class size grows as ;

(ii) the following holds over all distinct class pairs and all classes :

 min(a,b)maxc{D(¯θac∣∣¯θac+¯θbc2)+D(¯θbc∣∣¯θac+¯θbc2)}=Ω(MKN2),

then it follows from (3) that

Thus the conclusion of Theorem 3 is that under suitable conditions the fraction of misclassified nodes goes to zero in , yielding a convergence result for stochastic blockmodels with growing number of classes. Condition (i) stipulates that all class sizes grow a rate that is eventually bounded below by a single constant times , while condition (ii) ensures that any two rows of differ in at least one entry by an amount that is eventually bounded by a single constant times . Observe that if eventually and so that conditions on and sufficient for Theorem 2 are met, then since , it follows that goes to zero in .

## 3 Numerical results

We now present results of a small simulation study undertaken to investigate the assumptions and conditions of Theorems 13 above, in which -class blockmodels were fitted to various networks generated at random from models corresponding to each of the three theorems. Because exact maximization in of the blockmodel log-likelihood is computationally intractable even for moderate , we instead employed Gibbs sampling to explore the function and recorded the best value of visited by the sampler. As the results of Theorems 1 and 2 hold uniformly in , however, we expect and to be close to their empirical estimates whenever is sufficiently large, regardless of the approach employed to select . This fact also suggests that a single-class (Erdös-Rényi) blockmodel may come closest to achieving equality in Theorems 1 and 2, as many class assignments are equally likely a priori to have high likelihood. By similar reasoning, a weakly identifiable model should come closest to achieving the error bound in Theorem 3, such as one with nearly identical within- and between-class edge probabilities. We describe each of these cases empirically in the remainder of this section.

First, the tightness of the confidence bound of (2) from Theorem 1 was investigated by fitting -class blockmodels to Erdös-Rényi networks comprising independent trials, with nodes and 0075 chosen to match the data analysis example in the sequel, and . For each , the error terms and were recorded for each of 100 trials and compared to the respective 95% confidence bounds (005) derived from Theorem 1

. The bounds overestimated the respective errors by a factor of 3 to 7 on average, with small standard deviation. In this worst-case scenario the bound is loose, but not unusable; the errors never exceeded the 95% confidence bounds in any of the trials.

To test whether the assumptions of Theorem 2 are necessary as well as sufficient to obtain convergence of to , blockmodels were next fitted to Erdös-Rényi networks of increasing size, for in the range 50–1050. The corresponding normalized log-likelihood error for different rates of growth in the expected number of edges and the number of fitted classes is shown in Fig. 1. Observe from the leftmost panel that when and , as prescribed by the theorem, this error decreases in . If the edge density is reduced to , we observe in the center panel convergence when and divergence when . This suggests that the error as a function of follows Theorem 2 closely, but that the network can be somewhat more sparse than it requires.

To test the conditions of Theorem 3, blockmodels with parameters and increasing class size were used to generate data, and corresponding node misclassification error rates were recorded as a function of correctly specified -class blockmodel fitting. Model parameter was chosen to yield equally-sized blocks, so as to meet identifiability condition (i) of Theorem 3. Parameter was chosen to yield within-class and between-class success probabilities with the property that for any class pair , the condition was satisfied, with ; identifiability condition (ii) was thus met only in the case. The rightmost panel of Fig. 1 shows the fraction of misclassified nodes when and , corresponding to the setting in which convergence of to was observed above; this fraction is seen to decay when or , but to increase when . This behavior conforms with Theorem 3 and suggests that its identifiability conditions are close to being necessary as well as sufficient.

## 4 Network data example

### 4.1 Facebook social network dataset

To illustrate the use of our results in the fitting of -class stochastic blockmodels to network data, we employed a publicly available social network dataset containing undergraduate Facebook profiles from the California Institute of Technology (people.maths.ox.ac.uk/porterm/data/facebook5.zip). These profiles indicate whenever a pair of students have identified one another as friends, yielding a network of edges and accompanying covariate information including gender, class year, and hall of residence.

Traud et al. (2011) applied community detection algorithms to this network, and compared their output to partitions based on categorical covariates such as those identified above. They concludes that a grouping of students by residence hall was most similar to the best algorithmic grouping obtained, and thus that shared residence hall membership was the best predictor for the formation of community structure. This structure is reflected in the leftmost panel of Fig. 2, which shows the network adjacency structure under an ordering of students by residence hall.

### 4.2 Logit blockmodel parameterization and fitting procedure

Here we build on the results of Traud et al. (2011) by taking covariate information explicitly into account when fitting the Facebook dataset described above. Specifically, by assuming only that links are independent Bernoulli variates and then employing confidence bounds to assess fitted blocks by way of parameter , we examine these data for residual community structure beyond that well explained by the covariates themselves.

Since the results of Theorems 1 and 2 hold uniformly over all choices of blockmodel membership vector , we may select in any manner, including those that depend on covariates. For this example, we determined an approximate maximum likelihood estimate

under a logit blockmodel that allows the direct incorporation of covariates. The model is parameterized such that the log-odds ratio of an edge occurrence between nodes

and is given by

 logPij1−Pij=~θzizj+x(i,j)Tβ(i=1,…,N;j=i+1,…,N), (4)

where a vector of covariates indicating shared group membership, and model parameters are estimated from the data. Four categorical covariates were used: the three indicated above, plus an eight-category covariate indicating the range of the observed degree of each node; see Karrer & Newman (2011) for related discussion on this point. Matrix is analogous to blockmodel parameter , vector specifies the blockmodel class assignment, and vector was implemented here with sum-to-zero identifiability constraints.

Because exact maximization of the log-likelihood function corresponding to (4

) is computationally intractable, we instead employed an approach that alternated between Markov chain Monte Carlo exploration of

while holding constant, and optimization of and while holding constant. We tested different initialization methods and observed that highest likelihoods were consistently produced by first fitting class assignment vector . This fitting procedure provides a means of estimating averages over subsets of the set , under the assumption that the network data comprise independent trials.

### 4.3 Data analysis

We fitted the logit blockmodel of (4) for values of ranging from to using the stochastic maximization procedure described in the preceding paragraph, and gauged model order by the Bayesian information criterion and out-of-sample prediction using five-fold cross validation, shown respectively in the center and rightmost panels of Fig. 2. These plots suggest a relatively low model order, beginning around . The corresponding 95% confidence bounds on the divergence of from provided by Theorem 1 also yield small values for in the range 4–7: for example, when , the normalized sum of Kullback–Leibler divergences is bounded by 00067. Corresponding normalized root-mean-square error bounds over this range of are approximately one order of magnitude larger.

We then examined approximate maximum likelihood estimates of for in the range 4–7, as shown in the top two rows of Fig. 3; larger values of also reveal block structure, but exhibit correspondingly larger confidence bound evaluations. The permuted adjacency structures under each estimated class assignment are shown in the top row, along with the corresponding values of below in the second row. The structure of over this range of suggests that after covariates are taken into account, it is possible to identify a subset of students who divide naturally into two residual “meta-groups” that interact less frequently with one another in comparison to the remaining subjects in the dataset; the precision of the corresponding estimates can be quantified by Theorem 1, as in the caption of Fig. 3.

As increases, these groups become more tightly concentrated, as extra blocks absorb students whose connections are more evenly distributed. While the exact membership of each group varied over , in part due to stochasticity in the fitting algorithm employed, we observed 199 students whose meta-group membership remained constant. The bottom row of Fig. 3 shows the 8 residence halls identified for these sets of students, with the ninth category indicating unreported; observe that the effect of residence hall is still visible in that the left-hand grouping has more students in halls 4–7, while the right-hand grouping has more students in halls 1, 2, and 8.

## Acknowledgement

Work supported in part by the National Science Foundation, National Institute of Health, Army Research Office and the Office of Naval Research, U.S.A. Additional funding provided by the Harvard Medical School’s Milton Fund.

## Appendix

### Proofs of Theorems 1 and 2

###### Proof (of Theorem 1)

To begin, observe that for any fixed class assignment , every is a sum of independent Bernoulli random variables, with corresponding mean . A Chernoff bound (Dubhashi & Panconesi, 2009) shows

 pr(^θab ≥¯θab+t)≤e−nabD(¯θab+t∣∣¯θab),0

Since these bounds also hold respectively for , we may bound the probability of any given realization of in terms of the Kullback–Leibler divergence of from :

 pr(^θab=ϑ)≤e−nabD(ϑ∣∣¯θab).

By independence of the , this implies a corresponding bound on the probability of any :

 pr(^θ)≤exp{−∑a≤bnabD(^θab∣∣¯θab)}. (5)

Now, let denote the range of for fixed , and observe that since each of the lower-diagonal entries of can independently take on distinct values, we have that . Subject to the constraint that , we see that this quantity is maximized when for all , and hence

 |ˆΘ|≤[(N2)/(K+12)+1](K+12)<(N2/K2+1)K2+K2<(N/K+1)K2+K. (6)

Now consider the event that is at least as large as some ; the probability of this event is given by for

 ˆΘϵ={^θ∈ˆΘ:∑a≤bnabD(^θab∣∣¯θab)≥ϵ}. (7)

Since for all , we have from (5) and (7) that

 pr(ˆΘϵ)=∑^θ∈ˆΘϵpr(^θ)≤∑^θ∈ˆΘϵe−∑a≤bnabD(^θab∣∣¯θab)≤∑^θ∈ˆΘϵe−ϵ=|ˆΘϵ|e−ϵ,

and since , we may use (6) to obtain, for fixed class assignment ,

 pr{∑a≤bnabD(^θ∣∣¯θ)≥ϵ}<(N/K+1)K2+Ke−ϵ. (8)

Appealing to a union bound over all possible class assignments and setting then yields the claimed result.

###### Proof (of Theorem 2)

By Lemma 1, the difference can be expressed for any fixed class assignment as , where the first term satisfies the deviation bound of (8), and comprises a weighted sum of independent random variables.

To bound the quantity , observe that since by assumption , the same is true for each corresponding average . As a result, the random variables comprising are each bounded in magnitude by . This allows us to apply a Bernstein inequality for sums of bounded independent random variables due to Chung & Lu (2006, Theorems 2.8 and 2.9, p. 27), which states that for any ,

 pr{|X−E(X)|≥ϵ}≤2exp{−ϵ22∑i

Finally, observe that since the event implies either the event or the event , we have for fixed assignment that

 pr{|L(A;z)−¯LP(z) ≥2ϵM}≤pr[{∑a≤bnabD(^θab∣∣¯θab)≥ϵM}∪{|X−E(X)|≥ϵM}].

Summing the right-hand sides of (8) and (9), and then over all possible assignments, yields

 pr{maxz|L(A;z)−¯LP(z)|≥2ϵM}≤exp{KlogN+(K2+K)log(N/K+1)−ϵM}+2exp{KlogN−ϵ2M8log2N+(4/3)ϵlogN},

where we have used the fact that in (9). It follows directly that if and , then for every fixed as claimed.

### Proof of Theorem 3

###### Proof (of Theorem 3)

To begin, note that Theorem 2 holds uniformly in , and thus implies that

 |¯LP(¯z)−L(A;¯z)|+|¯LP(^z)−L(A;^z)|=oP(M).

Since is the maximum-likelihood estimate of class assignment , we know that , implying that for some . Thus, by the triangle inequality,

 |¯LP(¯z)−¯LP(^z)+δ|≤|¯LP(¯z)−L(A;¯z)|+|¯LP(^z)−(L(A;¯z)+δ)|=oP(M),

and since under any blockmodel with parameter , we have .

Under conditions (i) and (ii) of Theorem 3, we will now show that also

 ¯LP(¯z)−¯LP(^z)=Ne(^z)NΩ(M), (10)

holds for every realization of , thus implying that and proving the theorem.

To show (10), first observe that any blockmodel class assignment vector induces a corresponding partition of the set according to . Formally, partitions into subsets via the mapping

 ζij:(i=1,…,N;j=i+1,…,N)→(l=1,…,L).

This partition is separable in the sense that there exists a bijection between and the upper triangular portion of blockmodel parameter , such that we write for membership vector . More generally, for any partition of , we may define as the arithmetic average over all in the subset indexed by . Thus we may also define

 ¯L∗P(Π)=∑i

so that and coincide on partitions corresponding to admissible blockmodel assignments .

The establishment of (10) proceeds in three steps: first, we construct and analyze a refinement of the partition induced by any blockmodel assignment vector in terms of its error ; then, we show that refinements increase ; finally, we apply these results to the maximum-likelihood estimate .

###### Lemma 2

Consider a -class stochastic blockmodel with membership vector , and let denote the partition of its associated induced by any . For every , there exists a partition that refines and with the property that, if conditions (i) and (ii) of Theorem 3 hold,

 ¯LP(¯z)−¯L∗P(Π∗)=Ne(^z)NΩ(M), (11)

where counts the number of nodes whose true class assignments under are not in the majority within their respective class assignments under .

###### Lemma 3

Let be a refinement of any partition of the set ; then .

Since Lemma 2 applies to any admissible blockmodel assignment vector , it also applies to the maximum-likelihood estimate for any realization of the data; each in turn induces a partition of blockmodel edge probabilities , and (11) holds with respect to its refinement . By Lemma 3, . Finally, observe that by the definition of , and so , thereby establishing (10).

###### Proof (of Lemma 2)

The construction of will take several steps. For a given membership class under , partition the corresponding set of nodes into subclasses according to the true class assignment of each node. Then remove one node from each of the two largest subclasses so obtained, and group them together as a pair; continue this pairing process until no more than one nonempty subclass remains, then terminate. Observe that if we denote pairs by their node indices as , then by construction but .

Repeat the above procedure for each class under , and let denote the total number of pairs thus formed. For each of the pairs , find all other distinct indices for which the following holds:

 D(Pik∣∣Pik+Pjk2)+D(Pjk∣∣Pik+Pjk2)≥CMKN2, (12)

where is the constant from condition (ii) of Theorem 3, and indices and in (12) are to be interpreted respectively as whenever , and whenever . Let denote the total number of distinct triples that can be formed in this manner.

We are now ready to construct the partition of the probabilities as follows: For each of the triples , remove (or if ) and (or ) from their previous subset assignment under , and place them both in a new, distinct two-element subset. We observe the following:

(i) The partition is a refinement of the partition induced by : Since nodes and have the same class label under in that , it follows that for any , and are in the same subset under .

(ii) Since for each class at most one nonempty subclass remains after the pairing process, the number of pairs is at least half the number of misclassifications in that class. Therefore we conclude .

(iii) Condition (ii) of Theorem 3 implies that for every pair of classes , there exists at least one class for which (12) holds eventually. Thus eventually, for any of the pairs , we obtain a number of triples at least as large as the cardinality of class . Condition (i) in turn implies that the cardinality of the smallest class grows as , and thus we may write .

We can now express the difference as a sum of nonnegative divergences , where is the assignment mapping associated to , and use (12) to lower-bound this difference:

 ¯LP(¯z)−¯L∗P(Π∗)=∑i

###### Proof (of Lemma 3)

Let be a refinement of any partition of the set , and given indexing , let