# Edge Label Inference in Generalized Stochastic Block Models: from Spectral Theory to Impossibility Results

The classical setting of community detection consists of networks exhibiting a clustered structure. To more accurately model real systems we consider a class of networks (i) whose edges may carry labels and (ii) which may lack a clustered structure. Specifically we assume that nodes possess latent attributes drawn from a general compact space and edges between two nodes are randomly generated and labeled according to some unknown distribution as a function of their latent attributes. Our goal is then to infer the edge label distributions from a partially observed network. We propose a computationally efficient spectral algorithm and show it allows for asymptotically correct inference when the average node degree could be as low as logarithmic in the total number of nodes. Conversely, if the average node degree is below a specific constant threshold, we show that no algorithm can achieve better inference than guessing without using the observations. As a byproduct of our analysis, we show that our model provides a general procedure to construct random graph models with a spectrum asymptotic to a pre-specified eigenvalue distribution such as a power-law distribution.

• 62 publications
• 30 publications
• 24 publications
10/03/2018

### A Bayesian model for sparse graphs with flexible degree distribution and overlapping community structure

We consider a non-projective class of inhomogeneous random graph models ...
10/30/2017

### Asymptotic degree distributions in large homogeneous random networks: A little theory and a counterexample

In random graph models, the degree distribution of an individual node s...
10/02/2018

### Sampling-based Estimation of In-degree Distribution with Applications to Directed Complex Networks

The focus of this work is on estimation of the in-degree distribution in...
04/02/2014

### Learning Latent Block Structure in Weighted Networks

Community detection is an important task in network analysis, in which w...
02/07/2016

### Network Inference by Learned Node-Specific Degree Prior

We propose a novel method for network inference from partially observed ...
04/13/2022

### Grand canonical ensembles of sparse networks and Bayesian inference

Maximum entropy network ensembles have been very successful in modelling...
03/26/2021

### Beyond the adjacency matrix: random line graphs and inference for networks with edge attributes

Any modern network inference paradigm must incorporate multiple aspects ...

## 1 Introduction

Detecting communities in networks has received a large amount of attention and has found numerous applications across various disciplines including physics, sociology, biology, statistics, computer science, etc (see the exposition [13] and the references therein). Most previous work assumes networks can be divided into groups of nodes with dense connections internally and sparser connections between groups, and considers random graph models with some underlying cluster structure such as the stochastic blockmodel (SBM), a.k.a. the planted partition model

. In its simplest form, nodes are partitioned into clusters, and any two nodes are connected by an edge independently at random with probability

if they are in the same cluster and with probability otherwise. The problem of cluster recovery under the SBM has been extensively studied and many efficient algorithms with provable performance guarantees have been developed (see e.g., [8] and the references therein).

Real networks, however, may not display a clustered structure; the goal of community detection should then be redefined. As observed in [15], interactions in many real networks can be of various types and prediction of unknown interaction types may have practical merit such as prediction of missing ratings in recommender systems. Therefore an intriguing question arises: Can we accurately predict the unknown interaction types in the absence of a clustered structure? To answer it, we generalize the SBM by relaxing the cluster assumption and allowing edges to carry labels. In particular, each node has a latent attribute coming from a general compact space and for any two nodes, an edge is first drawn and then labeled according to some unknown distribution as a function of their latent attributes. Given a partial observation of the labeled graph generated as above, we aim to infer the edge label distributions, which is relevant in many scenarios such as:

• Collaborative filtering: A recommender system can be represented as a labeled bipartite graph where if a user rates a movie, then there is a labeled edge between them with the label being the rating. One would like to predict the missing ratings based on the observation of a few ratings.

• Link type prediction: A social network can be viewed as a labeled graph where if a person knows another person, then there is a labeled edge between them with the label being their relationship type (either friend or colleague). One would like to predict the unknown link types based on the few known link types.

• Prediction of gene expression levels: A DNA microarray can be looked as a a labeled bipartite graph where if a gene is expressed in a sample, then there is a labeled edge between them with the label being the expression level. One would like to predict the unobserved expression level based on the few observed expression levels.

### 1.1 Problem formulation

The generalized stochastic blockmodel (GSBM) is formally defined by seven parameters , , where is a positive integer; is a compact space endowed with the probability measure ; is a function symmetric in its two arguments; is a finite set with denoting the set of probability measures on it; is a measure-valued function symmetric in its two arguments; is a positive real number.

###### Definition 1.

Suppose that there are nodes indexed by . Each node has an attribute drawn in an i.i.d. manner from the distribution on . A random labeled graph is generated based on : For each pair of nodes , independently of all others, we draw an edge between them with probability ; then for each edge , independently of all others, we label it by with probability ; finally each labeled edge is retained with probability and erased otherwise.

Given a random labeled graph generated as above, our goal is to infer the edge label distribution for any pair of nodes and . To ensure the inference is feasible, we shall make the following identifiability assumption: Let and

 ∀x≠x′∈X,∑ℓ∈L∫X|νx,y(ℓ)−νx′,y(ℓ)|P(dy)>0; (1)

otherwise are statistically indistinguishable and can be combined as a single element in . We emphasize that the model parameters are all fixed and do not scale with , while could scale with . Notice that characterizes the total number of observed edge labels and thus can be seen as a measure of “signal strength”.

### 1.2 Main results

We show that it is possible to make meaningful inference of edge label distributions without knowledge of any model parameters in the relatively “sparse” graph regime with

. In particular, we propose a computationally efficient spectral algorithm with a random weighing strategy. The random weighing strategy assigns a random weight to each label and constructs a weighted adjacency matrix of the label graph. The spectral algorithm embeds the nodes into a finite, low dimensional Euclidean space based on the leading eigenvectors of the weighted adjacency matrix and uses the empirical frequency of labels on the local neighborhood in the Euclidean space to estimate the underlying true label distribution.

In the very “sparse” graph regime with , since there exist at least isolated nodes without neighbors and to infer the edge label distribution between two isolated nodes the observed labeled graph does not provide any useful information, it is impossible to make meaningful inference for at least a positive fraction of node pairs. Moreover, we show that it is impossible to make meaningful inference for any randomly chosen pair of nodes when is below a specific non-trivial threshold.

As a byproduct of our analysis, we show how the GSBM can generate random graph models with a spectrum asymptotic to a pre-specified eigenvalue distribution such as e.g. a power law by appropriately choosing model parameters based on some Fourier analysis.

### 1.3 Related work

Below we point out some connections of our model and results to prior work. More detailed comparisons are provided after we present the main theorems.

#### The Sbm and spectral methods

If the node attribute space is a finite set and no edge label is available, then the GSBM reduces to the classical SBM with finite number of blocks. The spectral method and its variants are widely used to recover the underlying clusters under the SBM, see, e.g., [21, 9, 26, 25, 7]. However, the previous analysis relies on the low-rank structure of the edge probability matrix. In contrast, the edge probability matrix under the GSBM is not low-rank, and our analysis is based on establishing a correspondence between the spectrum of a compact operator and the spectrum of a weighted adjacency matrix (see Proposition 1). Similar connection appears before in the context of data clustering considered in [27], where a graph is constructed based on observed attributes of nodes and clustering based on the graph Laplacian is analyzed. In contrast our setup does not assume the observation of node attributes. Also in our case the observed graphs could be very sparse, while the graphs considered in [27] are dense.

#### Latent space model

If the node attribute space is a finite-dimensional Euclidean space and no edge label is present, then the GSBM reduces to the latent space model, proposed in ([16, 14]). If we further assume the node attribute space is the probability simplex endowed with Dirichlet distribution with a parameter , and is a bilinear function, then the SBM reduces to the mixed membership SBM proposed in [1], which is a popular model for studying the overlapping community detection problem.

#### Exchangeable random graphs

If we ignore the edge labels, the GSBM fits exactly into the framework of “exchangeable random graphs” and the edge probability function is known as “graphon” (see e.g., [2] and the references therein). It is pointed out in [4]

that some known functions can be used to approximate the graphon, but no analysis is presented. Our spectral algorithm approximates the graphon using the eigenfunctions and The exchangeable random graph models with constant average node degrees has been studied in

[5]

, but the focus there is on the phase transition for the emergence of the giant connected component.

#### Phase transition if ω=O(1)

There is an emerging line of works [11, 23, 24, 20, 15, 19] that try to identify the sharp phase transition threshold for positively correlated clustering in the regime with a bounded average node degree. All previous rigorous results focus on the two communities case, while [11] gives detailed predictions about the phase transition thresholds in the more general case with multiple communities. Here with multiple communities we identify a threshold below which positively correlated clustering is impossible. However, our threshold is not sharp.

### 1.4 Notation

For two discrete probability distributions

and on , let denote the total variation distance. Throughout the paper, we say an event occurs “a.a.s.” or “asymptotically almost surely” when it occurs with a probability tending to one as . We use the standard big O notation. For instance, for two sequences , means .

## 2 Spectral reconstruction if ω=Ω(logn)

Let denote the adjacency matrix of and denote the label of edge in . Our goal reduces to infer based on and . In this section, we study a polynomial-time algorithm based on the spectrum of a suitably weighted adjacency matrix. The detailed description is given in Algorithm 1 with four steps.

Step 1 defines the weighted adjacency matrix using a random weighing function of edge labels. Step 2 extracts the top eigenvalues and eigenvectors of for a given integer . Step 3 embeds nodes in based on the spectrum of . Step 4 constructs an estimator of using the empirical label distribution on the edges between node and nodes in the local neighborhood of node . Note that the random weight function chosen in Step 1 is the key to exploit the labeling information encoded in . If were known, better deterministic weight function could be chosen to allow for sharper estimation, e.g. ([19]). However, no a priori deterministic weight function could ensure consistent estimation irrespective of . The function used in Step 4 is a continuous approximation of the indicator function such that if and if .

Our performance guarantee of Spectral Algorithm 1 is stated in terms of the spectrum of the integral operator defined as

 Tf(x):=∫XK(x,y)f(y)P(dy), (5)

where the symmetric kernel is defined by

 K(x,y):=∑ℓW(ℓ)νx,y(ℓ)∈[0,|L|]. (6)

Since is bounded, the operator , acting on the function space , is compact and therefore admits a discrete spectrum with finite multiplicity of all of its non-zero eigenvalues (see e.g. [17] and [27]). Moreover, any of its eigenfunctions is continuous on . Denote the eigenvalues of operator sorted in decreasing order by and its corresponding eigenfunctions with unit norm by . Define

 d2(x,x′):=∫X|K(x,y)−K(x′,y)|2P(dy). (7)

It is easy to check that with probability 1 with respect to the random choices of , by the identifiability condition (1), for all . By Minkowski inequality, satisfies the triangle inequality. Therefore, is a distance on . By the definition of and , we have (the following serie converges in , see Chapter V.4 in [17]):

 K(x,y)=∞∑k=1λkϕk(x)ϕk(y), (8)

and thus .

To derive the performance guarantee of Spectral Algorithm 1, we make the following continuity assumption on . Similar continuity assumptions appeared before in the literature on the latent space model and the exchangeable random graph model (see e.g., [6, Section 4.4] and [2, Section 2.1]).

###### Assumption 1.

For every , is continuous on , hence by compactness of uniformly continuous. Let denote a modulus of continuity of all functions and . That is to say, for all ,

 |Bx,y−Bx′,y′|≤ψ(d(x,x′)+d(y,y′))

and similarly for .

Let for a fixed integer , characterizing the tail of the spectrum of the operator . The following theorem gives an upper bound of the estimation error of for most pairs in terms of and .

###### Theorem 1.

Suppose Assumption 1 holds. Assume that for some universal positive constant and chosen in Spectral Algorithm 1 satisfies . Then a.a.s. the estimators and given in Spectral Algorithm 1 satisfy

 Bσi,σj|^μij(ℓ)−μσi,σj(ℓ)| ≤2ψ(2|λ1|ϵ)+1|λ1|2ϵ2√ϵr∫Xh|λ1|ϵ(d(σi,x))P(dx):=η,∀ℓ∈L, |^Bij−ωnBσi,σj| ≤ωnη, (9)

for a fraction of at least of all possible pairs of nodes.

Note that if goes to , the second term in given by (9) vanishes, and simplifies to which goes to if further goes to . In the case where is strictly positive, Theorem 1 implies that the estimation error of the edge label distribution goes to as successively and converge to . Note that is a free parameter chosen in Spectral Algorithm 1 and can be made arbitrarily small if is sufficiently small. The parameter measures how well the compact space endowed with measure can be approximated by discrete points, or equivalently how well our general model can be approximated by the labeled stochastic block model with blocks. The smaller is, the more structured, or the more “low-dimensional” our general model is. In this sense, Theorem 1 establishes an interesting connection between the estimation error and the structure present in our general model.

A key part of the proof of Theorem 1 is to show that for any fixed , the normalized -th largest eigenvalue of the weighted adjacency matrix asymptotically converges to where is the -th eigenvalue of integral operator , and this is precisely why our spectral embedding given by (2) is defined in a normalized fashion. The following simple example illustrates how we can derive closed form expressions for the spectrum of integral operator .

###### Example.

Take and as the Lebesgue measure. Assume unlabeled edges. Let where is an even (i.e. ), 1-periodic function. Denote its Fourier series expansion by . For instance, if for , then and for . If for , then and for .

For the above example, where denotes convolution. Fourier series analysis entails that must coincide with Fourier coefficient or for ( appearing twice in the spectrum of ). This example thus gives a general recipe for constructing random graph models with spectrum asymptotic to a pre-specified eigenvalue profile. For on , we find in particular that and , which is a power-law spectrum with the decaying exponent being . For on , and , which is a power-law spectrum with the decaying exponent being .

#### Comparisons to previous work

Theorem 1 provides the first theoretical result on inferring edge label distributions to our knowledge. For estimating edge probabilities, Theorem 1 implies or improves the best known results in several special cases.

For the SBM with finite blocks, is zero. By choosing sufficiently small in Theorem 1, we see that our spectral method is asymptotically correct if , which matches with best known bounds (see e.g., [8] and the references therein). For the mixed membership SBM with finite blocks, the best known performance guarantee given by [3] needs to be above the order of several factors, while Theorem 1 only needs to be the order of . However, Theorem 1 requires the additional spectral gap assumption and needs to vanish. Also, notice that Theorem 1 only applies to the setting where the edge probability within the community exceeds the edge probability across two different communities by a constant factor, while the best known results in [8, 3] apply to the general setting with any .

For the latent space model, [6]

proposed a universal singular value thresholding approach and showed that the edge probabilities can be consistently estimated if

with some Lipschitz condition on similar to Assumption 1, where is the dimension of the node attribute space. Our results in Theorem 1 do not depend on the dimension of the node attribute space and only need to be on the order of .

For the exchangeable random graph models, a singular value thresholding approach is shown in [6] to estimate the graphon consistently. More recently, [2] shows that the graphon can be consistently estimated using the empirical frequency of edges in local neighborhoods, which are constructed by thresholding based on the pairwise distances between different rows of the adjacency matrix. All these previous works assume the edge probabilities are constants. In contrast, Theorem 1 applies to much sparser graphs with edge probabilities could be as low as .

## 3 Impossibility if ω=O(1)

We have seen in the last section that Spectral Algorithm 1 achieves asymptotically correct inference of edge label distributions so long as and . In this section, we focus on the sparse regime where is a constant, i.e., the average node degree is bounded and the number of observed edge labels is only linearly in . We identify a non-trivial threshold under which it is fundamentally impossible to infer the edge label distributions with an accuracy better than guessing without using the observations.

To derive the impossibility result, let us consider a simple scenario where the compact space is endowed with a uniform measure , if and if for two positive constants , and if and if for two different discrete probability measures on . Since is a constant, the observed labeled graph is sparse and has a bounded average degree. Similar to the Erdős-Rényi random graph, there are at least isolated nodes without neighbors. To infer the edge label distribution between two isolated nodes, the observed labeled graph does not provide any useful information and thus it is impossible to achieve the asymptotically correct inference of edge label distribution for two isolated nodes. Hence we resort to a less ambitious objective.

###### Objective 1.

Given any two randomly chosen nodes and , we would like to correctly determine whether the label distribution is or with probability strictly larger than , which is achievable by always guessing and is the best one can achieve if no graph information available.

Note that if Objective 1 is not achievable, then the expected estimation error is at least . One might think that we can always achieve Objective 1 as long as such that the graph contains a giant connected component, because the labeled graph then could provide extra information. It turns out that this is not the case. Define

 τ=1r(a+b)∑ℓ∈L|aμ(ℓ)−bν(ℓ)|. (10)

Let and . Then by definition of , we have . Note that when , the average node degree is larger than one, and thus similar to Erdős-Rényi random graph, contains a giant connected component. The following theorem shows that Objective 1 is fundamentally impossible if where is strictly above the threshold for the emergence of the giant connected component.

###### Theorem 2.

If , then for any two randomly chosen nodes and ,

 ∀x,y∈{1,…,r},P(σρ=x|G,σv=y)∼1r a.a.s .

The above theorem implies that it is impossible to correctly determine whether two randomly chosen nodes have the same attribute or not with probability larger than and thus Objective 1 is fundamentally impossible. In case , it also implies that we cannot correctly determine whether the edge probability between nodes and is or with probability strictly larger than . This indicates the need for a minimum number of observations in order to exploit the information encoded in the labeled graph.

#### Comparisons to previous work

To our knowledge, Theorem 1 provides the first impossibility result on inferring edge label distributions and node attributes in the case with multiple communities. The previous work focuses on the case with two community case. If and no edge label is available, it is conjectured in [11] and later proved in [23, 24, 20] that the positively correlated clustering is feasible if and only if , or equivalently, . If the edge label is available, it is conjectured in [15] that the the positively correlated clustering is feasible if and only if with

 τ′=12(a+b)∑ℓ∈L(aμ(ℓ)−bν(ℓ))2aμ(ℓ)+bν(ℓ)≤τ.

It is proved in [19] that the positively correlated clustering is infeasible if . Comparing to the previous works, the threshold given by Theorem 2 is not sharp in the special case with two communities.

## 4 Numerical experiments

In this section, we explore the empirical performance of our Spectral Algorithm 1 based on Example 2. In particular, suppose nodes are uniformly distributed over the space . Let where is even, -periodic and defined by for . Assume unlabeled edges first.

We simulate the spectral embedding given by Step 3 of Algorithm 1 for a fixed observation probability . Pick in Algorithm 1. Note that the eigenvector

corresponding to the largest eigenvalue is nearly parallel to the all-one vector and thus does not convey any useful information. Therefore, our spectral embedding of

nodes are based on and . In particular, let . As we derived in Section 2, the second and third largest eigenvalues of operator are given by , and the corresponding eigenfunctions are given by and . Proposition 1 shows that asymptotically converges to . We plot and in a two-dimensional plane as shown in Fig. 1 and Fig. 1, respectively. For better illustration, we divide all nodes into ten groups with different colors, where the -th group consists of nodes with attributes given by . As we can see, is close to for most nodes , which coincides with our theoretical finding.

Then we simulate Spectral Algorithm 1 on estimating the observed edge probability between any node pair by picking and setting . We measure the estimation error by the normalized mean square error given by , where is the true edge probability defined by ; is our estimator defined in (4); is the empirical average edge probability defined by Our simulation result is depicted in Fig. 2.

Next we consider labeled edges with two possible labels or and . We simulate Spectral Algorithm 1 for estimating between any node pair by choosing the weight function as . We again measure the estimation error by the normalized mean square error given by , where is the true label distribution defined by ; is our estimator defined in (3); is the empirical label distribution defined by Our simulation result is depicted in Fig. 2. As we can see from Fig. 2, when is larger than , our spectral algorithm performs better than the estimator based on the empirical average.

## 5 Proofs

### 5.1 Proof of Theorem 1

Our proof is divided into three parts. We first establish the asymptotic correspondence between the spectrum of the weighted adjacency matrix and the spectrum of the operator using Proposition 1. Then, we prove that the estimator of edge label distribution converges to a limit. Finally, we upper bound the total variation distance between the limit and the true label distribution using Proposition 2.

###### Proposition 1.

Assume that for some universal positive constant and chosen in Spectral Algorithm 1 satisfies . Then for , almost surely . Moreover, for , almost surely there exist choices of orthonormal eigenfunctions of operator associated with such that .

By Proposition 1, we get the existence of eigenfunctions of associated with such that a.a.s., by letting

 fm:=(λ1λ1ϕ1(σm),…,λrλ1ϕr(σm)),

we have

 n∑m=1||zm−fm||22=n∑m=1r∑k=1⎛⎝√nλ(n)kλ(n)1vk(m)−λkλ1ϕk(σm)⎞⎠2=o(n).

By Markov’s inequality,

 1n∣∣{m∈{1,…,n}:||zm−fm||2≥δn}∣∣≤∑nm=1||zm−fm||22nδ2n=1δ2no(1).

Note that can be chosen to decay to zero with sufficiently slowly so that the right-hand side of the above is . We call nodes satisfying “bad nodes”. Let denote the set of “bad nodes”. It follows from the last display that . Let denote the set of nodes with at least fraction of edges directed towards “bad nodes”, i.e.,

 J={j:|{i∈I:Aij=1}|≥γn|{i:Aij=1}|}.

Note that the average node degree in is . Since by assumption, it follows from the Chernoff bound that the observed node degree is with high probability. Therefore, we can choose decaying to zero while still having , i.e., all but a vanishing fraction of nodes have at most fraction of edges directed towards “bad nodes”. We have thus performed an embedding of nodes in such that for ,

 ||zm−zm′||2=1|λ1|dr(σm,σm′)+O(δn), (11)

where pseudo-distance is defined by

The remainder of the proof exploits this embedding and the fact that pseudo-distance and distance are apart by at most in some suitable sense. For a randomly selected pair of nodes , one has a.a.s.  and . Therefore, node has at most fraction of edges directed towards “bad nodes”. Hence, by (11),

 ∑i′hϵ(||zi′−zi||2)I{Li′j=ℓ}=∑i′I{Li′j=ℓ}h|λ1|ϵ(dr(σi,σi′)+O(δn))+O(ωγn), (12)

and

 ∑i′hϵ(||zi′−zi||2)Ai′j=∑i′h|λ1|ϵ(dr(σi,σi′)+O(δn))Ai′j+O(ωγn). (13)

The first term in the R.H.S. of (12

) is a sum of i.i.d. bounded random variables with mean given by

 ωn∫Xh|λ1|ϵ(dr(σi,x)+O(δn))νx,σj(ℓ)P(dx).

Since by assumption, it follows from the Bernstein inequality that a.a.s.

 ∑i′hϵ(||zi′−zi||2)I{Li′j=ℓ}= (1+o(1))ω∫Xh|λ1|ϵ(dr(σi,x)+O(δn))νx,σj(ℓ)P(dx) +O(ωγn). (14)

The first term in the R.H.S. of (13) is a sum of i.i.d. bounded random variables with mean given by

 ωn∫Xh|λ1|ϵ(dr(σi,x)+O(δn))Bx,σjP(dx).

It again follows from the Bernstein inequality that a.a.s.

 ∑i′hϵ(||zi′−zi||2)Ai′j= (1+o(1))ω∫Xh|λ1|ϵ(dr(σi,x)+O(δn))Bx,σjP(dx) +O(ωγn). (15)

Note that is a continuous function in . Therefore,

 limn→∞h|λ1|ϵ(dr(σi,x)+O(δn))=h|λ1|ϵ(dr(σi,x)).

By the dominated convergence theorem, it follows from (3), (14), (15) that a.a.s.

 ^μi,j(ℓ)∼∫Xh|λ1|ϵ(dr(σi,x))νx,σj(ℓ)P(dx)∫Xh|λ1|ϵ(dr(σi,x))Bx,σjP(dx):=μ∗i,j(ℓ). (16)

Similarly, we have a.a.s.

 ^Bi,j(ℓ)∼ωn∫Xh|λ1|ϵ(dr(σi,x))Bx,σjP(dx)∫Xh|λ1|ϵ(dr(σi,x))P(dx):=B∗i,j. (17)

The following proposition upper bounds the difference between the limit (resp. ) and (resp. ).

###### Proposition 2.

Suppose Assumption 1 holds. Then there exists a fraction of at least of all possible pairs of nodes such that

 Bσi,σj|μ∗i,j(ℓ)−μσi,σj(ℓ)| ≤2ψ(2|λ1|ϵ)+1|λ1|2ϵ2√ϵr∫Xh|λ1|ϵ(d(σi,x))P(dx):=η,∀ℓ∈L, |B∗i,j−ωnBσi,σj| ≤ωnη. (18)

Applying Proposition 2, our theorem then follows.

### 5.2 Proof of Theorem 2

Proof of Theorem 2 relies on a nice coupling between the local neighborhood of with a simple labeled Galton-Watson tree. It is well-known that the local neighborhood of a node in the sparse graph is “tree-like”. In the case with , the coupling result is first studied in [23] and generalized to the labeled tree in [19]. In this paper, we extend the coupling result to any finite .

Let and consider a labeled Galton-Watson tree with Poisson offspring distribution with mean . The attribute of root is chosen uniformly at random from . For each child node, independently of everything else, it has the same attribute with its parent with probability and one of different attributes with probability . Every edge between the child and its parent is independently labeled with distribution if they have the same attribute and with distribution otherwise.

The labeled Galton-Watson tree can also be equivalently described as follows. Each edge is independently labeled at random according to the probability distribution . The attribute of root is first chosen uniformly at random from . Then, for each child node, independently of everything else, it has the same attribute with its parent with probability and one of different attributes with probability , where

 ϵ(ℓ)=bν(ℓ)aμ(ℓ)+(r−1)bν(ℓ). (19)

Recall that denote the neighborhood of in within distance and denote the nodes at the boundary of . Let denote the tree up to depth and denote the set of leaf nodes of . The following lemma similar to coupling lemmas in [23] and [19] shows that can be coupled with the labeled Galton-Watson tree .

###### Lemma 1.

Let for some small enough constant , then there exists a coupling such that a.a.s.  where denote the node attributes on the subgraph .

For the labeled Galton-Watson tree, we show that if , then the attributes of leaf nodes are asymptotically independent with the attribute of root.

###### Lemma 2.

Consider a labeled Galton-Waltson tree with . Then as ,

 ∀x∈{1,…,r},P(σρ=x|T,σ∂TR)→1r a.a.s.

By exploiting Lemma 1 and Lemma 2, we give our proof of Theorem 2. By symmetry, for and . Therefore, we only need to show that for any and it further reduces to showing that

 P[σρ=y|G,σv=y,σ∂GR]∼1/r. (20)

Let be as in Lemma 1 such that and thus a.a.s.. Lemma 4.7 in [23] shows that is asymptotically independent with conditional on . Hence, Also, note that Lemma 1 implies that and by Lemma 2, Therefore, equation (20) holds.

## Acknowledgment

M.L. acknowledges the support of the French Agence Nationale de la Recherche (ANR) under reference ANR-11-JS02-005-01 (GAP project). J. X. acknowledges the support of NSF ECCS 10-28464.

## References

• [1] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels.

Journal of Machine Learning Research

, 9:1981–2014, 2008.
• [2] E. M. Airoldi, T. B. Costa, and S. H. Chan. Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems 26, pages 692–700, 2013.
• [3] A. Anandkumar, R. Ge, D. Hsu, and S. Kakade.

A tensor spectral approach to learning mixed membership community models.

In COLT, pages 867–881, 2013.
• [4] P. J. Bickel and A. Chen. A nonparametric view of network models and Newman-Girvan and other modularities. Proceedings of the National Academy of Sciences, 2009.
• [5] B. Bollobás, S. Janson, and O. Riordan. The phase transition in inhomogeneous random graphs. Random Struct. Algorithms, 31(1):3–122, Aug. 2007.
• [6] S. Chatterjee. Matrix estimation by universal singular value thresholding. arxiv:1212.1247, 2012.
• [7] K. Chaudhuri, F. C. Graham, and A. Tsiatas. Spectral clustering of graphs with general degrees in the extended planted partition model. Journal of Machine Learning Research, 23:35.1–35.23, 2012.
• [8] Y. Chen and J. Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. arxiv:1402.1267, 2014.
• [9] A. Coja-Oghlan. Graph partitioning via adaptive spectral techniques. Comb. Probab. Comput., 19(2):227–284, 2010.
• [10] C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III. SIAM Journal on Numerical Analysis, 7(1):pp. 1–46, 1970.
• [11] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová. Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications. Physics Review E, 84:066106, 2011.
• [12] U. Feige and E. Ofek. Spectral techniques applied to sparse random graphs. Random Struct. Algorithms, 27(2):251–275, Sept. 2005.
• [13] S. Fortunato. Community detection in graphs. arXiv:0906.0612, 2010.
• [14] M. S. Handcock, A. E. Raftery, and J. M. Tantrum. Model-based clustering for social networks. Journal of the Royal Statistical Society: Series A (Statistics in Society), 170(2):301–354, 2007.
• [15] S. Heimlicher, M. Lelarge, and L. Massoulié. Community detection in the labelled stochastic block model. arXiv:1209.2910, 2012.
• [16] P. D. Hoff, A. E. Raftery, and M. S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97:1090+, December 2002.
• [17] T. Kato. Perturbation Theory for Linear Operators. Springer, Berlin, 1966.
• [18] V. I. Koltchinskii. Asymptotics of spectral projections of some random matrices approximating integral operators. Progress in Probability, 1998.
• [19] M. Lelarge, L. Massoulié, and J. Xu. Reconstruction in the labeled stochastic block model. In Information Theory Workshop, Sept. 2013.
• [20] L. Massoulié. Community detection thresholds and the weak Ramanujan property. In

STOC 2014: 46th Annual Symposium on the Theory of Computing

, pages 1–10, United States, 2014.
• [21] F. McSherry. Spectral partitioning of random graphs. In 42nd IEEE Symposium on Foundations of Computer Science, pages 529 – 537, Oct. 2001.
• [22] E. Mossel. Survey - information flows on trees. DIMACS series in discrete mathematics and theoretical computer science, pages 155–170, 2004.
• [23] E. Mossel, J. Neeman, and A. Sly. Stochastic block models and reconstruction. arXiv:1202.1499, 2012.
• [24] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold conjecture. arxiv:1311.4115, 2013.
• [25] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. The Annals of Statistics, 39(4):1878–1915, 2011.
• [26] D.-C. Tomozei and L. Massoulié. Distributed user profiling via spectral methods. SIGMETRICS Perform. Eval. Rev., 38(1):383–384, June 2010.
• [27] U. von Luxburg, O. Bousquet, and M. Belkin. On the convergence of spectral clustering on random samples: the normalized case. NIPS, 2005.

## Appendix A Proof of Proposition 1

We first introduce notations used in the proof. Several norms on matrices will be used. The spectral norm of a matrix is denoted by and equals the largest singular value. The Frobenius norm of a matrix is denoted by and equals the square root of the sum of squared singular values. It follows that if is of rank . For vectors, the only norm that will be used is the usual norm, denoted as . Introduce a matrix defined by Recall that is a fixed positive integer in Spectral Algorithm 1. Denote