Perfect Clustering for Stochastic Blockmodel Graphs via Adjacency Spectral Embedding

10/02/2013 ∙ by Vince Lyzinski, et al. ∙ 0

Vertex clustering in a stochastic blockmodel graph has wide applicability and has been the subject of extensive research. In thispaper, we provide a short proof that the adjacency spectral embedding can be used to obtain perfect clustering for the stochastic blockmodel and the degree-corrected stochastic blockmodel. We also show an analogous result for the more general random dot product graph model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

StatConn-Spring-2015-Info

introductory material


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many problems arising in the natural sciences, technology, business and politics, it is crucial to understand the specific connections among the objects under study: for example, the interactions between members of a political party; the firing of synapses in a neuronal network; or citation patterns in reference literature. Mathematically, these objects and their connections are modeled as graphs, and a common goal is to find clusters of similar vertices within a graph.

Both model-based and heuristic-based techniques have been proposed for clustering the vertices in a graphs

[newman2006modularity, Bickel2009, Choi2010, Snijders1997Estimation]. In this paper we focus on probabilistic performance guarantees for spectral-based techniques which have elements of both model- and heuristic-based methods [rohe2011spectral, STFP-2011]. We study the consistency of mean squared error clustering via the adjacency spectral embedding for three nested classes of models, each an examples of latent position models [Hoff2002]:

  • the stochastic blockmodel where vertices in the same cluster are stochastically equivalent [Holland1983],

  • the degree-corrected stochastic blockmodel where stochastic equivalence holds up to a scaling factor [karrer2011stochastic],

  • and the random dot product graph where a natural vertex clustering may not exist [young2007random].

The generality of our main result allows for the extension of our asymptotically error-free results from the rather restrictive stochastic blockmodel to more general settings.

Numerous spectral clustering procedures have been proposed and analyzed under various random graph models

[chaudhuri12:_spect, rinaldo_2013, qin2013dcsbm, rohe2011spectral, STFP-2011]. For example, Laplacian spectral embedding [rohe2011spectral] and adjacency spectral embedding [STFP-2011]

have been shown to yield consistent clustering for the stochastic blockmodel. These results have relied on bounding the Frobenius norm difference between the embedded vertices and associated eigenvectors of the population Laplacian (in Laplacian spectral embedding) or edge probability matrix (in adjacency spectral embedding).

Relying on global Frobenius norm bounds for demonstrating consistent clustering is suboptimal, however, because in general, one cannot rule out that a diminishing but positive proportion of the embedded points contribute disproportionately to the global error. When this occurs, these “outliers” are very likely to be misclustered, and hence the best existing bounds on the Frobenius norm show that at most

vertices will be misclustered (see [rohe2011spectral, Theorem 3.1] and [STFP-2011, Theorem 1]).

In contrast, our main technical result gives a bound (in probability) on the maximum error between individual embedded vertices and the associated eigenvectors of the edge probability matrix (see Lemma 2.5). This lemma is proved for general random dot product graphs and provides the necessary tools to improve the bounds on the error rate of mean squared error clustering in adjacency spectral embedding. The first main clustering result of this paper gives a bound on the probability that a mean square error clustering of the adjacency spectral embedding will be error-free, i.e. zero vertices will be misclustered (see Theorem 2.6).

Due to the generality of our main lemma, we are able to prove an analogous asymptotically error free clustering result in the degree-corrected stochastic blockmodel (see Theorem 4.3). Note again that the best existing results for spectral methods in the degree-corrected model assert that at most vertices will be misclustered [qin2013dcsbm, Theorem 4.4]. Finally, we prove a very general result that spectral clustering of random dot product graphs is strongly universally consistent in the sense of [pollard81:_stron_k] (see Theorem 5.2). These extensions underly the wide utility of our approach, and we believe our main lemma to be of independent interest.

We note that the authors of [bickel2011method], among others, have shown that likelihood-based techniques can be employed to achieve asymptotically error-free clustering in the stochastic blockmodel. However, likelihood based approaches are computationally intractable for very large graphs compared to our present spectral clustering approach.

2 Setting and main theorem

In the first part of this section, we will define the random dot product graph and our main tool, the adjacency spectral embedding. Next, we define the stochastic blockmodel and clustering procedure, and finally, we will state our main theorem and the supporting lemmas.

2.1 Random dot product graphs and the adjacency spectral embedding

The random dot product graph model is a convenient theoretical tool, and spectral properties of the adjacency matrix is well understood. While the stochastic blockmodel relies on an inherently non-geometric construction—indeed, each block in associated with a categorical label, and these labels determine the adjacency probabilities—the random dot product graph relies on a geometric construction in which each block is associated with a point in Euclidean space, i.e. a vector. The dot products of these vectors then determine the adjacency probabilities in the graph.

Definition 2.1 (Random Dot Product Graph (RDPG)).

A random adjacency matrix for where

is said to be an instance of a random dot product graph (RDPG) if

Remark.

In general we will denote the rows of an matrix by . With this notation, in the above definition for all We then define , so that the entries of give the Bernoulli parameters for edge probabilities.

Note that, as defined above, the rank of is . Let be the spectral decomposition of where is orthogonal, has orthonormal columns, is diagonal with

Importantly, we shall assume throughout this paper that the non-zero eigenvalues of

are distinct, i.e., the inequalities above are strict.

It follows that there exists an orthonormal such that . We thus suppose that ; the assumption does not lead to any loss of generality because the distribution of is invariant under orthogonal transformations of the latent positions and the clustering method considered in this paper is invariant under orthogonal transformations. This relationship between the spectral decomposition of and the latent positions for the RDPG model motivates our main tool: the adjacency spectral embedding.

Definition 2.2 (Adjacency Spectral Embedding (ASE)).

Let have orthonormal columns given by the eigenvectors of corresponding to the largest eigenvalues of according to the algebraic ordering. Let be diagonal with diagonal entries given by these eigenvalues in descending order. We define the -dimensional adjacency spectral embedding of via .

We shall assume, for ease of exposition, that the diagonal entries of are positive. As will be seen later, e.g., Lemma 3.2, this assumption is justified in the context of random dot product graphs due to the concentration of the eigenvalues of around those of . Recall that the rows of will be denoted by .

2.2 Clustering

We begin by considering the task of clustering in the -block stochastic blockmodel. This model is typically parameterized by a matrix of probabilities of adjacencies between vertices in each of the blocks along with the block memberships for each vertex. Here we present an alternative definition in terms of the RDPG model.

Definition 2.3 ((Positive Semidefinite) -block Stochastic Blockmodel (SBM)).

We say an RDPG is an SBM with blocks if the number of distinct rows in is . In this case, we define the block membership function to be a function such that if and only if . For each , let be the number of vertices such that , i.e. the number of vertices in block .

To ease notation, we will always use to denote the number of blocks in an SBM, and we will refer to a -block SBM as simply an SBM when appropriate.

Remark.

Note that a general -block SBM can only be represented in this way if the matrix of probabilities is positive semidefinite.

Next, we introduce mean square error clustering, which is the clustering sought by -means clustering.

Definition 2.4 (Mean Square Error (MSE) Clustering).

The MSE clustering of the rows of into blocks returns

are the optimal cluster centroids for the MSE clustering. We also define the cluster membership function , which satisfies if and only if , where is the row of .

In our results, we consider the MSE clustering of the rows of in two contexts, the SBM and RDPG models defined above. We will also consider a variation of the SBM, the degree-corrected SBM, in which we perform MSE clustering on the rows of a projected version of .

2.3 Main theorems

Before stating our main results, we indicate our notation for matrix norms and we define constants to be used throughout the remaining text. For a matrix we let , i.e. the maximum of the Euclidean norm of the rows. For a square matrix , denotes the spectral norm. The Frobenius norm of a matrix is denoted by .

We define:

  • is the maximum of the row sums of ;

  • is the rank of ;

  • is the minimum gap among the distinct eigenvalues of ;

  • is the number of blocks in the SBM.

Note, is not necessarily equal to the magnitude of the smallest non-zero eigenvalue of , as the gaps between consecutive non-zero eigenvalues could be smaller. We now state a technical but highly useful lemma in which we bound the maximum difference between the rows of and the rows of an orthogonal transformation of .

Lemma 2.5.

Suppose is given such that . Then, with probability at least , one has

(1)
Remark.

The parameter is the maximum expected degree for the random graph. In many models, will be of the same order of the density of the graph. If the density is very small, the bound in the Eq. (1) will be large. Frequently and are of the same order, so that the bound in Eq. (1) is of order for fixed and decaying polynomially. See Example 1 for a simple illustration of how this bound can be applied. Finally, note that we always have so that the condition in Lemma 2.5 implies that , which allows us to use the results of [oliveira2009concentration, tropp2011freedman].

Lemma 2.5 gives far greater control of the errors than the previous results that were derived for the Frobenius norm ; indeed, the latter bounds do not allow fine control of the errors in the individual rows of , and therefore can only bound the number of mis-clustered vertices via . Lemma 2.5, on the other hand, provides exactly this control and, as such, vastly improves the bounds on the error rate of MSE clustering of .

Our main theorem is the following result on the probability that mean square error clustering on the rows of is error-free.

Theorem 2.6 (Sbm).

Let be an SBM with blocks and block membership function and suppose . Assume that

  1. the non-zero eigenvalues of are distinct.

Denote the bound on in Lemma 2.5 as . Let be the optimal MSE clustering of the rows of into clusters. Let denote the symmetric group on , and a permutation of the blocks. Finally, let be the smallest block size. If

  1. for all if then and

  2. the eigenvalue gap satisfies

then with probability at least ,

We remark that assumptions (A1) and (A2) are quite natural: (A1) requires that the rows of with distinct entries have some minimum separation that is large enough compared to the ratio of the number of vertices to the smallest block size and compared to the bound in Lemma 2.5. Assumption (A2) on ensures a large enough gap in the eigenvalues to use Lemma 2.5. We note that Lemma 2.5 is applicable to the sparse setting, i.e., the setting wherein the average degrees of the vertices are of order for some , but that we still need sufficient separation between the distinct rows of . For a simple illustration of how this theorem can be applied for a concrete model see Example 1. Finally, we admit that assumption (A0) is less natural, but it is a helpful technical restriction and it excludes a small range of parameters.

While Theorem 2.6 is proven in the SBM setting, we note that our final theorem, Theorem 5.2, is an analogous clustering result in which we prove strong universal consistency of MSE clustering for more general random dot product graphs.

Finally, we observe that Theorem 2.6 has both finite-sample and asymptotic implications. In particular, under these model assumptions, for any finite , the theorem gives a lower bound on the probability of perfect clustering. We do not assert—and indeed it is easy to refute—that in the finite sample case, perfect clustering occurs with probability one. Nevertheless, we can choose for some constant , in which case the probability of perfect clustering approaches one as tends to infinity.

Figure 1:

Mean with standard error bars of

for each value of for the model in Example 1. The decay in the error is very close to .
Example 1 (Dense SBM).

Here we consider a simple concrete example where we can apply Lemma 2.5 and Theorem 2.6. Let and and let for and for . Hence, this is a two block model and the matrix of edge probabilities is given by

The constants in our theorem are , . The distinct eigenvalues of are , and 0 so the smallest gap is ; hence . Lemma 2.5 can be applied as long as , which will clearly hold for sufficiently large for any fixed . This also establishes assumptions (A0) and (A1) of Theorem 2.6.

The implication of Lemma 2.5 is that

While this bound is loose for small to moderate , the asymptotic implications are clear. Empirically, Figure 1 shows the average error for this model as a function of and we see that the error becomes small much sooner and decays at a rate very close to .

For Theorem 2.6 we can compute that if then for all and hence since and , the assumption (A1) will hold for sufficiently large. Hence, for large enough , there will be a very high probability that the mean square error clustering will provide perfect performance.

Example 2 (Sparse SBM).

In this example, we will illustrate some asymptotic implications of the assumptions of Theorem 2.6 in a generalization of Example 1. The SBM of the previous example is a specific instance of an SBM with edge probabilities given by

(2)

where for , and otherwise. In order for to be positive semidefinite we need , and the subsequent RDPG representation is

(3)

We will investigate for what values of , and the assumptions of our theorem are satisfied, noting first that (A0) is automatically satisfied. For the key constants, we can work out that the distinct eigenvalues of are , and . This gives . In addition, , so that assumption (A2) reads

We restrict ourselves to the sparse domain, and assume that , and Assumption (A2) then becomes

Here, we have , and assumption (A1) in this regime is equivalent to

Highlighting a few special cases, we consider

  • and . In this case, our assumptions do not hold. Indeed, in [mossel:ptrf] it is established that if , then clustering is impossible, and the same authors recently extended this result in [mossel2014consistency]

    to show that consistent estimation is impossible for any choice of

    and .

  • . In order to satisfy assumptions (A1) and (A2), it suffices that and (A1) does not hold if .

  • In order to satisfy assumptions (A1) and (A2), it suffices that and Note that (A2) does not hold if and (A1) does not hold if

  • In order to satisfy assumptions (A1) and (A2), it suffices that and Note that (A2) does not hold if , and (A1) does not hold if

3 Proof of Theorem 2.6

Before we prove Theorem 2.6, we first collect a sequence of useful bounds from [tang2012universally, tropp2011freedman, STFP-2011]. We then prove two key lemmas.

Proposition 3.1.

Suppose with and let and be as defined in Lemma 2.5. For any , if , then the following occur with probability at least

In addition, as is a non-negative matrix, . Thus, if , then provided that the above events occur, .

The next two lemmas from [athreya2013limit] are essential to our argument.

Lemma 3.2 ([athreya2013limit]).

In the setting of Proposition 3.1, if the events in Proposition 3.1 occur, then

Lemma 3.3 ([athreya2013limit]).

In the setting of Proposition 3.1, if the events in Proposition 3.1 occur, then

We then have the following bound

Lemma 3.4.

In the setting of Proposition 3.1, if the events in Proposition 3.1 occur, then

Proof.

Let . Denoting by the quantity , we have

where and are given by

Note that by Proposition 3.1 and our assumption that ,

Combining the previous displayed equation and Lemma 3.3 yields

Similarly, we have that and combining this with Proposition 3.1, we bound by

from which the desired bound follows. ∎

We now use Lemma 3.2, Lemma 3.4 and Hoeffding’s inequality to prove Lemma 2.5. We note that for any matrices and , .

Proof of Lemma 2.5.

Since we can add and subtract the matrix and to rewrite as

Lemma 3.4 bounds the first term in terms of the Frobenius norm which is a bound for the norm. For the second term, we have

as both and are diagonal matrices. Applying Lemma 3.2 thus yields

and hence

We now bound the third term. Let denote the th entry, and the th row, of the matrix . Observe that

Next, since

we see that is a sum of

independent, mean zero random variables

, and . Since has orthonormal columns, . Therefore, Hoeffding’s inequality implies

Since there are entries , a simple union bound ensures that

and consequently that

The third term can therefore be bounded as

with probability at least . Combining the bounds for the above three terms yields Lemma 2.5. ∎

Proof of Theorem 2.6.

Let We assume that the event in Lemma 2.5 occurs and show that this implies the result. Since has distinct rows, it follows that

Let be -balls with radii around the distinct rows of . By the assumptions in Theorem 2.6, these balls are disjoint. Suppose there exists such that does not contain any rows of . Then , as for each , no row of is within of the (at least ) rows of in . This implies that

a contradiction. Therefore, . Hence, by the pigeonhole principle, each ball contains precisely one distinct row of .

If , then both and are elements of , and since there is exactly one distinct row of in , . Conversely, if , then and are in disjoint balls and for some , implying that . Thus, if and only if , proving the theorem. ∎

4 Degree corrected SBM

In this section we extend our results to the degree corrected SBM [karrer2011stochastic].

Definition 4.1 (Degree Corrected Stochastic Blockmodel (DCSBM)).

We say an RDPG is a DCSBM with blocks if there exist unit vectors such that for each , there exists and such that .

Remark.

This model is inherently more flexible than the standard SBM because it allows for vertices within each block/community to have different expected degrees. This flexibility has made it a popular choice for modeling network data [karrer2011stochastic].

For this model, we introduce via , so that each row of has unit -norm. As demonstrated in [qin2013dcsbm], a key to spectrally clustering DCSBM graphs is to project the spectral embedding onto the unit sphere, yielding an estimate of rather than an estimate of . As such, let where denotes the operation of setting all the off-diagonal elements of the argument to 0. If we denote the unit sphere in by , then is the projection of on . See Figure 2 for a simple example of this projection step.

Figure 2: Example of (the red and green points) and (the blue and purple points) for a 2-block DCSBM with latent positions and and 100 vertices in each block. The ’s are i.i.d. Uniform(0.2,0.5).

Our next lemma is the analogue of Lemma 2.5 in the DCSBM setting, allowing us to tightly control the errors in the individual rows of .

Lemma 4.2.

In the setting of Lemma 2.5, let the matrices be the projections of and , respectively, onto . Let . If , then

Proof.

We have and . Straightforward calculations then yield

as desired. ∎

As in Theorem 2.6, this allows us to bound the probability of error-free MSE clustering.

Theorem 4.3 (Degree-corrected SBM).

Suppose and is a DCSBM with blocks and block membership function and suppose . Let be the unit vectors for the DCSBM and let denote the smallest scaling factor. Let be as in Theorem 2.6. Suppose is such that for all , . Let be the optimal MSE clustering of the rows of , the projection of onto , into clusters. Finally, let be the smallest block size. If

then with probability at least ,

The proof of this theorem follows mutatis mutandis from the proof of Theorem 2.6.

5 Strong universal consistency

We next show how our methodology can be used to prove strong universal consistency of -means clustering (as considered in [pollard81:_stron_k]) in the general RDPG setting. Specifically, suppose that is a sample of independent observations from some common compactly supported distribution on . Denote by the empirical distribution of the , and let be a set containing or fewer points. Suppose that is a continuous, nondecreasing function with . Now define and by

The problem of -means mean square error clustering given can then be viewed as the minimization of for over all sets containing or fewer elements. The strong consistency of -means clustering corresponds then to the following statement.

Theorem 5.1 ([pollard81:_stron_k]).

Suppose that for each , there is a unique set for which

For any given , denote by a minimizer of over all sets containing or fewer elements. Then almost surely and almost surely.

We now state the counterpart to Theorem 5.1 for the RDPG setting.

Theorem 5.2 (Rdpg).

Let , where the latent positions are sampled from some common compactly supported distribution . Let be the empirical distribution of the . Denote by a minimizer of over all sets containing or fewer elements. Then provided that the conditions in Theorem 5.1 holds for , almost surely, and furthermore, almost surely.

Proof.

We can suppose, without loss of generality, that is a distribution on a totally bounded set, say . Let denote the family of functions of the form where ranges over all subsets of containing or fewer points. The theorem is equivalent to showing

By Theorem 5.1, we know that

and so the theorem holds provided that

Let denote the summand in the above display. We then have the following bound

and hence

We thus have the bound

Now, by Lemma 2.5, converges to almost surely. Since is continuous on a compact set, it is uniformly continuous. Thus

as desired. ∎

6 Discussion

Lemma 2.5 provides a bound on the -to- norm of the difference between and . The ability to control the errors of individual rows of allows us to prove asymptotically almost surely perfect clustering in the SBM and DCSBM, a substantive improvement on the best existing spectral clustering results.

Our approach can be easily modified to prove several extensions of Theorem 2.6. For example, we can consider the special case in which the constants are fixed in and , whereupon the conditions of Theorem 2.6 are all satisfied for sufficiently large. In this setting, we can further suppose that there are positions and for some , i.e. a mixture of point masses. In other words, this is an stochastic block model with independent, identically distributed block memberships (that are not fixed) across vertices. Proving that the number of errors converges almost surely to zero is then an easy application of Theorem 2.6

. Furthermore, our methods can be extended to alternate clustering procedures, such as Gaussian mixture modeling (see

[suwan14:_empbayes]

) or hierarchical clustering.

Indeed, one can construct many examples where perfect performance isachieved asymptotically (see Examples 1 and 2). We will not detail all regimes explicitly, but rather note that this theory can be easily applied to handle a growing number of blocks, possibly impacting , and , and moderately sparse regimes, impacting .

We believe Lemma 2.5 to be of independent interest apart from clustering. Indeed, Lemma 2.5 is a key result in proving consistency of a divide-and-conquer seeded graph matching procedure [lyzinski:_seeded]. The lemma also leads to an easy proof of the strong consistency of -nearest-neighbors for vertex classification, thereby extending the results of [sussman2012universally]. The lemma is a key component of the construction of a consistent two-sample graph hypothesis test [tang14:_two]. Additionally, we are exploring the implications of the lemma on parameter estimation for more general latent position random graphs.

The DCSBM is inherently more general than the SBM, and has key properties useful in modeling group structures in graphs. In [qin2013dcsbm], the authors provide complementary results for spectral analysis of the DCSBM without requiring lower bounds on the degrees; however, in turn, they obtain less-than-perfect clustering. Our results are the first to show that, depending on model parameters, the probability of perfect clustering tends to one as the number of vertices tends to infinity. The keys to the easy extension of these results to more general models are Lemmas 2.5 and 3.4, stated here in the RDPG setting.

For a general RDPG, there may not be a “natural” community structure. Nonetheless, the strong universal consistency result of Theorem 5.2 ensures that clustering the embedded graph will be asymptotically equivalent to the clustering the true latent positions. Finding the -centers of the estimated latent positions provides one way to approximate the distribution of the latent positions as a mixture of point masses corresponding to an SBM, where the distinct latent positions are given by the -centers. Approximating a more general graph distribution as a stochastic blockmodel has been studied by [wolfe13:_nonpar] and [choi2014co], and here we have detailed one spectral solution to this problem. If is chosen appropriately, these approximations yield suitably parsimonious distributions that can be used for understanding large complex graphs, without requiring the estimation of a correspondingly complex distribution.

Acknowledgments

This work is partially supported by a National Security Science and Engineering Faculty Fellowship (NSSEFF), Johns Hopkins University Human Language Technology Center of Excellence, and the XDATA program of the Defense Advanced Research Projects Agency. Lastly, we would like to thank the anonymous referees for their helpful comments and suggestions and for suggesting Example 2.

References