Role of normalization in spectral clustering for stochastic blockmodels

Spectral clustering is a technique that clusters elements using the top few eigenvectors of their (possibly normalized) similarity matrix. The quality of spectral clustering is closely tied to the convergence properties of these principal eigenvectors. This rate of convergence has been shown to be identical for both the normalized and unnormalized variants in recent random matrix theory literature. However, normalization for spectral clustering is commonly believed to be beneficial [Stat. Comput. 17 (2007) 395-416]. Indeed, our experiments show that normalization improves prediction accuracy. In this paper, for the popular stochastic blockmodel, we theoretically show that normalization shrinks the spread of points in a class by a constant fraction under a broad parameter regime. As a byproduct of our work, we also obtain sharp deviation bounds of empirical principal eigenvalues of graphs generated from a stochastic blockmodel.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 25

03/15/2016

Data Clustering and Graph Partitioning via Simulated Mixing

Spectral clustering approaches have led to well-accepted algorithms for ...
02/08/2021

EigenGame Unloaded: When playing games is better than optimizing

We build on the recently proposed EigenGame that views eigendecompositio...
11/30/2020

Doubly Stochastic Subspace Clustering

Many state-of-the-art subspace clustering methods follow a two-step proc...
02/24/2021

Two-way kernel matrix puncturing: towards resource-efficient PCA and spectral clustering

The article introduces an elementary cost and storage reduction method f...
06/08/2017

Clustering with t-SNE, provably

t-distributed Stochastic Neighborhood Embedding (t-SNE), a clustering an...
08/08/2015

A variational approach to the consistency of spectral clustering

This paper establishes the consistency of spectral approaches to data cl...
01/09/2013

Spectral Clustering Based on Local PCA

We propose a spectral clustering method based on local principal compone...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Networks appear in many real-world problems. Any dataset of co-occurrences or relationships between pairs of entities can be represented as a network. For example, the Netflix data can be thought of as a giant bipartite network between customers and movies, where edges are formed via ratings. Facebook is a network of friends, where edges represent who knows whom. Weblogs link to other blogs and give rise to blog networks. Networks can also be implicit; for example, in machine learning they are often built by computing pairwise similarities between entities.

Many problems in machine learning and statistics are centered around community detection. Viral marketing functions by understanding how information propagates through friendship networks, and community detection is key to this. Link farms in the World Wide Web are basically malicious tightly connected clusters of webpages which exploit web-search algorithms to increase their rank. These need to be identified and removed so that search results are authentic and do not mislead users.

Spectral clustering [7, 9] is a widely used network clustering algorithm. The main idea is to first represent the th entity by a

-dimensional vector obtained by concatenating the

th elements of the top eigenvectors of a graph, and then cluster this lower-dimensional representation. We will refer to this as the spectral representation. Due to its computational ease and competitive performance, emerging application areas of spectral clustering range widely from parallel computing [12], CAD (computer aided design) [11], parallel sparse matrix factorization [19] to image segmentation [23], general clustering problems in machine learning [17] and most recently, to fitting and classification using network blockmodels [21, 24].

A stochastic blockmodel is a widely used generative model for networks with labeled nodes [13, 21]. It assigns nodes to

different classes and forces all nodes in the same class to be stochastically equivalent. For example, in a two-class stochastic blockmodel, any pair of nodes belonging to different classes link with probability

(a deterministic quantity possibly dependent on the size of the graph, i.e., ), whereas any pair belonging to class one (two) link with probability ().

Recently the consistency properties of spectral clustering in the context of stochastic blockmodels have attracted a significant amount of attention. Rohe, Chatterjee and Yu [21] showed that, under general conditions, for a sequence of normalized graphs with growing size generated from a stochastic blockmodels, spectral clustering yields the correct clustering in the limit. In a subsequent paper, Sussman et al. [24] showed that an analogous statement holds for an unnormalized sequence of graphs. For finite , the above results can also be obtained using direct applications of results from [18].

This prior theoretical work does not distinguish between normalized and unnormalized spectral clustering, and hence cannot be used to support the common practice of normalizing matrices for spectral clustering. In this paper, we present both theoretical arguments and empirical results to give a quantitative argument showing that normalization improves the quality of clustering. While existing work [21, 24]

bounds the classification accuracy, we do not take this route, since upper bounds can not be used to compare two methods. Instead, we focus on the variance within a class under the spectral representation using the top

eigenvectors. In this representation, by virtue of stochastic equivalence, points are identically distributed around their respective class centers. Hence the empirical variance can be computed using the average squared distance of points from their class center.

In this setting, the distance between the class centers can be thought of as bias; we show that this distance approaches the same deterministic quantity with or without normalization. Surprisingly, we also prove that normalization reduces the variance of points in a class by a constant fraction for a large parameter regime. So normalization does not change the bias, but shrinks the variance asymptotically. However, our results also indicate that the variance of points in a class increases as the graph gets sparser; hence methods which reduce the within-class variance are desired.

A simple consequence of our result is that in the completely associative case () as well as the completely dissociative case (), the variance of the spectral embedding within a class is asymptotically four times less when the matrix is normalized. While the completely associative case is on its own uninteresting, we build the proof of the general case using similar ideas and techniques.

Our results indicate that normalization has a clear edge when the parameters are close to the completely associative or completely dissociative settings. These seemingly easy to cluster regimes can be relatively difficult in sparse networks. Of course, as grows, both methods have enough data to distinguish between the clusters and behave similarly. But for small and sparse graphs, it is indeed an important regime.

Sussman et al. [24] present a parameter setting where normalization is shown to hurt classification accuracy empirically. We show that this is but a partial picture; and in fact there is a large parameter regime where spectral clustering with normalized matrices yields tighter and hence better clusters.

Using quantifiable link prediction experiments on real world graphs and classifications tasks in labeled simulated graphs, we show that normalization leads to better classification accuracies for the regime dictated by our theory, and yields higher link prediction accuracy on sparse real world graphs.

We conclude the introduction with a word of caution. Our asymptotic theory is valid in the degree regime where networks are connected with probability approaching one. However, finite sparse networks can have disconnected or weakly connected small components, in the presence of which, the normalized method returns uninformative principal eigenvectors with support on the small components. This makes classification worse compared to the unnormalized method, whose principal eigenvectors are informative in spite of having high variance. Hence, our asymptotic results should be used only as a guidance for finite , not as a hard rule. We deal with this problem by removing low degree nodes and performing experiments on the giant component of the remaining network.

2 Preliminaries and import of previous work

In this paper we will only work with two class blockmodels. Given a binary class membership matrix , the edges of the network are simply outcomes of independent Bernoulli coin flips. The stochastic blockmodel ensures stochastic equivalence of nodes within the same block; that is, all nodes within the same block have identical probability of linking with other nodes in the graph.

Thus the conditional expectation matrix of the adjacency matrix can be described by three probabilities, namely ; where and denote the probabilities of connecting within the first and second classes ( and ), respectively, and denotes the probability of connecting across two classes. All statements in this paper are conditioned on and .

[(A stochastic blockmodel)] Let be a fixed and unknown matrix of class memberships such that every row has exactly one , and the first and second columns have and ones, respectively. A stochastic blockmodel with parameters generates symmetric graphs with adjacency matrix such that, , . For , are independent with , where is symmetric with for , for and for .

For ease of exposition we will assume that the rows and columns of are permuted such that all elements of the same class are grouped together. We have . Clearly, is a blockwise constant matrix with zero diagonal by construction.

We use a parametrization similar to that in [3] to allow for decaying edge probabilities as grows. Formally , and are proportional to a common rate variable where as , forcing all edge probabilities to decay at the same rate. Thus it suffices to replace , or by in orders of magnitude; for example, the expected degree of nodes in either class is . We use “” to denote a generic positive constant. All expectations are conditioned on ; for notational convenience we write instead of .

First we consider the eigenvalues and eigenvectors of without the constraint of zero diagonals. If , then this matrix (denoted by ) will have two eigenvalues with magnitude and zero eigenvalues. Since , using Weyl’s inequality we see that the principal eigenvalues of are , whereas all other eigenvalues are .

Let () denote the th eigenvector (eigenvalue) of matrix . The ordering is in decreasing order of absolute value of the eigenvalues. We will denote the th empirical eigenvector (eigenvalue) by (). , are piecewise constant.

Now we will define the normalized counterparts of the above quantities. Let , and also let , where and are the diagonal matrices of degrees and expected degrees, respectively. We denote the first two eigenvectors by and , and the first two eigenvalues by and . Similar to and , and also are piecewise constant vectors. The empirical counterparts of the eigenvectors and values are denoted by , . One interesting fact about is that the th entry is proportional to , where is the degree of node . However, one cannot explicitly obtain the form of .

Among the many variants of spectral clustering, we consider the algorithm used in [21]. The idea is to compute an matrix with the top eigenvectors of along its columns, and apply the kmeans algorithm on the rows of . The kmeans

algorithm searches over different clusterings and returns a local optima of an objective function that minimizes the squared Euclidean distance of points from their respective cluster centers. The clusters are now identified as estimates of the

blocks.

Probabilistic bounds on misclassification errors of spectral clustering under the stochastic blockmodel has been obtained in previous work [21, 24]. However, upper bounds cannot be used for comparing two algorithms. Instead, we define a simple clustering quality metric computable in terms of an appropriately defined deviation of empirical eigenvectors from their population counterparts, and we show that these are improved by normalization.

2.1 Quality metrics

The quality metrics are defined as follows: the algorithm passes the empirical eigenvectors to an oracle who knows the cluster memberships. The oracle computes cluster centers , for us . Let denote the mean squared distance of points in from , and let denote the mean squared distance of points in from . From now on we will denote by and the distances obtained from the unnormalized and normalized methods, respectively.

To be concrete, we can write . Similarly, define as the mean square distance of points in from , that is, . One can analogously define and . We will use the notation (or ) when we refer to the corresponding quantities in general, that is, without any particular reference to a specific method.

Although seems like a simple average of squared distances, it actually has useful information about the quality of clustering. For definiteness, let us take the unnormalized case and examine points in . By stochastic equivalence, ,

are identically distributed (albeit dependent) random variables. Now

essentially is the trace of the sample variance matrix, and hence measures the variance of these random variables.

Ideally a good clustering algorithm with or without normalization should satisfy , but we will show that this ratio converges to zero at the same rate, with or without normalization, in consistence with previous work [18, 21] and [24]. Furthermore, we will show that ; that is, the two methods do not distinguish between .

Interestingly, our results also imply that increases as the graphs become sparser; that is, decreases. Hence, if a method can be shown to reduce the variance of points in a class by a constant fraction, it would be preferable for sparse graphs. Indeed we show that converges to a constant which is less than 1 for a broad range of parameter settings of and . In the simple disconnected case with , this constant is .

Another advantage of is that it can be conveniently expressed in terms of an appropriately defined deviation of empirical eigenvectors from their population counterpart. For any population and empirical eigenvector pair , we consider the following orthogonal decomposition: , where . The norm of residual will measure the deviation of from . The deviation of from can be measured similarly.

Since we are interested in two class blockmodels, we will mostly use as the residual of the th empirical eigenvector from its population counterpart, and . We denote by the average of entries of vector restricted to class . A key fact is that , (or , ) are both piecewise constant:

(1)
(2)

We will denote the distances obtained from by and from by . Even though is defined in terms of and , we will abuse this notation somewhat to use the above expressions for calculating , where will be defined identically in terms of and . For a wide regime of , we prove that is asymptotically a constant factor smaller and hence better than . First, using results from [10] we will prove that for , . In this case, the result can be proven using existing results on Erdős–Rényi graphs [10] and a simple application of Taylor’s theorem. In order to generalize the result to , we would need new convergence results for and generated from a stochastic blockmodel. All results rely on the following assumption on :

We assume , as .

This assumption ascertains with high probability that the sequence of growing graphs are not too sparse. The expected degree is , and this is the most commonly used regime where norm convergence of matrices can be shown [6, 18, 5]. Note that this is also the sharp threshold for connectivity of Erdős–Rényi graphs [4]. We will now formally define the sparsity regime in which we derive our results.

[(A semi-sparse stochastic blockmodel)] Define a stochastic blockmodel with parameters , , and ; see Definition 2. Let , and be deterministic quantities of the form . If satisfies Assumption 2.1, we call the stochastic blockmodel a semi-sparse stochastic blockmodel.

The paper is organized as follows: we present the main results in Section 3. The proof of the simple case is in Section 4, whereas the expressions of and in the general case appear in Section 5. We derive the expressions of and in Section 6. Experiments on simulated and real data appear in Section 7. The proofs of some accompanying lemmas and ancillary results are omitted from the main manuscript for ease of exposition and are deferred to the Supplement [22].

2.2 Import of previous work

By virtue of stochastic equivalence of points belonging to the same class, eigenvectors of map the data to distinct points. This is why consistency of spectral clustering is closely tied to consistency properties of empirical eigenvalues and eigenvectors. We will show that current theoretical work on eigenvector consistency does not distinguish between the use of normalized or unnormalized .

One of the earlier results on the consistency of spectral clustering can be found in [26], where weighted graphs generated from a geometric generative model are considered. While this is an important work, this does not apply to our random network model.

For any symmetric adjacency matrix with independent entries, one can use results on random matrix theory from Oliveira [18] to show that the empirical eigenvectors of a semi-sparse stochastic blockmodel converge to their population counterpart at the same rate with or without normalization. If denotes the probability function , and denotes the expected degree, then:

Theorem 2.1 ((Theorem 3.1 of [18]))

For any constant , there exists another constant , independent of or , such that the following holds. Let , . If , then for all ,

Moreover, if , then for the same range of ,

Let denote the orthogonal projector onto the space spanned by the eigenvectors of corresponding to eigenvalues in . A simple consequence of Theorem 2.1

is that for suitably separated population eigenvalues, the operator norm of the difference of the eigenspaces also converges to zero.

Corollary 2.1 ((Corollary 3.2 of [18]))

Given some , let be the set of all pairs such that , and has no eigenvalues in . Then for ,

(3)

Similarly define . Then for ,

(4)

In particular the right-hand sides of the above equations hold with probability for any .

A straightforward application of this corollary yields that spectral clustering for a stochastic blockmodel with and lead to convergence of empirical eigenvectors to their population counterparts. Further analysis shows that the fraction of misclassified nodes go to zero at the same rate for and . We defer the proof to Section B of the Supplement [22].

Corollary 2.2

Let be generated from a semi-sparse stochastic blockmodel (Definition 2.1) with and . Then, for , . Furthermore the fraction of misclassified nodes can be bounded by for both methods.

Spectral clustering with derived from a stochastic blockmodel with growing number of blocks has been shown to be asymptotically consistent [21]. Further, the fraction of mis-clustered nodes is shown to converge to zero under general conditions. These results are extended to show that spectral clustering on unnormalized also enjoys similar asymptotic properties [24]. Sussman et al. [24] also give an example of parameter setting for a stochastic blockmodel where spectral clustering using unnormalized outperforms that using . We, however, demonstrate using theory and experiments that this is only a partial picture, and there is a large regime of parameters where normalization indeed improves performance.

For ease of exposition, we list the different variables and their orders of magnitude in Table 1. For deterministic quantities and , , denotes that converges to some constant as . For two random variables and , we use to denote . For the scope of this paper denotes the norm, unless otherwise specified.

= Edge probability The identity matrix Number of nodes in the network binary matrix of class memberships The th group, Diagonal matrix of degrees Diagonal matrix of expected degrees, conditioned on Adjacency matrix The average of restricted to , that is, th largest eigenvalue of in magnitude th largest eigenvalue of in magnitude , for , for th eigenvector of th eigenvector of th largest eigenvalue of in magnitude; th largest eigenvalue of in magnitude th eigenvector of th eigenvector of matrix of top two empirical The population variant of eigenvectors (of ) along the columns Variant of using eigenvectors of

Table 1: Table of notation

3 Main results

For the general case we derive the following asymptotic expressions of and . We recall that measures the variance of points in class one under the spectral representation, whereas basically measures the distance between the class centers, which can also be thought of as bias. We will show that normalizing asymptotically reduces the variance without affecting the bias. The proofs can be found in Sections 5 and 6. Let be the adjacency matrix generated from a semi-sparse stochastic blockmodel where and . We define , , , for and as in Table 1:

(5)
(6)

Let be the adjacency matrix generated from a semi-sparse stochastic blockmodel where and .We define , , and as in Table 1:

(8)

Before explaining the above theorems, we present a special case for clarity. [(A special case)] When , we have , and and , which immediately shows that normalization shrinks the variance of the spectral embedding within a class () by a factor of four. This is the completely associative case. Now consider the completely dissociative case, that is, . It is easy to see that then , and . Substituting these values into the distance formulas again shows that normalization shrinks by a factor of four in the completely dissociative case.

We call the completely associative case the zero communication case, which can be thought of as two disconnected Erdős–Rényi graphs. Under Assumption 2.1 each of the smaller graphs will be connected with probability tending to one. von Luxburg [25] already established that spectral clustering achieves perfect classification in this scenario. We merely present this simple setting because the ideas and proof techniques used for this case will be carried over to the general case with . In particular, our results indicate that in the general case (), for parameter regimes close to the completely associative or completely dissociative models, the normalized method has a clear edge.

Corollary 3.1

Let be the adjacency matrix generated from a semi-sparse stochastic blockmodel (see Definition 2.1) where and . We have

(9)
(10)
(11)
(12)

The same holds for normalized and unnormalized versions of and .

While both of and (derived the unnormalized and normalized methods) are approaching zero in probability, is for both the normalized and unnormalized cases. In our regime of this translates to perfect classification as . This is not unexpected because existing literature has established that spectral clustering with both and are consistent in the semi-sparse regime. Also, approaches one; thus if the limiting ratio is smaller (larger) than one, then there is some indication that the normalized (unnormalized) method is to be preferred.

For simplicity, we consider a stochastic blockmodel with two equal sized classes and . We will now show that for this simple model, in the semi-sparse regime, our quality metric indicates that normalization would always improve performance. In the dense regime, that is, when degree grows linearly with , there are parts of the parameter space where our quality metric prefers the unnormalized method. However, the network is so dense that the two class centers are well separated, leading to equally good performance of both methods.

For this simple model, the limiting ratio has a concise form in the semi-sparse case, which is presented in Corollary 3.2, and plotted in Figure 1(A). Figure 1(B) shows the contour plot of the limiting ratio in the dense case. We also highlight the parameter regime where the ratio is close to or larger than one. Finally Figure 1(C) focuses on this highlighted area.

(A) (B)
(C)
Figure 1: Simple blockmodel with two equal sized classes and : (A) Limiting ratio of in the semi-sparse case. (B) Contour plot for the ratio in the dense case. The rectangular area consists of parameters settings leading to a ratio bigger than one. This is highlighted in (C), which shows the surface plot for along the axis in the regime where ratio is close to or larger than one. axis has varying ,  axis has varying . For reference we also plot the plane .
Corollary 3.2

Let be the adjacency matrix generated from the stochastic blockmodel where and . When , we have the following limit, which is always smaller than one:

(13)

On the other hand, when is a constant w.r.t. , the above ratio is smaller than one, unless or . The universal upper bound is 1.31.

Here we summarize the result in the above corollary. [(3)]

In the semi-sparse regime [Figure 1(A)], the limiting ratio is always less than one, thus favoring the normalized method.

In the dense case [Figure 1(B) and (C)], where is a constant w.r.t. , this ratio can be larger than one when or , with an upper bound of 1.31. The upper bound is achieved for large , pairs, for example, and . In this dense regime, both methods perform equally well on any reasonably sized network. Using simulations on small networks (twenty nodes), we found that in terms of misclassification error, the methods perform comparably.

Because of the inherent symmetry of the simple model, for , the ratio , in the semi-sparse regime. This again shows that normalization provides a clear edge close to the completely associative () or completely dissociative () cases. We want to point out that for the simulated experiment with (and ), the unnormalized method performs better than the normalized method in Sussman et al. [24]. In this case the analytic ratio also is larger than one, and the graph is very close to an Erdős–Rényi graph.

3.1 A shortcoming of asymptotic analysis

While Corollary 3.2 suggests that normalization always reduces within class variance in the semi-sparse degree regime, there is one caveat to this asymptotic result. In the semi-sparse regime, the network is connected with high probability as . However, finite sparse networks may have disconnected components consisting of a few nodes. In such scenarios, by construction the normalized method assigns eigenvalue one to eigenvectors with support on nodes in each of the connected components. As a result, the leading eigenvectors are uninformative, leading to poor performance. The unnormalized method, however, does not suffer from this problem and has the informative eigenvectors as the leading eigenvectors, albeit with a high variance. We get around this problem by removing small degree nodes and then working on the largest connected component. We also point out this problem in the discussion section.

3.2 Accuracy of the analytic ratio

Finally, we also use simulations to see how accurate the analytic ratio is. For , we vary , and between zero and two such that , , and . We note that the ratio increases for large pairs. The mean, median and maximum absolute relative error for () from their analytic counterparts is 0.02, 0.02 and 0.1 (0.001, 0.001 and 0.03), respectively. In both cases the maximum happens for the combination where is the smallest, leading to most instability. Since all our terms are , for this experiment these errors are indeed justifiable.

4 The zero communication case

We will now present our result for two class blockmodels (see Definition 2.1) with . We will heavily use the following orthogonal decomposition of the population eigenvectors:

Since , can be thought of as two disconnected Erdős–Rényi graphs of size and (let the two adjacency matrices be denoted by and , resp.). We assume WLOG so that and . We also assume that rows and columns of are permuted so that the first entries are from . (We will not use this in our proofs; it only helps the exposition.)

Füredi and Komlós [10] show that for , and . Hence for large , the second largest eigenvalue will come from , and will have zeros along the first class, similar to the second population eigenvector. Thus for , and vice versa.

Further, some algebra reveals that and . Computing or requires one to compute the norm and average of and restricted to ; see equation (1). For , this reduces to examining and for two Erdős–Rényi graphs.

Let us consider an Erdős–Rényi graph . Since self-loops are prohibited, the conditional expectation matrix is simply , which has eigenvalues, the largest of which is , and the rest are all . We denote by the degree of node , and .

Let , be respectively the principal eigenvalue and eigenvector pair of (), whose empirical counterparts are given by and ( and ) respectively. In this simple case, and are the same. We require that all eigenvectors are unit-length. We denote by the a length vector with the th entry equaling . We note that , and . Let and . We will prove that , which will help us prove Corollary 3.1.

Before proceeding with the result, for ease of exposition we recall the orders of magnitudes of some random variables used in the proof. Let denote . We have [this is simply twice the sum of centered variables and ]. The later result can be obtained by showing that the expectation is

, and the standard deviation is of a smaller order. A detailed proof can be found in 

[10].

Lemma

Write the first population eigenvector of an Erdős–Rényi graph adjacency matrix as . If satisfies Assumption 2.1, we have

Before delving into the proof, we state the main result from [10]. For an Erdős–Rényi graph, . Since , we have . As the explicit form of  is not known, the following step is used to compute the norm of :

(14)

The proof is straightforward. First we see that

Using , , thus proving equation (14). Now equation (14) and standard norm-inequalities yield , where is the th largest eigenvalue of .

Now, using results from [8] we have , and hence . Interestingly, note that , and hence . Hence . Combining this with the former upper bound, we have