A Spectral Algorithm with Additive Clustering for the Recovery of Overlapping Communities in Networks

06/12/2015 ∙ by Emilie Kaufmann, et al. ∙ 0

This paper presents a novel spectral algorithm with additive clustering designed to identify overlapping communities in networks. The algorithm is based on geometric properties of the spectrum of the expected adjacency matrix in a random graph model that we call stochastic blockmodel with overlap (SBMO). An adaptive version of the algorithm, that does not require the knowledge of the number of hidden communities, is proved to be consistent under the SBMO when the degrees in the graph are (slightly more than) logarithmic. The algorithm is shown to perform well on simulated data and on real-world graphs with known overlapping communities.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many datasets (e.g., social networks, gene regulation networks) take the form of graphs whose structure depends on some underlying communities. The commonly accepted definition of a community is that nodes tend to be more densely connected within a community than with the rest of the graph. Communities are often hidden in practice and recovering the community structure directly from the graph is a key step in the analysis of these datasets. Spectral algorithms are popular methods for detecting communities [Von Luxburg, 2007], that consist in two phases. First, a spectral embedding is built, where the

nodes of the graph are projected onto some low dimensional space generated by well-chosen eigenvectors of some matrix related to the graph (e.g., the adjacency matrix or a Laplacian matrix). Then, a

clustering algorithm (e.g., -means or -median) is applied to the

embedded vectors to obtain a partition of the nodes into communities.

It turns out that the structure of many real datasets is better explained by overlapping communities. This is particularly true in social networks, in which the neighborhood of any given node is made of several social circles, that naturally overlap [Mc Auley and Leskovec, 2012]. Similarly, in co-authorship networks, authors often belong to several scientific communities and in protein-protein interaction networks, a given protein may belong to several protein complexes [Palla et al., 2005].  The communities do not form a partition of the graph and new algorithms need to be designed. This paper presents a novel spectral algorithm, called spectral algorithm with additive clustering (SAAC). The algorithm consists in a spectral embedding based on the adjacency matrix of the graph, coupled with an additive clustering phase designed to find overlapping communities. The proposed algorithm does not require the knowledge of the number of communities present in the network, and can thus be qualified as adaptive.

SAAC belongs to the family of model-based community detection methods, that are motivated by a random graph model depending on some underlying set of communities. In the non-overlapping case, spectral methods have been shown to perform well under the stochastic block model (SBM), introduced by Holland and Leinhardt [Holland and Leinhardt, 1983]. Our algorithm is inspired by the simplest possible extension of the SBM to overlapping communities, we refer to as the stochastic blockmodel with overlaps (SBMO). In the SBMO, each node is associated to a binary membership vector, indicating all the communities to which the node belongs. We show that exploiting an additive structure in the SBMO leads to an efficient method for the identification of overlapping communities. To support this claim, we provide consistency guarantees when the graph is drawn under the SBMO, and we show that SAAC exhibit state-of-the-art performance on real datasets for which ground-truth communities are known.

The paper is structured as follows. In Section 2, we cast the problem of detecting overlapping communities into that of estimating a membership matrix in the SBMO model, introduced therein. In Section 3, we compare the SBMO with alternative random graph models proposed in the literature, and review the algorithms inspired by these models. In Section 4, we exhibit some properties of the spectrum of the adjacency matrix under SBMO, that motivate the new SAAC algorithm, introduced in Section 5, where we also formulate theoretical guarantees for an adaptive version of the algorithm. Finally, Section 6 illustrates the performance of SAAC on both real and simulated data.

Notation.

We denote by the Euclidean norm of a vector . For any matrix , we let denote its -th row and its -th column. For any , denotes its cardinality and is a row vector such that . The Frobenius norm of a matrix is

The spectral norm of a symmetric matrix

with eigenvalues

is For , we let the permutation matrix associated to , defined by

2 The stochastic blockmodel with overlaps (SBMO)

2.1 The model

For any symmetric matrix , let be some random symmetric binary matrix whose entries

are independent Bernoulli random variables with respective parameters

. Then is the adjacency matrix of an undirected random graph with expected adjacency matrix . In all the paper, we restrict the hat notation to variables that depend on this random graph. For example, the empirical degree of node observed on the random graph and the expected degree of node are respectively denoted by

Similarly, we write , , and

The stochastic block model (SBM) with nodes and communities depends on some mapping that associates nodes to communities and on some symmetric community connectivity matrix . In this model, two nodes and

are connected with probability

Introducing a membership matrix such that , the expected adjacency matrix can be written

The stochastic blockmodel with overlap (SBMO) is a slight extension of this model, in which is only assumed to be in and for all . Compared to the SBM, the rows of the membership matrix are no longer constrained to have only one non-zero entry. Since these rows give the communities of the respective nodes of the graph, this means that each node can now belong to several communities.

2.2 Performance metrics

Given some adjacency matrix drawn under the SBMO, our goal is to recover the underlying communities, that is to build an estimate of the membership matrix , up to some permutation of its columns (corresponding to a permutation of the community labels). We denote by the estimate of the number of communities ( is in general unknown), so that .

We introduce two performance metrics for this problem. The first is related to the number of nodes that are “well classified”, in the sense that there is no error in the estimate of their membership vector. The objective is to minimize the number of misclassified nodes of an estimate

of , defined by if and

otherwise. The second performance metric is the fraction of wrong predictions in the membership matrix (again, up to a permutation of the community labels). We define the estimation error of as if and otherwise by

2.3 Identifiability

The communities of a SBMO can only be recovered if the model is identifiable in that the equality , for some integer and matrices , , implies (and thus ): two SBMO with the same expected adjacency matrices have the same communities, up to a permutation of the community labels. In this section, we derive sufficient conditions for identifiability.

Consider the following SBMO with nodes and 3 overlapping communities:

(1)

where and (resp. ) is a vector of length with all coordinates equal to (resp. ). This SBMO is not identifiable since with

Observe that this is a SBM with 3 non-overlapping communities.

In view of the above example, some additional assumptions are required to ensure identifiability. A first approach is to restrict the analysis to SBM. The following result is proved in Appendix A.

The SBMO is identifiable under the following assumptions:

  • (SBM1)  for all , the rows and are different;

  • (SBM2)  for all , .

Assumption (SBM1) is the usual condition for identifiability of a SBM; the absence of overlap is enforced by assumption (SBM2). Note that the SBM of Example 2.3 clearly satisfies both assumptions and thus is identifiable: this is the only SBM with expected adjacency matrix . One may wonder whether the SBMO is identifiable if we impose an overlap, that is the existence of some node such that . The answer is negative, as shown by the following example.

Example 2.3 (continued).

Without loss of generality, we assume that . Consider the following SBMO with nodes and 4 overlapping communities:

We have .

Thus some additional assumptions are required to make the SBMO identifiable. It is in fact sufficient that the community connectivity matrix is invertible and that each community contains at least one pure node (that is, belonging to this community only). The following result is proved in Appendix A.

The SBMO is identifiable under the following assumptions:

  • (SBMO1)   is invertible;

  • (SBMO2)  for each there exists such that ,

Observe that the two SBMO of Example 2.3, with membership matrices and , violate (SBMO2). Only the SBM is identifiable. In particular, if we generate a SBMO with 3 overlapping communities based on the matrices and , our algorithm will return at best 3 non-overlapping communities corresponding to the SBM with membership matrix . To recover the model (1), some additional information is required on the community structure. For instance, one may impose and that each node belongs to exactly two communities. Note that this last condition alone is not sufficient, in view of the third model of Example 2.3.

Our choice for SBMO1-2 is motivated by applications to social networks: homophily will make the matrix diagonally dominant, hence invertible. In the rest of the paper, we assume that the identifiability conditions (SBMO1) and (SBMO2) are satisfied.

2.4 Subcommunity detection

Any SBMO with overlapping communities may be viewed as a SBM with up to non-overlapping communities, corresponding to groups of nodes sharing exactly the same communities in the SBMO and that we refer to as subcommunities.

Let be the number of subcommunities in the SBMO:

The corresponding SBM has communities indexed by , with community connectivity matrix given by . The SBM of Example 2.3 can be derived from the first SBMO in this way for instance. More interestingly, it is easy to check that if the initial SBMO satisfies (SBMO1)-(SBMO2) then the corresponding SBM satisfies (SBM1)-(SBM2).

Figure 1: Three overlapping communities of a SBMO (left) and the subcommunities of the associated SBM (right).

This suggests that community detection in the SBMO reduces to community detection in the corresponding SBM, for which many efficient algorithms are known. However, the notion of performance for a SBM is different from the that for the underlying SBMO: the knowledge of the subcommunities is not sufficient to recover the initial overlapping communities, that is to obtain an estimate such that is small. It is indeed necessary to map these subcommunities to elements of , which is not an easy task: first, the number of communities is unknown; second, assuming is known, there are up to such mappings so that a simple approach by enumeration is not feasible in general. Moreover, the performance of clustering algorithms degrades rapidly with the number of communities so that it is preferable to work directly on the overlapping communities rather than on the subcommunities, with possibly as large as .

Our algorithm detects directly the overlapping communities using the specific geometry of the eigenvectors of the expected adjacency matrix, . We provide conditions under which these geometric properties hold for the observed adjacency matrix, , which guarantees the consistency of our algorithm: the communities are recovered with probability tending to 1 in the limit of a large number of nodes .

2.5 Scaling

To study the performance of our algorithm when the number of nodes grows, we introduce a degree parameter so that the expected adjacency matrix of a graph with nodes is in fact given by

with independent of and . Although depends on , we do not make it explicit in the notation. Observe that the expected degree of each node grows like , since

where is the vector of one’s of dimension .

We assume that the set of subcommunities does not depend on and that for all , there exists a positive constant (independent of ) such that:

(2)

This implies the existence of positive constants and of a matrix , such that

(3)

One has for any such that . In the sequel, we assume that the graph is sparse in the sense that with . Observe also that is the (limit) proportion of nodes that belong to community while is the (limit) proportion of nodes that belong to communities and , for any . Hence we refer to as the overlap matrix.

In the following, we will slightly abuse notation by writing and if , although these equalities in fact hold only in the limit.

3 Related work

Models.

Several random graph models have been proposed in the literature to model networks with overlapping communities. In these models, each node is characterized by some community membership vector that is not always a binary vector, as in the SBMO. In the Mixed-Membership Stochastic Blockmodel (MMSB) [Airoldi et al., 2008], introduced as the first model with overlaps, membership vectors are probability vectors drawn from a Dirichlet distribution. In this model, conditionally to and , the probability that nodes and are connected is for some community connectivity matrix , just like in SBMO. However, the fact that and are probability vectors makes the model less interpretable. In particular, the probability that two nodes nodes are connected does not necessarily increase with the number of communities that they have in common, as pointed out by Yang and Leskovec [Yang and Leskovec, 2012], which contradicts a tendency empirically observed in social networks.

A first model that relies on binary membership vectors is the Overlapping Stochastic Block Model (OSBM) [Latouche et al., 2011], in which two nodes are connected with probability , where , , , and

is the sigmoid function. Now the probability of connectivity of two nodes increases with the number of communities shared, but the particular form of the probability of connection makes the model hard to analyze. Given a community connectivity matrix

, another natural way to build a random graph model based on binary membership vectors is to assume that two nodes and are connected if any pair of communities to which these nodes respectively belong can explain the connection. In other words, and are connected with probability Denoting by the matrix with entries , this probability can be written where the approximation is valid for sparse networks. In this case, the model is very close to the SBMO, with connectivity matrix . The Community-Affiliation Graph Model (AGM) [Yang and Leskovec, 2012] is a particular case of this model in which is diagonal. The SBMO with a diagonal connectivity matrix can be viewed as a particular instance of an Additive Clustering model [Shepard and Arabie, 1979] and is also related to the ‘colored edges’ model [Ball et al., 2011], in which

is drawn from a Poisson distribution with mean

where is the (non-binary) membership vector of node . Letting

and approximating the Poisson distribution by a Bernoulli distribution, we recover the SBMO.

The Overlapping Continuous Community Assignment Model (OCCAM), proposed by Zhang et al. [Zhang et al., 2014] relies on overlapping communities but also on individual degree parameters, which generalizes the degree-corrected stochastic blockmodel [Karrer and Newman, 2011]. In the OCCAM, a degree parameter is associated to each node . Letting , the expected adjacency matrix is , with a membership matrix . Identifiability of the model is proved assuming that is positive definite, each row satisfies , and the degree parameters satisfy . The SBMO can be viewed as a particular instance of the OCCAM, for which we provide new identifiability conditions, that allow for binary membership vectors.

Algorithms.

Several algorithmic methods have been proposed to identify overlapping community structure in networks [Xie et al., 2013]

. Among the model-based methods, that rely on the assumption that the observed network is drawn under a random graph model, some are approximations of the maximum likelihood or maximum a posteriori estimate of the membership vectors under one of the random graph models discussed above. For example, under the MMSB or the OSBM the membership vectors are assumed to be drawn from a probability (prior) distribution, and variational EM algorithms are proposed to approximate the posterior distributions

[Airoldi et al., 2008, Latouche et al., 2011]

. However, there is no proof of consistency of the proposed algorithms. In the MMSB, a different approach that uses tensor power iteration is proposed in

[Anandkumar et al., 2014]

to compute an estimator derived using the moments method, for which the first consistency results are provided.

The first occurrence of a spectral algorithm to find overlapping communities goes back to [Zhang et al., 2007]

. The proposed method is an adaptation of spectral clustering with the normalized Laplacian (see e.g.,

[Newman, 2013]) with a fuzzy clustering algorithm in place of

-means, and its justification is rather heuristic. Another spectral algorithm has been proposed by

[Zhang et al., 2014], as an estimation procedure for the (non-binary) membership matrix under the OCCAM. The spectral embedding is a row-normalized version of , with the diagonal matrix containing leading eigenvalues of and the matrix of associated eigenvectors. The centroids obtained by a -median clustering algorithm are then used to estimate . This algorithm is proved to be consistent under the OCCAM, when moreover degree parameters and membership vectors are drawn according to some distributions. Similar assumptions have appeared before in the proof of consistency of some community detection algorithms in the SBM or DC-SBM [Zhao et al., 2012]. Our consistency results are established for fixed parameters of the model.

4 Spectral analysis of the adjacency matrix in the SBMO

In this section, we describe the spectral structure of the adjacency matrix in the SBMO.

4.1 Expected adjacency matrix

Let be the set of membership matrices that contains at least one pure node per community:

From the identifiability conditions (SBMO1) and (SBMO2), is of rank (refer to the proof of Theorem  2.3) and belongs to . Let be a matrix whose columns are normalized orthogonal eigenvectors associated to the non-zero eigenvalues of . The structure of is described in the following proposition. Its first statement follows from the fact that the eigenvectors form a basis of and that . Its second statement is established in the proof of Theorem  2.3.

  1. There exists such that .

  2. If for some , , then there exists such that .

This decomposition reveals in particular an additive structure in : each row is the sum of rows corresponding to pure nodes associated to the communities to which node belongs. Fixing for each a pure node in community , one has indeed

(4)

Proposition 4.1, proved in Appendix A, relates the eigenvectors of to those of a matrix featuring the overlap matrix introduced in Section 2.5. Note that for any , we have so that has the same rank as , equal to . Hence is invertible and positive definite, thus the matrix (resp. its inverse) is well defined.

Let and . The following statements are equivalent:

  1. is an eigenvector of associated to .

  2. is an eigenvector of associated to ;

In particular, the non-zero eigenvalues of are of the same order as .

4.2 Observed adjacency matrix

In practice, we observe the adjacency matrix , which is as a noisy version of . Our hope is that the leading eigenvectors of are not too far from the leading eigenvectors of , so that in view of Proposition 4.1, the solution in the following optimization problem provides a good estimate of :

where is the matrix of the normalized eigenvectors of associated to the largest eigenvalues.

This hope is supported by the following result on the perturbation of the largest eigenvectors of the adjacency matrix of any random graph, proved in Appendix D. In practice, the number of communities is unknown and this result also provides an adaptive procedure to select the eigenvectors to use in the spectral embedding. We denote by the smallest absolute value of a non-zero eigenvalue of .

Let and . Let be a matrix formed by orthogonal eigenvectors of with an associated eigenvalue that satisfy

Let be the number of such eigenvectors. Let be matrix of largest eigenvectors of . If

then with probability larger than , and there exists a matrix such that

Under SBMO, we have by Proposition 4.1. As , we need to use Lemma 4.2 to prove that is a good estimate of . We give in the next section sufficient conditions on the degree parameter to obtain asymptotically exact recovery of the communities.

5 The SAAC algorithm

The spectral structure of the adjacency matrix suggests that defined below is a good estimate of the membership matrix in the SBMO:

(5)

where is the matrix of the normalized leading eigenvectors of . In practice, solving is very hard, and the algorithm introduced in Section 5.1 solves a relaxation of in which is only constrained to have binary entries, that is amenable to alternate minimization. In Section 5.2, we prove that an adaptive version of the estimate given by (5) is consistent.

5.1 Description of the algorithm

The spectral algorithm with additive clustering (SAAC) consists in first computing a matrix whose columns are normalized eigenvectors of associated to the largest eigenvalues (in absolute value), and then computing the solution of the following optimization problem:

is reminiscent of the (NP-hard) -means problem, in which the same objective function is minimized under the additional constraint that for all . The name of the algorithm highlights the fact that, rather than finding a clustering of the rows of , the goal is to find , containing pure nodes , that reveals the underlying additive structure of : for all , is not too far from , in view of (4).

In practice, just like -means, we propose to solve by an alternate minimization over and . The proposed implementation of the adaptive version of the algorithm, inspired by Theorem 5.2, is presented as Algorithm 1. An upper bound on the maximum overlap is provided to limit the combinatorial complexity of the algorithm. If if known, the selection phase can be removed, and one use directly the matrix of leading eigenvectors. While heuristics do exist for selecting the number of clusters in spectral clustering (e.g. [Von Luxburg, 2007, Zelnik-Manor and Perona, 2004]), this thresholding procedure is supported by theory for networks drawn under SBMO. It is reminiscent of the USVT algorithm of [Chatterjee, 2015], that can be used to estimate the expected adjacency matrix in a SBM.

0:  Parameters , , . Upper bound on the maximum overlap .
0:  , the adjacency matrix of the observed graph.
1:   Selection of the eigenvectors
2:  Form a matrix whose columns are eigenvectors of associated to eigenvalues satisfying
3:   Initialization
4:  
5:   initialized with -means++ applied to , the first centroid being chosen at random among nodes with degree smaller than the median degree
6:  
7:   Alternating minimization
8:  while (do
9:     
10:     Update membership vectors:
11:     Update centroids:
12:  end while
Algorithm 1 Adaptive SAAC

Alternate minimization is guaranteed to converge, in a finite number of steps, towards a local minimum of . However, the convergence is very sensitive to initialization. We use a -means initialization (see [Arthur and Vassilvitskii, 2007]), which is a randomized procedure that picks as initial centroids rows from that should be far from each other. For the first centroid, we choose at random a row in corresponding to a node whose degree is smaller than the median degree in the network. We do so because in the SBMO model, pure nodes tend to have smaller degrees and we expect the algorithm to work well if the initial centroids are chosen not too far from rows in corresponding to pure nodes.

Given , as long as the matrix is invertible, there is a closed form solution to the minimization of in , which is . The fact that is not invertible implies in particular that does not contain a pure node for each community. If this happens, we re-initialize the centroids, using again the -means procedure.

5.2 Consistency of an adaptive estimator

We give in Theorem 5.2 theoretical properties for a slight variant of the estimate in (5), that is solution of the optimization problem defined therein, that features the set of membership matrices for which the proportion of pure nodes in each community is larger than :

Recall the notation introduced in (2) and (3). We assume that is smaller than the smallest proportion of pure nodes (in the limit), given by , and let .

The estimator analyzed is adaptive, for it relies on an estimate of the number of communities, and on . We establish its consistency for any fixed matrices and satisfying (SBMO1) and (SBMO2). It is to be noted that while the consistency result for the OCCAM algorithm [Zhang et al., 2014] applies to moderately dense graphs ( has to be of order for some ), our result handle relatively sparse graphs, in which is of order for some . Our result involves constants defined below, that are related to the overlap matrix and to the matrix introduced in Proposition 4.1.

The core matrix is the symmetric matrix . We let

Note that is positive as seen by the following argument: if , then there would exist a linear combination of the rows of which is zero; this is impossible because the matrix is invertible.

Let and . Let be a matrix whose columns are orthogonal eigenvectors of associated to an eigenvalue satisfying

Let be the number of such eigenvectors. Let

Assume that and . There exists some constant such that, if

then, for large enough, with probability larger than , and

In particular, assuming that when goes to infinity, it can be shown (using the Borel-Cantelli Lemma) that the estimation procedure described in Theorem 5.2 with a parameter is consistent, in the sense that it satisfies

Theoretical guarantees for other estimates.

Theorem 5.2 leads to an upper bound on the estimation error of a solution to . In some cases, it is also possible to prove directly that the solution of leads to a consistent estimate of . This is the case for instance in an identifiable SBMO with two overlapping communities or with three communities with pairwise overlaps.

If is known, tighter results can be obtained for non-adaptive procedures in which is replaced by . These results are stated in Appendix C, where two non-adaptive estimation procedures are shown to be consistent under the (looser) condition for some constant stated therein.

5.3 Proof of Theorem 5.2

Let be a matrix whose columns are independent normalized eigenvectors of associated to the non-zero eigenvalues. The proof strongly relies on the following decomposition of , that is a consequence of Proposition 4.1.

There exists a matrix of eigenvectors of such that with .

We state below a crucial result characterizing the sensitivity to noise of the decomposition of Lemma 5.3, in terms of the quantity introduced in Definition 5.2. The proof of this key result is given in Appendix B: it builds on fact that provides a lower bound on the norm of some particular linear combinations of the rows of : indeed, one has

(Robustness to noise) Let , and . Assume that

  1. ,

  2. there exists :

Then there exists a permutation matrix such that for all , .

Let the matrix defined in Theorem 5.2. We first note that Lemma 4.2 can be rephrased in terms of the degree parameter . Indeed, from Proposition 4.1, , with in Definition 5.2 and , with

From Lemma 4.2, letting

if then with probability larger than , and there exists a rotation such that

(6)

In the sequel, we assume that and that this inequality holds with a rotation .

The estimate , is then defined by

Introducing , we first show that is a good estimate of provided that is:

(7)

This inequality can be obtained in the following way. Let be defined in Lemma 5.3. As (for ), by definition of and ,

Then, one has

We now introduce the set of nodes

and show that assumption 1. and 2. in Lemma 5.3 are satisfied for this set and the pair , if

(8)

Assumption 1. is satisfied by definition of . We now show that, as required by assumption 2., contains one pure node in each community relatively to and .

First, using notably (7), the cardinality of is upper bounded as

Thus, if (8) holds,