I Introduction
Community detection aims to assign community labels to nodes in a graph such that the nodes in the same community share higher similarity (better connectivity) than the nodes in different communities [1]
. It is essentially an unsupervised learning problem since one is only provided with the information of graph connectivity. Despite its unsupervised nature, recent research developments have been able to identify the informational and algorithmic limits of community detection under certain generative community models (GCMs), especially for spectral graph clustering (SGC) algorithms, such as the use of eigenvectors of the graph Laplacian matrices
[2] or the modularity matrix [3] for community detection. However, these analysis assuming that GCMs well match a graph may not hold in practice, which may often yield poor community detection results when there is a mismatch between the given graph and the underlying GCM. On the other hand, optimizing a designed objective function for community detection, such as normalized cut [4] or modularity [5], imposes no model assumption but is sensitive in community detection [6, 7].Motivated by the advantages of the theoretical and objective principles, we propose SGCGEN, a novel unified community detection framework that possesses the following features:
The power of community detectability. Under GCMs, the theoretical analysis of community detectability allows us to assess the quality of communities by converting the theoretical guarantees to a loss function that quantifies the error in community detection.
The constraint to model mismatch. By imposing an error metric on the level of inconsistency between a given graph and a GCM, one can confine the detection error due to model mismatch and hence improve community detection.
In particular, due to the extraordinary performance of SGC based on the normalized graph Laplacian matrix, a number of variants of SGC methods have been proposed to improve clustering performance in terms of scalability, robustness, and applicability. To provide a thorough analysis, in this paper we focus on the standard formulation of SGC based on the normalized graph Laplacian matrix introduced by the seminal works (see Sec. IIIA) [8, 9, 2]. The main line of this paper is to demonstrate the effectiveness of SGCGEN that combines standard SGC with GCMs [10] in an unified framework. Originated from the standard SGC formulation as presented in Sec. IIIA, SGCGEN can easily be generalized to many stateoftheart SGC methods [11, 12, 13, 14]. By revisiting the standard formulation of SGC with GCMs, we establish a novel condition on correct community detection using SGC via the normalized graph Laplacian matrix under a GCM called the stochastic block model (SBM) [10]. We then convert this condition to a datadriven community detection loss function and apply it to SGCGEN to develop effective and computationallyefficient community detection methods.
We highlight our contributions as following:
We propose SGCGEN, a unified community detection framework combining the principles of theoretical detectability and welldesigned objective functions for improvement.
We establish a condition on the correctness of community detection using SGC under a SBM, which leads to a novel datadriven community loss function for SGCGEN. Moreover, since the loss function enables community quality assessment, the proposed SGCGEN resembles the formulation of a supervised learning problem consisting of a loss function and a regularization function.
We present an algorithm for SGCGEN and conduct rigorous computational analysis showing that SGCGEN could be implemented as efficient as other baseline methods.
We compare the performance of community detection on 18 reallife graph datasets and use 7 representative clustering metrics to rank each method. The experimental results show that joint consideration of theoretical detectability and model mismatch using SGCGEN can substantially improve community detection when compared to 6 baseline community detection methods of similar objective functions.
Ii General Framework
Iia Notations
Throughout this paper bold uppercase letters (e.g., or ) denote matrices and denotes the entry in the th row and the th column of , bold lowercase letters (e.g., or
) denote column vectors, the term
denotes matrix or vector transpose, italic letters (e.g., , or ) denote scalars, and calligraphic uppercase letters (e.g., or ) denote sets. The term denotes a graph characterized by a node set and an edge set . The number of nodes and edges in are denoted by and , respectively. The convergence of a real rectangular matrix is with respect to the spectral norm, which is defined as , where denotes the Euclidean norm of a vector . Based on the definition,is equivalent to the largest singular value of
. A matrix is said to converge to another matrix of the same dimension if approaches zero. For the convenience of notation, we write if as .IiB Preliminaries
Throughout this paper, we consider the problem of nonoverlapping community detection in a simple connected graph that is undirected, unweighted and contains no selfloops. Given a graph and the number of communities , nonoverlapping community detection aims to assign each node a community label and divide the nodes into communities such that the nodes in the same community are better connected than nodes in different communities.
Spectral graph clustering (SGC).
SGC is a widely used technique for community detection. It transforms a graph into a vector space representation via spectral decomposition of a matrix associated with a graph. Specifically, each node in the graph is represented by a lowdimensional vector using a common subset of eigenvectors of a matrix. Based on the vector space representation, Kmeans clustering is applied to obtain
communities . One typical example of SGC is the normalized graph Laplacian matrix [2], where its smallest eigenvectors are used for community detection [4].Generative community model (GCM). A GCM generates a graph that embeds community structures [15]. A GCM can be either parametric or nonparametric. For example, the stochastic block model (SBM) [10]
is a parametric GCM that specifies a set of withincommunity and betweencommunity edge connection probability parameters. The graphon model
[16, 17] is a nonparametric GCM that generates a graph based on latent representations. Different GCMs are discussed in the survey paper [15].SGC under GCMs. For graphs generated by certain GCMs, recent research findings suggest that the performance of community detection using SGC can be separated into two regimes [18]: a detectable regime where the detected communities are consistent with the groundtruth communities, and an undetectable regime
where the detected communities and the groundtruth communities are inconsistent. Moreover, the critical space that separates these two regimes can be specified. Consequently, the problem of evaluating the quality of detected communities can be converted to the problem of estimating to which regime the given graph belongs. More details are given in the related work section (Sec.
VI).IiC Problem Formulation of SGCGEN
Consider community detection in a graph with an unknown number of communities. For each possible number of communities , we can provide quantitative measures on community detectability and model mismatch for SGC under GCMs. Specifically, given a GCM of communities and a set of communities detected by a SGC method , the corresponding community detection loss function and model mismatch metric are as follows.
Community detection loss function. For any , and , let denote a nonnegative loss function that reflects the level of incorrect community detection using under . Higher loss suggests the detected communities are less reliable.
Model mismatch metric. Let be a realvalued function quantifying the difference between the detected communities using and the underlying GCM . Larger value of suggests the detected communities are less consistent with the assumption of .
SGCGEN. Inspired by the formation of supervised learning problems, community detection, albeit an unsupervised learning problem, can be formulated in a similar fashion by specifying a community detection loss function and a model mismatch metric . Given a maximum number of communities , a SGC method and a GCM , the proposed community detection framework, called SGCGEN, solves the following minimization problem
(1) 
where denotes the set of candidate community detection results of different number of communities obtained by . Using terminology from supervised learning theory, is analog to the loss function, resembles the regularization function, and is the regularization parameter.
Many existing community detection methods can fit into the framework of SGCGEN in (1). For example, objectivefunctionbased algorithms specify a particular energy function for quality assessment and set [19]. Greedy algorithms specify a model mismatch metric and set and . For example, the Louvain method [20] selects to be the negative modularity, where modularity is a measure of relative difference between the detected communities and the corresponding configuration model [21].
Iii Theoretical Foundation of SGCGEN: Normalized Graph Laplacian Matrix and Stochastic Block Model
In this section we study the community detectability of SGC using the normalized graph Laplacian matrix under a stochastic block model (SBM). We establish a sufficient and necessary condition such that SGC is guaranteed to yield reliable community detection results for graphs generated by a SBM. The established condition will be used in Sec. IV to devise a novel datadriven community detection loss function for the proposed SGCGEN framework in (1). For demonstration, we also provide a case study of the established condition under a simplified SBM. The proofs of the established theories are given in the supplementary material^{1}^{1}1Supplementary material can be downloaded from www.pinyuchen.com.
Iiia Normalized Graph Laplacian Matrix and Stochastic Block Model (SBM)
SGC using normalized graph Laplacian matrix. Let denote the adjacency matrix of and let be the corresponding diagonal degree matrix. The unnormalized graph Laplacian matrix is defined as . The normalized graph Laplacian matrix is defined as . We denote the th smallest eigenpair of by , where
is the eigenvector associated with the eigenvalue
, and . It is also known that [2]. The standard SGC algorithm using [9] is summarized in Algorithm 1.Let be the matrix of eigenvectors . The matrix is the solution of the minimization problem
(2) 
where is the identity matrix, and the constraint imposes orthogonality and unit norm for the columns in . If is a connected graph, then by the definition of , we have . Let be the matrix after removing the first column from . Then (2) can be reformulated as
(3) 
where is the vector of 1’s (0’s) and is the solution to (3). The minimization problem in (3) is a standard formulation of SGC based on the normalized graph Laplacian matrix [8, 9, 2], which is also a fundamental element of many stateoftheart SGC methods [11, 12, 13, 14], and it will be the foundation of the theoretical results presented in Sec. IIIB.
Stochastic block model (SBM). SBM [10] is a fundamental GCM, and it has been the root of many other GCMs such as the degreecorrected SBM [22] and the random interconnection model [23]. SBM is a parametric GCM that assumes common edge connection probability for withincommunity and betweencommunity edges. A graph of communities can be generated by a SBM as follows. The SBM first divides the nodes into groups, where each group has nodes such that . For each unordered node pair , , an edge between and is connected with probability , where denote the community labels of and . Therefore, the SBM is parameterized by the number of communities and the edge connection probability matrix , where and is symmetric. We denote the SBM with parameters and by SBM(,).
IiiB Theoretical Guarantees on Community Detectability
Here we analyze the performance of community detection on graphs generated by SBM(,) using . In particular, we establish a sufficient and necessary condition on correct community detection, where correct community detection means the detected communities using match the oracle communities generated by SBM(,), up to some permutation in community labels. The condition of community detectability leads to a novel community detection loss function as will be discussed in Sec. IV. Let , , and let denote the limit value of as . The following lemma serves as a cornerstone that connects the dots between and SBM(,).
Lemma 1.
(matrix concentration under SBM(,))
Let denote the adjacency matrix of edges between communities and of a graph generated by SBM(,), . The following holds almost surely as , and :
Proof.
The matrix concentration result in Lemma 1 shows that the scaled adjacency matrix converges asymptotically to a constant matrix of finite spectral norm , which associates with the relative community size and the edge connection probability under SBM(,). The condition guarantees that all community sizes grow at a comparable rate. Note that Lemma 1 presumes each entry in is a constant. In case of sparse graphs where or for some positive constants , similar matrix concentration result holds with high probability under mild conditions via degree regularization techniques [26, 27].
Since Algorithm 1 is invariant to the permutation of node indices, for the purpose of analysis we treat the adjacency matrix as a matrix of blocks . Using Lemma 1, we establish a sufficient and necessary condition on correct community detection using for graphs generated by SBM(,).
Theorem 1.
(community detectability using under SBM(,))
For any graph generated by SBM(,), let and , where is the th smallest eigenpair of . The following holds almost surely as , and :
The communities in can be correctly detected
using if and only if .
Proof.
We provide a sketch of the proof below. The complete proof is given in Appendix B of the supplementary material^{1}^{1}footnotemark: 1.
Step 1. Specify the optimality condition of using (3).
Step 2. Show the distribution of the rows in can be separated into two regimes, detectable or undetectable, using Lemma 1.
Step 3. If is in the undetectable regime, show the distribution of the rows in is inconsistent with the community structure.
Step 4. If is in the detectable regime, show the distribution of the rows in is consistent with the community structure.
Step 5. Show is in the detectable regime iff .
∎
Note that Theorem 1 provides a novel datadriven criterion for evaluating the quality of communities without the knowledge of the parameters in SBM(). In other words, for any graph generated by SBM(), for evaluating community detectability it suffices to compute the smallest nonzero eigenvalues of and inspect the condition , which will be further explored in Sec. IV. In addition, Theorem 1 also implies the feasibility of community detection using Algorithm 1, since and row normalization does not alter the sign of each entry in .
IiiC Case Study: SBM()
To investigate the implication of the sufficient and necessary condition for correct community detection in Theorem 1, we study SBM(), the case of SBM with two communities, and justify the condition via numerical experiments. Under SBM(), we allow the size of the two communities, and , to be arbitrary as long as their limit values . We also simplify the notation of the edge connection matrix by defining , , and . The following corollary specifies the condition of community detectability in terms of , and .
Corollary 1.
(community detectability using under SBM()) For any graph generated by SBM(,), let denote the second smallest eigenvector of , where , , is the communityindexed block vector of . The following holds almost surely as and :
The two communities in can be correctly detected
using if and only if .
Furthermore, and for some if and only if .
Proof.
The results are induced from the condition in Theorem 1 under SBM(,). The proof is given in Appendix C of the supplementary material^{1}^{1}footnotemark: 1. ∎
The established detectability condition in Corollary 1 is universal in the sense that it does not depend on the ratio of the community sizes as long as its limit value . It is worth mentioning that the condition for correct community detection is also consistent with the condition using methods other than , such as the spectral modularity matrix [28], the spectrum of modular matrix [29], and the inferencebased method [30]. In addition, when , the results that and for some imply the nodes in the same community have identical yet communitywise distinct representation, as and are nonzero constant vectors with opposite signs. This guarantees that Kmeans clustering on leads to correct community detection when . In particular, when and , the parameters and reflect the expected number of withincommunity and betweencommunity edges, respectively. The condition in Corollary 1 then reduces to , which means the two communities can be correctly detected when there are more withincommunity edges than betweencommunity edges.
Fig. 1 displays two numerical examples of different community sizes to validate the detectability condition. It can be observed that in both cases when , correct community detection can be achieved and . On the other hand, when , correct community detection is impossible and is close to . Consequently, inspecting the datadriven parameter indeed reveals community detectability, which validates Theorem 1 and Corollary 1.
Iv Community Detection Algorithms using SGCGEN
Iva SGCGEN Meta Algorithm
The proposed SGCGEN framework in (1) applies to any SGC method and any GCM. It is a meta algorithm that avails community detection by specifying the corresponding community detection loss function and the model mismatch metric , in addition to the regularization parameter and the maximum number of communities . Algorithm 2 below summarizes SGCGEN.
Method  Algorithm  GCM  f  R  Computational complexity 
SGCEIG (regSGCEIG)  ()  SBM()  (4)  
SGCMOD (regSGCMOD)  ()  SBM()  (4)  
SGCAIC (regSGCAIC)  ()  SBM()  (4)  
SGCBIC (regSGCBIC)  ()  SBM()  (4)  
SBMAIC [31]  Bayesian inference  SBM()  0  
SBMBIC [31]  Bayesian inference  SBM()  0  
DCSBMAIC [31]  Bayesian inference  DCSBM  0  
DCSBMAIC [31]  Bayesian inference  DCSBM  0  
SelfTuning [19]  None  defined in [19]  0  
Louvain [20]  Node merging  None  0 
IvB SGCGEN via and SBM()
Based on the theoretical analysis established in Sec. III, here we specify 2 SGC methods, the corresponding community detection loss function, and 4 model mismatch metrics for SGCGEN. This yields 8 community detection methods originated from Algorithm 2. In particular, for these methods we select the GCM to be SBM(). These SGCGENempowered community detection methods are summarized in Table I. The details are described as follows.
Two SGC methods.
the first method is SGC using the normalized graph Laplacian matrix as described in Algorithm 1. To obtain the set in step 1 of Algorithm 2, one computes the smallest eigenvectors of and use Algorithm 1 to obtain the candidate communities of different in .
the second method is regularized SGC using the normalized graph Laplacian matrix . It is similar to except that one replaces the matrix in step 1 of Algorithm 1 with , where is the average degree of the graph . The regularization leads to better clustering than in sparse graphs as suggested in [32, 33].
Community detection loss function.
Since and are SGC methods via , using Theorem 1,
we set the community detection loss function to be
(4) 
where and is the th smallest eigenvalue of . It is similar to the exponential loss function used in supervised learning problems. The denominator serves the purpose of comparing different community detection results in .
When , the function is confined in the interval , and it favors the community detection results of small partial eigenvalue sum , which is a measure of multiway cut in [2].
When , is greater than 1 and has exponential growth as decreases, which implies that imposes large loss on incorrect community detection results based on Theorem 1. Note that is a datadriven function since it only requires the knowledge of .
Four model mismatch metrics.
spectral radius of the modular matrix with respect to SBM(). Define the modular matrix with respect to SBM() as
if and if ,
for all , where denotes the community label of node . The parameter is the maximum likelihood estimator of in given the detected communities , which is defined as
if and
if ,
for all ,
where denotes the number of nodes in and
denotes the number of edges between communities and .
is defined as the spectral radius of , which is the largest eigenvalue of in absolute value. It relates to the firstorder eigenvalue approximation of the signed triangle counts [34], which is an effective statistic for testing latent structure in random graphs.
negative modularity. Given communities in , modularity is a measure of difference between and a random graph of the same degree sequence [5]. The modularity is
defined as , where if and if , for all , and . By defining , the model mismatch metric is small when the communities are distinct from the corresponding randomized graphs.
AIC under SBM(). The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. is defined as the AIC given communities under SBM(), which is , where denotes the loglikelihood of under SBM().
The closedform expression of is given in [22].
BIC under SBM(). The Bayesian information criterion (BIC) is another relative measure of data fitness to statistical models. is defined as the BIC of communities under SBM(), which is .
IvC Computational Complexity Analysis
Here we analyze the computational complexity of the 8 SGCGEN community detection methods listed in Table I. There are three main factors contributing to the computational complexity: (i) computation of the smallest eigenvectors of , (ii) Kmeans clustering, and (iii) computation of the community detection loss function and the model mismatch metric. The overall computational complexity of each method is summarized in Table I.
For (i), computing the smallest eigenvectors of requires operations using power iteration techniques [35, 36, 37, 38, 39], where is the number of nonzero entries in . For (ii), given any , Kmeans clustering on the rows of the smallest eigenvectors of requires operations [40]. As a result, to obtain the set of candidate community detection results by varying from to requires operations in total. For (iii), the complexity of computing the function and the loss function in (4) is negligible since they can be obtained in the process of (i). The computation of for a given requires operations for computing and operations for computing the spectral radius of using power iteration techniques, where is the number of nonzero entries in . Therefore, the overall computational complexity of in SGCGEN is , where is the maximum number of nonzero entries in ranging from to . The computation of for a given is , the same complexity for computing modularity [5]. The overall computational complexity of in SGCGEN is . For a given , the computation of and requires operations to compute the closedform loglikelihood function . The overall computational complexity of and in SGCGEN is . The computational complexity of and has the same order since the regularization step in simply adds entries to the degree matrix . Similarly, the data storage of these methods require space.
Dataset  Description  Node  Edge  # of nodes  # of edges  Community labels 
BlogCatalog^{2}^{2}2http://socialcomputing.asu.edu/datasets/BlogCatalog3  online social network  user  friendship  10312  333983  39 social groups 
Youtube^{3}^{3}3http://socialcomputing.asu.edu/datasets/YouTube2  online social network  user  friendship  22180  96092  47 social groups 
PoliticalBlog^{4}^{4}4http://konect.unikoblenz.de/networks/morenoblogs  online social network  user  blog reference  1222  16714  2 political parties 
Cora^{5}^{5}5http://www.cs.umd.edu/ sen/lbcproj/data/cora.tgz  publication network  paper  citation  2485  5069  7 research topics 
Citeseer^{6}^{6}6http://www.cs.umd.edu/ sen/lbcproj/data/citeseer.tgz  publication network  paper  citation  2110  3694  6 research topics 
Pubmed^{7}^{7}7http://www.cs.umd.edu/projects/linqs/projects/lbc/PubmedDiabetes.tgz  publication network  paper  citation  19717  44324  3 research topics 
PrettyGoodPrivacy^{8}^{8}8http://konect.unikoblenz.de/networks/arenaspgp  communication network  router  connection  10680  24316  NA 
ASNewman^{9}^{9}9http://wwwpersonal.umich.edu/ mejn/netdata/  communication network  router  connection  22963  48436  NA 
ASSNAP^{10}^{10}10http://snap.stanford.edu/data/as.html  communication network  router  connection  6474  12572  NA 
Facebook^{11}^{11}11http://snap.stanford.edu/data/egonetsFacebook.html  online social network  user  friendship  4039  88234  NA 
EmailArenas^{12}^{12}12http://konect.unikoblenz.de/networks/arenasemail  email network  user  communication  1133  5451  NA 
EmailEnron^{13}^{13}13http://snap.stanford.edu/data/emailEnron.html  email network  user  communication  33696  180811  NA 
MinnesotaRoad^{14}^{14}14http://www.cise.ufl.edu/research/sparse/matrices/Gleich/minnesota.html  physical network  intersection  road  2640  3302  NA 
PowerGrid^{15}^{15}15http://konect.unikoblenz.de/networks/opsahlpowergrid  physical network  power station  power line  4941  6594  NA 
Reactome^{16}^{16}16http://konect.unikoblenz.de/networks/reactome  biological network  protein  interaction  5973  146385  NA 
CAAstroPh^{17}^{17}17http://snap.stanford.edu/data/caAstroPh.html  collaboration network  researcher  coauthorship  17903  197000  NA 
CAHepPh^{18}^{18}18http://snap.stanford.edu/data/caHepPh.html  collaboration network  researcher  coauthorship  21363  91314  NA 
CACondMat^{19}^{19}19http://snap.stanford.edu/data/caCondMat.html  collaboration network  researcher  coauthorship  11204  117634  NA 
In summary, the overall computational complexity of SGCGENenabled methods is linear in the number of nodes and edges ( and ) and depends on . In practice is a constant such that and . Based on the computational analysis, the community detection methods based on SGCGEN have the same order of complexity in and when compared with the baseline methods of similar objective functions described in Sec. VB, which suggests that utilizing SGCGEN for community detection is computationally as efficient as these baseline methods.
V Performance Evaluation
Va Dataset Description and Evaluation Metrics
Dataset Description. To compare the performance of community detection, we collected 18 reallife graph datasets from various domains, including online social, physical, biological, communication, collaboration, email, and publication networks. For each dataset, we extracted the largest connected component as the input graph for community detection. All input graphs are made undirected, unweighted and unlabeled. Among these datasets, 6 datasets are provided with additional community labels. If a node in the graph is provided with more than one community label, the most common label among its neighboring nodes is assigned to the node. The statistics of the collected graphs are summarized in Table II.
Evaluation metrics. We use 7 representative external and internal clustering metrics to evaluate the performance of different communication detection methods. External clustering metrics can be computed when the community labels are given. Internal clustering metrics evaluate the quality of communities in terms of connectivity, which can be computed without community labels.
external clustering metrics:
Normalized mutual information (NMI) [40].
Rand index (RI) [40].
Fmeasure (FM) [40].
These external clustering metrics are properly scaled between 0 and 1, and larger value means better clustering performance.
internal clustering metrics:
Conductance (COND) [8]: the averaged COND over all communities. Lower value means better performance.
Normalized cut (NC) [8]: the averaged NC over all communities. Lower value means better performance.
Average outdegree fraction (avgODF) [41]: the averaged avgODF over all communities. Lower value means better performance.
Modularity (MOD) [5]: MOD is defined in the model mismatch metric in Sec. IVB. Larger value means better performance.
Average rank score. To combine multiple clustering metrics for performance evaluation of different community detection methods, we adopt the methodology proposed in [6, 7] and use the average rank score of all clustering metrics as the performance metric. For each dataset, we rank each community detection method for every clustering metric via standard competition rankings and obtain an average rank score of all clustering metrics. Therefore, lower average rank score means better community detection.
VB Baseline Comparative Methods
As summarized in Table I,
we compare the performance of SGCGEN methods with 6 baseline community detection methods of similar loss functions and model mismatch metrics:
SBMAIC: Given the number of communities, SBMAIC uses Bayesian inference techniques to evaluate the posterior distribution of community assignments given the graph under the SBM. We implemented the stateoftheart package WSBM^{20}^{20}20http://tuvalu.santafe.edu/ aaronc/wsbm/ to obtain the mostly probable communities [31] and use the AIC to determine the final communities ranging from to .
SBMBIC: SBMBIC is the same as SBMAIC except that one uses the BIC to determine the final community detection results.
DCSBMAIC: DCSBMAIC is the same as SBMAIC except that one uses the degreecorrected SBM (DCSBM) [22] for inference.
DCSBMBIC: DCSBMBIC is the same as SBMBIC except that one uses the degreecorrected SBM (DCSBM) [22] for inference.
SelfTuning^{21}^{21}21http://www.vision.caltech.edu/lihi/Demos/SelfTuningClustering.html: SelfTuning is a SGC algorithm that uses an energy function based on for basis rotation and finds the best community detection results among to communities [19].
Louvain^{22}^{22}22https://perso.uclouvain.be/vincent.blondel/research/louvain.html: Louvain method is a greedy modularity maximization approach for community detection based on node merging [20].
VC The Effect of Regularization Parameter
Here we investigate the effect of the regularization parameter in (1) on the performance of the eight SGCGEN community detection methods listed in Table I. We set and use the six datasets with additional community labels in Table II to select from the set . For illustration, Fig. 2 displays the stacked average rank plot of these SGCGEN methods separately ranked by different in Youtube and Citeseer datasets. The colors represent different methods and the width of each colored block represents average rank score based on the selected values of . It is observed that for each method, setting large (i.e., underestimating the loss function) or neglecting the model mismatch metric (i.e., setting ) leads to the worst performance, which justifies the motivation of SGCGEN. In addition, sweeping within does not induce drastic changes in the average rank score, which demonstrates the robustness of SGCGEN. Based on the average rank score of these datasets, for the following experiments we assign () to the (regularized) SGCGEN methods.
VD Comparison to Baseline Methods
Here we compare the 8 SGCGEN methods to the 6 baseline methods in Sec. VB. For the Bayesian inference baseline methods we set , since we observe that larger does not improve their performance but significantly increases the computation time. For the SGCGEN methods and SelfTuning we set . For Louvain one does not need to specify . All experiments are implemented by Matlab R2016 on a 16core cluster with 128 GB RAM.
Table III
displays the mean and standard deviation of average rank scores over all 18 graph datasets for each community detection method. Among these 14 methods, SGCMOD and SGCEIG have the best and second best mean average rank score over all datasets, which suggests that joint consideration of theoretical detectability and modular structure using the proposed SGCGEN framework improves community detection. The results also suggest that the degree regularization technique does not necessarily guarantee better performance. For the baseline methods, it can be observed that Bayesian inference based approaches lead to poor performance, which can be explained by the fact the graph datasets may not comply with the assumption of the underlying generative community models. Louvain also yields poor performance since it is a greedy algorithm that only aims to maximize one single clustering metric (i.e., modularity). SelfTuning performs better than some SGCGEN methods but it does not prevail SGCMOD and SGCEIG, which can be explained by the fact that the energy function used in SelfTuning does not exploit the discriminative power of community detectability. Since in Sec.
IVC SGCGEN is shown to be computationally as efficient as these baseline methods, we conclude that community detection via SGCGEN yields superior performance without incurring additional computational costs.Method  Average rank of all datasets  

mean  standard deviation  
SGCEIG  4.6290  2.0178 
SGCMOD  4.3433  1.3484 
SGCAIC  5.5476  2.1481 
SGCBIC  5.1468  1.6762 
regSGCEIG  6.0417  1.3403 
regSGCMOD  5.3313  1.3592 
regSGCAIC  6.1409  1.8277 
regSGCBIC  6.0298  1.8559 
SBMAIC  10.9385  1.2383 
SBMBIC  10.9385  1.2383 
DCSBMAIC  11.5675  1.3735 
DCSBMBIC  11.5675  1.3735 
SelfTuing  4.9821  1.1923 
Louvain  6.1250  2.2236 
VE Comparison in graph domains and types
For further analysis, we categorize the 18 graph datasets in Table II into 7 domains based on their descriptions. Fig. 3 displays the mean average rank score of each domain for 10 selected methods. It can be observed that no single community detection method outperforms others in all domains. For example, regSGCMOD has superior performance in online social, publication and biological networks but has poor performance in email and communication networks. SBMBIC has the best performance in communication networks but not in other domains. SGCMOD has the best averaged performance over all datasets but it does not prevail others in every domain. The results suggest that considering the graph domain is essential for improving community detection.
We also separate the 18 datasets into two types: with community labels or without community labels. The corresponding average rank score is shown in Fig. 4. For the datasets with community labels, regSGCMOD, regSGCAIC and regSGCBIC are outstanding, whereas for the datasets without community labels SGCEIG and SGCMOD prevail. Since community labels provide additional external clustering metrics, the results suggest that the external and internal clustering metrics have different evaluation criterion.
Vi Related Work
Community detection and graph clustering have been an active research field in the past two decades. We refer readers to [1, 42] for an overview of community detection methods. In recent years, there has been a major breakthrough in analyzing both the informational and algorithmic limits of community detection under certain generative community models (GCMs). In this section we summarize the recent research findings in community detectability.
Many informational and algorithmic limits of community detection have been analyzed under the stochastic block model (SBM) [10]. Abbe et al. analyzed the informational limit by specifying the detectable and undetectable regimes for community detection via the parameters of SBM [43]. They also proposed a belief propagation algorithm that is proved to achieve the informational limit [44]. Hajek et al. proposed a semidefinite programming algorithm that achieves the informational limit [45, 46]. Inference approaches based on statistical physics have been studied in [47, 48]. Spectral graph clustering algorithms, including the modularity matrix, the graph Laplacian matrix, the adjacency matrix, and the modular matrix, have been studied in [49, 50, 32, 29, 51, 28, 52, 53, 26, 27] and applied to various applications in graph mining [54, 55, 56, 57, 58, 59, 60, 61]
and machine learning
[62, 63, 64, 65].Beyond the SBM, Zhao et al. proved the consistency of community detection [30] under the degreecorrected SBM [22]
. Under the same model, Qin and Rohe studied regularized spectral clustering
[66], and Gao et al. derived a minimax risk [67]. Chen and Hero proved the algorithmic limit of spectral clustering [68, 23, 69] under the random interconnection model [23]. Although studying the limits of community detection methods under GCMs provides novel insights on evaluating community detectability, these approaches assume the graphs are consistent with the underlying GCMs and therefore neglect the error induced by model mismatch, which motivates the SGCGEN framework proposed in this paper.Vii Conclusion and Future Work
In this paper we propose SGCGEN, a new community detection framework that jointly exploits the discriminative power of community detectability under generative community models and confines the corresponding model mismatch. A novel condition on correct community detection is established for SGCGEN, leading to effective and computationally efficient community detection methods.
Performance evaluation on 18 datasets and 7 clustering metrics shows that joint consideration of community detectability and modular structure via SGCGEN outperforms 6 baseline approaches in terms of the average rank score. We also investigated the effect of graph domains and graph types on community detection.
The performance analysis established in this paper focuses on the standard formulation of SGC rooted in many advanced methods. Our future work involves developing scalable implementation of SGCGEN to efficiently handle largescale graphs and extending SGCGEN to advanced community detection methods and models.
References
 [1] S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 35, pp. 75–174, 2010.
 [2] U. Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, Dec. 2007.
 [3] M. E. J. Newman, “Finding community structure in networks using the eigenvectors of matrices,” Phys. Rev. E, vol. 74, p. 036104, Sep 2006.
 [4] S. White and P. Smyth, “A spectral clustering approach to finding communities in graph.” in SIAM International Conference on Data Mining (SDM), vol. 5, 2005, pp. 76–84.
 [5] M. E. J. Newman, “Fast algorithm for detecting community structure in networks,” Phys. Rev. E, vol. 69, p. 066133, Jun 2004.
 [6] J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of algorithms for network community detection,” in ACM International Conference on World Wide Web (WWW), 2010, pp. 631–640.
 [7] J. Yang and J. Leskovec, “Defining and evaluating network communities based on groundtruth,” Knowledge and Information Systems, vol. 42, no. 1, pp. 181–213, 2015.
 [8] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.

[9]
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in
Advances in neural information processing systems (NIPS), 2002, pp. 849–856.  [10] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social Networks, vol. 5, no. 2, pp. 109–137, 1983.

[11]
J. Liu, C. Wang, M. Danilevsky, and J. Han, “Largescale spectral clustering
on graphs,” in
International Joint Conference on Artificial Intelligence
. AAAI Press, 2013, pp. 1486–1492.  [12] F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2014, pp. 977–986.
 [13] Y. Li, J. Huang, and W. Liu, “Scalable sequential spectral clustering.” in AAAI, 2016, pp. 1809–1815.
 [14] F. Nie, X. Wang, M. I. Jordan, and H. Huang, “The constrained laplacian rank algorithm for graphbased clustering.” in AAAI, 2016, pp. 1969–1976.
 [15] A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi, “A survey of statistical network models,” Foundations and Trends® in Machine Learning, vol. 2, no. 2, pp. 129–233, 2010.
 [16] P. Diaconis and S. Janson, “Graph limits and exchangeable random graphs,” arXiv preprint arXiv:0712.2749, 2007.
 [17] Y. Zhang, E. Levina, and J. Zhu, “Estimating network edge probabilities by neighborhood smoothing,” arXiv preprint arXiv:1509.08588, 2015.
 [18] E. Abbe, A. S. Bandeira, and G. Hall, “Exact recovery in the stochastic block model,” IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 47–487, 2016.
 [19] L. ZelnikManor and P. Perona, “Selftuning spectral clustering,” in Advances in neural information processing systems (NIPS), 2004, pp. 1601–1608.
 [20] V. D. Blondel, J.L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, no. 10, 2008.
 [21] M. E. J. Newman, “Modularity and community structure in networks,” Proc. National Academy of Sciences, vol. 103, no. 23, pp. 8577–8582, 2006.
 [22] B. Karrer and M. E. J. Newman, “Stochastic blockmodels and community structure in networks,” Phys. Rev. E, vol. 83, p. 016107, Jan 2011.
 [23] P.Y. Chen and A. O. Hero, “Phase transitions and a model order selection criterion for spectral graph clustering,” arXiv preprint arXiv:1604.03159, 2016.
 [24] R. Latala, “Some estimates of norms of random matrices.” Proc. Am. Math. Soc., vol. 133, no. 5, pp. 1273–1282, 2005.
 [25] M. Talagrand, “Concentration of measure and isoperimetric inequalities in product spaces,” Publications Mathématiques de l’Institut des Hautes Études Scientifiques, vol. 81, no. 1, pp. 73–205, 1995.
 [26] C. M. Le and R. Vershynin, “Concentration and regularization of random graphs,” arXiv preprint arXiv:1506.00669, 2015.
 [27] A. Joseph, B. Yu et al., “Impact of regularization on spectral clustering,” The Annals of Statistics, vol. 44, no. 4, pp. 1765–1791, 2016.

[28]
P.Y. Chen and A. O. Hero, “Universal phase transition in community detectability under a stochastic block model,”
Phys. Rev. E, vol. 91, p. 032804, Mar 2015.  [29] T. P. Peixoto, “Eigenvalue spectra of modular networks,” Phys. Rev. Lett., vol. 111, p. 098701, Aug 2013.
 [30] Y. Zhao, E. Levina, and J. Zhu, “Consistency of community detection in networks under degreecorrected stochastic block models,” The Annals of Statistics, vol. 40, no. 4, pp. 2266–2292, 08 2012.
 [31] C. Aicher, A. Z. Jacobs, and A. Clauset, “Learning latent block structure in weighted networks,” Journal of Complex Networks, p. cnu026, 2014.
 [32] K. Chaudhuri, F. C. Graham, and A. Tsiatas, “Spectral clustering of graphs with general degrees in the extended planted partition model,” in COLT, vol. 23, 2012, pp. 35–1.
 [33] A. A. Amini, A. Chen, P. J. Bickel, E. Levina et al., “Pseudolikelihood methods for community detection in large sparse networks,” The Annals of Statistics, vol. 41, no. 4, pp. 2097–2122, 2013.
 [34] S. Bubeck, J. Ding, R. Eldan, and M. Z. Rácz, “Testing for highdimensional geometry in random graphs,” Random Structures & Algorithms, 2016.
 [35] O. E. Livne and A. Brandt, “Lean algebraic multigrid (lamg): Fast graph Laplacian linear solver,” SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. B499–B522, 2012.
 [36] P.Y. Chen, B. Zhang, M. A. Hasan, and A. O. Hero, “Incremental method for spectral clustering of increasing orders,” in ACM International Conference on Knowledge Discovery and Data Mining (KDD) Workshop on Mining and Learning with Graphs, 2016, arXiv preprint arXiv:1512.07349.
 [37] L. Wu and A. Stathopoulos, “A preconditioned hybrid svd method for accurately computing singular triplets of large matrices,” SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S365–S388, 2015.

[38]
L. Wu, J. Laeuchli, V. Kalantzis, A. Stathopoulos, and E. Gallopoulos, “Estimating the trace of the matrix inverse by interpolating from the diagonal of an approximate inverse,”
Journal of Computational Physics, vol. 326, pp. 828–844, 2016.  [39] L. Wu, E. Romero, and A. Stathopoulos, “Primme_SVDS: A highperformance preconditioned svd solver for accurate largescale computations,” arXiv preprint arXiv:1607.01404, 2016.
 [40] M. J. Zaki and W. Meira Jr, Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, 2014.
 [41] G. W. Flake, S. Lawrence, and C. L. Giles, “Efficient identification of web communities,” in ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2000, pp. 150–160.
 [42] S. Fortunato and D. Hric, “Community detection in networks: A user guide,” Physics Reports, vol. 659, pp. 1–44, 2016.
 [43] E. Abbe and C. Sandon, “Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms,” arXiv preprint arXiv:1503.00609, 2015.
 [44] ——, “Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic bp, and the informationcomputation gap,” Advances in Neural Information Processing Systems (NIPS), 2016.
 [45] B. Hajek, Y. Wu, and J. Xu, “Achieving exact cluster recovery threshold via semidefinite programming,” IEEE Trans. Inf. Theory, vol. 62, no. 5, pp. 2788–2797, 2016.
 [46] ——, “Achieving exact cluster recovery threshold via semidefinite programming: Extensions,” IEEE Trans. Inf. Theory, vol. 62, no. 10, pp. 5918–5937, 2016.
 [47] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, “Inference and phase transitions in the detection of modules in sparse networks,” Phys. Rev. Lett., vol. 107, p. 065701, Aug 2011.
 [48] F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborova, and P. Zhang, “Spectral redemption in clustering sparse networks,” Proc. National Academy of Sciences, vol. 110, pp. 20 935–20 940, 2013.
 [49] K. Rohe, S. Chatterjee, and B. Yu, “Spectral clustering and the highdimensional stochastic blockmodel,” The Annals of Statistics, pp. 1878–1915, 2011.
 [50] R. R. Nadakuditi and M. E. J. Newman, “Graph spectra and the detectability of community structure in networks,” Phys. Rev. Lett., vol. 108, p. 188701, May 2012.
 [51] F. Radicchi, “Detectability of communities in heterogeneous networks,” Phys. Rev. E, vol. 88, p. 010801, Jul 2013.
 [52] A. Saade, F. Krzakala, and L. Zdeborová, “Spectral clustering of graphs with the bethe hessian,” in Advances in neural information processing systems (NIPS), 2014, pp. 406–414.
 [53] J. Lei and A. Rinaldo, “Consistency of spectral clustering in stochastic block models,” Ann. Statist., vol. 43, no. 1, pp. 215–237, 02 2015.
 [54] P.Y. Chen and A. Hero, “Deep community detection,” IEEE Trans. Signal Process., vol. 63, no. 21, pp. 5706–5719, Nov. 2015.
 [55] B. Zhang and M. A. Hasan, “Name disambiguation in anonymized graphs using network embedding,” in CIKM, 2017.
 [56] P.Y. Chen, S. Choudhury, and A. O. Hero, “Multicentrality graph spectral decompositions and their application to cyber intrusion detection,” in IEEE ICASSP, 2016, pp. 4553–4557.
 [57] B. Zhang, T. K. Saha, and M. Al Hasan, “Name disambiguation from link data in a collaboration graph,” in IEEE/ACM ASONAM, 2014, pp. 81–84.
 [58] T. K. Saha, B. Zhang, and M. Al Hasan, “Name disambiguation from link data in a collaboration graph using temporal and topological features,” Social Network Analysis and Mining, vol. 5, no. 1, p. 11, 2015.
 [59] P.Y. Chen and A. O. Hero, “Assessing and safeguarding network resilience to nodal attacks,” IEEE Commun. Mag., vol. 52, no. 11, pp. 138–143, Nov. 2014.

[60]
M. Dundar, Q. Kou, B. Zhang, Y. He, and B. Rajwa, “Simplicity of kmeans versus deepness of deep learning: A case of unsupervised feature learning with limited data,” in
IEEE ICMLA, 2015, pp. 883–888.  [61] P.Y. Chen and A. O. Hero, “Local Fiedler vector centrality for detection of deep and overlapping communities in networks,” in IEEE ICASSP, 2014, pp. 1120–1124.
 [62] X. Peng, R. S. Feris, X. Wang, and D. N. Metaxas, “A recurrent encoderdecoder network for sequential face alignment,” in ECCV, 2016, pp. 38–56.

[63]
P.Y. Chen and S. Liu, “Biasvariance tradeoff of graph laplacian regularizer,”
IEEE Signal Processing Letters, vol. 24, no. 8, pp. 1118–1122, Aug 2017.  [64] X. Peng, J. Huang, Q. Hu, S. Zhang, A. Elgammal, and D. Metaxas, “From circle to 3sphere: Head pose estimation by instance parameterization,” Computer Vision and Image Understanding, vol. 136, pp. 92–102, 2015.
 [65] S. Liu, P.Y. Chen, and A. O. Hero, “Accelerated distributed dual averaging over evolving networks of growing connectivity,” arXiv preprint arXiv:1704.05193, 2017.
 [66] T. Qin and K. Rohe, “Regularized spectral clustering under the degreecorrected stochastic blockmodel,” in Advances in neural information processing systems (NIPS), 2013, pp. 3120–3128.
 [67] C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou, “Community detection in degreecorrected block models,” arXiv preprint arXiv:1607.06993, 2016.
 [68] P.Y. Chen and A. O. Hero, “Phase transitions in spectral community detection,” IEEE Trans. Signal Process., vol. 63, no. 16, pp. 4339–4347, Aug 2015.
 [69] ——, “Multilayer spectral graph clustering via convex layer aggregation: Theory and algorithms,” IEEE Trans. Signal Inf. Process. Netw., 2017.
 [70] R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1990.
 [71] S. Resnick, A Probability Path. Birkhäuser Boston, 2013.
 [72] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
Supplementary Material
Appendix A Proof of Lemma 3.1
We separate the proof into two cases: (I) , and (II) . For case (I), notice that under SBM() each entry in
is an independent and identical Bernoulli random variable with success probability
. Let , where . As a result, each entry in is either with probability or with probability . The Latala’s theorem [24]states that for any random matrix
with statistically independent and zero mean entries, there exists a positive constant such that(S1) 
where is the largest singular value of . It is clear that each entry in is independent and has zero mean. By replacing with in the Latala’s theorem, since , we have , , and . Therefore, as .
We then use the Talagrand’s concentration inequality, which is stated as follows. Let be a convex and 1Lipschitz function. Let be a random vector and assume that every element of satisfies for all , with probability one. Then there exist positive constants and such that ,
(S2) 
Since [70], it is easy to check that is a convex and 1Lipschitz function. Therefore, applying the Talagrand’s inequality and substituting with the facts that and , we have
(S3) 
Note that, since for any positive integer , we have . Hence, by the BorelCantelli lemma [71], when , where denotes almost sure convergence. Using the result from standard matrix perturbation theory [70] yields for all , where denotes the th largest singular value. By the fact that , we have as ,
(S4)  
(S5) 
which implies
(S6) 
as . Finally, for every , ,
(S7) 
as and , which completes the proof of case (I).
For case (II), since accounts for the adjacency matrix of edges within community , it is a symmetric matrix with zeros on its main diagonal. Let denote the matrix that has the same entries as in the upper diagonals and has zero entries in the lower diagonals. Then . Applying the Latala’s theorem and the Talagrand’s concentration inequality to , we have
(S8) 
as due to the fact that
Comments
There are no comments yet.