Community detection aims to assign community labels to nodes in a graph such that the nodes in the same community share higher similarity (better connectivity) than the nodes in different communities 
. It is essentially an unsupervised learning problem since one is only provided with the information of graph connectivity. Despite its unsupervised nature, recent research developments have been able to identify the informational and algorithmic limits of community detection under certain generative community models (GCMs), especially for spectral graph clustering (SGC) algorithms, such as the use of eigenvectors of the graph Laplacian matrices or the modularity matrix  for community detection. However, these analysis assuming that GCMs well match a graph may not hold in practice, which may often yield poor community detection results when there is a mismatch between the given graph and the underlying GCM. On the other hand, optimizing a designed objective function for community detection, such as normalized cut  or modularity , imposes no model assumption but is sensitive in community detection [6, 7].
Motivated by the advantages of the theoretical and objective principles, we propose SGC-GEN, a novel unified community detection framework that possesses the following features:
The power of community detectability. Under GCMs, the theoretical analysis of community detectability allows us to assess the quality of communities by converting the theoretical guarantees to a loss function that quantifies the error in community detection.
The constraint to model mismatch. By imposing an error metric on the level of inconsistency between a given graph and a GCM, one can confine the detection error due to model mismatch and hence improve community detection.
In particular, due to the extraordinary performance of SGC based on the normalized graph Laplacian matrix, a number of variants of SGC methods have been proposed to improve clustering performance in terms of scalability, robustness, and applicability. To provide a thorough analysis, in this paper we focus on the standard formulation of SGC based on the normalized graph Laplacian matrix introduced by the seminal works (see Sec. III-A) [8, 9, 2]. The main line of this paper is to demonstrate the effectiveness of SGC-GEN that combines standard SGC with GCMs  in an unified framework. Originated from the standard SGC formulation as presented in Sec. III-A, SGC-GEN can easily be generalized to many state-of-the-art SGC methods [11, 12, 13, 14]. By revisiting the standard formulation of SGC with GCMs, we establish a novel condition on correct community detection using SGC via the normalized graph Laplacian matrix under a GCM called the stochastic block model (SBM) . We then convert this condition to a data-driven community detection loss function and apply it to SGC-GEN to develop effective and computationally-efficient community detection methods.
We highlight our contributions as following:
We propose SGC-GEN, a unified community detection framework combining the principles of theoretical detectability and well-designed objective functions for improvement.
We establish a condition on the correctness of community detection using SGC under a SBM, which leads to a novel data-driven community loss function for SGC-GEN. Moreover, since the loss function enables community quality assessment, the proposed SGC-GEN resembles the formulation of a supervised learning problem consisting of a loss function and a regularization function.
We present an algorithm for SGC-GEN and conduct rigorous computational analysis showing that SGC-GEN could be implemented as efficient as other baseline methods.
We compare the performance of community detection on 18 real-life graph datasets and use 7 representative clustering metrics to rank each method. The experimental results show that joint consideration of theoretical detectability and model mismatch using SGC-GEN can substantially improve community detection when compared to 6 baseline community detection methods of similar objective functions.
Ii General Framework
Throughout this paper bold uppercase letters (e.g., or ) denote matrices and denotes the entry in the -th row and the -th column of , bold lowercase letters (e.g., or
) denote column vectors, the termdenotes matrix or vector transpose, italic letters (e.g., , or ) denote scalars, and calligraphic uppercase letters (e.g., or ) denote sets. The term denotes a graph characterized by a node set and an edge set . The number of nodes and edges in are denoted by and , respectively. The convergence of a real rectangular matrix is with respect to the spectral norm, which is defined as , where denotes the Euclidean norm of a vector . Based on the definition,
is equivalent to the largest singular value of. A matrix is said to converge to another matrix of the same dimension if approaches zero. For the convenience of notation, we write if as .
Throughout this paper, we consider the problem of non-overlapping community detection in a simple connected graph that is undirected, unweighted and contains no self-loops. Given a graph and the number of communities , non-overlapping community detection aims to assign each node a community label and divide the nodes into communities such that the nodes in the same community are better connected than nodes in different communities.
Spectral graph clustering (SGC).
SGC is a widely used technique for community detection. It transforms a graph into a vector space representation via spectral decomposition of a matrix associated with a graph. Specifically, each node in the graph is represented by a low-dimensional vector using a common subset of eigenvectors of a matrix. Based on the vector space representation, K-means clustering is applied to obtaincommunities . One typical example of SGC is the normalized graph Laplacian matrix , where its smallest eigenvectors are used for community detection .
is a parametric GCM that specifies a set of within-community and between-community edge connection probability parameters. The graphon model[16, 17] is a nonparametric GCM that generates a graph based on latent representations. Different GCMs are discussed in the survey paper .
SGC under GCMs. For graphs generated by certain GCMs, recent research findings suggest that the performance of community detection using SGC can be separated into two regimes : a detectable regime where the detected communities are consistent with the ground-truth communities, and an undetectable regime
where the detected communities and the ground-truth communities are inconsistent. Moreover, the critical space that separates these two regimes can be specified. Consequently, the problem of evaluating the quality of detected communities can be converted to the problem of estimating to which regime the given graph belongs. More details are given in the related work section (Sec.VI).
Ii-C Problem Formulation of SGC-GEN
Consider community detection in a graph with an unknown number of communities. For each possible number of communities , we can provide quantitative measures on community detectability and model mismatch for SGC under GCMs. Specifically, given a GCM of communities and a set of communities detected by a SGC method , the corresponding community detection loss function and model mismatch metric are as follows.
Community detection loss function. For any , and , let denote a nonnegative loss function that reflects the level of incorrect community detection using under . Higher loss suggests the detected communities are less reliable.
Model mismatch metric. Let be a real-valued function quantifying the difference between the detected communities using and the underlying GCM . Larger value of suggests the detected communities are less consistent with the assumption of .
SGC-GEN. Inspired by the formation of supervised learning problems, community detection, albeit an unsupervised learning problem, can be formulated in a similar fashion by specifying a community detection loss function and a model mismatch metric . Given a maximum number of communities , a SGC method and a GCM , the proposed community detection framework, called SGC-GEN, solves the following minimization problem
where denotes the set of candidate community detection results of different number of communities obtained by . Using terminology from supervised learning theory, is analog to the loss function, resembles the regularization function, and is the regularization parameter.
Many existing community detection methods can fit into the framework of SGC-GEN in (1). For example, objective-function-based algorithms specify a particular energy function for quality assessment and set . Greedy algorithms specify a model mismatch metric and set and . For example, the Louvain method  selects to be the negative modularity, where modularity is a measure of relative difference between the detected communities and the corresponding configuration model .
Iii Theoretical Foundation of SGC-GEN: Normalized Graph Laplacian Matrix and Stochastic Block Model
In this section we study the community detectability of SGC using the normalized graph Laplacian matrix under a stochastic block model (SBM). We establish a sufficient and necessary condition such that SGC is guaranteed to yield reliable community detection results for graphs generated by a SBM. The established condition will be used in Sec. IV to devise a novel data-driven community detection loss function for the proposed SGC-GEN framework in (1). For demonstration, we also provide a case study of the established condition under a simplified SBM. The proofs of the established theories are given in the supplementary material111Supplementary material can be downloaded from www.pinyuchen.com.
Iii-a Normalized Graph Laplacian Matrix and Stochastic Block Model (SBM)
SGC using normalized graph Laplacian matrix. Let denote the adjacency matrix of and let be the corresponding diagonal degree matrix. The unnormalized graph Laplacian matrix is defined as . The normalized graph Laplacian matrix is defined as . We denote the -th smallest eigenpair of by , where
is the eigenvector associated with the eigenvalue, and . It is also known that . The standard SGC algorithm using  is summarized in Algorithm 1.
Let be the matrix of eigenvectors . The matrix is the solution of the minimization problem
where is the identity matrix, and the constraint imposes orthogonality and unit norm for the columns in . If is a connected graph, then by the definition of , we have . Let be the matrix after removing the first column from . Then (2) can be reformulated as
where is the vector of 1’s (0’s) and is the solution to (3). The minimization problem in (3) is a standard formulation of SGC based on the normalized graph Laplacian matrix [8, 9, 2], which is also a fundamental element of many state-of-the-art SGC methods [11, 12, 13, 14], and it will be the foundation of the theoretical results presented in Sec. III-B.
Stochastic block model (SBM). SBM  is a fundamental GCM, and it has been the root of many other GCMs such as the degree-corrected SBM  and the random interconnection model . SBM is a parametric GCM that assumes common edge connection probability for within-community and between-community edges. A graph of communities can be generated by a SBM as follows. The SBM first divides the nodes into groups, where each group has nodes such that . For each unordered node pair , , an edge between and is connected with probability , where denote the community labels of and . Therefore, the SBM is parameterized by the number of communities and the edge connection probability matrix , where and is symmetric. We denote the SBM with parameters and by SBM(,).
Iii-B Theoretical Guarantees on Community Detectability
Here we analyze the performance of community detection on graphs generated by SBM(,) using . In particular, we establish a sufficient and necessary condition on correct community detection, where correct community detection means the detected communities using match the oracle communities generated by SBM(,), up to some permutation in community labels. The condition of community detectability leads to a novel community detection loss function as will be discussed in Sec. IV. Let , , and let denote the limit value of as . The following lemma serves as a cornerstone that connects the dots between and SBM(,).
(matrix concentration under SBM(,))
Let denote the adjacency matrix of edges between communities and of a graph generated by SBM(,), . The following holds almost surely as , and :
The matrix concentration result in Lemma 1 shows that the scaled adjacency matrix converges asymptotically to a constant matrix of finite spectral norm , which associates with the relative community size and the edge connection probability under SBM(,). The condition guarantees that all community sizes grow at a comparable rate. Note that Lemma 1 presumes each entry in is a constant. In case of sparse graphs where or for some positive constants , similar matrix concentration result holds with high probability under mild conditions via degree regularization techniques [26, 27].
Since Algorithm 1 is invariant to the permutation of node indices, for the purpose of analysis we treat the adjacency matrix as a matrix of blocks . Using Lemma 1, we establish a sufficient and necessary condition on correct community detection using for graphs generated by SBM(,).
(community detectability using under SBM(,))
For any graph generated by SBM(,), let and , where is the -th smallest eigenpair of . The following holds almost surely as , and :
The communities in can be correctly detected
using if and only if .
We provide a sketch of the proof below. The complete proof is given in Appendix B of the supplementary material11footnotemark: 1.
Step 1. Specify the optimality condition of using (3).
Step 2. Show the distribution of the rows in can be separated into two regimes, detectable or undetectable, using Lemma 1.
Step 3. If is in the undetectable regime, show the distribution of the rows in is inconsistent with the community structure.
Step 4. If is in the detectable regime, show the distribution of the rows in is consistent with the community structure.
Step 5. Show is in the detectable regime iff . ∎
Note that Theorem 1 provides a novel data-driven criterion for evaluating the quality of communities without the knowledge of the parameters in SBM(). In other words, for any graph generated by SBM(), for evaluating community detectability it suffices to compute the smallest nonzero eigenvalues of and inspect the condition , which will be further explored in Sec. IV. In addition, Theorem 1 also implies the feasibility of community detection using Algorithm 1, since and row normalization does not alter the sign of each entry in .
Iii-C Case Study: SBM()
To investigate the implication of the sufficient and necessary condition for correct community detection in Theorem 1, we study SBM(), the case of SBM with two communities, and justify the condition via numerical experiments. Under SBM(), we allow the size of the two communities, and , to be arbitrary as long as their limit values . We also simplify the notation of the edge connection matrix by defining , , and . The following corollary specifies the condition of community detectability in terms of , and .
(community detectability using under SBM()) For any graph generated by SBM(,), let denote the second smallest eigenvector of , where , , is the community-indexed block vector of . The following holds almost surely as and :
The two communities in can be correctly detected
using if and only if .
Furthermore, and for some if and only if .
The results are induced from the condition in Theorem 1 under SBM(,). The proof is given in Appendix C of the supplementary material11footnotemark: 1. ∎
The established detectability condition in Corollary 1 is universal in the sense that it does not depend on the ratio of the community sizes as long as its limit value . It is worth mentioning that the condition for correct community detection is also consistent with the condition using methods other than , such as the spectral modularity matrix , the spectrum of modular matrix , and the inference-based method . In addition, when , the results that and for some imply the nodes in the same community have identical yet community-wise distinct representation, as and are nonzero constant vectors with opposite signs. This guarantees that K-means clustering on leads to correct community detection when . In particular, when and , the parameters and reflect the expected number of within-community and between-community edges, respectively. The condition in Corollary 1 then reduces to , which means the two communities can be correctly detected when there are more within-community edges than between-community edges.
Fig. 1 displays two numerical examples of different community sizes to validate the detectability condition. It can be observed that in both cases when , correct community detection can be achieved and . On the other hand, when , correct community detection is impossible and is close to . Consequently, inspecting the data-driven parameter indeed reveals community detectability, which validates Theorem 1 and Corollary 1.
Iv Community Detection Algorithms using SGC-GEN
Iv-a SGC-GEN Meta Algorithm
The proposed SGC-GEN framework in (1) applies to any SGC method and any GCM. It is a meta algorithm that avails community detection by specifying the corresponding community detection loss function and the model mismatch metric , in addition to the regularization parameter and the maximum number of communities . Algorithm 2 below summarizes SGC-GEN.
|SBM-AIC ||Bayesian inference||SBM()||0|
|SBM-BIC ||Bayesian inference||SBM()||0|
|DCSBM-AIC ||Bayesian inference||DCSBM||0|
|DCSBM-AIC ||Bayesian inference||DCSBM||0|
|Self-Tuning ||None||defined in ||0|
|Louvain ||Node merging||None||0|
Iv-B SGC-GEN via and SBM()
Based on the theoretical analysis established in Sec. III, here we specify 2 SGC methods, the corresponding community detection loss function, and 4 model mismatch metrics for SGC-GEN. This yields 8 community detection methods originated from Algorithm 2. In particular, for these methods we select the GCM to be SBM(). These SGC-GEN-empowered community detection methods are summarized in Table I. The details are described as follows.
Two SGC methods.
the first method is SGC using the normalized graph Laplacian matrix as described in Algorithm 1. To obtain the set in step 1 of Algorithm 2, one computes the smallest eigenvectors of and use Algorithm 1 to obtain the candidate communities of different in .
the second method is regularized SGC using the normalized graph Laplacian matrix . It is similar to except that one replaces the matrix in step 1 of Algorithm 1 with , where is the average degree of the graph . The regularization leads to better clustering than in sparse graphs as suggested in [32, 33].
Community detection loss function.
Since and are SGC methods via , using Theorem 1, we set the community detection loss function to be
where and is the -th smallest eigenvalue of . It is similar to the exponential loss function used in supervised learning problems. The denominator serves the purpose of comparing different community detection results in .
When , the function is confined in the interval , and it favors the community detection results of small partial eigenvalue sum , which is a measure of multiway cut in .
When , is greater than 1 and has exponential growth as decreases, which implies that imposes large loss on incorrect community detection results based on Theorem 1. Note that is a data-driven function since it only requires the knowledge of .
Four model mismatch metrics.
spectral radius of the modular matrix with respect to SBM(). Define the modular matrix with respect to SBM() as if and if , for all , where denotes the community label of node . The parameter is the maximum likelihood estimator of in given the detected communities , which is defined as if and if , for all , where denotes the number of nodes in and denotes the number of edges between communities and . is defined as the spectral radius of , which is the largest eigenvalue of in absolute value. It relates to the first-order eigenvalue approximation of the signed triangle counts , which is an effective statistic for testing latent structure in random graphs.
negative modularity. Given communities in , modularity is a measure of difference between and a random graph of the same degree sequence . The modularity is defined as , where if and if , for all , and . By defining , the model mismatch metric is small when the communities are distinct from the corresponding randomized graphs.
AIC under SBM(). The Akaike information criterion (AIC) is a measure of the relative quality of statistical models for a given set of data. is defined as the AIC given communities under SBM(), which is , where denotes the log-likelihood of under SBM(). The closed-form expression of is given in .
BIC under SBM(). The Bayesian information criterion (BIC) is another relative measure of data fitness to statistical models. is defined as the BIC of communities under SBM(), which is .
Iv-C Computational Complexity Analysis
Here we analyze the computational complexity of the 8 SGC-GEN community detection methods listed in Table I. There are three main factors contributing to the computational complexity: (i) computation of the smallest eigenvectors of , (ii) K-means clustering, and (iii) computation of the community detection loss function and the model mismatch metric. The overall computational complexity of each method is summarized in Table I.
For (i), computing the smallest eigenvectors of requires operations using power iteration techniques [35, 36, 37, 38, 39], where is the number of nonzero entries in . For (ii), given any , K-means clustering on the rows of the smallest eigenvectors of requires operations . As a result, to obtain the set of candidate community detection results by varying from to requires operations in total. For (iii), the complexity of computing the function and the loss function in (4) is negligible since they can be obtained in the process of (i). The computation of for a given requires operations for computing and operations for computing the spectral radius of using power iteration techniques, where is the number of nonzero entries in . Therefore, the overall computational complexity of in SGC-GEN is , where is the maximum number of nonzero entries in ranging from to . The computation of for a given is , the same complexity for computing modularity . The overall computational complexity of in SGC-GEN is . For a given , the computation of and requires operations to compute the closed-form log-likelihood function . The overall computational complexity of and in SGC-GEN is . The computational complexity of and has the same order since the regularization step in simply adds entries to the degree matrix . Similarly, the data storage of these methods require space.
|Dataset||Description||Node||Edge||# of nodes||# of edges||Community labels|
|BlogCatalog222http://socialcomputing.asu.edu/datasets/BlogCatalog3||online social network||user||friendship||10312||333983||39 social groups|
|Youtube333http://socialcomputing.asu.edu/datasets/YouTube2||online social network||user||friendship||22180||96092||47 social groups|
|PoliticalBlog444http://konect.uni-koblenz.de/networks/moreno-blogs||online social network||user||blog reference||1222||16714||2 political parties|
|Cora555http://www.cs.umd.edu/ sen/lbc-proj/data/cora.tgz||publication network||paper||citation||2485||5069||7 research topics|
|Citeseer666http://www.cs.umd.edu/ sen/lbc-proj/data/citeseer.tgz||publication network||paper||citation||2110||3694||6 research topics|
|Pubmed777http://www.cs.umd.edu/projects/linqs/projects/lbc/Pubmed-Diabetes.tgz||publication network||paper||citation||19717||44324||3 research topics|
|AS-Newman999http://www-personal.umich.edu/ mejn/netdata/||communication network||router||connection||22963||48436||NA|
|Facebook111111http://snap.stanford.edu/data/egonets-Facebook.html||online social network||user||friendship||4039||88234||NA|
|PowerGrid151515http://konect.uni-koblenz.de/networks/opsahl-powergrid||physical network||power station||power line||4941||6594||NA|
In summary, the overall computational complexity of SGC-GEN-enabled methods is linear in the number of nodes and edges ( and ) and depends on . In practice is a constant such that and . Based on the computational analysis, the community detection methods based on SGC-GEN have the same order of complexity in and when compared with the baseline methods of similar objective functions described in Sec. V-B, which suggests that utilizing SGC-GEN for community detection is computationally as efficient as these baseline methods.
V Performance Evaluation
V-a Dataset Description and Evaluation Metrics
Dataset Description. To compare the performance of community detection, we collected 18 real-life graph datasets from various domains, including online social, physical, biological, communication, collaboration, email, and publication networks. For each dataset, we extracted the largest connected component as the input graph for community detection. All input graphs are made undirected, unweighted and unlabeled. Among these datasets, 6 datasets are provided with additional community labels. If a node in the graph is provided with more than one community label, the most common label among its neighboring nodes is assigned to the node. The statistics of the collected graphs are summarized in Table II.
Evaluation metrics. We use 7 representative external and internal clustering metrics to evaluate the performance of different communication detection methods. External clustering metrics can be computed when the community labels are given. Internal clustering metrics evaluate the quality of communities in terms of connectivity, which can be computed without community labels.
external clustering metrics:
Normalized mutual information (NMI) .
Rand index (RI) .
F-measure (FM) .
These external clustering metrics are properly scaled between 0 and 1, and larger value means better clustering performance.
internal clustering metrics:
Conductance (COND) : the averaged COND over all communities. Lower value means better performance.
Normalized cut (NC) : the averaged NC over all communities. Lower value means better performance.
Average out-degree fraction (avg-ODF) : the averaged avg-ODF over all communities. Lower value means better performance.
Modularity (MOD) : MOD is defined in the model mismatch metric in Sec. IV-B. Larger value means better performance.
Average rank score. To combine multiple clustering metrics for performance evaluation of different community detection methods, we adopt the methodology proposed in [6, 7] and use the average rank score of all clustering metrics as the performance metric. For each dataset, we rank each community detection method for every clustering metric via standard competition rankings and obtain an average rank score of all clustering metrics. Therefore, lower average rank score means better community detection.
V-B Baseline Comparative Methods
As summarized in Table I,
we compare the performance of SGC-GEN methods with 6 baseline community detection methods of similar loss functions and model mismatch metrics:
SBM-AIC: Given the number of communities, SBM-AIC uses Bayesian inference techniques to evaluate the posterior distribution of community assignments given the graph under the SBM. We implemented the state-of-the-art package WSBM202020http://tuvalu.santafe.edu/ aaronc/wsbm/ to obtain the mostly probable communities  and use the AIC to determine the final communities ranging from to .
SBM-BIC: SBM-BIC is the same as SBM-AIC except that one uses the BIC to determine the final community detection results.
DCSBM-AIC: DCSBM-AIC is the same as SBM-AIC except that one uses the degree-corrected SBM (DCSBM)  for inference.
DCSBM-BIC: DCSBM-BIC is the same as SBM-BIC except that one uses the degree-corrected SBM (DCSBM)  for inference.
Self-Tuning212121http://www.vision.caltech.edu/lihi/Demos/SelfTuningClustering.html: Self-Tuning is a SGC algorithm that uses an energy function based on for basis rotation and finds the best community detection results among to communities .
Louvain222222https://perso.uclouvain.be/vincent.blondel/research/louvain.html: Louvain method is a greedy modularity maximization approach for community detection based on node merging .
V-C The Effect of Regularization Parameter
Here we investigate the effect of the regularization parameter in (1) on the performance of the eight SGC-GEN community detection methods listed in Table I. We set and use the six datasets with additional community labels in Table II to select from the set . For illustration, Fig. 2 displays the stacked average rank plot of these SGC-GEN methods separately ranked by different in Youtube and Citeseer datasets. The colors represent different methods and the width of each colored block represents average rank score based on the selected values of . It is observed that for each method, setting large (i.e., underestimating the loss function) or neglecting the model mismatch metric (i.e., setting ) leads to the worst performance, which justifies the motivation of SGC-GEN. In addition, sweeping within does not induce drastic changes in the average rank score, which demonstrates the robustness of SGC-GEN. Based on the average rank score of these datasets, for the following experiments we assign () to the (regularized) SGC-GEN methods.
V-D Comparison to Baseline Methods
Here we compare the 8 SGC-GEN methods to the 6 baseline methods in Sec. V-B. For the Bayesian inference baseline methods we set , since we observe that larger does not improve their performance but significantly increases the computation time. For the SGC-GEN methods and Self-Tuning we set . For Louvain one does not need to specify . All experiments are implemented by Matlab R2016 on a 16-core cluster with 128 GB RAM.
displays the mean and standard deviation of average rank scores over all 18 graph datasets for each community detection method. Among these 14 methods, SGC-MOD and SGC-EIG have the best and second best mean average rank score over all datasets, which suggests that joint consideration of theoretical detectability and modular structure using the proposed SGC-GEN framework improves community detection. The results also suggest that the degree regularization technique does not necessarily guarantee better performance. For the baseline methods, it can be observed that Bayesian inference based approaches lead to poor performance, which can be explained by the fact the graph datasets may not comply with the assumption of the underlying generative community models. Louvain also yields poor performance since it is a greedy algorithm that only aims to maximize one single clustering metric (i.e., modularity). Self-Tuning performs better than some SGC-GEN methods but it does not prevail SGC-MOD and SGC-EIG, which can be explained by the fact that the energy function used in Self-Tuning does not exploit the discriminative power of community detectability. Since in Sec.IV-C SGC-GEN is shown to be computationally as efficient as these baseline methods, we conclude that community detection via SGC-GEN yields superior performance without incurring additional computational costs.
|Method||Average rank of all datasets|
V-E Comparison in graph domains and types
For further analysis, we categorize the 18 graph datasets in Table II into 7 domains based on their descriptions. Fig. 3 displays the mean average rank score of each domain for 10 selected methods. It can be observed that no single community detection method outperforms others in all domains. For example, regSGC-MOD has superior performance in online social, publication and biological networks but has poor performance in email and communication networks. SBM-BIC has the best performance in communication networks but not in other domains. SGC-MOD has the best averaged performance over all datasets but it does not prevail others in every domain. The results suggest that considering the graph domain is essential for improving community detection.
We also separate the 18 datasets into two types: with community labels or without community labels. The corresponding average rank score is shown in Fig. 4. For the datasets with community labels, regSGC-MOD, regSGC-AIC and regSGC-BIC are outstanding, whereas for the datasets without community labels SGC-EIG and SGC-MOD prevail. Since community labels provide additional external clustering metrics, the results suggest that the external and internal clustering metrics have different evaluation criterion.
Vi Related Work
Community detection and graph clustering have been an active research field in the past two decades. We refer readers to [1, 42] for an overview of community detection methods. In recent years, there has been a major breakthrough in analyzing both the informational and algorithmic limits of community detection under certain generative community models (GCMs). In this section we summarize the recent research findings in community detectability.
Many informational and algorithmic limits of community detection have been analyzed under the stochastic block model (SBM) . Abbe et al. analyzed the informational limit by specifying the detectable and undetectable regimes for community detection via the parameters of SBM . They also proposed a belief propagation algorithm that is proved to achieve the informational limit . Hajek et al. proposed a semidefinite programming algorithm that achieves the informational limit [45, 46]. Inference approaches based on statistical physics have been studied in [47, 48]. Spectral graph clustering algorithms, including the modularity matrix, the graph Laplacian matrix, the adjacency matrix, and the modular matrix, have been studied in [49, 50, 32, 29, 51, 28, 52, 53, 26, 27] and applied to various applications in graph mining [54, 55, 56, 57, 58, 59, 60, 61]
and machine learning[62, 63, 64, 65].
. Under the same model, Qin and Rohe studied regularized spectral clustering, and Gao et al. derived a minimax risk . Chen and Hero proved the algorithmic limit of spectral clustering [68, 23, 69] under the random interconnection model . Although studying the limits of community detection methods under GCMs provides novel insights on evaluating community detectability, these approaches assume the graphs are consistent with the underlying GCMs and therefore neglect the error induced by model mismatch, which motivates the SGC-GEN framework proposed in this paper.
Vii Conclusion and Future Work
In this paper we propose SGC-GEN, a new community detection framework that jointly exploits the discriminative power of community detectability under generative community models and confines the corresponding model mismatch. A novel condition on correct community detection is established for SGC-GEN, leading to effective and computationally efficient community detection methods.
Performance evaluation on 18 datasets and 7 clustering metrics shows that joint consideration of community detectability and modular structure via SGC-GEN outperforms 6 baseline approaches in terms of the average rank score. We also investigated the effect of graph domains and graph types on community detection.
The performance analysis established in this paper focuses on the standard formulation of SGC rooted in many advanced methods. Our future work involves developing scalable implementation of SGC-GEN to efficiently handle large-scale graphs and extending SGC-GEN to advanced community detection methods and models.
-  S. Fortunato, “Community detection in graphs,” Physics Reports, vol. 486, no. 3-5, pp. 75–174, 2010.
-  U. Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, Dec. 2007.
-  M. E. J. Newman, “Finding community structure in networks using the eigenvectors of matrices,” Phys. Rev. E, vol. 74, p. 036104, Sep 2006.
-  S. White and P. Smyth, “A spectral clustering approach to finding communities in graph.” in SIAM International Conference on Data Mining (SDM), vol. 5, 2005, pp. 76–84.
-  M. E. J. Newman, “Fast algorithm for detecting community structure in networks,” Phys. Rev. E, vol. 69, p. 066133, Jun 2004.
-  J. Leskovec, K. J. Lang, and M. Mahoney, “Empirical comparison of algorithms for network community detection,” in ACM International Conference on World Wide Web (WWW), 2010, pp. 631–640.
-  J. Yang and J. Leskovec, “Defining and evaluating network communities based on ground-truth,” Knowledge and Information Systems, vol. 42, no. 1, pp. 181–213, 2015.
-  J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, 2000.
A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” inAdvances in neural information processing systems (NIPS), 2002, pp. 849–856.
-  P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Social Networks, vol. 5, no. 2, pp. 109–137, 1983.
J. Liu, C. Wang, M. Danilevsky, and J. Han, “Large-scale spectral clustering
on graphs,” in
International Joint Conference on Artificial Intelligence. AAAI Press, 2013, pp. 1486–1492.
-  F. Nie, X. Wang, and H. Huang, “Clustering and projected clustering with adaptive neighbors,” in ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2014, pp. 977–986.
-  Y. Li, J. Huang, and W. Liu, “Scalable sequential spectral clustering.” in AAAI, 2016, pp. 1809–1815.
-  F. Nie, X. Wang, M. I. Jordan, and H. Huang, “The constrained laplacian rank algorithm for graph-based clustering.” in AAAI, 2016, pp. 1969–1976.
-  A. Goldenberg, A. X. Zheng, S. E. Fienberg, and E. M. Airoldi, “A survey of statistical network models,” Foundations and Trends® in Machine Learning, vol. 2, no. 2, pp. 129–233, 2010.
-  P. Diaconis and S. Janson, “Graph limits and exchangeable random graphs,” arXiv preprint arXiv:0712.2749, 2007.
-  Y. Zhang, E. Levina, and J. Zhu, “Estimating network edge probabilities by neighborhood smoothing,” arXiv preprint arXiv:1509.08588, 2015.
-  E. Abbe, A. S. Bandeira, and G. Hall, “Exact recovery in the stochastic block model,” IEEE Trans. Inf. Theory, vol. 62, no. 1, pp. 47–487, 2016.
-  L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” in Advances in neural information processing systems (NIPS), 2004, pp. 1601–1608.
-  V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fast unfolding of communities in large networks,” Journal of Statistical Mechanics: Theory and Experiment, no. 10, 2008.
-  M. E. J. Newman, “Modularity and community structure in networks,” Proc. National Academy of Sciences, vol. 103, no. 23, pp. 8577–8582, 2006.
-  B. Karrer and M. E. J. Newman, “Stochastic blockmodels and community structure in networks,” Phys. Rev. E, vol. 83, p. 016107, Jan 2011.
-  P.-Y. Chen and A. O. Hero, “Phase transitions and a model order selection criterion for spectral graph clustering,” arXiv preprint arXiv:1604.03159, 2016.
-  R. Latala, “Some estimates of norms of random matrices.” Proc. Am. Math. Soc., vol. 133, no. 5, pp. 1273–1282, 2005.
-  M. Talagrand, “Concentration of measure and isoperimetric inequalities in product spaces,” Publications Mathématiques de l’Institut des Hautes Études Scientifiques, vol. 81, no. 1, pp. 73–205, 1995.
-  C. M. Le and R. Vershynin, “Concentration and regularization of random graphs,” arXiv preprint arXiv:1506.00669, 2015.
-  A. Joseph, B. Yu et al., “Impact of regularization on spectral clustering,” The Annals of Statistics, vol. 44, no. 4, pp. 1765–1791, 2016.
P.-Y. Chen and A. O. Hero, “Universal phase transition in community detectability under a stochastic block model,”Phys. Rev. E, vol. 91, p. 032804, Mar 2015.
-  T. P. Peixoto, “Eigenvalue spectra of modular networks,” Phys. Rev. Lett., vol. 111, p. 098701, Aug 2013.
-  Y. Zhao, E. Levina, and J. Zhu, “Consistency of community detection in networks under degree-corrected stochastic block models,” The Annals of Statistics, vol. 40, no. 4, pp. 2266–2292, 08 2012.
-  C. Aicher, A. Z. Jacobs, and A. Clauset, “Learning latent block structure in weighted networks,” Journal of Complex Networks, p. cnu026, 2014.
-  K. Chaudhuri, F. C. Graham, and A. Tsiatas, “Spectral clustering of graphs with general degrees in the extended planted partition model,” in COLT, vol. 23, 2012, pp. 35–1.
-  A. A. Amini, A. Chen, P. J. Bickel, E. Levina et al., “Pseudo-likelihood methods for community detection in large sparse networks,” The Annals of Statistics, vol. 41, no. 4, pp. 2097–2122, 2013.
-  S. Bubeck, J. Ding, R. Eldan, and M. Z. Rácz, “Testing for high-dimensional geometry in random graphs,” Random Structures & Algorithms, 2016.
-  O. E. Livne and A. Brandt, “Lean algebraic multigrid (lamg): Fast graph Laplacian linear solver,” SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. B499–B522, 2012.
-  P.-Y. Chen, B. Zhang, M. A. Hasan, and A. O. Hero, “Incremental method for spectral clustering of increasing orders,” in ACM International Conference on Knowledge Discovery and Data Mining (KDD) Workshop on Mining and Learning with Graphs, 2016, arXiv preprint arXiv:1512.07349.
-  L. Wu and A. Stathopoulos, “A preconditioned hybrid svd method for accurately computing singular triplets of large matrices,” SIAM Journal on Scientific Computing, vol. 37, no. 5, pp. S365–S388, 2015.
L. Wu, J. Laeuchli, V. Kalantzis, A. Stathopoulos, and E. Gallopoulos, “Estimating the trace of the matrix inverse by interpolating from the diagonal of an approximate inverse,”Journal of Computational Physics, vol. 326, pp. 828–844, 2016.
-  L. Wu, E. Romero, and A. Stathopoulos, “Primme_SVDS: A high-performance preconditioned svd solver for accurate large-scale computations,” arXiv preprint arXiv:1607.01404, 2016.
-  M. J. Zaki and W. Meira Jr, Data mining and analysis: fundamental concepts and algorithms. Cambridge University Press, 2014.
-  G. W. Flake, S. Lawrence, and C. L. Giles, “Efficient identification of web communities,” in ACM International Conference on Knowledge Discovery and Data Mining (KDD), 2000, pp. 150–160.
-  S. Fortunato and D. Hric, “Community detection in networks: A user guide,” Physics Reports, vol. 659, pp. 1–44, 2016.
-  E. Abbe and C. Sandon, “Community detection in general stochastic block models: fundamental limits and efficient recovery algorithms,” arXiv preprint arXiv:1503.00609, 2015.
-  ——, “Detection in the stochastic block model with multiple clusters: proof of the achievability conjectures, acyclic bp, and the information-computation gap,” Advances in Neural Information Processing Systems (NIPS), 2016.
-  B. Hajek, Y. Wu, and J. Xu, “Achieving exact cluster recovery threshold via semidefinite programming,” IEEE Trans. Inf. Theory, vol. 62, no. 5, pp. 2788–2797, 2016.
-  ——, “Achieving exact cluster recovery threshold via semidefinite programming: Extensions,” IEEE Trans. Inf. Theory, vol. 62, no. 10, pp. 5918–5937, 2016.
-  A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, “Inference and phase transitions in the detection of modules in sparse networks,” Phys. Rev. Lett., vol. 107, p. 065701, Aug 2011.
-  F. Krzakala, C. Moore, E. Mossel, J. Neeman, A. Sly, L. Zdeborova, and P. Zhang, “Spectral redemption in clustering sparse networks,” Proc. National Academy of Sciences, vol. 110, pp. 20 935–20 940, 2013.
-  K. Rohe, S. Chatterjee, and B. Yu, “Spectral clustering and the high-dimensional stochastic blockmodel,” The Annals of Statistics, pp. 1878–1915, 2011.
-  R. R. Nadakuditi and M. E. J. Newman, “Graph spectra and the detectability of community structure in networks,” Phys. Rev. Lett., vol. 108, p. 188701, May 2012.
-  F. Radicchi, “Detectability of communities in heterogeneous networks,” Phys. Rev. E, vol. 88, p. 010801, Jul 2013.
-  A. Saade, F. Krzakala, and L. Zdeborová, “Spectral clustering of graphs with the bethe hessian,” in Advances in neural information processing systems (NIPS), 2014, pp. 406–414.
-  J. Lei and A. Rinaldo, “Consistency of spectral clustering in stochastic block models,” Ann. Statist., vol. 43, no. 1, pp. 215–237, 02 2015.
-  P.-Y. Chen and A. Hero, “Deep community detection,” IEEE Trans. Signal Process., vol. 63, no. 21, pp. 5706–5719, Nov. 2015.
-  B. Zhang and M. A. Hasan, “Name disambiguation in anonymized graphs using network embedding,” in CIKM, 2017.
-  P.-Y. Chen, S. Choudhury, and A. O. Hero, “Multi-centrality graph spectral decompositions and their application to cyber intrusion detection,” in IEEE ICASSP, 2016, pp. 4553–4557.
-  B. Zhang, T. K. Saha, and M. Al Hasan, “Name disambiguation from link data in a collaboration graph,” in IEEE/ACM ASONAM, 2014, pp. 81–84.
-  T. K. Saha, B. Zhang, and M. Al Hasan, “Name disambiguation from link data in a collaboration graph using temporal and topological features,” Social Network Analysis and Mining, vol. 5, no. 1, p. 11, 2015.
-  P.-Y. Chen and A. O. Hero, “Assessing and safeguarding network resilience to nodal attacks,” IEEE Commun. Mag., vol. 52, no. 11, pp. 138–143, Nov. 2014.
M. Dundar, Q. Kou, B. Zhang, Y. He, and B. Rajwa, “Simplicity of kmeans versus deepness of deep learning: A case of unsupervised feature learning with limited data,” inIEEE ICMLA, 2015, pp. 883–888.
-  P.-Y. Chen and A. O. Hero, “Local Fiedler vector centrality for detection of deep and overlapping communities in networks,” in IEEE ICASSP, 2014, pp. 1120–1124.
-  X. Peng, R. S. Feris, X. Wang, and D. N. Metaxas, “A recurrent encoder-decoder network for sequential face alignment,” in ECCV, 2016, pp. 38–56.
P.-Y. Chen and S. Liu, “Bias-variance tradeoff of graph laplacian regularizer,”IEEE Signal Processing Letters, vol. 24, no. 8, pp. 1118–1122, Aug 2017.
-  X. Peng, J. Huang, Q. Hu, S. Zhang, A. Elgammal, and D. Metaxas, “From circle to 3-sphere: Head pose estimation by instance parameterization,” Computer Vision and Image Understanding, vol. 136, pp. 92–102, 2015.
-  S. Liu, P.-Y. Chen, and A. O. Hero, “Accelerated distributed dual averaging over evolving networks of growing connectivity,” arXiv preprint arXiv:1704.05193, 2017.
-  T. Qin and K. Rohe, “Regularized spectral clustering under the degree-corrected stochastic blockmodel,” in Advances in neural information processing systems (NIPS), 2013, pp. 3120–3128.
-  C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou, “Community detection in degree-corrected block models,” arXiv preprint arXiv:1607.06993, 2016.
-  P.-Y. Chen and A. O. Hero, “Phase transitions in spectral community detection,” IEEE Trans. Signal Process., vol. 63, no. 16, pp. 4339–4347, Aug 2015.
-  ——, “Multilayer spectral graph clustering via convex layer aggregation: Theory and algorithms,” IEEE Trans. Signal Inf. Process. Netw., 2017.
-  R. A. Horn and C. R. Johnson, Matrix Analysis. Cambridge University Press, 1990.
-  S. Resnick, A Probability Path. Birkhäuser Boston, 2013.
-  S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge University Press, 2004.
Appendix A Proof of Lemma 3.1
We separate the proof into two cases: (I) , and (II) . For case (I), notice that under SBM() each entry in
is an independent and identical Bernoulli random variable with success probability. Let , where . As a result, each entry in is either with probability or with probability . The Latala’s theorem 
states that for any random matrixwith statistically independent and zero mean entries, there exists a positive constant such that
where is the largest singular value of . It is clear that each entry in is independent and has zero mean. By replacing with in the Latala’s theorem, since , we have , , and . Therefore, as .
We then use the Talagrand’s concentration inequality, which is stated as follows. Let be a convex and 1-Lipschitz function. Let be a random vector and assume that every element of satisfies for all , with probability one. Then there exist positive constants and such that ,
Since , it is easy to check that is a convex and 1-Lipschitz function. Therefore, applying the Talagrand’s inequality and substituting with the facts that and , we have
Note that, since for any positive integer , we have . Hence, by the Borel-Cantelli lemma , when , where denotes almost sure convergence. Using the result from standard matrix perturbation theory  yields for all , where denotes the -th largest singular value. By the fact that , we have as ,
as . Finally, for every , ,
as and , which completes the proof of case (I).
For case (II), since accounts for the adjacency matrix of edges within community , it is a symmetric matrix with zeros on its main diagonal. Let denote the matrix that has the same entries as in the upper diagonals and has zero entries in the lower diagonals. Then . Applying the Latala’s theorem and the Talagrand’s concentration inequality to , we have
as due to the fact that