1 Introduction
Our work is motivated by one question. How can we efficiently extract finegrained clusters included in a massive graph? Modularity clustering [Newman2004] is a fundamental graph analysis tool to understand complicated graphs [Takahashi et al.2017, Sato et al.2018]. It detects a set of clusters that maximizes a clustering metric, modularity [Newman2004]. Since greedily maximizing the modularity achieves better clustering results, modularity clustering is employed in various AIbased applications [Louni and Subbalakshmi2018, Shiokawa et al.2018].
Although modularity clustering is useful in many applications, it has two serious weaknesses. First, it fails to reproduce groundtruth clusters in massive graphs due to the resolution limit problem [Fortunato and Barthélemy2007]. Fortunato et al. theoretically proved that the modularity becomes larger until each cluster contains edges, where is the number of edges included in a given graph. That is, modularity maximization prefers to find coarsegrained clusters regardless of the groundtruth cluster size. Second, modularity clustering requires a large computation time to identify clusters since it exhaustively computes all nodes and edges included in a graph. In the mid2000s, modularity clustering was applied to social networks with at most thousands of edges [Palla et al.2005]. By contrast, recent applications must handle massive graphs with millions or even billions of edges [Louni and Subbalakshmi2018] because graphs are becoming larger, and large graphs can be easily found. As a result, current modularity clustering methods [Blondel et al.2008] need to consume several dozens of hours to obtain clusters from massive graphs.
1.1 Existing Approaches and Challenges
Many studies have strived to overcome these weaknesses. One major approach is to avoid the resolution limit effects by modifying the modularity clustering metric. Localityaware modularity metrics [Muff et al.2005, Li et al.2008, Sankar et al.2015] are the most successful ones. Modularity is not a scaleinvariant since it implicitly assumes that each node interacts with all other nodes. However, it is more reasonable to assume that each node interacts just with its neighbors. By modifying the modularity so that it refers only to neighbor clusters, the metrics successfully moderate the resolution limit effects for smallsized graphs. Costa, however, recently pointed out that graph size still affects such metrics [Costa2014]. That is, the metrics do not fully reproduce groundtruth clusters if graphs are large.
Instead of locality aware metrics, Duan et al. proposed a sophisticated method, correlationaware modularity clustering (CorMod) [Duan et al.2014]. They found that modularity shares the same idea with a correlation measure leverage [PiatetskyShapiro1991], and they removed biases that cause the resolution limit problem from modularity through a correlation analysis on leverage. Based on the above analysis, Duan et al. defined a different type of modularity named the likelihoodratio modularity (LRM). Unlike other modularity metrics, CorMod can avoid producing coarsegrained clusters by maximizing LRM in massive graphs.
Although CorMod can effectively reproduce the groundtruth clusters, it still suffers from the large computational costs to handle massive graphs. Similar to the original modularity clustering, CorMod iteratively computes all nodes and edges to find clusters that maximize LRM. This incurs time, where and are the numbers of nodes and edges, respectively. Several efficient methods have been proposed for modularity maximization such as the node aggregation and pruning approaches used in Louvain [Blondel et al.2008] and IncMod [Shiokawa et al.2013], but they cannot be directly applied to CorMod since they have no guarantee for the clustering quality in the LRM maximization. This is because the approaches focus on only the modularity maximization, whereas CorMod employs more complicated function LRM. Thus, it is a challenging task to improve the computational efficiency of CorMod for finegrained modularity clustering that approximate the groundtruth clusters.
1.2 Our Approaches and Contributions
We focus on the problem of speeding up CorMod to efficiently extract clusters that well approximate the groundtruth sizes from massive graphs. We present a novel correlationaware algorithm, gScarf, which is designed to handle billionedge graphs without sacrificing the clustering quality of CorMod. The basic idea underlying gScarf is to dynamically remove unnecessary computations for nodes and edges from the clustering procedure. To determine which computations to exclude, gScarf focuses on the deterministic property of LRM as it is uniquely determined using only the structural properties of clusters such as degrees and cluster sizes. That is, LRM does not need to be computed repeatedly for clusters with the same structural properties.
Based on the deterministic property, gScarf employs the following techniques to improve efficiency. (1) gScarf theoretically derives an incremental form of LRM, namely LRMgain, (2) It introduces LRMgain caching in order to skip unnecessary LRMgain computations based on the deterministic property, and (3) it employs incremental subgraph folding to further improve the clustering speed. As a result, our algorithm has the following attractive characteristics:

High accuracy: gScarf does not sacrifice the clustering quality compared to CorMod, even though gScarf drops computations for nodes and edges (Section 4.2). We theoretically confirmed that our approach does not fail to increase LRM during the clustering procedure.

Easy to deploy: Our approach does not require any userspecified parameters (Algorithm 1). Therefore, gScarf provides users with a simple solution for applications using modularitybased algorithms.
To the best of our knowledge, gScarf is the first solution that achieves high efficiency and finegrained clustering results on billionedge graphs. gScarf outperforms the stateoftheart methods by up to three orders of magnitude in terms of clustering time. For instance, gScarf returns clusters within five minutes for a Twitter graph with 1.46 billion edges. Although modularity clustering effectively enhances application quality, it has been difficult to apply to massive graphs. However, gScarf, which is well suited to massive graphs, should improve the quality in a wider range of AIbased applications.
2 Preliminary
Here, we formally define the notations and introduce the background. Let be an undirected graph, where , , and are sets of nodes, edges, and weights of edges, respectively. The weight of edge is initially set to . Graph clustering divides into disjoint clusters , in which and for any . We assume graphs are undirected only to simplify the representations. Other types of graphs such as directed graphs can be handled with only slight modifications. For further discussions, we detail how to apply our proposed approaches to the directed graphs in Appendix A.
2.1 Modularity
The modularitybased algorithms detect a set of clusters from so that maximizes a clustering metric, called modularity [Newman2004]. Modularity measures the differences in graph structures from an expected random graph. Here, we denote and , where is a number of edges within , is a total degree of all nodes in , and is the number of edges in . Given a set of clusters , modularity for is given by the following function :
(1) 
Note that, in Equation (1), indicates the expected fraction of edges, which is obtained by assuming that is a random graph. Thus, increases when each cluster has a larger fraction of inneredges than that of the random graph.
Fortunato and Barthelémy theoretically concluded that modularity suffers from the resolution limit problem [Fortunato and Barthélemy2007]. They proved that increases until each cluster includes edges. That is, if a given graph is too large, the modularity maximization produces many superclusters. Thus, modularity maximization fails to reproduce the groundtruth clusters in massive graphs.
2.2 Likelihoodratio Modularity
To overcome the problem, Duan et al. recently proposed CorMod [Duan et al.2014]. CorMod employs the following modularitybased metric likelihoodratio modularity (LRM), which guarantees to avoid the resolution limit:
(2) 
where is
, which denotes the binomial probability mass function.
shows the probability of obtaining from , whereas is the expected probability that is taken from a random graph. As shown in Equation (2), LRM is designed to balance the size of and the ratio of for each cluster. If is small, becomes large while the ratio of becomes small. On the other hand, if is large, becomes small, the ratio becomes large. Thus, unlike modularity, LRM successfully avoids to produce superclusters regardless of the graph sizes.3 Proposed Method: gScarf
We present gScarf that efficiently detects clusters from massive graphs based on LRM maximization. First, we overview the ideas underlying gScarf and then give a full description.
3.1 Ideas
Our goal is to efficiently find clusters without sacrificing clustering quality compared with CorMod. CorMod, which is a stateoftheart approach, iteratively computes LRM for all pairs of clusters until LRM no longer increases. By contrast, gScarf skips unnecessary LRM computations by employing the following approaches. First, we theoretically derive an incremental form of LRM named LRMgain. LRMgain is a deterministic criterion to measure the rise of LRM obtained after merging a pair of clusters. Second, we introduce LRMgain caching to remove duplicate LRM computations that are repeatedly invoked. Finally, we employ incremental subgraph folding to prune unnecessary computations for nodes and edges. Instead of exhaustive computations for the entire graph, gScarf computes only the essential nodes and edges efficiently to find clusters.
Our ideas have two main advantages. (1) gScarf finds all clusters with a quitesmall running time on realworld graphs. Our ideas successfully handle the powerlaw degree distribution, which is a wellknown property of realworld graphs [Faloutsos et al.1999]. This is because gScarf increases its performance if a lot of nodes have similar degrees. Hence, the above property lead gScarf to compute efficiently. (2) gScarf produces almost the same clusters as those of CorMod [Duan et al.2014]. We theoretically demonstrate that gScarf does not miss chances to improve LRM, although gScarf dynamically prunes nodes and edges from a graph. Thus, gScarf does not sacrifice clustering quality compared to CorMod.
3.2 Incremental LRM Computation
We define LRMgain, which measures the rise of the LRM scores obtained after merging two clusters.
Definition 1 (LRMgain ).
Let be the gain of LRM obtained by merging two clusters, and , is given as follows:
(3) 
where and are the gains of the probability ratio and the modularity, respectively. Let be the cluster obtained by merging and , and their definitions are given as follows:
(4)  
(5) 
where is if ; otherwise, .
Note that represents a gain of the probabilityratio [Brin et al.1997], which is a wellknown correlation measure. Additionally, is the modularitygain [Clauset et al.2004] that measures the increase of the modularity after merging and into the same cluster . From Definition 1, we have the following property:
Lemma 1.
Let and be clusters obtained by merging into and , respectively. We always have iff .
Proof.
We first transform Equation (2) by applying the Poisson limit theorem [Papoulis and Pillai2002] to the binomial probability mass functions, and . Since and are significantly small, we can transform as follows:
(6)  
By letting , we have,
(7) 
Clearly, .
We then prove . Since , we clearly have . Thus, we hold . By applying for the above inequality, . Hence, .
We omit the proof of . However, it can be proved in a similar fashion of the above proof. ∎
Lemma 1 implies that the maximization of LRM and LRMgain are equivalent. Consequently,gScarf can find clusters that increase LRM by using LRMgain in a local maximization manner. We also identify two additional properties that play essential roles to discuss our LRMgain caching technique shown in Section 3.3.
Lemma 2.
is uniquely determined by , , , , and , where is the number of edges between and .
Proof.
Since and , we can obtain from , , , , and . Furthermore, since [Blondel et al.2008], is clearly determined by , , and . Hence, we hold Lemma 2 by Definition 1. ∎
Lemma 3.
For each pair of clusters and , it requires time to compute its LRMgain .
3.3 LRMgain Caching
We introduce LRMgain caching to reduce the number of computed nodes and edges. As shown in Lemma 2, LRMgain is uniquely determined by a tuple of structural properties . That is, if equals to , then . Hence, once and its corresponding structural property obtained, can be reused to compute other pairs of clusters whose structural properties are equivalent to .
To leverage the above deterministic property, gScarf employs LRMgain caching given as follows:
Definition 2 (LRMgain caching).
Let be a hash function that stores with its corresponding structural property . The LRMgain caching for a structural property is defined as follows:
(8) 
where denotes is equivalent to .
gScarf skips the computation of if is in ; otherwise, gScarf computes by Definition 1.
To discuss the theoretical aspect of Definition 2, we introduce a wellknown property called the powerlaw degree distribution [Faloutsos et al.1999]. Under this property, the frequency of nodes with neighbors is , where exponent
is a positive constant that represents the skewness of the degree distribution.
Given the powerlaw degree distribution, LRMgain caching has the following property according to Lemma 3:
Lemma 4.
LRMgain caching requires time during the clustering procedure, where is the average degree.
Proof.
To simplify, we assume that each cluster is a singleton, i.e. for any . That is, iff and because we always have and for the singleton clusters. Since the frequency of nodes with neighbors is , the expected number of pairs whose structural properties are equivalent to is . Thus, to cover the whole of a graph, gScarf needs to compute LRMgain scores at least times during the clustering. From Lemma 3, each LRMgain computation needs time, LRMgain caching, therefore, requires times. ∎
3.4 Incremental Subgraph Folding
We introduce incremental subgraph folding that removes unnecessary computations. Realworld graphs generally have high clustering coefficient [Watts and Strogatz1998]; they have a lot of triangles in a graph. However, a high clustering coefficient entails unnecessary computations in CorMod since the triangles create numerous duplicate edges between clusters. To avoid such computations, gScarf incrementally folds nodes placed in the same cluster into an equivalent node with weighted edges by extending theoretical aspects of the incremental aggregation presented in [Shiokawa et al.2013].
First, we define the subgraph folding as follows:
Definition 3 (Subgraph Folding).
Given , and , where and . Let be a function that maps every nodes in to itself; otherwise, maps it to a new node . The subgraph folding of nodes and results in a new graph , where and with updated weight values such that
Definition 3 transforms two nodes and into an equivalent single node with weighted edges. It replaces multiple edges between and into a selfloop edge with a weight value , which represents the number of edges between and . Similarly, let , it merges edges and into weighted single edge . Thus, the subgraph folding reduces the number of nodes/edges from .
Lemma 5.
Let nodes be in the same cluster (i.e. ). If gScarf folds the nodes and into a new node by Definition 3, then LRM taken from node is equivalent to that of the composed of nodes and .
Proof.
From Lemma 5, the subgraph folding does not fail to capture LRMgain that shows a positive score. That is, gScarf reduces the number of duplicated edges without sacrificing the clustering quality of CorMod.
Based on Definition 3, gScarf performs subgraph folding in an incremental manner during the clustering procedure. gScarf greedily searches clusters using LRMgain shown in Definition 1. Once cluster is chosen, gScarf computes LRMgain for all neighbor clusters , where denotes a set of clusters neighboring . After that, gScarf folds a pair of clusters that yields the largest positive LRMgain based on Definition 3.
Our incremental subgraph folding has the following theoretical property in terms of the computational costs.
Lemma 6.
A subgraph folding entails times, where is the average degree.
Proof.
By Definition 3, the subgraph folding requires to update the weights of each edge by traversing all neighboring clusters. Thus, a subgraph folding for and entails timecomplexity . ∎
3.5 Algorithm
Algorithm 1 gives a full description of gScarf. First, gScarf initializes each node as a singleton cluster, i.e. , and it stores all clusters into a target node set (lines 13). Then, gScarf starts the clustering phase to find a set of clusters that maximizes LRM in a local maximization manner (lines 415). Once gScarf selects a cluster from (line 5), it explores neighboring cluster that yields the largest positive score of LRMgain (lines 610). To improve the clustering efficiency, gScarf uses the LRMgain caching in Definition 2; gScarf skips unnecessary computations based on the deterministic property shown in Lemma 2 (lines 810). If gScarf finds the structural property in , then it reuses instead of ; otherwise, it computes by Definition 1 (line 9). After finding , gScarf performs the incremental subgraph folding for and (lines 1114). If , gScarf contracts and into a single cluster by Definition 3 (line 19). gScarf terminates when (line 4). Finally, it outputs a set of clusters in (line 15).
The computational cost of gScarf is analyzed as follows:
Theorem 1.
gScarf incurs time to obtain a clustering result from a graph , where is the number of edges in , d is the average degree, and is a positive constant that controls the skewness of the degree distribution.
Proof.
Algorithm 1 requires iterations for the clustering phase. In each iteration, gScarf invokes at most one subgraph folding (lines 1821) that entails costs from Lemma 6. Thus, it has times to perform the clustering phase. Additionally, gScarf employs the LRMgain caching during the iterations. As we proved in Lemma 4, it incurs times. Therefore, gScarf requires times to obtain a clustering result. ∎
The computational costs of gScarf are dramatically smaller than those of CorMod, which requires time. In practice, realworld graphs show since they generally have small values of and due to the powerlaw degree distribution [Faloutsos et al.1999]. Specifically, as shown in Table 1, the realworld datasets examined in the next section have and . Thus, gScarf has a nearly linear time complexity against the graph size (i.e., ) on the realworld graphs with the powerlaw degree distribution. Furthermore, our incremental subgraph folding effectively handles the clustering coefficient [Watts and Strogatz1998, Shiokawa et al.2015] to further reduce the computational costs. Thus, gScarf has an even smaller cost than the one shown in Theorem 1.
Name  Groundtruth  Source  

YT  1.13 M  2.98 M  2.63  1.93  ✓  comYoutube (SNAP) 
WK  1.01 M  25.6 M  25.1  2.02  N/A  itwiki2013 (LAW) 
LJ  3.99 M  34.6 M  8.67  2.29  ✓  comLiveJournal (SNAP) 
OK  3.07 M  117 M  38.1  1.89  ✓  comOrkut (SNAP) 
WB  118 M  1.01 B  8.63  2.14  N/A  webbase2001 (LAW) 
TW  41.6 M  1.46 B  35.2  2.27  N/A  twitter2010 (LAW) 
4 Experimental Evaluation
In this section, we experimentally evaluate the efficiency and the clustering accuracy of gScarf.
Experimental Setup: We compared gScarf with the following stateoftheart graph clustering baseline algorithms:

CorMod: The original correlationaware modularity clustering [Duan et al.2014]. CorMod greedily maximizes the same criteria (LRM) as gScarf.

Louvain: The most popular modularity clustering for large graphs [Blondel et al.2008]. This method greedily maximizes the modularity in Equation (1).

IncMod: The fastest modularity maximization method proposed by [Shiokawa et al.2013]. It improves the efficiency of Louvain via incremental aggregation and pruning techniques.

pSCAN: The densitybased graph clustering recently proposed by [Chang et al.2017]. It extracts clusters by measuring a density of node connections based on thresholds and . We set and as the same settings as used in [Chang et al.2017].

CEIL: A scalable resolution limitfree method based on a new clustering metric that quantifies internal and external densities of the cluster [Sankar et al.2015]. It detects clusters by greedily maximizing the metric.

MaxPerm: The resolution limit free method based on the localization approach [Chakraborty et al.2014]. This method quantifies permanence of a node within a cluster, and it greedily explores clusters so that the permanence increases.

TECTONIC: The motifaware clustering method proposed by [Tsourakakis et al.2017]. This method extracts connected components as clusters by removing edges with a smaller number of triangles than threshold . We used the recommended value .
All experiments were conducted on a Linux server with a CPU (Intel(R) Xeon(R) E52690 2.60 GHz) and 128 GB RAM. All algorithms were implemented in C/C++ as a singlethreaded program, which uses a single CPU core with the entire graph held in the main memory.
Datasets: We used six realworld graphs published by SNAP [Leskovec and Krevl2014] and LAW [Boldi et al.2011]. Table 1 shows their statistics. The symbols, and denote the average degree and the skewness of the powerlaw degree distribution, respectively. As shown in Table 1, only YT, LJ, and OK have their groundtruth clusters. In our experiments, we also used synthetic graphs with their groundtruth clusters generated by LFRbenchmark [Lancichinetti et al.2009], which is the de facto standard model for generating graphs. The setting are described in detail in Section 4.2.
Groundtruth  gScarf  CorMod  Louvain  IncMod  pSCAN  CEIL  MaxPerm  TECTONIC  

YT  13.5  13.3  13.3  66.1  50.4  24.3  5.6  11.4  8.2 
LJ  40.6  44.2  45.1  111.4  104.5  81.9  11.3  33.3  10.7 
OK  215.7  194.9  194.1  16551.9  15676.7  403.7  35.7  121.9  9.61 
4.1 Efficiency
We evaluated the clustering times of each algorithm in the realworld graphs (Figure 1). Note that we omitted the results of CEIL, MaxPerm, and TECTONIC for large graphs since they did not finish within 24 hours. Overall, gScarf outperforms the other algorithms in terms of running time. On average, gScarf is 273.7 times faster than the other methods. Specifically, gScarf is up to 1,100 times faster than CorMod, although they optimize the same clustering criterion. This is because gScarf exploits LRMgain caching and does not compute all nodes and edges. In addition, it further reduces redundant computations incurred by triangles included in a graph by incrementally folding subgraphs. As a result, as we proved in Theorem 1, gScarf shows a nearly linear running time against the graph size. That is why our approach is superior to the other methods in terms of running time.
To verify how gScarf effectively reduces the number of computations, Figure 3 plots the computed edges in each algorithm for YT where a black (white) dot indicates that the (i,j)th element is (is not) computed. gScarf decreases the computational costs by 83.9–98.3% compared to the other methods. The results indicate that the LRMgain caching and the incremental subgraph folding successfully reduce unnecessary computations. This is due to the powerlaw degree distribution, which is a natural property of realworld graphs including YT. If a graph has a strong skewness in its degree distribution, there are numerous equivalent structural properties since the graph should have similar degrees. Hence, gScarf can remove a large amount of computations.
We then evaluated the effectiveness of our key approaches. We compared the runtimes of gScarf with its variants that exclude either LRMgain caching or incremental subgraph folding. Figure 3 shows the runtimes of each algorithm in realworld graphs, where W/OCaching and W/OFolding indicate gScarf without LRMgain caching and incremental subgraph folding, respectively. gScarf is 21.6 and 10.4 times faster than W/OCaching and W/OFolding on average, respectively. These results indicate that LRMgain caching contributes more to the efficiency improvements even though it requires times. As discussed in Section 3.5, realworld graphs generally have small values of and due to the powerlaw degree distribution. For instance, as shown in Table 1, the datasets that we examined have small and , i.e., and at most. Consequently, the time complexity is much smaller than the cost for incremental subgraph folding in realworld graphs. Hence, gScarf can efficiently reduce its running time.
4.2 Accuracy of Clustering Results
One major advantage of gScarf is that it does not sacrifice clustering quality, although it dynamically skips unnecessary computations. To verify the accuracy of gScarf, we evaluated the clustering quality against groundtruth clusters on both realworld and synthetic graphs. We used normalized mutual information (NMI) [Cilibrasi and Vitányi2005] to measure the accuracy compared with the groundtruth clusters.
Realworld graphs
Figure 4 (a) shows NMI scores of each algorithm in YT, LJ, and OK. We used the top5000community datasets published by [Leskovec and Krevl2014] as the groundtruth clusters for all graphs. In the top5000community datasets, several nodes belong to multiple groundtruth clusters. We thus assigned such nodes into the cluster where most of its neighbor nodes are located.
As shown in Figure 4, gScarf shows higher NMI scores than the other methods except for CorMod. In addition, gScarf has almost the same or slightly higher NMI score than the original method CorMod. This is because both employ LRM maximization to avoid the resolution limit problem. Furthermore, gScarf is theoretically designed not to miss any chance to increase LRM by Lemma 1 and Lemma 5. That is, the greedy LRMgain maximization in gScarf is equivalent to that in CorMod. Hence, gScarf achieves better NMI scores even though it dynamically prunes unnecessary nodes/edges by LRMgain caching and incremental subgraph folding.
To analyze the cluster resolution, we compared the average sizes of clusters extracted by each algorithm. As shown in Table 2, gScarf and CorMod effectively reproduce the cluster sizes of the groundtruth in large realworld graphs. This is because LRM maximization is designed to avoid the resolution limit problem [Duan et al.2014], and LRM effectively extract cluster sizes that approximate the groundtruth clusters. On the other hand, other algorithms extract coarsegrained or significantly smaller clusters. Especially in the modularity maximization methods (i.e. Louvain and IncMod), they output larger clusters than the groundtruth since they grow clusters until their sizes reach .
Synthetic graphs
To verify the effects of graph structures and graph sizes, we evaluated the NMI scores on LFRbenchmark graphs. In Figure 4 (b), we generated graphs composed of nodes with average degree 20 by varying the mixing parameter mu from 0.1 to 0.9. mu controls the fraction of neighbor nodes included in other clusters; as mu increases, it becomes more difficult to reveal the intrinsic clusters. In contrast, in Figure 4 (c), we used graphs generated by varying the number of nodes from to , where the average degree and mu are set to 20 and 0.5, respectively.
Figure 4 (b) shows that gScarf and CorMod achieve high accuracy even if the parameter mu increases, whereas most of the other methods degrade NMI scores for large mu. In addition, as shown in Figure 4 (c), gScarf also shows higher NMI scores than the others for large graphs since LRM effectively avoids to output superclusters regardless of inputted graph sizes. Hence, gScarf efficiently detects clusters that effectively approximate the groundtruth clusters in massive graphs. From these results, we confirmed that gScarf does not sacrifice the clustering quality of CorMod.
5 Conclusion
We proposed gScarf, an efficient algorithm that produces finegrained modularitybased clusters from massive graphs. gScarf avoids unnecessary computations by LRMgain caching and incremental subgraph folding. Experiments showed that gScarf offers an improved efficiency for massive graphs without sacrificing the clustering quality compared to existing approaches. By providing our efficient approaches that suit for massive graphs, gScarf will enhance the effectiveness of future applications based on the graph analysis.
Acknowledgements
This work was partially supported by JST ACTI. We thank to Prof. Kenichi Kawarabayashi (National Institute of Informatics, Japan) for his helps and useful discussions.
A How to Handle Directed Graphs
We here detail gScarf modifications for directed graphs. As discussed in [Nicosia et al.2009], we can handle the directed graphs in the modularity maximization by replacing to in , where and are indegree and outdegree of node , respectively. Thus, our LRMgain caching can handle the directed graphs by replacing and of in Definition 2 to and , respectively. Similarly, we need to modify Definition 3 so that the folding technique takes account of directions of edges. That is, our incremental subgraph folding can handle the directed graphs by aggregating indegree edges and outdegree separately.
References
 [Blondel et al.2008] V.D. Blondel, J.L. Guillaume, R. Lambiotte, and E.L.J.S. Mech. Fast Unfolding of Communities in Large Networks. Journal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008.
 [Boldi et al.2011] Paolo Boldi, Marco Rosa, Massimo Santini, and Sebastiano Vigna. Layered Label Propagation: A MultiResolution CoordinateFree Ordering for Compressing Social Networks. In Proc. WWW 2011, pages 587–596, 2011.
 [Brin et al.1997] Sergey Brin, Rajeev Motwani, Jeffrey D Ullman, and Shalom Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. ACM SIGMOD Record, 26(2):255–264, 1997.
 [Chakraborty et al.2014] Tanmoy Chakraborty, Sriram Srinivasan, Niloy Ganguly, Animesh Mukherjee, and Sanjukta Bhowmick. On the Permanence of Vertices in Network Communities. In Proc. KDD 2014, pages 1396–1405, 2014.
 [Chang et al.2017] Lijun Chang, Wei Li, Lu Qin, Wenjie Zhang, and Shiyu Yang. pSCAN: Fast and Exact Structural Graph Clustering. IEEE Transaction on Knowledge Data Engineering, 29(2):387–401, 2017.
 [Cilibrasi and Vitányi2005] Rudi Cilibrasi and Paul MB Vitányi. Clustering by Compression. IEEE Transactions on Information Theory, 51(4):1523–1545, 2005.
 [Clauset et al.2004] Aaron Clauset, M. E. J. Newman, , and Cristopher Moore. Finding Community Structure in Very Large Networks. Physical Review, E 70(066111), 2004.
 [Costa2014] Alberto Costa. Comment on ”Quantitative Function for Community Detection”. CoRR, abs/1409.4063, 2014.
 [Duan et al.2014] Lian Duan, Willian Nick Street, Yanchi Liu, and Haibing Lu. Community Detection in Graphs Through Correlation. In Proc. KDD 2014, pages 1376–1385, 2014.
 [Faloutsos et al.1999] Michalis Faloutsos, Petros Faloutsos, and Christos Faloutsos. On Powerlaw Relationships of the Internet Topology. In Proc. SIGCOMM 1999, pages 251–262, 1999.
 [Fortunato and Barthélemy2007] Santo Fortunato and M Barthélemy. Resolution Limit in Community Detection. Proceedings of the National Academy of Sciences of United States of America, 104(1):36–41, Jan 2007.
 [Lancichinetti et al.2009] Andrea Lancichinetti, Santo Fortunato, and János Kertész. Detecting the Overlapping and Hierarchical Community Structure in Complex Networks. New Journal of Physics, 11(3):033015, 2009.
 [Leskovec and Krevl2014] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data, jun 2014.
 [Li et al.2008] Zhenping Li, Shihua Zhang, RuiSheng Wang, XiangSun Zhang, and Luonan Chen. Quantative Function for Community Detection. Physical Review, E 77(036109), 2008.
 [Louni and Subbalakshmi2018] Alireza Louni and K. P. Subbalakshmi. Who Spread That Rumor: Finding the Source of Information in Large Online Social Networks with Probabilistically Varying Internode Relationship Strengths. IEEE Transactions on Computational Social Systems, 5(2):335–343, June 2018.
 [Muff et al.2005] Stefanie Muff, Francesco Rao, and Amedeo Caflisch. Local Modularity Measure for Network Clusterizations. Physical Review, E 72(056107), 2005.
 [Newman2004] M. E. J. Newman. Fast Algorithm for Detecting Community Structure in Networks. Physical Review, E 69(066133), 2004.
 [Nicosia et al.2009] Vincenzo Nicosia, Giuseppe Mangioni, Vincenza Carchiolo, and Michele Malgeri. Extending the Definition of Modularity to Directed Graphs with Overlapping Communities. Journal of Statistical Mechanics, 2009(3):P03024, 2009.
 [Palla et al.2005] Gergely Palla, Imre Derényi, Illés Farkas, and Tamás Vicsek. Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society. Nature, 435(7043):814–818, June 2005.

[Papoulis and Pillai2002]
Athanasios Papoulis and S. Unnikrishna Pillai.
Probability, Random Variables, and Stochastic Processes
. McGrawHill Higher Education, 4 edition, 2002.  [PiatetskyShapiro1991] Gregory PiatetskyShapiro. Discovery, Analysis, and Presentation of Strong Rules. Knowledge Discovery in Databases, pages 229–248, 1991.
 [Sankar et al.2015] M. Vishnu Sankar, Balaraman Ravindran, and S Shivashankar. CEIL: A Scalable, Resolution Limit Free Approach for Detecting Communities in Large Networks. In Proc. IJCAI 2015, pages 2097–2103, 2015.
 [Sato et al.2018] Tomoki Sato, Hiroaki Shiokawa, Yuto Yamaguchi, and Hiroyuki Kitagawa. FORank: Fast ObjectRank for Large Heterogeneous Graphs. In Companion Proceedings of The Web Conference 2018, pages 103–104, 2018.
 [Shiokawa et al.2013] Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka. Fast Algorithm for Modularitybased Graph Clustering. In Proc. AAAI 2013, pages 1170–1176, 2013.

[Shiokawa et al.2015]
Hiroaki Shiokawa, Yasuhiro Fujiwara, and Makoto Onizuka.
SCAN++: Efficient Algorithm for Finding Clusters, Hubs and Outliers on Largescale Graphs.
Proceedings of the Very Large Data Bases Endowment (PVLDB), 8(11):1178–1189, July 2015.  [Shiokawa et al.2018] Hiroaki Shiokawa, Tomokatsu Takahashi, and Hiroyuki Kitagawa. ScaleSCAN: Scalable Densitybased Graph Clustering. In Proc. DEXA 2018, pages 18–34, 2018.
 [Takahashi et al.2017] Tomokatsu Takahashi, Hiroaki Shiokawa, and Hiroyuki Kitagawa. SCANXP: Parallel Structural Graph Clustering Algorithm on Intel Xeon Phi Coprocessors. In Proc. ACM SIGMOD Workshop on Network Data Analytics, pages 6:1–6:7, 2017.
 [Tsourakakis et al.2017] Charalampos E. Tsourakakis, Jakub Pachocki, and Michael Mitzenmacher. Scalable Motifaware Graph Clustering. In Proc. WWW 2017, pages 1451–1460, 2017.
 [Watts and Strogatz1998] Duncan J. Watts and Steven H. Strogatz. Collective Dynamics of ’SmallWorld’ Networks. Nature, 393(6684):440–442, 1998.
Comments
There are no comments yet.