An important problem in study of complex graphs is that of characterizing and detecting community structure within them [1, 2, 3]. Processes occurring on networks often depend on the network topology and in particular on the community structure . Therefore identifying the community structure is essential in understanding and modeling complex systems . Various definitions of community exist , with the community structure depending on the definition, and no definition is guaranteed to be the best for all applications . Often, however, communities are thought of as groups of nodes that are more densely connected together than they are with nodes in other groups. One of the most widely used metrics to quantify community structure based on this idea is modularity [8, 9, 10]. For a given partition of the nodes of a network , modularity is defined as the fraction of links within communities minus the expected fraction in a corresponding random network that serves as a null model,
where is the number of links in community , is the number of external links of , and is the total number of links in the network. The partition that maximizes is considered as the one that corresponds to the community structure. Community structures based on maximizing have been found in a wide variety of networks such as communication, infrastructural, biological, and social networks [8, 10, 11].
Despite its popularity, the metric has drawbacks. Perhaps the most notable is that by maximizing one may not detect communities that contain fewer links than
This is known as the resolution limit (RL) problem . A number of approaches have been taken toward solving this problem [13, 14, 15, 16]. One approach has been to modulate the relative weights of the two terms in Eq. 1 . Indeed this approach does allow smaller communities to be detected, but at the cost of then not being able to detect large communities . Another approach has been to use a different null model for the second term in Eq. 1 . Doing that though may affect the character of the community structure that will be detected [17, 19]. Perhaps the most promising approach, however, is to use a new metric called modularity density to quantify community structure. This measure was recently introduced by Chen et al.  to address multiple issues with modularity, particularly the resolution limit problem. Modularity density is defined as:
where is the number of links between communities and , is the number of nodes in , is the density of links inside , is the density of links between and , and the other quantities are the same as in Eq. 1. Again, it is the partition that maximizes that corresponds to the community structure.
As can be seen from Eqs. 1 and 2, there are two main differences between modularity density and modularity. The most significant difference is that modularity density adds coefficients related to link densities to each term in its definition. It is for this reason the metric is called what it is. If only the first two terms are considered, then it has been found that a structure consisting of many small communities is often found or resolved, even in situations where using fails to do so. Thus, the RL problem is at least partially mitigated using , but perhaps by creating communities that are too small. The third term in the definition of , which is its second main difference with modularity, was introduced to help alleviate a tendency for and to find communities that are too small. This term is referred to as the Split Penalty (SP).
In this paper we show that, although using modularity density does alleviate the RL problem in many cases, it does not completely eliminate it. There is still a RL problem when using . We show this by identifying limitations of applying to certain example cases. We also show that the SP term can have undesired consequences. To address these problems discovered with using , we propose a new metric to quantify community structure, a variant of modularity density, which we refer to as excess modularity density . We show that using further mitigates the RL problem, resolving communities in cases when using either or fails to do so. Also, has no SP term.
The rest of the paper is organized as follows. In the next section we use to find the community structure in the Zachary’s Karate Club network and discuss potential problems that arise due to the SP term in . In Sec. 3 we use simple networks structures to demonstrate that also suffers from RL problems. For some specific examples, we identify the conditions under which becomes unreliable. In Sec. 4, we propose a modified metric in an attempt to fix the issues with . We test the use of this new metric on a number of networks and observe that indeed addresses the issues found with . We also discuss the fundamental limitations that even has with respect to the RL problem. In the final section, we conclude by summarizing our results, arguing for the superiority of using the density metric , and discussing possible future research directions to further extend the idea and applications of modularity density measures.
2 Modularity density applied to the Karate Club network
Modularity density has been shown to substantially reduce the two problems of modularity mentioned in the introduction while maintaining the general character of communities that modularity finds in practical and synthetic benchmark networks. Using it to analyze the community structure of the well known Zachery’s Karate Club , a partition with a value of 0.231 was first reported . This partition agrees reasonably well with the one thought to maximize . These two partitions are shown in Fig. 1(a) and (b). Recently, another partition with a higher value of 0.235 was reported . It is shown in Fig. 1(c). However, we find a partition with an even higher, potentially the true maximal, value of 0.243. It is shown in Fig. 1(d). Finding the network partition that maximizes the metrics and, presumably, is an NP-hard computational problem . For the case of , numerous algorithms have been developed to find good approximate solutions in polynomial time [22, 23, 24, 25, 26, 27, 28]. In this paper, we use variants of an efficient algorithm that was recently introduced in  to find maximal partitions to find partitions that maximize modularity density metrics. This algorithm uses both partitioning and agglomeration, combined with multiple types of Kernigan-Lin-type refinements , to achieve high-quality partitions. A similar algorithm was, in principle, used to find the partition in Fig. 1(c) . Our implementation, however, finds the partition with higher shown in Fig. 1(d).
Unfortunately, the new partition we find reveals an unexpected problem with . Notice that, nodes 10, 12 and 29, which have no direct links between each other, are grouped in the same community. Intuitively, such a partition should not exist. Notice furthermore that, nodes 10, 12 and 29 are somehow special: node 12 is the only node with degree 1 in the network; node 10 has two links which connect to two different communities; nodes 29 has 3 links and each of them connects to a different community. This suggests that letting these three nodes each form a separate single-node community would be an acceptable and better result than putting them together.
To understand the nature of the problem, consider a partition of a network consisting of nodes, , which are isolated from each other, i.e. no links between any pair of these nodes, but may be connected to other nodes in the network, and a set of other nodes that are separated into communities with sizes . Let the number of links between community and isolated node be . Then the contributions to value of the SP term, the third term in Eq. 2, resulting from these links can be calculated. Consider two extreme cases: (1) separating all the isolated nodes into communities and (2) merging them into one community. The corresponding contributions to SP term in these two case are, respectively:
Since by the RMS-AM inequality 
we have . Thus, the SP term prefers to merge the isolated nodes into one community.
As to the other two terms in , only the communities involving the isolated nodes can make a difference in their value, since the contribution from the other communities is the same in both cases. Note that the value of for a community consisting of a single node is not well-defined. Thus, in case (1) when the isolated nodes form separate communities, is also not well-defined. To fix this problem one can simply define the value of for a single node community, which we refer to as . Since it is a density, it is reasonable to expect . Whatever value is chosen for , the contribution to the first term in from communities of the isolated nodes, whether or not they are merged into one community, is always zero because for these communities. For the second term in , if the nodes are merged, case (2), contributions to its value from the communities of isolated nodes is 0 because . In case (1) though, the contributions to its value depend on the value of . If , then the contributions are also 0. However, if , then the second term would favor case (2), merging the isolated nodes. So, perhaps should be defined to be 0, but even then isolated nodes will tend to be grouped together because of the SP term. Thus, although the SP term may in some situations solve the problem with modularity of favoring small communities, it also introduces the problem of grouping unlinked nodes into the same community.
3 Resolution limits of modularity density
Modularity density, by introducing density coefficients, does solve a well-known RL problem described originally in . As shown in Fig. 2, modularity fails to resolve pairs of cliques, i.e. fully connected sets of nodes, in certain configurations. In two cases shown, cliques are only connected by a single link and, thus, can be expected to form independent communities. However, if modularity density is used instead, then the cliques are resolved in these two cases. Thus, using modularity density does significantly address this RL problem. In these two examples though, the cliques in each pair both have the same size. If instead they have unequal sizes, then the results are more complicated.
Consider the particularly simple example of two cliques with sizes and , with just one link connecting them, as shown in the inset of Fig. 3(a), and define the relative size of the cliques as . Let and and denote the modularity density when the two cliques are assigned to two separate communities and when the two cliques are merged into a single community, respectively. Figure 3(a) shows results for and varying that reveal how the relative sizes of the two cliques affect the ability of to resolve them. When , i.e. , and modularity density does not resolve the two cliques. Figure 3(b), similarly, shows the values of and for which modularity density does not resolve the cliques when they are connected by a single link.
The critical value of , below which
fails to resolve a pair of cliques, can also be estimated assuming that the cliques are large and the number of links connecting them is small. In this case,and for each clique. Furthermore, and the SP term can be ignored. Then, after simplification we get
The equation has two real roots that indicate that modularity density fails to resolve the cliques () if or . This result is independent of system size (for large and ) and agrees well with the results shown in Figs. 3(a) and (b).
More generally, if the two cliques are part of an unspecified larger network, then the ability of to resolve them is a function of and of the fraction of the network’s links that are contained within the two cliques can also be evaluated. Here and are the number of links contained within the cliques. In this case, the black region of Fig. 3(c) indicates where and the cliques are not resolved. Hence, modularity density fails to resolve two cliques when the sizes of cliques are not balanced, specifically when the small clique is smaller than about 0.4 of the large clique size, and the links contained within the two cliques account for more than about half of the total number of links in the network. Considering again the two cases shown in Fig. 2, Fig. 3(c) indicates that modularity density is able to resolve the cliques in each pair because they have equal sizes , but that it potentially would not if . Cliques are, of course, an extreme form of dense community, and we have discussed only the case when networks consist only of cliques connected by single links, but our conclusions about the RL of remain approximately true if the the communities are dense, but not cliques, and if they are connected by a sufficiently small number of links or no links at all.
Thus, although modularity density does significantly mitigate the resolution limit problem of modularity in many situations, it does not solve the problem completely. In the situations discussed above, modularity density still suffers from a resolution problem. Nevertheless, modularity density is still a good metric if the network is within its domain of applicability, such as the light gray regions in Figs. 3(b) and (c).
4 An improved metric: Excess modularity density
Because of the inclusion of the link density in the definition of modularity density, Eq. 2, when all nodes are grouped together in one community, except in trivial cases, the second term is smaller than the first. Because of this, even in the absence of any modular structure in the network, will be positive. Whereas, the value of modularity for this case is zero. In order to fix this issue, we propose using a modified density metric for the quality of community structure. Our metric replaces the community link density in the definition of by a rescaled link density
where and are the total number of links and nodes of the network, respectively. measures the excess link density inside a community by subtracting the global link density from . This is intuitively attractive because measures the link density in a community above and beyond the global average. It also eliminates the problem of measuring a positive non-zero modularity density even in the absence of any modular structures. We also exclude the SP term to avoid the problems caused by it that were discussed in Sec. 2. We denote this new metric by and refer to it as excess modularity density
The partition that maximizes corresponds to the community structure. An added advantage of excluding the SP term is that it makes finding the maximal partition easier computationally.
To fully define , a value of the link density for single node communities must also be defined. Given the considerations of Sec. 2, an appropriate value of should not cause disconnected nodes to be grouped together. Consider a set of isolated nodes that have no common links, but may be connected to the rest of the network. The values of when the the isolated nodes are merged and when each node is assigned a separate community , assuming the same community structure in the rest of the network in both cases, can be written as
Here is the degree of isolated node , is the density of links in the total network, and the is the contribution to from the remaining portion of the network. Since , if then , which is not necessarily true for any other choice of . Hence, we define .
To demonstrate the efficacy of using , consider again the problem of two cliques of different sizes that fails to resolve if the sizes are too different, Figure 3(a). Figure 4 shows the analogous results using . As for all values of , there is no resolution limit problem using in this case. In fact, there is no resolution limit problem using in any range of the more general two-clique problem shown in Figs. 3(b) and (c), even if the sizes of the cliques are extremely different.
Next, we consider a more general case of three cliques with sizes ,, and , each pair of which is connected by a single link. Let , . Figure 5 shows the five possible ways to partition the network without dividing the cliques. Figure 6 shows the value of modularity density and excess modularity density for an example set of relative clique sizes for each of these five possible partitions. Figures 6(a)-(c) show the value of modularity density, while Figs. 6(a)-(c) show the value of excess modularity density. Note that is not always the highest among the various partitions, but that is always the highest of any partition. Thus, modularity density can fail to resolve the three cliques, but excess modularity density is always able to do so in the cases considered.
Although we have only shown the limitation of in two-clique and three-clique examples here, we believe that similar issues would be encountered when analyzing more complex network structures. can help eliminate these problems to a great extent but it may also have limitations and we have not shown that it is guaranteed to work in the most general case. In fact in some extreme cases, we expect will fail to resolve communities. Consider, for example, the two-clique network of Figure 3 but embedded in a larger network. By adding more communities to the network that are loosely connected to the two cliques the global link density can be made to systematically approach zero. In this limit, and consequently without the SP term. But even in this extreme example, is at least as good a metric as . Many networks of interest as not as extreme and, so, can be used as an improved metric that reduces the resolution limit problems associated with and .
We have also used a variant of the algorithm of  to optimize in Zachary’s Karate Club network  and the American college football network  to detect their underlying community structure. For the Karate Club network, Fig. 7(a), many small communities are found, including four single node communities. However, each of these communities are contained within the two ground-truth communities that correspond to the historical split of the Karate Club. Thus, the community structure from maximizing respects the ground truth division of the network. In contrast, maximizing gives rise to a community of nodes with no internal connections that does not respect the ground truth [See Fig. 1(d)].
Our result for the American college football network is shown in Fig. 7(b). This is a network of games played between Division IA colleges during regular season Fall 2000 . Community detection is usually expected to find ground-truth communities of colleges belonging to the same conference. As the figure shows, we find fifteen communities by optimizing . This contrasts with partitions consisting of twelve communities when is optimized and ten communities when is optimized. (See Supplementary Material Fig. S1.) The partition that maximizes detects most of the ground-truth communities, but with some notable exceptions. The partitions that maximize and
also show deviations from the conference membership ground-truth. Of course, the conference structure is not necessarily the true natural structure of the network. Some nodes, such as Texas Christian, share more links to a different conference than to the conference they are part of and are not classified correctly by conference structure.
The partition maximizing is farthest from the ground truth and has larger communities. (The value of for this partition is the same as that of the highest reported value .) The partition maximizing is very similar. (This same partition was reported in [31, 32].) The only differences are that the schools outside of the Southeastern Conference and the Mountain West Conference that are grouped with them now form independent communities and that Central Florida switches from grouping with the Mid-American Conference to merging with the group that split from the Mountain West Conference. Each of these differences make intuitive sense. The partition maximizing is similar to the one the maximizes except that the independent schools Notre Dame and Navy now split from the Big East Conference and form their own community and Connecticut splits from the Mid-American conference to form its own community. The only other, but very interesting change is that the Mid-American Conference [bottom right corner in Fig. 7(b)] now splits into two communities that are eastern and the western groups. Again, each of these differences make intuitive sense. Thus, even though finds smaller communities than or , these additional communities are meaningful. Furthermore, the partition that maximizes in this network, as well as for the Karate Club network, is mostly consistent with those that maximize and , since the main differences simply subdivide communities found by maximizing the other metrics. We consider the ability of to detect small groups a positive feature, not a drawback .
In this paper we have discussed community detection by maximizing modularity density measures. Modularity density was originally introduced to address problems with modularity, most notably the resolution limit of modularity. We found that, while the use of modularity density does significantly mitigate the resolution limit problem, it does not eliminate it completely. In particular, we found that when using modularity density loosely connected dense communities, especially cliques, can not be resolved if they have very different sizes and constitute a large portion of the network. We also found that the split penalty term in modularity density can cause sets of nodes that have no common links to be grouped together as a community.
To address these problems, we introduced a modified density metric called excess modularity density . We motivated the definition of the modified metric on intuitive grounds and applied it to both stylized and real-world example networks. We demonstrated that it effectively eliminates the problems associated with using both modularity and modularity density for a broad class of networks, thereby expanding the range of applicability of community detection methods. In the limit of a sparse network, however, excess modularity density and modularity density become equivalent and the resolution issues will also exist for . Thus, despite our advances, finding a complete, general solution to the resolution limit problem remains elusive. Nevertheless, using instead of can substantially improve the quality of community detection and we therefore propose it as a superior measure.
The metric has been defined in this paper only for simple unweighted networks. Many complex networks are more complicated having, for example, weighted links and/or a bipartite structure. Definitions of modularity and modularity density have been extended to incorporate such networks by utilizing an appropriate null model [20, 36, 37]. Similar extensions can be made to excess modularity density. To use these metrics, algorithms to find the partition that maximizes them must also be developed. Such algorithms have been developed and utilized for modularity [38, 39, 40] and modularity density . Developing such algorithms for excess modularity density would be both interesting and important. The expected structure in the absence of any communities, i.e. the null model, plays a crucial role in determining the community structure of a given network. Usually a metric relies on a randomized network with soft constraints for this purpose. Thorough analysis of the effect of imposing hard constraints [41, 42, 43, 44, 45] on these null models would be another interesting topic to explore.
This work was supported by the NSF through grants DMR-1507371 and IOS-1546858. Some of the computations in this work were done on the uHPC cluster at the University of Houston, acquired through NFS Award Number 1531814. We thank Boleslaw K. Szymanski, Charo I. Del Genio, and Weibin Zhang for fruitful discussions.
-  Newman M E J 2004 Detecting community structure in networks Eur. Phys. J. B, 38(2):321-330
-  Danon L, Diaz-Guilera A, Duch J and Arenas A 2005 Comparing community structure identification J. Stat. Mech., P09008
-  Fortunato S 2010 Community detection in graphs Physics reports, 486(3):75-174
-  Singh P, Sreenivasan S, Szymanski B K, Korniss G 2013 Threshold-limited spreading in social networks with multiple initiators. Scientific reports, 3 2330
-  Mentzen W I and Wurtele E S 2008 Regulon organization of Arabidopsis BMC plant biology, 8(1):99
-  Schaub M T, Delvenne J, Rosvall M and Lambiotte R 2017 The many facets of community detection in complex networks Applied Network Science, 2(1):4
-  Peel L, Larremore D B and Clauset A 2017 The ground truth about metadata and community detection in networks Science Advances, 3(5):e1602548
-  Girvan M and Newman M E J 2002 Community structure in social and biological networks Proc. Natl. Acad. Sci., 99:8271-8276
-  Newman M E J 2003 The Structure and Function of Complex Networks SIAM Rev., 45(2):167-256
-  Newman M E J and Girvan M 2004 Finding and evaluating community structure in networks Phys. Rev. E, 69:026113
Newman M E J 2006 Finding community structure in networks using the eigenvectors of matrices.Phys. Rev. E, 74(3):036104
-  Fortunato S and Barthelemy M 2007 Resolution limit in community detection Proc. Natl. Acad. Sci., 104(1):26-41
-  Ronhovde P and Nussinov Z 2010 Local resolution-limit-free Potts model for community detection Phys. Rev. E, 81, 046114
-  Arenas A, Fernández A and Gomez S 2008 Analysis of the structure of complex networks at different resolution levels New J. of Physics, 10,(5):053039
-  Granell C, Gomez S and Arenas A 2012 Hierarchical multiresolution method to overcome the resolution limit in complex networks International Journal of Bifurcation and Chaos, 22(07):1250171
-  Aldecoa R and Marin I 2011 Deciphering Network Community Structure by Surprise PloS one 6(9):e24195
-  Lancichinetti A and Fortunato S 2011 Limits of modularity maximization in community detection 2011 Phys. Rev. E., 84(6):066122
-  V. A. Traag, P. Van Dooren and Y. Nesterov. Narrow scope for resolution-limit-free community detection Phys. Rev. E, 84, 016114 (2011).
-  Lancichinetti A and Fortunato S 2016 Community detection in networks: A user guide Physics Reports, 659:1-44
-  Chen M, Nguyen T and Szymanski B K 2013 A New Metric for Quality of Network Community Structure ASE Hum. J., 2(4):226-240
-  Brandes U, Delling D, Gaertler M, Gorke R, Hoefer M, Nikoloski Z and Wagner D On modularity clustering 2008 IEEE transactions on knowledge and data engineering, 20(2):172-188
-  Newman M E 2004 Fast algorithm for detecting community structure in networks Phys. Rev. E, 69(6):066133
-  Newman M E 2006 Modularity and community structure in networks Proc. Natl. Acad. Sci., 103(23):8577-8582
-  Blondel V D, Guillaume J L, Lambiotte R and Lefebvre E 2008 Modularity from fluctuations in random graphs and complex networks J. Stat. Mech., P10008
-  Guimera R, Sales-Pardo M and Amaral L A N 2004 Modularity from fluctuations in random graphs and complex networks Phys. Rev. E., 70(2):025101
-  Medus A, Acuña G and Dorso C O 2005 Detection of community structures in networks via global optimization Physica A: Statistical Mechanics and its Applications, 358(2):593-604
-  Duch J and Arenas A 2005 Community detection in complex networks using extremal optimization Phys. Rev. E., 72(2):027104
-  Sun Y, Danila B, Josic K and Bassler K E 2009 Improved community structure detection using a modified fine-tuning strategy Europhysics Letters, 86(2):28004
-  Treviño S III, Nyberg A, del Genio C I and Bassler K E Fast and accurate determination of modularity and its effect size J. Stat. Mech., P02003
-  Zachary W W 1977 An information flow model for conflict and fission in small groups Journal of Anthropological Research, 33:452-473
-  Chen M, Kuzmin K and Szymanski B K 2014 Community Detection via Maximization of Modularity and Its Variants IEEE Transactions on Computational Social Systems, 1(1):46-65
-  Botta F and del Genio C I 2016 Finding network communities using modularity density J. Stat. Mech., 123402
-  Kernighan B W abd Lin S 1970 Finding network communities using modularity density The Bell system technical journal, 49(2):291-307
-  Gwanyama P W 2004 The HM-GM-AM-QM Inequalities College Mathematics Journal, 47-50
Cafieri S, Hansen P and Liberti L 2011 Locally optimal heuristic for modularity maximization of networksPhys. Rev. E., 83(5):056105
-  Newman M E 2004 Analysis of weighted networks Phys. Rev. E., 70(5):056131
-  Barber M J 2007 Modularity and community detection in bipartite networks Phys. Rev. E., 76(6):066102
-  Chauhan R, Ravi J, Datta P, Chen T, Schnappinger D, Bassler K E, Balazsi G and Gennaro M L 2016 Reconstruction and topological characterization of the sigma factor regulatory network of Mycobacterium tuberculosis Nature communications, 7 ncomms11062
-  Treviño S III, Sun Y, Cooper T F and Bassler K E 2012 Robust detection of hierarchical communities from Escherichia coli gene expression data PLoS computational biology, 8(2):e1002391
-  Bhavnani S K, Bellala G, Victor S, Bassler K E and Visweswaran S 2012 complementary bipartite visual analytical representations in the analysis of SNPs: a case study in ancestral informative markers J. Am. Med. Inform. Assoc., 19(e1):e5-e12
-  Bassler K E, Del Genio C I, Erdos P L , Miklos I and Toroczkai Z 2015 Exact sampling of graphs with prescribed degree correlations. New Journal of Physics, 17(8) 083052
-  Orsini C, Dankulov M M, Colomer-de-Simon P, Jamakovic A, Mahadevan P, Vahdat A, Bassler K E et al. 2015 Quantifying randomness in real networks Nature communications, 6:8627
-  Coolen A C C, De Martino A and Annibale A 2009 Constrained markovian dynamics of random graphs Journal of Statistical Physics, 136(6):1035-1067
-  Del Genio C I, Kim H, Toroczkai Z and Bassler K E 2010 Efficient and exact sampling of simple graphs with given arbitrary degree sequence PLOSONE, 5(4):e10012
-  Kim H, Del Genio C I, Bassler K E and Toroczkai Z 2012 Constructing and sampling directed graphs with given degree sequences New Journal of Physics, 14(2):023012