DeepAI
Log In Sign Up

Ensemble approaches for improving community detection methods

09/01/2013
by   Johan Dahlin, et al.
FOI
0

Statistical estimates can often be improved by fusion of data from several different sources. One example is so-called ensemble methods which have been successfully applied in areas such as machine learning for classification and clustering. In this paper, we present an ensemble method to improve community detection by aggregating the information found in an ensemble of community structures. This ensemble can found by re-sampling methods, multiple runs of a stochastic community detection method, or by several different community detection algorithms applied to the same network. The proposed method is evaluated using random networks with community structures and compared with two commonly used community detection methods. The proposed method when applied on a stochastic community detection algorithm performs well with low computational complexity, thus offering both a new approach to community detection and an additional community detection method.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/12/2018

Generative models for local network community detection

Local network community detection aims to find a single community in a l...
01/03/2021

A Survey of Community Detection Approaches: From Statistical Modeling to Deep Learning

Community detection, a fundamental task for network analysis, aims to pa...
07/06/2021

The Hyperspherical Geometry of Community Detection: Modularity as a Distance

The Louvain algorithm is currently one of the most popular community det...
09/23/2019

Reduced network extremal ensemble learning (RenEEL) scheme for community detection in complex networks

We introduce an ensemble learning scheme for community detection in comp...
09/01/2016

From Community Detection to Community Deception

The community deception problem is about how to hide a target community ...
04/18/2018

Consensus Community Detection in Multilayer Networks using Parameter-free Graph Pruning

The clustering ensemble paradigm has emerged as an effective tool for co...

I Introduction

Networks are ubiquitous in nature and provide versatile models for many-body systems with non-regular interactions. For these reasons, they have become an important topic of current research. Network science has provided novel application areas for methods from statistical physics, and has in turn developed new methods that can be used to study physical systems. The study of networks is often concerned with quantifying different microscopic aspects of the structure, such as centrality measures, degree distributions, information flow in networks, and robustness. Concepts and methods from network analysis have been applied to a large range of different types of networks. Some examples of the most important applications include analysis of energy grids, epidemiology, metabolic networks, protein-protein interaction networks, social networks, etc. Fortunato (2010) . Networks often arise as a consequence of strongly interacting agents or many-body systems with weak interactions but where the complex network structure leads to emergent behavior that has interesting physical properties.

Other important topics in network research include higher-level structures called communities. Communities are commonly defined as groups of nodes more densely connected with each other than with nodes outside the group. These communities, or partitions, are commonly found in e.g. our everyday social networks as colleagues, high-school friends, neighbors, family, etc. Much effort has been devoted to defining community structures and finding algorithms to detect partitions with low computational complexity and high accuracy. Despite this, no accepted general definition of what a community structure is has been proposed. In order to clarify the proper (if any) definition of community structures, there is a need for new approaches. In this paper, we present a new approach to the community detection problem by considering an ensemble of community clustering methods working on the same network. The results of the different community detection algorithms are then fused into a, hopefully more accurate, representation of the community structure of the network.

It is our hope that the work presented here will contribute both to more efficient algorithms for community detection and to a conceptual discussion about what a community structure is. By considering an ensemble of clustering methods, it is possible to consider different definitions of community structure. More effective algorithms can be found by merging (aggregating) many runs of fast stochastic algorithms as well as several runs of the same algorithm using different settings. In addition the latter method can be used to analyze the community structure of the network at many different scales, providing insight into the relations between community structure at different levels. The merging method is also applicable in aggregation of communities generated by bootstrap replicates of the network data, which is necessary in cases where there is missing or incomplete information available about the network.

Previous and related work includes work in the areas of community detection and in ensemble methods developed for data clustering and classification methods. In recent years a large number of methods for detecting community structures have been developed, drawing on knowledge from many different fields, e.g. statistical mechanics, discrete mathematics, computer science, statistics, and sociology. These methods have also been improved to handle weighted, directed, and multi graphs. A thorough review of the current state in community detection is given in Ref.Fortunato (2010); we provide some background in Section II.

Ensemble clustering is a technique used in e.g. bioinformatics applications and is useful in merging several clustering results into one. To our knowledge, no work has been devoted to applying these methods in community detection problems. However other methods have been used to merge several community structures, e.g. voting in Refs. Raghavan et al. (2007); Rosvall and Bergstrom (2010). As data clustering and community detection are quite similar, it should be possible to merge communities in the same manner as ensembles of data with good results. Ensemble clustering methods were first introduced in Ref. Strehl and Ghosh (2003) and further developed by e.g. Refs. Monti et al. (2003); Fern and Brodley (2004).

This paper continues with a presentation of the background in community detection (Section II) and ensemble methods (Section III) previously used in classification and clustering, where we discuss some common community detection methods and the performance of ensemble methods The ensemble-based community clustering method is introduced in Section III.3, where its computational complexity is discussed and suggestions for how to estimate the certainty of the obtained solution given. Finally, Section IV offers some simulation experiments which compare the proposed method to two well-known community detection algorithms: greedy modularity maximizing agglomerative algorithm Clauset et al. (2004), and a q-Potts based spin glass model Reichardt and Bornholdt (2006).

The paper is concluded with a summary with some remarks concerning implications and future work.

Ii Community detection

Networks consist of nodes, representing e.g. individuals, computers, or proteins, that are connected by edges representing e.g. friendships, network connections, or other types of interactions. Formally, networks are defined using graph structures, , where denotes the set of nodes and the set of edges. We denote the number of edges in a graph and the number of edges by .

Networks often contain some form of community structure, i.e. groups of nodes that are more densely connected to each other than to nodes outside of the group. In essence, this resembles the similar problem in data clustering, where similar data points are grouped together into clusters. In the same manner, nodes inside communities are often thought of as sharing some common feature. The interpretation of this feature naturally depends on the nature of the network data, e.g. communities in social networks are often thought of as constituting some social group sharing family ties, employer, a specific interest, etc. Detection of communities is therefore an important tool in sociology and other related areas, but is also used in fields including ecology and biology where food webs, protein-protein interaction, metabolic networks and natural resource exploittation networks are of interestBodin and Prell (2011).

Community detection is a widely studied subject and much work has been devoted to developing faster and more accurate automatic methods for detection and verification of communities in complex networks. This section serves only as a short review of the field and some of the proposed methods for identifying communities in networks. For a comprehensive review of the field as a whole, we refer the reader to Refs. Fortunato (2010) and Porter et al. (2009).

ii.1 Existence and uniqueness

There is no formal generally accepted definition of what a community, despite large efforts in the study of community detection and complex networks. In this paper, we adopt the practical viewpoint of Definition 1 and use the definition due to Ref. Radicchi et al. (2004) for what a community structure is.

Definition 1 (Community)

A community (in qualitative terms) is a subset of nodes within a network such that connections between nodes in the subset are denser than connections with the rest of the network.

The main problem with this definition is questions like: how large a subset must be (can a community consist of only a few nodes?) and what exactly denser means in terms of number of edges inside the community versus between communities. We return to the latter question in connection with the discussion of algorithms for generating synthetic (random) networks with community structures in below. Another issue with the definition is that in real networks, there are often edges of different kinds. For example, a social network contains edges that denote friendship, which are separate from those that represent colleagues. When looking for work-related communities, only work-related edge types should be considered.

Figure 1: Four communities found using the q-Potts spin glass method (see below) in the famous karate network from Ref. Zachary (1977), indicating the friendships in a karate club at an U.S. university during the 1970’s.

The lack of a general definition raises problems related to uniqueness and existence of communities in networks. A network may contain many different community structures depending on the scale considered, from just one community containing all nodes to communities only containing one node each. This is known from previous work as the resolution limit, when discussing modularity (see next subsection) as a quality measure for community structures. Problems with existence includes questions regarding if the communities are the result of the data or by the underlying process generating this data. With this it is meant that observations seldom identifies the entire network and therefore it is difficult to verify if the resulting communities exist or only is an artifact of some missing or erroneous data.

Assume that we have studied a dense social network and have identified some of the ties between varies individuals. As previously discussed, it is often difficult to identify all ties in the network and therefore we may only find a subset of all the edges in the real network. Applying standard community detection methods on this network will probably return some community structure with a certain number of communities. But as only a subset of the edges have been used, the real unobserved network may only contain one community. Therefore, the existence of the identified communities are in question, do they really exist or not?

The previous example discusses the well-known problem of the robustness of communities. This has been studied by e.g. using bootstrap methods to generate subsets of edges and study how the community structure appear as a mean of a large number of bootstrapped networks. Later, we will see how the methods proposed in this paper offer another solution to the problem of estimating the robustness of a community structure.

Additional problems with the definition of community structure arises because of the multi-modal nature of most networks of interest. If we for instance are interested in clustering a social network of individuals and their relations, we must distinguish between many different kinds of interpersonal relations: friendship, colleague, co-author and citation networks.

ii.2 Quality of community structures

Comparing the quality of a obtained community structure is usually done by a measure called modularity. This measure compares the network structure with the structure of a null model in which edges are randomly redistributed keeping the degree of all nodes fixed. The concept of modularity was introduced in Ref. Newman (2006) and is usually denoted , where

is the vector of communities to which each node

belongs. The measure is calculated using the expression

(1)

where is the community of which node is a member, is the degree of node , is the adjacency matrix111The elements of the adjacency matrix, , are equal to if an edge exist between nodes and , and otherwise. is the number of edges, is the number of nodes, and is the Kronecker delta function222The Kronecker delta function, , takes the value if the elements are equal and otherwise..

Modularity is often used to compare different community structures in the same or similar networks. Due to the null model used, modularity can not be used to compare the community structures of different networks, as its maximum value is determined by the network structure. Often higher modularity is taken as an indication of a better community structure as it is more different compared to the random null model. As such, the modularity can only be used as a comparative measure and has drawbacks including difficulties in interpretation, that random networks usually have higher maximum modularity than real-world networks, and the previously discussed resolution limit.

It can be shown that modularity optimization, i.e. finding the optimal community structure, is an NP-complete problem Brandes et al. (2006)

. Thus the problem of community detection is time consuming for large networks and good approximations are necessary for detecting communities with reasonable computational effort. In the following section, three different methods are introduced for detecting communities in networks. These methods are examples of heuristic and stochastic methods for relaxing the NP-complete problem. It is also possible to show that the modularity has many local maxima

Good et al. (2010) making the identification of a global maximum very difficult.

ii.3 Algorithmic community detection

As previously discussed, large efforts have been given the problem of automatically identifying communities in networks. We refer to these methods as algorithmic community detection methods, in contrast to earlier manual methods pioneered by Refs. Weiss and Jacobson (1955); Rice (1927). A large number of methods have been proposed based on concepts from e.g. the fields of computer science, discrete mathematics, statistical physics, and statistics. In this paper, we consider three different methods: the q-Potts based spin glass algorithm (SP) introduced in Ref. Reichardt and Bornholdt (2006), the greedy agglomerative method (GA) proposed in Ref. Clauset et al. (2004), and the fast stochastic method of propagation of labels (LP) presented in Ref. Raghavan et al. (2007).

The SP-algorithm is based on a q-Potts spin glass and communities are detected by minimizing the energy of the following Hamiltonian

(2)

where and are coupling parameters, is the Kronecker delta function, and is the number of spins in state . The size of the detected communities is determined by the ratio333In the following simulation experiments, this ratio is set as unity and therefore only communities with sizes larger than where is the number of edges can be detected. Reichardt and Bornholdt (2006) between the two coupling parameters. This is due to the fact that the first term with coupling factor favors many edges inside communities and few between communities, The second term which is scaled by

favors a uniform distribution of nodes in communities.

Reichardt and Bornholdt (2006)

The configuration of spins (communities) that minimizes the Hamiltonian in (2) is found using simulated annealing Kirkpatrick et al. (1983). The system is initialized at the temperature and cooled using the cooling factor until the final temperature is obtained. As simulated annealing is quite computer intensive, this algorithm has a high complexity of at least with on a sparse network. The advantage of this method is that it is known to often find good approximations of the global minimum of the Hamiltonian and therefore also good approximations of the community structure.

The GA-method greedily merges pairs of nodes/clusters using agglomerative hierarchical clustering. The order in which nodes are merged is governed by the modularity measure, which is calculated for all possible merges and the resulting merge is determined by the pair that yield the highest increase. This greedy method is quite computation intensive as many possible merges must be evaluated and it is not certain that the optimal solution yielding the maximum modularity is found. The complexity for this algorithm is estimated to be

, which is quite low in comparison with the SP-algorithm. Clauset et al. (2004)

Figure 2: A simple situation in the label propagation algorithm. A node is voted to change its label to B instead of A, as this label is in the majority of the neighboring nodes. cm is a function returning the most common label breaking ties randomly.

The LP-algorithm is an example of a stochastic method for detecting communities in networks. The method uses labels for each node to decide to which community the specific node belongs. LP is an iterative algorithms that initializes by assigning each node an unique label. The iterative step begins by selecting a node at random and then assigning it a new label using voting (breaking ties randomly) by the labels of the neighboring nodes. The iterative procedure is repeated until no node changes label and thereby an equilibrium is obtained. As communities are groups of nodes more densely connected than with other communities, the labels should propagate and spread in each community. Raghavan et al. (2007)

The stochastic part of this algorithm is two-fold: firstly the order in which nodes are selected and secondly the random breaking of ties. These two factors are responsible for that the algorithm produces random outcomes. The advantage of this method is that it is very fast, , where is the number of edges which is . Several runs of the LP-algorithm are often combined to counter the stochastic nature of the method. This combination is the essential topic of this paper and it is further discussed in the next section. Raghavan et al. (2007)

Iii Ensemble methods

Community detection is a form of clustering of network data, in which nodes are similar by sharing many common neighbors. Clustering in turn is a form of classification, which is extensively used in machine learning and other related areas. In this section, we discuss the important concept of boosting

used in classification to combine several weaker classifiers into a better classifier using (weighted) voting schemes. Another important concept in classification is

bagging in which a large number of bootstrap replicated are aggregated to form a robust average. This method has been successfully applied using network data in Ref. Rosvall and Bergstrom (2010) and it is therefore likely that boosting will also be applicable to network data.

Boosting has previously been used on clustering methods in e.g. bioinformatics to improve the result of clustering analysis. As clustering is similar to community detection, it is fruitful to discuss the

ensemble clustering methods previously used in data clustering and generalize these for network data. This is the aim of this section which also contains a proposed method for combining several runs of a stochastic community detection algorithm, as the LP method previously discussed, or the structures found by different community detection methods and by bootstrap re-sampled networks.

iii.1 Boosting classifiers

The idea behind boosting classifiers is to arrange a large number of simple (weak) classifiers into an ensemble (committee), which by a wisdom-of-crowd-effect creates a better classifier. In many cases, the resulting classifier performs much better than a simple more complex classifier. This makes boosting a powerful, yet simple method to greatly improve the classification accuracy.

Another related method to boosting is ensemble learning, in which a number of different weak classifiers are combined into an ensemble without any re-sampling or re-weighting. This family of methods suits our setting better and is the basis for the following discussion on ensemble clustering. It is however important to put some effort into trying to explain why a group of simpler classifiers may perform much better than single more advanced classifier. Much effort has been devoted to answer this question and some answers have been found for independent classifiers.

The ensemble method is discussed in Ref. Hansen and Salamon (1990)

for use with neural networks, which are trained using some data set. It is possible to show that the training problem is an optimization problem with many local minimum (as in the case with modularity maximization). Therefore the weightings found can differ largely for solutions with almost the same error rate. By combining many of these weightings, the authors could show large improvements in overall accuracy. Assume that each classifier has some probability of classification error,

, therefore the probability of finding exactly classification errors in classifiers is given by

(3)

and by applying a simple majority voting rule, the corresponding probability of mis-classifications in an ensemble with classifiers is

(4)

It is further stated in Ref. Hansen and Salamon (1990), that it is possible by induction to prove that this probability is decreasing with increasing when . This means that when each classifier is better than a random classification and independent of other classifiers, an arbitrarily error rate can be selected by varying the number of classifiers used in the ensemble. The assumption of independence is seldom valid in practical applications but the method still works when each classifier perform better than chance.

The error rate of an ensemble is further discussed and calculated for dependent classifiers in Ref. Sollich and Krogh (1996), where the ensemble generalization error, , is expressed as

(5)

where is the label of observation , is the label given by classifier and is the weighted ensemble average,

(6)

for some weights for classifiers .

The first term in Eq. (5) is the (weighted) average of the generalization errors of the individual predictors and the second is the weighted average of ambiguities. The latter contains all correlations between the different classifiers. Finally the relation shows that the more predictors differ, the better is the performance of the ensemble. This explains why an ensemble of classifiers performs better than a more advanced single classifiers, as the error rate can be decreased by increasing the number of classifiers included in the ensemble. Sollich and Krogh (1996)

iii.2 Ensemble clustering

As classification is related to clustering, it is reasonable that these ensemble methods are useful in clustering as well. In ensemble clustering, the problem is often to combine an ensemble of clusterings generated by e.g. some re-sampling method (bootstrap) Dudoit and Fridlyand (2003). The combination should return the average or aggregated properties of the clusterings found in the ensemble. A method for finding ensemble clusterings is proposed by Ref. Strehl and Ghosh (2003) called Instance-based Ensemble Clustering (IBEC). Other important examples of ensemble clustering methods are found in Refs. Fern and Brodley (2004), but are not used in this paper.

Definition 2 (Ibec)

Given an ensemble of clusters, , IBEC constructs a fully connected (complete) graph, , where is a set of nodes and is a frequency matrix with as the frequency of instances that nodes and are placed in the same cluster.

The IBEC method aggregates the clusterings by constructing a graph where each node represents a data point and each edge indicates that the two connected nodes have been clustered together. The frequency with which the two nodes have been clustered together acts as a weight or similarity for the resulting edge, see Definition 2 for details. The nodes are finally partitioned into clusters using agglomerative hierarchical clustering with some linkage rule, or by a graph partitioning method as the Kernighan-Lin algorithm Newman (2010).

iii.3 Node-based Fusion of Communities

In this paper, we propose a generalization of IBEC for network data and for fusing different community structures (subgraphs) into a final representation. This final community structure should indicate the most probable structure as it is the aggregated information from many candidates. Node-based Fusion of Communities (NFC) is similar to the previously discussed IBEC but use a special linkage rule to account for the special nature of network data, i.e. nodes can not be placed in the same community if they are not connected by a sufficiently short path.

The NFC-method is outlined in Figure 3. Firstly, a complete graph, , is constructed using the data from candidate communities, which are the output from some community detection algorithm(s). The set of nodes, , is the original set of nodes in the network, and the set of edges now indicate that two nodes have been found in the same community. The matrix, , where the element in row and column , is the frequency of the event that nodes and has been found in the same candidate community.

This new graph is clustered using agglomerative hierarchical clustering using a special linkage rule. This is necessary as recalculating is needed for determining the frequency of that the nodes have been placed in the same community as the meta-cluster, i.e. a cluster of some merged nodes.

Figure 3: Node-based Fusion using agglomerative hierarchical clustering.

The frequency between the merged nodes (cluster) and the other nodes or clusters, , is found by

(7)

where the membership matrix, , where is the community in which node is a member in candidate network . That is, is the number of occurrences where all nodes (in both clusters) are in the same candidate cluster. Using this linkage rule incur some information loss as information of individual nodes is lost in the meta-cluster.

The result of the hierarchical clustering algorithm is a dendrogram and a list of merges. The clustering corresponding to the maximum modularity is taken as the communities found in the merged candidate networks. The complexity of NFC is determined by the hierarchical clustering algorithm and therefore is .

iii.4 Estimating certainty in community structures

By using the output from the NFC-method, it is possible to estimate the certainty of the hypothesis that a node belongs to a certain community. This is especially important for nodes lying in the borderland between communities. Perhaps it is equally likely that the node belongs to another neighboring community. It is also important in structures similar to chains and tree in the network. These nodes are naturally quite sensitive to uncertain edges because they have few neighbors. Nodes having an uncertain community membership can be found using the candidate communities. If a node is found quite often in two different communities, the confidence that it has been classified correctly is low as it is very sensitive to the network structure.

For the node-based method, the nodes are merged in a hierarchical manner, the two first nodes are the most similar and for each merge the nodes get more dissimilar. If a node was merged early into a community, it is less likely that it would belong to another community. Therefore a qualitative measure of the certainty that a node belongs to a community is found as where is the number of merges needed before the node is added to the community. A larger value of this score indicates an early merge and therefore a more certain merge.

Iv Simulation experiments

This section contains details regarding the conducted simulation experiments using the proposed method for merging community structures, the NFC method. We propose to use this method in combination with the stochastic LP-algorithm and call this combination Label Propagation-Node-based Fusion of Communities (LP-NFC). This section discuss some preliminary methods including the generation of synthetic networks and performance measures. Comparative studies of the proposed LP-NFC algorithm with the SP and GA-algorithms are presented and the community detection methods are evaluated by performance and computational complexity.

iv.1 Synthetic networks with community structure

The proposed method is demonstrated using synthetic (random network with community structures) networks with a community structure. The synthetic network model used in this paper is adopted from Refs. Lancichinetti and Fortunato (2009); Lancichinetti et al. (2008). The authors have constructed algorithms to generate artificial networks with community structures, which has become a standard benchmark for community detection using synthetic networks. The networks are generated using six different input parameters, shown in Table 1, together with the values used in the following experiments. These parameters allow for the generation of families of networks with desired properties.

Variable Value Description
- number of nodes in the network
15 mean degree of each node
50 maximum degree
- mixing parameter
20 minimum size of a community
50 maximum size of a community
1 exponent of community size distribution
(typically in real-world networks)
2 exponent of degree distribution
(typically in real-world networks)
Table 1: The parameters used for generating synthetic networks in the simulation studies using the algorithm from Ref. Lancichinetti and Fortunato (2009). These parameters create networks similar to Newman-Girvan benchmarks. The mixing parameter, , and number of nodes are varied during the experiments. Newman and Girvan (2004)

The mixing parameter, , is the fraction of edges between the different communities and is the fraction of intra-community edges. A small mixing parameter corresponds to well-separated communities, the extreme is when and only disjoint communities exist. As increases, the communities become more difficult to detect, until when no communities exist in the network according to the adopted definition of a community in Definition 1.

The algorithm to generate the synthetic networks consists of five different steps. A simplified version, see Ref. Lancichinetti and Fortunato (2009) for the full version, is as follows: Firstly, generate the degree of each node by sampling from a power-law distribution with parameter, , satisfying, and . Secondly, generate the size of each community from a power-law distribution with parameter, , such that all nodes are members of a community and the community sizes are consistent with the parameters, and .

  1. using the configuration model, assign edges between all nodes such that the degree of all nodes are satisfied,

  2. randomly distribute the nodes to the communities in the network,

  3. rewire the edges between nodes until the mixing parameter, is satisfied.

The drawback of this algorithm is the lack of triangles observed in real-world social networks, which result in a sparser network than in empirically found networks. The advantage is that synthetic networks enable the study of how the mixing parameter is correlated with the effectiveness in finding communities in uncertain networks. The algorithm has a linear complexity, , and can therefore be used to simulate large networks with community structures that are consistent with real-world social networks. Lancichinetti and Fortunato (2009)

iv.2 Comparing community structures

Supervised measures are used to evaluate the performance of the community detections methods used in the simulation experiments. This is possible due to that the correct

community structured is returned by the synthetic networks algorithm, thus external information is available for supervised methods. Traditionally, supervised methods have included precision and recall which are commonly used in classification and clustering to evaluate methods and algorithms. The drawback with these traditional methods are that it is difficult to find the matching pair of labels from the obtained solution and the externally provided labels. For example, a single community detected by the community detection algorithms may correspond to two different labels in the external information. The problem is to select the external label to match this obtained community with. Previously the corresponding label has been taken as the set with the largest overlap, therefore only obtaining approximate performance measures.

In this paper, the Normalized Mutual Information (NMI) is instead used to measure the performance of the different community detection algorithm. An additional method using correlation is proposed, as it is a simpler measure and later discussed also yields result similar to the former measure.

iv.2.1 Normalized Mutual Information

The NMI-measure originates in information theory and can be interpreted as how much is known about the external labeling given the obtained solution and vice versa. We follow Ref. Lancichinetti and Fortunato (2009) to defined the NMI-measure.

Let denote the NMI-measure where is the obtained community structure and is the external labeling. Assume that where is the label of node and the same for with the obtained community membership of node . Further assume that and

are the realizations of some random variables,

and

, with some (joint) probability distributions as

(8)
(9)
(10)

where is the number of elements in which equals the label with the corresponding for , and is the total number of nodes, . The mutual information, , is defined by

(11)

where the sums are taken over all assumed values of and , and is the logarithm (with base ). The NMI-measure, , between the obtained community structure, , and the externally given labels, , is

(12)

which equals zero if the community structures are independent and unity if they are equivalent. The entropy, , of a random variable defined as

(13)

iv.2.2 Correlation

The correlation is used to calculate a measure of how the rows in the matrices tend to be similar Jain and Dubes (1988). This type of measure has previously been used in comparing clustering and classification methods. Let denote the neighborhood matrix where if nodes and are found in the same cluster and otherwise. The mean correlation , between the two matrices, and , is found as the mean of the Pearson correlations, , for each row

(14)

where

denotes the variance. Using the covariance between each element in each row and the expected value (mean) of the that specific row

(15)

where denotes the expected value. By letting be the ideal neighborhood matrix, , with when (equal external labels) but zero otherwise and . Finally letting denote the neighborhood matrix from the obtain solution, results in another supervised method for comparing the performance of community detection methods. As the NMI, this measure scales to unity when a perfect match is found and if the measure assumes the value zero then no matches are found.

Both the NMI and correlation measures have the advantage over methods like precision and recall, that no identification/linkage of labels are needed. This therefore removes the need of using approximate methods as largest overlap to identify which label that is external given best matches the obatined labels from the community detection method.

iv.3 Convergence properties

A first important question to answer is how many runs of the LP algorithm need to be merged by the NFC method to obtain stable solutions. This is the question which is to be answered in this section, before any performance comparisons can be made.

In Figure 4, we present the NMI measures for several different runs of the proposed LP-NFC method. In each run, the number of nodes is varied between and nodes, the mixing parameter is varied between and , and finally the number of merged runs is varied between and .

As the number of samplings increase some of the curves shift rightwards, which indicates better performance in finding the correct structures in network with more diffuse community structures. Remember, that higher mixing parameter indicate more diffuse community structure, that are more difficult to detect. The largest movement is found in the curve corresponding to nodes. The conclusion is that more samplings are needed in networks with more nodes than in networks with fewer nodes. This as the NMI for the smaller networks are more or less constant with respect to the number of samplings.

This corresponds to what is known from standard Monte Carlo-methods, that it is possible to decrease the statistical error by increasing the number of samples. This is only possible up to a certain level, before the systematic errors dominates the statistical error. We conclude that samplings are a good choice due to this result as well as required computational time.

Figure 4: The NMI for the LP-NFC method with three different number of samplings used on synthetic networks with community structures. The graphs are found as an average of runs on networks with nodes and mixing parameter, .

iv.4 Comparisons with GA and SP

Continuing with comparing the proposed method to merge (aggregate) a number of runs by the LP-algorithm with the commonly used SP and GA-algorithms. The algorithms have been evaluated using the previously discussed synthetic networks with varying number of nodes, , and mixing parameter, . The obtained community structure is compared with the labeling outputted by the synthetic network algorithm using, the previously discussed. NMI and correlation measures.

The most important feature in the following figures are the shift from high values of NMI and correlation to lower. The critical value of the mixing parameter, , clearly depends on the size of the network, , and differ between the compared algorithms.

In Figure 5, the methods are compared using the mean NMI from runs at each value of the number of nodes and mixing parameter. The profiles of the LP-NFC and GA-algorithms are quite similar in appearance compared with the SP-algorithm. The latter is previously known to perform worse on synthetic networks than on real-world networks. This is clearly visible by that the NMI values quickly fall of in comparison with the other two algorithms, that seems to have more or less constant NMI until the critical mixing parameter.

Figure 5: The NMI for three different community detection methods used on synthetic networks with community structures. The graphs are found as an average of runs on networks with nodes and mixing parameter, .

Another feature worth noting is the tail behavior for the LP-NFC-algorithm at high mixing parameters. The NMI value for the other two algorithms quickly drops to zero after the shift in performance, but the LP-NFC algorithm continues to have non-zero NMI. This is particularly visible for the smaller networks with , , and nodes and is probably some artifact from the stochastic nature of the LP-algorithm.

Comparing the LP-NFC and GA-algorithms, we conclude that the performance is similar between these two methods and are superior to the SP-algorithm. Continuing, with another comparison using the correlation and the computational complexity to find the preferred method.

Figure 6: The correlation for three different community detection methods used on synthetic networks with community structures. The graphs are found as an average of runs on networks with nodes and mixing parameter, .

In Figure 6, methods are compared using the mean correlation from runs at each possible number of nodes and mixing parameter. This figure is quite similar to the previous but with some differences in the LP-NFC algorithm, where the mean correlation drops to zero (as for the other two algorithms). The artifacts at high mixing parameters have therefore vanished and seems to be to be related with the use of the NMI-measure.

Most of the analysis from the NMI-measure remains with the correlation measure as well. It is perhaps even more apparent that the GA-algorithm is less sensitive to the number of nodes than the LP-NFC-algorithm. This is seen by the densely packed correlation curves in the GA-algorithm that is not visible in the LP-NFC-algorithm.

The SP-algorithm continues to under-perform in comparison with the two other methods. As previously discussed, this is probably the result of using the synthetic networks, as the method perform well for real-world networks. These two types of networks differ in some important aspects, as for example the number of triangles which could explain the poor performance of the SP-algorithm.

Figure 7: The NMI of the modularity-weighted LP-NFC method. The graph are found as an average of runs on networks with nodes and mixing parameter, .

Concluding this analysis, we suggest an improvement to the LP-NFC-algorithm by weighting each frequency by the normalized modularity. For each candidate community structure, the modularity is calculated and normalized with the maximal value of the modularity. This generates a set of weights, , for each candidate community structure and the elements in the frequency matrix is weighted by . This gives more weight to nodes that have been placed in the same community in structures with higher modularity than in situations with lower modularity. As the modularity indicates the quality of the community structure found, this could generate better results.

This modification has been evaluated in the same manner as the previous version and the results from evaluation by the NMI and correlation measures are shown in Figure 7.

iv.5 Time complexity

Important aspects of community detection algorithms are performance and computational complexity. A good algorithm should have a low computational complexity, which is equivalent to scalability to larger networks. Performance and accuracy are also desirable properties of community detection methods. LP has a low complexity and performance, it is therefore interesting to determine the complexity of the LP-NFC. This especially as the LP-NFC method has performed well in comparison with SP and GA-algorithms.

It is previously known that the GA-algorithm has a rather low computational complexity, , for most networks. This is especially true with the modifications described in Ref. Wakita and Tsurumi (2007). The simulation experiments in this paper are done in the software R using the implementations offered by the igraph-package, which are not optimized for low computational complexity. The following comparison is therefore just preliminary and a better implementation of LP-NFC is needed for making better comparisons.

The LP-NFC-algorithm is based on two different steps, the first is runs of the LP-algorithm which then is merged using agglomerative hierarchical clustering. Using the linkage rule considered in this paper, the complexity of the latter algorithms is or . The total complexity of the method is then, , where is the number of edges with and is some suitable number of merged runs e.g. . This gives a theoretical computational complexity of approximately .

The running time of the LP-NFC and GA-algorithms is shown in Figure 8 for different numbers of nodes, , and mixing parameters, . The LP-NFC-algorithm have a rather high complexity in its current implementation as previously discussed. It is approximately , which is higher than the theoretical value. The GA-algorithm has about linear computational complexity, as previously discussed by Ref. Wakita and Tsurumi (2007). Some other interesting aspects is that the LP-NFC-algorithm is a lot faster for smaller networks (have a smaller constant term than the GA-algorithm) and the impact of the mixing parameter. In the GA-algorithm the mixing parameter has a rather high influence on the running time of the community detection method. This effect is not visible for the proposed method in this paper.

Figure 8: The running time of the LP-NFC and GA-algorithms. The graph are found as an average of runs on networks with nodes and mixing parameter, . The dotted lines are reference curves for for the LP-NFC algorithm and for the GA-algorithm.

The SP-algorithm is neglected in this comparison because of implementation differences, making this algorithm a lot faster than the other two. This making the comparison difficult but the theoretical complexity of the SP-algorithm is approximately for sparse networks where .

We conclude by that the computational complexity of the current implementation of LP-NFC can be improved but the theoretical limit is still higher than for the GA-algorithm. The advantage of the LP-NFC-algorithm is that the running time is not an increasing function of the mixing parameter and is faster for networks smaller than nodes.

V Concluding remarks

In this paper, we have presented a method for combining community structures detected in networks, named Node-based Fusion of Communities (NFC). This method has applications including combining several different community detection method, for enhancing the performance of stochastic methods, and for merging communities detected at different scales. The method has been used in combination with the Label Propagation (LP) algorithm and evaluated using simulation studies with synthetic networks.

Acknowledgements.
JD would especially like to thank Sang Hoon Lee, Jari Saramäki, Petter Holme, and Martin Rosvall for helpful discussions, comments, and suggestions during the work underlying this paper. This paper is part of the project Tools for information management and analysis, which is funded by the R&D programme of the Swedish Armed Forces.

References

  • Fortunato (2010) S. Fortunato, Physics Reports, 486, 75 (2010), ISSN 03701573.
  • Raghavan et al. (2007) U. N. Raghavan, R. Albert,  and S. Kumara, Physical Review E, 76, 036106+ (2007).
  • Rosvall and Bergstrom (2010) M. Rosvall and C. T. Bergstrom, PLoS ONE, 5, e8694+ (2010), ISSN 1932-6203.
  • Strehl and Ghosh (2003) A. Strehl and J. Ghosh, Journal of Machine Learning Research, 3, 583 (2003), ISSN 1532-4435.
  • Monti et al. (2003) S. Monti, P. Tamayo, J. Mesirov,  and T. Golub, Machine Learning, 52, 91 (2003), ISSN 08856125.
  • Fern and Brodley (2004) X. Z. Fern and C. E. Brodley, in ICML ’04: Proceedings of the twenty-first international conference on Machine learning (ACM, New York, NY, USA, 2004) pp. 36+, ISBN 1-58113-828-5.
  • Clauset et al. (2004) A. Clauset, M. E. J. Newman,  and C. Moore, Physical Review E, 70, 066111+ (2004).
  • Reichardt and Bornholdt (2006) J. Reichardt and S. Bornholdt, Phys Rev E Stat Nonlin Soft Matter Phys, 74 (2006), ISSN 1539-3755, doi:10.1103/PhysRevE.74.016110.
  • Bodin and Prell (2011) Ö. Bodin and C. Prell, eds., Social networks and natural resource management : uncovering the social fabric of environmental governance (Cambridge University Press, 2011).
  • Porter et al. (2009) M. A. Porter, J.-P. Onnela,  and P. J. Mucha,  (2009)arXiv:0902.3788 .
  • Radicchi et al. (2004) F. Radicchi, C. Castellano, F. Cecconi, V. Loreto,  and D. Parisi,  (2004)arXiv:cond-mat/0309488 .
  • Zachary (1977) W. W. Zachary, Journal of Anthropological Research, 33, 452 (1977).
  • Newman (2006) M. E. J. Newman, Proceedings of the National Academy of Sciences, 103, 8577 (2006), ISSN 0027-8424, arXiv:physics/0602124 .
  • (14) The elements of the adjacency matrix, , are equal to if an edge exist between nodes and , and otherwise.
  • (15) The Kronecker delta function, , takes the value if the elements are equal and otherwise.
  • Brandes et al. (2006) U. Brandes, D. Delling, M. Gaertler, R. Goerke, M. Hoefer, Z. Nikoloski,  and D. Wagner,  (2006)arXiv:physics/0608255 .
  • Good et al. (2010) B. H. Good, Y. A. de Montjoye,  and A. Clauset, Physical Review E, 81, 046106+ (2010), arXiv:0910.0165 .
  • Weiss and Jacobson (1955) R. S. Weiss and E. Jacobson, American Sociological Review, 20 (1955).
  • Rice (1927) S. A. Rice, The American Political Science Review, 21 (1927).
  • (20) In the following simulation experiments, this ratio is set as unity and therefore only communities with sizes larger than where is the number of edges can be detected. Reichardt and Bornholdt (2006).
  • Kirkpatrick et al. (1983) S. Kirkpatrick, C. D. Gelatt,  and M. P. Vecchi, Science, Number 4598, 13 May 1983, 220, 4598, 671 (1983).
  • Hansen and Salamon (1990) L. K. Hansen and P. Salamon, IEEE Transactions on Pattern Analysis and Machine Intelligence, 12, 993 (1990), ISSN 0162-8828.
  • Sollich and Krogh (1996) P. Sollich and A. Krogh, Advances in Neural Information Processing Systems, 8, 190 (1996).
  • Dudoit and Fridlyand (2003) S. Dudoit and J. Fridlyand, Bioinformatics (Oxford, England), 19, 1090 (2003), ISSN 1367-4803.
  • Newman (2010) M. Newman, Networks: An Introduction, 1st ed. (Oxford University Press, USA, 2010) ISBN 0199206651.
  • Lancichinetti and Fortunato (2009) A. Lancichinetti and S. Fortunato, Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 80, 016118+ (2009a).
  • Lancichinetti et al. (2008) A. Lancichinetti, S. Fortunato,  and F. Radicchi, Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 78 (2008).
  • Newman and Girvan (2004) M. E. J. Newman and M. Girvan, Physical Review E, 69, 026113+ (2004).
  • Lancichinetti and Fortunato (2009) A. Lancichinetti and S. Fortunato, Physical Review E, 80, 056117+ (2009b).
  • Jain and Dubes (1988) A. K. Jain and R. C. Dubes, Algorithms for clustering data (Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1988) ISBN 0-13-022278-X.
  • Wakita and Tsurumi (2007) K. Wakita and T. Tsurumi, “Finding Community Structure in Mega-scale Social Networks,”  (2007), arXiv:cs/0702048 .