GANC: Greedy Agglomerative Normalized Cut

05/05/2011 ∙ by Seyed Salim Tabatabaei, et al. ∙ McGill University 0

This paper describes a graph clustering algorithm that aims to minimize the normalized cut criterion and has a model order selection procedure. The performance of the proposed algorithm is comparable to spectral approaches in terms of minimizing normalized cut. However, unlike spectral approaches, the proposed algorithm scales to graphs with millions of nodes and edges. The algorithm consists of three components that are processed sequentially: a greedy agglomerative hierarchical clustering procedure, model order selection, and a local refinement. For a graph of n nodes and O(n) edges, the computational complexity of the algorithm is O(n log^2 n), a major improvement over the O(n^3) complexity of spectral methods. Experiments are performed on real and synthetic networks to demonstrate the scalability of the proposed approach, the effectiveness of the model order selection procedure, and the performance of the proposed algorithm in terms of minimizing the normalized cut metric.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 21

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Clustering or partitioning of nodes in networks or graphs is an important task that has many applications in a diverse range of fields. It has been used for many years to study social networks (wasserman94, ) and continues to be employed in the field of sociology to explore social interactions. More recently it has been employed in the study of biochemical networks (guimera05, ; loewenstein08, )

, biological neural networks 

(chen08, ; crum08, ), and transport and communication networks.

There are three components to the clustering task: (i) choosing the number of clusters; (ii) selecting a criterion that measures the merit of each candidate cluster allocation; and (iii) identifying an algorithm that searches for the optimal clustering. Some performance criteria can be used both to select the number of clusters and choose the clustering, e.g., modularity (girvan02, ) or information-theoretic criteria based on the minimum description length (rosvall08, ).

There is no universally-accepted performance criterion and indeed the most appropriate criterion can vary depending on the application domain and the goal of the clustering. Modularity, for example, focuses on network structure and places primary value on direct connections between nodes. On the other hand, clustering algorithms based on Markov random walks (rosvall08, ; lafon06, ) also value indirect connections and network flow. In this paper we select the normalized cut criterion (shi00, ), which simultaneously encourages intra-cluster similarity while penalizing inter-cluster similarities. Methods based on this criterion have been employed successfully in a wide range of applications (crum08, ; lafon06, ; shi00, ; fowlkes04, ; dhillon07, ; yu10, ). The normalized cut criterion is related to the conductance of the underlying graph (sinclair89, ). Furthermore an implicit duality between normalized cut and normalized association exists; the former encourages clusters that are less connected to each other and the latter encourages clusters whose nodes are well-connected.

When partitioning a graph into clusters, the cut associated with cluster is the sum of the weights of the edges between nodes in and nodes in other clusters. Intuitively, minimizing the maximum cut (or minimizing the average cut) identifies clusters that have the weakest ties between them. The problem is that the minimum cut will often be achieved by identifying many very small clusters, which provides little insight into the underlying structure of the graph. The normalized cut metric was introduced by Shi and Malik in (shi00, ) to address this shortcoming. It normalizes each cut by dividing it by the total weight of the associated cluster. This has the effect of penalizing very small clusters because they generally have low total weight.

Minimizing normalized cut is an NP-complete problem (shi00, )

. Shi and Malik illustrated that the minimization could be relaxed to form a generalized eigenvalue system, whose (discretized) solution corresponds to the minimum normalized cut. This has led to the development of a spectral clustering; it involves calculating eigenvectors of the identified system. In order to identify a partitioning of

nodes to clusters, some techniques perform recursive bipartitioning (shi00, ), and thus require the repeated identification of two eigenvectors. Other approaches strive to identify clusters directly by calculating the smallest eigenvectors of the underlying graph Laplacian. One of the concerns about these methods is computational complexity; even when employing fast eigendecomposition methods such as the Lanczos iterative technique, complexity grows rapidly with .

In this paper, we propose an agglomerative clustering algorithm that strives to minimize the normalized cut (or equivalently, maximize the normalized association). It is a fast, scalable algorithm, with almost linear complexity in the number of nodes for relatively sparse graphs. Performance evaluation using a range of benchmark and observed graphs indicates that the algorithm identifies partitionings that have average normalized association metrics as large as those of the partitions identified by the spectral clustering techniques. We also propose a method for identifying important partitioning scales which can be used to automatically select the number of clusters. To the best of our knowledge, this is the first model order selection method applicable to the normalized association maximization that is scalable. Through our experiments, we will show how effective our model order selection criterion is for both synthetic and real networks.

The rest of the paper is organized as follows. In Section 2 the major approaches for minimization of normalized cut and model order selection are reviewed. The problem formulation of normalized association is re-stated in Section 3. Our proposed algorithm is detailed in Section 4. The proposed algorithm is then compared to the state-of-the-art clustering algorithms in Section 5. The concluding remarks are provided in Section 6.

2 Related Work

The identification of clusters in graphs and networks has received significant attention. In our review we focus on a representative set of algorithms that minimize the normalized cut with or without spectral decomposition. We also review the existing methods for selecting the number of clusters.

2.1 Optimization of Normalized Cut

Spectral partitioning of graphs was first proposed by Donath and Hoffman in the 1970’s (donath72, )

. Interest in the techniques was renewed in the 1990’s when Pothen et al. described an algorithm for bi-partitioning using the Fiedler vector 

(pothen90, ). Hendrickson et al. and Karypis et al. contributed with multilevel algorithms for more efficient spectral partitioning (Hendrickson95, ; Karypis99, ).

The normalized cut metric was introduced by Shi and Malik in (shi00, ). They demonstrated how the bipartitioning task, with the objective of minimizing the normalized cut, could be relaxed to construct a generalized eigenvalue problem, and was thus related to spectral partitioning. The eigenvector corresponding to the second-smallest eigenvalue of the graph Laplacian identifies the bipartitioning (the real values in the eigenvector must be mapped to two discrete values for partitioning). This established the connection between minimization of normalized cut and spectral partitioning. Shi and Malik proposed a recursive bipartitioning scheme in order to partition a graph into clusters. In (meila01, ), Meila and Shi proposed an algorithm that calculates eigenvectors (thereby associating real values with each node in the graph) and then uses a clustering algorithm, such as -means, to do the partitioning in . Ng et al. observed in (ng02, ) that the algorithm in (meila01, ) is susceptible to failure when there is substantial variation in the degree of connectivity between clusters. They proposed an alternative algorithm that uses a different normalization in both the eigenvalue problem and the construction of feature vectors prior to the application of -means.

All of these algorithms involve a computationally expensive eigendecomposition. To address the computational difficulties for large graphs, Fowlkes et al. described a procedure that uses the Nyström method to reduce the complexity of the eigenvalue problem (fowlkes04, ). This does significantly reduce the computational overhead, but it is not enough to make the eigendecomposition algorithms scalable to large graphs. Yan et al. recently described an algorithm for fast approximate spectral clustering (yan09, ), but the focus is not on clustering for graphs (rather it addresses real-valued feature vectors).

Dhillon et al. introduced a much faster algorithm for minimizing normalized cut in (dhillon07, ); the graph is first greedily coarsened, then the coarsened graph is partitioned using a region growing procedure (Karypis99, ) and finally weighted kernel -means clustering is applied to each partition to refine the clustering.

Other methods also exist that strive to optimize related cost functions; for example Sharon et al. proposed a scalable hierarchical clustering using ratio association (the normalization term is the number of nodes in clusters) (sharon00, ). Other examples include the work by Ding et al. (ding01, ), Sharon et al. (sharon01, ), Akselrod-Ballin et al. (akselrodballin06, ), and Corso et al. (corso08, ) that use the sum of internal weights of clusters as the normalization term.

2.2 Model Order Selection

The problem of selecting the number of clusters is almost as old as clustering itself. In the following discussion, denotes the number of clusters that a given approach identifies as the true number of clusters.

Numerous authors have used some notion of quality of clustering, usually based on some definition of the inter- and intra-cluster distances, to identify the number of clusters. Let denote this notion for a given partitioning to clusters. is then examined for various values of . Partitionings for different values of can be obtained from a hierarchical clustering, or alternatively, by several flat partitionings.

If is not a monotonic function of , the approach is usually to take or (depending on the definition of ). The majority of the methods discussed in (milligan85, ) and (halkidi01, ) are of this type; Calinski and Harabasz (calinski74, ), C-index (hubert76, ), Baker and Hubert (baker75, ), and cubic clustering criterion (sarle83, ) are among the most effective ones (milligan85, ). Tibshirani et al. (tibshirani01estimating, ) define as the gap, that is the difference of the average pairwise distance of the data points of the clustering at level and the expected value of the same measure of some reference model. This is similar to modularity (girvan02, ) (discussed later) in the sense that it compares the clustering results to a reference model. None of these methods are directly applicable to graph clustering algorithms; the calculations of the defined metrics require pairwise distances which are not immediately available from a graph representation. Possible distance metrics include the shortest path (edachery99, ) or the diffusion distance (lafon06, ); however the shortest path is very sensitive to noise and the calculation of the diffusion distance requires eigendecomposition.

In the case where is a monotonic function of , the only extremal values of correspond to trivial values of . Hence, the value of corresponding to two or more choices of are examined to quantify the significance of a given level of a hierarchical clustering (gnanadesikan77, ; krzanowski88, ; pedersen06, ). In order to assess the clustering at level of the hierarchy, Gnanadesikan et al. (gnanadesikan77, ) propose the fraction , Arnold (arnold79, ) use the value of , Krzanowski and Lai (krzanowski88, ) employ the fraction , and Pederson and Kulkarni (pedersen06, ) suggest using the fraction . The latter two approaches are the most similar to what we propose in Section 4 in the sense that they use the preceding and succeeding levels of the hierarchy to obtain the significance of the clustering at level . However the approaches in (krzanowski88, ; pedersen06, ) are potentially susceptible to noise due to the division; small perturbations in the weights could lead to dramatic changes in the selected model order.

Other approaches to selecting the number of clusters are based on the adoption of (semi-)parametric models for the structure of the graph. This allows the application of model selection techniques based on concepts such as the Bayesian InformationCriterion (BIC)

(schwarz1978, ), and Akaike Information Criterion (akaike74, ). Despite the strong theoretical support for these methods, the adoption of a parametric model for the graph structure is undesirable. The models are often overly restrictive and do not adequately capture the properties of many real-life networks. An example is the requirement in (wolfe70, ; pelleg00, )

that the input data are normally distributed (after projection of the graph into a real space). The more general methods based on mixture models do not scale well to very large graphs; even the recent approaches have only been applied to graphs with a few thousand nodes

(mariadassou10, ; latoucherep08, ; latouche10, ).

Some heuristics are based on the sizes of the clusters that are merged at different levels of the clustering hierarchy

(gdalyahu99, ; harel01, ). The authors in (harel01, ) suggest that when two clusters with large number of nodes are merged, a significant amount of detail is lost; hence such an instance is potentially where a hierarchical clustering algorithm should stop. The authors of (gdalyahu99, ) propose a similar approach. The main drawback of these approaches is that only the granularity of the clusters are taken into account and the number and the weights of the edges are simply ignored.

A well-known and effective method of selecting the number of clusters is to examine the eigenvalues of the Laplacian of the graph that is to be clustered (ng02, ; azran06, ; vonLuxburg07, ). For a graph with isolated connected components the multiplicity of the eigenvalue zero is equal to the number of clusters. Any other graph with well-separated clusters can be considered as a perturbation of this ideal case. Matrix perturbation theory states that the stability of the eigenvectors of a matrix is proportional to the eigengap (the difference between two successive eigenvalues). Von Luxburg (vonLuxburg07, ) suggests using . A large eigengap at is the case in which spectral algorithms using the Laplacian perform most successfully (ng02, ). A more robust criterion is proposed by Zelnik-Manor and Perona (ZelnikManor04, ) that uses eigenvectors instead. Despite the solid theoretical support behind these approaches, the requirement of eigendecomposition makes them impractical when large graphs are being clustered.

There have been attempts to identify the number of clusters using stability analysis. “Stability” has been defined differently by different authors, but they all based on the same intuition; with respect to some algorithm, a stable clustering is one that behaves relatively consistently in the presence of some controlled perturbation. The perturbation could be in terms of noise (ben-hur02, ), sampling subsets from the input (lange04, ; ben06, )

, or random projection of a high dimensional data into a lower dimensional space

(smolkin03, ). These approaches require several runs of clustering for every value of which makes them computationally expensive. Furthermore, Ben-David et al. (ben06, ) warn against using stability analysis in this context, and they suggest that this family of model order selection techniques is not suitable for selecting the number of clusters in general. The intuition behind their work is that when an underlying objective function, , has several local optima of relatively similar values, a clustering algorithm might be trapped by any of them, even when is the true number of clusters. The difference in clustering solutions is interpreted as instability and the model order rejected, but the behavior is caused by the imperfect nature of the clustering algorithm.

Two more recently proposed methods to select the number of clusters are based on optimization of the quality metrics of modularity (girvan02, ) and description length (rosvall08, ). Both methods simultaneously address both clustering and the selection of the number of clusters; they strive to optimize the objective function over all possible partitionings. These methods are discussed in more detail in Section 5.

3 Problem Formulation

Let be a weighted graph having nodes and edges444In this work, we assume we are given the graph on which we wish to perform clustering. We do not address the problem of forming a graph from data, which arises when one applies graph clustering methods to general data sets; see, e.g., (vonLuxburg07, ; daitch09, ; maier09, ).. We assume that edge weights are non-negative and symmetric, that if , and that for , the weight is indicative of the similarity between nodes and ; that is, the larger the weight, the more similar the nodes. We also allow for self-weights, .

For a fixed number of clusters, , we measure the quality of the partition via the normalized cut metric (shi00, ), defined as follows. For a node , denote the degree of by , and for a subset of nodes, let denote the cumulative degree of the subset. Similarly, for two disjoint subsets of nodes, and , let denote the sum of the weights of edges with one end in and the other end in . Let denote the partition of nodes to clusters, where is the subset of nodes affiliated to cluster . The normalized cut metric is defined as

(1)

Minimizing the normalized cut can be interpreted as minimizing the similarity of nodes in different clusters, relative to the degree of each cluster. Alternatively, maximizing the intra-cluster similarity can be achieved by maximizing the normalized association, defined as

(2)

Moreover, maximizing normalized association is equivalent to minimizing normalized cut since , and hence . For the sake of readability, we adopt the maximization of normalized association as our goal in developing a clustering procedure.

4 GANC: Greedy Agglomerative Normalized Cut

In this section, we describe a greedy algorithm for building an agglomerative clustering on a graph. Although we do not make guarantees about its accuracy, the algorithm is fast on sparse graphs and yields excellent performance on a variety of examples, as illustrated in Section 5. Moreover, GANC has a model order selection criterion and does not have to be provided with the number of clusters a priori. The algorithm consists of three steps: agglomerative clustering, model order selection, and refinement.

4.1 Agglomerative Clustering

We propose greedy maximization of normalized association via an agglomerative hierarchical clustering. In the following discussion, note that the number of clusters at stage of the hierarchy is . At stage of the hierarchy, using the given partition, , two clusters are merged to form a new partition, . The two clusters to be merged are chosen greedily so that normalized association of stage is maximized.555We note that a similar greedy merging algorithm is alluded to by Shi and Malik in (shi00, ), however they also suggest first projecting each node into using the first eigenvectors of the graph Laplacian, and running an algorithm such as -means to obtain an initial clustering. Our approach requires no eigendecomposition and performs no such initialization. We also note that, although Shi and Malik mention having experimented with greedy merging, results for this approach have not been reported in the literature.

Initially, is a function that maps nodes to unique clusters, . The degree of every node , , is calculated. Furthermore, for every edge, , the improvement in normalized association by its contraction to obtain is stored in , so that a matrix of the improvements is constructed. Here we assume that initially there are no self-loops, but if there are, the value of improvements are computed by (4) presented shortly. Throughout the agglomerative clustering procedure, , , and are updated at each stage as described below.

At each iteration, the pair is selected for merging to create a larger cluster, . The degree is computed for the newly constructed cluster, . The weights and the improvement matrices are updated by removing the rows and columns corresponding to and , and inserting a new row and column corresponding to (rows and columns are not removed or added in our implementation, but this is the practical effect). The weight matrix is updated as follows:

(3)

and the improvement matrix update is:

(4)

for all the clusters, , adjacent to either or . The self-weights are also calculated by:

(5)

For all pairs of nodes not adjacent to and , the weights and improvements remain unchanged. The above sequence of steps is repeated times to form the clustering hierarchy.

4.2 Model Order Selection

Many of the clustering algorithms require the number of clusters to be provided to the algorithm a priori. However such information is often not available in practical situations making the decision about the number of clusters an issue in itself.

It is worth noting that the stage number () that maximizes does not necessarily correspond to a meaningful number of clusters.666To see this, consider an unweighted graph that consists of two isolated chains, each having 4 nodes. The true number of clusters is trivially 2 resulting in ; however if one groups the adjacent nodes to obtain 4 groups of node pairs, the resulting value of normalized association would be . Here we propose a simple but effective approach to model order selection. Let denote the partition that maximizes normalized assocation over all partitions of into clusters. To carry out model order selection, we examine the curvature777The reason we call this metric the curvature is its similarity to the central approximation of the second order derivative which is defined for a continuous function, as . By substituting , we get the negative of our curvature equation (the negation is to have positive peaks). of which we define as

(6)

The first term of the above addition is the improvement of normalized association moving from stage to of the hierarchy and the second term is the improvement moving from stage to .

The function captures the notion that a particular number of clusters, , identifies meaningful structural similarities embodied in the graph if it provides a normalized association which is significantly larger than the best partition with one additional cluster and little can be gained by reducing the number of clusters to .

In practice we do not have access to the optimal partitions, so we cannot evaluate the exact value of the curvature function. Instead, we approximate the curvature by using the normalized association values for the partitions returned by the agglomerative step of our algorithm.

Note that the model order selection step of the algorithm could be used as a model order selection step for any other algorithm that maximizes the normalized association and generates a clustering hierarchy. Furthermore, this step of the algorithm can be considered optional if there is a prior knowledge about the true number of clusters.

4.3 Refinement

Greedy algorithms can get trapped by local optima. This is also the case with the agglomerative step of our algorithm, especially when the clusters are not clearly separated. After selecting the number of clusters either using the model order selection rule described previously or prior knowledge, a refinement step is invoked in order to improve the initial clustering results. The nodes are moved across the clusters to further improve the value of normalized association. A similar approach is taken in (blondel08, ), but groups of nodes are moved from a cluster to another instead of individual nodes. If we try all possible moves, we end up performing an exhaustive search of dividing nodes into clusters. This defeats the purpose of developing the fast agglomerative clustering step. Instead, we look only at the boundary nodes, i.e., the nodes that have at least one neighbor in another cluster.

Initially the set of all the boundary nodes is identified. We use an matrix to keep track of the neighborhood information of the boundary nodes. If denotes this matrix, and denotes an -dimensional vector:

(7)

For each of the boundary nodes, improvements in normalized association by moving from their current cluster to each of their neighbors are calculated. This improvement for node moving from cluster to cluster is

(8)

If no move leads to improvement, the node stays in the cluster to which it belongs. If one or more moves result in improvement, then the node is moved to the cluster that results in maximum improvement. If is moved from to , the corresponding cluster degrees and associations are updated:

(9)
(10)
(11)
(12)

and for all the neighboring nodes of denoted by , entries of and are updated as follows:

(13)
(14)
(15)

and finally

(16)
(17)
(18)

While performing the updates (9) to (18), the set of boundary nodes is also updated. Whenever a node is moved from cluster to cluster , its neighbors in cluster are added to the set of boundary nodes. For some node neighboring , moving might result in becoming zero in (15) for all ; i.e., some nodes might be removed from the set of boundary nodes.

One pass through all the boundary nodes is considered a single refinement iteration. When an iteration causes no positive improvement, the refinement procedure is stopped. Alternatively, one could specify the maximum number of refinement iterations. Note that although we prohibit the refinement step from emptying any cluster, we have not observed such an attempt in our experiments.

GREEDY AGGLOMERATION

1:  Build the initial matrix, and
2:  for  to 1 do
3:     
4:     
5:     Update rows and columns of , the weights, and the degrees corresponding to clusters and (3 - 5)
6:  end for

MODEL ORDER SELECTION

1:  for  to 2 do
2:     Calculate (6)
3:  end for
4:   or provided by the user

REFINEMENT

1:  Obtain flat partitioning for level
2:  Build the initial and (7), and true
3:  while refine do
4:     
5:     for  to  do
6:        if  is on the boundary and  then
7:           Move from to , where
8:           Update , , the weights, and the degrees (9 - 18)
9:           
10:        end if
11:     end for
12:     if  then
13:         false
14:     end if
15:  end while
Algorithm 1 GANC: Greedy Agglomerative Normalized Cut

4.4 Implementation and Computational Complexity of GANC

We take a similar approach to (clauset04, ) when implementing the agglomerative clustering procedure of GANC. Max-heaps and balanced binary trees are used to store the rows of the matrix and the adjacency matrix. A separate heap is also used to store the maximum of each row of . This leads to the complexity of for the agglomerative clustering procedure, where is the height of the generated dendrogram and is the number of edges with non-zero weight (see (clauset04, ) for details).

The model order selection step of the algorithm requires computations. The computational requirements are much less than those of the methods analyzing the eigenvalue fall-off which require eigendecomposition (e.g., (ZelnikManor04, ) and (azran06, )). Performing an eigendecomposition to study the eigenvalue fall-off for all of the eigenvalues requires operations.

To implement the refinement step, we use the map data structure to store and update . Every row of is stored in a map (C++ STL implementation of the map data structure is used (plauger2000cpp, )). Updates (9) to (12) are performed in constant time. To access each map entry of a row of to insert, update, or delete, no more than operations are required, because the maximum number of clusters connected to a boundary node is strictly less than . Updates (13), (14), and (15) each take no more than operations and are repeated for all neighbors of a node that is moved. We have observed through experiments that nodes with larger degrees do not tend to be on the boundaries and hence the average number of neighbors to be updated is practically smaller than the average node degree. (16), (18), and any insertion to or deletion from the set of boundary nodes are performed in time. (17) is trivial and is performed in a constant time. Not more than nodes can be moved in a given iteration. When node is moved, operations are required, where is the degree of node . Summing over all boundary nodes, the complexity is because . Hence the number of operations in a single iteration is at most . Throughout our experiments we have observed very small number of refinement iterations even when the refinement is applied to the networks with millions of nodes and edges. Furthermore the algorithm is capable of limiting the number of iterations. Hence the complexity of the average number of operations of the whole refinement step does not exceed .

The total computational complexity of GANC is dominated by the agglomerative clustering procedure. In practice, the height of the resulting dendrogram is often . Also, the graphs that are studied are usually sparse and hence . Therefore the complexity of the agglomerative clustering step of our algorithm is for many real-life graphs.

The worst case runtime happens when the graph is complete and the dendrogram is totally unbalanced. There are agglomerations to construct the whole hierarchy each of which requires at most operations. Hence the worst case complexity for the agglomerative clustering step of the algorithm is which is not expected to occur in practical situations.

The memory requirement of GANC is which is equivalent to for sparse graphs. In summary, in many practical cases (relatively sparse graphs and balanced dendrograms), the computational complexity is and the memory requirement is . GANC can be downloaded from http://www.ece.mcgill.ca/~coates/software/.

5 Experimental Results

In this section we provide a comparison of the performance and the runtime of our proposed algorithm GANC with a selection of the state-of-the-art graph clustering algorithms from the literature whose implementations were downloaded from the corresponding authors’ websites. Experiments were performed on an Intel 3.0 GHz Core 2 Quad CPU with 8 GB of RAM and Ubuntu 9.10 operating system.

5.1 Comparing Algorithms

We compare to the following algorithms: Shi and Malik (recursive NCut) (shi00, ); Meila and Shi (k-way NCut) (meila01, ); Ng, Jordan, and Weiss (ng02, ); Dhillon, Gaun, and Kulis (dhillon07, ); Rosvall and Bergstrom (rosvall08, ); and Blondel et al. (blondel08, ). The first four algorithms focus on maximizing NAssoc. Despite not addressing our criterion of interest directly, the other algorithms are included because they are scalable and represent the state-of-the-art in graph clustering.

The discussed clustering algorithms ((shi00, ),(meila01, ),(ng02, )) are not scalable as they include eigendecomposition. Dhillon, Guan, and Kulis proposed an algorithm that strives to maximize normalized association without requiring any eigendecomposition (dhillon07, )

. Similar to the other algorithms that address the maximization of normalized association, the Dhillon-Guan-Kulis algorithm requires the user to provide the number of clusters. After specifying the number of clusters, three steps are performed: a coarsening phase, a base clustering step using region growing, and finally a refinement step using weighted kernel k-means with appropriate choices of the kernel and the node weights to maximize the normalized association.

Minimization of the criterion proposed by Rosvall and Bergstrom (rosvall08, ) results in a coarse-grained representation of the information flow through the network. Their proposed clustering objective aims to compress the underlying random walk on a graph without losing important network structures. To facilitate this, a coding is designed as follows: every cluster is assigned to a unique code; in a given cluster, every node is also assigned to a unique code. However two nodes in different clusters are allowed to be assigned to the same code. The algorithm then strives to minimize the average description length of a single step of a random walk.

The algorithm by Blondel et al. targets maximizing modularity (blondel08, ). Modularity () is defined as the total fraction of intra-cluster sum-weight minus the expected fraction if the edges (and weights) were distributed randomly while the node degrees were preserved:

(19)

where . Modularity provides a valuable metric of the connectedness of clusters, but a number of authors have demonstrated that it suffers from a resolution limit when used to select the number of clusters (good09, ). The partitioning that maximizes modularity will generally not isolate clusters if the number of edges in the cluster is a small fraction of the total number of edges in the graph. The reason behind the resolution limit of modularity is the second term in the summation of (19); when the network gets larger, increases monotonically. However the cluster degrees are not necessarily a function of the network size and are often bounded, regardless of the number of nodes (lekovec2008natural, ). This results in and hence the value of modularity becomes dependent only on . By normalizing the summation by as suggested in (yu10, ), the resolution limit phenomena is resolved; but we have

(20)

i.e., maximization of and normalized association are equivalent.

5.2 Synthetic Graphs

We first analyze the performance on synthetic graphs, for which the true clustering behavior is known. We use benchmark graphs developed by Lancichinetti, Fortunato, and Radicchi (lancichinetti09, ) (LFR graphs). These random graphs are designed based on the planted partition model (condon2001algorithms, ). Each node is assigned to one of

clusters. When edges are added to the graph, the probability that the edge is between nodes in the same cluster is

and the probability that the edge joins nodes from different clusters is . The LFR benchmarks have heterogeneous cluster sizes with user-specified lower bound and upper bound, and , respectively. Furthermore the node degrees are upper-bounded by and the average node degree is denoted by . As decreases, edges are increasingly likely to be intra-cluster, making the partitioning task easier.

The Meila-Shi algorithm performs at least as well as the other spectral clustering algorithms (Ng-Jordan-Weiss (ng02, ) and Shi-Malik (shi00, )) and hence we only display the Meila-Shi results. The local search of Dhillon-Guan-Kulis is also excluded often, because it results in negligible improvement while it introduces a very large computational overhead.

(a)
(b)
Figure 1: LFR benchmarks with , and variable proportions of inter-cluster edges (

). In all of the above figures, Meila-Shi, Rosvall-Bergstrom, and GANC(refined) are overlapping. In (b) and (d) Jaccard index = 1 for the overlapping algorithms that corresponds to perfect extraction of the clusters.

5.2.1 Maximizing normalized association

Figure 1 examines how the algorithms perform with respect to maximizing NAssoc for 1,000-node LFR graphs. In order to observe how the algorithms perform on graphs with heterogeneous clusters, we let the cluster sizes range from 20 to 50 nodes.888In the real-life networks that we consider in this paper, the clusters have been observed to be of a limited size, regardless of the network size (lekovec2008natural, ); for example the Dunbar number suggests an upper limit of 150 nodes for clusters in a social network (dunbar1998grooming, ). For each value of , 100 graph realizations are generated. The value of NAssoc is divided by to obtain a value between 0 and 1 which represents the average NAssoc per cluster.

The algorithms perform almost identically with the exception of the Dhillon et al. algorithm (dhillon07, ). The refinement step of GANC results in a significant improvement in the value of NAssoc. However, the local search of Dhillon’s algorithm does not improve NAssoc significantly. Note that the Blondel et al. algorithm results in higher values of NAssoc due to selecting smaller number of clusters than the true ones ( is a monotonically decreasing function of ); a fair comparison between algorithms is possible only when the number of clusters are equal.

5.2.2 Comparing to the planted partitions

The advantage of exploring performance on synthetic graphs is that a ground-truth partitioning is available. Because the LFR benchmarks are based on the planted partition model, there is the additional advantage that NAssoc is an appropriate criterion to adopt. Apart from some possible small errors due to the randomness inherent in the construction of the benchmark graphs, the ground truth partitioning will correspond to a maximum of NAssoc for a given value of .

For comparing two partitions on the same graph, we use the Jaccard index (downton80, ) which for two partitions, and , is defined as

(21)

where , , and are, respectively, the total pair of nodes that are assigned to the same cluster in both and , the same cluster in and different clusters in , and the same cluster in and different clusters in . When two partitions are identical, the Jaccard index evaluates to one, and it decreases as the partitions deviate from each other.

Figures (b)b show the closeness of the partitionings identified by the algorithms and the ground truth in terms of the Jaccard index. The algorithm by Blondel et al. deviates dramatically from the true clustering (Figure (b)b). The reason is that the minimum and maximum cluster sizes are fixed for all the networks in addition to the average node degree. Hence the average number of edges per cluster remains the same making the proportion of the intra-cluster edges decrease. Due to the resolution limit of modularity, as the proportion of intra-cluster edges is decreased, modularity maximization algorithms tend to group several clusters into a single cluster. If we reduce to , the Blondel et al. algorithm leads to closer results to the ground truth. The algorithm by Dhillon et al. does not perform as well as other algorithms that strive to maximize NAssoc and the local search results in negligible improvement in terms of the Jaccard index.

(a) Ring graph
(b) Curvature plot of the ring
Figure 2: The ring of 24 cliques; the highest peak of curvature is at 24.
Figure 3: Curvature versus number of clusters for varying proportion of inter-cluster edges (). The true clustering indicated by red X. The peak of the curvature becomes less and less distinguishable by increasing . LFR benchmark graph with 1000 nodes and .

5.3 Model order selection: the curvature metric

We first provide an illustrative example of the use of the curvature metric for selecting the number of clusters in the partitioning. This example highlights the difference in behavior compared to the modularity metric used by modularity maximization algorithms (blondel08, ).

An example used by Good et al. (good09, ), to illustrate the resolution limit is the ring of 24 cliques999A clique is a subset of nodes in a graph which are fully connected., each of which has 5 nodes and is connected to its neighboring clique by a single edge. The value of modularity is maximized by a 12-cluster partition that merges pairs of cliques together. A more natural clustering is to consider each clique as an individual cluster. Figure (b)b shows the curvature plot for the case of the ring of cliques. In addition to the peak of the curvature that is correctly located at 24 clusters, it is interesting to note that the other local peaks of the curvature are also meaningful; e.g., the peak at 12 clusters corresponds to the clustering that maximizes modularity.

To see how the curvature indicates the true number of clusters for networks with more complex structures, we consider the LFR benchmarks. We fix the LFR parameters for a 1,000-node graph and explore how the curvature changes as is varied. Figure 3 indicates that the peak of the curvature plot correctly identifies the true number of clusters for in the range 0.1 – 0.5. As increases, the main peak becomes less distinct; for , the second highest peak is almost as large as the primary peak. As the clusters become increasingly interconnected, the curvature plot provides a less clear indication of the true number of clusters.

Note that Figure 3 corresponds to a single realization of the LFR benchmarks. The empirical probabilities that the highest peak of the curvature corresponds to the true number of clusters for 100 realizations of the LFR benchmark with parameters as in Figure 3 are , for when no refinement step is applied; furthermore the empirical probabilities that one of the two highest peaks corresponds to the true number of clusters are .

These probabilities are obtained using the approximate curvatures derived from the agglomerative clustering procedure. To achieve a better insight into the model selection capabilities of the curvature metric, we derive better approximations by applying the refinement step to every level of the hierarchy and then recomputing curvature estimates. With this new procedure, we obtain

and . This indicates that curvature can provide a good indication of model order even for .

The model order selection algorithm of Rosvall-Bergstrom algorithm quite successfully indicates the true number of clusters in the case of LFR benchmarks; the empirical probabilities for the Rosvall-Bergstrom algorithm is . The Blondel et al. algorithm does not successfully identify the true number of clusters when increases: . Note that when , the failure of the Blondel et al. algorithm becomes more evident and turns to a success for any (this is reflected in Figure 1). However the refined GANC and Rosvall-Bergstrom algorithms obtain the same success ratios when .

5.4 Real networks

In this section, we examine the behavior the GANC and compare it to the other algorithms for graphs representing real networks. We first examine the behavior for small networks where there is knowledge of the ground truth partition. We then experiment with large networks, which allows us to assess the scalability of GANC.

5.4.1 Small networks

We conduct experiments using the Zachary karate club network (zachary, ), the football network (girvan02, ), and the political books network101010Collected by V. Krebs, http://www.orgnet.com

. The Zachary karate club network portrays friendships in a karate club. During the polling period, there was a dispute between the manager and the instructor which led to the establishment of a new club by the instructor. The karate students then split into two groups, either staying with the original club or following the instructor to the new club. The two clusters associated with the network correspond to these two groups. Each node in the football networks corresponds to a team in US college football. There are 11 conferences and 5 independent teams. Each edge corresponds to the existence of a match between the connected nodes. The independent teams can be interpreted as outliers. Nodes of the political book network correspond to the political books sold by Amazon.com and are categorized as neutral, conservative, and liberal. The edges between pairs of nodes correspond to frequent co-buying of the pairs of books by the same buyers. The three networks are unweighted.

Tables 1 and 2 compare the performance of GANC with other clustering algorithms for these small, well-known datasets. The tables provide the average NAssoc per cluster () and the Jaccard index. The “true” number of clusters is indicated within parentheses beside the name of each network111111Note that the “true” number of clusters is something of an artificial social construct and does not necessarily correspond to a partitioning that maximizes any meaningful graph clustering metric.. In Table 1, all algorithms (including GANC) are informed of the true number of clusters and instructed to generate a partitioning with that number of clusters. Table 1 indicates that the spectral clustering algorithms (Ng-Jordan-Weiss, Meila-Shi, and Shi-Malik) achieve similar performance in terms of NAssoc. We note that when is specified for these three networks, GANC identifies partitionings with average NAssoc as large or larger than those of the partitionings identified by the spectral clustering techniques, with the exception of Shi-Malik in the case of the football network.

Karate() Football() PolBooks()
GANC (refined) 0.872/0.89 0.704/0.820 0.88/0.67
Dhillon-Guan-Kulis 0.820/0.64 0.700/0.820 0.87/0.64
Ng-Jordan-Weiss 0.872/1.00 0.701/0.825 0.82/0.57
Meila-Shi 0.869/0.89 0.696/0.820 0.88/0.67
Shi-Malik 0.872/1.00 0.706/0.825 0.86/0.67
Table 1: Small real networks: known in advance (average NAssoc per cluster/Jaccard index). Maximum NAssoc is indicated by bold text.

Table 2 compares the performance of algorithms that select the number of clusters based on some aspect of the data. The number of clusters selected by each algorithm is shown within parentheses after the Jaccard index. Rosvall’s algorithm uses a description length metric to select the number of clusters (rosvall08, ); Blondel’s algorithm chooses the partitioning that maximizes the modularity (blondel08, ). A direct comparison of NAssoc is not valid when is not fixed.

Karate Football PolBooks
GANC (curv) 0.80 () 0.76 () 0.69 ())
Blondel et al. 0.53 () 0.70 () 0.67 ()
Rosvall-Bergstrom 0.71 () 0.83 () 0.63 ()
Table 2: Small real networks: Jaccard index for the estimated model orders.

The first row of this table shows the performance of GANC when the curvature plot is used to select the number of clusters. Although the selected number of clusters does not correspond to the true number of clusters for any of the networks, all three values can be explained. In the karate club network there is a group of students who have weak ties to other members of the network and these are identified as a third cluster; in the football network, GANC isolates the independent teams; in the political books network, GANC does not identify a neutral cluster and assigns each of the neutral books to either the liberal or conservative cluster.

5.4.2 Cortical Networks

Here we study the networks presented in (hagmann08, ) which are weighted graphs, each with 998 nodes. The nodes correspond to small regions of the human cerebral cortex and the edges correspond to cortico-cortical axonal pathways. The networks are developed for five patients (the extraction is performed twice for patient A). We first perform a comparison of the competing algorithms, then we discuss the clustering results of GANC.

When applying the Rosvall-Bergstrom and Blondel et al. algorithms the used does not have the freedom to choose the number of clusters. Hence we repeat the experiment twice to make a meaningful objective comparison in terms of NAssoc. Table 3 lists the average NAssoc of every clustering algorithm and patient. In the first subtable, is set to the model order selected by the Rosvall-Bergstrom algorithm. Then, in the second subtable, is set to the model order selected by the Blondel et al. algorithm, and in the third subtable, peaks of the model order selected by curvature are used. Note that the Shi-Malik algorithm outperforms the other two spectral algorithms for these datasets; hence we only include the Shi-Malik results.

-.0in-.0in A-1 A-2 B C D E Rosvall-Bergstrom 0.725 (62) 0.706 (68) 0.738 (61) 0.706 (62) 0.691 (63) 0.706 (58) GANC (refined) 0.730 0.716 0.740 0.712 0.705 0.718 Dhillon-Guan-Kulis 0.696 0.682 0.712 0.688 0.670 0.684 Shi-Malik 0.715 0.692 0.727 0.690 0.686 0.694 Blondel et al. 0.873 (14) 0.872 (13) 0.885 (15) 0.862 (14) 0.856 (14) 0.828 (21) GANC (refined) 0.886 0.879 0.883 0.873 0.864 0.840 Dhillon-Guan-Kulis 0.860 0.865 0.872 0.859 0.851 0.810 Shi-Malik 0.877 0.874 0.877 0.860 0.870 0.826 GANC (refined) 0.908 (10) 0.945 (4) 0.899 (12) 0.880 (13) 0.930 (5) 0.907(9) Dhillon-Guan-Kulis 0.889 0.946 0.897 0.863 0.916 0.885 Shi-Malik 0.898 0.951 0.886 0.863 0.936 0.902

Table 3: NAssoc of the cortical networks. Maximum NAssoc is indicated by bold text. The model order selected for each column in each sub-table is identified within parentheses.

The clustering results corresponding to the curvature peaks include cluster(s) that contain nodes from both of the brain hemispheres and clusters that include nodes from only one of the hemispheres. In the following discussion, we call the clusters that contain nodes from both of the hemispheres, the central clusters. Graphs A1, B, D, and E include only one central cluster. Graphs A2 and C each include two central clusters.

The regions of the brain that are commonly grouped in the central clusters are posterior cingulate cortex, precuneus, cuneus, paracentral lobule, pericalcarine cortex, caudal anterior cingulate cortex, isthmuscingulate, isthmus of the cingulate cortex, and lingual gyrus, provided that graph B is excluded. If the second largest peak of the curvature is selected for B (

), the same regions are assigned to its central cluster. The first five of the mentioned regions are also classified as part of the structural core proposed by Hagmann et al. 

(hagmann08, ); Hagmann et al. used cluster strengths, cluster degrees, k-Core, s-Core, betweenness centrality, and efficiency121212Cluster strength and degree are the weighted and unweighted degrees, respectively. k-Core (s-Core) is the largest subgraph that includes nodes of degree (respectively, strength) at least k (respectively, s). Betweenness centrality of a region is a measure of the proportion of the shortest paths passing through it. Efficiency of a region indicates how short the region’s average path lengths to other regions are. to propose a structural core of the brain consisting of eight regions.

The authors of (hagmann08, ) also used modularity maximization and found six modules, two of which included nodes from both of the hemispheres (central clusters). The regions that are assigned to the central clusters by modularity maximization are similar to the ones extracted by GANC.

5.4.3 Larger networks

Here we illustrate the performance of our algorithm on large graphs. We apply our algorithm to four networks with different natures: Cond-Mat (a collaboration network) (newman01, ), Googleweb (a web graph) (lekovec2008natural, )131313Google programming contest: http://www.google.com/programming-contest/, Amazon (a product co-purchasing network)(leskovec2007dynamics, ), and AS-Skitter (an autonomous system graph)(leskovec2005graphs, ). The nodes in Cond-Mat represent scientists that submit their preprints to the condensed matter archive at www.arxiv.org; the edges represent co-authorships. The nodes in Googleweb represent websites and the directed edges represent the existence of hyperlinks. Each node in Amazon network corresponds to a product purchased at www.amazon.com. Each directed edge from a node to another means that when the former is purchased, the latter is frequently also purchased. AS-Skitter is a network of autonomous systems extracted by traceroute analysis.141414http://www.caida.org/tools/measurement/skitter

The Cond-Mat graph is weighted and the rest of the graphs are unweighted. In Table 4

, the average and maximum degrees do not take the weights into account; the number of edges affects the run-time of GANC, not the weight values. None of the above graphs are originally connected. However each has a very large connected component that includes the majority of the nodes and the edges. Here we conduct clustering analysis of the largest connected component. Some of these graphs are directed, but we construct undirected graphs by adding the adjacency matrix to its transpose.

-.75in-.0in Network Cond-Mat Googleweb Amazon AS-Skitter # of Nodes/Edges 36458/171736 342408/1142134 403364/2443311 1694616/11094209 Avg/Max Degree 28.5/278 6.7/1367 12.1/2752 13.1/35455 GANC (2121) 0.784/0m2s (12213) 0.869/2m30s (11237) 0.768/3m4s (26399) 0.769/46m33s Dhillon-Guan-Kulis (2121) 0.669/0m3s (12213) — (11237) 0.679/6m0s (26399) — Rosvall-Bergstrom (2121) 0.746/0m10s (12213) 0.844/1m17s (11237) 0.712/11m26s (26399) 0.720/99m37s GANC (82) 0.957/0m2s (239) 0.993/2m30s (230) 0.984/3m4s (1776) 0.933/46m33s Dhillon-Guan-Kulis (82) 0.784/1s (239) 0.935/0m2s (230) 0.872/0m4s (1776) 0.709/8m0s Blondel et al. (82) 0.849/0m1s (239) 0.990/0m5s (230) 0.952/0m14s (1776) 0.859/0m4s GANC (37) 0.970/0m2s (7) 0.999/2m30s (22) 0.997/3m4s (32) 0.994/46m33s GANC (526) 0.902/0m2s (855) 0.984/2m30s (206) 0.996/3m4s

Table 4: Larger real networks: Partition metrics ( (Number of clusters) NAssoc per cluster/execution time).

Table 4 compares the performance of the algorithms that are scalable to such large networks. The table shows the average NAssoc per cluster of the identified partitioning and the time required for completion of the algorithm. The spectral clustering algorithms cannot be executed on our test machine when either the number of nodes or the number of clusters is very large. Both the computational time and the memory requirements are excessive. We therefore compare the performance of GANC, Dhillon-Guan-Kulis (no local search), Rosvall-Bergstrom, and the Blondel et al. algorithms.

The Rosvall-Bergstrom and Blondel et al. algorithms automatically select the number of clusters. A meaningful comparison of the average NAssoc is only possible when the number of clusters are equal; to facilitate comparisons, we therefore construct multiple partitionings for GANC and the Dhillon-Guan-Kulis algorithms, each with a different number of clusters. We also construct a partitioning corresponding to one or more of the peaks in the curvature plot (for some of the networks, there are two peaks that are very similar in value, so we consider it useful to examine the partitionings corresponding to each). The last two rows of Table 4 correspond to those peaks.

GANC is superior to other algorithms with respect to maximizing NAssoc in all of the cases. GANC also significantly outperforms the Dhillon-Guan-Kulis algorithm, which is also striving to maximize NAssoc. The latter is also outperformed by Rosvall-Bergstrom and Blondel et al. algorithms.

In terms of the computation time, GANC is often faster than Rosvall-Bergstrom, but the scaling behavior is different. To illustrate this, in addition to the graphs listed in Table 1, we have applied the algorithms to the road network graph of California (lekovec2008natural, ) which is extremely sparse (the average degree is 2.8). The graph contains around 2 million nodes. While GANC performs the clustering in 32 seconds, in takes 197 minutes for the Rosvall-Bergstrom algorithm to converge to a solution. The complexity of GANC is dominated by the agglomerative clustering procedure which requires operations. Hence the speed of GANC depends on the number of edges in the graph, and the height of the dendrogram. However the graph sparsity does not show an impact on Rosvall-Bergstrom’s execution time. Blondel’s algorithm is much faster than GANC, but as mentioned previously, it suffers from resolution limit associated with clustering algorithms that maximize modularity. The Dhillon-Guan-Kulis algorithm is also very fast for small values of , but it gets slower and its memory requirements become excessive if becomes large.

To illustrate the properties of the algorithm outputs in terms of the cluster sizes and the connectivity of individual clusters, we have examined the plots of NAssoc of individual clusters versus the number of nodes in them. Each algorithm behaves similarly on different graphs presented in Table 4. Figure (a)a compares GANC and Rosvall-Bergstrom when applying these algorithms on Amazon co-purchasing graph. For the same number of clusters, GANC produces clusters with higher values of NAssoc than Rosvall-Bergstrom. The latter produces many small clusters with very low values of NAssoc. The behavior of the Dhillon-Guan-Kulis and the Blondel et al. algorithms are discussed in our case study.

(a) Amazon
(b) US Patent Citation
Figure 4: NAssoc of individual clusters.

5.5 Case Study: US Patent Citation Graph

As a case study, we consider the undirected version of the citation graph released by the National Bureau of Economic Research (website:patents, ; hall01, ). The patents are classified into 6 broad technological categories. A more refined classification leads to 36 sub-categories. We use each patent’s label (category or sub-category) in addition to NAssoc and runtime to perform a comparison.

The original graph is not connected; however the largest connected component contains more than 99.7% of the nodes and all of the edges. Hence we focus on the largest connected component which contains 3,764,117 nodes and 16,511,740 edges. The maximum node degree is 793.

5.5.1 Clustering Runtime

The Rosvall-Bergstrom algorithm (rosvall08, ) was terminated after more than 30 hours without converging to a solution. GANC takes 77 minutes to construct the full hierarchy. After the hierarchy has been generated, each flat partitioning including the refinement takes less than 35 seconds. The algorithm by Blondel et al. (blondel08, ) takes 8 minutes. The Dhillon-Guan-Kulis algorithm (dhillon07, ) takes 72 seconds for and increases as is increased.

5.5.2 Maximization of Normalized Association

In order to have a fair comparison in terms of NAssoc, we fix to match the number of clusters of the Blondel et al. result. The values of for GANC, Dhillon-Guan-Kulis, and Blondel et al. are 0.964, 0.859, and 0.855, respectively. The values of NAssoc of the individual clusters are shown in Figure (b)b which illustrates the clear superiority of GANC.

5.5.3 Extraction of True Clusters and Absence of Large Well-Defined Clusters

We use the categories and sub-categories to classify the nodes in the patent citation graph. We denote the maximum proportion of nodes from the same class in a given cluster as the homogeneity proportion of that cluster. For the Blondel et al. and Dhillon-Guan-Kulis algorithms, we use and for GANC we use (the closest peak of the curvature plot). The clusters are sorted according to their homogeneity proportion in Figure (a)a. The figure shows the superiority of GANC in extracting clusters of nodes from the same categories. When sub-categories are employed for the assessment, the superiority of GANC becomes more pronounced.

(a) Maximum Homogeneity Proportions
(b) Homogeneity Proportion per Cluster
Figure 5: The 52 most homogeneous clusters out of 57 clusters using the Blondel et al. and the Dhillon-Guan-Kulis algorithms.

Figure (a)a alone does not provide a fair comparison as any singleton would have homogeneity proportion of 1. In Figure (b)b, the homogeneity proportion is plotted versus the size of the extracted clusters for the three algorithms.

The Blondel et al. algorithm produces clusters as small as 14 to as large as 250,000 nodes. However the largest homogeneous cluster that is extracted has 38 nodes. The quality of clusters degrade as the size of the clusters increase. This trend is very likely to be due to the resolution limit of modularity.

Both GANC and Dhillon-Guan-Kulis produce a very large cluster (the core cluster (lekovec2008natural, )). Excluding the core, Dhillon-Guan-Kulis produces clusters of an average size of which is around 37,000 nodes in this example (See Figure (b)b). This behavior of the Dhillon-Guan-Kulis algorithm is due to the underlying region-growing procedure it adopts from Metis (Karypis99, ).

The clusters extracted by GANC are much smaller than the other two algorithms. Excluding the core, the cluster sizes are between 7 and 225. Similar to previous observations of Leskovec et al. (lekovec2008natural, ), the strongly-connected clusters of the large graphs listed in Table 4 diminish in number as we move towards the center of the graph. Having a small homogeneity proportion is then expected for the core as it includes several clusters that are not very well-connected and are lumped together. If we study a clustering that is in a lower level in the hierarchy, further clusters would be extracted from the core and the core shrinks. The resulting clusters are not as well-connected in terms of NAssoc though. In other words, the most isolated and well-connected clusters are extracted at the higher levels of the hierarchy.

6 Conclusion

We proposed a novel algorithm to maximize normalized association that consists of three steps: the agglomerative hierarchical clustering procedure, the model order selection step, and the refinement step.

The agglomerative clustering procedure requires operations for many real-life graphs, where is the number of nodes in the graph. This procedure dominates the computational complexity of the other steps of the algorithm. The second step of our algorithm, the model order selection, is based on the relative improvement of normalized association when changing the number of clusters; the curvature plot is used to select one or several model orders for the final clustering. Unlike modularity, the curvature metric does not exhibit an intrinsic resolution limit. For a multi-resolution analysis, a user can specify a range on the allowable number of clusters, and the algorithm will select the number of clusters with the maximum curvature in that range. After selecting the number of clusters, the clusters are passed to the refinement step. This step of our algorithm iterates over the boundary nodes in the clusters and explores possible improvements in normalized association by moving each of the boundary nodes to their neighboring clusters. Using the map data structure, the overhead added by the refinement becomes negligible. Experiments show that despite the negligible runtime of the refinement step, it significantly improves the initial results with respect to the normalized association maximization.

Our experimental analysis on relatively small networks indicated that the proposed algorithm identified partitions that have normalized association values comparable to the spectral algorithms that involve an eigendecomposition. These algorithms are too computationally complex to be applied to very large graphs. We demonstrated that our proposed algorithm can be applied to large graphs (millions of nodes and edges). For these large graphs, the proposed algorithm identifies partitions that have larger values of normalized association than those identified by the only comparable algorithm that directly addresses the normalized cut metric.

A clustering algorithm can by no means be suitable for every application. For example, despite the failure of Dhillon-Guan-Kulis algorithm to extract clusters with nodes of the same categories in the patent citation network, it generates clusters of very similar sizes. This is more suitable for VLSI applications for instance (li06, ). On the other hand, when clustering is meant to extract “communities” (group of nodes with strong intra-connection), GANC and Rosvall-Bergstrom are clearly preferred.

The Blondel et al. and Rosvall-Bergstrom algorithms are not able to generate a partitioning to an arbitrary number of clusters151515The available implementation of the Rosvall-Bergstrom algorithm is based on the Blondel et al. algorithm. See http://www.tp.umu.se/~rosvall/algorithm.pdf.; even though they are agglomerative, they do not generate the complete hierarchy because they merge several nodes/clusters in each of their iterations. Hence if one is interested in several arbitrary clustering levels, GANC fits one’s requirement the best out of the competing algorithms discussed.

In this paper we only considered undirected graphs while the edge directions could carry valuable information about the structure of a graph. An extension of GANC can be developed by adopting the generalized normalized cut criterion (meila2007clustering, ).

References

  • (1) S. Wasserman, K. Faust, Social network analysis: Methods and applications, Cambridge Univ. Press, 1994.
  • (2) R. Guimera, L. Amaral, Functional cartography of complex metabolic networks, Nature 433 (7028) (2005) 895–900.
  • (3) Y. Loewenstein, E. Portugaly, M. Fromer, M. Linial, Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space, Bioinformatics 24 (13) (2008) i41 – i49.
  • (4) Z. Chen, Y. He, P. Rosa-Neto, J. Germann, A. Evans, Revealing modular architecture of human brain structural networks by using cortical thickness from MRI, Cerebral Cortex 18 (10) (2008) 2374 – 2381.
  • (5) W. Crum, Spectral Clustering and Label Fusion For 3D Tissue Classification: Sensitivity and Consistency Analysis, in: Proc. Med. Im. Underst. Anal., Dundee, Scotland, 2008.
  • (6) M. Girvan, M. E. J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. 99 (12) (2002) 7821 – 7826.
  • (7) M. Rosvall, C. T. Bergstrom, Maps of random walks on complex networks reveal community structure, Proc. Natl. Acad. Sci. 105 (4) (2008) 1118 – 1123.
  • (8) S. Lafon, A. Lee, Diffusion maps and coarse-graining: a unified framework for dimensionality reduction, graph partitioning, and data set parameterization, IEEE Trans. Patt. Anal. Mach. Intel. 28 (9) (2006) 1393 – 1403.
  • (9) J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Patt. Anal. Mach. Intel. 22 (8) (2000) 888–905.
  • (10) C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the Nyström method, IEEE Trans. Patt. Anal. Mach. Intel. 26 (2) (2004) 214–225.
  • (11) I. Dhillon, Y. Guan, B. Kulis, Weighted graph cuts without eigenvectors: a multilevel approach, IEEE Trans. Patt. Anal. Mach. Intel. 29 (11) (2007) 1944–1957.
  • (12) L. Yu, C. Ding, Network community discovery: solving modularity clustering via normalized cut, in: Proc. ACM Wkshp Mining and Learn. with Graphs, 2010.
  • (13)

    A. Sinclair, M. Jerrum, Approximate counting, uniform generation and rapidly mixing Markov chains, Info. Comp. 82 (1) (1989) 93 – 133.

  • (14) W. Donath, A. Hoffman, Algorithms for partitioning of graphs and computer logic based on eigenvectors of connection matrices, IBM Tech. Disc. Bull. 15 (3) (1972) 938 – 944.
  • (15) A. Pothen, H. Simon, K.-P. Liou, Partitioning sparse matrices with eigenvectors of graphs, SIAM J. Matrix Anal. App. 11 (1) (1990) 430–452.
  • (16) B. Hendrickson, R. Leland, A multilevel algorithm for partitioning graphs, in: Proc. ACM Int. Conf. Supercomp., Barcelona, Spain, 1995.
  • (17) G. Karypis, V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comp. 20 (1) (1998) 359 – 392.
  • (18) M. Meila, J. Shi, A random walks view of spectral segmentation, in: Proc. Int. Wkshp Art. Intel. Stat., Key West, FL, USA, 2001.
  • (19) A. Ng, M. Jordan, Y. Weiss, On spectral clustering: Analysis and an algorithm, in: Proc. Adv. Neur. Inf. Proc. Sys., Vancouver, BC, Canada, 2001.
  • (20) D. Yan, L. Huang, M. I. Jordan, Fast approximate spectral clustering, in: Proc. ACM Int. Conf. Knowl. Disc. Data Mining, Paris, France, 2009.
  • (21) E. Sharon, A. Brandt, R. Basri, Fast multiscale image segmentation, in: Proc. IEEE Conf. Comp. Vis. Patt. Recog., Hilton Head, SC, USA, 2000.
  • (22) C. Ding, X. He, H. Zha, M. Gu, H. Simon, A min-max cut algorithm for graph partitioning and data clustering, in: Proc. IEEE Int. Conf. Data Mining, San Jose, CA, USA, 2001.
  • (23) E. Sharon, A. Brandt, R. Basri, Segmentation and boundary detection using multiscale intensity measurements, in: Proc. IEEE Conf. Comp. Vis. Patt. Recog., Kauai, HI, USA, 2001.
  • (24) A. Akselrod-Ballin, M. Galun, R. Basri, A. Brandt, M. Gomori, M. Filippi, P. Valsasina, An integrated segmentation and classification approach applied to multiple sclerosis analysis, in: Proc. IEEE Conf. Comp. Vis. Patt. Recog., New York, NY, USA, 2006.
  • (25) J. Corso, E. Sharon, S. Dube, S. El-Saden, U. Sinha, A. Yuille, Efficient multilevel brain tumor segmentation with integrated bayesian model classification, IEEE Trans. Med. Im. 27 (5) (2008) 629 –640.
  • (26) G. Milligan, M. Cooper, An examination of procedures for determining the number of clusters in a data set, Psychometrika 50 (2) (1985) 159–179.
  • (27) M. Halkidi, Y. Batistakis, M. Vazirgiannis, On clustering validation techniques, J. Intel. Inf. Sys. 17 (2) (2001) 107–145.
  • (28) T. Caliński, J. Harabasz, A dendrite method for cluster analysis, Comm. Stat. Theo. Meth. 3 (1) (1974) 1–27.
  • (29) L. Hubert, J. Levin, A general statistical framework for assessing categorical clustering in free recall, Psych. Bull. 83 (6) (1976) 1072–1080.
  • (30) F. Baker, L. Hubert, Measuring the power of hierarchical cluster analysis, J. American Stat. Assoc. (1975) 31–38.
  • (31) W. Sarle, Cubic Clustering Criterion, Tech. rep., Cary, NC: SAS Inst. Inc (1983).
  • (32) R. Tibshirani, G. Walther, T. Hastie, Estimating the number of clusters in a data set via the gap statistic, J. Royal Stat. Soc.: Series B (Stat. Meth.) 63 (2) (2001) 411–423.
  • (33) J. Edachery, A. Sen, F. Brandenburg, Graph clustering using distance-k cliques, in: Graph Drawing, Vol. 1731 of Lecture Notes in Comp. Science, Springer Berlin / Heidelberg, 1999, pp. 98–106.
  • (34) R. Gnanadesikan, J. Kettenring, J. Landwehr, Interpreting and assessing the results of cluster analyses, Bull. Int. Stat. Inst. 47 (2) (1977) 451–463.
  • (35) W. Krzanowski, Y. Lai, A criterion for determining the number of groups in a data set using sum-of-squares clustering, Biometrics 44 (1) (1988) 23–34.
  • (36) T. Pedersen, A. Kulkarni, Automatic cluster stopping with criterion functions and the gap statistic, in: Proc. Conf. North American Chap. Assoc. Comp. Ling. Human Lang. Tech., New York City, NY, USA, 2006.
  • (37) S. Arnold, A test for clusters, J. Market. Res. 16 (4) (1979) 545–551.
  • (38) G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6 (2) (1978) 461–464.
  • (39) H. Akaike, A new look at the statistical model identification, IEEE Trans. Auto. Control 19 (6) (1974) 716 – 723.
  • (40) J. Wolfe, Pattern clustering by multivariate mixture analysis, Multivar. Behav. Res. 5 (3) (1970) 329–350.
  • (41) D. Pelleg, A. Moore, X-means: Extending K-means with Effcient Estimation of the Number of Clusters, in: Proc. Int. Conf. Mach. Learn., Stanford, CA, USA, 2000.
  • (42) M. Mariadassou, S. Robin, C. Vacher, Uncovering latent structure in valued graphs: a variational approach, Ann. App. Stat. 4 (2) (2010) 715–742.
  • (43) P. Latouche, E. Birmelé, C. Ambroise, Bayesian methods for graph clustering, Tech. rep., Laboratoire Statistique et Génome, UMR CNRS 8071-INRA 1152-UEVE, 91000 Evry, France (2008).
  • (44) P. Latouche, E. Birmelé, C. Ambroise, Bayesian methods for graph clustering, in: Adv. Data Anal. Data Hand. Bus. Intel., 2010, pp. 229–239.
  • (45) Y. Gdalyahu, D. Weinshall, M. Werman, Stochastic image segmentation by typical cuts, in: Proc. IEEE Comp. Soc. Conf. Comp. Vis. Patt. Recog., Fort Collins, CO, USA, 1999.
  • (46) D. Harel, Y. Koren, On clustering using random walks, in: Proc. Conf. Found. Soft. Tech. Theor. Comp. Sci., Bangalore, India, 2001.
  • (47) A. Azran, Z. Ghaharamani, Spectral methods for automatic multiscale data clustering, in: Proc. IEEE Conf. Comp. Vis. Patt. Recog., New York, NY, USA, 2006.
  • (48) U. von Luxburg, A tutorial on spectral clustering, Stat. Comp. 17 (4) (2007) 395–416.
  • (49) L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, in: Proc. Adv. Neur. Inf. Proc. Sys., Vancouver, BC, Canada, 2004.
  • (50) A. Ben-Hur, A. Elisseeff, I. Guyon, A stability based method for discovering structure in clustered data, in: Pac. Symp. Biocomp., Lihue, HI, USA, 2002.
  • (51) T. Lange, V. Roth, M. Braun, J. Buhmann, Stability-based validation of clustering solutions, Neural Comp. 16 (6) (2004) 1299–1323.
  • (52) S. Ben-David, U. Von Luxburg, D. Pál, A sober look at clustering stability, Learning Theory (2006) 5–19.
  • (53) M. Smolkin, D. Ghosh, Cluster stability scores for microarray data in cancer studies, BMC Bioinformatics 4 (1) (2003) 36.
  • (54) S. I. Daitch, J. A. Kelner, D. A. Spielman, Fitting a graph to vector data, in: Proc. ACM Int. Conf. Mach. Learn., Montreal, QC, Canada, 2009.
  • (55) M. Maier, U. Von Luxburg, M. Hein, Influence of graph construction on graph-based clustering measures, in: Proc. Adv. Neur. Inf. Proc. Sys., Vancouver, BC, Canada, 2009.
  • (56) V. Blondel, J. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfolding of communities in large networks, J. Stat. Mech.: Theor. Exp. 2008 (2008) P10008.
  • (57) A. Clauset, M. Newman, C. Moore, Finding community structure in very large networks, Phys. Rev. E 70 (6) (2004) 66111.
  • (58) P. Plauger, M. Lee, D. Musser, A. Stepanov, C++ Standard Template Library, Prentice Hall PTR Upper Saddle River, NJ, USA, 2000.
  • (59) B. H. Good, Y.-A. de Montjoye, A. Clauset, Performance of modularity maximization in practical contexts, Phys. Rev. E 81 (4) (2010) 46106 – 46124.
  • (60) J. Leskovec, K. J. Lang, A. Dasgupta, M. W. Mahoney, Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters, Comp. Res. Repos. abs/0810.1355.
  • (61) A. Lancichinetti, S. Fortunato, Benchmarks for testing community detection algorithms on directed and weighted graphs with overlapping communities, Phys. Rev. E 80 (1) (2009) 16118.
  • (62) A. Condon, R. M. Karp, Algorithms for graph partitioning on the planted partition model, Rand. Struct. Alg. 18 (2) (2001) 116 – 140.
  • (63) R. Dunbar, Grooming, gossip, and the evolution of language, Harvard Univ. Press, MA, USA, 1998.
  • (64) M. Downton, T. Brennan, Comparing classifications: an evaluation of several coefficients of partition agreement, in: Proc. Meet. Class. Soc., Boulder, CO, USA, 1980.
  • (65) W. Zachary, An information flow model for conflict and fission in small groups, J. Anthrop. Res. 33 (4) (1977) 452–473.
  • (66) P. Hagmann, L. Cammoun, X. Gigandet, R. Meuli, C. Honey, V. Wedeen, O. Sporns, Mapping the structural core of human cerebral cortex, PLoS Biol. 6 (7) (2008) e159.
  • (67) M. Newman, The structure of scientific collaboration networks, Proc. Natl. Acad. Sci. 98 (2) (2001) 404–409.
  • (68) J. Leskovec, L. A. Adamic, B. A. Huberman, The dynamics of viral marketing, in: Proc. Int. Conf. Elect. Commerce, Ann Arbor, MI, USA, 2006.
  • (69) J. Leskovec, J. Kleinberg, C. Faloutsos, Graphs over time: densification laws, shrinking diameters and possible explanations, in: Proc. ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, ACM, Chicago, Illinois, USA, 2005.
  • (70) The NBER Citations Data File, http://www.nber.org/patents/.
  • (71) B. Hall, A. Jaffe, M. Trajtenberg, The NBER patent citation data file: Lessons, insights and methodological tools, NBER working paper 8498.
  • (72) J. Li, L. Behjat, A connectivity based clustering algorithm with application to VLSI circuit partitioning, IEEE Trans. Circ. Sys. II: Exp. Briefs 53 (5) (2006) 384 – 388.
  • (73) M. Meila, W. Pentney, Clustering by weighted cuts in directed graphs, in: Proc. SIAM Int. Conf. Data Mining, Minneapolis, MN, USA, 2007.