paris
Hierarchical graph clustering
view repo
We present a novel hierarchical graph clustering algorithm inspired by modularity-based clustering techniques. The algorithm is agglomerative and based on a simple distance between clusters induced by the probability of sampling node pairs. We prove that this distance is reducible, which enables the use of the nearest-neighbor chain to speed up the agglomeration. The output of the algorithm is a regular dendrogram, which reveals the multi-scale structure of the graph. The results are illustrated on both synthetic and real datasets.
READ FULL TEXT VIEW PDF
Bayesian hierarchical clustering (BHC) is an agglomerative clustering me...
read it
Hierarchical graph clustering is a common technique to reveal the multi-...
read it
Visual grouping is a key mechanism in human scene perception. There, it
...
read it
The Baire metric induces an ultrametric on a dataset and is of linear
co...
read it
Determining the number of clusters in a dataset is a fundamental issue i...
read it
A graph homomorphism is a map between two graphs that preserves adjacenc...
read it
When a set 'entities' are related by the 'features' they share they are
...
read it
Hierarchical graph clustering
Many datasets can be represented as graphs, being the graph explicitely embedded in data (e.g., the friendship relation of a social network) or built through some suitable similarity measure between data items (e.g., the number of papers co-authored by two researchers). Such graphs often exhibit a complex, multi-scale community structure where each node is invoved in many groups of nodes, so-called communities, of different sizes.
One of the most popular graph clustering algorithm is known as Louvain in name of the university of its inventors [Blondel et al., 2008]. It is based on the greedy maximization of the modularity, a classical objective function introduced in [Newman and Girvan, 2004]. The Louvain algorithm is fast, memory-efficient, and provides meaningful clusters in practice. It does not enable an analysis of the graph at different scales, however [Fortunato and Barthelemy, 2007, Lancichinetti and Fortunato, 2011]. The resolution parameter available in the current version of the algorithm^{1}^{1}1See the python-louvain Python package. is not directly related to the number of clusters or the cluster sizes and thus is hard to adjust in practice.
In this paper, we present a novel algorithm for hierarchical clustering that captures the multi-scale nature of real graphs. The algorithm is fast, memory-efficient and parameter-free. It relies on a novel notion of distance between clusters induced by the probability of sampling node pairs. We prove that this distance is reducible, which guarantees that the resulting hierarchical clustering can be represented by regular dendrograms and enables a fast implementation of our algorithm through the nearest-neighbor chain scheme, a classical technique for agglomerative algorithms
[Murtagh and Contreras, 2012].The rest of the paper is organized as follows. We present the related work in Section 2. The notation used in the paper, the distance between clusters used to aggregate nodes and the clustering algorithm are presented in Sections 3, 4 and 5. The link with modularity and the Louvain algorithm is explained in Section 6. Section 7 shows the experimental results and Section 8 concludes the paper.
Most graph clustering algorithms are not hierarchical and rely on some resolution parameter that allows one to adapt the clustering to the dataset and to the intended purpose [Reichardt and Bornholdt, 2006, Arenas et al., 2008, Lambiotte et al., 2014, Newman, 2016]. This parameter is hard to adjust in practice, which motivates the present work.
Usual hierarchical clustering techniques apply to vector data
[Ward, 1963, Murtagh and Contreras, 2012]. They do not directly apply to graphs, unless the graph is embedded in some metric space, through spectral techniques for instance [Luxburg, 2007, Donetti and Munoz, 2004].A number of hierarchical clustering algorithms have been developped specifically for graphs. The most popular algorithms are agglomerative and characterized by some distance between clusters, see [Newman, 2004, Pons and Latapy, 2005, Huang et al., 2010, Chang et al., 2011]. None of these distances has been proved to be reducible, a key property of our algorithm. Among non-agglomerative algorithms, the divisive approach of [Newman and Girvan, 2004] is based on the notion of edge betweenness while the iterative approach of [Sales-Pardo et al., 2007] and [Lancichinetti et al., 2009] look for local maxima of modularity or some fitness function; other approaches rely on statistical interence [Clauset et al., 2008], replica correlations [Ronhovde and Nussinov, 2009] and graph wavelets [Tremblay and Borgnat, 2014]. To our knowledge, non of these algorithms has been proved to lead to regular dendrograms (that is, without inversion).
Finally, the Louvain algorithm also provides a hierarchy, induced by the successive aggregation steps of the algorithm [Blondel et al., 2008]. This is not a full hierarchy, however, as there are typically a few aggregation steps. Moreover, the same resolution is used in the optimization of modularity across all levels of the hierarchy, while the numbers of clusters decrease rapidly after a few aggregation steps. We shall see that our algorithm may be seen as a modified version of Louvain using a sliding resolution, that is adapted to the current agglomeration step of the algorithm.
Consider a weighted, undirected graph of nodes, with . Let be the corresponding weighted adjacency matrix. This is a symmetric, non-negative matrix such that for each , if and only if there is an edge between and , in which case is the weight of edge . We refer to the weight of node as the sum of the weights of its incident edges,
Observe that for unit weights, is the degree of node . The total weight of the nodes is:
We refer to a clustering as any partition of . In particular, each element of is a subset of , we refer to as a cluster.
The weights induce a probability distribution on node pairs,
and a probability distribution on nodes,
Observe that the joint distribution
depends on the graph (in particular, only neighbors are sampled with positive probability), while the marginal distribution depends on the graph through the node weights only.We define the distance between two distinct nodes as the node pair sampling ratio^{2}^{2}2The distance is not a metric in general. We only require symmetry and non-negativity.:
(1) |
with if (i.e., and are not neighbors). Nodes are close with respect to this distance if the pair is sampled much more frequently through the joint distribution than through the product distribution . For unit weights, the joint distribution is uniform over the edges, so that the closest node pair is the pair of neighbors having the lowest degree product.
Another interpretation of the node distance follows from the conditional probability,
This is the conditional probability of sampling given that is sampled (from the joint distribution). The distance between and can then be written
Nodes are close with respect to this distance if (respectively, ) is sampled much more frequently given that is sampled (respectively, ).
Similarly, consider a clustering of the graph (that is, a partition of ). The weights induce a probability distribution on cluster pairs,
and a probability distribution on clusters,
We define the distance between two distinct clusters as the cluster pair sampling ratio:
(2) |
with if (i.e., there is no edge between clusters and ). Defining the conditional probability
which is the conditional probability of sampling given that is sampled, we get
This distance will be used in the agglomerative algorithm to merge the closest clusters. We have the following key results.
For any distinct clusters ,
Proof. We have:
from which the formula follows.
For any distinct clusters ,
By the reducibility property, merging clusters and cannot decrease their minimum distance to any other cluster .
The agglomerative approach consists in starting from individual clusters (i.e., each node is in its own cluster) and merging clusters recursively. At each step of the algorithm, the two closest clusters are merged. We obtain the following algorithm:
Initialization
Agglomeration
For ,
Return
The successive clusterings produced by the algorithm, with , can be recovered from the list of successive merges. Observe that clustering consists of clusters, for . By the reducibility property, the corresponding sequence of distances between merged clusters, with , is non-decreasing, resulting in a regular dendrogram (that is, without inversions) [Murtagh and Contreras, 2012].
It is worth noting that the graph does not need to be connected. If the graph consists of connected components, then the clustering gives these connected components, whose respective distances are infinite; the last merges can then be done in an arbitrary order. Moreover, the hierarchies associated with these connected components are independent of one another (i.e., the algorithm successively applied to the corresponding subgraphs would produce exactly the same clustering). Similarly, we expect the clustering of weakly connected subgraphs to be approximately independent of one another. This is not the case of the Louvain algorithm, whose clustering depends on the whole graph through the total weight , a shortcoming related to the resolution limit of modularity (see Section 6).
In view of (2), for any clustering of , the distance between two clusters is the distance between two nodes of the following aggregate graph: nodes are the elements of and the weight between (including the case , corresponding to a self-loop) is . Thus the agglomerative algorithm can be implemented by merging nodes and updating the weights (and thus the distances between nodes) at each step of the algorithm. Since the initial nodes of the graph are indexed from to , we index the cluster created at step of the algorithm by . We obtain the following version of the above algorithm, where the clusters are replaced by their respective indices:
Initialization
Agglomeration
For ,
for
Return
By the reducibility property of the distance, the algorithm can be implemented through the nearest-neighbor chain scheme [Murtagh and Contreras, 2012]. Starting from an arbitrary node, a chain a nearest neighbors is formed. Whenever two nodes of the chain are mutual nearest neighbors, these two nodes are merged and the chain is updated recursively, until the initial node is eventually merged. This scheme reduces the search of a global minimum (the pair of nodes that minimizes ) to that of a local minimum (any pair of nodes such that ), which speeds up the algorithm while returning exactly the same hierarchy. It only requires a consistent tie-breaking rule for equal distances (e.g., any node at equal distance of and is considered as closer to if and only if ). Observe that the space complexity of the algorithm is in , where is the number of edges of (i.e., the graph size).
The modularity is a standard metric to assess the quality of a clustering (any partition of ). Let if are in the same cluster under clustering , and otherwise. The modularity of clustering is defined by [Newman and Girvan, 2004]:
(3) |
which can be written in terms of probability distributions,
Thus the modularity is the difference between the probabilities of sampling two nodes of the same cluster under the joint distribution and under the product distribution . It can also be expressed from the probability distributions at the cluster level,
It is clear from (3) that any clustering maximizing modularity has some resolution limit, as pointed out in [Fortunato and Barthelemy, 2007], because the second term is normalized by the total weight and thus becomes negligible for too small clusters. To go beyond this resolution limit, it is necessary to introduce a multiplicative factor , called the resolution. The modularity becomes:
(4) |
or equivalently,
This resolution parameter can be interpreted through the Potts model of statistical physics [Reichardt and Bornholdt, 2006], random walks [Lambiotte et al., 2014], or statistical inference of a stochastic block model [Newman, 2016]. For , the resolution is minimum and there is a single cluster, that is ; for , the resolution is maximum and each node has its own cluster, that is .
The Louvain algorithm consists, for any fixed resolution parameter , of the following steps:
Initialization
Iteration
While modularity increases, update by moving one node from one cluster to another.
Aggregation
Merge all nodes belonging to the same cluster, update the weights and apply step 2 to the resulting aggregate graph while modularity is increased.
Return
The result of step 2 depends on the order in which nodes and clusters are considered; typically, nodes are considered in a cyclic way and the target cluster of each node is that maximizing the modularity increase.
Our algorithm can be viewed as a modularity-maximizing scheme with a sliding resolution. Starting from the maximum resolution where each node has its own cluster, we look for the first value of the resolution parameter , say , that triggers a single merge between two nodes, resulting in clustering . In view of (4), we have:
These two nodes are merged (corresponding to the aggregation phase of the Louvain algorithm) and we look for the next value of the resolution parameter, say , that triggers a single merge between two nodes, resulting in clustering , and so on. By construction, the resolution at time (that triggers the -th merge) is and the corresponding clustering is that of our algorithm. In particular, the sequence of resolutions is non-increasing.
To summarize, our algorithm consists of a simple but deep modification of the Louvain algorithm, where the iterative step (step 2) is replaced by a single merge, at the best current resolution (that resulting in a single merge). In particular, unlike the Louvain algorithm, our algorithm provides a full hierarchy. Moreover, the sequence of resolutions can be used as an input to the Louvain algorithm. Specifically, the resolution provides exactly clusters in our case, and the Louvain algorithm is expected to provide approximately the same number of clusters at this resolution.
We have coded our hierarchical clustering algorithm, we refer to as Paris^{3}^{3}3Paris Pairwise AgglomeRation Induced by Sampling., in Python. All material necessary to reproduce the experiments presented below is available online^{4}^{4}4See https://github.com/tbonald/paris.
We start with a hierarchical stochastic block model, as described in [Lyzinski et al., 2017]. There are nodes structured in levels, with blocks of nodes at level 1, each block of 40 nodes being divided into blocks of nodes at level 2 (see Figure 1).
The output of Paris is shown in Figure 2 as a dendrogram where the distances (on the -axis) are in log-scale. The two levels of hierarchy clearly appear.
We also show in Figure 3 the number of clusters with respect to the resolution parameter for Paris (top) and Louvain (bottom). The results are very close, and clearly show the hierarchical structure of the model (vertical lines correspond to changes in the number of clusters). The key difference between both algorithms is that, while Louvain needs to be run for each resolution parameter (here 100 values ranging from to ), Paris is run only once, the relevant resolutions being direct outputs of the algorithm, embedded in the dendrogram (see Section 6).
We now consider four real datasets, whose characteristics are summarized in Table 1.
Graph | # nodes | # edges | Avg. degree |
---|---|---|---|
OpenStreet | 5,993 | 6,957 | 2.3 |
OpenFlights | 3,097 | 18,193 | 12 |
Wikipedia Schools | 4,589 | 106,644 | 46 |
Wikipedia Humans | 702,782 | 3,247,884 | 9.2 |
The first dataset, extracted from OpenStreetMap^{5}^{5}5https://openstreetmap.org, is the graph formed by the streets of the center of Paris. To illustrate the quality of the hierarchical clustering returned by our algorithm, we have extracted the two “best” clusterings, in terms of ratio between successive distance merges in the corresponding dendrogram; the results are shown in Figure 4. The best clustering gives two clusters, Rive Droite (with Ile de la Cité) and Rive Gauche, the two banks separated by the river Seine; the second best clustering divides these two clusters into sub-clusters.
The second dataset, extracted from OpenFlights^{6}^{6}6https://openflights.org, is the graph of airports with the weight between two airports equal to the number of daily flights between them. We run Paris and extract the best clusterings from the largest component of the graph, as for the OpenStreet graph. The first two best clusterings isolate the Island/Groenland area and the Alaska from the rest of the world, the corresponding airports forming dense clusters, lightly connected with the other airports. The following two best clusterings are shown in Figure 5, with respectively 5 and 10 clusters corresponding to meaningful continental regions of the world.
The third dataset is the graph formed by links between pages of Wikipedia for Schools^{7}^{7}7https://schools-wikipedia.org, see [West et al., 2009, West and Leskovec, 2012]. The graph is considered as undirected. Table 2 (top) shows the 10 largest clusters of , the 100 last (and thus largest) clusters found by Paris. Only pages of highest degrees are shown for each cluster.
Observe that the ability of selecting the clustering associated with some target number of clusters is one of the key advantage of Paris over Louvain. Moreover, Paris gives a full hiearchy of the pages, meaning that each of these clusters is divided into sub-clusters in the output of the algorithm. Table 2 (bottom) gives for instance, among the 500 clusters found by Paris (that is, in ), the 10 largest clusters that are subclusters of the first cluster of Table 2 (top), related to animals / taxonomy. The subclusters tend to give meaningful groups of animals, revealing the multi-scale structure of the dataset.
The fourth dataset is the subgraph of Wikipedia restricted to pages related to humans. We have done the same experiment as for Wikipedia for Schools. The results are shown in Table 3. Again, we observe that clusters form relevant groups of people, with cluster #1 corresponding to political figures for instance, this cluster consisting of meaningful subgroups as shown in the bottom of Table 3. All this information is embedded in the dendrogram returned by Paris.
To assess the quality of the hierarchical clustering, we use the cost function proposed in [Dasgupta, 2016] and given by:
(5) |
where the sum is over all left and right clusters attached to any internal node of the tree representing the hierarchy. This is the expected size of the smallest subtree containing two random nodes , sampled from the joint distribution introduced in Section 4. If the tree indeed reflects the underlying hierarchical structure of the graph, we expect most edges to link nodes that are close in the tree, i.e., whose common ancestor is relatively far from the root. The size of the corresponding subtree (whose root is this common ancestor) is expected to be small, meaning that (5) is a relevant cost function. Moreover, it was proved in [Cohen-Addad et al, 2018] that if the graph is perfectly hierarchical, the underlying tree is optimal with respect to this cost function.
The results are presented in Table 5 for the graphs considered so far and the graphs of Table 4, selected from the SNAP datasets [Leskovec and Krevl, 2014]. The cost function is normalized by the number of nodes
so as to get a value between 0 and 1. We compare the performance of Paris to that of a spectral algorithm where the nodes are embedded in a space of dimension 20 by using the 20 leading eigenvectors of the Laplacian matrix
( is the diagonal matrix of node weights) and applying the Ward method in the embedding space. The spectral decomposition of the Laplacian is based on standard functions on sparse matrices available in the Python package scipy.Observe that we do not include Louvain in these experiments as this algorithm does not provide a full hierarchy of the graph, so that the cost function (5) is not applicable.
Graph | # nodes | # edges | Avg. degree |
---|---|---|---|
4,039 | 88,234 | 44 | |
Amazon | 334,863 | 925,872 | 5.5 |
DBLP | 317,080 | 1,049,866 | 6.6 |
81,306 | 1,342,310 | 33 | |
Youtube | 1,134,890 | 2,987,624 | 5.2 |
855,802 | 4,291,352 | 10 |
The results are shown when the algorithm runs within some time limit (less than 1 hour on a computer equipped with a 2.8GHz Intel Core i7 CPU with 16GB of RAM). The best performance is displayed in bold characters. Observe that both algorithms have similar performance on those graphs where the results are available. However, Paris is much faster than the spectral algorithm, as shown by Table 6 (for each algorithm, the initial load or copy of the graph is not included in the running time; running times exceeding 1 hour are not shown). Paris is even faster than Louvain in most cases, while providing a much richer information on the graph.
Graph | Spectral | Paris |
---|---|---|
OpenStreet | 0.0103 | 0.0102 |
OpenFlights | 0.125 | 0.130 |
0.0479 | 0.0469 | |
Wikipedia Schools | 0.452 | 0.402 |
Amazon | 0.0297 | |
DBLP | 0.110 | |
0.0908 | ||
Youtube | 0.185 | |
Wikipedia Humans | 131 | |
0.0121 |
Graph | Spectral | Louvain | Paris |
---|---|---|---|
OpenStreet | 0.86s | 0.19s | 0.17s |
OpenFlight | 0.51s | 0.31s | 0.33s |
5.9s | 1s | 0.71s | |
Wikipedia Schools | 16s | 2.2s | 1.5s |
Amazon | 45s | 43s | |
DBLP | 52s | 31s | |
35s | 21s | ||
Youtube | 8 min | 16 min 30s | |
Wikipedia Humans | 2 min 30s | 2 min 10s | |
3 min 50s | 1 min 50s |
We have proposed an agglomerative clustering algorithm for graphs based on a reducible distance between clusters. The algorithm captures the multi-scale structure of real graphs. It is parameter-free, fast and memory-efficient. In the future, we plan to work on the automatic extraction of clusterings from the dendrogram, at the most relevant resolutions.
Proceedings of ACM symposium on Theory of Computing
.A tutorial on spectral clustering.
Statistics and Computing.
Comments
There are no comments yet.