1 Introduction
Many datasets have a graph structure. Examples include infrastructure networks, communication networks, social networks, databases and cooccurence networks, to quote a few. These graphs often exhibit a complex, multiscale structure where each node belong to many groups of nodes, socalled clusters, of different sizes.
Hierarchical graph clustering is a common technique to reveal the multiscale structure of complex networks. Instead of looking for a single partition of the set of nodes, as in usual clustering techniques, the graph is represented by a hierarchical structure known as a dendrogram, which can then be used to find relevant clusterings at different resolutions, by suitable cuts of this dendrogram.
We address the issue of the quality of a dendrogram representing a graph. For usual clustering, the quality of a partition is commonly assessed through its modularity, which corresponds to the proportion of edges within clusters, compared to that of a random graph [7]. For hierarchical clustering, a cost function has recently been proposed by Dasgupta [3] and extended by CohenAddad and his coauthors [2]; it can be viewed as the expected size of the smallest cluster (as induced by the hierarchy) containing two random nodes. In this paper, we propose a quality metric based on the ability to reconstruct the graph from the dendrogram. The optimal representation of the graph defines a class of reducible linkages leading to regular dendrograms by greedy agglomerative clustering.
In the next section, we introduce the sampling distributions of nodes and clusters induced by the graph; these play a central node in our approach. We then formalize the problem of graph representation by a dendrogram. Our quality metric follows from the characterization of the optimal solution in terms of graph reconstruction. The corresponding hierarchical graph clustering algorithms are then presented and interpreted in terms of modularity.
2 Sampling distribution
Consider a weighted, undirected, connected graph of nodes, without selfloops. Let be equal to the weight of edge , if any, and to 0 otherwise. We refer to the weight of node as:
We denote by the total weight of nodes:
Similarly, for any sets , let
and
Node sampling.
The weights induce a probability distribution on node pairs:
with marginal distribution:
Cluster sampling.
For any partition of into clusters, the weights induce a probability distribution on cluster pairs:
with marginal distribution:
The joint distribution is the relative frequency of moves from cluster to cluster by the random walk.
3 Representation by a dendrogram
Assume the graph is represented by a dendrogram, that is a rooted binary tree whose leaves are the nodes of the graph, . We denote by the set of internal nodes of the tree . For each , a positive number is assigned to node , corresponding to its height in the dendrogram. We assume that the dendrogram is regular in the sense that if is an ancestor of in the tree. For each , there are two subtrees attached to node , with sets of leaves and ; these sets uniquely identify the internal node so that we can write and .
Ultrametric.
The dendrogram defines a metric on : for each , we define where is the closest common ancestor of and in the dendrogram. This is an ultrametric in the sense that:
Conversely, any ultrametric defines a dendrogram, which can be build from bottom to top by successive peerings of the closest nodes.
Graph representation.
A natural question is whether, given the dendrogram, it is possible to generate some graph that is “close” to the original graph . Let be some probability distribution on representing the prior information known about the relative node weights: the distribution is uniform in the absence of such information and is equal to the sampling distribution for a perfect knowledge of the node weights. We look for the best representation of by a dendrogram in the sense of the reconstruction of from
, which can be viewed as the autoencoding scheme:
Since can be interpreted as a distance between and , its inverse corresponds to a similarity. Thus we define the weight between any two nodes in the graph as:
Denoting by the total weight,
we get the following node pair sampling distribution associated with :
The distance between graphs and
can then be assessed through the KullbackLeibler divergence between the respective sampling distributions,
Optimization problem.
Minimizing the KullbackLeibler divergence in is equivalent to minimizing the cost function:
over all ultrametrics defined on . Observe that for any , so that the best ultrametric is defined up to a multiplicative constant. Using the fact that , where is the closest common ancestor of and in the dendrogram, we get:
(1) 
with
The problem of the best representation of by a dendrogram now reduces to the optimization problem:
(2) 
over all ultrametrics on .
4 Optimal representation
We seek to solve the optimization problem (2).
Optimal distances.
We first assume that the underlying tree of the ultrametric is given and look for the best corresponding distance . Let be the set of internal nodes of the tree. Then for each , we get by the differentiation of (1) in ,
where
that is
(3) 
Optimal tree.
Replacing by its optimal value (3) in (1), we deduce that the optimization problem (2) reduces to:
(4) 
where is the set of internal nodes of the tree . The dendrogram is then fully determined by (3), for each internal node . The function to maximize in (4) is the KullbackLeibler divergence between the cluster pair distributions when nodes are sampled from the edges and independently from the distribution , respectively. It provides a meaningful objective function for hierarchical clustering, that can be interpreted in terms of graph reconstruction.
A key difference between our objective function (4) and the cost functions proposed in the literature [3, 2] lies in the entropy term:
Removing this term yields the cost function:
to be compared with usual cost functions, of the form:
When
is the uniform distribution (no prior information on the node weights), these cost functions become respectively:
The latter is Dasgupta’s cost function, equal to the expected size of the smallest cluster containing two random nodes sampled from .
The optimization problem (4) is NPhard, just like minimizing Dasgupta’s cost function is NPhard [3]
. In the next section, we present heuristics based on greedy agglomerative algorithms for finding approximate solutions to this optimization problem.
5 Hierarchical clustering
The optimal distances (3) suggest a greedy algorithm for solving the optimization problem (4). The algorithm consists in starting from clusters (one per node) and in successively merging the two closest clusters in terms of intercluster distance (3). This is a usual agglomerative algorithm with linkage (intercluster similarity):
(5) 
The dendrogram is built from bottom to top, with distance attached to the internal node resulting from the merge of clusters . The agglomeration relies on the following update formula:
Proposition 1
We have:
Proof. We have:
The update formula shows that the linkage (5) is reducible, in the sense that:
This inequality guarantees that the resulting dendrogram is regular (the sequence of distances attached to successive internal nodes is nondecreasing) and that the corresponding distance on is an ultrametric. Moreover, the search of the clusters to merge can be done through the nearestneighbor chain to speed up the algorithm [5].
Linkage.
For the uniform distribution (no prior information on the node weights), the linkage (5) is proportional to the usual average linkage:
corresponding to the density of the cut separating clusters and .
For equal to (perfect knowledge of the node weights), this is the linkage proposed in [1]:
which can be interpreted in terms of sampling ratio as:
We refer to this linkage as modular in view of its relationship with modularity.
Modularity.
The modularity of any partition of the set of nodes is defined by [7]:
This is the difference between the probabilities that two nodes belong to the same cluster when sampled from the edges and independently from the nodes, in proportion to their weights. The former sampling distribution depends on the graph while the latter depends on the graph through the node weights only.
A more general definition of modularity is:
Now the node sampling distribution is any distribution with support . For instance, it may be uniform (no prior information on the node weights) or equal to (the usual definition of modularity, where the information on the node weights is known)^{1}^{1}1Another common interpretation of modularity is the difference between the proportions of edge weights within clusters in the original graph and in some null model where nodes and are linked with probability ; for the uniform distribution (no prior information on the node weights), the null model is an ErdősRényie graph while for (perfect knowledge of the node weights), the null model is the configuration model..
Maximizing modularity usually provides a unique clustering. To explore the multiscale structure of real graphs, it is common to introduce some positive resolution parameter that controls the respective weights of both terms in the definition of modularity [8, 4, 6]. The modularity of partition at resolution is defined by:
When , the second term becomes negligible and the best partition is trivial, with a single cluster equal to the set of nodes . When , the second term becomes preponderant and the best partition has clusters, one per node. Now the maximum resolution beyond which the best partition has clusters is given by:
The first node pair to be merged (at maximum resolution) is that achieving this maximum, which is the closest pair in terms of linkage (5).
More generally, given some clustering , the modularity of any coarser partition at resolution is:
Again, the best partition is when and the first cluster merge occurs at resolution:
The cluster pair to be merged (at maximum resolution) is that achieving this maximum, which the closest pair in terms of linkage (5). The agglomerative algorithm based on linkage (5) can thus be interpreted as the greedy maximization of modularity at maximum resolution.
References
 [1] T. Bonald, B. Charpentier, A. Galland, and A. Hollocou. Hierarchical graph clustering based on node pair sampling. In Proceedings of the 14th International Workshop on Mining and Learning with Graphs (MLG), 2018.
 [2] V. CohenAddad, V. Kanade, F. MallmannTrenn, and C. Mathieu. Hierarchical clustering: Objective functions and algorithms. In Proceedings of ACMSIAM Symposium on Discrete Algorithms, 2018.

[3]
S. Dasgupta.
A cost function for similaritybased hierarchical clustering.
In
Proceedings of ACM symposium on Theory of Computing
, 2016.  [4] R. Lambiotte, J.C. Delvenne, and M. Barahona. Random walks, Markov processes and the multiscale modular organization of complex networks. IEEE Transactions on Network Science and Engineering, 2014.
 [5] F. Murtagh and P. Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2012.
 [6] M. Newman. Community detection in networks: Modularity optimization and maximum likelihood are equivalent. arXiv preprint, 2016.
 [7] M. E. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 2004.
 [8] J. Reichardt and S. Bornholdt. Statistical mechanics of community detection. Physical Review E, 74(1), 2006.
Comments
There are no comments yet.