# An Information-theoretic Perspective of Hierarchical Clustering

A combinatorial cost function for hierarchical clustering was introduced by Dasgupta <cit.>. It has been generalized by Cohen-Addad et al. <cit.> to a general form named admissible function. In this paper, we investigate hierarchical clustering from the information-theoretic perspective and formulate a new objective function. We also establish the relationship between these two perspectives. In algorithmic aspect, we get rid of the traditional top-down and bottom-up frameworks, and propose a new one to stratify the sparsest level of a cluster tree recursively in guide with our objective function. For practical use, our resulting cluster tree is not binary. Our algorithm called HCSE outputs a k-level cluster tree by a novel and interpretable mechanism to choose k automatically without any hyper-parameter. Our experimental results on synthetic datasets show that HCSE has a great advantage in finding the intrinsic number of hierarchies, and the results on real datasets show that HCSE also achieves competitive costs over the popular algorithms LOUVAIN and HLP.

## Authors

• 5 publications
• 27 publications
• 1 publication
11/12/2021

### Hierarchical Clustering: New Bounds and Objective

Hierarchical Clustering has been studied and used extensively as a metho...
01/02/2018

### Co-Clustering via Information-Theoretic Markov Aggregation

We present an information-theoretic cost function for co-clustering, i.e...
12/06/2018

### A New Cost Function for Hierarchical Cluster Trees

Hierarchical clustering has been a popular method in various data analys...
12/06/2018

### An Improved Cost Function for Hierarchical Cluster Trees

Hierarchical clustering has been a popular method in various data analys...
12/29/2021

### Shallow decision trees for explainable k-means clustering

A number of recent works have employed decision trees for the constructi...
12/10/2018

### Ramp-based Twin Support Vector Clustering

Traditional plane-based clustering methods measure the cost of within-cl...
12/15/2020

### Objective-Based Hierarchical Clustering of Deep Embedding Vectors

We initiate a comprehensive experimental study of objective-based hierar...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Hierarchical clustering for graphs plays an important role in the structural analysis of a given data set. Understanding hierarchical structures on the levels of multiple granularities is fundamental in various disciplines including artificial intelligence, physics, biology, sociology, etc

[3, 7, 9, 5]. Hierarchical clustering requires a cluster tree that represents a recursive partitioning of a graph into smaller clusters as the tree nodes get deeper. In this cluster tree, each leaf represents a graph node while each non-leaf node represents a cluster containing its descendant leaves. The root is the largest one containing all leaves.

During the last two decades, flat clustering has attracted great attentions, which breeds plenty of algorithms, such as -means [10], DBSCAN [8]

[1], and so on. From the combinatorial perspective, we have cost functions, i.e., modularity and centrality measures, to evaluate the quality of partition-based clustering. Therefore, community detection is usually formulated as an optimization problem to optimize these objectives. By contrast, no comparative cost function with a clear and reasonable combinatorial explanation was developed until Dasgupta [6] introduced a cost function for cluster trees. In this definition, similarity or dissimilarity between data points is represented by weighted edges. Taking similarity scenario as an example, a cluster is a set of nodes with relatively denser intra-links compared with its inter-links, and in a good cluster tree, heavier edges tend to connect leaves whose lowest common ancestor is as deep as possible. This intuition leads to Dasgupta’s cost function that is a weighted linear combination of the sizes of lowest common ancestors over all edges.

Motivated by Dasgupta’s cost function, Cohen-Addad et al. [4] proposed the concept of admissible cost function. In their definition, the size of each lowest common ancestor in Dasgupta’s cost function is generalized to be a function of the sizes of its left and right children. For all similarity graphs generated from a minimal ultrametric, a cluster tree achieves the minimum cost if and only if it is a generating tree that is a “natural” ground truth tree in an axiomatic sense therein. A necessary condition of admissibility of an objective function is that it achieves the same value for every cluster tree for a uniformly weighted clique that has no structure in common sense. However, any slight deviation of edge weights would generally separate the two end-points of a light edge on a high level of its optimal (similarity-based) cluster tree. Thus, it seems that admissible objective functions, which take Dasgupta’s cost function as a specific form, ought to be an unchallenged criterion in evaluating cluster trees since they are formulated by an axiomatic approach.

However, an admissible cost function seems imperfect in practice. The arbitrariness of optima of cluster trees for cliques indicates that the division of each internal nodes on an optimal cluster tree totally neglects the balance of its two children. Edge weight is the unique factor that decides the structure of optimal trees. But a balanced tree is commonly considered as an ideal candidate in hierarchical clustering compared to an unbalanced one. So even clustering for cliques, a balanced partition should be preferable for each internal node. At least, the height of an optimal cluster tree which is logarithm of graph size is intuitively more reasonable than that of a caterpillar shaped cluster tree whose height is . Moreover, a simple proof would imply that the optimal cluster tree for any connected graphs is binary. This property is not always useful in practical use since a real system usually has its inherent number of hierarchies and a natural partition for each internal clusters. For instance, the natural levels of administrative division in a country is usually intrinsic, and it is not suitable to differentiate hierarchies for parallel cities in the same state. This structure cannot be obtained by simply minimizing admissible cost functions.

In this paper, we investigate the hierarchical clustering from the perspective of information theory. Our study is based on Li and Pan’s structural information theory [11]

whose core concept named structural entropy measures the complexity of hierarchical networks. We formulate a new objective function from this point of view, which builds the bridge for combinatorial and information-theoretic perspectives for hierarchical clustering. For this cost function, the balance of cluster trees will be involved naturally as a factor just like we design optimal codes, for which the balance of probability over objects is fundamental in constructing an efficient coding tree. We also define cluster trees with a specific height, which is coincident with our cognition of natural clustering. For practical use, we develop a novel algorithm for natural hierarchical clustering for which the number of hierarchies can be determined automatically. The idea of our algorithm is essentially different from the popular recursive division or agglomeration framework. We formulate two basic operations called

stretch and compress

respectively on cluster trees to search for the sparsest level iteratively. Our algorithm HCSE terminates when a specific criterion that intuitively coincides with the natural hierarchies is met. Our extensive experiments on both synthetic and real datasets demonstrate that HCSE outperforms the present popular heuristic algorithms LOUVAIN

[2] and HLP [12]. The latter two algorithms proceed simply by recursively invoking flat clustering algorithms based on modularity and label propagation, respectively. For both of them, the hierarchy number is solely determined by the round numbers when the algorithm terminates, for which the interpretability is quite poor. Our experimental results on synthetic datasets show that HCSE has a great advantage in finding the intrinsic number of hierarchies, and the results on real datasets show that HCSE achieves competitive costs over LOUVAIN and HLP.

We organize this paper as follows. The structural information theory and its relationship with combinatorial cost functions will be introduced in Section 2, and the algorithm will be given in Section 3. The experiments and their results are presented in Section 4. We conclude the paper in Section 5.

## 2 A cost function from information-theoretic perspective

In this section, we introduce Li and Pan’s structural information theory [11] and the combinatorial cost functions of Dasgupta [6] and Cohen-Addad et al. [4]. Then we propose a new cost function that is developed from structural information theory and establish the relationship between the information-theoretic and combinatorial perspectives.

### 2.1 Notations

Let be an undirected weighted graph with a set of vertices , a set of edges and a weight function , where denotes the set of all positive real numbers. An unweighted graph can be viewed as a weighted one whose weights are unit. For each vertex , denote by the weighted degree of .222From now on, whenever we say the degree of a vertex, we always refer to the weighted degree. For a subset of vertices , define the volume of to be the sum of degrees of vertices. We denote it by .

A cluster tree for graph is a rooted tree with leaves, each of which is labeled by a distinct vertex . Each non-leaf node on is labeled by a subset of that consists of all the leaves treating as ancestor. For each node on , denote by the parent of . For each pair of leaves and , denote by the least common ancestor (LCA) of them on .

### 2.2 Structural information and structural entropy

The idea of structural information is to encode a random walk with a certain rule by using a high-dimensional encoding system for a graph . It is well known that a random walk, for which a neighbor is randomly chosen with probability proportional to edge weights, has a stationary distribution on vertices that is proportional to vertex degree.333For connected graphs, this stationary distribution is unique, but not for disconnected ones. Here, we consider this one for all graphs. So to position a random walk under its stationary distribution, the amount of information needed is typically the Shannon’s entropy, denoted by

 H(1)(G)=−∑v∈Vdvvol(V)logdvvol(V)\lx@notefootnoteInthispaper,theomittedbaseoflogarithmisalways$2$..

By Shannon’s noiseless coding theorem, is the limit of average code length generated from the memoryless source for one step of the random walk. However, dependence of locations may shorten the code length. For each level on cluster trees, the uncertainty of locations is measured by the entropy of the stationary distribution on the clusters of this level. Consider an encoding for every cluster, including the leaves. Each non-root node is labeled by its order among the children of its parent . So the self-information of within this local parent-children substructure is , which is also roughly the length of Shannon code for and its siblings. The codeword of consists of the sequential labels of nodes along the unique path from the root (excluded) to itself (included). The key idea is as follows. For one step of the random walk from to in , to indicate , we omit from ’s codeword the longest common prefix of and that is exactly the codeword of . This means that the random walk takes this step in the cluster (and also in ’s ancestors) and the uncertainty at this level may not be involved. Therefore, intuitively, a quality similarity-based cluster tree would trap the random walk with high frequency in the deep clusters that are far from the root, and long codeword of would be omitted. This shortens the average code length of the random walk. Note that we ignore the uniqueness of decoding since a practical design of codewords is not our purpose. We utilize this scheme to evaluate and differentiate hierarchical structures.

Then we formulate the above scheme and measure the average code length as follows. Given a weighted graph and a cluster tree for , note that under the stationary distribution, the random walk takes one step out of a cluster on with probability , where is the sum of weights of edges with exactly one end-point in . Therefore, the aforementioned uncertainty measured by the average code length is

 HT(G)=−∑α∈Tgαvol(V)logvol(α)vol(α−).\lx@notefootnoteFornotationalconvenience,fortheroot$λ$of$T$,set$λ−=λ$.Sothetermfor$λ$inthesummationis$0$.

We call the structural entropy of on . We define the structural entropy of to be the minimum one among all cluster trees, denoted by Note that the structural entropy of on the trivial -level cluster tree is consistent with the previously defined . It doesn’t have any non-trivial cluster.

### 2.3 Combinatorial explanation of structural entropy

The cost function of a cluster tree for graph introduced by Dasgupta [6] is defined to be , where denotes the size of cluster . The admissible cost function introduced by Cohen-Addad et al. [4] generalizes the term in the definition of to be a general function , for which Dasgupta defined . For both definitions, the optimal hierarchical clustering of is in correspondence with a cluster tree of minimum cost in the combinatorial sense that heavy edges are cut as far down the tree as possible. The following theorem establishes the relationship between structural entropy and this kind of combinatorial form of cost functions.

###### Theorem 2.1.

For a weighted graph , to minimize (over ) is equivalent to minimize the cost function

 costT(G)=∑(u,v)∈Ew(u,v)logvol(u∨v). (1)
###### Proof.

Note that

 HT(G) = −∑α∈Tgαvol(V)logvol(α)vol(α−) = −∑α∈T∑(u,v)∈gαw(u,v)vol(V)logvol(α)vol(α−) = −∑(u,v)∈E⎛⎝w(u,v)vol(V)∑α:(u,v)∈gαlogvol(α)vol(α−)⎞⎠.

For a single edge , all the terms for leaf satisfying sum (over ) up to along the unique path from to . It is symmetric for

. Therefore, considering ordered pair

,

 HT(G) = −∑ordered (u,v)∈Ew(u,v)vol(V)logduvol(u∨v) = 1vol(V)⎛⎝−∑u∈Vdulogdu+∑ordered (u,v)∈Ew(u,v)logvol(u∨v)⎞⎠ = 1vol(V)⎛⎝−∑u∈Vdulogdu+2⋅∑(u,v)∈Ew(u,v)logvol(u∨v)⎞⎠.

The second equality follows from the fact and the last equality from the symmetry of . Since the first summation is independent of , to minimize is equivalent to minimize . ∎

Theorem 2.1 indicates that when we view as a function of vertices rather than of numbers and define , the “admissible” function becomes equivalent to structural entropy in evaluating cluster trees, although it is not admissible any more.

So what is the difference between these two cost functions? As stated by Cohen-Addad et al. [4], an important axiomatic hypothesis for admissible function, thus also for Dasgupta’s cost function, is that the cost for every binary cluster tree of an unweighted clique is identical. This means that any binary tree for clustering on cliques is reasonable, which coincides with the common sense that structureless datasets can be organized hierarchically free. However, for structural entropy, the following theorem indicates that balanced organization is of importance even though for structureless dataset.

###### Theorem 2.2.

For any positive integer , let be the clique of vertices with identical weight on every edge. Then a cluster tree of achieves minimum structural entropy if and only if is a balanced binary tree, that is, the two children clusters of each sub-tree of have difference in size at most .

The proof of Theorem 2.2 is a bit technical, and we defer it to Appendix A. The intuition behind Theorem 2.2 is that balanced codes are the most efficient encoding scheme for unrelated data. So the codewords of the random walk that jumps freely among clusters on each level of a cluster tree have the minimum average length if all the clusters on this level are in balance. This is able to be guaranteed exactly by a balanced cluster tree.

In the cost function (1), which we call cost(SE) from now on, is a concave function of the volume of . In the context of regular graphs (e.g. cliques), replacing by is equivalent for optimization. Dasgupta [6] claimed that for the clique of four vertices, a balanced tree is preferable when replace by for any strictly increasing concave function with . However, it is interesting to note that this generalization does not hold for all those concave functions. For example, it is easy to check that for , the cluster tree of that achieves minimum cost partitions into and on the first level, rather than and . Theorem 2.2 shows that for all cliques, balanced trees are preferable when is a logarithmic function.

It is worth noting the admissible function introduced by Cohen-Addad et al. [4] is defined from the viewpoint that a generating tree of a similarity-based graph that is generated from a minimal ultrametric achieves the minimum cost. In this definition, the monotonicity of edge weights between clusters on each level from bottom to top on , which is given by Cohen-Addad et al. [4] as a property of a “natural” ground-truth hierarchical clustering, is the unique factor when evaluating . However, Theorem 2.2 implies that for cost(SE), besides cluster weights, the balance of cluster trees is implicitly involved as another factor. Moreover, for cliques, the minimum cost should be achieved on every subtree, which makes an optimal cluster tree balanced everywhere. This optimal clustering for cliques is also robust in the sense that a slight perturbation to the minimal ultrametric, which can be considered as slight variations to the weights of a batch of edges, will not change the optimal cluster tree structure wildly due to the holdback force of balance.

## 3 Our hierarchical clustering algorithm

In this section, we develop an algorithm to optimize cost(SE) (Eq. (1), equivalent to optimizing structure entropy) and yield the associated cluster tree. At present, all existing algorithms for hierarchical clustering can be categorized into two frameworks: top-down division and bottom-up agglomeration [4]. The top-down division approach usually yields a binary tree by recursively dividing a cluster into two parts with a cut-related criterion. But a binary clustering tree is far from a practical one as we introduced in Section 1. For practical use, bottom-up agglomeration that is also known as hierarchical agglomerative clustering (HAC) is commonly preferable. It constructs a cluster tree from leaves to the root recursively, during each round of which the newly generated clusters shrink into single vertices.

Our algorithm jumps out of these two frameworks. We establish a new one that stratifies the sparsest level of a cluster tree recursively rather than in a sequential order. In general, in guide with cost(SE), we construct a -level cluster tree from the previous -level one, during which the level whose stratification makes the average local cost that is incorporated in a local reduced subgraph decrease most is differentiated into two levels. The process of stratification consists of two basic operations: stretch and compression. In stretch steps, given an internal node of a cluster tree, a local binary subtree is constructed by an agglomerative approach, while in compression steps, the paths that are overlength from the root to leaves on the binary tree is compressed by shrinking tree edges that make the cost reduce most. This framework can be collocated with any cost function.

Then we define the operations “stretch” and “compression” formally. Given a cluster tree for graph , let be an internal node on and be its children. We call this local parent-children structure to be a -triangle of , denoted by . These two operations are defined on -triangles. Note that each child of is a cluster in . We reduce by shrinking each to be a single vertex while maintaining each intra-link and ignoring each internal edge of . This reduction captures the connections of clusters at this level in the parent cluster . The stretch operation proceeds in HAC approach for -triangle. That is, initially, view each as a cluster and recursively combine two clusters into a new one for which cost(SE) drops most. The sequence of combinations yields a binary subtree rooted at which has as leaves. Then the compression operation is proposed to reduce the height of to be . Let be the set of edges on each of which appears on a path of length more than from the root of to some leaf. Denote by for edge be the amount of structural entropy enhanced by the shrink of . We pick from the edge with least . Note that the compression of a tree edge makes the grandchildren of some internal node to be children, it must amplify the cost. The compression operation picks the least amplification. The process of stretch and compression is illustrated in Figure 1 and stated in Algorithms 1 and 2, respectively.

Then we define the sparsest level of a cluster tree . Let be the set of -level nodes on , that is, is the set of nodes each of which has distance from ’s root. Suppose that the height of is , then is a partition for all internal nodes of . For each internal node , define . Note that is the partial sum contributed by in . After a “stretch-and-compress” round on -triangle, denote by the structural entropy by which the new cluster tree reduces. Since the reconstruction of -triangle stratifies cluster , is always non-negative. Define the sparsity of to be , which is the relative variation of structural entropy in cluster . From the information-theoretic perspective, this means that the uncertainty of random walk can be measured locally in any internal cluster, which reflects the quality of clustering of this local area. At last, we define the sparsest level of to be the -th level such that the average sparsity of triangles rooted at nodes in is maximum, that is , where . Then the operation of stratification stretches and compresses on the sparsest level of . This is illustrated in Figure 2.

For a given positive integer , to construct a cluster tree of height for graph , we start from the trivial -level cluster tree that involves all vertices of as leaves. Then we do not stop stratifying at the sparsest level recursively until a -level cluster tree is obtained. This process is described in Algorithm 3.

To determine the height of the cluster tree automatically, we derive the natural clustering from the variation of sparsity on each level. Intuitively, a natural hierarchical cluster tree should have not only sparse boundary on clusters, but also low sparsity for triangles of , which means that stratification within the reduced subgraphs corresponding to the triangles on the sparsest level make little sense. For this reason, we consider the inflection points of the sequence , where is the structural entropy by which the -th round of stratification reduces. Formally, denote for each . We say that is an inflection point if both and hold. Our algorithm finds the least such that is an inflection point and fix the height of the cluster tree to be (Note that after rounds of stratification, the number of levels is ). This process is described as Algorithm 4.

## 4 Experiments

Our experiments are given both on synthetic networks generated from the Hierarchical Stochastic Block Model (HSBM) and on real datasets. We compare our algorithm HSE with the popular practical algorithms LOUVAIN [2] and HLP [12]. Both of these two algorithms construct a non-binary cluster tree with the same framework, that is, the hierarchies are formed from bottom to top one by one. In each round, they invoke different flat clustering algorithms, Modularity and Label Propagation, respectively. To avoid over-fitting to higher levels, which possibly results in under-fitting to lower levels, LOUVAIN admits a sequential input of vertices. Usually, to avert the worst-case trap, the order in which the vertices come is random, and so the resulting cluster tree depends on this order. HLP invokes the common LP algorithm recursively, and so it cannot be guaranteed to avoid under-fitting in each round. This can be seen in our experiments on synthetic datasets, for which these two algorithms usually miss a ground-truth level.

For real datasets, we do the comparative experiments on real networks. Some of them have (possibly hierarchical) ground truth, e.g., Amazon, while most of them do not have. We evaluate the resulting cluster trees for Amazon networks by Jaccard index, and for others without ground truth by both cost(SE) and Dasgupta’s cost function cost(Das).

### 4.1 Hsbm

The stochastic block model (SBM) is a type of probability generation model proposed based on the idea of random equivalence. The SBM of flat graphs needs to design the number of clusters, the number of vertices in each cluster, the probability of generating an edge for each pair of vertices in clusters, and the probability of generating an edge for each pair of vertices that are from different clusters. For HSBM, a ground truth of hierarchies should be presumed. Suppose that there are clusters at the bottom level. Then probability of generating edges is determined by a symmetric matrix, in which the -th entry is the probability of connecting each pair of vertices from cluster and , respectively. Two clusters that have higher LCA on the ground-truth tree have lower probability.

Our experiments utilize -level HSBM. For simplicity, let

be the probability vector for which

is the probability of generating edges for vertex pairs whose LCA on the ground-truth cluster tree has depth . Note that the -depth node is the root. We compare the Normalized Mutual Information (NMI) at each level of the ground-truth cluster tree to those of three algorithms. Note that the randomness in LOUVAIN, and breaking-ties rule as well as convergence of HLP make different results, we choose the most effective strategy and pick the best results in five runs for both of them. Compared to their uncertainty, our algorithm HCSE yields stable results.

Table 3 demonstrates the results in three groups of probabilities, for which the clarity of hierarchical structure gets higher one by one. Our algorithm HSE is always able to find the right number of levels, while LOUVAIN always misses the top level, and HLP misses the top level in two groups. The inflection points for choosing the intrinsic hierarchy number of hierarchies are demonstrated in Figure 3.

### 4.2 Real datasets

First, we do our experiments on Amazon network for which the set of ground-truth clusters has been given. For two sets , the Jaccard Index of them is defined as . We pick the largest cluster which is a subgraph with vertices and edges. We run HCSE algorithm on it. For each ground-truth cluster that appears in this subgraph, we find from the resulting cluster tree an internal node that has maximum Jaccard index with . Then we calculate the average Jaccard index over all such . We also calculate cost(SE) and cost(Das). The results are demonstrated in Table 2. HCSE performs better for and cost(SE), while LOUVAIN performs better for cost(Das). Because of unbalance in over-fitting and under-fitting traps, HLP outperforms none of the other two algorithms for all criteria.

Second, we do our experiments on a series of real networks without ground truth. We compare cost(SE) and cost(Das), respectively. Since the different level numbers given by the three algorithms influence the costs seriously, that is, lower costs are obtained just due to greater heights, we only list in Table 3 the networks for which the three algorithms yield similar level numbers that differ by at most or . It can be observed that HLP does not achieve optima for any network, while HCSE performs best w.r.t. cost(Das) for all networks, but does not outperform LOUVAIN for most networks. This is mainly due to the fact that LOUVAIN always finds no less number of hierarchies than HCSE, and the better cost probably benefits from its depth.

## 5 Conclusions and future discussions

In this paper, we investigate the hierarchical clustering problem from an information-theoretic perspective and propose a new objective function that relates to the combinatoric cost functions raised by Dasgupta [6] and Cohen-Addad et al. [4]. We define the optimal -level cluster tree for practical use and devise an hierarchical clustering algorithm that stratifies the sparsest level of the cluster tree recursively. This is a general framework that can be collocated with any cost function. We also propose an interpretable strategy to find the intrinsic number of levels without any hyper-parameter. The experimental results on -level HSBM demonstrate that our algorithm HCSE has a great advantage in finding compared to the popular but strongly heuristic algorithms LOUVAIN and HLP. Our results on real datasets show that HCSE also achieves competitive costs compared to these two algorithms.

There are several directions that are worth further study. The first problem is about the relationship between the concavity of of the cost function and the balance of the optimal cluster tree. We have seen that for cliques, being concave is not a sufficient condition for total balance, so whether is it a necessary condition? Moreover, is there any explicit necessary and sufficient condition for total balance of the optimal cluster tree for cliques? The second problem is about approximation algorithms for both structural entropy and cost(SE). Due to the non-linear and volume-related function , the proof techniques for approximation algorithms in [4] becomes unavailable. The third one is about more precise characterizations for “natural” hierarchical clustering whose depth is limited. Since any reasonable choice of makes the cost function achieve optimum on some binary tree, a blind pursuit of minimization of cost functions seems not to be a rational approach. More criteria in this scenario need to be studied.

## References

• [1] C. J. Alpert and S. Yao (1995)

Spectral partitioning: the more eigenvectors, the better

.
In Proceedings of the 32st Conference on Design Automation, pp. 195–200. Cited by: §1.
• [2] V. D. Blondel, J. Guillaume, R. Lambiotte, and E. Lefebvre (2008) Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008 (10), pp. P10008. Cited by: §1, §4.
• [3] P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C. Lai, and R. L. Mercer (1992)

Class-based n-gram models of natural language

.
Computational linguistics 18 (4), pp. 467–480. Cited by: §1.
• [4] V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu (2019) Hierarchical clustering: objective functions and algorithms. Journal of the ACM (JACM) 66 (4), pp. 1–42. Cited by: An Information-theoretic Perspective of Hierarchical Clustering, §1, §2.3, §2.3, §2.3, §2, §3, §5, §5.
• [5] A. Culotta, P. Kanani, R. Hall, M. Wick, and A. McCallum (2007)

Author disambiguation using error-driven machine learning with a ranking loss function

.
In Sixth International Workshop on Information Integration on the Web (IIWeb-07), Vancouver, Canada, Cited by: §1.
• [6] S. Dasgupta (2016) A cost function for similarity-based hierarchical clustering. In

Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016

, D. Wichs and Y. Mansour (Eds.),
pp. 118–127. External Links: Cited by: An Information-theoretic Perspective of Hierarchical Clustering, §1, §2.3, §2.3, §2, §5.
• [7] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein (1998) Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95 (25), pp. 14863–14868. Cited by: §1.
• [8] M. Ester, H. Kriegel, J. Sander, X. Xu, et al. (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §1.
• [9] A. N. Gorban, B. Kégl, D. C. Wunsch, A. Y. Zinovyev, et al. (2008)

Principal manifolds for data visualization and dimension reduction

.
Vol. 58, Springer. Cited by: §1.
• [10] J. A. Hartingan and M. A. Wong (1979) A -means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1), pp. 100–108. Cited by: §1.
• [11] A. Li and Y. Pan (2016) Structural information and dynamical complexity of networks. IEEE Trans. Inf. Theory 62 (6), pp. 3290–3339. External Links: Cited by: §1, §2.
• [12] R. A. Rossi, N. K. Ahmed, E. Koh, and S. Kim (2020) Fast hierarchical graph clustering in linear-time. In Companion Proceedings of the Web Conference 2020, pp. 10–12. Cited by: §1, §4.

## Appendix A Proof of Theorem 2.2

We restate Theorem 2.2 as follows.

Theorem 2.2. For any positive integer , let be the clique of vertices with identical weight on every edge. Then a cluster tree of achieves minimum structural entropy if and only if is a balanced binary tree, that is, the two children clusters of each sub-tree of have difference in size at most .

Note that a balanced binary tree (BBT for abbreviation) means the tree is balanced on every internal node. Formally, for an internal node of cluster size , its two sub-trees are of cluster sizes and , respectively.

For cliques, since the weights of each edge are identical, we assume it safely to be . By Theorem 2.1, minimizing the structural entropy is equivalent to minimizing the cost function (over )

 costT(G) = ∑(u,v)∈Elogvol(u∨v) = ∑(u,v)∈Elog((n−1)|u∨v|) = ∑(u,v)∈Elog(n−1)+∑(u,v)∈Elog|u∨v|

Since the first term in the last equation is independent of , the optimization turns to minimizing the last term, which we denote by . Grouping all edges in by least common ancestor (LCA) of two end-points, the cost can be written as the sum of the cost at every internal node of . Formally, for every internal node , let be the leaves of the sub-trees rooted at the left and right child of , respectively. We have

 Γ(T) = ∑Nγ(N) γ(N) = ⎛⎝∑x∈A,y∈B1⎞⎠⋅log(|A|+|B|) = |A|⋅|B|⋅log(|A|+|B|)

Now we only have to show the following lemma.

###### Lemma A.1.

For any positive integer , a cluster tree of achieves minimum cost if and only if is a BBT.

###### Proof.

Lemma A.1 is proved by induction on . The key technique of tree swapping we use here is inspired by Cohen-Addad et al [4]. The basis step holds since for or , the cluster tree is balanced and unique. It certainly achieves the minimum cost exclusively.

Now, consider a clique with . Let be an arbitrary unbalanced cluster tree and be its root. We need to prove that the cost does not achieve the minimum. Without loss of generality, we can safely assume the root node is unbalanced, since otherwise, we set to be the sub-tree that is rooted at an unbalanced node. Let be a tree with root whose left and right sub-trees are BBTs such that they have the same sizes with the left and right sub-trees of , respectively. Let , , and be the sets of nodes on the four sub-trees at the second level of and , , and denote their sizes, respectively. Our proof is also available when some of them are empty. We always assume and . Next, we construct by swapping (transplanting) and with each other. Finally, let be a tree with root whose left and right sub-trees are BBTs after balancing the left and right sub-trees of . So is a BBT. Then we only have to prove that . Note that the strict “” is necessary since we need to negate all unbalanced cluster trees.

Then we show that the transformation process that consists of the above three steps makes the cost decrease step by step. Formally,

• to . The sub-trees of become BBTs in . Since the number of edges whose end-points treat the root as LCA is the same, by induction we have .

• to . We will show that in Lemma A.2.

• to . The sub-trees of become BBTs in . For the same reason as (a), we have .

This transformation process is illustrated in Figure 4. Putting them together, we get and Lemma A.1 follows.

###### Lemma A.2.

After swapping and , we obtain from , for which .

###### Proof.

We only need to consider the changes in cost of three nodes: root and its left and right children, since the cost contributed by each of the remaining nodes does not change after swapping. Ignoring the unchanged costs, define

 cost(T2) = nlnrlogn+nllnlrlognl+nrlnrrlognr =

where , . Both of them are at least . Similarly, define

 cost(T3) = (nll+nrl)(nlr+nrr)logn+nllnrllog(nll+nrl)+nlrnrrlog(nlr+nrr) =

Denote

 Δ = Γ(T2)−Γ(T3) (2) = cost(T2)−cost(T3) =

So we only have to show that . We consider the following three cases according to the odevity of and .

• and are even.

• and

are odd.

• is odd while is even.

The case that is even while is odd is symmetric to Case .

For Case , if both and are even, then notations of rounding in Eq. (2) can be removed and can be simplified as

 Δ=n2l4log(nln)+n2r4log(nrn)+nlnr2.

Let , and so . Recall that is unbalanced on the root , so is . Thus . Multiplying by on both sides, we only have to prove that

 p2logp+q2logq+2pq>0.

That is,

 pqlogp+qplogq+2>0.

Let . Then we only need to show that when . Since

 g′(x) = (1−x)+lnxln2⋅(1−x)2, g′′(x) = −x2−2xlnx−1ln2⋅x(1−x)3.

It is easy to check that when . So is strictly convex in the interval . Since ,

 g(p)+g(q)>2g(p+q2)=−2.

Thus holds.

For Case , if both and are odd, then can be split into two parts , in which

 Δ1 = n2l4log(nln)+n2r4log(nrn)+nlnr2 Δ2 =

Since we have shown that , if we can prove , then the lemma will hold for Case . Due to the convexity of logarithmic function, this holds clearly since

 2log(n2)≥lognl+lognr.

For Case , if is odd while is even,

 Δ=n2l−14log(nln)+n2r4log(nrn)−[(nl−1)nr4log(n−12n)+(nl+1)nr4log(n+12n)].

Multiplying the above equation by , without changing its sign, yields

 (4ln2)Δ=(n2l−1)ln(nln)+n2rln(nrn)−[(nl−1)nrln(n−12n)+(nl+1)nrln(n+12n)]

Splitting the right hand side into two parts,

 A = n2lln(nln)+n2rln(nrn)+2nlnrln2 B = −ln(nln)−(nl+1)nrln(1+1n)−(nl−1)nrln(1−1n)

Since is odd and the root of is unbalanced, we only need to consider the case that , (Note that and are symmetric. So if is even, exchange and ), where both and are odd satisfying . Next we show that in this case, and . By calculation, for Case .

###### Claim A.1.

for odd integers .

###### Proof.

Substituting , into the yields

 A=C(n,i)≜(n−i2)2ln(n−i2n)+(n+i2)2ln(n+i2n)+2⋅n−i2⋅n+i2ln2.

Treat as a continuous variable, we have

 ∂C(n,i)∂n=12[(n+i)ln(1+in)+(n−i)ln(1−in)−i2n]

Multiplying the above equation by and setting yields

 f(x) ≜ (1+x)ln(1+x)+(1−x)ln(1−x)−x2, f′(x) = ln(1+x)−ln(1−x)−2x, f′′(x) = 2x21−x2.

It is easy to check that and . When , . Thus and . This means that for all . So for (When is fixed, the minimum value of can be taken to , which makes and integral). The curves of for varying are plotted in Figure 5.

When , we get and . Substituting them into yields

 D(n) ≜ dDdn = 1−2n+2ln2+2(n−1)ln(1−1n).

When , it is easy to check that . So the minimum value of , which is also the minimum value of , is achieved at . So . ∎

.

###### Proof.

Due to the facts that

 ln(1+1n) < 1n−12n2+13n3, ln(1−1n) < −1n−12n2−13n3,

we have

 B = −ln(nln)−(nl+1)nrln(1+1n)−(nl−1)nrln(1−1n) > −ln(nln)+nlnrn2−2nrn−2nr3n3 > −ln(nln)+nlnrn2−2nrn−23n2.

Let , then

 B > −lnα+α(1−α)−2(1−α)−23n2 ≥ ln2−34−23n2.

When , . ∎

Combining Claims A.1 and A.2, Lemma A.2 follows. ∎

This completes the proof of Theorem 2.2.