1 Introduction
Hierarchical clustering for graphs plays an important role in the structural analysis of a given data set. Understanding hierarchical structures on the levels of multiple granularities is fundamental in various disciplines including artificial intelligence, physics, biology, sociology, etc
[3, 7, 9, 5]. Hierarchical clustering requires a cluster tree that represents a recursive partitioning of a graph into smaller clusters as the tree nodes get deeper. In this cluster tree, each leaf represents a graph node while each nonleaf node represents a cluster containing its descendant leaves. The root is the largest one containing all leaves.During the last two decades, flat clustering has attracted great attentions, which breeds plenty of algorithms, such as means [10], DBSCAN [8]
[1], and so on. From the combinatorial perspective, we have cost functions, i.e., modularity and centrality measures, to evaluate the quality of partitionbased clustering. Therefore, community detection is usually formulated as an optimization problem to optimize these objectives. By contrast, no comparative cost function with a clear and reasonable combinatorial explanation was developed until Dasgupta [6] introduced a cost function for cluster trees. In this definition, similarity or dissimilarity between data points is represented by weighted edges. Taking similarity scenario as an example, a cluster is a set of nodes with relatively denser intralinks compared with its interlinks, and in a good cluster tree, heavier edges tend to connect leaves whose lowest common ancestor is as deep as possible. This intuition leads to Dasgupta’s cost function that is a weighted linear combination of the sizes of lowest common ancestors over all edges.Motivated by Dasgupta’s cost function, CohenAddad et al. [4] proposed the concept of admissible cost function. In their definition, the size of each lowest common ancestor in Dasgupta’s cost function is generalized to be a function of the sizes of its left and right children. For all similarity graphs generated from a minimal ultrametric, a cluster tree achieves the minimum cost if and only if it is a generating tree that is a “natural” ground truth tree in an axiomatic sense therein. A necessary condition of admissibility of an objective function is that it achieves the same value for every cluster tree for a uniformly weighted clique that has no structure in common sense. However, any slight deviation of edge weights would generally separate the two endpoints of a light edge on a high level of its optimal (similaritybased) cluster tree. Thus, it seems that admissible objective functions, which take Dasgupta’s cost function as a specific form, ought to be an unchallenged criterion in evaluating cluster trees since they are formulated by an axiomatic approach.
However, an admissible cost function seems imperfect in practice. The arbitrariness of optima of cluster trees for cliques indicates that the division of each internal nodes on an optimal cluster tree totally neglects the balance of its two children. Edge weight is the unique factor that decides the structure of optimal trees. But a balanced tree is commonly considered as an ideal candidate in hierarchical clustering compared to an unbalanced one. So even clustering for cliques, a balanced partition should be preferable for each internal node. At least, the height of an optimal cluster tree which is logarithm of graph size is intuitively more reasonable than that of a caterpillar shaped cluster tree whose height is . Moreover, a simple proof would imply that the optimal cluster tree for any connected graphs is binary. This property is not always useful in practical use since a real system usually has its inherent number of hierarchies and a natural partition for each internal clusters. For instance, the natural levels of administrative division in a country is usually intrinsic, and it is not suitable to differentiate hierarchies for parallel cities in the same state. This structure cannot be obtained by simply minimizing admissible cost functions.
In this paper, we investigate the hierarchical clustering from the perspective of information theory. Our study is based on Li and Pan’s structural information theory [11]
whose core concept named structural entropy measures the complexity of hierarchical networks. We formulate a new objective function from this point of view, which builds the bridge for combinatorial and informationtheoretic perspectives for hierarchical clustering. For this cost function, the balance of cluster trees will be involved naturally as a factor just like we design optimal codes, for which the balance of probability over objects is fundamental in constructing an efficient coding tree. We also define cluster trees with a specific height, which is coincident with our cognition of natural clustering. For practical use, we develop a novel algorithm for natural hierarchical clustering for which the number of hierarchies can be determined automatically. The idea of our algorithm is essentially different from the popular recursive division or agglomeration framework. We formulate two basic operations called
stretch and compressrespectively on cluster trees to search for the sparsest level iteratively. Our algorithm HCSE terminates when a specific criterion that intuitively coincides with the natural hierarchies is met. Our extensive experiments on both synthetic and real datasets demonstrate that HCSE outperforms the present popular heuristic algorithms LOUVAIN
[2] and HLP [12]. The latter two algorithms proceed simply by recursively invoking flat clustering algorithms based on modularity and label propagation, respectively. For both of them, the hierarchy number is solely determined by the round numbers when the algorithm terminates, for which the interpretability is quite poor. Our experimental results on synthetic datasets show that HCSE has a great advantage in finding the intrinsic number of hierarchies, and the results on real datasets show that HCSE achieves competitive costs over LOUVAIN and HLP.We organize this paper as follows. The structural information theory and its relationship with combinatorial cost functions will be introduced in Section 2, and the algorithm will be given in Section 3. The experiments and their results are presented in Section 4. We conclude the paper in Section 5.
2 A cost function from informationtheoretic perspective
In this section, we introduce Li and Pan’s structural information theory [11] and the combinatorial cost functions of Dasgupta [6] and CohenAddad et al. [4]. Then we propose a new cost function that is developed from structural information theory and establish the relationship between the informationtheoretic and combinatorial perspectives.
2.1 Notations
Let be an undirected weighted graph with a set of vertices , a set of edges and a weight function , where denotes the set of all positive real numbers. An unweighted graph can be viewed as a weighted one whose weights are unit. For each vertex , denote by the weighted degree of .^{2}^{2}2From now on, whenever we say the degree of a vertex, we always refer to the weighted degree. For a subset of vertices , define the volume of to be the sum of degrees of vertices. We denote it by .
A cluster tree for graph is a rooted tree with leaves, each of which is labeled by a distinct vertex . Each nonleaf node on is labeled by a subset of that consists of all the leaves treating as ancestor. For each node on , denote by the parent of . For each pair of leaves and , denote by the least common ancestor (LCA) of them on .
2.2 Structural information and structural entropy
The idea of structural information is to encode a random walk with a certain rule by using a highdimensional encoding system for a graph . It is well known that a random walk, for which a neighbor is randomly chosen with probability proportional to edge weights, has a stationary distribution on vertices that is proportional to vertex degree.^{3}^{3}3For connected graphs, this stationary distribution is unique, but not for disconnected ones. Here, we consider this one for all graphs. So to position a random walk under its stationary distribution, the amount of information needed is typically the Shannon’s entropy, denoted by
By Shannon’s noiseless coding theorem, is the limit of average code length generated from the memoryless source for one step of the random walk. However, dependence of locations may shorten the code length. For each level on cluster trees, the uncertainty of locations is measured by the entropy of the stationary distribution on the clusters of this level. Consider an encoding for every cluster, including the leaves. Each nonroot node is labeled by its order among the children of its parent . So the selfinformation of within this local parentchildren substructure is , which is also roughly the length of Shannon code for and its siblings. The codeword of consists of the sequential labels of nodes along the unique path from the root (excluded) to itself (included). The key idea is as follows. For one step of the random walk from to in , to indicate , we omit from ’s codeword the longest common prefix of and that is exactly the codeword of . This means that the random walk takes this step in the cluster (and also in ’s ancestors) and the uncertainty at this level may not be involved. Therefore, intuitively, a quality similaritybased cluster tree would trap the random walk with high frequency in the deep clusters that are far from the root, and long codeword of would be omitted. This shortens the average code length of the random walk. Note that we ignore the uniqueness of decoding since a practical design of codewords is not our purpose. We utilize this scheme to evaluate and differentiate hierarchical structures.
Then we formulate the above scheme and measure the average code length as follows. Given a weighted graph and a cluster tree for , note that under the stationary distribution, the random walk takes one step out of a cluster on with probability , where is the sum of weights of edges with exactly one endpoint in . Therefore, the aforementioned uncertainty measured by the average code length is
We call the structural entropy of on . We define the structural entropy of to be the minimum one among all cluster trees, denoted by Note that the structural entropy of on the trivial level cluster tree is consistent with the previously defined . It doesn’t have any nontrivial cluster.
2.3 Combinatorial explanation of structural entropy
The cost function of a cluster tree for graph introduced by Dasgupta [6] is defined to be , where denotes the size of cluster . The admissible cost function introduced by CohenAddad et al. [4] generalizes the term in the definition of to be a general function , for which Dasgupta defined . For both definitions, the optimal hierarchical clustering of is in correspondence with a cluster tree of minimum cost in the combinatorial sense that heavy edges are cut as far down the tree as possible. The following theorem establishes the relationship between structural entropy and this kind of combinatorial form of cost functions.
Theorem 2.1.
For a weighted graph , to minimize (over ) is equivalent to minimize the cost function
(1) 
Proof.
Note that
For a single edge , all the terms for leaf satisfying sum (over ) up to along the unique path from to . It is symmetric for
. Therefore, considering ordered pair
,The second equality follows from the fact and the last equality from the symmetry of . Since the first summation is independent of , to minimize is equivalent to minimize . ∎
Theorem 2.1 indicates that when we view as a function of vertices rather than of numbers and define , the “admissible” function becomes equivalent to structural entropy in evaluating cluster trees, although it is not admissible any more.
So what is the difference between these two cost functions? As stated by CohenAddad et al. [4], an important axiomatic hypothesis for admissible function, thus also for Dasgupta’s cost function, is that the cost for every binary cluster tree of an unweighted clique is identical. This means that any binary tree for clustering on cliques is reasonable, which coincides with the common sense that structureless datasets can be organized hierarchically free. However, for structural entropy, the following theorem indicates that balanced organization is of importance even though for structureless dataset.
Theorem 2.2.
For any positive integer , let be the clique of vertices with identical weight on every edge. Then a cluster tree of achieves minimum structural entropy if and only if is a balanced binary tree, that is, the two children clusters of each subtree of have difference in size at most .
The proof of Theorem 2.2 is a bit technical, and we defer it to Appendix A. The intuition behind Theorem 2.2 is that balanced codes are the most efficient encoding scheme for unrelated data. So the codewords of the random walk that jumps freely among clusters on each level of a cluster tree have the minimum average length if all the clusters on this level are in balance. This is able to be guaranteed exactly by a balanced cluster tree.
In the cost function (1), which we call cost(SE) from now on, is a concave function of the volume of . In the context of regular graphs (e.g. cliques), replacing by is equivalent for optimization. Dasgupta [6] claimed that for the clique of four vertices, a balanced tree is preferable when replace by for any strictly increasing concave function with . However, it is interesting to note that this generalization does not hold for all those concave functions. For example, it is easy to check that for , the cluster tree of that achieves minimum cost partitions into and on the first level, rather than and . Theorem 2.2 shows that for all cliques, balanced trees are preferable when is a logarithmic function.
It is worth noting the admissible function introduced by CohenAddad et al. [4] is defined from the viewpoint that a generating tree of a similaritybased graph that is generated from a minimal ultrametric achieves the minimum cost. In this definition, the monotonicity of edge weights between clusters on each level from bottom to top on , which is given by CohenAddad et al. [4] as a property of a “natural” groundtruth hierarchical clustering, is the unique factor when evaluating . However, Theorem 2.2 implies that for cost(SE), besides cluster weights, the balance of cluster trees is implicitly involved as another factor. Moreover, for cliques, the minimum cost should be achieved on every subtree, which makes an optimal cluster tree balanced everywhere. This optimal clustering for cliques is also robust in the sense that a slight perturbation to the minimal ultrametric, which can be considered as slight variations to the weights of a batch of edges, will not change the optimal cluster tree structure wildly due to the holdback force of balance.
3 Our hierarchical clustering algorithm
In this section, we develop an algorithm to optimize cost(SE) (Eq. (1), equivalent to optimizing structure entropy) and yield the associated cluster tree. At present, all existing algorithms for hierarchical clustering can be categorized into two frameworks: topdown division and bottomup agglomeration [4]. The topdown division approach usually yields a binary tree by recursively dividing a cluster into two parts with a cutrelated criterion. But a binary clustering tree is far from a practical one as we introduced in Section 1. For practical use, bottomup agglomeration that is also known as hierarchical agglomerative clustering (HAC) is commonly preferable. It constructs a cluster tree from leaves to the root recursively, during each round of which the newly generated clusters shrink into single vertices.
Our algorithm jumps out of these two frameworks. We establish a new one that stratifies the sparsest level of a cluster tree recursively rather than in a sequential order. In general, in guide with cost(SE), we construct a level cluster tree from the previous level one, during which the level whose stratification makes the average local cost that is incorporated in a local reduced subgraph decrease most is differentiated into two levels. The process of stratification consists of two basic operations: stretch and compression. In stretch steps, given an internal node of a cluster tree, a local binary subtree is constructed by an agglomerative approach, while in compression steps, the paths that are overlength from the root to leaves on the binary tree is compressed by shrinking tree edges that make the cost reduce most. This framework can be collocated with any cost function.
Then we define the operations “stretch” and “compression” formally. Given a cluster tree for graph , let be an internal node on and be its children. We call this local parentchildren structure to be a triangle of , denoted by . These two operations are defined on triangles. Note that each child of is a cluster in . We reduce by shrinking each to be a single vertex while maintaining each intralink and ignoring each internal edge of . This reduction captures the connections of clusters at this level in the parent cluster . The stretch operation proceeds in HAC approach for triangle. That is, initially, view each as a cluster and recursively combine two clusters into a new one for which cost(SE) drops most. The sequence of combinations yields a binary subtree rooted at which has as leaves. Then the compression operation is proposed to reduce the height of to be . Let be the set of edges on each of which appears on a path of length more than from the root of to some leaf. Denote by for edge be the amount of structural entropy enhanced by the shrink of . We pick from the edge with least . Note that the compression of a tree edge makes the grandchildren of some internal node to be children, it must amplify the cost. The compression operation picks the least amplification. The process of stretch and compression is illustrated in Figure 1 and stated in Algorithms 1 and 2, respectively.
Then we define the sparsest level of a cluster tree . Let be the set of level nodes on , that is, is the set of nodes each of which has distance from ’s root. Suppose that the height of is , then is a partition for all internal nodes of . For each internal node , define . Note that is the partial sum contributed by in . After a “stretchandcompress” round on triangle, denote by the structural entropy by which the new cluster tree reduces. Since the reconstruction of triangle stratifies cluster , is always nonnegative. Define the sparsity of to be , which is the relative variation of structural entropy in cluster . From the informationtheoretic perspective, this means that the uncertainty of random walk can be measured locally in any internal cluster, which reflects the quality of clustering of this local area. At last, we define the sparsest level of to be the th level such that the average sparsity of triangles rooted at nodes in is maximum, that is , where . Then the operation of stratification stretches and compresses on the sparsest level of . This is illustrated in Figure 2.
For a given positive integer , to construct a cluster tree of height for graph , we start from the trivial level cluster tree that involves all vertices of as leaves. Then we do not stop stratifying at the sparsest level recursively until a level cluster tree is obtained. This process is described in Algorithm 3.
To determine the height of the cluster tree automatically, we derive the natural clustering from the variation of sparsity on each level. Intuitively, a natural hierarchical cluster tree should have not only sparse boundary on clusters, but also low sparsity for triangles of , which means that stratification within the reduced subgraphs corresponding to the triangles on the sparsest level make little sense. For this reason, we consider the inflection points of the sequence , where is the structural entropy by which the th round of stratification reduces. Formally, denote for each . We say that is an inflection point if both and hold. Our algorithm finds the least such that is an inflection point and fix the height of the cluster tree to be (Note that after rounds of stratification, the number of levels is ). This process is described as Algorithm 4.
4 Experiments
Our experiments are given both on synthetic networks generated from the Hierarchical Stochastic Block Model (HSBM) and on real datasets. We compare our algorithm HSE with the popular practical algorithms LOUVAIN [2] and HLP [12]. Both of these two algorithms construct a nonbinary cluster tree with the same framework, that is, the hierarchies are formed from bottom to top one by one. In each round, they invoke different flat clustering algorithms, Modularity and Label Propagation, respectively. To avoid overfitting to higher levels, which possibly results in underfitting to lower levels, LOUVAIN admits a sequential input of vertices. Usually, to avert the worstcase trap, the order in which the vertices come is random, and so the resulting cluster tree depends on this order. HLP invokes the common LP algorithm recursively, and so it cannot be guaranteed to avoid underfitting in each round. This can be seen in our experiments on synthetic datasets, for which these two algorithms usually miss a groundtruth level.
For real datasets, we do the comparative experiments on real networks. Some of them have (possibly hierarchical) ground truth, e.g., Amazon, while most of them do not have. We evaluate the resulting cluster trees for Amazon networks by Jaccard index, and for others without ground truth by both cost(SE) and Dasgupta’s cost function cost(Das).
4.1 Hsbm
The stochastic block model (SBM) is a type of probability generation model proposed based on the idea of random equivalence. The SBM of flat graphs needs to design the number of clusters, the number of vertices in each cluster, the probability of generating an edge for each pair of vertices in clusters, and the probability of generating an edge for each pair of vertices that are from different clusters. For HSBM, a ground truth of hierarchies should be presumed. Suppose that there are clusters at the bottom level. Then probability of generating edges is determined by a symmetric matrix, in which the th entry is the probability of connecting each pair of vertices from cluster and , respectively. Two clusters that have higher LCA on the groundtruth tree have lower probability.
Our experiments utilize level HSBM. For simplicity, let
be the probability vector for which
is the probability of generating edges for vertex pairs whose LCA on the groundtruth cluster tree has depth . Note that the depth node is the root. We compare the Normalized Mutual Information (NMI) at each level of the groundtruth cluster tree to those of three algorithms. Note that the randomness in LOUVAIN, and breakingties rule as well as convergence of HLP make different results, we choose the most effective strategy and pick the best results in five runs for both of them. Compared to their uncertainty, our algorithm HCSE yields stable results.Table 3 demonstrates the results in three groups of probabilities, for which the clarity of hierarchical structure gets higher one by one. Our algorithm HSE is always able to find the right number of levels, while LOUVAIN always misses the top level, and HLP misses the top level in two groups. The inflection points for choosing the intrinsic hierarchy number of hierarchies are demonstrated in Figure 3.
4.2 Real datasets
First, we do our experiments on Amazon network ^{6}^{6}6http://snap.stanford.edu/data/ for which the set of groundtruth clusters has been given. For two sets , the Jaccard Index of them is defined as . We pick the largest cluster which is a subgraph with vertices and edges. We run HCSE algorithm on it. For each groundtruth cluster that appears in this subgraph, we find from the resulting cluster tree an internal node that has maximum Jaccard index with . Then we calculate the average Jaccard index over all such . We also calculate cost(SE) and cost(Das). The results are demonstrated in Table 2. HCSE performs better for and cost(SE), while LOUVAIN performs better for cost(Das). Because of unbalance in overfitting and underfitting traps, HLP outperforms none of the other two algorithms for all criteria.
index  HCSE  HLP  LOUVAIN 
0.20  0.16  0.17  
cost(SE)  1.85E6  2.05E6  1.89E6 
cost(Das)  5.57E8  3.99E8  3.08E8 
Second, we do our experiments on a series of real networks ^{7}^{7}7http://networkrepository.com/index.php without ground truth. We compare cost(SE) and cost(Das), respectively. Since the different level numbers given by the three algorithms influence the costs seriously, that is, lower costs are obtained just due to greater heights, we only list in Table 3 the networks for which the three algorithms yield similar level numbers that differ by at most or . It can be observed that HLP does not achieve optima for any network, while HCSE performs best w.r.t. cost(Das) for all networks, but does not outperform LOUVAIN for most networks. This is mainly due to the fact that LOUVAIN always finds no less number of hierarchies than HCSE, and the better cost probably benefits from its depth.
Networks  HCSE  HLP  LOUVAIN 
CSphd  1.30E4 / 5.19E4 / 5  1.54E4 / 5.58E4 / 4  1.28E4 / 7.61E4 / 5 
fbpagesgovernment  2.48E6 / 1.18E8 / 4  2.53E6 / 1.76E8 / 3  2.43E6 / 1.33E8 / 4 
emailuniv  1.16E5 / 2.20E6 / 3  1.46E5 / 6.14E6 / 3  1.14E5 / 2.20E6 / 4 
fbmessages  1.58E5 / 4.50E6 / 4  1.76E5 / 8.12E6 / 3  1.52E5 / 4.96E6 / 4 
G22  5.56E5 / 2.68E7 / 4  6.11E5 / 4.00E7 / 3  5.63E5 / 2.80E7 / 5 
As20000102  2.64E5 / 2.36E7 / 4  3.62E5 / 7.63E7 / 3  2.42E5 / 2.42E7 / 5 
bibd136  7.41E5 / 2.56E7 / 3  8.05E5 / 4.41E7 / 2  7.50E5 / 2.75E7 / 4 
delaunayn10  4.65E4 / 3.39E5 / 4  4.87E4 / 3.55E5 / 4  4.24E4 / 4.25E5 / 5 
p2pGnutella05  9.00E5 / 1.48E8 / 3  1.01E6 / 2.78E8 / 3  8.05E5 / 1.49E8 / 5 
p2pGnutella08  5.59E5 / 5.51E7 / 4  6.36E5 / 1.28E8 / 4  4.88E5 / 6.03E7 / 5 
5 Conclusions and future discussions
In this paper, we investigate the hierarchical clustering problem from an informationtheoretic perspective and propose a new objective function that relates to the combinatoric cost functions raised by Dasgupta [6] and CohenAddad et al. [4]. We define the optimal level cluster tree for practical use and devise an hierarchical clustering algorithm that stratifies the sparsest level of the cluster tree recursively. This is a general framework that can be collocated with any cost function. We also propose an interpretable strategy to find the intrinsic number of levels without any hyperparameter. The experimental results on level HSBM demonstrate that our algorithm HCSE has a great advantage in finding compared to the popular but strongly heuristic algorithms LOUVAIN and HLP. Our results on real datasets show that HCSE also achieves competitive costs compared to these two algorithms.
There are several directions that are worth further study. The first problem is about the relationship between the concavity of of the cost function and the balance of the optimal cluster tree. We have seen that for cliques, being concave is not a sufficient condition for total balance, so whether is it a necessary condition? Moreover, is there any explicit necessary and sufficient condition for total balance of the optimal cluster tree for cliques? The second problem is about approximation algorithms for both structural entropy and cost(SE). Due to the nonlinear and volumerelated function , the proof techniques for approximation algorithms in [4] becomes unavailable. The third one is about more precise characterizations for “natural” hierarchical clustering whose depth is limited. Since any reasonable choice of makes the cost function achieve optimum on some binary tree, a blind pursuit of minimization of cost functions seems not to be a rational approach. More criteria in this scenario need to be studied.
References

[1]
(1995)
Spectral partitioning: the more eigenvectors, the better
. In Proceedings of the 32st Conference on Design Automation, pp. 195–200. Cited by: §1.  [2] (2008) Fast unfolding of communities in large networks. Journal of statistical mechanics: theory and experiment 2008 (10), pp. P10008. Cited by: §1, §4.

[3]
(1992)
Classbased ngram models of natural language
. Computational linguistics 18 (4), pp. 467–480. Cited by: §1.  [4] (2019) Hierarchical clustering: objective functions and algorithms. Journal of the ACM (JACM) 66 (4), pp. 1–42. Cited by: An Informationtheoretic Perspective of Hierarchical Clustering, §1, §2.3, §2.3, §2.3, §2, §3, §5, §5.

[5]
(2007)
Author disambiguation using errordriven machine learning with a ranking loss function
. In Sixth International Workshop on Information Integration on the Web (IIWeb07), Vancouver, Canada, Cited by: §1. 
[6]
(2016)
A cost function for similaritybased hierarchical clustering.
In
Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 1821, 2016
, D. Wichs and Y. Mansour (Eds.), pp. 118–127. External Links: Link, Document Cited by: An Informationtheoretic Perspective of Hierarchical Clustering, §1, §2.3, §2.3, §2, §5.  [7] (1998) Cluster analysis and display of genomewide expression patterns. Proceedings of the National Academy of Sciences 95 (25), pp. 14863–14868. Cited by: §1.
 [8] (1996) A densitybased algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §1.

[9]
(2008)
Principal manifolds for data visualization and dimension reduction
. Vol. 58, Springer. Cited by: §1.  [10] (1979) A means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28 (1), pp. 100–108. Cited by: §1.
 [11] (2016) Structural information and dynamical complexity of networks. IEEE Trans. Inf. Theory 62 (6), pp. 3290–3339. External Links: Link, Document Cited by: §1, §2.
 [12] (2020) Fast hierarchical graph clustering in lineartime. In Companion Proceedings of the Web Conference 2020, pp. 10–12. Cited by: §1, §4.
Appendix A Proof of Theorem 2.2
We restate Theorem 2.2 as follows.
Theorem 2.2. For any positive integer , let be the clique of vertices with identical weight on every edge. Then a cluster tree of achieves minimum structural entropy if and only if is a balanced binary tree, that is, the two children clusters of each subtree of have difference in size at most .
Note that a balanced binary tree (BBT for abbreviation) means the tree is balanced on every internal node. Formally, for an internal node of cluster size , its two subtrees are of cluster sizes and , respectively.
For cliques, since the weights of each edge are identical, we assume it safely to be . By Theorem 2.1, minimizing the structural entropy is equivalent to minimizing the cost function (over )
Since the first term in the last equation is independent of , the optimization turns to minimizing the last term, which we denote by . Grouping all edges in by least common ancestor (LCA) of two endpoints, the cost can be written as the sum of the cost at every internal node of . Formally, for every internal node , let be the leaves of the subtrees rooted at the left and right child of , respectively. We have
Now we only have to show the following lemma.
Lemma A.1.
For any positive integer , a cluster tree of achieves minimum cost if and only if is a BBT.
Proof.
Lemma A.1 is proved by induction on . The key technique of tree swapping we use here is inspired by CohenAddad et al [4]. The basis step holds since for or , the cluster tree is balanced and unique. It certainly achieves the minimum cost exclusively.
Now, consider a clique with . Let be an arbitrary unbalanced cluster tree and be its root. We need to prove that the cost does not achieve the minimum. Without loss of generality, we can safely assume the root node is unbalanced, since otherwise, we set to be the subtree that is rooted at an unbalanced node. Let be a tree with root whose left and right subtrees are BBTs such that they have the same sizes with the left and right subtrees of , respectively. Let , , and be the sets of nodes on the four subtrees at the second level of and , , and denote their sizes, respectively. Our proof is also available when some of them are empty. We always assume and . Next, we construct by swapping (transplanting) and with each other. Finally, let be a tree with root whose left and right subtrees are BBTs after balancing the left and right subtrees of . So is a BBT. Then we only have to prove that . Note that the strict “” is necessary since we need to negate all unbalanced cluster trees.
Then we show that the transformation process that consists of the above three steps makes the cost decrease step by step. Formally,

to . The subtrees of become BBTs in . Since the number of edges whose endpoints treat the root as LCA is the same, by induction we have .

to . We will show that in Lemma A.2.

to . The subtrees of become BBTs in . For the same reason as (a), we have .
This transformation process is illustrated in Figure 4. Putting them together, we get and Lemma A.1 follows.
∎
Lemma A.2.
After swapping and , we obtain from , for which .
Proof.
We only need to consider the changes in cost of three nodes: root and its left and right children, since the cost contributed by each of the remaining nodes does not change after swapping. Ignoring the unchanged costs, define
where , . Both of them are at least . Similarly, define
Denote
(2)  
So we only have to show that . We consider the following three cases according to the odevity of and .

and are even.

and
are odd.

is odd while is even.
The case that is even while is odd is symmetric to Case .
For Case , if both and are even, then notations of rounding in Eq. (2) can be removed and can be simplified as
Let , and so . Recall that is unbalanced on the root , so is . Thus . Multiplying by on both sides, we only have to prove that
That is,
Let . Then we only need to show that when . Since
It is easy to check that when . So is strictly convex in the interval . Since ,
Thus holds.
For Case , if both and are odd, then can be split into two parts , in which
Since we have shown that , if we can prove , then the lemma will hold for Case . Due to the convexity of logarithmic function, this holds clearly since
For Case , if is odd while is even,
Multiplying the above equation by , without changing its sign, yields
Splitting the right hand side into two parts,
Since is odd and the root of is unbalanced, we only need to consider the case that , (Note that and are symmetric. So if is even, exchange and ), where both and are odd satisfying . Next we show that in this case, and . By calculation, for Case .
Claim A.1.
for odd integers .
Proof.
Substituting , into the yields
Treat as a continuous variable, we have
Multiplying the above equation by and setting yields
It is easy to check that and . When , . Thus and . This means that for all . So for (When is fixed, the minimum value of can be taken to , which makes and integral). The curves of for varying are plotted in Figure 5.
When , we get and . Substituting them into yields
When , it is easy to check that . So the minimum value of , which is also the minimum value of , is achieved at . So . ∎
Claim A.2.
.
Proof.
Due to the facts that
we have
Let , then
When , . ∎
This completes the proof of Theorem 2.2.
Comments
There are no comments yet.