# A New Cost Function for Hierarchical Cluster Trees

Hierarchical clustering has been a popular method in various data analysis applications. It partitions a data set into a hierarchical collection of clusters, and can provide a global view of (cluster) structure behind data across different granularity levels. A hierarchical clustering (HC) of a data set can be naturally represented by a tree, called a HC-tree, where leaves correspond to input data and subtrees rooted at internal nodes correspond to clusters. Many hierarchical clustering algorithms used in practice are developed in a procedure manner. Dasgupta proposed to study the hierarchical clustering problem from an optimization point of view, and introduced an intuitive cost function for similarity-based hierarchical clustering with nice properties as well as natural approximation algorithms. We observe that while Dasgupta's cost function is effective at differentiating a good HC-tree from a bad one for a fixed graph, the value of this cost function does not reflect how well an input similarity graph is consistent to a hierarchical structure. In this paper, we present a new cost function, which is developed based on Dasgupta's cost function, to address this issue. The optimal tree under the new cost function remains the same as the one under Dasgupta's cost function. However, the value of our cost function is more meaningful. The new way of formulating the cost function also leads to a polynomial time algorithm to compute the optimal cluster tree when the input graph has a perfect HC-structure, or an approximation algorithm when the input graph 'almost' has a perfect HC-structure. Finally, we provide further understanding of the new cost function by studying its behavior for random graphs sampled from an edge probability matrix.

## Authors

• 10 publications
• 34 publications
12/06/2018

### An Improved Cost Function for Hierarchical Cluster Trees

Hierarchical clustering has been a popular method in various data analys...
12/16/2021

### Hierarchical Clustering: O(1)-Approximation for Well-Clustered Graphs

Hierarchical clustering studies a recursive partition of a data set into...
10/16/2015

### A cost function for similarity-based hierarchical clustering

The development of algorithms for hierarchical clustering has been hampe...
05/25/2019

### Ultrametric Fitting by Gradient Descent

We study the problem of fitting an ultrametric distance to a dissimilari...
02/06/2018

### Critical Percolation as a Framework to Analyze the Training of Deep Networks

In this paper we approach two relevant deep learning topics: i) tackling...
09/09/2021

### An objective function for order preserving hierarchical clustering

We present an objective function for similarity based hierarchical clust...
08/13/2021

### An Information-theoretic Perspective of Hierarchical Clustering

A combinatorial cost function for hierarchical clustering was introduced...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering has been one of the most important and popular data analysis methods in the modern data era, with numerous clustering algorithms proposed in the literature [1]. Theoretical studies on clustering have so far been focused mostly on the flat clustering algorithms, e.g, [4, 5, 10, 11, 17], which aim to partition the input data set into a set of (often pre-specified) number of groups, called clusters. However, there are many scenarios where it is more desirable to perform hierarchical clustering, which recursively partitions data into a hierarchical collection of clusters. A hierarchical clustering (HC) of a data set can be naturally represented by a tree, called a HC-tree

, where leafs correspond to input data and subtrees rooted at internal nodes correspond to clusters. Hierarchical clustering can provide a more thorough view of the cluster structure behind input data across all levels of granularity simultaneously, and is sometimes better at revealing the complex structure behind modern data. It has been broadly used in data mining, machine learning and bioinformatic applications; e.g, the studies of phylogenetics.

Most hierarchical clustering algorithms used in practice are developed in a procedure manner: For example, the family of agglomerative methods

build a HC-tree bottom-up by starting with all data points in individual clusters, and then repeatedly merging them to form bigger clusters at coarser levels. Prominent merging strategies include single-linkage, average-linkage and complete-linkage heuristics. The family of

divisive methods instead partition the data in a top-down manner, starting with a single cluster, and then recursively dividing it into smaller clusters using strategies based on spectral cut, -means, -center and so on. Many of these algorithms work well in different practical situations, for example, average linkage algorithm is known as Unweighted Pair Group Method with Arithmetic Mean(UPGMA) algorithm [19] commonly used in evolutionary biology for phylogenetic inference. However, it is in general not clear what the output HC-tree aims to optimize, and what one should expect to obtain. This lack of optimization understanding of the HC-tree also makes it hard to decide which hierarchical clustering algorithm one should use given a specific type of input data.

The optimization formulation of the HC-tree was recently tackled by Dasgupta in [9]. Specifically, given a similarity graph (which is a weighted graph with edge weight corresponding to the similarity between nodes), he proposed an intuitive cost function for any HC-tree, and defined an optimal HC-tree for to be one that minimizes this cost. Dasgupta showed that the optimal tree under this objective function has many nice properties and is indeed desirable. Furthermore, while it is NP-hard to find the optimal tree, he showed that a simple heuristic using an -approximation of the sparsest graph cut will lead to an algorithm computing a HC-tree whose cost is an -approximation of the optimal cost. Given that the best approximation factor for the sparsest cut is [3], this gives an -approximation algorithm for the optimal HC-tree as defined by Dasgupta’s cost function. The approximation factor has since been improved in several subsequent work [7, 8, 18], and it has been shown independently in [7, 8] that one can obtain an -approximation.

##### Our work.

For a fixed graph , the value of Dasgupta’s cost function can be used to differentiate “better” HC-trees (with smaller cost) from “worse” ones (with larger cost), and the HC-tree with the smallest cost is optimal. However, we observe that this cost function, in its current form, does not indicate whether an input graph has a strong hierarchical structure or not, or whether one graph has “more” hierarchical structure than another graph. For example, consider a star graph and a path graph , both with nodes and edges, and unit edge weights. It turns out that the cost of the optimal HC-tree for the star is , while that for path graph is . However, the star, with all nodes connected to a center node, is intuitively more cluster-like than the path with sequential edges. (In fact, we will show later that a star, or the so-called linked star where there are two stars with their center vertices linked by an edge, see Figure 5 (a) in Appendix, both have what we call perfect HC-structure.) Furthermore, consider a dense unit-weight graph with edges, one can show that the optimal cost is always , whether the graph exhibit any cluster structure at all. Hence in general, it is not meaningful to compare the optimal HC-tree costs across different graphs.

We propose a modification of Dasgupta’s cost function to address this issue and study its properties and algorithms. In particular, by reformulating Dasgupta’s cost function, we observe that for a fixed graph, there exists a base-cost which reflects the minimum cost one can hope to achieve. Based on this observation, we develop a new cost to evaluate how well a HC-tree represents an input graph . An optimal HC-tree for is the one minimizing . The new cost function has several interesting properties:

• For any graph , a tree minimizes for a fixed graph if and only if it minimizes Dasgupta’s cost function; thus the optimal tree under our cost function remains the same as the optimal tree under Dasgupta’s cost function. Furthermore, hardness results and the existing approximation algorithm developed in [7] still apply to our cost function.

• For any positively weighted graph with vertices, the optimal cost is bounded with (while the optimal cost under Dasgupta’s cost function could be made arbitrarily large). The optimal cost intuitively indicates how much HC-structure the graph has. In particular, if and only if there exists a HC-tree that is consistent with all triplets relations in (see Section 2 for more precise definitions), in which case, we say that this graph has a perfect hierarchical-clustering (HC) structure.

• The new formulation enables us to develop an -time algorithm to test whether an input graph has a perfect HC-structure (i.e, ) or not, as well as computing an optimal tree if . If an input graph is what we call the -perturbation of a graph with a perfect HC-structure, then in time we can compute a HC-tree whose cost is a ()-approximation of the optimal one.

• Finally, in Section 4, we study the behavior of our cost function for a random graph generated from an edge probability matrix . Under mild conditions on , we show that the optimal cost concentrates on a certain value, which interestingly, is different from the optimal cost for the expectation-graph (i.e, the graph whose edge weights equal to entries in ). Furthermore, for random graphs sampled from probability matrices, the optimal cost will decrease if we strengthen in-cluster connections. For instance, the optimal cost of a Erdős-Rényi random graph with connection probability is . In other words, the optimal cost reflects how much HC-structure a random graph has.

In general, we believe that the new formulation and results from our investigation help to reveal insights on hierarchical structure behind graph data. We remark that [8] proposed a concept of ground-truth input (graph), which, informally, is a graph consistent with an ultrametric under some monotone transformation of weights. Our concept of graphs with perfect HC-structure is more general than their ground-truth graph (see Theorem 4), and for example are much more meaningful for unweighted graphs (Proposition 1).

##### More on related work.

The present study is inspired by the work of Dasgupta [9], as well as the subsequent work in [7, 8, 15, 18], especially [8]. As mentioned earlier, after [9], there have been several independent follow-up work to improve the approximation factor to [18] and then to via more refined analysis of the sparse-cut based algorithm of Dasgupta [7, 8], or via SDP relaxation [7]. It was also shown in [7, 18] that it is hard to approximate the optimal cost (under Dasgupta’s cost function) within any constant factor, assuming the Small Set Expansion (SSE) hypothesis originally introduced in [16]. A dual version of Dasgupta’s cost function was considered in [15]; and an analog of Dasgupta’s cost function for dissimilarity graphs (i.e, graph where edge weight represents dissimilarity) was studied in [8]. In both cases, the formulation leads to a maximization problem, and thus exhibits rather different flavor from an approximation perspective: Indeed, simple constant-factor approximation algorithms are proposed in both cases.

We remark that the investigation in [8] in fact targets a broader family of cost functions than Dasgupta’s cost function (and thus ours as well),

## 2 An Improved Cost Function for HC-trees and Properties

In this section, we first describe the cost function proposed by Dasgupta [9] in Section 2.1. We introduce our cost function in Section 2.2 and present several properties of it in 2.3.

### 2.1 Problem setup and Dasgupta’s cost function

Our input is a set of data points as well as their pairwise similarity, represented as a weight matrix with representing the similarity between points and . We assume that is symmetric and each entry is non-negative. Alternatively, we can assume that the input is a weighted (undirected) graph , with the weight for an edge being . The two views are equivalent: if the input graph is not a complete graph, then in the weight matrix view, we simply set if . We use the two views interchangeably in this paper. Finally, we use “unweighted graph” to refer to a graph where each edge has unit-weight .

Given a set of data points , a hierarchical clustering tree (HC-tree) is a rooted tree whose leaf set equals . We also say that is a HC-tree spanning (its leaf set) . Given any tree node , represents the subtree rooted at , and denotes the set of leaves contained in the subtree . Given any two points , denotes the lowest common ancestor of leafs and in ; the subscript is often omitted when the choice of is clear. To simplify presentation, we sometimes use indices of vertices to refer to vertices, e.g, means . The following cost function to evaluate a HC-tree w.r.t. a similarity graph was introduced in [9]:

 costG(T)=∑(i,j)∈Ewij|leaves(T[LCA(i,j)])|.

An optimal HC-tree is defined as one that minimizing . Intuitively, to minimize the cost, pairs of nodes with high similarity should be merged (into a single cluster) earlier. It was shown in [9] that the optimal tree under this cost function has several nice properties, e.g, behaving as expected for graphs such as disconnected graphs, cliques, and planted partitions. In particular, if is an unweighted clique, then all trees have the same cost, and thus are optimal – this is intuitive as no preference should be given to any specific tree shape in this case.

### 2.2 The new cost function

To introduce our new cost function, it is more convenient to take the matrix view where the weight is defined for all pairs of nodes and (as mentioned earlier, if the input is a weighted graph , then we set for ). First, a triplet means three distinct indices . We say that relation holds in , if the lowest common ancestor of and is a proper descendant of . Intuitively, subtrees containing leaves and will merge first, before they merge with a subtree containing . We say that relation holds in , if they are merged at the same time; that is, . See Figure 1 for an illustration.

###### Definition 1.

Given any triplet of , the cost of this triplet (induced by ) is

 triCT,G(i,j,k)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩wik+wjk    if % relation {i,j|k} % holdswij+wjk    if relation {i,k|j} holdswij+wik    if relation {j,k|i} holdswij+wjk+wik    if relation {i|j|k} holds

We omit from the above notation when its choice is clear. The total-cost of tree w.r.t. is

 totalCG(T)=∑i≠j≠k∈[1,n]triCT(i,j,k).

The rather simple proof of the following claim is in Appendix A.1.

###### Claim 1.

The total-cost of any HC-tree differs from Dasgupta’s cost by a quantity depending only on the input graph . Hence for a fixed graph , it maintains the relative order of the costs of any two trees, implying that the optimal tree under or remains the same.

While the difference from Dasgupta’s cost function seems to be minor, it is easy to see from this formulation that for a fixed graph , there is a least-possible cost that any HC-tree will incur, which we call the base-cost. It is important to note that the following base-cost depends only on the input graph .

###### Definition 2.

Given a -node graph associated with similarity matrix , for any distinct triplet , define its min-triplet cost to be

 minTriCG(i,j,k)=min{wij+wik,wij+wjk,wik+wjk}.

The base-cost of similarity graph is

 baseC(G)=∑i≠j≠k∈[1,n]minTriCG(i,j,k).

To differentiate from Dasgupta’s cost function, we call our new cost function the ratio-cost.

###### Definition 3 (Ratio-cost function).

Given a similarity graph and a HC-tree , the ratio-cost of w.r.t. is defined as

 ρG(T)=totalCG(T)baseC(G).

The optimal tree for is a tree such that ; and its ratio-cost is called the optimal ratio-cost .

###### Observation 1.

(i) For any HC-tree , , implying that .
(ii) A tree optimizing also optimizes (and thus ), as is a constant for a fixed graph.
(iii) There is always an optimal tree for that is binary.

Observation (i) and (ii) above follow directly the definitions of these costs and Claim 1. (iii) holds as it holds for Dasgupta’s cost function ([9], Section 2.3). Hence in the remainder of the paper, we will consider only binary trees when talking about optimal trees.

Note that it is possible that for an non-empty graph . We follow the convention that while for any positive number . We show in Appendix A.2 that in this case, there must exist a tree such that . Thus for this case.

##### Intuition behind the costs.

Consider any triplet , and assume w.l.o.g that is the largest among the three pairwise similarities. If there exists a “perfect” HC-tree , then it should first merge and as they are most similar, before merging them with . That is, the relation for this triplet in the HC-tree should be ; and we say that this relation (and the tree ) is consistent with (similarties of) this triplet. The cost of the triplet is designed to reward this “perfect” relation: is minimized, in which case , only when holds in . In other words, is the smallest cost possible for this triplet, and a HC-tree can achieve this cost only when the relation of this triplet in is consistent with their similarities.

If there is a HC-tree such that for all triplets, their relations in are consistent with their similarities, then , implying that must be optimal as . Similarly, if for a graph , then the optimal tree has to be consistent with all triplets from . Intuitively, this graph has a perfect HC-structure, in the sense that there is a HC-tree such that the desired merging order for all triplets are preserved in this tree.

###### Definition 4 (Graph with perfect HC-structure).

A similarity graph has perfect HC-structure if . Equivalently, there exists a HC-tree such that .

##### Examples.

The base-cost is independent of tree and can be computed easily for a graph. If we find a tree with , then must be optimal and has perfect HC-structure.

With this in mind, it is now easy to see that for a clique (with unit weight), for any HC-tree and any triplet , . Hence for any HC-tree , and thus the clique has perfect HC-structure. A complete graph whose edge weights equal to entries of the edge probabilities of a planted partition also has a perfect HC-structure. It is also easy to check that the -node star graph (or two linked stars) has perfect HC-structure, while for the -node path , . Intuitively, a path does not process much hierarchical structure. See Appendix A.3 for details. We also refer the readers to see more results and discussions on Erdös Rényi random graph and planted bipartiton (also called planted bisection) random graphs in Section 4. In particular, as we describe in the discussion after Corollary 2, the ratio-cost function exhibits an interesting, yet natural, behavior as the in-cluster and between-cluster probabilities for a planted bisection model change.

Remark: Note that in general, the value of Dasgupta’s cost is affected by the (edge) density of a graph. One may think that we could normalize by the number of edges in the graph (or by total edge weights). However, the examples of the star and path show that this strategy itself is not sufficient to reveal the cluster structure behind an input graph, as those two graphs have equal number of edges.

### 2.3 Some properties

For an unweighted graph, optimal cost under Dasgupta’s cost function is bounded by . But this cost can be made arbitrarily large for a weighted graph. For our ratio-cost function, it turns out that the optimal cost is always bounded for both unweighted and weighted graphs. Proof of the following result can be found in Appendix A.4. For an unweighted graph, we can in fact obtain an upper bound that is asymptotically the same as the one in (ii) below using a much simpler argument (than the one in our current proof). However, the stated upper bound () is tight in the sense that it equals ‘1’ for a clique, matching the fact that for a clique.

###### Theorem 1.

(i) Given a similarity graph with being symmetric, having non-negative entries, we have that where .
(ii) For a connected unweighted graph with and , we have that .

We now show that the bound in (i) above is asymptotically tight. To prove that, we will use a certain family of edge expanders.

###### Definition 5.

Given and integer , a graph is an -edge expander if for every with , the set of crossing edges from to , denoted as , has cardinality at least .

A (undirected) graph is -regular if all nodes have degree .

###### Theorem 2.

[3] For any natural number , and sufficiently large , there exist -regular graphs on vertices which are also -edge expanders.

###### Theorem 3.

A -regular -edge expander graph satisfies that .

###### Proof.

The existence of a -regular edge expander is guaranteed by Theorem 2. Given a cut , the sparsity of it is . Consider the sparsest-cut for with minimum sparsity. Assume w.l.o.g that . Its sparsity satisfies:

 ϕ∗ =|E(S∗,V∖S∗)||S∗|⋅|V∖S∗|≥d10⋅|S∗||S∗|⋅|V∖S∗|=d10|V∖S∗|≥d10n. (1)

The first inequality above holds as is a -edge expander.

On the other hand, consider an arbitrary bisection cut with . Its sparsity satisfies:

 ϕ(ˆS) =|E(ˆS,V∖ˆS)||ˆS|⋅|V∖ˆS|=E(ˆS,V∖ˆS)n2⋅n2≤d⋅n2n2/4=2dn. (2)

Combining (Eqn. 1) and (Eqn. 2), we have that the bisection cut is a -approximation for the sparest cut. On the other hand, it is shown in [8] that a recursive -approximation sparest-cut algorithm will lead to a HC-tree whose cost is a -approximation for where is the optimal tree. Now consider the HC-tree resulted from first bisecting the input graph via , then identifying sparest-cut for each subtree recursively. By [8], for some constant . Furthermore, is at least the cost induced by those edges split at the top level; that is,

 costG(T) =∑(i,j)∈E|leaves(T[LCA(i,j)])|≥∑(i,j)∈E(ˆS,V∖ˆS)|leaves(T[LCA(i,j)])| =∑(i,j)∈E(ˆS,V∖ˆS)n=n⋅|E(ˆS,V∖ˆS)|≥n⋅nd20=dn220, (3)

where the last inequality holds as is a -edge expander. Hence

 costG(T∗)≥120ccostG(T)≥dn2400c,

with being a constant. By Claim 1, we have that

 totalCG(T∗)=costG(T∗)−2|E|≥dn2400c−dn=Ω(dn2). (4)

Now call a triplet a wedge (resp. a triangle if there are exactly two (resp. exactly three) edges among the three nodes. Only wedges and triangles contribute to . Thus:

 baseC(G) =#wedges+2#triangles≤∑i∈[1,n](d2)=O(d2n) ⇒     ρ∗G =totalCG(T∗)baseC(G)=Ω(dn2d2n)=Ω(nd).

Combining the above with Theorem 1, the claim then follows. ∎

##### Relation to the “ground-truth input” graphs of [8].

Cohen-Addad et al. introduced what they call the “ground-truth input” graphs to describe inputs that admit a “natural” ground-truth cluster tree [8]. A brief review of this concept is given in Appendix A.5. Interestingly, we show that a ground-truth input graph always has a perfect HC-structure. However, the converse is not necessarily true and the family of graphs with perfect HC-structure is broader while still being natural. Proof of the following theorem can be found in Appendix A.5.

###### Theorem 4.

Given a similarity graph , if is a ground-truth input graph of [8], then must have perfect HC-structure. However, the converse is not true.

Intuitively, a graph has a perfect HC-structure if there exists a tree such that for all triplets, the most similar pair (with the largest similarity) will be merged first in . Such a triplet-order constraint is much weaker than the requirement of the ground-truth graph of [8] (which intuitively is generated by an ultrametric). An example is given in Figure 2. In fact, the following proposition shows that the concept of ground-truth graph is rather stringent for graphs with unit edge weights (i.e, unweighted graphs). In particular, a connected unweighted graph is a ground-truth graph of [8] if and only if it is the complete graph. In contrast, unweighted graphs with perfect HC-structure represent a much broader family of graphs. The proof of this proposition is in Appendix A.5.

###### Proposition 1.

Let be an unweighted graph (i.e, if and otherwise). is a ground-truth graph if and only if each connected component of is a clique.

## 3 Algorithms

### 3.1 Hardness and approximation algorithms

By Claim 1, equals minus a constant (depending only on ). Thus the hardness results for optimizing also holds for optimizing . Hence the following theorem follows from results of [7] and [9]. The simple proof is in Appendix B.1.

###### Theorem 5.

(1) It is NP-hard to compute , even when is an unweighted graph (i.e, edges have unit weights). (2) Furthermore, under Small Set Expansion hypothesis, it is NP-hard to approximate within any constant factor.

We remark that while the hardness results for optimizing translate into hardness results for , it is not immediately evident that an approximation algorithm translates too, as differs from by a positive quantity. Nevertheless, it turns out that the -approximation algorithm of [7] for also approximates within the same asymptotic approximation factor. See appendix B.2 for details.

### 3.2 Algorithms for graphs with perfect or near-perfect HC-structure

While in general, it remains open how to approximate (as well as ) to a factor better than , we now show that we can check whether a graph has perfect HC-structure or not, and compute an optimal tree if it has, in polynomial time. We also provide a polynomial-time approximation algorithm for graphs with near-perfect HC-structures (to be defined later).

Remark: One may wonder whether a simple agglomerative (bottom-up) hierarchical clustering algorithm, such as the single-linkage, average-linkage, or complete-linkage clustering algorithm, could have recovered the perfect HC-structure. For the ground-truth input graphs introduced in [8], it is known that it can be recovered via average-linkage clustering algorithms. However, as the example in Figure 2 shows, this in general is not true for recovering graphs with perfect HC-structures, as depending on the strategy, a bottom-up approach may very well merge nodes and first, leading to a non-optimal HC-tree.

Intuitively, it is necessary to have a top-down approach to recover the perfect HC-structure. The high level framework of our recursive algorithm BuildPerfectTree() is given below and output a HC-tree spans (i.e, with its leaf set being) a subset of vertices from . (resp. ) in the algorithm denotes the subgraph of spanned by vertices in (resp. in ). We will prove that the output tree spans all vertices , if and only if has a perfect HC-structure (in which case will also be an optimal tree).

BuildPerfectTree()

Input: graph . Output: a binary HC-tree

• Set = validBipartition();   If( or ) Return()

• Set = BuildPerfectTree();    =BuildPerfectTree()

• Build tree with and being the two subtrees of its root. Return()

We say that is a partial bi-partition of if and ; and is a bi-partition of (or ) if , , and . Let .

###### Definition 6 (Triplet types).

A triplet with edge weights and is
type-1: if the largest weight, say , is strictly larger than the other two; i.e, ;
type-2: if exact two weights, say and , are the largest; i.e, ;
type-3: otherwise, which means all three weights are equal; i.e, .

###### Definition 7 (Valid partition).

A partition , , of (i.e, , , and ) is valid w.r.t. if (i) for any type-1 triplet with , either all three vertices belong to the same set from ; or and are in one set from , while is in another set; and (ii) for any type-2 triplet with , it cannot be that and are from the same set of , while is in another one.

If this partition is a bi-partition, then it is also called a valid bi-partition.

In what follows, we also refer to each set in the partition as a cluster.

The goal of procedure validBipartition() is to compute a valid bi-partition if it exists. Otherwise, it returns “fail" (more precisely, it returns () ). On the high level, it has two steps. It turns out (Step-1) follows from existing literature on the so-called rooted triplet consistency problem. The main technical challenge is to develop (Step-2).

Procedure validBipartition()

(Step-1): Compute a certain valid partition of , if possible.

(Step-2): Compute a valid bi-partition from this valid partition .

##### (Step 1): compute a valid partition.

It turns out that the partition procedure within the algorithm BUILD of [2] will compute a specific partition with nice properties. The following result can be derived from Lemma 1, Theorem 2, and proof of Theorem 4 of [2].

###### Proposition 2 ([2]).

(1) We can check whether there is a valid partition for in time. Furthermore, if one exists, then within the same time bound we can compute a valid partition which is minimal in the following sense: Given any other valid partition , is a refinement of (i.e, all points from the same cluster are contained within some ).

(2) This implies that, let be any valid bi-partition for , and the partition as computed above. Then for any , if (resp. ), then (resp. ).

##### (Step 2): compute a valid bi-partition from C if possible.

Suppose (Step 1) succeeds and let , , be the minimum valid partition computed. It is not clear whether a valid bi-partition exists (and how to find it) even though a valid partition exists. Our algorithm computes a valid bi-partition depending on whether a claw-configuration exists or not.

###### Definition 8 (A claw).

Four points form a claw w.r.t. if (i) each point is from a different cluster in ; and (ii) . See Figure 3 (a).

(Case-1): Suppose there is a claw w.r.t. . Fix any claw, and assume w.l.o.g that it is such that for or . We choose an arbitrary but fixed representative vertex for each cluster with . Compute the subgraph of spanned by vertices . (Recall that we can view as a complete graph where has weight if it is not an edge in .) Set . We say that an edge is light if its weight is strictly less than ; otherwise, it is heavy. Easy to see that by definition of claw, edges and are all light. Now, consider the subgraph of spanned by only light-edges, and w.l.o.g. let be the maximum clique in contains and . See Figure 3 (b) for this clique, where solid edges are heavy, while dashed ones are light. (It turns out that this maximum clique can be computed efficiently, as we show in Appendix B.6.2.)

We then set up potential bi-partitions , for each . We check the validity of each such . If none of them is valid, then procedure validBipartition() returns ‘fail’. Otherwise, it turns the valid one. The correctness of this step is guaranteed by the Lemma 1, whose proof can be found in Appendix B.4.

###### Lemma 1.

Suppose there is a claw w.r.t. . Let ’s, be described above. There exists a valid bi-partition for if and only if one of the , , is valid.

(Case-2): Suppose there is no claw w.r.t. . The case when there is no claw is slightly more complicated. We present the lemma below with a constructive proof in Appendix B.5.

###### Lemma 2.

If there is no claw w.r.t. , then we can check whether there is a valid bi-partition (and compute one if it exists) in time, where .

This finishes the description of procedure validBipartition(). Putting everything together, we conclude with the following theorem, with proof in Appendix B.6. We note that it is easy to obtain a time complexity of . However, we show in Appendix B.6 how to modify our algorithm as well as to provide a much more refined analysis to improve the time to .

###### Theorem 6.

Given a similarity graph with vertices, algorithm BuildPerfectTree() can be implemented to run in time. It returns a tree spanning all vertices in if and only if has a perfect HC-structure, in which case this spanning tree is an optimal HC-tree.

Hence we can check whether it has a perfect HC-structure, as well as compute an optimal HC-tree if it has, in time.

##### Graphs with almost perfect HC-structure.

In practice, a graph with perfect HC-structure could be corrupted with noise. We introduce a concept of graphs with an almost-perfect HC-structure, and present a polynomial time algorithm to approximate the optimal cost.

###### Definition 9 (δ-perfect HC-structure).

A graph has -perfect HC-structure, , if there exists weights such that (i) the graph has perfect HC-structure; and (ii) for any , we have . In this case, we also say that is a -perturbation of graph .

Note that a graph with -perfect HC-structure is simply a graph with perfect HC-structure.

The proof of the following theorem is in Appendix B.7. Note that when (i.e, the graph has a perfect HC-structure), this theorem also gives rise to a -approximation algorithm for the optimal ratio-cost in . In contrast, an exact algorithm to compute the optimal tree in this case takes time as shown in Theorem 6.

###### Theorem 7.

Suppose is a -perturbation of a graph with perfect HC-structure. Then we have (i) ; and (ii) we can compute a HC-tree s.t. (i.e, we can ()-approximate ) in time.

## 4 Ratio-cost function for Random Graphs

###### Definition 10.

Given a symmetric matrix with each entry , is a random graph generated from if there is an edge with probability . Each edge in has unit weight.

The expectation-graph refers to the weighted graph where the edge has weight .

In all statements, “with high probability (w.h.p)” means “with probability larger than for some constant ". The main result is as follows. The proof is in Appendix C.1.

###### Theorem 8.

Given an edge probability matrix , assume each entry , for any . Given a random graph sampled from , let denote the optimal HC-tree for (w.r.t. ratio-cost), and an optimal HC-tree for the expectation-graph . Then we have that w.h.p,

 ρ∗G=ρG(T∗)=(1+o(1))totalC¯¯¯¯G(¯¯¯¯T∗)E[baseC(G)],

where is the expected base-cost for the random graph .

Hence the otpimal cost of a randomly generated graph concentrates around a quantity. However, this quantity is different from the optimal ratio-cost for the expectation graph , which would be . Rather, is sensitive to the specific random graph sampled. We give two examples below. A Erdős-Rényi random graph is generated by the edge probability matrix with for all . The probability matrix for the planted bisection model , , is such that (1) for or ; (2) otherwise. See Appendix C.2 and C.3 for the proofs of these results.

###### Corollary 1.

For a Erdős-Rényi random graph , where , w.h.p., the optimal HC-tree has ratio-cost .

###### Corollary 2.

For a planted bisection model , where , w.h.p., the optimal HC-tree has ratio-cost .

Note that the decreases as increases. When , increases as increases, otherwise, it decreases as increases.

Note that while the cost of an optimal tree w.r.t. Dasgupta’s cost (and also our total-cost) always concentrates around the cost for the expectation graph, as we mentioned above, the value of the ratio-cost depends on the sampled random graph itself. For Erdős-Rényi random graphs, larger value indicates tighter connection among nodes, making it more clique-like, andthus decreases till when . (In contrast, note that the expectation-graph is a complete graph where all edges have weight ; thus it always has no matter what value it is.)

For the planted bisection model, increasing value also strengthens in-cluster connections and thus decreases. Interestingly, increasing value when it is small hurts the clustering structure, because it adds more cross-cluster connections thereby making the two clusters formed by vertices with indices from and from respectively, less evident – indeed, increases when increases for small . However, when is already relatively large (close to ), increasing it more makes the entire graph closer to a clique, and starts to decreases when increases for large . Note that such a refined behavior (w.r.t probability ) does not hold for the original cost function by Dasgupta.

##### Acknowledgements.

We would like to thank Sanjoy Dasgupta for very insightful discussions at the early stage of this work, which lead to the observation of the base-cost for a graph. The work was initiated during the semester long program of “Foundations of Machine Learning"

at Simons Insitute for the Theory of Computer in spring 2017. This work is partially supported by National Science Foundation under grants CCF-1740761, DMS-1547357, and RI-1815697.

## References

• [1] C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications. Chapman & Hall/CRC, 1st edition, 2013.
• [2] A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10(3):405–421, 1981.
• [3] S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University Press, New York, NY, USA, 1st edition, 2009.
• [4] D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.
• [5] M. Balcan and Y. Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102–155, 2016.
• [6] J. Byrka, S. Guillemot, and J. Jansson. New results on optimizing rooted triplets consistency. Discrete Applied Mathematics, 158(11):1136–1147, 2010.
• [7] M. Charikar and V. Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’17, pages 841–854, Philadelphia, PA, USA, 2017. Society for Industrial and Applied Mathematics.
• [8] V. Cohen-Addad, V. Kanade, F. Mallmann-Trenn, and C. Mathieu. Hierarchical clustering: Objective functions and algorithms. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’18, pages 378–397, 2018.
• [9] S. Dasgupta. A cost function for similarity-based hierarchical clustering. In Proceedings of the Forty-eighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages 118–127, New York, NY, USA, 2016. ACM.
• [10] W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proceedings of the Thirty-fifth Annual ACM Symposium on Theory of Computing, STOC ’03, pages 50–58, New York, NY, USA, 2003. ACM.
• [11] D. Ghoshdastidar and A. Dukkipati. Consistency of spectral hypergraph partitioning under planted partition model. Ann. Statist., 45(1):289–315, 02 2017.
• [12] S. Janson.

Large deviations for sums of partly dependent random variables.

Random Structures Algorithms, 24:234–248, 2004.
• [13] J. Jansson, J. H. K. Ng, K. Sadakane, and W.-K. Sung. Rooted maximum agreement supertrees. Algorithmica, 43(4):293–307, Dec. 2005.
• [14] R. Krauthgamer, J. S. Naor, and R. Schwartz. Partitioning graphs into balanced components. In Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’09, pages 942–949, Philadelphia, PA, USA, 2009. Society for Industrial and Applied Mathematics.
• [15] B. Moseley and J. Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting k-means, and local search. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3097–3106. Curran Associates, Inc., 2017.
• [16] P. Raghavendra and D. Steurer. Graph expansion and the unique games conjecture. In Proceedings of the Forty-second ACM Symposium on Theory of Computing, STOC ’10, pages 755–764, New York, NY, USA, 2010. ACM.
• [17] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the high-dimensional stochastic blockmodel. Ann. Statist., 39(4):1878–1915, 08 2011.
• [18] A. Roy and S. Pokutta. Hierarchical clustering via spreading metrics. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2316–2324. Curran Associates, Inc., 2016.
• [19] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy: The Principles and Practice of Numerical Classification. W. H. Freeman and Co., 1973.

## Appendix A Missing details from Section 2

### a.1 Proof for Claim 1

We prove the first equality. In particular, fix any pair of leaves , , and count how many times is added for both sides. On the left-hand side, each time there is a node such that relation , or holds, will be added once. There are exactly number of such ’s, which is the same as how many time will be added to the right-hand side. Hence the first two terms are equal.

The second equality is straightforward after we plug in the definition of .

### a.2 Proving that baseC(G)=0⇒totalCG(T∗)=0

Let denote an input graph with . This means that each triplet , there can be at most one edge among them, as otherwise, will not be zero. Hence the degree of each node in will be at most 1. In other words, the collection of edges in form a matching of nodes in it. Let , and denote these matched nodes, where and thus these two nodes are paired together. Let denote the remaining isolated nodes. It is easy to verify that the tree shown in Figure 4 has .

### a.3 Examples of ratio-costs of some graphs

Consider the two examples in Figure 5 (a). The linked-stars consists of two stars centered at and respectively, connected via edge . For the linked-stars , assuming that is even and by symmetry, we have that:

 baseC(G1) =2∑i≠j≠k∈[1,n/2]minTriC(i,j,k)+2∑i