1 Introduction
Clustering has been one of the most important and popular data analysis methods in the modern data era, with numerous clustering algorithms proposed in the literature [1]. Theoretical studies on clustering have so far been focused mostly on the flat clustering algorithms, e.g, [4, 5, 10, 11, 17], which aim to partition the input data set into a set of (often prespecified) number of groups, called clusters. However, there are many scenarios where it is more desirable to perform hierarchical clustering, which recursively partitions data into a hierarchical collection of clusters. A hierarchical clustering (HC) of a data set can be naturally represented by a tree, called a HCtree
, where leafs correspond to input data and subtrees rooted at internal nodes correspond to clusters. Hierarchical clustering can provide a more thorough view of the cluster structure behind input data across all levels of granularity simultaneously, and is sometimes better at revealing the complex structure behind modern data. It has been broadly used in data mining, machine learning and bioinformatic applications; e.g, the studies of phylogenetics.
Most hierarchical clustering algorithms used in practice are developed in a procedure manner: For example, the family of agglomerative methods
build a HCtree bottomup by starting with all data points in individual clusters, and then repeatedly merging them to form bigger clusters at coarser levels. Prominent merging strategies include singlelinkage, averagelinkage and completelinkage heuristics. The family of
divisive methods instead partition the data in a topdown manner, starting with a single cluster, and then recursively dividing it into smaller clusters using strategies based on spectral cut, means, center and so on. Many of these algorithms work well in different practical situations, for example, average linkage algorithm is known as Unweighted Pair Group Method with Arithmetic Mean(UPGMA) algorithm [19] commonly used in evolutionary biology for phylogenetic inference. However, it is in general not clear what the output HCtree aims to optimize, and what one should expect to obtain. This lack of optimization understanding of the HCtree also makes it hard to decide which hierarchical clustering algorithm one should use given a specific type of input data.The optimization formulation of the HCtree was recently tackled by Dasgupta in [9]. Specifically, given a similarity graph (which is a weighted graph with edge weight corresponding to the similarity between nodes), he proposed an intuitive cost function for any HCtree, and defined an optimal HCtree for to be one that minimizes this cost. Dasgupta showed that the optimal tree under this objective function has many nice properties and is indeed desirable. Furthermore, while it is NPhard to find the optimal tree, he showed that a simple heuristic using an approximation of the sparsest graph cut will lead to an algorithm computing a HCtree whose cost is an approximation of the optimal cost. Given that the best approximation factor for the sparsest cut is [3], this gives an approximation algorithm for the optimal HCtree as defined by Dasgupta’s cost function. The approximation factor has since been improved in several subsequent work [7, 8, 18], and it has been shown independently in [7, 8] that one can obtain an approximation.
Our work.
For a fixed graph , the value of Dasgupta’s cost function can be used to differentiate “better” HCtrees (with smaller cost) from “worse” ones (with larger cost), and the HCtree with the smallest cost is optimal. However, we observe that this cost function, in its current form, does not indicate whether an input graph has a strong hierarchical structure or not, or whether one graph has “more” hierarchical structure than another graph. For example, consider a star graph and a path graph , both with nodes and edges, and unit edge weights. It turns out that the cost of the optimal HCtree for the star is , while that for path graph is . However, the star, with all nodes connected to a center node, is intuitively more clusterlike than the path with sequential edges. (In fact, we will show later that a star, or the socalled linked star where there are two stars with their center vertices linked by an edge, see Figure 5 (a) in Appendix, both have what we call perfect HCstructure.) Furthermore, consider a dense unitweight graph with edges, one can show that the optimal cost is always , whether the graph exhibit any cluster structure at all. Hence in general, it is not meaningful to compare the optimal HCtree costs across different graphs.
We propose a modification of Dasgupta’s cost function to address this issue and study its properties and algorithms. In particular, by reformulating Dasgupta’s cost function, we observe that for a fixed graph, there exists a basecost which reflects the minimum cost one can hope to achieve. Based on this observation, we develop a new cost to evaluate how well a HCtree represents an input graph . An optimal HCtree for is the one minimizing . The new cost function has several interesting properties:

For any graph , a tree minimizes for a fixed graph if and only if it minimizes Dasgupta’s cost function; thus the optimal tree under our cost function remains the same as the optimal tree under Dasgupta’s cost function. Furthermore, hardness results and the existing approximation algorithm developed in [7] still apply to our cost function.

For any positively weighted graph with vertices, the optimal cost is bounded with (while the optimal cost under Dasgupta’s cost function could be made arbitrarily large). The optimal cost intuitively indicates how much HCstructure the graph has. In particular, if and only if there exists a HCtree that is consistent with all triplets relations in (see Section 2 for more precise definitions), in which case, we say that this graph has a perfect hierarchicalclustering (HC) structure.

The new formulation enables us to develop an time algorithm to test whether an input graph has a perfect HCstructure (i.e, ) or not, as well as computing an optimal tree if . If an input graph is what we call the perturbation of a graph with a perfect HCstructure, then in time we can compute a HCtree whose cost is a ()approximation of the optimal one.

Finally, in Section 4, we study the behavior of our cost function for a random graph generated from an edge probability matrix . Under mild conditions on , we show that the optimal cost concentrates on a certain value, which interestingly, is different from the optimal cost for the expectationgraph (i.e, the graph whose edge weights equal to entries in ). Furthermore, for random graphs sampled from probability matrices, the optimal cost will decrease if we strengthen incluster connections. For instance, the optimal cost of a ErdősRényi random graph with connection probability is . In other words, the optimal cost reflects how much HCstructure a random graph has.
In general, we believe that the new formulation and results from our investigation help to reveal insights on hierarchical structure behind graph data. We remark that [8] proposed a concept of groundtruth input (graph), which, informally, is a graph consistent with an ultrametric under some monotone transformation of weights. Our concept of graphs with perfect HCstructure is more general than their groundtruth graph (see Theorem 4), and for example are much more meaningful for unweighted graphs (Proposition 1).
More on related work.
The present study is inspired by the work of Dasgupta [9], as well as the subsequent work in [7, 8, 15, 18], especially [8]. As mentioned earlier, after [9], there have been several independent followup work to improve the approximation factor to [18] and then to via more refined analysis of the sparsecut based algorithm of Dasgupta [7, 8], or via SDP relaxation [7]. It was also shown in [7, 18] that it is hard to approximate the optimal cost (under Dasgupta’s cost function) within any constant factor, assuming the Small Set Expansion (SSE) hypothesis originally introduced in [16]. A dual version of Dasgupta’s cost function was considered in [15]; and an analog of Dasgupta’s cost function for dissimilarity graphs (i.e, graph where edge weight represents dissimilarity) was studied in [8]. In both cases, the formulation leads to a maximization problem, and thus exhibits rather different flavor from an approximation perspective: Indeed, simple constantfactor approximation algorithms are proposed in both cases.
We remark that the investigation in [8] in fact targets a broader family of cost functions than Dasgupta’s cost function (and thus ours as well),
2 An Improved Cost Function for HCtrees and Properties
In this section, we first describe the cost function proposed by Dasgupta [9] in Section 2.1. We introduce our cost function in Section 2.2 and present several properties of it in 2.3.
2.1 Problem setup and Dasgupta’s cost function
Our input is a set of data points as well as their pairwise similarity, represented as a weight matrix with representing the similarity between points and . We assume that is symmetric and each entry is nonnegative. Alternatively, we can assume that the input is a weighted (undirected) graph , with the weight for an edge being . The two views are equivalent: if the input graph is not a complete graph, then in the weight matrix view, we simply set if . We use the two views interchangeably in this paper. Finally, we use “unweighted graph” to refer to a graph where each edge has unitweight .
Given a set of data points , a hierarchical clustering tree (HCtree) is a rooted tree whose leaf set equals . We also say that is a HCtree spanning (its leaf set) . Given any tree node , represents the subtree rooted at , and denotes the set of leaves contained in the subtree . Given any two points , denotes the lowest common ancestor of leafs and in ; the subscript is often omitted when the choice of is clear. To simplify presentation, we sometimes use indices of vertices to refer to vertices, e.g, means . The following cost function to evaluate a HCtree w.r.t. a similarity graph was introduced in [9]:
An optimal HCtree is defined as one that minimizing . Intuitively, to minimize the cost, pairs of nodes with high similarity should be merged (into a single cluster) earlier. It was shown in [9] that the optimal tree under this cost function has several nice properties, e.g, behaving as expected for graphs such as disconnected graphs, cliques, and planted partitions. In particular, if is an unweighted clique, then all trees have the same cost, and thus are optimal – this is intuitive as no preference should be given to any specific tree shape in this case.
2.2 The new cost function
To introduce our new cost function, it is more convenient to take the matrix view where the weight is defined for all pairs of nodes and (as mentioned earlier, if the input is a weighted graph , then we set for ). First, a triplet means three distinct indices . We say that relation holds in , if the lowest common ancestor of and is a proper descendant of . Intuitively, subtrees containing leaves and will merge first, before they merge with a subtree containing . We say that relation holds in , if they are merged at the same time; that is, . See Figure 1 for an illustration.
Definition 1.
Given any triplet of , the cost of this triplet (induced by ) is
We omit from the above notation when its choice is clear. The totalcost of tree w.r.t. is
The rather simple proof of the following claim is in Appendix A.1.
Claim 1.
The totalcost of any HCtree differs from Dasgupta’s cost by a quantity depending only on the input graph . Hence for a fixed graph , it maintains the relative order of the costs of any two trees, implying that the optimal tree under or remains the same.
While the difference from Dasgupta’s cost function seems to be minor, it is easy to see from this formulation that for a fixed graph , there is a leastpossible cost that any HCtree will incur, which we call the basecost. It is important to note that the following basecost depends only on the input graph .
Definition 2.
Given a node graph associated with similarity matrix , for any distinct triplet , define its mintriplet cost to be
The basecost of similarity graph is
To differentiate from Dasgupta’s cost function, we call our new cost function the ratiocost.
Definition 3 (Ratiocost function).
Given a similarity graph and a HCtree , the ratiocost of w.r.t. is defined as
The optimal tree for is a tree such that ; and its ratiocost is called the optimal ratiocost .
Observation 1.
(i) For any HCtree , , implying that .
(ii) A tree optimizing also optimizes (and thus ), as is a constant for a fixed graph.
(iii) There is always an optimal tree for that is binary.
Observation (i) and (ii) above follow directly the definitions of these costs and Claim 1. (iii) holds as it holds for Dasgupta’s cost function ([9], Section 2.3). Hence in the remainder of the paper, we will consider only binary trees when talking about optimal trees.
Note that it is possible that for an nonempty graph . We follow the convention that while for any positive number . We show in Appendix A.2 that in this case, there must exist a tree such that . Thus for this case.
Intuition behind the costs.
Consider any triplet , and assume w.l.o.g that is the largest among the three pairwise similarities. If there exists a “perfect” HCtree , then it should first merge and as they are most similar, before merging them with . That is, the relation for this triplet in the HCtree should be ; and we say that this relation (and the tree ) is consistent with (similarties of) this triplet. The cost of the triplet is designed to reward this “perfect” relation: is minimized, in which case , only when holds in . In other words, is the smallest cost possible for this triplet, and a HCtree can achieve this cost only when the relation of this triplet in is consistent with their similarities.
If there is a HCtree such that for all triplets, their relations in are consistent with their similarities, then , implying that must be optimal as . Similarly, if for a graph , then the optimal tree has to be consistent with all triplets from . Intuitively, this graph has a perfect HCstructure, in the sense that there is a HCtree such that the desired merging order for all triplets are preserved in this tree.
Definition 4 (Graph with perfect HCstructure).
A similarity graph has perfect HCstructure if . Equivalently, there exists a HCtree such that .
Examples.
The basecost is independent of tree and can be computed easily for a graph. If we find a tree with , then must be optimal and has perfect HCstructure.
With this in mind, it is now easy to see that for a clique (with unit weight), for any HCtree and any triplet , . Hence for any HCtree , and thus the clique has perfect HCstructure. A complete graph whose edge weights equal to entries of the edge probabilities of a planted partition also has a perfect HCstructure. It is also easy to check that the node star graph (or two linked stars) has perfect HCstructure, while for the node path , . Intuitively, a path does not process much hierarchical structure. See Appendix A.3 for details. We also refer the readers to see more results and discussions on Erdös Rényi random graph and planted bipartiton (also called planted bisection) random graphs in Section 4. In particular, as we describe in the discussion after Corollary 2, the ratiocost function exhibits an interesting, yet natural, behavior as the incluster and betweencluster probabilities for a planted bisection model change.
Remark: Note that in general, the value of Dasgupta’s cost is affected by the (edge) density of a graph. One may think that we could normalize by the number of edges in the graph (or by total edge weights). However, the examples of the star and path show that this strategy itself is not sufficient to reveal the cluster structure behind an input graph, as those two graphs have equal number of edges.
2.3 Some properties
For an unweighted graph, optimal cost under Dasgupta’s cost function is bounded by . But this cost can be made arbitrarily large for a weighted graph. For our ratiocost function, it turns out that the optimal cost is always bounded for both unweighted and weighted graphs. Proof of the following result can be found in Appendix A.4. For an unweighted graph, we can in fact obtain an upper bound that is asymptotically the same as the one in (ii) below using a much simpler argument (than the one in our current proof). However, the stated upper bound () is tight in the sense that it equals ‘1’ for a clique, matching the fact that for a clique.
Theorem 1.
(i) Given a similarity graph with being symmetric, having nonnegative entries, we have that
where .
(ii) For a connected unweighted graph with and , we have that .
We now show that the bound in (i) above is asymptotically tight. To prove that, we will use a certain family of edge expanders.
Definition 5.
Given and integer , a graph is an edge expander if for every with , the set of crossing edges from to , denoted as , has cardinality at least .
A (undirected) graph is regular if all nodes have degree .
Theorem 2.
[3] For any natural number , and sufficiently large , there exist regular graphs on vertices which are also edge expanders.
Theorem 3.
A regular edge expander graph satisfies that .
Proof.
The existence of a regular edge expander is guaranteed by Theorem 2. Given a cut , the sparsity of it is . Consider the sparsestcut for with minimum sparsity. Assume w.l.o.g that . Its sparsity satisfies:
(1) 
The first inequality above holds as is a edge expander.
On the other hand, consider an arbitrary bisection cut with . Its sparsity satisfies:
(2) 
Combining (Eqn. 1) and (Eqn. 2), we have that the bisection cut is a approximation for the sparest cut. On the other hand, it is shown in [8] that a recursive approximation sparestcut algorithm will lead to a HCtree whose cost is a approximation for where is the optimal tree. Now consider the HCtree resulted from first bisecting the input graph via , then identifying sparestcut for each subtree recursively. By [8], for some constant . Furthermore, is at least the cost induced by those edges split at the top level; that is,
(3) 
where the last inequality holds as is a edge expander. Hence
with being a constant. By Claim 1, we have that
(4) 
Now call a triplet a wedge (resp. a triangle if there are exactly two (resp. exactly three) edges among the three nodes. Only wedges and triangles contribute to . Thus:
Combining the above with Theorem 1, the claim then follows. ∎
Relation to the “groundtruth input” graphs of [8].
CohenAddad et al. introduced what they call the “groundtruth input” graphs to describe inputs that admit a “natural” groundtruth cluster tree [8]. A brief review of this concept is given in Appendix A.5. Interestingly, we show that a groundtruth input graph always has a perfect HCstructure. However, the converse is not necessarily true and the family of graphs with perfect HCstructure is broader while still being natural. Proof of the following theorem can be found in Appendix A.5.
Theorem 4.
Given a similarity graph , if is a groundtruth input graph of [8], then must have perfect HCstructure. However, the converse is not true.
Intuitively, a graph has a perfect HCstructure if there exists a tree such that for all triplets, the most similar pair (with the largest similarity) will be merged first in . Such a tripletorder constraint is much weaker than the requirement of the groundtruth graph of [8] (which intuitively is generated by an ultrametric). An example is given in Figure 2. In fact, the following proposition shows that the concept of groundtruth graph is rather stringent for graphs with unit edge weights (i.e, unweighted graphs). In particular, a connected unweighted graph is a groundtruth graph of [8] if and only if it is the complete graph. In contrast, unweighted graphs with perfect HCstructure represent a much broader family of graphs. The proof of this proposition is in Appendix A.5.
Proposition 1.
Let be an unweighted graph (i.e, if and otherwise). is a groundtruth graph if and only if each connected component of is a clique.
3 Algorithms
3.1 Hardness and approximation algorithms
By Claim 1, equals minus a constant (depending only on ). Thus the hardness results for optimizing also holds for optimizing . Hence the following theorem follows from results of [7] and [9]. The simple proof is in Appendix B.1.
Theorem 5.
(1) It is NPhard to compute , even when is an unweighted graph (i.e, edges have unit weights). (2) Furthermore, under Small Set Expansion hypothesis, it is NPhard to approximate within any constant factor.
We remark that while the hardness results for optimizing translate into hardness results for , it is not immediately evident that an approximation algorithm translates too, as differs from by a positive quantity. Nevertheless, it turns out that the approximation algorithm of [7] for also approximates within the same asymptotic approximation factor. See appendix B.2 for details.
3.2 Algorithms for graphs with perfect or nearperfect HCstructure
While in general, it remains open how to approximate (as well as ) to a factor better than , we now show that we can check whether a graph has perfect HCstructure or not, and compute an optimal tree if it has, in polynomial time. We also provide a polynomialtime approximation algorithm for graphs with nearperfect HCstructures (to be defined later).
Remark: One may wonder whether a simple agglomerative (bottomup) hierarchical clustering algorithm, such as the singlelinkage, averagelinkage, or completelinkage clustering algorithm, could have recovered the perfect HCstructure. For the groundtruth input graphs introduced in [8], it is known that it can be recovered via averagelinkage clustering algorithms. However, as the example in Figure 2 shows, this in general is not true for recovering graphs with perfect HCstructures, as depending on the strategy, a bottomup approach may very well merge nodes and first, leading to a nonoptimal HCtree.
Intuitively, it is necessary to have a topdown approach to recover the perfect HCstructure. The high level framework of our recursive algorithm BuildPerfectTree() is given below and output a HCtree spans (i.e, with its leaf set being) a subset of vertices from . (resp. ) in the algorithm denotes the subgraph of spanned by vertices in (resp. in ). We will prove that the output tree spans all vertices , if and only if has a perfect HCstructure (in which case will also be an optimal tree).
 BuildPerfectTree()

Input: graph . Output: a binary HCtree

Set = validBipartition(); If( or ) Return()

Set = BuildPerfectTree(); =BuildPerfectTree()

Build tree with and being the two subtrees of its root. Return()

We say that is a partial bipartition of if and ; and is a bipartition of (or ) if , , and . Let .
Definition 6 (Triplet types).
A triplet with edge weights and is
type1: if the largest weight, say , is strictly larger than the other two; i.e, ;
type2: if exact two weights, say and , are the largest; i.e, ;
type3: otherwise, which means all three weights are equal; i.e, .
Definition 7 (Valid partition).
A partition , , of (i.e, , , and ) is valid w.r.t. if (i) for any type1 triplet with , either all three vertices belong to the same set from ; or and are in one set from , while is in another set; and (ii) for any type2 triplet with , it cannot be that and are from the same set of , while is in another one.
If this partition is a bipartition, then it is also called a valid bipartition.
In what follows, we also refer to each set in the partition as a cluster.
The goal of procedure validBipartition() is to compute a valid bipartition if it exists. Otherwise, it returns “fail" (more precisely, it returns () ). On the high level, it has two steps. It turns out (Step1) follows from existing literature on the socalled rooted triplet consistency problem. The main technical challenge is to develop (Step2).
 Procedure validBipartition()

(Step1): Compute a certain valid partition of , if possible.

(Step2): Compute a valid bipartition from this valid partition .
(Step 1): compute a valid partition.
It turns out that the partition procedure within the algorithm BUILD of [2] will compute a specific partition with nice properties. The following result can be derived from Lemma 1, Theorem 2, and proof of Theorem 4 of [2].
Proposition 2 ([2]).
(1) We can check whether there is a valid partition for in time. Furthermore, if one exists, then within the same time bound we can compute a valid partition which is minimal in the following sense: Given any other valid partition , is a refinement of (i.e, all points from the same cluster are contained within some ).
(2) This implies that, let be any valid bipartition for , and the partition as computed above. Then for any , if (resp. ), then (resp. ).
(Step 2): compute a valid bipartition from if possible.
Suppose (Step 1) succeeds and let , , be the minimum valid partition computed. It is not clear whether a valid bipartition exists (and how to find it) even though a valid partition exists. Our algorithm computes a valid bipartition depending on whether a clawconfiguration exists or not.
Definition 8 (A claw).
Four points form a claw w.r.t. if (i) each point is from a different cluster in ; and (ii) . See Figure 3 (a).
(a)  (b) 
(Case1): Suppose there is a claw w.r.t. . Fix any claw, and assume w.l.o.g that it is such that for or . We choose an arbitrary but fixed representative vertex for each cluster with . Compute the subgraph of spanned by vertices . (Recall that we can view as a complete graph where has weight if it is not an edge in .) Set . We say that an edge is light if its weight is strictly less than ; otherwise, it is heavy. Easy to see that by definition of claw, edges and are all light. Now, consider the subgraph of spanned by only lightedges, and w.l.o.g. let be the maximum clique in contains and . See Figure 3 (b) for this clique, where solid edges are heavy, while dashed ones are light. (It turns out that this maximum clique can be computed efficiently, as we show in Appendix B.6.2.)
We then set up potential bipartitions , for each . We check the validity of each such . If none of them is valid, then procedure validBipartition() returns ‘fail’. Otherwise, it turns the valid one. The correctness of this step is guaranteed by the Lemma 1, whose proof can be found in Appendix B.4.
Lemma 1.
Suppose there is a claw w.r.t. . Let ’s, be described above. There exists a valid bipartition for if and only if one of the , , is valid.
(Case2): Suppose there is no claw w.r.t. . The case when there is no claw is slightly more complicated. We present the lemma below with a constructive proof in Appendix B.5.
Lemma 2.
If there is no claw w.r.t. , then we can check whether there is a valid bipartition (and compute one if it exists) in time, where .
This finishes the description of procedure validBipartition(). Putting everything together, we conclude with the following theorem, with proof in Appendix B.6. We note that it is easy to obtain a time complexity of . However, we show in Appendix B.6 how to modify our algorithm as well as to provide a much more refined analysis to improve the time to .
Theorem 6.
Given a similarity graph with vertices, algorithm BuildPerfectTree() can be implemented to run in time. It returns a tree spanning all vertices in if and only if has a perfect HCstructure, in which case this spanning tree is an optimal HCtree.
Hence we can check whether it has a perfect HCstructure, as well as compute an optimal HCtree if it has, in time.
Graphs with almost perfect HCstructure.
In practice, a graph with perfect HCstructure could be corrupted with noise. We introduce a concept of graphs with an almostperfect HCstructure, and present a polynomial time algorithm to approximate the optimal cost.
Definition 9 (perfect HCstructure).
A graph has perfect HCstructure, , if there exists weights such that (i) the graph has perfect HCstructure; and (ii) for any , we have . In this case, we also say that is a perturbation of graph .
Note that a graph with perfect HCstructure is simply a graph with perfect HCstructure.
The proof of the following theorem is in Appendix B.7. Note that when (i.e, the graph has a perfect HCstructure), this theorem also gives rise to a approximation algorithm for the optimal ratiocost in . In contrast, an exact algorithm to compute the optimal tree in this case takes time as shown in Theorem 6.
Theorem 7.
Suppose is a perturbation of a graph with perfect HCstructure. Then we have (i) ; and (ii) we can compute a HCtree s.t. (i.e, we can ()approximate ) in time.
4 Ratiocost function for Random Graphs
Definition 10.
Given a symmetric matrix with each entry , is a random graph generated from if there is an edge with probability . Each edge in has unit weight.
The expectationgraph refers to the weighted graph where the edge has weight .
In all statements, “with high probability (w.h.p)” means “with probability larger than for some constant ". The main result is as follows. The proof is in Appendix C.1.
Theorem 8.
Given an edge probability matrix , assume each entry , for any . Given a random graph sampled from , let denote the optimal HCtree for (w.r.t. ratiocost), and an optimal HCtree for the expectationgraph . Then we have that w.h.p,
where is the expected basecost for the random graph .
Hence the otpimal cost of a randomly generated graph concentrates around a quantity. However, this quantity is different from the optimal ratiocost for the expectation graph , which would be . Rather, is sensitive to the specific random graph sampled. We give two examples below. A ErdősRényi random graph is generated by the edge probability matrix with for all . The probability matrix for the planted bisection model , , is such that (1) for or ; (2) otherwise. See Appendix C.2 and C.3 for the proofs of these results.
Corollary 1.
For a ErdősRényi random graph , where , w.h.p., the optimal HCtree has ratiocost .
Corollary 2.
For a planted bisection model , where , w.h.p., the optimal HCtree has ratiocost .
Note that the decreases as increases. When , increases as increases, otherwise, it decreases as increases.
Note that while the cost of an optimal tree w.r.t. Dasgupta’s cost (and also our totalcost) always concentrates around the cost for the expectation graph, as we mentioned above, the value of the ratiocost depends on the sampled random graph itself. For ErdősRényi random graphs, larger value indicates tighter connection among nodes, making it more cliquelike, andthus decreases till when . (In contrast, note that the expectationgraph is a complete graph where all edges have weight ; thus it always has no matter what value it is.)
For the planted bisection model, increasing value also strengthens incluster connections and thus decreases. Interestingly, increasing value when it is small hurts the clustering structure, because it adds more crosscluster connections thereby making the two clusters formed by vertices with indices from and from respectively, less evident – indeed, increases when increases for small . However, when is already relatively large (close to ), increasing it more makes the entire graph closer to a clique, and starts to decreases when increases for large . Note that such a refined behavior (w.r.t probability ) does not hold for the original cost function by Dasgupta.
Acknowledgements.
We would like to thank Sanjoy Dasgupta for very insightful discussions at the early stage of this work, which lead to the observation of the basecost for a graph. The work was initiated during the semester long program of “Foundations of Machine Learning"
at Simons Insitute for the Theory of Computer in spring 2017. This work is partially supported by National Science Foundation under grants CCF1740761, DMS1547357, and RI1815697.
References
 [1] C. C. Aggarwal and C. K. Reddy. Data Clustering: Algorithms and Applications. Chapman & Hall/CRC, 1st edition, 2013.
 [2] A. V. Aho, Y. Sagiv, T. G. Szymanski, and J. D. Ullman. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM Journal on Computing, 10(3):405–421, 1981.
 [3] S. Arora and B. Barak. Computational Complexity: A Modern Approach. Cambridge University Press, New York, NY, USA, 1st edition, 2009.
 [4] D. Arthur and S. Vassilvitskii. Kmeans++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’07, pages 1027–1035, Philadelphia, PA, USA, 2007. Society for Industrial and Applied Mathematics.
 [5] M. Balcan and Y. Liang. Clustering under perturbation resilience. SIAM Journal on Computing, 45(1):102–155, 2016.
 [6] J. Byrka, S. Guillemot, and J. Jansson. New results on optimizing rooted triplets consistency. Discrete Applied Mathematics, 158(11):1136–1147, 2010.
 [7] M. Charikar and V. Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. In Proceedings of the TwentyEighth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’17, pages 841–854, Philadelphia, PA, USA, 2017. Society for Industrial and Applied Mathematics.
 [8] V. CohenAddad, V. Kanade, F. MallmannTrenn, and C. Mathieu. Hierarchical clustering: Objective functions and algorithms. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’18, pages 378–397, 2018.
 [9] S. Dasgupta. A cost function for similaritybased hierarchical clustering. In Proceedings of the Fortyeighth Annual ACM Symposium on Theory of Computing, STOC ’16, pages 118–127, New York, NY, USA, 2016. ACM.
 [10] W. F. de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In Proceedings of the Thirtyfifth Annual ACM Symposium on Theory of Computing, STOC ’03, pages 50–58, New York, NY, USA, 2003. ACM.
 [11] D. Ghoshdastidar and A. Dukkipati. Consistency of spectral hypergraph partitioning under planted partition model. Ann. Statist., 45(1):289–315, 02 2017.

[12]
S. Janson.
Large deviations for sums of partly dependent random variables.
Random Structures Algorithms, 24:234–248, 2004.  [13] J. Jansson, J. H. K. Ng, K. Sadakane, and W.K. Sung. Rooted maximum agreement supertrees. Algorithmica, 43(4):293–307, Dec. 2005.
 [14] R. Krauthgamer, J. S. Naor, and R. Schwartz. Partitioning graphs into balanced components. In Proceedings of the Twentieth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’09, pages 942–949, Philadelphia, PA, USA, 2009. Society for Industrial and Applied Mathematics.
 [15] B. Moseley and J. Wang. Approximation bounds for hierarchical clustering: Average linkage, bisecting kmeans, and local search. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 3097–3106. Curran Associates, Inc., 2017.
 [16] P. Raghavendra and D. Steurer. Graph expansion and the unique games conjecture. In Proceedings of the Fortysecond ACM Symposium on Theory of Computing, STOC ’10, pages 755–764, New York, NY, USA, 2010. ACM.
 [17] K. Rohe, S. Chatterjee, and B. Yu. Spectral clustering and the highdimensional stochastic blockmodel. Ann. Statist., 39(4):1878–1915, 08 2011.
 [18] A. Roy and S. Pokutta. Hierarchical clustering via spreading metrics. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2316–2324. Curran Associates, Inc., 2016.
 [19] P. H. A. Sneath and R. R. Sokal. Numerical Taxonomy: The Principles and Practice of Numerical Classification. W. H. Freeman and Co., 1973.
Appendix A Missing details from Section 2
a.1 Proof for Claim 1
We prove the first equality. In particular, fix any pair of leaves , , and count how many times is added for both sides. On the lefthand side, each time there is a node such that relation , or holds, will be added once. There are exactly number of such ’s, which is the same as how many time will be added to the righthand side. Hence the first two terms are equal.
The second equality is straightforward after we plug in the definition of .
a.2 Proving that
Let denote an input graph with . This means that each triplet , there can be at most one edge among them, as otherwise, will not be zero. Hence the degree of each node in will be at most 1. In other words, the collection of edges in form a matching of nodes in it. Let , and denote these matched nodes, where and thus these two nodes are paired together. Let denote the remaining isolated nodes. It is easy to verify that the tree shown in Figure 4 has .
a.3 Examples of ratiocosts of some graphs
Consider the two examples in Figure 5 (a). The linkedstars consists of two stars centered at and respectively, connected via edge . For the linkedstars , assuming that is even and by symmetry, we have that:
Comments
There are no comments yet.