# Hierarchical Clustering: a 0.585 Revenue Approximation

Hierarchical Clustering trees have been widely accepted as a useful form of clustering data, resulting in a prevalence of adopting fields including phylogenetics, image analysis, bioinformatics and more. Recently, Dasgupta (STOC 16') initiated the analysis of these types of algorithms through the lenses of approximation. Later, the dual problem was considered by Moseley and Wang (NIPS 17') dubbing it the Revenue goal function. In this problem, given a nonnegative weight w_ij for each pair i,j ∈ [n]={1,2, … ,n}, the objective is to find a tree T whose set of leaves is [n] that maximizes the function ∑_i<j ∈ [n] w_ij (n -|T_ij|), where |T_ij| is the number of leaves in the subtree rooted at the least common ancestor of i and j. In our work we consider the revenue goal function and prove the following results. First, we prove the existence of a bisection (i.e., a tree of depth 2 in which the root has two children, each being a parent of n/2 leaves) which approximates the general optimal tree solution up to a factor of 1/2 (which is tight). Second, we apply this result in order to prove a 2/3p approximation for the general revenue problem, where p is defined as the approximation ratio of the Max-Uncut Bisection problem. Since p is known to be at least 0.8776 (Wu et al., 2015, Austrin et al., 2016), we get a 0.585 approximation algorithm for the revenue problem. This improves a sequence of earlier results which culminated in an 0.4246-approximation guarantee (Ahmadian et al., 2019).

## Authors

• 26 publications
• 13 publications
• 3 publications
11/12/2021

### Hierarchical Clustering: New Bounds and Objective

Hierarchical Clustering has been studied and used extensively as a metho...
09/20/2019

### Online Hierarchical Clustering Approximations

Hierarchical clustering is a widely used approach for clustering dataset...
08/30/2021

### Approximation algorithms for priority Steiner tree problems

In the Priority Steiner Tree (PST) problem, we are given an undirected g...
01/26/2021

### Hierarchical Clustering via Sketches and Hierarchical Correlation Clustering

Recently, Hierarchical Clustering (HC) has been considered through the l...
12/15/2020

### Objective-Based Hierarchical Clustering of Deep Embedding Vectors

We initiate a comprehensive experimental study of objective-based hierar...
08/07/2018

### Hierarchical Clustering better than Average-Linkage

Hierarchical Clustering (HC) is a widely studied problem in exploratory ...
07/24/2020

### Beating Greedy For Approximating Reserve Prices in Multi-Unit VCG Auctions

We study the problem of finding personalized reserve prices for unit-dem...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The notion of Hierarchical Clustering (HC) trees has been introduced and subsequently studied for several decades. The notion was first considered due to its applications to the realm of phylogenetics [Numerical_taxonomy] [A_model_for_taxonomy]. Here, given genomic similarities between species our goal is to hierarchically cluster the species in a way that captures fine-grained relations between different species. Since then, the notion of HC trees has expanded to many additional fields. See [A_Survey_of_Clustering_Data_Mining_Techniques] for a survey on the subject.

Typically, schemes for generating HC trees fall into one of two categories: Agglomerative algorithms (i.e., bottom-up) and Divisive algorithms (i.e., top-down). Agglomerative algorithms initially start with a partition in which each data point forms its own set. The algorithm then proceeds to recursively merge different sets, terminating once all data points are contained in the same set. Notably, the well-studied average linkage algorithm is an example of such a procedure in which in each step two sets maximizing the average induced weight are merged. On the other hand, divisive algorithms start with a single set containing all the data points, and then proceed to recursively split sets, terminating once each data point remains alone in its set.

Recently, Dasgupta [a_cost_function_for_similarity-based_hierarchical_clustering] formally defined the notion of a ”good” HC tree. With this definition he elegantly bridged the gap between HC trees and the field of approximation algorithms by defining a minimization cost function (see related work for an extended discussion on this goal function). Thereafter, Moseley and Wang [Approximation_Bounds_for_Hierarchical_Clustering:_Average_Linkage] considered the complementary maximization variant of this problem - namely, the revenue goal function.

Both problems considered the following model. Assume we are given a set of data points, with some notion of similarity between them. The similarity is formally captured through similarity-edges between any two data points, represented by a weighted graph with . Alternatively, may be viewed as a complete graph in which for any . Our goal is then to construct an HC tree, , such that its leaves are in a 1-1 correspondence with our data points, . Note that given such a tree, every internal node represents a set of data points, which are the leaves of its subtree, and a partition of this set given by the sets of leaves of the subtrees rooted at the children of the node.

Intuitively speaking, since higher weighted edges correspond to more similar data points, it is desirable to split the endpoints of such edges low in a good HC tree. Formally, Moseley and Wang [Approximation_Bounds_for_Hierarchical_Clustering:_Average_Linkage] defined the revenue problem as,

 maxTRG(T)=maxT∑e={ij}∈Ewij(n−|Tij|),

where is an HC tree, denotes the subtree rooted at the least-common-ancestor (LCA) of data points and and denotes the amount of data points in .

In [Approximation_Bounds_for_Hierarchical_Clustering:_Average_Linkage], Moseley and Wang considered several algorithms. Notably, they considered the random algorithm, that simply splits data points randomly at every cut (henceforth denoted by ), and the average-linkage algorithm. They showed that both yield an approximation factor of 1/3. Subsequently, Charikar et al. [Hierarchical_Clustering_better_than_Average_Linkage] showed that one can beat average-linkage through the use of semi-definite programming, improving the bound to 0.3364. Recently, Ahmadian et al. [Bisect_and_Conquer:_Hierarchical_Clustering_via_Max-Uncut_Bisection] managed to leverage the Max-Uncut Bisection (MUB) problem in order to prove a 0.4246 approximation. In our paper we improve upon this result and show an improved approximation of 0.585.

Our contributions. We consider the revenue goal function and prove the following results.

• We show that for any revenue instance, there exists a bisection, , that is, a tree of depth in which the root has two children, each being the parent of leaves, such that , where denotes the revenue gained by the optimal tree (see Theorem 3.1). In order to show such existence we make use of two random processes: we randomly fix the order of the leaves in the optimal tree in an appropriate way, and then randomly generate our bisection, . We emphasize the fact that even though makes use of an arbitrarily deep tree, it is enough to consider a single cut in order to gain half the revenue.

• Using our result regarding the existence of a large revenue generating bisection, we prove a 0.585 approximation for the revenue problem. We note that in fact we show a approximation where is the best known approximation for the MUB problem.

###### Remark.

The algorithm we consider is that which solves the MUB problem for the first cut and then proceeds using the random algorithm. In [Bisect_and_Conquer:_Hierarchical_Clustering_via_Max-Uncut_Bisection], Ahmadian et al. considered this algorithm coupled with the random algorithm - they used both algorithms simultaneously while taking the maximal revenue of the two. They showed that this results in a 0.4246 approximation with respect to the gained revenue. Somewhat surprisingly we show that the former algorithm (MUB and then random) on its own is enough to yield an approximation of 0.585.

Techniques. Our first result makes use of a new upper bound on the optimal solution, which may be of independent interest. Specifically, given an optimal solution, we embed its leaves on a line such that its root is above the line and we have no resulting crossing edges. This clearly yields an ordering of the leaves (see Figure 1). We then consider the distance between any two data points, and within this ordering (simply the difference in rank) and observe that this is in fact a lower bound on . This in turn yields an upper bound on the optimal solution.

Next, we make use of this bound by randomly generating a bisection that gains revenue that is ”large” with respect to the bound and by showing that in expectation this bound is ”far” from the optimal solution. Both ”large” and ”far” will be formally defined later on.

Related work. Dasgupta [a_cost_function_for_similarity-based_hierarchical_clustering] kicked off the line of work considering HC trees within the realm of approximation algorithms. In his paper, he considered similarity-edges and defined the cost of an HC tree as,

 minTCG(T)=minT∑e={ij}∈Ewij|Tij|.

Note that the revenue goal function is in fact complementary to that of Dasgupta’s function (that is, the optimal solution is the same for both, albeit with different goal function values).

In [a_cost_function_for_similarity-based_hierarchical_clustering], many general properties pertaining to this goal function were discussed. Notably, it was shown that this goal function is intuitive in that (1) on complete graphs (with no structure) all HC trees yield the same cost, (2) on disconnected graphs, optimal HC trees begin by splitting disconnected components and (3) the goal function is modular. He further presented an approximation algorithm via recursive sparsest cut. Later, both Charikar and Chatziafratis [Approximate_Hierarchical_Clustering_via_Sparsest_Cut_and_Spreading_Metrics] and Cohen-Addad et al. [Hierarchical_Clustering:_Objective_Functions_and_Algorithms] showed that this algorithm is in fact an approximation. In the hardness domain, Dasgupta [a_cost_function_for_similarity-based_hierarchical_clustering] showed that the problem is NP-hard via a reduction to a variant of the NAE-SAT problem. This was later improved by Charikar and Chatziafratis [Approximate_Hierarchical_Clustering_via_Sparsest_Cut_and_Spreading_Metrics], showing that in fact no constant approximation exists (assuming the Small Set Expansion hypothesis). Cohen-Addad et al. [Hierarchical_Clustering_Beyond_the_Worst-Case] managed to overcome the latter worst-case specific result by considering average case inputs defined by a stochastic block model and its hierarchical extension. Here, they managed to establish an approximation.

Following Dasgupta’s work, Cohen-Addad et al. [Hierarchical_Clustering:_Objective_Functions_and_Algorithms] considered the case of dissimilarity-edges. In this case, Dasgupta’s cost function is now translated to a maximization problem. Given an HC tree, , we now denote its gained value as its gained dissimilarity, . In this setting, both the random algorithm () and the average-linkage algorithm yield dissimilarity values of of the optimal solution. Charikar et al. [Hierarchical_Clustering_better_than_Average_Linkage] improved upon this by proving an approximation of 0.6671 which makes use of a more delicate multi-phase algorithm.

Several other extensions to the formerly defined HC goal functions were also considered. One such extension is that of structural constraints. Specifically, every constraint appears in the form of for data points , and . A constraint is then considered satisfied if , for an HC tree, . Aho et al. [Inferring_a_tree_from_lowest_common_ancestors_with_an_application_to_the_optimization_of_relational_expressions] considered this problem in the phylogenetic realm where this notion gives rise to the problem of constructing a phylogenetic tree that satisfies a set of lineage constraints on common ancestors. This notion has more recently been investigated in the domain of HC, by Chatziafratis et al. [Hierarchical_Clustering_with_Structural_Constraints]. In their paper they extended Dasgupta’s goal function to include structural constraints and showed an approximation where is the number of constraints. For additional works in the realm of HC trees see [Clustering_with_interactive_feedback], [Local_algorithms_for_interactive_clustering], [Interactive_bayesian_hierarchical_clustering], [Hierarchical_Clustering_for_Euclidean_Data].

## 2 Notation and Preliminaries

We introduce several definitions that will aid us throughout the following sections.

###### Definition 1.

Given a revenue instance , we denote the optimal revenue tree by .

We shall abuse notation and refer both to the optimal revenue tree and to its generated revenue as (it will be clear from context which of these definitions we will be referring to).

###### Definition 2.

Given an HC tree , we denote by the revenue it yields and for any similarity edge , we denote by the revenue gained by the edge with respect to the HC tree, .

The following definition will be useful when considering the revenue generated with respect to some similarity edge.

###### Definition 3.

Given a revenue instance and a similarity edge we denote .

Note that therefore, for any HC tree , , where is the number of leaves in the tree . Further note that with these definitions we may assume w.l.o.g. that any optimal tree is binary.

Finally, to ease notation we shall refer to the data points of a given revenue instance , as .

## 3 Existence of a High Revenue Bisection

###### Theorem 3.1.

For any revenue instance and corresponding optimal solution, , there exists a bisection, , satisfying

 R(X)≥12OPT,

and this is asymptotically tight.

Before we prove the theorem, we introduce some notation. To simplify the presentation we assume from now on that the number of leaves is even.

Defining a tree ordering: Given an HC tree , we may fix its leaves in several different orders. Each order is produced by representing as a planar graph where its leaves are all embedded on a line, the root is above the line and all edges of the tree are straight lines going down from each parent to its children with no crossing edges. Denote each such ordering by the function .

Henceforth we will consider with respect to the optimal solution and denote . Recall that we may assume w.l.o.g. that is a binary tree. Next we define a distribution over all feasible orderings, .

###### Definition 4.

Let denote the distribution generated by choosing uniformly at random a subset of internal nodes from and swapping the placements of the left and right subtrees of each of these chosen nodes.

Finally, given an ordering , and an edge , let , denote the distance between leaves and in the fixed tree, . For a pictorial example, see Figure 1.

In order to prove Theorem 3.1 we first upper bound the revenue gained by .

###### Observation 1.

, where both and are defined with respect to .

In itself, this will not result in a good enough bound on . Therefore, we show that on average, is far from .

.

###### Proof.

Let . It can be shown through simple induction that for any tree with leaves, and for any leaf, , . (This also follows by linearity of expectation from the simple fact that for every two distinct leaves and

the probability that

is exactly .)

Denote by and the number of leaves in the subtrees containing and , each rooted at a separate child of and ’s least-common-ancestor. By the definition of , the probability that (together with the leaves of its subtree) appears before (and the leaves in its subtree) is exactly , resulting in:

 EPπ[yπe]=(1/2)(ni+nj+12−ni+12)+(1/2)(nj+ni+12−nj+12)=|Te|2.

Let . By linearity of expectation we get the following lemma.

###### Lemma 3.3.

There exists an ordering of , , such that,

 Yπ∗≤∑e∈Ewe⋅|Te|2.
###### Proof.

By linearity of expectation and Lemma 3.2,

 EPπ[Yπ]=∑e∈Ewe⋅EPπ[yπe]=∑e∈Ewe⋅|Te|2.

Therefore, there exists an ordering as needed. ∎

Next we show that there exists a distribution over the set of all bisections with high (to some degree) revenue with respect to the revenue gained by considering rather than .

###### Lemma 3.4.

Given any ordering of , , there exists a distribution, , over all bisections, , such that for any edge ,

 EPX[RX(e)]≥12we(n−2yπe).
###### Proof.

Fix . We relabel the leaves of such that (this is simply to ease the notation). Next we define a distribution over all bisections, . Consider the following random process: choose uniformly at random from . Then define to be the bisection, .

Now consider some edge, . If , then . is always nonnegative (since it is defined as the revenue gained by ) and thus the assertion of the lemma holds in this case.

Otherwise, for . In this case, the probability that and are cut by the bisection is exactly . Furthermore, since is a bisection, any uncut edge yields a revenue of . Overall,

 EPX[RX(e)]≥we(1−2yπen)n2,

completing the proof of the lemma. ∎

Recall that by the definition of it follows that . Therefore, by combining lemmas 3.3 and 3.4, we may sum over all edges and get the following observation.

###### Observation 2.

Given the ordering, , guaranteed by Lemma 3.3, we get,

 EPX[R(X)]≥∑e∈E12we(n−2yπ∗e).

We are now ready to prove Theorem 3.1.

###### Proof of Theorem 3.1.

Fix an ordering of to be as guaranteed by Lemma 3.3. By Observation 2, there exists a bisection satisfying

 R(X∗)≥EPX[R(X)]≥∑e∈E12we(n−2yπ∗e).

Thus, by Lemma 3.3,

 R(X∗) ≥∑e∈E12we(n−2yπ∗e) =∑e∈E12we(n)−Yπ∗ ≥∑e∈E12we(n)−∑e∈Ewe⋅|Te|2 =12∑e∈Ewe(n−|Te|) =12OPT.

To show that the result is asymptotically tight, consider the instance on vertices which is a matching of edges, each having weight . In this case, one solution is a binary tree in which if then the LCA of and (as defined by the solution) is and ’s immediate parent. This solution generates revenue for every edge of the matching guaranteeing that .

On the hand, consider an arbitrary tree of depth 2 such that its first cut is a bisection and denote it by . Let and denote the total weight of edges in each set generated by the cut. Therefore, . The revenue generated from is exactly . Overall,

 R(T)≤(12+o(1))OPT.

###### Remark.

Note that Theorem 2 may be derandomized in the sense that given any binary HC tree we may produce a bisection with at least half its revenue deterministically and efficiently. Indeed, this can be done using the method of conditional expectations. We omit the (simple) details, since although this implies that given an optimal HC tree one can obtain a bisection with revenue at least OPT/2 deterministically and efficiently, this does not yield any new algorithmic consequence when an optimal HC tree is not given.

## 4 A 0.585 Approximation for the Revenue Goal Function

We define the approximation algorithm as a 2 step process: first we cut all data points using some black box algorithm that produces a bisection, denoted henceforth as . Thereafter, we continue splitting each cluster randomly. Denote the combined algorithm by ALG.

In order to show an approximation bound for ALG, we first need to consider the MUB problem. In this problem we are given a weighted graph and our goal is to create a bisection maximizing the weights of uncut edges. We note that if we restrict ourselves to revenue trees which are bisections (i.e., the first cut splits the data points into two equal sets, then each set is cut only once using a ”star” subtree - see Figure 2), the MUB optimal solution and the revenue optimal solution are in fact the same.

Let denote the approximation ratio algorithm guarantees for the MUB problem. Note that is at least (see [Better_Balance_by_Being_Biased_A_08776_Approximation_for_Max_Bisection], [An_improved_semidefinite_programming_hierarchies_rounding_approximation_algorithm_for_maximum_graph_bisection_problems]). Thus, by defining to be such an algorithm, the following theorem shows that algorithm ALG is a 0.585 approximation for the revenue problem.

###### Theorem 4.1.

Algorithm ALG guarantees an approximation ratio of for the revenue problem, where is defined as the approximation ratio for the MUB problem.

###### Proof.

Let denote the tree generated by algorithm ALG. It is a known simple fact that the random algorithm generates a revenue of at least for any edge . Therefore, if we denote by and the weight of the uncut edges generated by the first cut of our algorithm, then,

 R(TALG)≥WL(n2+13⋅n2)+WR(n2+13⋅n2)=(WL+WR)2n3.

Denote by and the weights of the uncut edges generated by the optimal MUB solution and let denote our first cut’s approximation with respect to the MUB problem. Furthermore, let denote the optimal solution to the revenue problem restricted to bisections. As noted earlier, corresponds to and . Therefore,

 WL+WR≥p(WL∗+WR∗)=2pnR(X∗).

Let denote the revenue gained by the optimal solution. Thus, leveraging Theorem 3.1 yields,

 R(TALG)≥ 2n3(WL+WR)≥ 4p3(R(X∗))≥ 2p3OPT.

Since is known to be at least , we get that ALG is a approximation algorithm. ∎