Hierarchical clustering is a popular data analysis method, with various applications in data mining (Berkhin, 2006), phylogeny (Eisen et al., 1998), and even finance (Tumminello et al., 2010). In practice, simple agglomerative procedures like average-linkage, single-linkage and complete-linkage are often used for this task (See the book by Manning, Raghavan, and Schütze (2008) for a comprehensive discussion). While the study of hierarchical clustering has focused on algorithms, there was a lack of objective functions to measure the quality of the output and compare the performance of algorithms. To remedy this, Dasgupta (2016) recently introduced and studied an interesting objective function for hierarchical clustering for similarity measures.
The input to the above problem is a set of points with a similarity measure between and . Given a hierarchical clustering represented by a tree whose leaves correspond to points in , Dasgupta’s objective is defined as where is the subtree of rooted in the least common ancestor of and in . Dasgupta showed that solutions obtained from minimizing this objective has several desirable properties, which prompted a line of work on objective driven algorithms for hierarchical clustering, resulting in new algorithms as well as shedding light on the performance of classical methods (Roy and Pokutta, 2016; Charikar and Chatziafratis, 2017; Cohen-Addad et al., 2018).
Recently, this viewpoint has been applied to understand the performance of average linkage agglomerative clustering, one of the most popular algorithms used in practice. Moseley and Wang (2017) introduced a new objective, in some sense dual to the objective introduced by Dasgupta:
They showed that Average-Linkage obtains a -approximation for maximizing this objective function.
It turns out that a random solution also achieves a -approximation ratio for this problem. Recently Charikar, Chatziafratis, and Niazadeh (2019) showed that in the worst case, the approximation ratio achieved by Average-Linkage is no better than for maximizing . They also gave an SDP based algorithm that achieves a ()-approximation for the problem, for small constant .
One drawback of the prior work on these hierarchical clustering objectives is the fact that they all consider arbitrary similarity scores (specified as an matrix); however, there is much more structure to such similarity scores in practice. In this paper, we initiate the study of the commonly encountered case of Euclidean data, where the similarity score is computed by applying a monotone decreasing function to the Euclidean distance between and . Roughly speaking, we show how to exploit this structure to design improved approximation/faster algorithms for hierarchical clustering, and how to re-analyze algorithms commonly used in practice. In this paper, we focus on the problem of maximizing , introduced by Moseley and Wang (2017).
Arguably the most common distance-based similarity measure used for Euclidean data is the Gaussian kernel. Here we use the spherical version . The parameter , referred to as the bandwidth, plays an important role for applications and a large body of literature exists on selection of this parameter (Zelnik-Manor and Perona, 2005).
As discussed above, devising simple practical algorithms that improve on the -approximation for general similarity measures appears to be a major challenge as observed in Moseley and Wang (2017); Charikar et al. (2019). In this paper, we show that this approximation can be improved for Euclidean data, through fast algorithms that can be scaled to very large datasets. While it might seem that the case of Euclidean data with the Gaussian kernel is a very restricted class of inputs to the HC problem, we show that for suitably high dimensions and suitable small choice of bandwidth it can simulate arbitrary similarity scores (scaled appropriately). Thus any improvements to approximation guarantees for the Euclidean case (that do not apply to general similar scores) must necessarily involve assumptions about the dimension of the data (not very high) or on (not pathologically small). Such assumptions on are consistent with common methods for computing the bandwidth parameter based on data (e.g., (Zelnik-Manor and Perona, 2005)).
We start with the simplest case of 1-dimensional Euclidean data. Even in this seemingly simple setting, obtaining an efficient algorithm that produces an exactly optimum solution seems non-trivial, motivating the study of approximation algorithms. In Section 3 we prove that two algorithms – Random Cut (RC) and Average-Linkage (AL) – obtain -approximation of the optimal solution, i.e., tree maximizing (AL achieves this deterministically, while RC only in expectation). This beats the best known approximation for the general case, which is (Charikar et al., 2019). Here RC is substantially faster than AL, with a running time of vs. .
We next consider the high-dimensional case with the Gaussian Kernel and show that Average-Linkage cannot beat the factor even in poly-logarithmic dimensions. We propose the Projected Random Cut (PRC) algorithm that gets a constant improvement over , irrespective of the dimension (the improvement is a function of the ratio of the diameter to , and drops as this ratio gets large).
Furthermore, PRC runs in time while Average-Linkage runs in time. Even single-linkage runs in almost-linear time only for constant and has exponential dependence on the dimension (see e.g., Yaroslavtsev and Vadapalli (2018)) and it is open whether it can be scaled to large when is large.
Many existing algorithms have time efficiency shortcomings, and none of them can be used for really large datasets. On the contrary, our Projected Random Cut (see Section 4) is a fast HC algorithm that scales to the largest ML datasets. The running time scales almost linearly and the algorithm can be implemented in one pass without needing to store similarities, so the memory is . We also evaluate its quality on a small dataset (Zoo).
Further related work.
HC has been extensively studied across different domains (we refer the reader to Berkhin (2006) for a survey). One killer application in biology is in phylogenetics Sneath and Sokal (1962); Jardine and Sibson (1968) and actually popular algorithms, like Average-Linkage are originating in this field. HC is also widely used to perform community detection in social networks (Leskovec et al., 2014; Mann et al., 2008), is used in bioinformatics (Diez et al., 2015) and even in finance (Tumminello et al., 2010). Other important applications include image and text classification (Steinbach et al., 2000b).
After a formal optimization HC objective was introduced, many works studied HC from an approximation algorithms perspective. The original top-down Sparsest Cut algorithm by Dasgupta, was later shown that it achieves an approximation by Roy and Pokutta (2016); this was finally improved to by Charikar and Chatziafratis (2017); Cohen-Addad et al. (2018). Subsequent work on beyond-worst-case analysis (Cohen-Addad et al., 2017) proposes a hierarchical stochastic block model under which an older spectral algorithm of McSherry (2001) combined with linkage steps gets a -approximation. Finally, alongside the objective of scaling HC to large graphs, the work of Bateni et al. (2017) presents two MapReduce implementations with a focus on minimum-spanning-tree based clusterings that scales to graphs with trillions of edges.
Here is a list of open questions that we remain unanswered in this paper: (1) Can one get an improvement over for the problem of maximizing as a function of for small ? (2) Can projection on low-dimensional subspaces be used to improve the approximation ratio for the high-dimensional case further? (3) Does Average-Linkage achieve a -approximation for 1-dimensional data?
We consider data sets represented as sets of -dimensional feature vectors. Suppose these vectors are . We focus on similarity measures between pairs of data points, denoted by , where the similarities only depend on the underlying vectors, i.e., for some function and furthermore are fully determined by monotone functions of distances between them.
Definition 2.1 (Distance-based similarity measure).
A similarity measure is “distance-based” if for some function , and is “monotone distance-based” if furthermore is a monotone non-increasing function.
As an example of the monotone similarity measure it is natural to consider the Gaussian kernel similarity, i.e.,
where is a normalization factor determining the bandwidth of the Gaussian kernel (Gärtner, 2003). For simplicity, we ignore the multiplicative factor (unless noted otherwise), as our focus is on multiplicative approximations and scaling has no effects.
Linkage-based hierarchical clustering.
Among various algorithms popular in practice for HC, we focus on two very popular ones: single-linkage and average-linkage, which are two simple agglomerative clustering algorithms that recursively merge pairs of nodes (or super nodes) to create the final binary tree. Single-linkage picks the pair of super nodes with the maximum similarity weight between their data points, i.e., merges and maximizing . On the contrary, average-linkage picks the pair of super-nodes with maximum average similarity weight at each step, i.e., merges and maximizing , where . Average-linkage has an approximation ratio of for maximizing the objective function (Moseley and Wang, 2017), and this factor is tight (Charikar et al., 2019).
General upper bounds.
In order to analyze the linkage-based clustering algorithms and our proposed algorithms, we propose a natural upper bound on the value of the objective function. The idea is to decompose the objective function into contributions of triple of vertices:
where denotes the event that is separated first among the vertices of triple in tree . Note that the final tree scores only one of the similarity weights between the triple . Given this observation, we define the following benchmark:
Clearly, for all trees , .
Consider the special case of 1D data points (with any monotone distance-based similarity measure as in Definition 2.1), and suppose . Now, for any triple , as a simple observation we have . Hence we can modify the above benchmark to obtain two refined new benchmarks:
Again, clearly for all trees we have:
3 Hierarchical Clustering in 1D
In this section we look at the extreme case where the feature vectors have , and we try to analyze the popular/natural algorithms existing in this domain by evaluating how well they approximate the objective function . We focus on average-linkage and Random Cut (will be formally defined later). In particular, Random Cut
is a building block of our algorithm for high-dimensional data given in Section4.
We show (1) average-linkage gives a -approximation, and can obtain no better than fraction of the optimal objective value in the worst-case. We then show (2) Random Cut also is a -approximation (in expectation) and this factor is tight. In the Appendix A, we discuss other simple algorithms: single-linkage and greedy cutting (will be defined formally later). We start by the observation that greedy cutting and single-linkage output the same tree (and so are equivalent). We finish by showing (3) there is an instance where single-linkage attains only of the optimum objective value. We further show on this instance both average-linkage and random cutting are almost optimal.
For the rest of this section, suppose we have 1D points , where for some monotone non-increasing function .
In order to simplify the notation, we define to be for any two sets and . We first prove a simple structural property of average-linkage in 1D.
For , under any monotone distance-based similarity measure , average-linkage can always merge neighbouring clusters.
We do the proof by induction. This property holds at the first step of average-linkage (base of induction). Now suppose up to a particular step of average-linkage, the algorithm has always merged neighbours. At this step, we have an ordered collection of super nodes , where for every and for all , . If at this step average-linkage doesn’t merge two neighbouring super nodes, then there exists , where
But note that because of the monotone non-increasing distance-based similarity weights, for any triple we have , This is a contradiction to the above inequality, which finishes the inductive step. ∎
For , under any monotone distance-based similarity measure , average-linkage obtains at least of the 1D-SUM-upper, and hence is a -approximation for the objective .
The proof uses a potential function argument. Given a partitioning of points into sets , a triple of points is called separated if no pair of these three points belong to the same set. Now, the potential function gets as input, and maps it to a summation over all separated triples by as below:
Note that , and .
We now run average-linkage. Based on the definition of , every time that average-linkage merges two super nodes and it scores , and sum over of all these per-step scores is equal to its final objective value. Let Score-AL denotes the variable that stores of the score of average-linkage over time. At every step we keep track of (1) the change in the potential function, denoted by and (2) how much progress average-linkage had towards the final objective value, denoted by . In order to prove -approximation, it suffices to show that at every step of average-linkage we have:
To see this note that average-linkage starts with all points separated, and ends with one cluster/super node with all the points. Therefore, by summing eq. over all merging steps of average-linkage and canceling the terms in the telescopic sum, we have:
Plugging values of at the start and the end, we get:
which implies the -approximation factor.
To prove eq. , we focus on a single step of average-linkage where by Lemma 3.1 some two neighboring clusters denoted as and get merged. Let denote the nodes on the left of and let denote the nodes on the right of (Figure 1).111Note that may be empty sets. By merging two clusters , the change in the score of average-linkage is:
Moreover, any separated triple such that either or will not be separated anymore after this merge. For each such triple, the potential function drops by . Therefore:
To compare the two, we show that , and hence:
which implies eq. as desired. To prove the last claim, note that average-linkage picks the pair over both and . Therefore, by definition of average-linkage:
By summing above inequalities, we get , which finishes the proof. ∎
As a final note, in Section 5 we discuss hard instances for average-linkage under Gaussian kernels when , where we essentially show the result of Theorem 3.2 is tight when comparing against 1D-SUM-upper, and there is no hope to get an approximation factor better than for average-linkage in general for .
The following algorithm (termed as Random Cut) picks a uniformly random point in the range and divides the set of points into left and right using as the splitter. The same process is applied recursively until the leaves are reached.
For under any monotone distance-based similarity measure the algorithm Random Cut obtains at least fraction of the 1D-MAX-upper, and hence gives a -approximation for the objective in expectation.
For every triple conditioned on partitioning the interval for the first time the longer edge amongst and
gets cut with probabilityand the shorter with probability so that . W.l.o.g let’s assume that is the longer edge. Then the algorithm Random Cut scores in expectation for the triple. Note that:
By the linearity of expectation taking the sum over all triples Random Cut scores at least of 1D-MAX-upper in expectation and hence gives -approximation for the objective . ∎
4 Hierarchical Clustering in High Dimensions
We now describe an algorithm Projected Random Cut which we use for high-dimensional data. This algorithm is given as Algorithm 2. It first projects on a random spherical Gaussian vector and then clusters the resulting projections using Random Cut.
For any input set of vectors the algorithm Projected Random Cut gives an -approximation (in expectation) for the objective under the Gaussian kernel similarity measure where for .
Recall an upper bound on the optimum:
Fix any triple where . Note that the objective value achieved by the algorithm Projected Random Cut can also be expressed as where is the contribution to the objective from the triple defined as follows. Consider the tree constructed by the algorithm. If is the first vector in the triple to be separated from the other two in the hierarchical partition (starting from the root) then is defined to be (in the other two cases when or are separated first the definition is analogous). Note that since Projected Random Cut is a randomized algorithm
is a random variable. By the linearity of expectation we have. Thus in order to complete the proof it suffices to show that for every it holds that:
Fix any triple which forms a triangle in . Conditioned on cutting this triangle for the first time let be the vector of probabilities corresponding to the events that the corresponding edge is not cut. I.e. this is the probability that we score the contribution of this edge in the objective. Note that .
Consider any triangle whose vertices are . To simplify presentation we set . We can assume that . Let , and so that . See Figure 2. Note that the probability that the -th longest side of the triangle has the longest projection is then .
If is the shortest edge in the triangle then it holds that
Suppose that forms an angle with , i.e. the vector orthogonal to forms angle with . We define three auxiliary points as follows (see also Figure 2.). Let be a line parallel to going through , let to be a line parallel to going through and let be the line parallel to going through . We then let be the intersection of and , be the intersection of and and be the intersection of and (see Fig 2).
Thus the projections of , and on are , and . Hence conditioned on having the longest projection the probability of scoring the contribution of is given as since we are applying the Random Cut algorithm after the projection. Note that by Thales’s theorem we have . Applying the law of sines to the triangles and we have:
where we used the fact that . Similarly, we use the fact that . Using the above we can express as:
Thus the overall probability of scoring the contribution of the edge conditioned on having the longest projection which we denote as is given as:
Similarly, consider the probability of scoring the contribution of conditioned on having the longest projection. We can express it as:
Below we will show that . In fact, we will show that for any fixed it holds that . Comparing the expressions for both it suffices to show that for all . Since this is equivalent to:
It suffices to show that:
Using the formula it suffices to show that:
The above inequality follows for all since . This shows that . Since the probability that has the longest projection is we have that the probability of scoring and having the longest projection is at least . An analogous argument shows that the probability of scoring and having the longest projection is at least .
Putting things together:
where we used that since . ∎
We are now ready to complete the proof of Theorem 4.1. Let and note that since . If then the desired guarantee follows immediately. Otherwise, if then we have:
where we used the fact that
4.1 Gaussian Kernels with small
Theorem 4.1 only provides an improved approximation guarantee for Projected Random Cut compared to the factor (i.e., the tight approximation guarantees of average-linkage in high dimensions; see Section 5) if is not too small, where . In particular, we get constant improvement if . Is this a reasonable assumption? Interestingly, we answer this question in the affirmative by showing that if we have , then the Gaussian kernel can encode arbitrary similarity weights (up to scaling, which has no effects on multiplicative approximations). For simplicity, we only prove this result for weights here, while it can be generalized to arbitrary weights.
Given any undirected graph on nodes and , there exist unit vectors in and bandwidth parameter , such that , , and for some we have:
Our proof is constructive. Let . Pick orthogonal vectors in such that , where is the degree of node in graph . For each , define as follows:
Note that As the next step, pick a set of orthonormal vectors in the null space of . Finally, for each , define the final vector as follows:
First, note that these vectors have unit length:
Now, pick any two vertices and . If , then:
where we used the fact that , and the fact that for every edge incident to and every edge incident to , as when . Similarly, when :
Now, consider a Gaussian kernel with bandwidth and vectors . Since from the above calculations of the inner products it follows that:
Thus the ratio is , as desired. ∎
5 Hard Instances for Average-Linkage under Gaussian Kernel Similarity
There exists a set of vectors for for which the average-linkage clustering algorithm achieves an approximation at most for under the Gaussian kernel similarity measure.
Proof of Theorem 5.
We start by this theorem:
Theorem 5.2 (Charikar et al. (2019)).
For any constant the instance average-linkage clustering achieves the value of at most while the optimum is at least .
Let be real-valued parameters to be chosen later. We use indices and to index our set of vectors. For , let
where and is the -th standard unit vector with the in the -th entry. Then it is easy to see that for any fixed and it holds that:
Similarly, for any fixed and it holds that:
Otherwise if and then: