 # Guarantees for Hierarchical Clustering by the Sublevel Set method

Meila (2018) introduces an optimization based method called the Sublevel Set method, to guarantee that a clustering is nearly optimal and "approximately correct" without relying on any assumptions about the distribution that generated the data. This paper extends the Sublevel Set method to the cost-based hierarchical clustering paradigm proposed by Dasgupta (2016).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Compared to (simple) clustering data into clusters, hierarchical clustering is much more complex and much less understood. One of the few seminal advances in hierarchical clusterings is the introduction by Dasgupta (2016) of a general yet simple paradigm of hierarchical clustering as loss minimization. This paradigm was expanded by Charikar and Chatziafratis (2016) and Roy and Pokutta (2016)

. The latter work also introduces a new set of techniques for obtaining hierarchical clusterings by showing that optimizing the loss can be relaxed to a Linear Program (LP).

This paper introduces the first method to obtain optimality guarantees in the context of hierarchical clustering. Specifically, it is shown that the Sublevel Set (SS) paradigm invented by Meila (2018) for simple, non-hiearchical clustering, can be extended as well to hierarchical clustering. The main contribution is show that there is a natural distance between hierarchical clusterings whose properties can be exploited in the setting of the SS problem we will present in Section 3.

The Sublevel Set method produces stability theorems of the following form.

###### Theorem 1 (Informal Stability Theorem)

If a clustering has low enough loss for a data set , then, subject to some conditions verifiable from the data, any other clustering that has lower or equql loss to cannot be more than different from .

When a stability theorem holds in practice, it means that is not just a “good” clustering; must be the only good clustering supported by the data , up to small variations. This property is called stability. It is obvious that, even though a Stability Theorem does not guarantee optimality, it implies that the optimal clustering of the data is within distance of . The value which bounds the amount of variaation in the above theorem, defines a ball of radius around that contains all the good clusterings, including the optimal one. This ball is called an optimality interval (OI) and with a slight abuse we will also refer to its radius as an OI.

The main result of this paper is Theorem 2 in Section 3 which will give an OI for hierarchical clustering in the paradigm of Dasgupta (2016), along with a simple algorithm for calculating the OI , based on the LP relaxation of Roy and Pokutta (2016). We formally define the distance in the space of hierarchical clusterings in which the OI is to be measured in the next section.

## 2 Preliminaries and a distance between hierarchical clusterings

### A loss function for hierarchical clustering

Let be the number of points to be clustered, and be a hierarchical clustering, or tree for short, whose leaves are the nodes. All trees have levels, and between one level and the level below, a single cluster is split into two non-empty sets; at level , with , there are clusters. A tree can be represented as a set of matrices . The variable if nodes are separated at level of , and 0 otherwise. The levels of the tree are numbered from the botton up, with 0 the level of all leaves ( implicitly), and the highest split at level . Let denote the matrix representing the clustering at level . Each matrix is symmetric with 0 on the diagonal. Note also that , where is the path length from or to their lowest common ancestor ().

Denote by a symmetric matrix of similarities, such that is the cost of not having together at any level in the clustering. The cost of a hierarchical clustering is the sum of the costs for each pair of nodes , and each level . It is assumed that to simplify the algebraic expressions. This cost was introduced by Dasgupta (2016) who showed that it has many interesting properties. With the notation for a hierarchical clustering, the cost can be re-written as shown by Roy and Pokutta (2016)

 Loss(S,X)=n∑i,j=1n−1∑t=1Sijxtij+n∑i,j=1Sij=n∑i,j=1Sij(n−1∑t=1xtij)+n∑i,j=1Sij. (1)

Note that the second term is a constant independent of the structure of . In Roy and Pokutta (2016) is it shown that minimizing over hierarchical clusterings can be formulated as an Integer Linear Program, which can be relaxed as usual to a Linear Program (LP).

### The Matrix Hamming distance between hierarchical clusterings

For any two matrices , let denote the Frobenius scalar product, and denote the Frobenius norm, squared. For two hierarchical clusterings , we define . This is clearly a scalar product on the space of hierarchical clusterings, and

 ||X||2F=⟨X,X⟩=n∑i,j=1n−1∑t=1(xtij)2=n∑i,j=1n−1∑t=1xtij. (2)

The last equality holds because all are either 0 or 1. Moreover, in Dasgupta (2016) it is proved that

 n∑i,j=1n−1∑t=1xtij=n3−n3. (3)

Therefore, we have the following simple results.

###### Proposition 1

For any hierarchical clustering over points , .

For any two binary matrices , we define the Matrix Hamming (MH) distance to be the number of entries in which differs from , i.e. , where denotes the exclusive-or Boolean operator. We further extend the MH distance to hierarchical clusterings by .

###### Proposition 2

Let be two hierarchical clusterings over points. Then .

Proof

 ||X−Y||2F = ||X||2F+||Y||2F−2⟨X,Y⟩ (4) = 2n3−n3−2⟨X,Y⟩ (5) = n∑i,j=1n−1∑t=1(xtij+ytij−2xtijytij) (6) = n∑i,j=1n−1∑t=1(xtij⊕ytij) (7) = (8)

For simple, non-hierachical clusterings, is equivalent to the unadjusted Rand Index Meilă (2007). The (unadjusted) Rand Index has long been abandoned in the analysis of simple clusterings because, when the number of clusters is larger than 4 or 5, all “usual” clusterings appear very close under this distance. This was further formalized by Meilă (2005).

This disadvantage for simple clusterings may turn to be an advantage in the hierarchical setting. We expect that, for small values of the level , near the leaves of the tree, will be very small w.r.t. the upper bound . Indeed, for , and contain each one pair of merged points, hence , whenever . Hence, in , the levels of the cluster tree below the very top ones are strongly down-weighted, letting the top splits dominate the distance.

## 3 Sublevel Set method

Now we are ready to apply the SS method of Meila (2018) to a hierarchical clustering.

From Proposition 2 it follows that maximizing is equivalent to minimizing . Therefore, we can obtain a stability theorem and an OI as folows. Assume we have data , a hierarchical clustering , obtained by minimizing as well as possible. Hence we assume is fixed; is any other arbitrary other clustering. We define the following optimization problem, which we call a Sublevel Set problem.

 (SS)δ= minY ⟨X,Y⟩ (9) s.t. Loss(S,Y)≤Loss(S,X) (15) ytij≥yt+1ij, for all t,i,j ytij+ytjk≥ytik, for all t,i,j,k ∑j∈Sxtij≥|S|−t, for t,S⊆[n] with |S|>t,i∈S xtij∈[0,1], for all t,i,j xtij=xtji,xtii=0, for all t,i,j

The problem above maximizes over a relaxed space of non-binary matrices that satisfy contraints (15)–(15); all matrices representing cluster trees also satisfy these constraints. Constraint (15) restricts the feasible set to those that have lower or equal cost to , therefore this set is called a sublevel set for . We note that is linear in , therefore the (SS) problem is a Linear Program.

The Sublevel Set problem above follows the (LP-ultrametric) problem formulation from Section 4 of Roy and Pokutta (2016), with the addition of the sublevel set constraint (15) and replacing with with in the objective. Note that in (15) are an exponential number of constraints; Roy and Pokutta (2016) claim that the LP can still be optimized in poly operations. In Meila (2018) the SDPNAL software Yang et al. (2015) was used, and this software can also solve LPs.

###### Theorem 2 (Stability Theorem for hierarchical clustering)

Let be defined as above, and let be the optimal value of the (SS) problem. Then, any other clustering with satisfies , with .

The proof of the Theorem is immediate from the constraint (15) and Propositions 1 and 2. In more detail, the solution of (SS) may not be an integer solution. However, the (SS) problem guarantees that for any hierarchical clustering that has , . From Propositions 1 and 2, it follows that . Hence, all good clusterings of the data must be in a ball of radius from , and gives an Optimality Interval for , in terms of Hamming distance.

The (SS) optimization problem can be solved numerically, to obtain .

1. Given similarity matrix , use a hierarchical clustering method to obtain a clustering .

2. Compute .

3. Set up and solve the (SS) problem by calling an LP solver. Let be the optimal value and optimal solution of (SS).

4. Compute

5. The optimality interval

In Meilă (2012) are formulas that bound the clustering misclassification error distance (also known as earthmover’s distance) by the Matrix Hamming distance. They can be used to translate the bound into a bound on the more intuitive missclassification distance; this would come with a further relaxation of the bound.

## 4 Conclusion

In this paper we have used the SS method to develop an algorithm that outputs optimality guarantees, in the form Optimality Intervals in the metric space of hierarchical clusterings defined by the matrix Hamming distance

. Besides guaranteeing (sub)-optimality, the OI also guarantees stability, provided that it is small enough. In other words, when the OI is small, not only is the cost of the estimated clustering

almost optimal, we are also guaranteed that there is no other very different way to partition the data that will give the same or better cost. Much remains still to be studied, in particular how small must be for the bound to be truly meaningful.

## References

• Charikar and Chatziafratis (2016) Moses Charikar and Vaggos Chatziafratis. Approximate hierarchical clustering via sparsest cut and spreading metrics. Technical Report 1609:09548, arXiv, 2016.
• Dasgupta (2016) Sanjoy Dasgupta. A cost function for similarity-based hierarchical clustering. In Daniel Wichs and Yishay Mansour, editors,

Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2016, Cambridge, MA, USA, June 18-21, 2016

, pages 118–127. ACM, 2016.
ISBN 978-1-4503-4132-5.
• Meilă (2005) Marina Meilă. Comparing clusterings – An axiomatic view. I

nternational Conference of Machine Learning (ICML), pp. 577-584, 2005.

• Meilă (2007) Marina Meilă. Comparing clusterings – an information based distance.

Journal of Multivariate Analysis

, 98(5):873–895, 2007.
• Meilă (2012) Marina Meilă. Local equivalence of distances between clusterings – a geometric perspective. Machine Learning, 86(3):369–389, 2012.
• Meila (2018) Marina Meilă. How to tell when a clustering is (approximately) correct using convex relaxations. Advances in Neural Information Processing Systems (NeurIPS), pp. 7407–7418, 2018.
• Roy and Pokutta (2016) Aurko Roy and Sebastian Pokutta. Hierarchical clustering via spreading metrics. In Isabelle Guyon and Ulrike von Luxburg, editors, Advances in Neural Information Processing Systems (NIPS), 2016.
• Yang et al. (2015) Yang, L., Sun, D.,  Toh, K.-C., ‘Sdpnal: a majorized semismooth newton-cg augmented lagrangian method for semidefinite programming with nonnegative constraints’, Mathematical Programming Computation 7(3), 331–366, 2015.