# Data ultrametricity and clusterability

The increasing needs of clustering massive datasets and the high cost of running clustering algorithms poses difficult problems for users. In this context it is important to determine if a data set is clusterable, that is, it may be partitioned efficiently into well-differentiated groups containing similar objects. We approach data clusterability from an ultrametric-based perspective. A novel approach to determine the ultrametricity of a dataset is proposed via a special type of matrix product, which allows us to evaluate the clusterability of the dataset. Furthermore, we show that by applying our technique to a dissimilarity space will generate the sub-dominant ultrametric of the dissimilarity.

## Authors

• 1 publication
• 1 publication
• ### A Generic Distributed Clustering Framework for Massive Data

In this paper, we introduce a novel Generic distributEd clustEring frame...
06/19/2021 ∙ by Pingyi Luo, et al. ∙ 0

• ### Particle Clustering Machine: A Dynamical System Based Approach

Identification of the clusters from an unlabeled data set is one of the ...
12/30/2017 ∙ by Sambarta Dasgupta, et al. ∙ 0

• ### Ultra-Scalable Spectral Clustering and Ensemble Clustering

This paper focuses on scalability and robustness of spectral clustering ...
03/04/2019 ∙ by Dong Huang, et al. ∙ 0

• ### Complexity Analysis Approach for Prefabricated Construction Products Using Uncertain Data Clustering

This paper proposes an uncertain data clustering approach to quantitativ...
10/29/2017 ∙ by Wenying Ji, et al. ∙ 0

• ### Frank-Wolfe Optimization for Dominant Set Clustering

We study Frank-Wolfe algorithms – standard, pairwise, and away-steps – f...
07/22/2020 ∙ by Carl Johnell, et al. ∙ 0

• ### Automatic selection of clustering algorithms using supervised graph embedding

11/16/2020 ∙ by Noy Cohen-Shapira, et al. ∙ 0

• ### EGGS: Eigen-Gap Guided Search Making Subspace Clustering Easy

The performance of spectral clustering heavily relies on the quality of ...
07/23/2021 ∙ by Jicong Fan, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Clustering is the prototypical unsupervised learning activity which consists in identifying cohesive and well-differentiated groups of records in data. A data set is clusterable if such groups exist; however, due to the variety in data distributions and the inadequate formalization of certain basic notions of clustering, determining data clusterability before applying specific clustering algorithms is a difficult task.

Evaluating data clusterability before the application of clustering algorithms can be very helpful because clustering algorithms are expensive. However, many such evaluations are impractical because they are NP-hard, as shown in [4]. Other notions define data as clusterable when the minimum between-cluster separation is greater than the maximum intra-cluster distance [13], or when each element is closer to all elements in its cluster than to all other data [7].

Several approaches exist in assessing data clusterability. The main hypothesis of [1] is that clusterability can be inferred from an one-dimensional view of pairwise distances between objects. Namely, clusterability is linked to the multimodality of the histogram of inter-object dissimilarities. The basic assumption is that “the presence of multiple modes in the set of pairwise dissimilarities indicates that the original data is clusterable.” Multimodality is evaluated using the Dip and Silverman statistical multimodality tests, an approach that is computationally efficient.

Alternative approaches to data clusterability are linked to the feasibility of producing a clustering; a corollary of this assumption is that “data that are hard to cluster do not have a meaningful clustering structure” [12]

. Other approaches to clusterability are identified based on clustering quality measures, and on loss function optimization

[4, 9, 3, 8, 7, 11].

We propose a novel approach that relates data clusterability to the extent to which the dissimilarity defined on the data set relate to a special ultrametric defined on the set.

The paper is structured as follows. In Section 2 we introduce dissimilarities and an ultrametrics that play a central role in our definition of clusterability. A special matrix product on matrices with non-negative elements that allow an efficient computation of the subdominant ultrametric is introduced. In Section 3 a measure of clusterability that is based on the iterative properties of the dissimilarity matrix is defined. We provide experimental evidence on the effectiveness of the proposed measure through several experiments on small artificial data sets in Section 4. Finally, we present our conclusions and future plans in Section 5.

## 2 Dissimilarities, Ultrametrics, and Matrices

A dissimilarity on a set is a mapping such that

1. and if and only if ;

2. ;

A dissimilarity on that satisfies the triangular inequality

 d(x,y)⩽d(x,z)+d(z,y)

for every is a metric. If, instead, the stronger inequality

 d(x,y)⩽max{d(x,z),d(z,y)}

is satisfied, is said to be an ultrametric and the pair is an ultrametric space.

A closed sphere in is a set defined by

 B[x,r]={y∈S∣d(x,y)⩽r}.

When is an ultrametric space two spheres having the same radius in are either disjoint or coincide [18]. Therefore, the collection of closed spheres of radius in , is a partition of ; we refer to this partition as an -spheric clustering of .

In an ultrametric space every triangle is isosceles. Indeed, let be a triplet of points in and let be the least distance between the points of . Since and , it follows that , so is isosceles; the two longest sides of this triangle are equal.

It is interesting to note that every -spheric clustering in an ultrametric space is a perfect clustering [5]. This means that all of its in-cluster distances are smaller than all of its between-cluster distances. Indeed, if belong to the same cluster then . If and , where , then , and this implies because the triangle is isosceles and is not the longest side of this triangle.

###### Example 2.1.

Let and let be the ultrametric space, where the ultrametric is defined by the following table:

 d(xi,xj)x1x2x3x4x5x6x7x8x10441010161616x24041010161616x34401010161616x410101006161616x510101060161616x61616161616044x71616161616404x81616161616440

The closed spheres of this spaces are:

 B[xi,r] = ⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩{xi} for r<4,{x1,x2,x3} for 4⩽r<10,{x1,x2,x3,x4,x5} for 10⩽r<16,S for r=16, for 1⩽i⩽3, B[xi,r] = ⎧⎨⎩{xi} for r<6,{x4,x6} for 6⩽r<16,S for r=16, for 4⩽i⩽5, B[xi,r] = ⎧⎨⎩{xi} for r<4,{x6,x7,x8} for 4⩽r<16,S for r=16, for 6⩽i⩽8.

Based on the properties of spheric clusterings mentioned above meaningful such clusterings can be produced in linear time in the number of objects. For the ultrametric space mentioned in Example 2.1, the closed spheres of radius produce the clustering

 {x1,x2,x3},{x4,x5,},{x6,x7,x8}.

If a dissimilarity defined on a data set is close to an ultrametric it is natural to assume that the data set is clusterable. We assess the closeness between a dissimilarity and a special ultrametric known as the subdominant ultrametric of using a matrix approach.

Let be a set. Define a partial order “” on the set of definite dissimilarities by if for every . It is easy to verify that is a poset.

The set of ultrametrics on is a subset of .

###### Theorem 2.2.

Let be a collection of ultrametrics on the set . Then, the mapping defined as

 d(x,y)=sup{di(x,y)∣i∈I}

is an ultrametric on .

###### Proof.

We need to verify only that satisfies the ultrametric inequality for . Since each mapping is an ultrametric, for we have

 di(x,y) ⩽ max{di(x,z),di(z,y)} ⩽ max{d(x,z),d(z,y)}

for every . Therefore,

 d(x,y) = sup{di(x,y)∣i∈I} ⩽ max{d(x,z),d(z,y)},

hence is an ultrametric on . ∎

###### Theorem 2.3.

Let be a dissimilarity on a set and let be the set of ultrametrics . The set has a largest element in the poset .

###### Proof.

The set is nonempty because the zero dissimilarity given by for every is an ultrametric and .

Since the set has as an upper bound, it is possible to define the mapping as for . It is clear that for every ultrametric . We claim that is an ultrametric on .

We prove only that satisfies the ultrametric inequality. Suppose that there exist such that violates the ultrametric inequality; that is,

 max{e1(x,z),e1(z,y)}

This is equivalent to

 sup{e(x,y)∣e∈Ud} > max{sup{e(x,z)∣e∈Ud}, sup{e(z,y)∣e∈Ud}}.

Thus, there exists such that

 ^e(x,y)>sup{e(x,z)∣e∈Ud}

and

 ^e(x,y)>sup{e(z,y)∣e∈Ud}.

In particular, and , which contradicts the fact that is an ultrametric. ∎

The ultrametric defined by Theorem 2.3 is known as the maximal subdominant ultrametric for the dissimilarity .

The situation is not symmetric with respect to the infimum of a set of ultrametrics because, in general, the infimum of a set of ultrametrics is not necessarily an ultrametric.

Let be the set

 \bbd P={x∣x∈\bbd R,x⩾0}∪{∞}.

The usual operations defined on can be extended to by defining

 x+∞=∞+x=∞,x⋅∞=∞⋅x=∞

for .

Let be the set of matrices over . If we have if that is, if for and .

If and the matrix product is defined as:

 cij=min{max{aik,bkj}∣1⩽k⩽n},

for and .

If is the matrix defined by

 (En)ij={0if i=j,∞otherwise,

that is the matrix whose main diagonal elements are and the other elements equal , then for every and for every .

The matrix multiplication defined above is associative, hence is a semigroup with the identity . The powers of are inductively defined as

 A0 = En, An+1 = AnA,

for .

For we define as for and . Note that if , then . It is immediate that for and , then implies ; similarly, if and .

Let be the finite set of elements in that occur in the matrix . Since he entries of any power of are also included in , the sequence is ultimately periodic because it contains a finite number of distinct matrices.

Let be the least integer such that for some . The sequence of powers of has the form

 A,A2,…,Ak(A)−1,Ak(A),…,Ak(A)+d−1,Ak(A),…,Ak(A)+d−1,…,

where is the least integer such that . This integer is denoted by .

The set is a cyclic group with respect to the multiplication.

If is a dissimilarity space, where , the matrix of this space is the matrix defined by for . Clearly, is a symmetric matrix and all its diagonal elements are , that is, .

If, in addition, we have for , then is a metric matrix. If this condition is replaced by the stronger condition for , then is ultrametric matrix. Thus, for an ultrametric matrix we have . This amounts to .

###### Theorem 2.4.

If is a dissimilarity matrix there exists such that

 ⋯=Am+1=Am⩽⋯⩽A2⩽A⩽En

and is an ultrametric matrix.

###### Proof.

Since , the existence of the number with the property mentioned in the theorem is immediate since there exists only a finite number of matrices whose elements belong to . Since , it follows that is an ultrametric matrix. ∎

For a matrix let be the least number such that . We refer to as the stabilization power of the matrix . The matrix is denoted by .

The previous considerations suggest defining the ultrametricity of a matrix with as . Since , it follows that . If , is ultrametric itself and .

###### Theorem 2.5.

Let be a dissimilarity space, where having the dissimilarity matrix . If is the least number such that , then the mapping defined by is the subdominant ultrametric for the dissimilarity .

###### Proof.

As we observed, is an ultrametric matrix, so is an ultrametric on . Since , it follows that for all .

Suppose that is an ultrametric matrix such that , which implies . Thus, dominates any ultrametric that is dominated by . Consequently, the dissimilarity defined by is the subdominant ultrametric for . ∎

The subdominant ultrametric of a dissimilarity is usually studied in the framework of weighted graphs [14].

A weighted graph is a triple , where is the set of vertices of , is a set of two-element subsets of called edges. and is the weight of the edges. If , then , where are distinct vertices in . The weight is extended to all 2-elements subsets of as

 w({vi,vj})={w({vi,vj})if {vi,vj}∈E,∞otherwise.

A path of length in a weighted graph is a sequence

 ℘=(v0,v1,,v2,…,vn−1,vn),

where for .

The set of paths of length in the graph is denoted as . The set of paths of length that join the vertex to the vertex is denoted by . The set of all paths is

 Paths(G)=⋃n⩾1Pathsn(G).

For a weighted graph , the extension of the weight function to is the function defined as

 M(℘)=max{w(vi−1,vi)∣1⩽i⩽n},

where . Thus, if , we have .

If is a weighted graph, its incidence matrix is the matrix , where , defined by for .

Let be the set of paths of length that join the vertex to the vertex . Note that

 P(ℓ+1)ij = {(vi,…,vk,vj)∣ ℘=(vi,…,vk)∈P(ℓ)ik and vj does not occur in ℘}.

Define . The powers of the incidence matrix of the graph are given by

 a(ℓ+1)ik = min{M(℘′)∣℘′∈P(ℓ+1)ik} = min{max{M(℘),w(e)}∣ ℘′=(vi,…,vj,vk) and ℘∈P(ℓ)ij,e=(vj,vk)∈E} = minj{max{aℓij,w(e)}∣e=(vj,vk)}.

Thus, we have

 (AℓG)ij=min{M(℘)∣℘∈Pℓij}

for .

## 3 A Measure of Clusterability

We conjecture that a dissimilarity space is more clusterable if the dissimilarity is closer to an ultrametric, hence if is small. Thus, it is natural to define the clusterability of a data set as the number where , is the dissimilarity matrix of and is the stabilization power of . The lower the stabilization power, the closer is to an ultrametric matrix, and thus, the higher the clusterability of the data set.

Our hypothesis is supported by previous results obtained in [1], where the clusterability of 9 databases were statistically examined using the Dip and Silverman tests of unimodality. The approach used in [1] starts with the hypothesis that the presence of multiple modes in the uni-dimensional set of pairwise distances indicates that the original data set is clusterable. Multimodality is assessed using the tests mentioned above. The time required by this evaluation is quadratic in the number of objects.

The first four data sets, iris, swiss, faithful and rivers were deemed to be clusterable; the last five were evaluated as not clusterable. Tests published in [6] have produced low -values for the first four datasets, which is an indication of clusterability. The last five data sets, USArrests, attitude, cars, and trees produce much larger -values, which show a lack of clusterability. Table 1 shows that all data sets deemed clusterable by the unimodality statistical test have values of the clusterability index that exceed .

In our approach clusterability of a data set is expressed primarily through the “stabilization power” of the dissimilarity matrix ; in addition, the histogram of the dissimilarity values is less differentiated when the data is not clusterable.

## 4 Experimental Evidence on Small Artificial Data Sets

Another series of experiments involved a series of small datasets having the same number of points in arranged in lattices. The points have integer coordinates and the distance between points is the Manhattan distance.

By shifting the data points to different locations, we create several distinct structured clusterings that consists of rectangular clusters.

Figures 2 and 3 show an example of a series of datasets with a total of 36 data points. Initially, the data set has 4 rectangular clusters containing 9 data points each with a gap of 3 distance units between the clusters. The ultrametricity of the dataset and, therefore, its clusterability is affected by the number of clusters, the size of the clusters, and the inter-cluster distances. Figure 3 shows that reaches its highest value and, therefore, the clusterability is the lowest, when there is only one cluster in the dataset (see the third row of Figure 3).

If points are uniformly distributed, as it is the case in the third row of Figure

3, the clustering structure disappears and has the lowest value.

Histograms are used by some authors [10, 2] to identify the degree of clusterability. Note however that in the case of the data shown in Figures 2 and 3, the histograms of original dissimilarity of the dataset do not offer guidance on the clusterability(second column of Figure 2 and 3). By applying the “min-max” power operation on the original matrix, we get an ultrametric matrix. The new histogram of the ultrametric shows a clear difference on each dataset. In the third column of Figures 2 and 3, the histogram of the ultrametric matrix for each dataset shows a decrease of the number of distinct distances after the “power” operation.

If the dataset has no clustering structure the histogram of the ultrametric distance has only one bar.

The number of pics of the histogram indicate the minimum number of clusters in the ultrametric space specified by the matrix using the equality , so the number of clusters is . The largest values of valleys of the histogram indicate the radii of the spheres in the ultrametric space that define the clusters.

If a data set contains a large number of small clusters, these clusters can be regarded as outliers and the clusterability of the data set is reduced. This is the case in the third line of Figure

4 which shows a particular case for 9 clusters with 36 data points. Since the size of each cluster is too small to be considered as a real cluster, all of them together are merely regarded as a one cluster dataset with 9 points.

## 5 Conclusions and Future Work

The special matrix powers of the adjacency matrix of the weighted graph of object dissimilarities provide a tool for computing the subdominant ultrametric of a dissimilarity and an assessment of the existence of an underlying clustering structure in a dissimilarity space.

The “power” operation successfully eliminates the redundant information in the dissimilarity matrix of the dataset but maintains the useful information that can discriminate the cluster structures of the dataset.

In a series of seminal papers[15, 16, 17], F. Murtagh argued that as the dimensionality of a linear metric space increases, an equalization process of distances takes place and the metric of the space gets increasingly closer to an ultrametric. This raises the issues related to the comparative evaluation (statistical and algebraic) of the ultrametricity of such spaces and of their clusterability, which we intend to examine in the future.

## References

• [1] M. Ackerman, A. Adolfsson, and N. Brownstein. An effective and efficient approach for clusterability evaluation. CoRR, abs/1602.06687, 2016.
• [2] M. Ackerman and S. Ben-David. Clusterability: A theoretical study. In

Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009

, pages 1–8, 2009.
• [3] M. Ackerman, S. Ben-David, and D. Loker. Towards property-based classification of clustering paradigms. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010. Proceedings of a meeting held 6-9 December 2010, Vancouver, British Columbia, Canada., pages 10–18.
• [4] Margareta Ackerman and Shai Ben-David. Clusterability: A theoretical study. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, AISTATS 2009, Clearwater Beach, Florida, USA, April 16-18, 2009, pages 1–8, 2009.
• [5] Margareta Ackerman, Shai Ben-David, Simina Brânzei, and David Loker. Weighted clustering. In Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, July 22-26, 2012, Toronto, Ontario, Canada., pages 858–863, 2012.
• [6] Andreas Adolfsson, Margareta Ackerman, and N. C. Brownstein. To cluster, or not to cluster: An analysis of clusterability methods. CoRR, abs/1808.08317, 2018.
• [7] M. F. Balcan, A. Blum, and S. Vempala. A discriminative framework for clustering via similarity functions. In

Proceedings of the 40th Annual ACM Symposium on Theory of Computing, Victoria, British Columbia, Canada, May 17-20, 2008

, pages 671–680, 2008.
• [8] S. Ben-David. Computational feasibility of clustering under clusterability assumptions. CoRR, abs/1501.00437.
• [9] S. Ben-David and M. Ackerman. Measures of clustering quality: A working set of axioms for clustering. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 121–128, 2008.
• [10] S. Ben-David and M. Ackerman. Measures of clustering quality: A working set of axioms for clustering. In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 121–128, 2008.
• [11] A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. In Proceedings of the 7th Pacific Symposium on Biocomputing, PSB 2002, Lihue, Hawaii, USA, January 3-7, 2002, pages 6–17, 2002.
• [12] Amit Daniely, Nati Linial, and Michael E. Saks. Clustering is difficult only when it does not matter. CoRR, abs/1205.4891, 2012.
• [13] S. Epter, M. Krishnamoorthy, and M. Zaki. Clusterability detection and initial seed selection in large datasets. In The International Conference on Knowledge Discovery in Databases, volume 7, 1999.
• [14] B. Leclerc. Description combinatoire des ultramétriques. Mathématiques et science humaines, 73:5–37, 1981.
• [15] Fionn Murtagh. Quantifying ultrametricity. In COMPSTAT, pages 1561–1568, 2004.
• [16] Fionn Murtagh. Clustering in very high dimensions. In UK Workshop on Computational Intelligence, page 226, 2005.
• [17] Fionn Murtagh. Identifying and exploiting ultrametricity. In Advances in Data Analysis, pages 263–272. Springer, 2007.
• [18] D. A. Simovici and C. Djeraba. Mathematical Tools for Data Mining – Set Theory, Partial Orders, Combinatorics. Springer-Verlag, London, second edition, 2008.