Revisiting Agglomerative Clustering

by   Eric K. Tokuda, et al.

In data clustering, emphasis is often placed in finding groups of points. An equally important subject concerns the avoidance of false positives. As it could be expected, these two goals oppose one another, in the sense that emphasis on finding clusters tends to imply in higher probability of obtaining false positives. The present work addresses this problem considering some traditional agglomerative methods, namely single, average, median, complete, centroid and Ward's applied to unimodal and bimodal datasets following uniform, gaussian, exponential and power-law distributions. More importantly, we adopt a generic model of clusters involving a higher density core surrounded by a transition zone, followed by a sparser set of outliers. Combined with preliminary specification of the size of the expected clusters, this model paved the way to the implementation of an objective means for identifying the clusters from dendrograms. In addition, the adopted model also allowed the relevance of the detected clusters to be estimated in terms of the height of the subtrees corresponding to the identified clusters. More specifically, the lower this height, the more compact and relevant the clusters tend to be. Several interesting results have been obtained, including the tendency of several of the considered methods to detect two clusters in unimodal data. The single-linkage method has been found to provide the best resilience to this tendency. In addition, several methods tended to detect clusters that do not correspond directly to the cores, therefore characterized by lower relevance. The possibility of identifying the type of distribution of points from the adopted measurements was also investigated.


page 1

page 2

page 3

page 4


Clustering with Confidence: Finding Clusters with Statistical Guarantees

Clustering is a widely used unsupervised learning method for finding str...

Toward Generalized Clustering through an One-Dimensional Approach

After generalizing the concept of clusters to incorporate clusters that ...

DISCERN: Diversity-based Selection of Centroids for k-Estimation and Rapid Non-stochastic Clustering

As one of the most ubiquitously applied unsupervised learning methods, c...

Identifying meaningful clusters in malware data

Finding meaningful clusters in drive-by-download malware data is a parti...

Dimensionality's Blessing: Clustering Images by Underlying Distribution

Many high dimensional vector distances tend to a constant. This is typic...

How to Find a Good Explanation for Clustering?

k-means and k-median clustering are powerful unsupervised machine learni...