Revisiting Agglomerative Clustering

05/16/2020
by   Eric K. Tokuda, et al.
0

In data clustering, emphasis is often placed in finding groups of points. An equally important subject concerns the avoidance of false positives. As it could be expected, these two goals oppose one another, in the sense that emphasis on finding clusters tends to imply in higher probability of obtaining false positives. The present work addresses this problem considering some traditional agglomerative methods, namely single, average, median, complete, centroid and Ward's applied to unimodal and bimodal datasets following uniform, gaussian, exponential and power-law distributions. More importantly, we adopt a generic model of clusters involving a higher density core surrounded by a transition zone, followed by a sparser set of outliers. Combined with preliminary specification of the size of the expected clusters, this model paved the way to the implementation of an objective means for identifying the clusters from dendrograms. In addition, the adopted model also allowed the relevance of the detected clusters to be estimated in terms of the height of the subtrees corresponding to the identified clusters. More specifically, the lower this height, the more compact and relevant the clusters tend to be. Several interesting results have been obtained, including the tendency of several of the considered methods to detect two clusters in unimodal data. The single-linkage method has been found to provide the best resilience to this tendency. In addition, several methods tended to detect clusters that do not correspond directly to the cores, therefore characterized by lower relevance. The possibility of identifying the type of distribution of points from the adopted measurements was also investigated.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/27/2016

Clustering with Confidence: Finding Clusters with Statistical Guarantees

Clustering is a widely used unsupervised learning method for finding str...
research
01/01/2020

Toward Generalized Clustering through an One-Dimensional Approach

After generalizing the concept of clusters to incorporate clusters that ...
research
07/16/2023

Using Decision Trees for Interpretable Supervised Clustering

In this paper, we address an issue of finding explainable clusters of cl...
research
10/14/2019

DISCERN: Diversity-based Selection of Centroids for k-Estimation and Rapid Non-stochastic Clustering

As one of the most ubiquitously applied unsupervised learning methods, c...
research
07/31/2020

Identifying meaningful clusters in malware data

Finding meaningful clusters in drive-by-download malware data is a parti...
research
04/29/2019

Clustering Optimization: Finding the Number and Centroids of Clusters by a Fourier-based Algorithm

We propose a Fourier-based approach for optimization of several clusteri...

Please sign up or login with your details

Forgot password? Click here to reset