An empirical comparison and characterisation of nine popular clustering methods

02/06/2021
by   Christian Hennig, et al.
0

Nine popular clustering methods are applied to 42 real data sets. The aim is to give a detailed characterisation of the methods by means of several cluster validation indexes that measure various individual aspects of the resulting clusters such as small within-cluster distances, separation of clusters, closeness to a Gaussian distribution etc. as introduced in Hennig (2019). 30 of the data sets come with a "true" clustering. On these data sets the similarity of the clusterings from the nine methods to the "true" clusterings is explored. Furthermore, a mixed effects regression relates the observable individual aspects of the clusters to the similarity with the "true" clusterings, which in real clustering problems is unobservable. The study gives new insight not only into the ability of the methods to discover "true" clusterings, but also into properties of clusterings that can be expected from the methods, which is crucial for the choice of a method in a real situation without a given "true" clustering.

READ FULL TEXT
research
07/09/2020

Modified Possibilistic Fuzzy C-Means Algorithm for Clustering Incomplete Data Sets

Possibilistic fuzzy c-means (PFCM) algorithm is a reliable algorithm has...
research
02/22/2016

Recovering the number of clusters in data sets with noise features using feature rescaling factors

In this paper we introduce three methods for re-scaling data sets aiming...
research
06/07/2019

Benchmarking Minimax Linkage

Minimax linkage was first introduced by Ao et al. [3] in 2004, as an alt...
research
02/25/2015

Exploiting a comparability mapping to improve bi-lingual data categorization: a three-mode data analysis perspective

We address in this paper the co-clustering and co-classification of bili...
research
01/22/2016

When is Clustering Perturbation Robust?

Clustering is a fundamental data mining tool that aims to divide data in...
research
02/15/2018

Reducing over-clustering via the powered Chinese restaurant process

Dirichlet process mixture (DPM) models tend to produce many small cluste...
research
10/24/2019

Improving Diarization Robustness using Diversification, Randomization and the DOVER Algorithm

Speaker diarization based on bottom-up clustering of speech segments by ...

Please sign up or login with your details

Forgot password? Click here to reset