reval: a Python package to determine the best number of clusters with stability-based relative clustering validation

08/27/2020
by   Isotta Landi, et al.
0

Determining the number of clusters that best partitions a dataset can be a challenging task because of 1) the lack of a priori information within an unsupervised learning framework; and 2) the absence of a unique clustering validation approach to evaluate clustering solutions. Here we present reval: a Python package that leverages stability-based relative clustering validation methods to determine best clustering solutions. Statistical software, both in R and Python, usually rely on internal validation metrics, such as the silhouette index, to select the number of clusters that best fits the data. Meanwhile, open-source software solutions that easily implement relative clustering techniques are lacking. Internal validation methods exploit characteristics of the data itself to produce a result, whereas relative approaches attempt to leverage the unknown underlying distribution of data points looking for a replicable and generalizable clustering solution. The implementation of relative validation solutions can further the theory of clustering by enriching the already available methods that can be used to investigate clustering results in different situations and for different data distributions. This work aims at contributing to this effort by developing a stability-based method that selects the best clustering solution as the one that replicates, via supervised learning, on unseen subsets of data. The package works with multiple clustering and classification algorithms, hence allowing further assessment of the stability of different clustering mechanisms.

READ FULL TEXT
research
06/15/2020

Selecting the Number of Clusters K with a Stability Trade-off: an Internal Validation Criterion

Model selection is a major challenge in non-parametric clustering. There...
research
03/01/2021

Validation of cluster analysis results on validation data: A systematic framework

Cluster analysis refers to a wide range of data analytic techniques for ...
research
04/04/2023

Clustering Validation with The Area Under Precision-Recall Curves

Confusion matrices and derived metrics provide a comprehensive framework...
research
04/28/2021

A Deep Learning Object Detection Method for an Efficient Clusters Initialization

Clustering is an unsupervised machine learning method grouping data samp...
research
09/04/2020

The Area Under the ROC Curve as a Measure of Clustering Quality

The Area Under the the Receiver Operating Characteristics (ROC) Curve, r...
research
09/20/2022

Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures

We address the lack of reliability in benchmarking clustering techniques...
research
08/04/2011

A Data Mining Approach to the Diagnosis of Tuberculosis by Cascading Clustering and Classification

In this paper, a methodology for the automated detection and classificat...

Please sign up or login with your details

Forgot password? Click here to reset