Sanity Check for External Clustering Validation Benchmarks using Internal Validation Measures

09/20/2022
by   Hyeon Jeon, et al.
5

We address the lack of reliability in benchmarking clustering techniques based on labeled datasets. A standard scheme in external clustering validation is to use class labels as ground truth clusters, based on the assumption that each class forms a single, clearly separated cluster. However, as such cluster-label matching (CLM) assumption often breaks, the lack of conducting a sanity check for the CLM of benchmark datasets casts doubt on the validity of external validations. Still, evaluating the degree of CLM is challenging. For example, internal clustering validation measures can be used to quantify CLM within the same dataset to evaluate its different clusterings but are not designed to compare clusterings of different datasets. In this work, we propose a principled way to generate between-dataset internal measures that enable the comparison of CLM across datasets. We first determine four axioms for between-dataset internal measures, complementing Ackerman and Ben-David's within-dataset axioms. We then propose processes to generalize internal measures to fulfill these new axioms, and use them to extend the widely used Calinski-Harabasz index for between-dataset CLM evaluation. Through quantitative experiments, we (1) verify the validity and necessity of the generalization processes and (2) show that the proposed between-dataset Calinski-Harabasz index accurately evaluates CLM across datasets. Finally, we demonstrate the importance of evaluating CLM of benchmark datasets before conducting external validation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/15/2020

Evaluating and Validating Cluster Results

Clustering is the technique to partition data according to their charact...
research
09/07/2022

Adjusted Asymmetric Accuracy: A Well-Behaving External Cluster Validity Measure

There is no, nor will there ever be, single best clustering algorithm, b...
research
03/01/2021

Validation of cluster analysis results on validation data: A systematic framework

Cluster analysis refers to a wide range of data analytic techniques for ...
research
08/02/2023

A new approach for evaluating internal cluster validation indices

A vast number of different methods are available for unsupervised classi...
research
12/12/2012

An Information-Theoretic External Cluster-Validity Measure

In this paper we propose a measure of clustering quality or accuracy tha...
research
08/27/2020

reval: a Python package to determine the best number of clusters with stability-based relative clustering validation

Determining the number of clusters that best partitions a dataset can be...
research
12/27/2021

MedShift: identifying shift data for medical dataset curation

To curate a high-quality dataset, identifying data variance between the ...

Please sign up or login with your details

Forgot password? Click here to reset