The Area Under the ROC Curve as a Measure of Clustering Quality

09/04/2020
by   Pablo Andretta Jaskowiak, et al.
0

The Area Under the the Receiver Operating Characteristics (ROC) Curve, referred to as AUC, is a well-known performance measure in the supervised learning domain. Due to its compelling features, it has been employed in a number of studies to evaluate and compare the performance of different classifiers. In this work, we explore AUC as a performance measure in the unsupervised learning domain, more specifically, in the context of cluster analysis. In particular, we elaborate on the use of AUC as an internal/relative measure of clustering quality, which we refer to as Area Under the Curve for Clustering (AUCC). We show that the AUCC of a given candidate clustering solution has an expected value under a null model of random clustering solutions, regardless of the size of the dataset and, more importantly, regardless of the number or the (im)balance of clusters under evaluation. In addition, we demonstrate that, in the context of internal/relative clustering validation, AUCC is actually a linear transformation of the Gamma criterion from Baker and Hubert (1975), for which we also formally derive a theoretical expected value for chance clusterings. We also discuss the computational complexity of these criteria and show that, while an ordinary implementation of Gamma can be computationally prohibitive and impractical for most real applications of cluster analysis, its equivalence with AUCC actually unveils a computationally much more efficient and practical algorithmic procedure. Our theoretical findings are supported by experimental results.

READ FULL TEXT
research
04/04/2023

Clustering Validation with The Area Under Precision-Recall Curves

Confusion matrices and derived metrics provide a comprehensive framework...
research
06/17/2021

A Distance-based Separability Measure for Internal Cluster Validation

To evaluate clustering results is a significant part of cluster analysis...
research
09/02/2020

An Internal Cluster Validity Index Based on Distance-based Separability Measure

To evaluate clustering results is a significant part in cluster analysis...
research
08/27/2020

reval: a Python package to determine the best number of clusters with stability-based relative clustering validation

Determining the number of clusters that best partitions a dataset can be...
research
02/24/2020

Revisiting Saliency Metrics: Farthest-Neighbor Area Under Curve

Saliency detection has been widely studied because it plays an important...
research
05/21/2021

Computational Efficient Approximations of the Concordance Probability in a Big Data Setting

Performance measurement is an essential task once a statistical model is...

Please sign up or login with your details

Forgot password? Click here to reset