On the True Number of Clusters in a Dataset

10/31/2018
by   Amber Srivastava, et al.
0

One of the main challenges in cluster analysis is estimating the true number of clusters in a dataset. This paper quantifies a notion of persistence of a clustering solution over a range of resolution scales, which is used to characterize the natural clusters and estimate the true number of clusters in a dataset. We show that this quantification of persistence is associated with evaluating the largest eigenvalue of the underlying cluster covariance matrix. Detailed experiments on a variety of standard and synthetic datasets demonstrate that the proposed persistence-based indicator outperforms the existing approaches, such as, gap-statistic method, X-means, G-means, PG-means, dip-means algorithms and information-theoretic method, in accurately predicting the true number of clusters. Interestingly, our method can be explained in terms of the phase-transition phenomenon in the deterministic annealing algorithm where the number of cluster centers changes (bifurcates) with respect to an annealing parameter. However, the approach suggested in this paper is independent of the choice of clustering algorithm; and can be used in conjunction with any suitable clustering algorithm.

READ FULL TEXT
research
10/31/2018

On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

Typically clustering algorithms provide clustering solutions with prespe...
research
11/22/2022

Global k-means++: an effective relaxation of the global k-means clustering algorithm

The k-means algorithm is a very prevalent clustering method because of i...
research
10/28/2017

Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of Information

In this paper, we provide an approach to clustering relational matrices ...
research
04/21/2020

Revealing Cluster Structures Based on Mixed Sampling Frequencies

This paper proposes a new nonparametric mixed data sampling (MIDAS) mode...
research
12/02/2019

Identifying the number of clusters for K-Means: A hypersphere density based approach

Application of K-Means algorithm is restricted by the fact that the numb...
research
05/04/2022

Exploring Rawlsian Fairness for K-Means Clustering

We conduct an exploratory study that looks at incorporating John Rawls' ...
research
10/15/2020

Cascade of Phase Transitions for Multi-Scale Clustering

We present a novel framework exploiting the cascade of phase transitions...

Please sign up or login with your details

Forgot password? Click here to reset