On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset

10/31/2018
by   Amber Srivastava, et al.
0

Typically clustering algorithms provide clustering solutions with prespecified number of clusters. The lack of a priori knowledge on the true number of underlying clusters in the dataset makes it important to have a metric to compare the clustering solutions with different number of clusters. This article quantifies a notion of persistence of clustering solutions that enables comparing solutions with different number of clusters. The persistence relates to the range of data-resolution scales over which a clustering solution persists; it is quantified in terms of the maximum over two-norms of all the associated cluster-covariance matrices. Thus we associate a persistence value for each element in a set of clustering solutions with different number of clusters. We show that the datasets where natural clusters are a priori known, the clustering solutions that identify the natural clusters are most persistent - in this way, this notion can be used to identify solutions with true number of clusters. Detailed experiments on a variety of standard and synthetic datasets demonstrate that the proposed persistence-based indicator outperforms the existing approaches, such as, gap-statistic method, X-means, G-means, PG-means, dip-means algorithms and information-theoretic method, in accurately identifying the clustering solutions with true number of clusters. Interestingly, our method can be explained in terms of the phase-transition phenomenon in the deterministic annealing algorithm, where the number of distinct cluster centers changes (bifurcates) with respect to an annealing parameter.

READ FULL TEXT
research
10/31/2018

On the True Number of Clusters in a Dataset

One of the main challenges in cluster analysis is estimating the true nu...
research
10/28/2017

Partitioning Relational Matrices of Similarities or Dissimilarities using the Value of Information

In this paper, we provide an approach to clustering relational matrices ...
research
10/24/2018

A Binary Optimization Approach for Constrained K-Means Clustering

K-Means clustering still plays an important role in many computer vision...
research
02/13/2021

ThetA – fast and robust clustering via a distance parameter

Clustering is a fundamental problem in machine learning where distance-b...
research
10/15/2020

Cascade of Phase Transitions for Multi-Scale Clustering

We present a novel framework exploiting the cascade of phase transitions...
research
05/02/2023

Jacobian-Scaled K-means Clustering for Physics-Informed Segmentation of Reacting Flows

This work introduces Jacobian-scaled K-means (JSK-means) clustering, whi...
research
02/27/2017

Contextualization of topics: Browsing through the universe of bibliographic information

This paper describes how semantic indexing can help to generate a contex...

Please sign up or login with your details

Forgot password? Click here to reset