k is the Magic Number -- Inferring the Number of Clusters Through Nonparametric Concentration Inequalities

07/04/2019
by   Sibylle Hess, et al.
0

Most convex and nonconvex clustering algorithms come with one crucial parameter: the k in k-means. To this day, there is not one generally accepted way to accurately determine this parameter. Popular methods are simple yet theoretically unfounded, such as searching for an elbow in the curve of a given cost measure. In contrast, statistically founded methods often make strict assumptions over the data distribution or come with their own optimization scheme for the clustering objective. This limits either the set of applicable datasets or clustering algorithms. In this paper, we strive to determine the number of clusters by answering a simple question: given two clusters, is it likely that they jointly stem from a single distribution? To this end, we propose a bound on the probability that two clusters originate from the distribution of the unified cluster, specified only by the sample mean and variance. Our method is applicable as a simple wrapper to the result of any clustering method minimizing the objective of k-means, which includes Gaussian mixtures and Spectral Clustering. We focus in our experimental evaluation on an application for nonconvex clustering and demonstrate the suitability of our theoretical results. Our SpecialK clustering algorithm automatically determines the appropriate value for k, without requiring any data transformation or projection, and without assumptions on the data distribution. Additionally, it is capable to decide that the data consists of only a single cluster, which many existing algorithms cannot.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/16/2021

K-expectiles clustering

K-means clustering is one of the most widely-used partitioning algorithm...
research
11/20/2019

CNAK : Cluster Number Assisted K-means

Determining the number of clusters present in a dataset is an important ...
research
03/29/2017

Improving Spectral Clustering using the Asymptotic Value of the Normalised Cut

Spectral clustering is a popular and versatile clustering method based o...
research
10/01/2022

A new nonparametric interpoint distance-based measure for assessment of clustering

A new interpoint distance-based measure is proposed to identify the opti...
research
02/20/2022

Clustering by the Probability Distributions from Extreme Value Theory

Clustering is an essential task to unsupervised learning. It tries to au...
research
06/13/2013

Non-parametric Power-law Data Clustering

It has always been a great challenge for clustering algorithms to automa...
research
12/08/2020

Algorithms for finding k in k-means

k-means Clustering requires as input the exact value of k, the number of...

Please sign up or login with your details

Forgot password? Click here to reset