Algorithms for finding k in k-means

by   Chiranjib Bhattacharyya, et al.

k-means Clustering requires as input the exact value of k, the number of clusters. Two challenges are open: (i) Is there a data-determined definition of k which is provably correct and (ii) Is there a polynomial time algorithm to find k from data ? This paper provides the first affirmative answers to both these questions. As common in the literature, we assume that the data admits an unknown Ground Truth (GT) clustering with cluster centers separated. This assumption alone is not sufficient to answer Yes to (i). We assume a novel, but natural second constraint called no tight sub-cluster (NTSC) which stipulates that no substantially large subset of a GT cluster can be "tighter" (in a sense we define) than the cluster. Our yes answer to (i) and (ii) are under these two deterministic assumptions. We also give polynomial time algorithm to identify k. Our algorithm relies on NTSC to peel off one cluster at a time by identifying points which are tightly packed. We are also able to show that our algorithm(s) apply to data generated by mixtures of Gaussians and more generally to mixtures of sub-Gaussian pdf's and hence are able to find the number of components of the mixture from data. To our knowledge, previous results for these specialized settings as well, assume generally that k is given besides the data.


page 1

page 2

page 3

page 4


Clustering Semi-Random Mixtures of Gaussians

Gaussian mixture models (GMM) are the most widely used statistical model...

Information Elicitation Meets Clustering

In the setting where we want to aggregate people's subjective evaluation...

Structures of Spurious Local Minima in k-means

k-means clustering is a fundamental problem in unsupervised learning. Th...

Clustering Mixtures with Almost Optimal Separation in Polynomial Time

We consider the problem of clustering mixtures of mean-separated Gaussia...

Beyond Parallel Pancakes: Quasi-Polynomial Time Guarantees for Non-Spherical Gaussian Mixtures

We consider mixtures of k≥ 2 Gaussian components with unknown means and ...

From Clustering to Cluster Explanations via Neural Networks

A wealth of algorithms have been developed to extract natural cluster st...

An Analysis of the t-SNE Algorithm for Data Visualization

A first line of attack in exploratory data analysis is data visualizatio...