Penalized k-means algorithms for finding the correct number of clusters in a dataset

11/15/2019
by   Behzad Kamgar-Parsi, et al.
0

In many applications we want to find the number of clusters in a dataset. A common approach is to use the penalized k-means algorithm with an additive penalty term linear in the number of clusters. An open problem is estimating the value of the coefficient of the penalty term. Since estimating the value of the coefficient in a principled manner appears to be intractable for general clusters, we investigate "ideal clusters", i.e. identical spherical clusters with no overlaps and no outlier background noise. In this paper: (a) We derive, for the case of ideal clusters, rigorous bounds for the coefficient of the additive penalty. Unsurprisingly, the bounds depend on the correct number of clusters, which we want to find in the first place. We further show that additive penalty, even for this simplest case of ideal clusters, typically produces a weak and often ambiguous signature for the correct number of clusters. (b) As an alternative, we examine the k-means with multiplicative penalty, and show that this parameter-free formulation has a stronger, and less often ambiguous, signature for the correct number of clusters. We also empirically investigate certain types of deviations from ideal cluster assumption and show that combination of k-means with additive and multiplicative penalties can resolve ambiguous solutions.

READ FULL TEXT
research
09/08/2022

A penalized criterion for selecting the number of clusters for K-medians

Clustering is a usual unsupervised machine learning technique for groupi...
research
11/15/2020

Estimation of the number of clusters on d-dimensional sphere

Spherical data is distributed on the sphere. The data appears in various...
research
07/20/2015

A Parameter-free Affinity Based Clustering

Several methods have been proposed to estimate the number of clusters in...
research
10/28/2019

Same-Cluster Querying for Overlapping Clusters

Overlapping clusters are common in models of many practical data-segment...
research
12/23/2022

Stop using the elbow criterion for k-means and how to choose the number of clusters instead

A major challenge when using k-means clustering often is how to choose t...
research
06/14/2022

Conditioning of linear systems arising from penalty methods

Penalizing incompressibility in the Stokes problem leads, under mild ass...
research
03/21/2019

Convergence of Parameter Estimates for Regularized Mixed Linear Regression Models

We consider Mixed Linear Regression (MLR), where training data have bee...

Please sign up or login with your details

Forgot password? Click here to reset