Noisy, Greedy and Not So Greedy k-means++

12/02/2019
by   Anup Bhattacharya, et al.
0

The k-means++ algorithm due to Arthur and Vassilvitskii has become the most popular seeding method for Lloyd's algorithm. It samples the first center uniformly at random from the data set and the other k-1 centers iteratively according to D^2-sampling where the probability that a data point becomes the next center is proportional to its squared distance to the closest center chosen so far. k-means++ is known to achieve an approximation factor of O(log k) in expectation. Already in the original paper on k-means++, Arthur and Vassilvitskii suggested a variation called greedy k-means++ algorithm in which in each iteration multiple possible centers are sampled according to D^2-sampling and only the one that decreases the objective the most is chosen as a center for that iteration. It is stated as an open question whether this also leads to an O(log k)-approximation (or even better). We show that this is not the case by presenting a family of instances on which greedy k-means++ yields only an Ω(ℓ·log k)-approximation in expectation where ℓ is the number of possible centers that are sampled in each iteration. We also study a variation, which we call noisy k-means++ algorithm. In this variation only one center is sampled in every iteration but not exactly by D^2-sampling anymore. Instead in each iteration an adversary is allowed to change the probabilities arising from D^2-sampling individually for each point by a factor between 1-ϵ_1 and 1+ϵ_2 for parameters ϵ_1 ∈ [0,1) and ϵ_2 > 0. We prove that noisy k-means++ compute an O(log^2 k)-approximation in expectation. We also discuss some applications of this result.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/2022

A Nearly Tight Analysis of Greedy k-means++

The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] i...
research
02/18/2021

No-Substitution k-means Clustering with Low Center Complexity and Memory

Clustering is a fundamental task in machine learning. Given a dataset X ...
research
10/03/2018

Reverse Greedy is Bad for k-Center

We demonstrate that the reverse greedy algorithm is a Θ(k) approximation...
research
09/23/2018

Improved constant approximation factor algorithms for k-center problem for uncertain data

In real applications, database systems should be able to manage and proc...
research
11/07/2022

A Simple Combinatorial Algorithm for Robust Matroid Center

Recent progress on robust clustering led to constant-factor approximatio...
research
07/25/2023

Noisy k-means++ Revisited

The k-means++ algorithm by Arthur and Vassilvitskii [SODA 2007] is a cla...

Please sign up or login with your details

Forgot password? Click here to reset