Improved Outlier Robust Seeding for k-means

09/06/2023
by   Amit Deshpande, et al.
0

The k-means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called k-means++ uses D^2 sampling and comes with a provable O(log k) approximation guarantee <cit.>. However, in the presence of adversarial noise or outliers, D^2 sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees w.r.t. k-means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the D^2 sampling distribution, which makes it robust to the outliers. Our algorithm runs in O(ndk) time, outputs O(k) clusters, discards marginally more points than the optimal number of outliers, and comes with a provable O(1) approximation guarantee. Our algorithm can also be modified to output exactly k clusters instead of O(k) clusters, while keeping its running time linear in n and d. This is an improvement over previous results for robust k-means based on LP relaxation and rounding <cit.>, <cit.> and robust k-means++ <cit.>. Our empirical results show the advantage of our algorithm over k-means++ <cit.>, uniform random seeding, greedy sampling for k means <cit.>, and robust k-means++ <cit.>, on standard real-world and synthetic data sets used in previous work. Our proposal is easily amenable to scalable, faster, parallel implementations of k-means++ <cit.> and is of independent interest for coreset constructions in the presence of outliers <cit.>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/25/2021

Extensions of the Maximum Bichromatic Separating Rectangle Problem

In this paper, we study two extensions of the maximum bichromatic separa...
research
03/05/2020

Fast Noise Removal for k-Means Clustering

This paper considers k-means clustering in the presence of noise. It is ...
research
05/18/2023

On k-means for segments and polylines

We study the problem of k-means clustering in the space of straight-line...
research
07/02/2020

Adapting k-means algorithms for outliers

This paper shows how to adapt several simple and classical sampling-base...
research
08/16/2023

A Quantum Approximation Scheme for k-Means

We give a quantum approximation scheme (i.e., (1 + ε)-approximation for ...
research
04/05/2021

Matrix Chain Multiplication and Polygon Triangulation Revisited and Generalized

The matrix-chain multiplication problem is a classic problem that is wid...
research
09/15/2022

Order of uniform approximation by polynomial interpolation in the complex plane and beyond

For Lagrange polynomial interpolation on open arcs X=γ in , it is well-k...

Please sign up or login with your details

Forgot password? Click here to reset