Adapting k-means algorithms for outliers

07/02/2020
by   Christoph Grunau, et al.
0

This paper shows how to adapt several simple and classical sampling-based algorithms for the k-means problem to the setting with outliers. Recently, Bhaskara et al. (NeurIPS 2019) showed how to adapt the classical k-means++ algorithm to the setting with outliers. However, their algorithm needs to output O(log (k) · z) outliers, where z is the number of true outliers, to match the O(log k)-approximation guarantee of k-means++. In this paper, we build on their ideas and show how to adapt several sequential and distributed k-means algorithms to the setting with outliers, but with substantially stronger theoretical guarantees: our algorithms output (1+ε)z outliers while achieving an O(1 / ε)-approximation to the objective function. In the sequential world, we achieve this by adapting a recent algorithm of Lattanzi and Sohler (ICML 2019). In the distributed setting, we adapt a simple algorithm of Guha et al. (IEEE Trans. Know. and Data Engineering 2003) and the popular k-means of Bahmani et al. (PVLDB 2012). A theoretical application of our techniques is an algorithm with running time Õ(nk^2/z) that achieves an O(1)-approximation to the objective function while outputting O(z) outliers, assuming k ≪ z ≪ n. This is complemented with a matching lower bound of Ω(nk^2/z) for this problem in the oracle model.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2018

Distributed k-Clustering for Data with Heavy Noise

In this paper, we consider the k-center/median/means clustering with out...
research
03/05/2020

Simple and sharp analysis of k-means||

We present a truly simple analysis of k-means|| (Bahmani et al., PVLDB 2...
research
09/06/2023

Improved Outlier Robust Seeding for k-means

The k-means is a popular clustering objective, although it is inherently...
research
03/05/2020

Fast Noise Removal for k-Means Clustering

This paper considers k-means clustering in the presence of noise. It is ...
research
07/25/2023

Noisy k-means++ Revisited

The k-means++ algorithm by Arthur and Vassilvitskii [SODA 2007] is a cla...
research
09/10/2019

Robust Multivariate Estimation Based On Statistical Data Depth Filters

In the classical contamination models, such as the gross-error (Huber an...
research
11/28/2017

Adapting Sequential Algorithms to the Distributed Setting

In this paper we aim to define a robust family of sequential algorithms ...

Please sign up or login with your details

Forgot password? Click here to reset