Fast Noise Removal for k-Means Clustering

03/05/2020
by   Sungjin Im, et al.
0

This paper considers k-means clustering in the presence of noise. It is known that k-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called k-means clustering with outliers. The goal of k-means clustering with outliers is to discard up to a specified number z of points as noise/outliers and then find a k-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any k-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from k-means with outliers to k-means without outliers. Secondly, we show how to construct a coreset of size O(k log n). When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/22/2020

Fast and Accurate k-means++ via Rejection Sampling

k-means++ <cit.> is a widely used clustering algorithm that is easy to i...
research
09/06/2023

Improved Outlier Robust Seeding for k-means

The k-means is a popular clustering objective, although it is inherently...
research
07/02/2020

Adapting k-means algorithms for outliers

This paper shows how to adapt several simple and classical sampling-base...
research
01/24/2019

Greedy Strategy Works for Clustering with Outliers and Coresets Construction

We study the problems of clustering with outliers in high dimension. Tho...
research
02/16/2022

Spatial Transformer K-Means

K-means defines one of the most employed centroid-based clustering algor...
research
11/22/2022

Algorithm for detection of illegal discounting in North Carolina Education Lottery

The lottery is a very lucrative industry. Popular fascination often focu...
research
07/25/2023

Noisy k-means++ Revisited

The k-means++ algorithm by Arthur and Vassilvitskii [SODA 2007] is a cla...

Please sign up or login with your details

Forgot password? Click here to reset