Distributed k-Means with Outliers in General Metrics

02/16/2022
by   Enrico Dandolo, et al.
0

Center-based clustering is a pivotal primitive for unsupervised learning and data analysis. A popular variant is undoubtedly the k-means problem, which, given a set P of points from a metric space and a parameter k<|P|, requires to determine a subset S of k centers minimizing the sum of all squared distances of points in P from their closest center. A more general formulation, known as k-means with z outliers, introduced to deal with noisy datasets, features a further parameter z and allows up to z points of P (outliers) to be disregarded when computing the aforementioned sum. We present a distributed coreset-based 3-round approximation algorithm for k-means with z outliers for general metric spaces, using MapReduce as a computational model. Our distributed algorithm requires sublinear local memory per reducer, and yields a solution whose approximation ratio is an additive term O(γ) away from the one achievable by the best known sequential (possibly bicriteria) algorithm, where γ can be made arbitrarily small. An important feature of our algorithm is that it obliviously adapts to the intrinsic complexity of the dataset, captured by the doubling dimension D of the metric space. To the best of our knowledge, no previous distributed approaches were able to attain similar quality-performance tradeoffs for general metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2019

Accurate MapReduce Algorithms for k-median and k-means in General Metric Spaces

Center-based clustering is a fundamental primitive for data analysis and...
research
02/18/2020

Coreset-based Strategies for Robust Center-type Problems

Given a dataset V of points from some metric space, the popular k-center...
research
05/06/2018

Generalized Center Problems with Outliers

We study the F-center problem with outliers: given a metric space (X,d),...
research
02/18/2021

No-Substitution k-means Clustering with Low Center Complexity and Memory

Clustering is a fundamental task in machine learning. Given a dataset X ...
research
01/07/2022

k-Center Clustering with Outliers in Sliding Windows

Metric k-center clustering is a fundamental unsupervised learning primit...
research
04/13/2021

A New Coreset Framework for Clustering

Given a metric space, the (k,z)-clustering problem consists of finding k...
research
07/31/2020

MSPP: A Highly Efficient and Scalable Algorithm for Mining Similar Pairs of Points

The closest pair of points problem or closest pair problem (CPP) is an i...

Please sign up or login with your details

Forgot password? Click here to reset