A Practical Algorithm for Distributed Clustering and Outlier Detection

05/24/2018
by   Jiecao Chen, et al.
0

We study the classic k-means/median clustering, which are fundamental problems in unsupervised learning, in the setting where data are partitioned across multiple sites, and where we are allowed to discard a small portion of the data by labeling them as outliers. We propose a simple approach based on constructing small summary for the original dataset. The proposed method is time and communication efficient, has good approximation guarantees, and can identify the global outliers effectively. To the best of our knowledge, this is the first practical algorithm with theoretical guarantees for distributed clustering with outliers. Our experiments on both real and synthetic data have demonstrated the clear superiority of our algorithm against all the baseline algorithms in almost all metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/18/2018

Distributed k-Clustering for Data with Heavy Noise

In this paper, we consider the k-center/median/means clustering with out...
research
05/02/2023

FPT Approximations for Capacitated/Fair Clustering with Outliers

Clustering problems such as k-Median, and k-Means, are motivated from ap...
research
06/03/2013

Distributed k-Means and k-Median Clustering on General Topologies

This paper provides new algorithms for distributed clustering for two po...
research
12/01/2022

Clustering What Matters: Optimal Approximation for Clustering with Outliers

Clustering with outliers is one of the most fundamental problems in Comp...
research
06/16/2023

Adversarially robust clustering with optimality guarantees

We consider the problem of clustering data points coming from sub-Gaussi...
research
06/15/2020

Hypergraph Clustering Based on PageRank

A hypergraph is a useful combinatorial object to model ternary or higher...
research
04/05/2023

A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data

The interactive exploration of large and evolving datasets is challengin...

Please sign up or login with your details

Forgot password? Click here to reset