Fast Distributed k-Means with a Small Number of Rounds

01/31/2022
by   Tom Hess, et al.
0

We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means|| algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means||, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means||. Code for running the algorithm and experiments is available at https://github.com/selotape/distributed_k_means.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/28/2022

A Faster k-means++ Algorithm

K-means++ is an important algorithm to choose initial cluster centers fo...
research
08/07/2023

TeraHAC: Hierarchical Agglomerative Clustering of Trillion-Edge Graphs

We introduce TeraHAC, a (1+ϵ)-approximate hierarchical agglomerative clu...
research
02/25/2019

Optimal Distributed Covering Algorithms

We present a time-optimal deterministic distributed algorithm for approx...
research
06/15/2018

Straggler-Resilient and Communication-Efficient Distributed Iterative Linear Solver

We propose a novel distributed iterative linear inverse solver method. O...
research
01/30/2018

On the Interactive Communication Cost of the Distributed Nearest Lattice Point Problem

We consider the problem of distributed computation of the nearest lattic...
research
05/01/2019

Recombinator-k-means: Enhancing k-means++ by seeding from pools of previous runs

We present a heuristic algorithm, called recombinator-k-means, that can ...
research
02/19/2021

Distributed Bootstrap for Simultaneous Inference Under High Dimensionality

We propose a distributed bootstrap method for simultaneous inference on ...

Please sign up or login with your details

Forgot password? Click here to reset