Fast Distributed k-Means with a Small Number of Rounds

by   Tom Hess, et al.

We propose a new algorithm for k-means clustering in a distributed setting, where the data is distributed across many machines, and a coordinator communicates with these machines to calculate the output clustering. Our algorithm guarantees a cost approximation factor and a number of communication rounds that depend only on the computational capacity of the coordinator. Moreover, the algorithm includes a built-in stopping mechanism, which allows it to use fewer communication rounds whenever possible. We show both theoretically and empirically that in many natural cases, indeed 1-4 rounds suffice. In comparison with the popular k-means|| algorithm, our approach allows exploiting a larger coordinator capacity to obtain a smaller number of rounds. Our experiments show that the k-means cost obtained by the proposed algorithm is usually better than the cost obtained by k-means||, even when the latter is allowed a larger number of rounds. Moreover, the machine running time in our approach is considerably smaller than that of k-means||. Code for running the algorithm and experiments is available at


page 1

page 2

page 3

page 4


Optimal Distributed Covering Algorithms

We present a time-optimal deterministic distributed algorithm for approx...

Clustering subgaussian mixtures by semidefinite programming

We introduce a model-free relax-and-round algorithm for k-means clusteri...

Tight Bounds on the Round Complexity of the Distributed Maximum Coverage Problem

We study the maximum k-set coverage problem in the following distributed...

Straggler-Resilient and Communication-Efficient Distributed Iterative Linear Solver

We propose a novel distributed iterative linear inverse solver method. O...

Recombinator-k-means: Enhancing k-means++ by seeding from pools of previous runs

We present a heuristic algorithm, called recombinator-k-means, that can ...

On the Interactive Communication Cost of the Distributed Nearest Lattice Point Problem

We consider the problem of distributed computation of the nearest lattic...

Distributed Bootstrap for Simultaneous Inference Under High Dimensionality

We propose a distributed bootstrap method for simultaneous inference on ...