Optimal Time Bounds for Approximate Clustering

12/12/2012
by   Ramgopal Mettu, et al.
0

Clustering is a fundamental problem in unsupervised learning, and has been studied widely both as a problem of learning mixture models and as an optimization problem. In this paper, we study clustering with respect the emphk-median objective function, a natural formulation of clustering in which we attempt to minimize the average distance to cluster centers. One of the main contributions of this paper is a simple but powerful sampling technique that we call emphsuccessive sampling that could be of independent interest. We show that our sampling procedure can rapidly identify a small set of points (of size just O(klogn/k)) that summarize the input points for the purpose of clustering. Using successive sampling, we develop an algorithm for the k-median problem that runs in O(nk) time for a wide range of values of k and is guaranteed, with high probability, to return a solution with cost at most a constant factor times optimal. We also establish a lower bound of Omega(nk) on any randomized constant-factor approximation algorithm for the k-median problem that succeeds with even a negligible (say 1/100) probability. Thus we establish a tight time bound of Theta(nk) for the k-median problem for a wide range of values of k. The best previous upper bound for the problem was O(nk), where the O-notation hides polylogarithmic factors in n and k. The best previous lower bound of O(nk) applied only to deterministic k-median algorithms. While we focus our presentation on the k-median objective, all our upper bounds are valid for the k-means objective as well. In this context our algorithm compares favorably to the widely used k-means heuristic, which requires O(nk) time for just one iteration and provides no useful approximation guarantees.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

research
09/07/2020

Achieving anonymity via weak lower bound constraints for k-median and k-means

We study k-clustering problems with lower bounds, including k-median and...
research
05/28/2018

High Probability Frequency Moment Sketches

We consider the problem of sketching the p-th frequency moment of a vect...
research
06/30/2021

Nearly-Tight and Oblivious Algorithms for Explainable Clustering

We study the problem of explainable clustering in the setting first form...
research
11/03/2017

The Bane of Low-Dimensionality Clustering

In this paper, we give a conditional lower bound of n^Ω(k) on running ti...
research
02/27/2019

Reconciliation k-median: Clustering with Non-Polarized Representatives

We propose a new variant of the k-median problem, where the objective fu...
research
07/07/2021

Oblivious Median Slope Selection

We study the median slope selection problem in the oblivious RAM model. ...
research
02/27/2023

On Coresets for Clustering in Small Dimensional Euclidean Spaces

We consider the problem of constructing small coresets for k-Median in E...

Please sign up or login with your details

Forgot password? Click here to reset