No-substitution k-means Clustering with Adversarial Order

12/28/2020
by   Robi Bhattacharjee, et al.
6

We investigate k-means clustering in the online no-substitution setting when the input arrives in arbitrary order. In this setting, points arrive one after another, and the algorithm is required to instantly decide whether to take the current point as a center before observing the next point. Decisions are irrevocable. The goal is to minimize both the number of centers and the k-means cost. Previous works in this setting assume that the input's order is random, or that the input's aspect ratio is bounded. It is known that if the order is arbitrary and there is no assumption on the input, then any algorithm must take all points as centers. Moreover, assuming a bounded aspect ratio is too restrictive – it does not include natural input generated from mixture models. We introduce a new complexity measure that quantifies the difficulty of clustering a dataset arriving in arbitrary order. We design a new random algorithm and prove that if applied on data with complexity d, the algorithm takes O(dlog(n) klog(k)) centers and is an O(k^3)-approximation. We also prove that if the data is sampled from a “natural" distribution, such as a mixture of k Gaussians, then the new complexity measure is equal to O(k^2log(n)). This implies that for data generated from those distributions, our new algorithm takes only poly(klog(n)) centers and is a poly(k)-approximation. In terms of negative results, we prove that the number of centers needed to achieve an α-approximation is at least Ω(d/klog(nα)).

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/18/2021

No-Substitution k-means Clustering with Low Center Complexity and Memory

Clustering is a fundamental task in machine learning. Given a dataset X ...
research
08/11/2019

Fully Dynamic k-Center Clustering in Doubling Metrics

In the k-center clustering problem, we are given a set of n points in a ...
research
08/09/2019

Unexpected Effects of Online K-means Clustering

In this paper we study k-means clustering in the online setting. In the ...
research
02/08/2021

A Constant Approximation Algorithm for Sequential No-Substitution k-Median Clustering under a Random Arrival Order

We study k-median clustering under the sequential no-substitution settin...
research
07/08/2017

Learning Mixture of Gaussians with Streaming Data

In this paper, we study the problem of learning a mixture of Gaussians w...
research
02/20/2023

Replicable Clustering

In this paper, we design replicable algorithms in the context of statist...
research
04/13/2020

Learning Mixtures of Spherical Gaussians via Fourier Analysis

Suppose that we are given independent, identically distributed samples x...

Please sign up or login with your details

Forgot password? Click here to reset