No-Substitution k-means Clustering with Low Center Complexity and Memory

02/18/2021
by   Robi Bhattacharjee, et al.
0

Clustering is a fundamental task in machine learning. Given a dataset X = {x_1, … x_n}, the goal of k-means clustering is to pick k "centers" from X in a way that minimizes the sum of squared distances from each point to its nearest center. We consider k-means clustering in the online, no substitution setting, where one must decide whether to take x_t as a center immediately upon streaming it and cannot remove centers once taken. The online, no substitution setting is challenging for clustering–one can show that there exist datasets X for which any O(1)-approximation k-means algorithm must have center complexity Ω(n), meaning that it takes Ω(n) centers in expectation. Bhattacharjee and Moshkovitz (2020) refined this bound by defining a complexity measure called Lower_α, k(X), and proving that any α-approximation algorithm must have center complexity Ω(Lower_α, k(X)). They then complemented their lower bound by giving a O(k^3)-approximation algorithm with center complexity Õ(k^2Lower_k^3, k(X)), thus showing that their parameter is a tight measure of required center complexity. However, a major drawback of their algorithm is its memory requirement, which is O(n). This makes the algorithm impractical for very large datasets. In this work, we strictly improve upon their algorithm on all three fronts; we develop a 36-approximation algorithm with center complexity Õ(kLower_36, k(X)) that uses only O(k) additional memory. In addition to having nearly optimal memory, this algorithm is the first known algorithm with center complexity bounded by Lower_36, k(X) that is a true O(1)-approximation with its approximation factor being independent of k or n.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/16/2022

A Nearly Tight Analysis of Greedy k-means++

The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] i...
research
12/28/2020

No-substitution k-means Clustering with Adversarial Order

We investigate k-means clustering in the online no-substitution setting ...
research
02/20/2023

Replicable Clustering

In this paper, we design replicable algorithms in the context of statist...
research
12/02/2019

Noisy, Greedy and Not So Greedy k-means++

The k-means++ algorithm due to Arthur and Vassilvitskii has become the m...
research
03/24/2019

Generalization of k-means Related Algorithms

This article briefly introduced Arthur and Vassilvitshii's work on k-mea...
research
12/01/2020

(k, l)-Medians Clustering of Trajectories Using Continuous Dynamic Time Warping

Due to the massively increasing amount of available geospatial data and ...
research
02/16/2022

Distributed k-Means with Outliers in General Metrics

Center-based clustering is a pivotal primitive for unsupervised learning...

Please sign up or login with your details

Forgot password? Click here to reset