Coresets for Clustering with Missing Values

06/30/2021
by   Vladimir Braverman, et al.
0

We provide the first coreset for clustering points in ℝ^d that have multiple missing values (coordinates). Previous coreset constructions only allow one missing coordinate. The challenge in this setting is that objective functions, like k-Means, are evaluated only on the set of available (non-missing) coordinates, which varies across points. Recall that an ϵ-coreset of a large dataset is a small proxy, usually a reweighted subset of points, that (1+ϵ)-approximates the clustering objective for every possible center set. Our coresets for k-Means and k-Median clustering have size (jk)^O(min(j,k)) (ϵ^-1 d log n)^2, where n is the number of data points, d is the dimension and j is the maximum number of missing coordinates for each data point. We further design an algorithm to construct these coresets in near-linear time, and consequently improve a recent quadratic-time PTAS for k-Means with missing values [Eiben et al., SODA 2021] to near-linear time. We validate our coreset construction, which is based on importance sampling and is easy to implement, on various real data sets. Our coreset exhibits a flexible tradeoff between coreset size and accuracy, and generally outperforms the uniform-sampling baseline. Furthermore, it significantly speeds up a Lloyd's-style heuristic for k-Means with missing values.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/19/2022

Near-optimal Coresets for Robust Clustering

We consider robust clustering problems in ℝ^d, specifically k-clustering...
research
06/20/2019

Coresets for Clustering with Fairness Constraints

In a recent work, Chierichetti et al. studied the following "fair" varia...
research
04/14/2020

Coresets for Clustering in Euclidean Spaces: Importance Sampling is Nearly Optimal

Given a collection of n points in ℝ^d, the goal of the (k,z)-clustering ...
research
06/27/2021

Linear-Time Approximation Scheme for k-Means Clustering of Affine Subspaces

In this paper, we present a linear-time approximation scheme for k-means...
research
10/19/2020

EPTAS for k-means Clustering of Affine Subspaces

We consider a generalization of the fundamental k-means clustering for d...
research
12/11/2020

Online Coresets for Clustering with Bregman Divergences

We present algorithms that create coresets in an online setting for clus...
research
03/11/2019

Coresets for Ordered Weighted Clustering

We design coresets for Ordered k-Median, a generalization of classical c...

Please sign up or login with your details

Forgot password? Click here to reset