Turning Big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

07/12/2018
by   Dan Feldman, et al.
0

We develop and analyze a method to reduce the size of a very large set of data points in a high dimensional Euclidean space R d to a small set of weighted points such that the result of a predetermined data analysis task on the reduced set is approximately the same as that for the original point set. For example, computing the first k principal components of the reduced set will return approximately the first k principal components of the original set or computing the centers of a k-means clustering on the reduced set will return an approximation for the original set. Such a reduced set is also known as a coreset. The main new feature of our construction is that the cardinality of the reduced set is independent of the dimension d of the input space and that the sets are mergable. The latter property means that the union of two reduced sets is a reduced set for the union of the two original sets (this property has recently also been called composability, see Indyk et. al., PODS 2014). It allows us to turn our methods into streaming or distributed algorithms using standard approaches. For problems such as k-means and subspace approximation the coreset sizes are also independent of the number of input points. Our method is based on projecting the points on a low dimensional subspace and reducing the cardinality of the points inside this subspace using known methods. The proposed approach works for a wide range of data analysis techniques including k-means clustering, principal component analysis and subspace clustering. The main conceptual contribution is a new coreset definition that allows to charge costs that appear for every solution to an additive constant.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/24/2016

A Theoretical Analysis of Noisy Sparse Subspace Clustering on Dimensionality-Reduced Data

Subspace clustering is the problem of partitioning unlabeled data points...
research
08/12/2021

Probabilistic methods for approximate archetypal analysis

Archetypal analysis is an unsupervised learning method for exploratory d...
research
01/06/2023

Principal Component Analysis in Space Forms

Principal component analysis (PCA) is a workhorse of modern data science...
research
05/29/2017

Coreset Construction via Randomized Matrix Multiplication

Coresets are small sets of points that approximate the properties of a l...
research
04/11/2022

Improved Approximations for Euclidean k-means and k-median, via Nested Quasi-Independent Sets

Motivated by data analysis and machine learning applications, we conside...
research
08/26/2015

Gaussian Mixture Models with Component Means Constrained in Pre-selected Subspaces

We investigate a Gaussian mixture model (GMM) with component means const...
research
12/12/2018

Improved Topological Approximations by Digitization

Čech complexes are useful simplicial complexes for computing and analyzi...

Please sign up or login with your details

Forgot password? Click here to reset