Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means

We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.

READ FULL TEXT
research
02/02/2016

On the Nyström and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Data Sets

In this paper we analyze approximate methods for undertaking a principal...
research
02/22/2016

Streaming PCA: Matching Matrix Bernstein and Near-Optimal Finite Sample Guarantees for Oja's Algorithm

This work provides improved guarantees for streaming principle component...
research
05/04/2023

Nearly-Linear Time and Streaming Algorithms for Outlier-Robust PCA

We study principal component analysis (PCA), where given a dataset in ℝ^...
research
07/17/2018

An Acceleration Scheme for Memory Limited, Streaming PCA

In this paper, we propose an acceleration scheme for online memory-limit...
research
03/02/2015

Recovering PCA from Hybrid-(ℓ_1,ℓ_2) Sparse Sampling of Data Elements

This paper addresses how well we can recover a data matrix when only giv...
research
02/03/2023

Support Recovery in Sparse PCA with Non-Random Missing Data

We analyze a practical algorithm for sparse PCA on incomplete and noisy ...
research
03/03/2013

Sparse PCA through Low-rank Approximations

We introduce a novel algorithm that computes the k-sparse principal comp...

Please sign up or login with your details

Forgot password? Click here to reset