On the Nyström and Column-Sampling Methods for the Approximate Principal Components Analysis of Large Data Sets

02/02/2016
by   Darren Homrighausen, et al.
0

In this paper we analyze approximate methods for undertaking a principal components analysis (PCA) on large data sets. PCA is a classical dimension reduction method that involves the projection of the data onto the subspace spanned by the leading eigenvectors of the covariance matrix. This projection can be used either for exploratory purposes or as an input for further analysis, e.g. regression. If the data have billions of entries or more, the computational and storage requirements for saving and manipulating the design matrix in fast memory is prohibitive. Recently, the Nyström and column-sampling methods have appeared in the numerical linear algebra community for the randomized approximation of the singular value decomposition of large matrices. However, their utility for statistical applications remains unclear. We compare these approximations theoretically by bounding the distance between the induced subspaces and the desired, but computationally infeasible, PCA subspace. Additionally we show empirically, through simulations and a real data example involving a corpus of emails, the trade-off of approximation accuracy and computational complexity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/23/2017

On Principal Components Regression, Random Projections, and Column Subsampling

Principal Components Regression (PCR) is a traditional tool for dimensio...
research
10/31/2015

Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means

We analyze a compression scheme for large data sets that randomly keeps ...
research
11/26/2017

Robust PCA and Robust Subspace Tracking

Principal Components Analysis (PCA) is one of the most widely used dimen...
research
09/02/2019

Randomized methods to characterize large-scale vortical flow network

We demonstrate the effective use of randomized methods for linear algebr...
research
09/08/2021

Priming PCA with EigenGame

We introduce primed-PCA (pPCA), an extension of the recently proposed Ei...
research
04/03/2014

Subspace Learning from Extremely Compressed Measurements

We consider learning the principal subspace of a large set of vectors fr...
research
08/09/2018

Fast computation of the principal components of genotype matrices in Julia

Finding the largest few principal components of a matrix of genetic data...

Please sign up or login with your details

Forgot password? Click here to reset