Selecting the number of components in PCA via random signflips

12/05/2020
by   David Hong, et al.
0

Dimensionality reduction via PCA and factor analysis is an important tool of data analysis. A critical step is selecting the number of components. However, existing methods (such as the scree plot, likelihood ratio, parallel analysis, etc) do not have statistical guarantees in the increasingly common setting where the data are heterogeneous. There each noise entry can have a different distribution. To address this problem, we propose the Signflip Parallel Analysis (Signflip PA) method: it compares data singular values to those of "empirical null" data generated by flipping the sign of each entry randomly with probability one-half. We show that Signflip PA consistently selects factors above the noise level in high-dimensional signal-plus-noise models (including spiked models and factor models) under heterogeneous settings. Here classical parallel analysis is no longer effective. To do this, we propose to leverage recent breakthroughs in random matrix theory, such as dimension-free operator norm bounds [Latala et al, 2018, Inventiones Mathematicae], and large deviations for the top eigenvalues of nonhomogeneous matrices [Husson, 2020]. To our knowledge, some of these results have not yet been used in statistics. We also illustrate that Signflip PA performs well in numerical simulations and on empirical data examples.

READ FULL TEXT

page 5

page 10

page 26

research
08/23/2018

Structural-Factor Modeling of High-Dimensional Time Series: Another Look at Approximate Factor Models with Diverging Eigenvalues

This article proposes a new approach to modeling high-dimensional time s...
research
11/11/2017

Deterministic parallel analysis

Factor analysis is widely used in many application areas. The first step...
research
12/21/2020

Empirical Bayes PCA in high dimensions

When the dimension of data is comparable to or larger than the number of...
research
01/10/2021

HePPCAT: Probabilistic PCA for Data with Heteroscedastic Noise

Principal component analysis (PCA) is a classical and ubiquitous method ...
research
03/22/2019

Principal components in linear mixed models with general bulk

We study the outlier eigenvalues and eigenvectors in variance components...
research
12/09/2020

Consistently recovering the signal from noisy functional data

In practice most functional data cannot be recorded on a continuum, but ...
research
07/17/2019

Freeness over the diagonal and outliers detection in deformed random matrices with a variance profile

We study the eigenvalues distribution of a GUE matrix with a variance pr...

Please sign up or login with your details

Forgot password? Click here to reset