On unsupervised projections and second order signals

04/11/2022
by   Thomas Lartigue, et al.
0

Linear projections are widely used in the analysis of high-dimensional data. In unsupervised settings where the data harbour latent classes/clusters, the question of whether class discriminatory signals are retained under projection is crucial. In the case of mean differences between classes, this question has been well studied. However, in many contemporary applications, notably in biomedicine, group differences at the level of covariance or graphical model structure are important. Motivated by such applications, in this paper we ask whether linear projections can preserve differences in second order structure between latent groups. We focus on unsupervised projections, which can be computed without knowledge of class labels. We discuss a simple theoretical framework to study the behaviour of such projections which we use to inform an analysis via quasi-exhaustive enumeration. This allows us to consider the performance, over more than a hundred thousand sets of data-generating population parameters, of two popular projections, namely random projections (RP) and Principal Component Analysis (PCA). Across this broad range of regimes, PCA turns out to be more effective at retaining second order signals than RP and is often even competitive with supervised projection. We complement these results with fully empirical experiments showing 0-1 loss using simulated and real data. We study also the effect of projection dimension, drawing attention to a bias-variance trade-off in this respect. Our results show that PCA can indeed be a suitable first-step for unsupervised analysis, including in cases where differential covariance or graphical model structure are of interest.

READ FULL TEXT
research
08/06/2019

Online Detection of Sparse Changes in High-Dimensional Data Streams Using Tailored Projections

When applying principal component analysis (PCA) for dimension reduction...
research
05/01/2020

How to reduce dimension with PCA and random projections?

In our "big data" age, the size and complexity of data is steadily incre...
research
02/22/2019

Model-based clustering in very high dimensions via adaptive projections

Mixture models are a standard approach to dealing with heterogeneous dat...
research
09/14/2017

Generalized Biplots for Multidimensional Scaled Projections

Dimension reduction and visualization is a staple of data analytics. Met...
research
04/30/2015

Semi-Orthogonal Multilinear PCA with Relaxed Start

Principal component analysis (PCA) is an unsupervised method for learnin...
research
12/31/2022

A Study on a User-Controlled Radial Tour for Variable Importance in High-Dimensional Data

Principal component analysis is a long-standing go-to method for explori...
research
05/15/2019

Which principal components are most sensitive to distributional changes?

PCA is often used in anomaly detection and statistical process control t...

Please sign up or login with your details

Forgot password? Click here to reset