Compressibility: Power of PCA in Clustering Problems Beyond Dimensionality Reduction

04/22/2022
by   Chandra Sekhar Mukherjee, et al.
0

In this paper we take a step towards understanding the impact of principle component analysis (PCA) in the context of unsupervised clustering beyond a dimensionality reduction tool. We explore another property of PCA in vector clustering problems, which we call compressibility. This phenomenon shows that PCA significantly reduces the distance of data points belonging to the same clusters, while reducing inter-cluster distances relatively mildly. This gap explains many empirical observations found in practice. For example, in single-cell RNA-sequencing analysis, which is an application of vector clustering in biology, it has been observed that applying PCA on datasets significantly improves the accuracy of classical clustering algorithms such as K-means. We study this compression gap in both theory and practice. On the theoretical side, we analyze PCA in a fairly general probabilistic setup, which we call the random vector model. In terms of practice, we verify the compressibility of PCA on multiple single-cell RNA-seq datasets.

READ FULL TEXT

page 15

page 16

research
10/10/2021

Scaled torus principal component analysis

A particularly challenging context for dimensionality reduction is multi...
research
05/19/2022

Confident Clustering via PCA Compression Ratio and Its Application to Single-cell RNA-seq Analysis

Unsupervised clustering algorithms for vectors has been widely used in t...
research
06/08/2023

Subject clustering by IF-PCA and several recent methods

Subject clustering (i.e., the use of measured features to cluster subjec...
research
09/07/2020

Improving Problem Identification via Automated Log Clustering using Dimensionality Reduction

Goal: We consider the problem of automatically grouping logs of runs tha...
research
08/17/2020

Principal Ellipsoid Analysis (PEA): Efficient non-linear dimension reduction clustering

Even with the rise in popularity of over-parameterized models, simple di...
research
02/24/2015

Phase Transitions for High Dimensional Clustering and Related Problems

Consider a two-class clustering problem where we observe X_i = ℓ_i μ + Z...
research
12/19/2016

High Performance Software in Multidimensional Reduction Methods for Image Processing with Application to Ancient Manuscripts

Multispectral imaging is an important technique for improving the readab...

Please sign up or login with your details

Forgot password? Click here to reset