Phase Transitions for High Dimensional Clustering and Related Problems

02/24/2015
by   Jiashun Jin, et al.
0

Consider a two-class clustering problem where we observe X_i = ℓ_i μ + Z_i, Z_i iid∼ N(0, I_p), 1 ≤ i ≤ n. The feature vector μ∈ R^p is unknown but is presumably sparse. The class labels ℓ_i∈{-1, 1} are also unknown and the main interest is to estimate them. We are interested in the statistical limits. In the two-dimensional phase space calibrating the rarity and strengths of useful features, we find the precise demarcation for the Region of Impossibility and Region of Possibility. In the former, useful features are too rare/weak for successful clustering. In the latter, useful features are strong enough to allow successful clustering. The results are extended to the case of colored noise using Le Cam's idea on comparison of experiments. We also extend the study on statistical limits for clustering to that for signal recovery and that for hypothesis testing. We compare the statistical limits for three problems and expose some interesting insight. We propose classical PCA and Important Features PCA (IF-PCA) for clustering. For a threshold t > 0, IF-PCA clusters by applying classical PCA to all columns of X with an L^2-norm larger than t. We also propose two aggregation methods. For any parameter in the Region of Possibility, some of these methods yield successful clustering. We find an interesting phase transition for IF-PCA. Our results require delicate analysis, especially on post-selection Random Matrix Theory and on lower bound arguments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/08/2023

Subject clustering by IF-PCA and several recent methods

Subject clustering (i.e., the use of measured features to cluster subjec...
research
04/22/2022

Compressibility: Power of PCA in Clustering Problems Beyond Dimensionality Reduction

In this paper we take a step towards understanding the impact of princip...
research
09/13/2020

Statistical Query Algorithms and Low-Degree Tests Are Almost Equivalent

Researchers currently use a number of approaches to predict and substant...
research
08/24/2021

Phase Transitions for High-Dimensional Quadratic Discriminant Analysis with Rare and Weak Signals

Consider a two-class classification problem where we observe samples (X_...
research
07/02/2018

Optimality and Sub-optimality of PCA I: Spiked Random Matrix Models

A central problem of random matrix theory is to understand the eigenvalu...
research
02/18/2020

Optimal Structured Principal Subspace Estimation: Metric Entropy and Minimax Rates

Driven by a wide range of applications, many principal subspace estimati...
research
11/15/2022

Solving clustering as ill-posed problem: experiments with K-Means algorithm

In this contribution, the clustering procedure based on K-Means algorith...

Please sign up or login with your details

Forgot password? Click here to reset