Biwhitening Reveals the Rank of a Count Matrix

03/25/2021
by   Boris Landa, et al.
0

Estimating the rank of a corrupted data matrix is an important task in data science, most notably for choosing the number of components in principal component analysis. Significant progress on this task has been made using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, such as Poisson or binomial, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, focusing on a Poisson random matrix with independent entries, we propose a simple procedure termed biwhitening that makes it possible to estimate the rank of the underlying data matrix (i.e., the Poisson parameter matrix) without any prior knowledge on its structure. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson distribution, we extend our biwhitening approach to other discrete distributions, such as the generalized Poisson, binomial, multinomial, and negative binomial. We conduct numerical experiments that corroborate our theoretical findings, and demonstrate our approach on real single-cell RNA sequencing (scRNA-seq) data, where we show that our results agree with a slightly overdispersed generalized Poisson model.

READ FULL TEXT
research
06/20/2023

The Dyson Equalizer: Adaptive Noise Stabilization for Low-Rank Signal Detection and Recovery

Detecting and recovering a low-rank signal in a noisy data matrix is a f...
research
04/20/2015

Poisson Matrix Recovery and Completion

We extend the theory of low-rank matrix recovery and completion to the c...
research
06/01/2019

Multi-reference factor analysis: low-rank covariance estimation under unknown translations

We consider the problem of estimating the covariance matrix of a random ...
research
07/24/2023

Negative binomial count splitting for single-cell RNA sequencing data

The analysis of single-cell RNA sequencing (scRNA-seq) data often involv...
research
01/25/2022

Zero-Truncated Poisson Regression for Sparse Multiway Count Data Corrupted by False Zeros

We propose a novel statistical inference methodology for multiway count ...
research
06/23/2020

Solving the Phantom Inventory Problem: Near-optimal Entry-wise Anomaly Detection

We observe that a crucial inventory management problem ('phantom invento...
research
08/27/2018

Identifiability of Low-Rank Sparse Component Analysis

Sparse component analysis (SCA) is the following problem: Given an input...

Please sign up or login with your details

Forgot password? Click here to reset