A New Theory for Sketching in Linear Regression

10/14/2018
by   Edgar Dobriban, et al.
0

Large datasets create opportunities as well as analytic challenges. A recent development is to use random projection or sketching methods for dimension reduction in statistics and machine learning. In this work, we study the statistical performance of sketching algorithms for linear regression. Suppose we randomly project the data matrix and the outcome using a random sketching matrix reducing the sample size, and do linear regression on the resulting data. How much do we lose compared to the original linear regression? The existing theory does not give a precise enough answer, and this has been a bottleneck for using random projections in practice. In this paper, we introduce a new mathematical approach to the problem, relying on very recent results from asymptotic random matrix theory and free probability theory. This is a perfect fit, as the sketching matrices are random in practice. We allow the dimension and sample sizes to have an arbitrary ratio. We study the most popular sketching methods in a unified framework, including random projection methods (Gaussian and iid projections, uniform orthogonal projections, subsampled randomized Hadamard transforms), as well as sampling methods (including uniform, leverage-based, and greedy sampling). We find precise and simple expressions for the accuracy loss of these methods. These go beyond classical Johnson-Lindenstrauss type results, because they are exact, instead of being bounds up to constants. Our theoretical formulas are surprisingly accurate in extensive simulations and on two empirical datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/23/2017

On Principal Components Regression, Random Projections, and Column Subsampling

Principal Components Regression (PCR) is a traditional tool for dimensio...
research
05/01/2020

How to reduce dimension with PCA and random projections?

In our "big data" age, the size and complexity of data is steadily incre...
research
03/02/2023

High-dimensional analysis of double descent for linear regression with random projections

We consider linear regression problems with a varying number of random p...
research
02/16/2020

Distributed Sketching Methods for Privacy Preserving Regression

In this work, we study distributed sketching methods for large scale reg...
research
01/03/2022

On randomized sketching algorithms and the Tracy-Widom law

There is an increasing body of work exploring the integration of random ...
research
02/03/2020

Limiting Spectrum of Randomized Hadamard Transform and Optimal Iterative Sketching Methods

We provide an exact analysis of the limiting spectrum of matrices random...
research
07/01/2018

Robust Inference Under Heteroskedasticity via the Hadamard Estimator

Drawing statistical inferences from large datasets in a model-robust way...

Please sign up or login with your details

Forgot password? Click here to reset