An Econometric Perspective of Algorithmic Sampling

07/03/2019
by   Sokbae Lee, et al.
0

Datasets that are terabytes in size are increasingly common, but computer bottlenecks often frustrate a complete analysis of the data. While more data are better than less, diminishing returns suggest that we may not need terabytes of data to estimate a parameter or test a hypothesis. But which rows of data should we analyze, and might an arbitrary subset of rows preserve the features of the original data? This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and which finds that an algorithmically desirable sketch of the data must have a subspace embedding property. Building on this work, we study how prediction and inference is affected by data sketching within a linear regression setup. The sketching error is small compared to the sample size effect which is within the control of the researcher. As a sketch size that is algorithmically optimal may not be suitable for prediction and inference, we use statistical arguments to provide `inference conscious' guides to the sketch size. When appropriately implemented, an estimator that pools over different sketches can be nearly as efficient as the infeasible one using the full sample.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2019

An Econometric View of Algorithmic Subsampling

Datasets that are terabytes in size are increasingly common, but compute...
research
07/15/2020

Sketching for Two-Stage Least Squares Estimation

When there is so much data that they become a computation burden, it is ...
research
03/04/2020

Optimal Regularization Can Mitigate Double Descent

Recent empirical and theoretical studies have shown that many learning a...
research
07/04/2019

Sketched MinDist

We consider sketch vectors of geometric objects J through the function ...
research
10/10/2016

Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data

Sketching techniques have become popular for scaling up machine learning...
research
06/06/2023

Statistical inference for sketching algorithms

Sketching algorithms use random projections to generate a smaller sketch...
research
02/02/2023

Sketched Ridgeless Linear Regression: The Role of Downsampling

Overparametrization often helps improve the generalization performance. ...

Please sign up or login with your details

Forgot password? Click here to reset