Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources

04/13/2023
by   Shuyuan Wu, et al.
0

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2023

On the asymptotic properties of a bagging estimator with a massive dataset

Bagging is a useful method for large-scale statistical analysis, especia...
research
07/04/2020

Estimating Extreme Value Index by Subsampling for Massive Datasets with Heavy-Tailed Distributions

Modern statistical analyses often encounter datasets with massive sizes ...
research
10/03/2021

A Sequential Addressing Subsampling Method for Massive Data Analysis under Memory Constraint

The emergence of massive data in recent years brings challenges to autom...
research
11/25/2021

Biased-sample empirical likelihood weighting: an alternative to inverse probability weighting

Inverse probability weighting (IPW) is widely used in many areas when da...
research
08/18/2022

Optimal One-pass Nonparametric Estimation Under Memory Constraint

For nonparametric regression in the streaming setting, where data consta...
research
08/22/2016

Computational and Statistical Tradeoffs in Learning to Rank

For massive and heterogeneous modern datasets, it is of fundamental inte...
research
08/16/2017

Adaptive Threshold Sampling and Estimation

Sampling is a fundamental problem in both computer science and statistic...

Please sign up or login with your details

Forgot password? Click here to reset