On the Subbagging Estimation for Massive Data

02/28/2021
by   Tao Zou, et al.
6

This article introduces subbagging (subsample aggregating) estimation approaches for big data analysis with memory constraints of computers. Specifically, for the whole dataset with size N, m_N subsamples are randomly drawn, and each subsample with a subsample size k_N≪ N to meet the memory constraint is sampled uniformly without replacement. Aggregating the estimators of m_N subsamples can lead to subbagging estimation. To analyze the theoretical properties of the subbagging estimator, we adapt the incomplete U-statistics theory with an infinite order kernel to allow overlapping drawn subsamples in the sampling procedure. Utilizing this novel theoretical framework, we demonstrate that via a proper hyperparameter selection of k_N and m_N, the subbagging estimator can achieve √(N)-consistency and asymptotic normality under the condition (k_Nm_N)/N→α∈ (0,∞]. Compared to the full sample estimator, we theoretically show that the √(N)-consistent subbagging estimator has an inflation rate of 1/α in its asymptotic variance. Simulation experiments are presented to demonstrate the finite sample performances. An American airline dataset is analyzed to illustrate that the subbagging estimate is numerically close to the full sample estimate, and can be computationally fast under the memory constraint.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2023

On the asymptotic properties of a bagging estimator with a massive dataset

Bagging is a useful method for large-scale statistical analysis, especia...
research
07/22/2023

Survey Design and Estimating Equations when Combining Big Data with Probability Samples

The use of big data in official statistics and the applied sciences is a...
research
07/08/2019

A Versatile Estimation Procedure without Estimating the Nonignorable Missingness Mechanism

We consider the estimation problem in a regression setting where the out...
research
11/14/2019

On Data Enriched Logistic Regression

Biomedical researchers usually study the effects of certain exposures on...
research
10/03/2021

A Sequential Addressing Subsampling Method for Massive Data Analysis under Memory Constraint

The emergence of massive data in recent years brings challenges to autom...
research
08/16/2017

Adaptive Threshold Sampling and Estimation

Sampling is a fundamental problem in both computer science and statistic...
research
01/08/2019

Efficient Minimum Distance Estimation of Pareto Exponent from Top Income Shares

We propose an efficient estimation method for the income Pareto exponent...

Please sign up or login with your details

Forgot password? Click here to reset