Survey Design and Estimating Equations when Combining Big Data with Probability Samples

07/22/2023
by   Ryan Covey, et al.
0

The use of big data in official statistics and the applied sciences is accelerating, but statistics computed using only big data often suffer from substantial selection bias. This leads to inaccurate estimation and invalid statistical inference. We rectify the issue for a broad class of linear and nonlinear statistics by producing estimating equations that combine big data with a probability sample. Under weak assumptions about an unknown superpopulation, we show that our integrated estimator is consistent and asymptotically unbiased with an asymptotic normal distribution. Variance estimators with respect to both the sampling design alone and jointly with the superpopulation are obtained at once using a single, unified theoretical approach. A surprising corollary is that strategies minimising the design variance almost minimise the joint variance when the population and sample sizes are large. The integrated estimator is shown to be more efficient than its survey-only counterpart if dependence between sample membership indicators is small and the finite population is large. We illustrate our method for quantiles, the Gini index, linear regression coefficients and maximum likelihood estimators where the sampling design is stratified simple random sampling without replacement. Our results are illustrated in a simulation of individual Australian incomes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2023

Integrating Big Data and Survey Data for Efficient Estimation of the Median

An ever-increasing deluge of big data is becoming available to national ...
research
06/15/2019

Proxy expenditure weights for Consumer Price Index: Audit sampling inference for big data statistics

Purchase data from retail chains provide proxy measures of private house...
research
05/30/2021

Statistical Inference from Partially Nominated Sets: An Application to Estimating the Prevalence of Osteoporosis

This paper focuses on drawing statistical inference based on a novel var...
research
02/28/2021

On the Subbagging Estimation for Massive Data

This article introduces subbagging (subsample aggregating) estimation ap...
research
12/06/2022

Efficient Stratification Method for Socioeconomic Survey in Remote Areas

The problems that exist in implementing a sampling design for socio-econ...
research
06/06/2023

A Calibrated Data-Driven Approach for Small Area Estimation using Big Data

Where the response variable in a big data set is consistent with the var...
research
01/03/2021

Better understanding of the multivariate hypergeometric distribution with implications in design-based survey sampling

Multivariate hypergeometric distribution arises frequently in elementary...

Please sign up or login with your details

Forgot password? Click here to reset