Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison

03/14/2021
by   Zhenhua Wang, et al.
0

Multiple imputation (MI) is the state-of-the-art approach for dealing with missing data arising from non-response in sample surveys. Multiple imputation by chained equations (MICE) is the most widely used MI method, but it lacks theoretical foundation and is computationally intensive. Recently, MI methods based on deep learning models have been developed with encouraging results in small studies. However, there has been limited research on systematically evaluating their performance in realistic settings comparing to MICE, particularly in large-scale surveys. This paper provides a general framework for using simulations based on real survey data and several performance metrics to compare MI methods. We conduct extensive simulation studies based on the American Community Survey data to compare repeated sampling properties of four machine learning based MI methods: MICE with classification trees, MICE with random forests, generative adversarial imputation network, and multiple imputation using denoising autoencoders. We find the deep learning based MI methods dominate MICE in terms of computational time; however, MICE with classification trees consistently outperforms the deep learning MI methods in terms of bias, mean squared error, and coverage under a range of realistic settings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/08/2017

Multiple Imputation Using Deep Denoising Autoencoders

Missing data is a well-recognized problem impacting all domains. State-o...
research
09/22/2022

Multistage Large Segment Imputation Framework Based on Deep Learning and Statistic Metrics

Missing value is a very common and unavoidable problem in sensors, and r...
research
09/30/2022

Leveraging variational autoencoders for multiple data imputation

Missing data persists as a major barrier to data analysis across numerou...
research
07/13/2020

Imputation procedures in surveys using nonparametric and machine learning methods: an empirical comparison

Nonparametric and machine learning methods are flexible methods for obta...
research
07/25/2018

Propensity score estimation using classification and regression trees in the presence of missing covariate data

Data mining and machine learning techniques such as classification and r...
research
04/30/2020

Multiple imputation using chained random forests: a preliminary study based on the empirical distribution of out-of-bag prediction errors

Missing data are common in data analyses in biomedical fields, and imput...
research
06/22/2021

Multiple Organ Failure Prediction with Classifier-Guided Generative Adversarial Imputation Networks

Multiple organ failure (MOF) is a severe syndrome with a high mortality ...

Please sign up or login with your details

Forgot password? Click here to reset