Solving the "many variables" problem in MICE with principal component regression

06/30/2022
by   Edoardo Costantini, et al.
0

Multiple Imputation (MI) is one of the most popular approaches to addressing missing values in questionnaires and surveys. MI with multivariate imputation by chained equations (MICE) allows flexible imputation of many types of data. In MICE, for each variable under imputation, the imputer needs to specify which variables should act as predictors in the imputation model. The selection of these predictors is a difficult, but fundamental, step in the MI procedure, especially when there are many variables in a data set. In this project, we explore the use of principal component regression (PCR) as a univariate imputation method in the MICE algorithm to automatically address the "many variables" problem that arises when imputing large social science data. We compare different implementations of PCR-based MICE with a correlation-thresholding strategy by means of a Monte Carlo simulation study and a case study. We find the use of PCR on a variable-by-variable basis to perform best and that it can perform closely to expertly designed imputation procedures.

READ FULL TEXT
research
09/04/2023

Supervised dimensionality reduction for multiple imputation by chained equations

Multivariate imputation by chained equations (MICE) is one of the most p...
research
08/29/2022

High-dimensional imputation for the social sciences: a comparison of state-of-the-art methods

Including a large number of predictors in the imputation model underlyin...
research
07/14/2020

Predicting feature imputability in the absence of ground truth

Data imputation is the most popular method of dealing with missing value...
research
02/23/2023

IlocA: An algorithm to Cluster Cells and form Imputation Groups from a pair of Classification Variables

We set out the novel bottom up procedure to aggregate or cluster cells w...
research
07/29/2020

Regression-based imputation of explanatory discrete missing data

Imputation of missing values is a strategy for handling non-responses in...
research
05/04/2022

The Effect of Multiple Imputation of Routine Pathology Variables on Laboratory Diagnosis of Hepatitis C Infection

Pathology tests are central to modern healthcare in terms of diagnosis a...
research
02/04/2021

Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types

Missing values with mixed data types is a common problem in a large numb...

Please sign up or login with your details

Forgot password? Click here to reset