Propensity score estimation using classification and regression trees in the presence of missing covariate data

Data mining and machine learning techniques such as classification and regression trees (CART) represent a promising alternative to conventional logistic regression for propensity score estimation. Whereas incomplete data preclude the fitting of a logistic regression on all subjects, CART is appealing in part because some implementations allow for incomplete records to be incorporated in the tree fitting and provide propensity score estimates for all subjects. Based on theoretical considerations, we argue that the automatic handling of missing data by CART may however not be appropriate. Using a series of simulation experiments, we examined the performance of different approaches to handling missing covariate data; (i) applying the CART algorithm directly to the (partially) incomplete data, (ii) complete case analysis, and (iii) multiple imputation. Performance was assessed in terms of bias in estimating exposure-outcome effects among the exposed, standard error, mean squared error and coverage. Applying the CART algorithm directly to incomplete data resulted in bias, even in scenarios where data were missing completely at random. Overall, multiple imputation followed by CART resulted in the best performance. Our study showed that automatic handling of missing data in CART can cause serious bias and does not outperform multiple imputation as a means to account for missing data.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/10/2021

Handling missing data when estimating causal effects with Targeted Maximum Likelihood Estimation

Causal inference from longitudinal studies is central to epidemiologic r...
research
11/28/2020

Learning from Incomplete Data by Simultaneous Training of Neural Networks and Sparse Coding

Handling correctly incomplete datasets in machine learning is a fundamen...
research
11/26/2022

Multiple imputation for logistic regression models: incorporating an interaction

Background: Multiple imputation is often used to reduce bias and gain ef...
research
12/14/2021

Navigating the corporate disclosure gap: Modelling of Missing Not at Random Carbon Data

Corporate carbon emissions data is disclosed by approximately 65 and mid...
research
12/30/2021

General and Feasible Tests with Multiply-Imputed Datasets

Multiple imputation (MI) is a technique especially designed for handling...
research
03/14/2021

Are deep learning models superior for missing data imputation in large surveys? Evidence from an empirical comparison

Multiple imputation (MI) is the state-of-the-art approach for dealing wi...
research
10/20/2022

Evaluation of multiple imputation to address intended and unintended missing data in case-cohort studies with a binary endpoint

Case-cohort studies are conducted within cohort studies, wherein collect...

Please sign up or login with your details

Forgot password? Click here to reset