Correcting for Selection Bias and Missing Response in Regression using Privileged Information

03/29/2023
by   Philip Boeken, et al.
0

When estimating a regression model, we might have data where some labels are missing, or our data might be biased by a selection mechanism. When the response or selection mechanism is ignorable (i.e., independent of the response variable given the features) one can use off-the-shelf regression methods; in the nonignorable case one typically has to adjust for bias. We observe that privileged data (i.e. data that is only available during training) might render a nonignorable selection mechanism ignorable, and we refer to this scenario as Privilegedly Missing at Random (PMAR). We propose a novel imputation-based regression method, named repeated regression, that is suitable for PMAR. We also consider an importance weighted regression method, and a doubly robust combination of the two. The proposed methods are easy to implement with most popular out-of-the-box regression algorithms. We empirically assess the performance of the proposed methods with extensive simulated experiments and on a synthetically augmented real-world dataset. We conclude that repeated regression can appropriately correct for bias, and can have considerable advantage over weighted regression, especially when extrapolating to regions of the feature space where response is never observed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/27/2023

Triply robust estimation under missing at random

Missing data is frequently encountered in many areas of statistics. Impu...
research
08/04/2022

Using Instruments for Selection to Adjust for Selection Bias in Mendelian Randomization

Selection bias is a common concern in epidemiologic studies. In the lite...
research
09/14/2023

On Prediction Feature Assignment in the Heckman Selection Model

Under missing-not-at-random (MNAR) sample selection bias, the performanc...
research
11/10/2021

variable selection and missing data imputation in categorical genomic data analysis by integrated ridge regression and random forest

Genomic data arising from a genome-wide association study (GWAS) are oft...
research
09/24/2018

Preserving the distribution function in surveys in case of imputation for zero inflated data

Item non-response in surveys is usually handled by single imputation, wh...
research
12/14/2021

Navigating the corporate disclosure gap: Modelling of Missing Not at Random Carbon Data

Corporate carbon emissions data is disclosed by approximately 65 and mid...
research
11/06/2021

In Nonparametric and High-Dimensional Models, Bayesian Ignorability is an Informative Prior

In problems with large amounts of missing data one must model two distin...

Please sign up or login with your details

Forgot password? Click here to reset