Imputation for High-Dimensional Linear Regression

We study high-dimensional regression with missing entries in the covariates. A common strategy in practice is to impute the missing entries with an appropriate substitute and then implement a standard statistical procedure acting as if the covariates were fully observed. Recent literature on this subject proposes instead to design a specific, often complicated or non-convex, algorithm tailored to the case of missing covariates. We investigate a simpler approach where we fill-in the missing entries with their conditional mean given the observed covariates. We show that this imputation scheme coupled with standard off-the-shelf procedures such as the LASSO and square-root LASSO retains the minimax estimation rate in the random-design setting where the covariates are i.i.d. sub-Gaussian. We further show that the square-root LASSO remains pivotal in this setting. It is often the case that the conditional expectation cannot be computed exactly and must be approximated from data. We study two cases where the covariates either follow an autoregressive (AR) process, or are jointly Gaussian with sparse precision matrix. We propose tractable estimators for the conditional expectation and then perform linear regression via LASSO, and show similar estimation rates in both cases. We complement our theoretical results with simulations on synthetic and semi-synthetic examples, illustrating not only the sharpness of our bounds, but also the broader utility of this strategy beyond our theoretical assumptions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/09/2017

Rate Optimal Estimation and Confidence Intervals for High-dimensional Regression with Missing Covariates

Although a majority of the theoretical literature in high-dimensional st...
research
07/12/2018

Optimal Strategies for Matching and Retrieval Problems by Comparing Covariates

In many retrieval problems, where we must retrieve one or more entries f...
research
05/04/2018

Lasso, knockoff and Gaussian covariates: a comparison

Given data y and k covariates x_j one problem in linear regression is to...
research
09/24/2021

Correcting Conditional Mean Imputation for Censored Covariates and Improving Usability

Analysts are often confronted with censoring, wherein some variables are...
research
06/17/2021

On the Power of Preconditioning in Sparse Linear Regression

Sparse linear regression is a fundamental problem in high-dimensional st...
research
02/26/2018

Missing Data in Sparse Transition Matrix Estimation for Sub-Gaussian Vector Autoregressive Processes

High-dimensional time series data exist in numerous areas such as financ...
research
12/05/2014

Quantile universal threshold: model selection at the detection edge for high-dimensional linear regression

To estimate a sparse linear model from data with Gaussian noise, consili...

Please sign up or login with your details

Forgot password? Click here to reset