An ensemble learning method for variable selection: application to high dimensional data and missing values

08/21/2018
by   Avner Bar-Hen, et al.
0

Standard approaches for variable selection in linear models are not tailored to deal properly with high dimensional and incomplete data. Currently, methods dedicated to high dimensional data handle missing values by ad-hoc strategies, like complete case analysis or single imputation, while methods dedicated to missing values, mainly based on multiple imputation, do not discuss the imputation method to use with high dimensional data. Consequently, both approaches appear to be limited for many modern applications. With inspiration from ensemble methods, a new variable selection method is proposed. It extends classical variable selection methods such as stepwise, lasso or knockoff in the case of high dimensional data with or without missing data. Theoretical properties are studied and the practical interest is demonstrated through a simulation study. In the low dimensional case without missing values, the performances of the method can be better than those obtained by standard techniques. Moreover, the procedure improves the control of the error risks. With missing values, the method performs better than reference selection methods based on multiple imputation. Similar performances are obtained in the high-dimensional case with or without missing values.

READ FULL TEXT
research
03/30/2022

A comparison of strategies for selecting auxiliary variables for multiple imputation

Multiple imputation (MI) is a popular method for handling missing data. ...
research
11/16/2022

The Missing Indicator Method: From Low to High Dimensions

Missing data is common in applied data science, particularly for tabular...
research
02/06/2018

An Imputation-Consistency Algorithm for High-Dimensional Missing Data Problems and Beyond

Missing data are frequently encountered in high-dimensional problems, bu...
research
01/31/2023

Naive imputation implicitly regularizes high-dimensional linear models

Two different approaches exist to handle missing values for prediction: ...
research
01/14/2019

Supervised Learning for Multi-Block Incomplete Data

In the supervised high dimensional settings with a large number of varia...
research
02/26/2022

Missing Value Knockoffs

One limitation of the most statistical/machine learning-based variable s...
research
04/12/2022

Evolutionary shift detection with ensemble variable selection

1. Abrupt environmental changes can lead to evolutionary shifts in trait...

Please sign up or login with your details

Forgot password? Click here to reset