A comprehensive review of variable selection in high-dimensional regression for molecular biology

by   Perrine Lacroix, et al.

Variable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables.


page 1

page 2

page 3

page 4


The EAS approach to variable selection for multivariate response data in high-dimensional settings

In this paper, we extend the epsilon admissible subsets (EAS) model sele...

The Loss Rank Criterion for Variable Selection in Linear Regression Analysis

Lasso and other regularization procedures are attractive methods for var...

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Penalized likelihood methods are widely used for high-dimensional regres...

Structural randomised selection

An important problem in the analysis of high-dimensional omics data is t...

Effect of hyperparameters on variable selection in random forests

Random forests (RFs) are well suited for prediction modeling and variabl...

Generalized Linear Model for Gamma Distributed Variables via Elastic Net Regularization

The Generalized Linear Model (GLM) for the Gamma distribution (glmGamma)...

Inferring diagnostic and prognostic gene expression signatures across WHO glioma classifications: A network-based approach

Tumor heterogeneity is a challenge to designing effective and targeted t...

Please sign up or login with your details

Forgot password? Click here to reset