High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

08/02/2018
by   Fan Wang, et al.
0

Penalized likelihood methods are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well-developed, the relative efficacy of different methods in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users of these methods. In this paper we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 1,800 data-generating scenarios, allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely-used methods (Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector as well as Stability Selection). We find considerable variation in performance between methods, with results dependent on details of the data-generating scenario and the specific goal. Our results support a `no panacea' view, with no unambiguous winner across all scenarios, even in this restricted setting where all data align well with the assumptions underlying the methods. Lasso is well-behaved, performing competitively in many scenarios, while SCAD is highly variable. Substantial benefits from a Ridge-penalty are only seen in the most challenging scenarios with strong multi-collinearity. The results are supported by semi-synthetic analyzes using gene expression data from cancer samples. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.

READ FULL TEXT
research
11/17/2022

Penalized Variable Selection with Broken Adaptive Ridge Regression for Semi-competing Risks Data

Semi-competing risks data arise when both non-terminal and terminal even...
research
09/24/2021

A comprehensive review of variable selection in high-dimensional regression for molecular biology

Variable selection methods are widely used in molecular biology to detec...
research
11/05/2014

Controlling false discoveries in high-dimensional situations: Boosting with stability selection

Modern biotechnologies often result in high-dimensional data sets with m...
research
10/27/2022

Exhuming nonnegative garrote from oblivion using suitable initial estimates- illustration in low and high-dimensional real data

The nonnegative garrote (NNG) is among the first approaches that combine...
research
05/16/2022

ecpc: An R-package for generic co-data models for high-dimensional prediction

High-dimensional prediction considers data with more variables than samp...
research
07/30/2020

Solar: a least-angle regression for accurate and stable variable selection in high-dimensional data

We propose a new least-angle regression algorithm for variable selection...
research
03/20/2019

Omitted variable bias of Lasso-based inference methods under limited variability: A finite sample analysis

We study the finite sample behavior of Lasso and Lasso-based inference m...

Please sign up or login with your details

Forgot password? Click here to reset