Ranked Sparsity: A Cogent Regularization Framework for Selecting and Estimating Feature Interactions and Polynomials

07/15/2021
by   Ryan A. Peterson, et al.
0

We explore and illustrate the concept of ranked sparsity, a phenomenon that often occurs naturally in modeling applications when an expected disparity exists in the quality of information between different feature sets. Its presence can cause traditional and modern model selection methods to fail because such procedures commonly presume that each potential parameter is equally worthy of entering into the final model - we call this presumption "covariate equipoise". However, this presumption does not always hold, especially in the presence of derived variables. For instance, when all possible interactions are considered as candidate predictors, the premise of covariate equipoise will often produce over-specified and opaque models. The sheer number of additional candidate variables grossly inflates the number of false discoveries in the interactions, resulting in unnecessarily complex and difficult-to-interpret models with many (truly spurious) interactions. We suggest a modeling strategy that requires a stronger level of evidence in order to allow certain variables (e.g. interactions) to be selected in the final model. This ranked sparsity paradigm can be implemented with the sparsity-ranked lasso (SRL). We compare the performance of SRL relative to competing methods in a series of simulation studies, showing that the SRL is a very attractive method because it is fast, accurate, and produces more transparent models (with fewer false interactions). We illustrate its utility in an application to predict the survival of lung cancer patients using a set of gene expression measurements and clinical covariates, searching in particular for gene-environment interactions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2022

Fast, effective, and coherent time series modeling using the sparsity-ranked lasso

The sparsity-ranked lasso (SRL) has been developed for model selection a...
research
11/01/2019

Analysis of Genomic and Transcriptomic Variations as Prognostic Signature for Lung Adenocarcinoma

Lung cancer is the leading cause of the largest number of deaths worldwi...
research
11/01/2019

Meta-Analysis of Genomic and Transcriptomic Variations in Lung Adenocarcinoma

Lung cancer is the leading cause of the largest number of deaths worldwi...
research
11/10/2020

Gaussian Graphical Regression Models with High Dimensional Responses and Covariates

Though Gaussian graphical models have been widely used in many scientifi...
research
01/09/2017

MEBoost: Variable Selection in the Presence of Measurement Error

We present a novel method for variable selection in regression models wh...
research
11/03/2021

Inference of Microbial Interactions Using Copula Models with Mixture Margins

Quantification of microbial interactions from 16S rRNA and meta-genomic ...

Please sign up or login with your details

Forgot password? Click here to reset