SuRF: a New Method for Sparse Variable Selection, with Application in Microbiome Data Analysis

09/13/2019
by   Lihui Liu, et al.
0

In this paper, we present a new variable selection method for regression and classification purposes. Our method, called Subsampling Ranking Forward selection (SuRF), is based on LASSO penalised regression, subsampling and forward-selection methods. SuRF offers major advantages over existing variable selection methods in terms of both sparsity of selected models and model inference. We provide an R package that can implement our method for generalized linear models. We apply our method to classification problems from microbiome data, using a novel agglomeration approach to deal with the special tree-like correlation structure of the variables. Existing methods arbitrarily choose a taxonomic level a priori before performing the analysis, whereas by combining SuRF with these aggregated variables, we are able to identify the key biomarkers at the appropriate taxonomic level, as suggested by the data. We present simulations in multiple sparse settings to demonstrate that our approach performs better than several other popularly used existing approaches in recovering the true variables. We apply SuRF to two microbiome data sets: one about prediction of pouchitis and another for identifying samples from two healthy individuals. We find that SuRF can provide a better or comparable prediction with other methods while controlling the false positive rate of variable selection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/17/2020

An Easy-to-Implement Hierarchical Standardization for Variable Selection Under Strong Heredity Constraint

For many practical problems, the regression models follow the strong her...
research
08/31/2020

Variable selection in social-environmental data: Sparse regression and tree ensemble machine learning approaches

Objective: Social-environmental data obtained from the U.S. Census is an...
research
03/26/2019

Sparse Learning for Variable Selection with Structures and Nonlinearities

In this thesis we discuss machine learning methods performing automated ...
research
10/27/2020

Sequential knockoffs for continuous and categorical predictors: with application to a large Psoriatic Arthritis clinical trial pool

Knockoffs provide a general framework for controlling the false discover...
research
03/01/2018

Distributed multivariable modeling for signature development under data protection constraints

Data protection constraints frequently require distributed analysis of d...
research
04/08/2023

DiscoVars: A New Data Analysis Perspective – Application in Variable Selection for Clustering

We present a new data analysis perspective to determine variable importa...
research
02/20/2020

Knockoff Boosted Tree for Model-Free Variable Selection

In this article, we propose a novel strategy for conducting variable sel...

Please sign up or login with your details

Forgot password? Click here to reset