Variable selection in social-environmental data: Sparse regression and tree ensemble machine learning approaches

08/31/2020
by   Elizabeth Handorf, et al.
0

Objective: Social-environmental data obtained from the U.S. Census is an important resource for understanding health disparities, but rarely is the full dataset utilized for analysis. A barrier to incorporating the full data is a lack of solid recommendations for variable selection, with researchers often hand-selecting a few variables. Thus, we evaluated the ability of empirical machine learning approaches to identify social-environmental factors having a true association with a health outcome. Materials and Methods: We compared several popular machine learning methods, including penalized regressions (e.g. lasso, elastic net), and tree ensemble methods. Via simulation, we assessed the methods' ability to identify census variables truly associated with binary and continuous outcomes while minimizing false positive results (10 true associations, 1,000 total variables). We applied the most promising method to the full census data (p=14,663 variables) linked to prostate cancer registry data (n=76,186 cases) to identify social-environmental factors associated with advanced prostate cancer. Results: In simulations, we found that elastic net identified many true-positive variables, while lasso provided good control of false positives. Using a combined measure of accuracy, hierarchical clustering based on Spearman's correlation with sparse group lasso regression performed the best overall. Bayesian Adaptive Regression Trees outperformed other tree ensemble methods, but not the sparse group lasso. In the full dataset, the sparse group lasso successfully identified a subset of variables, three of which replicated earlier findings. Discussion: This analysis demonstrated the potential of empirical machine learning approaches to identify a small subset of census variables having a true association with the outcome, and that replicate across empiric methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2019

SuRF: a New Method for Sparse Variable Selection, with Application in Microbiome Data Analysis

In this paper, we present a new variable selection method for regression...
research
09/10/2021

Reducing bias and alleviating the influence of excess of zeros with multioutcome adaptive LAD-lasso

Zero-inflated explanatory variables are common in fields such as ecology...
research
04/17/2019

Variable Selection in Functional Linear Concurrent Regression

We propose a novel method for variable selection in functional linear co...
research
02/23/2018

Variable selection via Group LASSO Approach : Application to the Cox Regression and frailty model

In the analysis of survival outcome supplemented with both clinical info...
research
09/09/2015

Sélection de variables par le GLM-Lasso pour la prédiction du risque palustre

In this study, we propose an automatic learning method for variables sel...
research
03/16/2020

Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods

Penalized regression methods, such as lasso and elastic net, are used in...

Please sign up or login with your details

Forgot password? Click here to reset