Effect of hyperparameters on variable selection in random forests

09/13/2023
by   Cesaire J. K. Fouodo, et al.
0

Random forests (RFs) are well suited for prediction modeling and variable selection in high-dimensional omics studies. The effect of hyperparameters of the RF algorithm on prediction performance and variable importance estimation have previously been investigated. However, how hyperparameters impact RF-based variable selection remains unclear. We evaluate the effects on the Vita and the Boruta variable selection procedures based on two simulation studies utilizing theoretical distributions and empirical gene expression data. We assess the ability of the procedures to select important variables (sensitivity) while controlling the false discovery rate (FDR). Our results show that the proportion of splitting candidate variables (mtry.prop) and the sample fraction (sample.fraction) for the training dataset influence the selection procedures more than the drawing strategy of the training datasets and the minimal terminal node size. A suitable setting of the RF hyperparameters depends on the correlation structure in the data. For weakly correlated predictor variables, the default value of mtry is optimal, but smaller values of sample.fraction result in larger sensitivity. In contrast, the difference in sensitivity of the optimal compared to the default value of sample.fraction is negligible for strongly correlated predictor variables, whereas smaller values than the default are better in the other settings. In conclusion, the default values of the hyperparameters will not always be suitable for identifying important variables. Thus, adequate values differ depending on whether the aim of the study is optimizing prediction performance or variable selection.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/10/2018

Hyperparameters and Tuning Strategies for Random Forest

The random forest algorithm (RF) has several hyperparameters that have t...
research
04/20/2018

Variable Selection via Adaptive False Negative Control in High-Dimensional Regression

In high-dimensional regression, variable selection methods have been dev...
research
09/24/2021

A comprehensive review of variable selection in high-dimensional regression for molecular biology

Variable selection methods are widely used in molecular biology to detec...
research
02/18/2021

Adjusting the Benjamini-Hochberg method for controlling the false discovery rate in knockoff assisted variable selection

This paper revisits the knockoff-based multiple testing setup considered...
research
03/26/2018

A Scalable Empirical Bayes Approach to Variable Selection in Generalized Linear Models

A new empirical Bayes approach to variable selection in the context of g...
research
11/12/2018

Global sensitivity analysis for optimization with variable selection

The optimization of high dimensional functions is a key issue in enginee...
research
03/01/2019

Metropolized Knockoff Sampling

Model-X knockoffs is a wrapper that transforms essentially any feature i...

Please sign up or login with your details

Forgot password? Click here to reset