Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance

12/10/2015
by   Amichai Painsky, et al.
0

Recursive partitioning approaches producing tree-like models are a long standing staple of predictive modeling, in the last decade mostly as "sub-learners" within state of the art ensemble methods like Boosting and Random Forest. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. Such variables can often be very informative, but current tree methods essentially leave us a choice of either not using them, or exposing our models to severe overfitting. We propose a conceptual framework to splitting using leave-one-out (LOO) cross validation for selecting the splitting variable, then performing a regular split (in our case, following CART's approach) for the selected variable. The most important consequence of our approach is that categorical variables with many categories can be safely used in tree building and are only chosen if they contribute to predictive power. We demonstrate in extensive simulation and real data analysis that our novel splitting approach significantly improves the performance of both single tree models and ensemble methods that utilize trees. Importantly, we design an algorithm for LOO splitting variable selection which under reasonable assumptions does not increase the overall computational complexity compared to CART for two-class classification. For regression tasks, our approach carries an increased computational burden, replacing a O(log(n)) factor in CART splitting rule search with an O(n) term.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/27/2023

Prediction-based Variable Selection for Component-wise Gradient Boosting

Model-based component-wise gradient boosting is a popular tool for data-...
research
02/15/2017

Probing for sparse and fast variable selection with model-based boosting

We present a new variable selection method based on model-based gradient...
research
11/19/2021

MURAL: An Unsupervised Random Forest-Based Embedding for Electronic Health Record Data

A major challenge in embedding or visualizing clinical patient data is t...
research
06/17/2020

FREEtree: A Tree-based Approach for High Dimensional Longitudinal Data With Correlated Features

This paper proposes FREEtree, a tree-based method for high dimensional l...
research
06/24/2019

The Power of Unbiased Recursive Partitioning: A Unifying View of CTree, MOB, and GUIDE

A core step of every algorithm for learning regression trees is the sele...
research
09/12/2021

Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection

Gradient Boosting Machines (GBM) are among the go-to algorithms on tabul...
research
12/30/2020

Optimal trees selection for classification via out-of-bag assessment and sub-bagging

The effect of training data size on machine learning methods has been we...

Please sign up or login with your details

Forgot password? Click here to reset