Unbiased Gradient Boosting Decision Tree with Unbiased Feature Importance

05/18/2023
by   Zheyu Zhang, et al.
0

Gradient Boosting Decision Tree (GBDT) has achieved remarkable success in a wide variety of applications. The split finding algorithm, which determines the tree construction process, is one of the most crucial components of GBDT. However, the split finding algorithm has long been criticized for its bias towards features with a large number of potential splits. This bias introduces severe interpretability and overfitting issues in GBDT. To this end, we provide a fine-grained analysis of bias in GBDT and demonstrate that the bias originates from 1) the systematic bias in the gain estimation of each split and 2) the bias in the split finding algorithm resulting from the use of the same data to evaluate the split improvement and determine the best split. Based on the analysis, we propose unbiased gain, a new unbiased measurement of gain importance using out-of-bag samples. Moreover, we incorporate the unbiased property into the split finding algorithm and develop UnbiasedGBM to solve the overfitting issue of GBDT. We assess the performance of UnbiasedGBM and unbiased gain in a large-scale empirical study comprising 60 datasets and show that: 1) UnbiasedGBM exhibits better performance than popular GBDT implementations such as LightGBM, XGBoost, and Catboost on average on the 60 datasets and 2) unbiased gain achieves better average performance in feature selection than popular feature importance methods. The codes are available at https://github.com/ZheyuAqaZhang/UnbiasedGBM.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/12/2019

Unbiased Measurement of Feature Importance in Tree-Based Methods

We propose a modification that corrects for split-improvement variable i...
research
01/25/2018

Information gain ratio correction: Improving prediction with more balanced decision tree splits

Decision trees algorithms use a gain function to select the best split d...
research
08/16/2021

Task-wise Split Gradient Boosting Trees for Multi-center Diabetes Prediction

Diabetes prediction is an important data science application in the soci...
research
06/26/2017

GPU-acceleration for Large-scale Tree Boosting

In this paper, we present a novel massively parallel algorithm for accel...
research
07/25/2023

Feature Importance Measurement based on Decision Tree Sampling

Random forest is effective for prediction tasks but the randomness of tr...
research
04/28/2021

[Re] Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias

Singh et al. (2020) point out the dangers of contextual bias in visual r...
research
03/26/2020

From unbiased MDI Feature Importance to Explainable AI for Trees

We attempt to give a unifying view of the various recent attempts to (i)...

Please sign up or login with your details

Forgot password? Click here to reset