A XGBoost risk model via feature selection and Bayesian hyper-parameter optimization

01/24/2019
by   Yan Wang, et al.
8

This paper aims to explore models based on the extreme gradient boosting (XGBoost) approach for business risk classification. Feature selection (FS) algorithms and hyper-parameter optimizations are simultaneously considered during model training. The five most commonly used FS methods including weight by Gini, weight by Chi-square, hierarchical variable clustering, weight by correlation, and weight by information are applied to alleviate the effect of redundant features. Two hyper-parameter optimization approaches, random search (RS) and Bayesian tree-structured Parzen Estimator (TPE), are applied in XGBoost. The effect of different FS and hyper-parameter optimization methods on the model performance are investigated by the Wilcoxon Signed Rank Test. The performance of XGBoost is compared to the traditionally utilized logistic regression (LR) model in terms of classification accuracy, area under the curve (AUC), recall, and F1 score obtained from the 10-fold cross validation. Results show that hierarchical clustering is the optimal FS method for LR while weight by Chi-square achieves the best performance in XG-Boost. Both TPE and RS optimization in XGBoost outperform LR significantly. TPE optimization shows a superiority over RS since it results in a significantly higher accuracy and a marginally higher AUC, recall and F1 score. Furthermore, XGBoost with TPE tuning shows a lower variability than the RS method. Finally, the ranking of feature importance based on XGBoost enhances the model interpretation. Therefore, XGBoost with Bayesian TPE hyper-parameter optimization serves as an operative while powerful approach for business risk modeling.

READ FULL TEXT

page 9

page 12

page 17

research
09/09/2020

Developing and Improving Risk Models using Machine-learning Based Algorithms

The objective of this study is to develop a good risk model for classify...
research
03/13/2019

Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

We aim at developing and improving the imbalanced business risk modeling...
research
08/20/2022

Should univariate Cox regression be used for feature selection with respect to time-to-event outcomes?

IMPORTANCE: Time-to-event outcomes are commonly used in clinical trials ...
research
02/24/2023

Automatic Classification of Symmetry of Hemithoraces in Canine and Feline Radiographs

Purpose: Thoracic radiographs are commonly used to evaluate patients wit...
research
02/23/2023

A Comparison of Modeling Preprocessing Techniques

This paper compares the performance of various data processing methods i...

Please sign up or login with your details

Forgot password? Click here to reset