CatBoost model with synthetic features in application to loan risk assessment of small businesses

06/15/2021
by   Haoxue Wang, et al.
0

Loan risk for small businesses has long been a complex problem worthy of exploring. Predicting the loan risk can benefit entrepreneurship by developing more jobs for the society. CatBoost (Categorical Boosting) is a powerful machine learning algorithm suitable for dataset with many categorical variables like the dataset for forecasting loan risk. In this paper, we identify the important risk factors that contribute to loan status classification problem. Then we compare the performance between boosting-type algorithms(especially CatBoost) with other traditional yet popular ones. The dataset we adopt in the research comes from the U.S. Small Business Administration (SBA) and holds a very large sample size (899,164 observations and 27 features). In order to make the best use of the important features in the dataset, we propose a technique named "synthetic generation" to develop more combined features based on arithmetic operation, which ends up improving the accuracy and AUC of the original CatBoost model. We obtain a high accuracy of 95.84 AUC of 98.80

READ FULL TEXT
research
01/08/2020

Gradient Boosting on Decision Trees for Mortality Prediction in Transcatheter Aortic Valve Implantation

Current prognostic risk scores in cardiac surgery are based on statistic...
research
03/13/2019

Predicting class-imbalanced business risk using resampling, regularization, and model ensembling algorithms

We aim at developing and improving the imbalanced business risk modeling...
research
05/29/2023

The Misuse of AUC: What High Impact Risk Assessment Gets Wrong

When determining which machine learning model best performs some high im...
research
02/03/2021

Investigating Critical Risk Factors in Liver Cancer Prediction

We exploit liver cancer prediction model using machine learning algorith...
research
07/05/2023

A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

High-cardinality categorical variables are variables for which the numbe...
research
04/04/2018

Qualitätsmaße binärer Klassifikationen im Bereich kriminalprognostischer Instrumente der vierten Generation

This master's thesis discusses an important issue regarding how algorith...

Please sign up or login with your details

Forgot password? Click here to reset