Evaluating XGBoost for Balanced and Imbalanced Data: Application to Fraud Detection

03/27/2023
by   Gissel Velarde, et al.
0

This paper evaluates XGboost's performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. XGBoost has been selected for evaluation, as it stands out in several benchmarks due to its detection performance and speed. After introducing the problem of fraud detection, the paper reviews evaluation metrics for detection systems or binary classifiers, and illustrates with examples how different metrics work for balanced and imbalanced datasets. Then, it examines the principles of XGBoost. It proposes a pipeline for data preparation and compares a Vanilla XGBoost against a random search-tuned XGBoost. Random search fine-tuning provides consistent improvement for large datasets of 100 thousand samples, not so for medium and small datasets of 10 and 1 thousand samples, respectively. Besides, as expected, XGBoost recognition performance improves as more data is available, and deteriorates detection performance as the datasets become more imbalanced. Tests on distributions with 50, 45, 25, and 5 percent positive samples show that the largest drop in detection performance occurs for the distribution with only 5 percent positive samples. Sampling to balance the training set does not provide consistent improvement. Therefore, future work will include a systematic study of different techniques to deal with data imbalance and evaluating other approaches, including graphs, autoencoders, and generative adversarial methods, to deal with the lack of labels.

READ FULL TEXT
research
10/12/2020

Class-Weighted Evaluation Metrics for Imbalanced Data Classification

Class distribution skews in imbalanced datasets may lead to models with ...
research
05/07/2020

Minority Class Oversampling for Tabular Data with Deep Generative Models

In practice, data scientists are often confronted with imbalanced data. ...
research
10/17/2019

KDE sampling for imbalanced class distribution

Imbalanced response variable distribution is not an uncommon occurrence ...
research
12/21/2020

Natural vs Balanced Distribution in Deep Learning on Whole Slide Images for Cancer Detection

The class distribution of data is one of the factors that regulates the ...
research
05/16/2023

BSGAN: A Novel Oversampling Technique for Imbalanced Pattern Recognitions

Class imbalanced problems (CIP) are one of the potential challenges in d...
research
09/07/2018

VOS: a Method for Variational Oversampling of Imbalanced Data

Class imbalanced datasets are common in real-world applications that ran...

Please sign up or login with your details

Forgot password? Click here to reset