Minority Class Oversampling for Tabular Data with Deep Generative Models

05/07/2020
by   Ramiro Camino, et al.
0

In practice, data scientists are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the data scientist on the model's performance. A common method to treat imbalanced datasets is under- and oversampling. In this process, samples are either removed from the majority class or synthetic samples are added to the minority class. In this paper, we follow up on recent developments in deep learning. We take proposals of generative adversarial networks, including our own, and study the ability of these approaches to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Across 160K+ experiments, we show that all of the new methods tend to perform better than simple baseline methods such as SMOTE, but require different under- and oversampling ratios to do so. Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely. We also observe that the improvements in terms of performance metric, while shown to be significant when ranking the methods, often are minor in absolute terms, especially compared to the required effort. Furthermore, we notice that a large part of the improvement is due to undersampling, not oversampling. We make our code and testing framework available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2022

Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets

Data is commonly stored in tabular format. Several fields of research (e...
research
02/21/2022

Imbalanced Classification via Explicit Gradient Learning From Augmented Data

Learning from imbalanced data is one of the most significant challenges ...
research
04/19/2022

Imbalanced Classification via a Tabular Translation GAN

When presented with a binary classification problem where the data exhib...
research
03/27/2023

Evaluating XGBoost for Balanced and Imbalanced Data: Application to Fraud Detection

This paper evaluates XGboost's performance given different dataset sizes...
research
09/07/2018

VOS: a Method for Variational Oversampling of Imbalanced Data

Class imbalanced datasets are common in real-world applications that ran...
research
01/03/2021

Synthetic Embedding-based Data Generation Methods for Student Performance

Given the inherent class imbalance issue within student performance data...
research
03/22/2022

Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Background: Machine learning techniques have been widely used and demons...

Please sign up or login with your details

Forgot password? Click here to reset