Minority Class Oversampling for Tabular Data with Deep Generative Models

05/07/2020
by   Ramiro Camino, et al.
0

In practice, data scientists are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the data scientist on the model's performance. A common method to treat imbalanced datasets is under- and oversampling. In this process, samples are either removed from the majority class or synthetic samples are added to the minority class. In this paper, we follow up on recent developments in deep learning. We take proposals of generative adversarial networks, including our own, and study the ability of these approaches to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Across 160K+ experiments, we show that all of the new methods tend to perform better than simple baseline methods such as SMOTE, but require different under- and oversampling ratios to do so. Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely. We also observe that the improvements in terms of performance metric, while shown to be significant when ranking the methods, often are minor in absolute terms, especially compared to the required effort. Furthermore, we notice that a large part of the improvement is due to undersampling, not oversampling. We make our code and testing framework available.

READ FULL TEXT

Authors

page 1

page 2

page 3

page 4

06/20/2022

Convex space learning improves deep-generative oversampling for tabular imbalanced classification on smaller datasets

Data is commonly stored in tabular format. Several fields of research (e...
02/21/2022

Imbalanced Classification via Explicit Gradient Learning From Augmented Data

Learning from imbalanced data is one of the most significant challenges ...
04/19/2022

Imbalanced Classification via a Tabular Translation GAN

When presented with a binary classification problem where the data exhib...
01/03/2021

Synthetic Embedding-based Data Generation Methods for Student Performance

Given the inherent class imbalance issue within student performance data...
09/07/2018

VOS: a Method for Variational Oversampling of Imbalanced Data

Class imbalanced datasets are common in real-world applications that ran...
10/09/2020

Measuring What Counts: The case of Rumour Stance Classification

Stance classification can be a powerful tool for understanding whether a...
03/22/2022

Dazzle: Using Optimized Generative Adversarial Networks to Address Security Data Class Imbalance Issue

Background: Machine learning techniques have been widely used and demons...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.