Oversampling for Imbalanced Learning Based on K-Means and SMOTE

11/02/2017
by   Felix Last, et al.
0

Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.

READ FULL TEXT

page 14

page 15

research
08/05/2015

Empirical Similarity for Absent Data Generation in Imbalanced Classification

When the training data in a two-class classification problem is overwhel...
research
11/02/2021

Envelope Imbalance Learning Algorithm based on Multilayer Fuzzy C-means Clustering and Minimum Interlayer discrepancy

Imbalanced learning is important and challenging since the problem of th...
research
10/12/2020

Class-Weighted Evaluation Metrics for Imbalanced Data Classification

Class distribution skews in imbalanced datasets may lead to models with ...
research
12/17/2022

Balanced Split: A new train-test data splitting strategy for imbalanced datasets

Classification data sets with skewed class proportions are called imbala...
research
08/20/2020

Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning

Class imbalance is a common problem in supervised learning and impedes t...
research
05/23/2019

SelectNet: Learning to Sample from the Wild for Imbalanced Data Training

Supervised learning from training data with imbalanced class sizes, a co...
research
06/17/2021

MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

Class-imbalanced data, in which some classes contain far more samples th...

Please sign up or login with your details

Forgot password? Click here to reset