Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification

09/29/2020
by   Min Qian, et al.
0

With the abundance of industrial datasets, imbalanced classification has become a common problem in several application domains. Oversampling is an effective method to solve imbalanced classification. One of the main challenges of the existing oversampling methods is to accurately label the new synthetic samples. Inaccurate labels of the synthetic samples would distort the distribution of the dataset and possibly worsen the classification performance. This paper introduces the idea of weakly supervised learning to handle the inaccurate labeling of synthetic samples caused by traditional oversampling methods. Graph semi-supervised SMOTE is developed to improve the credibility of the synthetic samples' labels. In addition, we propose cost-sensitive neighborhood components analysis for high dimensional datasets and bootstrap based ensemble framework for highly imbalanced datasets. The proposed method has achieved good classification performance on 8 synthetic datasets and 3 real-world datasets, especially for high imbalance and high dimensionality problems. The average performances and robustness are better than the benchmark methods.

READ FULL TEXT

page 1

page 5

research
02/17/2020

Class-Imbalanced Semi-Supervised Learning

Semi-Supervised Learning (SSL) has achieved great success in overcoming ...
research
11/01/2022

Automated Imbalanced Learning

Automated Machine Learning has grown very successful in automating the t...
research
02/14/2018

Dealing with Difficult Minority Labels in Imbalanced Mutilabel Data Sets

Multilabel classification is an emergent data mining task with a broad r...
research
08/03/2015

A Weakly Supervised Learning Approach based on Spectral Graph-Theoretic Grouping

In this study, a spectral graph-theoretic grouping strategy for weakly s...
research
03/24/2019

Exploiting Synthetically Generated Data with Semi-Supervised Learning for Small and Imbalanced Datasets

Data augmentation is rapidly gaining attention in machine learning. Synt...
research
02/14/2018

Tackling Multilabel Imbalance through Label Decoupling and Data Resampling Hybridization

The learning from imbalanced data is a deeply studied problem in standar...
research
10/23/2019

GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

Imbalanced datasets are ubiquitous. Classification performance on imbala...

Please sign up or login with your details

Forgot password? Click here to reset