Empirical Similarity for Absent Data Generation in Imbalanced Classification

08/05/2015
by   Arash Pourhabib, et al.
0

When the training data in a two-class classification problem is overwhelmed by one class, most classification techniques fail to correctly identify the data points belonging to the underrepresented class. We propose Similarity-based Imbalanced Classification (SBIC) that learns patterns in the training data based on an empirical similarity function. To take the imbalanced structure of the training data into account, SBIC utilizes the concept of absent data, i.e. data from the minority class which can help better find the boundary between the two classes. SBIC simultaneously optimizes the weights of the empirical similarity function and finds the locations of absent data points. As such, SBIC uses an embedded mechanism for synthetic data generation which does not modify the training dataset, but alters the algorithm to suit imbalanced datasets. Therefore, SBIC uses the ideas of both major schools of thoughts in imbalanced classification: Like cost-sensitive approaches SBIC operates on an algorithm level to handle imbalanced structures; and similar to synthetic data generation approaches, it utilizes the properties of unobserved data points from the minority class. The application of SBIC to imbalanced datasets suggests it is comparable to, and in some cases outperforms, other commonly used classification techniques for imbalanced datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/02/2017

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

Learning from class-imbalanced data continues to be a common and challen...
research
10/23/2019

GenSample: A Genetic Algorithm for Oversampling in Imbalanced Datasets

Imbalanced datasets are ubiquitous. Classification performance on imbala...
research
11/02/2018

Clustering and Learning from Imbalanced Data

A learning classifier must outperform a trivial solution, in case of imb...
research
09/10/2019

Spam filtering on forums: A synthetic oversampling based approach for imbalanced data classification

Forums play an important role in providing a platform for community inte...
research
03/21/2020

Prob2Vec: Mathematical Semantic Embedding for Problem Retrieval in Adaptive Tutoring

We propose a new application of embedding techniques for problem retriev...
research
04/28/2018

A Cost-Sensitive Deep Belief Network for Imbalanced Classification

Imbalanced data with a skewed class distribution are common in many real...
research
05/09/2021

RB-CCR: Radial-Based Combined Cleaning and Resampling algorithm for imbalanced data classification

Real-world classification domains, such as medicine, health and safety, ...

Please sign up or login with your details

Forgot password? Click here to reset