SMOTE-ENC: A novel SMOTE-based method to generate synthetic data for nominal and continuous features

03/13/2021
by   Mimi Mukherjee, et al.
0

Real world datasets are heavily skewed where some classes are significantly outnumbered by the other classes. In these situations, machine learning algorithms fail to achieve substantial efficacy while predicting these under-represented instances. To solve this problem, many variations of synthetic minority over-sampling methods (SMOTE) have been proposed to balance the dataset which deals with continuous features. However, for datasets with both nominal and continuous features, SMOTE-NC is the only SMOTE-based over-sampling technique to balance the data. In this paper, we present a novel minority over-sampling method, SMOTE-ENC (SMOTE - Encoded Nominal and Continuous), in which, nominal features are encoded as numeric values and the difference between two such numeric value reflects the amount of change of association with minority class. Our experiments show that the classification model using SMOTE-ENC method offers better prediction than model using SMOTE-NC when the dataset has a substantial number of nominal features and also when there is some association between the categorical features and the target class. Additionally, our proposed method addressed one of the major limitations of SMOTE-NC algorithm. SMOTE-NC can be applied only on mixed datasets that have features consisting of both continuous and nominal features and cannot function if all the features of the dataset are nominal. Our novel method has been generalized to be applied on both mixed datasets and on nominal only datasets. The code is available from mkhushi.github.io

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/22/2020

Gamma distribution-based sampling for imbalanced data

Imbalanced class distribution is a common problem in a number of fields ...
research
11/05/2021

Solving the Class Imbalance Problem Using a Counterfactual Method for Data Augmentation

Learning from class imbalanced datasets poses challenges for many machin...
research
02/25/2019

FPRAS for the Potts Model and the Number of k-colorings

In this paper, we give a sampling algorithm for the Potts model using Ma...
research
07/03/2020

Improved Preterm Prediction Based on Optimized Synthetic Sampling of EHG Signal

Preterm labor is the leading cause of neonatal morbidity and mortality a...
research
09/30/2020

An Online Learning Algorithm for a Neuro-Fuzzy Classifier with Mixed-Attribute Data

General fuzzy min-max neural network (GFMMNN) is one of the efficient ne...
research
08/19/2023

DatasetEquity: Are All Samples Created Equal? In The Quest For Equity Within Datasets

Data imbalance is a well-known issue in the field of machine learning, a...
research
05/19/2017

Data-adaptive Active Sampling for Efficient Graph-Cognizant Classification

The present work deals with active sampling of graph nodes representing ...

Please sign up or login with your details

Forgot password? Click here to reset