Experimenting with an Evaluation Framework for Imbalanced Data Learning (EFIDL)

01/26/2023
by   Chenyu Li, et al.
0

Introduction Data imbalance is one of the crucial issues in big data analysis with fewer labels. For example, in real-world healthcare data, spam detection labels, and financial fraud detection datasets. Many data balance methods were introduced to improve machine learning algorithms' performance. Research claims SMOTE and SMOTE-based data-augmentation (generate new data points) methods could improve algorithm performance. However, we found in many online tutorials, the valuation methods were applied based on synthesized datasets that introduced bias into the evaluation, and the performance got a false improvement. In this study, we proposed, a new evaluation framework for imbalanced data learning methods. We have experimented on five data balance methods and whether the performance of algorithms will improve or not. Methods We collected 8 imbalanced healthcare datasets with different imbalanced rates from different domains. Applied 6 data augmentation methods with 11 machine learning methods testing if the data augmentation will help with improving machine learning performance. We compared the traditional data augmentation evaluation methods with our proposed cross-validation evaluation framework Results Using traditional data augmentation evaluation meta hods will give a false impression of improving the performance. However, our proposed evaluation method shows data augmentation has limited ability to improve the results. Conclusion EFIDL is more suitable for evaluating the prediction performance of an ML method when data are augmented. Using an unsuitable evaluation framework will give false results. Future researchers should consider the evaluation framework we proposed when dealing with augmented datasets. Our experiments showed data augmentation does not help improve ML prediction performance.

READ FULL TEXT

page 5

page 10

research
04/20/2023

Is augmentation effective to improve prediction in imbalanced text datasets?

Imbalanced datasets present a significant challenge for machine learning...
research
04/06/2023

A review of ensemble learning and data augmentation models for class imbalanced problems: combination, implementation and evaluation

Class imbalance (CI) in classification problems arises when the number o...
research
09/13/2023

The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease detection

Machine Learning (ML) has emerged as a promising approach in healthcare,...
research
01/16/2021

Improve Global Glomerulosclerosis Classification with Imbalanced Data using CircleMix Augmentation

The classification of glomerular lesions is a routine and essential task...
research
04/03/2023

A Guide for Practical Use of ADMG Causal Data Augmentation

Data augmentation is essential when applying Machine Learning in small-d...
research
08/29/2023

From SMOTE to Mixup for Deep Imbalanced Classification

Given imbalanced data, it is hard to train a good classifier using deep ...
research
12/01/2020

A Generative Model to Synthesize EEG Data for Epileptic Seizure Prediction

Prediction of seizure before they occur is vital for bringing normalcy t...

Please sign up or login with your details

Forgot password? Click here to reset