Smart Data based Ensemble for Imbalanced Big Data Classification

01/16/2020
by   Diego García-Gil, et al.
0

Big Data scenarios pose a new challenge to traditional data mining algorithms, since they are not prepared to work with such amount of data. Smart Data refers to data of enough quality to improve the outcome from a data mining algorithm. Existing data mining algorithms unability to handle Big Datasets prevents the transition from Big to Smart Data. Automation in data acquisition that characterizes Big Data also brings some problems, such as differences in data size per class. This will lead classifiers to lean towards the most represented classes. This problem is known as imbalanced data distribution, where one class is underrepresented in the dataset. Ensembles of classifiers are machine learning methods that improve the performance of a single base classifier by the combination of several of them. Ensembles are not exempt from the imbalanced classification problem. To deal with this issue, the ensemble method have to be designed specifically. In this paper, a data preprocessing ensemble for imbalanced Big Data classification is presented, with focus on two-class problems. Experiments carried out in 21 Big Datasets have proved that our ensemble classifier outperforms classic machine learning models with an added data balancing method, such as Random Forests.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/18/2017

MEBoost: Mixing Estimators with Boosting for Imbalanced Data Classification

Class imbalance problem has been a challenging research problem in the f...
research
10/09/2022

An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH

Training of Machine Learning (ML) models in real contexts often deals wi...
research
01/06/2021

The Shapley Value of Classifiers in Ensemble Games

How do we decide the fair value of individual classifiers in an ensemble...
research
10/03/2019

Recognizing the Tractability in Big Data Computing

Due to the limitation on computational power of existing computers, the ...
research
05/15/2019

Ignorance-Aware Approaches and Algorithms for Prototype Selection in Machine Learning

Operating with ignorance is an important concern of the Machine Learning...
research
07/24/2021

Imbalanced Big Data Oversampling: Taxonomy, Algorithms, Software, Guidelines and Future Directions

Learning from imbalanced data is among the most challenging areas in con...

Please sign up or login with your details

Forgot password? Click here to reset