An Instance Selection Algorithm for Big Data in High imbalanced datasets based on LSH

10/09/2022
by   Germán E. Melo-Acosta, et al.
0

Training of Machine Learning (ML) models in real contexts often deals with big data sets and high-class imbalance samples where the class of interest is unrepresented (minority class). Practical solutions using classical ML models address the problem of large data sets using parallel/distributed implementations of training algorithms, approximate model-based solutions, or applying instance selection (IS) algorithms to eliminate redundant information. However, the combined problem of big and high imbalanced datasets has been less addressed. This work proposes three new methods for IS to be able to deal with large and imbalanced data sets. The proposed methods use Locality Sensitive Hashing (LSH) as a base clustering technique, and then three different sampling methods are applied on top of the clusters (or buckets) generated by LSH. The algorithms were developed in the Apache Spark framework, guaranteeing their scalability. The experiments carried out in three different datasets suggest that the proposed IS methods can improve the performance of a base ML model between 5

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/16/2020

Smart Data based Ensemble for Imbalanced Big Data Classification

Big Data scenarios pose a new challenge to traditional data mining algor...
research
07/24/2021

Imbalanced Big Data Oversampling: Taxonomy, Algorithms, Software, Guidelines and Future Directions

Learning from imbalanced data is among the most challenging areas in con...
research
04/19/2018

Instance Selection Improves Geometric Mean Accuracy: A Study on Imbalanced Data Classification

A natural way of handling imbalanced data is to attempt to equalise the ...
research
11/29/2010

Classifying extremely imbalanced data sets

Imbalanced data sets containing much more background than signal instanc...
research
12/02/2019

Matrix sketching for supervised classification with imbalanced classes

Matrix sketching is a recently developed data compression technique. An ...
research
11/02/2021

Envelope Imbalance Learning Algorithm based on Multilayer Fuzzy C-means Clustering and Minimum Interlayer discrepancy

Imbalanced learning is important and challenging since the problem of th...
research
10/12/2021

Scalable machine learning in the R language using a summarization matrix

Big data analytics generally rely on parallel processing in large comput...

Please sign up or login with your details

Forgot password? Click here to reset