Scalable mRMR feature selection to handle high dimensional datasets: Vertical partitioning based Iterative MapReduce framework

08/21/2022
by   Yelleti Vivek, et al.
0

While building machine learning models, Feature selection (FS) stands out as an essential preprocessing step used to handle the uncertainty and vagueness in the data. Recently, the minimum Redundancy and Maximum Relevance (mRMR) approach has proven to be effective in obtaining the irredundant feature subset. Owing to the generation of voluminous datasets, it is essential to design scalable solutions using distributed/parallel paradigms. MapReduce solutions are proven to be one of the best approaches to designing fault-tolerant and scalable solutions. This work analyses the existing MapReduce approaches for mRMR feature selection and identifies the limitations thereof. In the current study, we proposed VMR_mRMR, an efficient vertical partitioning-based approach using a memorization approach, thereby overcoming the extant approaches limitations. The experiment analysis says that VMR_mRMR significantly outperformed extant approaches and achieved a better computational gain (C.G). In addition, we also conducted a comparative analysis with the horizontal partitioning approach HMR_mRMR [1] to assess the strengths and limitations of the proposed approach.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/07/2017

Feature selection in high-dimensional dataset using MapReduce

This paper describes a distributed MapReduce implementation of the minim...
research
01/14/2021

A Nature-Inspired Feature Selection Approach based on Hypercomplex Information

Feature selection for a given model can be transformed into an optimizat...
research
09/27/2014

Large-scale Online Feature Selection for Ultra-high Dimensional Sparse Data

Feature selection with large-scale high-dimensional data is important ye...
research
06/28/2023

Feature Selection: A perspective on inter-attribute cooperation

High-dimensional datasets depict a challenge for learning tasks in data ...
research
06/15/2023

A Hybrid Feature Selection and Construction Method for Detection of Wind Turbine Generator Heating Faults

Preprocessing of information is an essential step for the effective desi...
research
02/08/2022

Feature subset selection for Big Data via Chaotic Binary Differential Evolution under Apache Spark

Feature subset selection (FSS) using a wrapper approach is essentially a...
research
09/20/2016

An Efficient Method of Partitioning High Volumes of Multidimensional Data for Parallel Clustering Algorithms

An optimal data partitioning in parallel & distributed implementation of...

Please sign up or login with your details

Forgot password? Click here to reset