Distributed Correlation-Based Feature Selection in Spark

01/31/2019
by   Raul-Jose Palma-Mendoza, et al.
0

CFS (Correlation-Based Feature Selection) is an FS algorithm that has been successfully applied to classification problems in many domains. We describe Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and distributed version of the CFS algorithm, capable of dealing with the large volumes of data typical of big data applications. Two versions of the algorithm were implemented and compared using the Apache Spark cluster computing model, currently gaining popularity due to its much faster processing times than Hadoop's MapReduce model. We tested our algorithms on four publicly available datasets, each consisting of a large number of instances and two also consisting of a large number of features. The results show that our algorithms were superior in terms of both time-efficiency and scalability. In leveraging a computer cluster, they were able to handle larger datasets than the non-distributed WEKA version while maintaining the quality of the results, i.e., exactly the same features were returned by our algorithms when compared to the original algorithm available in WEKA.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2018

Distributed ReliefF based Feature Selection in Spark

Feature selection (FS) is a key research area in the machine learning an...
research
10/13/2016

An Information Theoretic Feature Selection Framework for Big Data under Apache Spark

With the advent of extremely high dimensional datasets, dimensionality r...
research
05/10/2018

Scaling associative classification for very large datasets

Supervised learning algorithms are nowadays successfully scaling up to d...
research
07/21/2020

ADAGES: adaptive aggregation with stability for distributed feature selection

In this era of "big" data, not only the large amount of data keeps motiv...
research
04/16/2018

BELIEF: A distance-based redundancy-proof feature selection method for Big Data

With the advent of Big Data era, data reduction methods are highly deman...
research
02/05/2021

Scalable Robust Graph and Feature Extraction for Arbitrary Vessel Networks in Large Volumetric Datasets

Recent advances in 3D imaging technologies provide novel insights to res...
research
08/23/2017

Massively-Parallel Feature Selection for Big Data

We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm ...

Please sign up or login with your details

Forgot password? Click here to reset