Parallel Large-Scale Attribute Reduction on Cloud Systems

10/06/2016
by   Junbo Zhang, et al.
0

The rapid growth of emerging information technologies and application patterns in modern society, e.g., Internet, Internet of Things, Cloud Computing and Tri-network Convergence, has caused the advent of the era of big data. Big data contains huge values, however, mining knowledge from big data is a tremendously challenging task because of data uncertainty and inconsistency. Attribute reduction (also known as feature selection) can not only be used as an effective preprocessing step, but also exploits the data redundancy to reduce the uncertainty. However, existing solutions are designed 1) either for a single machine that means the entire data must fit in the main memory and the parallelism is limited; 2) or for the Hadoop platform which means that the data have to be loaded into the distributed memory frequently and therefore become inefficient. In this paper, we overcome these shortcomings for maximum efficiency possible, and propose a unified framework for Parallel Large-scale Attribute Reduction, termed PLAR, for big data analysis. PLAR consists of three components: 1) Granular Computing (GrC)-based initialization: it converts a decision table (i.e., original data representation) into a granularity representation which reduces the amount of space and hence can be easily cached in the distributed memory: 2) model-parallelism: it simultaneously evaluates all feature candidates and makes attribute reduction highly parallelizable; 3) data-parallelism: it computes the significance of an attribute in parallel using a MapReduce-style manner. We implement PLAR with four representative heuristic feature selection algorithms on Spark, and evaluate them on various huge datasets, including UCI and astronomical datasets, finding our method's advantages beyond existing solutions.

READ FULL TEXT

page 10

page 12

research
04/16/2018

BELIEF: A distance-based redundancy-proof feature selection method for Big Data

With the advent of Big Data era, data reduction methods are highly deman...
research
10/14/2018

DPASF: A Flink Library for Streaming Data preprocessing

Data preprocessing techniques are devoted to correct or alleviate errors...
research
07/21/2018

Integrated IoT and Cloud Environment for Fingerprint Recognition

Big data applications involving the analysis of large datasets becomes a...
research
03/24/2023

Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

In the big data era, the key feature that each algorithm needs to have i...
research
06/29/2021

Scalable Traffic Predictive Analysis using GPU in Big Data

The paper adopts parallel computing systems for predictive analysis in b...
research
11/01/2018

Distributed ReliefF based Feature Selection in Spark

Feature selection (FS) is a key research area in the machine learning an...
research
01/20/2020

Randomized Algorithms for Computation of Tucker decomposition and Higher Order SVD (HOSVD)

Big data analysis has become a crucial part of new emerging technologies...

Please sign up or login with your details

Forgot password? Click here to reset