Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

09/21/2022
by   Taha Tekdogan, et al.
0

Most of the popular Big Data analytics tools evolved to adapt their working environment to extract valuable information from a vast amount of unstructured data. The ability of data mining techniques to filter this helpful information from Big Data led to the term Big Data Mining. Shifting the scope of data from small-size, structured, and stable data to huge volume, unstructured, and quickly changing data brings many data management challenges. Different tools cope with these challenges in their own way due to their architectural limitations. There are numerous parameters to take into consideration when choosing the right data management framework based on the task at hand. In this paper, we present a comprehensive benchmark for two widely used Big Data analytics tools, namely Apache Spark and Hadoop MapReduce, on a common data mining task, i.e., classification. We employ several evaluation metrics to compare the performance of the benchmarked frameworks, such as execution time, accuracy, and scalability. These metrics are specialized to measure the performance for classification task. To the best of our knowledge, there is no previous study in the literature that employs all these metrics while taking into consideration task-specific concerns. We show that Spark is 5 times faster than MapReduce on training the model. Nevertheless, the performance of Spark degrades when the input workload gets larger. Scaling the environment by additional clusters significantly improves the performance of Spark. However, similar enhancement is not observed in Hadoop. Machine learning utility of MapReduce tend to have better accuracy scores than that of Spark, like around 3

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/19/2022

Big Data and Education: using big data analytics in language learning

Working with big data using data mining tools is rapidly becoming a tren...
research
04/12/2021

Cloud Big Data Mining and Analytics: Bringing Greenness and Acceleration in the Cloud

Big data is gaining overwhelming attention since the last decade. Almost...
research
11/15/2022

Perona: Robust Infrastructure Fingerprinting for Resource-Efficient Big Data Analytics

Choosing a good resource configuration for big data analytics applicatio...
research
09/10/2021

How Can Subgroup Discovery Help AIOps?

The genuine supervision of modern IT systems brings new challenges as it...
research
10/12/2021

Scalable machine learning in the R language using a summarization matrix

Big data analytics generally rely on parallel processing in large comput...
research
03/30/2021

Text Classification Using Hybrid Machine Learning Algorithms on Big Data

Recently, there are unprecedented data growth originating from different...
research
09/07/2020

Simulating Name-like Vectors for Testing Large-scale Entity Resolution

Accurate and efficient entity resolution (ER) has been a problem in data...

Please sign up or login with your details

Forgot password? Click here to reset