Performance Evaluation of Distributed Computing Environments with Hadoop and Spark Frameworks

07/16/2017
by   Vladyslav Taran, et al.
0

Recently, due to rapid development of information and communication technologies, the data are created and consumed in the avalanche way. Distributed computing create preconditions for analyzing and processing such Big Data by distributing the computations among a number of compute nodes. In this work, performance of distributed computing environments on the basis of Hadoop and Spark frameworks is estimated for real and virtual versions of clusters. As a test task, we chose the classic use case of word counting in texts of various sizes. It was found that the running times grow very fast with the dataset size and faster than a power function even. As to the real and virtual versions of cluster implementations, this tendency is the similar for both Hadoop and Spark frameworks. Moreover, speedup values decrease significantly with the growth of dataset size, especially for virtual version of cluster configuration. The problem of growing data generated by IoT and multimodal (visual, sound, tactile, neuro and brain-computing, muscle and eye tracking, etc.) interaction channels is presented. In the context of this problem, the current observations as to the running times and speedup on Hadoop and Spark frameworks in real and virtual cluster configurations can be very useful for the proper scaling-up and efficient job management, especially for machine learning and Deep Learning applications, where Big Data are widely present.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/08/2018

Deep Learning with Apache SystemML

Enterprises operate large data lakes using Hadoop and Spark frameworks t...
research
04/18/2019

Big Data in IoT Systems

Big Data in IoT is a large and fast-developing area where many different...
research
07/04/2018

Analyzing Big Datasets of Genomic Sequences: Fast and Scalable Collection of k-mer Statistics

Distributed approaches based on the map-reduce programming paradigm have...
research
04/27/2018

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

In the era of big data and cloud computing, large amounts of data are ge...
research
11/15/2017

PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

This paper describes PlinyCompute, a system for development of high-perf...
research
02/17/2021

Deployment of Elastic Virtual Hybrid Clusters Across Cloud Sites

Virtual clusters are widely used computing platforms than can be deploye...
research
03/22/2023

How does SSD Cluster Perform for Distributed File Systems: An Empirical Study

As the capacity of Solid-State Drives (SSDs) is constantly being optimis...

Please sign up or login with your details

Forgot password? Click here to reset