Evaluation of Distributed Data Processing Frameworks in Hybrid Clouds

01/06/2022
by   Faheem Ullah, et al.
0

Distributed data processing frameworks (e.g., Hadoop, Spark, and Flink) are widely used to distribute data among computing nodes of a cloud. Recently, there have been increasing efforts aimed at evaluating the performance of distributed data processing frameworks hosted in private and public clouds. However, there is a paucity of research on evaluating the performance of these frameworks hosted in a hybrid cloud, which is an emerging cloud model that integrates private and public clouds to use the best of both worlds. Therefore, in this paper, we evaluate the performance of Hadoop, Spark, and Flink in a hybrid cloud in terms of execution time, resource utilization, horizontal scalability, vertical scalability, and cost. For this study, our hybrid cloud consists of OpenStack (private cloud) and MS Azure (public cloud). We use both batch and iterative workloads for the evaluation. Our results show that in a hybrid cloud (i) the execution time increases as more nodes are borrowed by the private cloud from the public cloud, (ii) Flink outperforms Spark, which in turn outperforms Hadoop in terms of execution time, (iii) Hadoop transfers the largest amount of data among the nodes during the workload execution while Spark transfers the least amount of data, (iv) all three frameworks horizontally scale better as compared to vertical scaling, and (v) Spark is found to be least expensive in terms of cost for data processing while Hadoop is found the most expensive.

READ FULL TEXT

page 1

page 5

page 6

page 7

research
07/31/2020

The Impact of Distance on Performance and Scalability of Distributed Database Systems in Hybrid Clouds

The increasing need for managing big data has led the emergence of advan...
research
06/05/2020

Skedulix: Hybrid Cloud Scheduling for Cost-Efficient Execution of Serverless Applications

We present a framework for scheduling multifunction serverless applicati...
research
12/15/2021

Data Placement for Multi-Tenant Data Federation on the Cloud

Due to privacy concerns of users and law enforcement in data security an...
research
06/10/2018

An Enhanced BPSO based Approach for Service Placement in Hybrid Cloud

Due to the challenges of competition and the rapidly evolving market, co...
research
03/20/2023

Benchmarking scalability of stream processing frameworks deployed as event-driven microservices in the cloud

Event-driven microservices are an emerging architectural style for data-...
research
09/20/2022

Design and Implementation of Fragmented Clouds for Evaluation of Distributed Databases

In this paper, we present a Fragmented Hybrid Cloud (FHC) that provides ...
research
04/14/2023

Hybrid DLT as a data layer for real-time, data-intensive applications

We propose a new approach, termed Hybrid DLT, to address a broad range o...

Please sign up or login with your details

Forgot password? Click here to reset