Co-Tuning of Cloud Infrastructure and Distributed Data Processing Platforms

09/01/2023
by   Isuru Dharmadasa, et al.
0

Distributed Data Processing Platforms (e.g., Hadoop, Spark, and Flink) are widely used to store and process data in a cloud environment. These platforms distribute the storage and processing of data among the computing nodes of a cloud. The efficient use of these platforms requires users to (i) configure the cloud i.e., determine the number and type of computing nodes, and (ii) tune the configuration parameters (e.g., data replication factor) of the platform. However, both these tasks require in-depth knowledge of the cloud infrastructure and distributed data processing platforms. Therefore, in this paper, we first study the relationship between the configuration of the cloud and the configuration of distributed data processing platforms to determine how cloud configuration impacts platform configuration. After understanding the impacts, we propose a co-tuning approach for recommending optimal co-configuration of cloud and distributed data processing platforms. The proposed approach utilizes machine learning and optimization techniques to maximize the performance of the distributed data processing system deployed on the cloud. We evaluated our approach for Hadoop, Spark, and Flink in a cluster deployed on the OpenStack cloud. We used three benchmarking workloads (WordCount, Sort, and K-means) in our evaluation. Our results reveal that, in comparison to default settings, our co-tuning approach reduces execution time by 17.5

READ FULL TEXT

page 9

page 10

page 11

page 12

page 13

page 14

research
01/06/2022

A Framework for Energy-aware Evaluation of Distributed Data Processing Platforms in Edge-Cloud Environment

Distributed data processing platforms (e.g., Hadoop, Spark, and Flink) a...
research
06/16/2023

An approach to provide serverless scientific pipelines within the context of SKA

Function-as-a-Service (FaaS) is a type of serverless computing that allo...
research
12/15/2021

Data Placement for Multi-Tenant Data Federation on the Cloud

Due to privacy concerns of users and law enforcement in data security an...
research
03/10/2022

A Framework for the Interoperability of Cloud Platforms: Towards FAIR Data in SAFE Environments

As the number of cloud platforms supporting biomedical research grows, t...
research
08/02/2023

Scaling Data Science Solutions with Semantics and Machine Learning: Bosch Case

Industry 4.0 and Internet of Things (IoT) technologies unlock unpreceden...
research
12/12/2021

In-Memory Indexed Caching for Distributed Data Processing

Powerful abstractions such as dataframes are only as efficient as their ...
research
02/07/2022

Comprehensive Performance Analysis of Homomorphic Cryptosystems for Practical Data Processing

Oblivious data processing has been an on and off topic for the last deca...

Please sign up or login with your details

Forgot password? Click here to reset