C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

07/28/2021
by   Jonathan Will, et al.
0

Distributed dataflow systems enable data-parallel processing of large datasets on clusters. Public cloud providers offer a large variety and quantity of resources that can be used for such clusters. Yet, selecting appropriate cloud resources for dataflow jobs - that neither lead to bottlenecks nor to low resource utilization - is often challenging, even for expert users such as data engineers. We present C3O, a collaborative system for optimizing data processing cluster configurations in public clouds based on shared historical runtime data. The shared data is utilized for predicting the runtimes of data processing jobs on different possible cluster configurations, using specialized regression models. These models take the diverse execution contexts of different users into account and exhibit mean absolute errors below 3 evaluation with 930 unique Spark jobs.

READ FULL TEXT
research
11/16/2020

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

Analyzing large datasets with distributed dataflow systems requires the ...
research
06/01/2022

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for di...
research
06/18/2019

MultiCloud Resource Management using Apache Mesos with Apache Airavata

We discuss initial results and our planned approach for incorporating Ap...
research
11/08/2022

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Selecting appropriate computational resources for data processing jobs o...
research
11/16/2021

On the Potential of Execution Traces for Batch Processing Workload Optimization in Public Clouds

With the growing amount of data, data processing workloads and the manag...
research
03/22/2023

How does SSD Cluster Perform for Distributed File Systems: An Empirical Study

As the capacity of Solid-State Drives (SSDs) is constantly being optimis...
research
11/15/2021

Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

Distributed dataflow systems like Apache Flink and Apache Spark simplify...

Please sign up or login with your details

Forgot password? Click here to reset