Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

11/08/2022
by   Jonathan Will, et al.
0

Selecting appropriate computational resources for data processing jobs on large clusters is difficult, even for expert users like data engineers. Inadequate choices can result in vastly increased costs, without significantly improving performance. One crucial aspect of selecting an efficient resource configuration is avoiding memory bottlenecks. By knowing the required memory of a job in advance, the search space for an optimal resource configuration can be greatly reduced. Therefore, we present Ruya, a method for memory-aware optimization of data processing cluster configurations based on iteratively exploring a narrowed-down search space. First, we perform job profiling runs with small samples of the dataset on just a single machine to model the job's memory usage patterns. Second, we prioritize cluster configurations with a suitable amount of total memory and within this reduced search space, we iteratively search for the best cluster configuration with Bayesian optimization. This search process stops once it converges on a configuration that is believed to be optimal for the given job. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of search iterations to find an optimal configuration by around half, compared to the baseline.

READ FULL TEXT
research
06/28/2022

Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

Distributed dataflow systems like Apache Spark and Apache Hadoop enable ...
research
07/05/2022

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Distributed in-memory data processing engines accelerate iterative appli...
research
07/28/2021

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

Distributed dataflow systems enable data-parallel processing of large da...
research
08/22/2023

Karasu: A Collaborative Approach to Efficient Cluster Configuration for Big Data Analytics

Selecting the right resources for big data analytics jobs is hard becaus...
research
10/29/2019

Real-time Bidding campaigns optimization using attribute selection

Real-Time Bidding is nowadays one of the most promising systems in the o...
research
05/06/2019

Lynceus: Tuning and Provisioning Data Analytic Jobs on a Budget

Many enterprises need to run data analytic jobs on the cloud. Significan...
research
03/28/2022

LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications

Spark SQL has been widely deployed in industry but it is challenging to ...

Please sign up or login with your details

Forgot password? Click here to reset