Get Your Memory Right: The Crispy Resource Allocation Assistant for Large-Scale Data Processing

06/28/2022
by   Jonathan Will, et al.
0

Distributed dataflow systems like Apache Spark and Apache Hadoop enable data-parallel processing of large datasets on clusters. Yet, selecting appropriate computational resources for dataflow jobs – that neither lead to bottlenecks nor to low resource utilization – is often challenging, even for expert users such as data engineers. Further, existing automated approaches to resource selection rely on the assumption that a job is recurring to learn from previous runs or to warrant the cost of full test runs to learn from. However, this assumption often does not hold since many jobs are too unique. Therefore, we present Crispy, a method for optimizing data processing cluster configurations based on job profiling runs with small samples of the dataset on just a single machine. Crispy attempts to extrapolate the memory usage for the full dataset to then choose a cluster configuration with enough total memory. In our evaluation on a dataset with 1031 Spark and Hadoop jobs, we see a reduction of job execution costs by 56 average spending less than ten minutes on profiling runs per job on a consumer-grade laptop.

READ FULL TEXT
research
11/08/2022

Ruya: Memory-Aware Iterative Optimization of Cluster Configurations for Big Data Processing

Selecting appropriate computational resources for data processing jobs o...
research
06/06/2023

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Distributed dataflow systems such as Apache Spark or Apache Flink enable...
research
08/23/2019

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms

Microsoft's internal big data analytics platform is comprised of hundred...
research
02/16/2022

Learning Transferrable Representations of Career Trajectories for Economic Prediction

Understanding career trajectories – the sequences of jobs that individua...
research
05/20/2018

Machine Learning for Predictive Analytics of Compute Cluster Jobs

We address the problem of predicting whether sufficient memory and CPU r...
research
07/05/2022

Blink: Lightweight Sample Runs for Cost Optimization of Big Data Applications

Distributed in-memory data processing engines accelerate iterative appli...
research
02/27/2022

Past, Present and Future of Hadoop: A Survey

In this paper, a technology for massive data storage and computing named...

Please sign up or login with your details

Forgot password? Click here to reset