Training Data Reduction for Performance Models of Data Analytics Jobs in the Cloud

11/15/2021
by   Jonathan Will, et al.
0

Distributed dataflow systems like Apache Flink and Apache Spark simplify processing large amounts of data on clusters in a data-parallel manner. However, choosing suitable cluster resources for distributed dataflow jobs in both type and number is difficult, especially for users who do not have access to previous performance metrics. One approach to overcoming this issue is to have users share runtime metrics to train context-aware performance models that help find a suitable configuration for the job at hand. A problem when sharing runtime data instead of trained models or model parameters is that the data size can grow substantially over time. This paper examines several clustering techniques to minimize training data size while keeping the associated performance models accurate. Our results indicate that efficiency gains in data transfer, storage, and model training can be achieved through training data reduction. In the evaluation of our solution on a dataset of runtime data from 930 unique distributed dataflow jobs, we observed that, on average, a 75 prediction errors by one percentage point.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/01/2022

Collaborative Cluster Configuration for Distributed Data-Parallel Processing: A Research Overview

Many organizations routinely analyze large datasets using systems for di...
research
11/16/2020

Towards Collaborative Optimization of Cluster Configurations for Distributed Dataflow Jobs

Analyzing large datasets with distributed dataflow systems requires the ...
research
08/27/2021

Enel: Context-Aware Dynamic Scaling of Distributed Dataflow Jobs using Graph Propagation

Distributed dataflow systems like Spark and Flink enable the use of clus...
research
07/28/2021

C3O: Collaborative Cluster Configuration Optimization for Distributed Data Processing in Public Clouds

Distributed dataflow systems enable data-parallel processing of large da...
research
07/29/2021

Bellamy: Reusing Performance Models for Distributed Dataflow Jobs Across Contexts

Distributed dataflow systems enable the use of clusters for scalable dat...
research
05/02/2016

Highly Accurate Prediction of Jobs Runtime Classes

Separating the short jobs from the long is a known technique to improve ...
research
08/23/2019

Griffon: Reasoning about Job Anomalies with Unlabeled Data in Cloud-based Platforms

Microsoft's internal big data analytics platform is comprised of hundred...

Please sign up or login with your details

Forgot password? Click here to reset