Runtime Variation in Big Data Analytics

04/07/2023
by   Yiwen Zhu, et al.
0

The dynamic nature of resource allocation and runtime conditions on Cloud can result in high variability in a job's runtime across multiple iterations, leading to a poor experience. Identifying the sources of such variation and being able to predict and adjust for them is crucial to cloud service providers to design reliable data processing pipelines, provision and allocate resources, adjust pricing services, meet SLOs and debug performance hazards. In this paper, we analyze the runtime variation of millions of production SCOPE jobs on Cosmos, an exabyte-scale internal analytics platform at Microsoft. We propose an innovative 2-step approach to predict job runtime distribution by characterizing typical distribution shapes combined with a classification model with an average accuracy of >96 and better capturing long tails. We examine factors such as job plan characteristics and inputs, resource allocation, physical cluster heterogeneity and utilization, and scheduling policies. To the best of our knowledge, this is the first study on predicting categories of runtime distributions for enterprise analytics workloads at scale. Furthermore, we examine how our methods can be used to analyze what-if scenarios, focusing on the impact of resource allocation, scheduling, and physical cluster provisioning decisions on a job's runtime consistency and predictability.

READ FULL TEXT
research
06/06/2023

Selecting Efficient Cluster Resources for Data Analytics: When and How to Allocate for In-Memory Processing?

Distributed dataflow systems such as Apache Spark or Apache Flink enable...
research
02/01/2018

Towards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System

Geo-distributed data analytics are increasingly common to derive useful ...
research
01/20/2021

Neural-based Modeling for Performance Tuning of Spark Data Analytics

Cloud data analytics has become an integral part of enterprise business ...
research
05/12/2020

DMR API: Improving cluster productivity by turning applications into malleable

Adaptive workloads can change on–the–fly the configuration of their jobs...
research
05/20/2018

Machine Learning for Predictive Analytics of Compute Cluster Jobs

We address the problem of predicting whether sufficient memory and CPU r...
research
05/22/2018

DRESS: Dynamic RESource-reservation Scheme for Congested Data-intensive Computing Platforms

In the past few years, we have envisioned an increasing number of busine...
research
07/19/2021

Optimal Resource Allocation for Serverless Queries

Optimizing resource allocation for analytical workloads is vital for red...

Please sign up or login with your details

Forgot password? Click here to reset