Phoebe: A Learning-based Checkpoint Optimizer

10/05/2021
by   Yiwen Zhu, et al.
0

Easy-to-use programming interfaces paired with cloud-scale processing engines have enabled big data system users to author arbitrarily complex analytical jobs over massive volumes of data. However, as the complexity and scale of analytical jobs increase, they encounter a number of unforeseen problems, hotspots with large intermediate data on temporary storage, longer job recovery time after failures, and worse query optimizer estimates being examples of issues that we are facing at Microsoft. To address these issues, we propose Phoebe, an efficient learning-based checkpoint optimizer. Given a set of constraints and an objective function at compile-time, Phoebe is able to determine the decomposition of job plans, and the optimal set of checkpoints to preserve their outputs to durable global storage. Phoebe consists of three machine learning predictors and one optimization module. For each stage of a job, Phoebe makes accurate predictions for: (1) the execution time, (2) the output size, and (3) the start/end time taking into account the inter-stage dependencies. Using these predictions, we formulate checkpoint optimization as an integer programming problem and propose a scalable heuristic algorithm that meets the latency requirement of the production environment. We demonstrate the effectiveness of Phoebe in production workloads, and show that we can free the temporary storage on hotspots by more than 70 failed jobs 68 illustrates that adding multiple sets of checkpoints is not cost-efficient, which dramatically reduces the complexity of the optimization.

READ FULL TEXT
research
08/13/2018

Allocation of Graph Jobs in Geo-Distributed Cloud Networks

Recently, processing of big-data has drawn tremendous attention, where c...
research
02/27/2020

Cost Models for Big Data Query Processing: Learning, Retrofitting, and Our Findings

Query processing over big data is ubiquitous in modern clouds, where the...
research
12/14/2019

Approximating Bounded Job Start Scheduling with Application in Royal Mail Deliveries under Uncertainty

Motivated by mail delivery scheduling problems arising in Royal Mail, we...
research
06/30/2021

Optimally rescheduling jobs with a LIFO buffer

This paper considers single-machine scheduling problems in which a given...
research
12/10/2022

Acela: Predictable Datacenter-level Maintenance Job Scheduling

Datacenter operators ensure fair and regular server maintenance by using...
research
10/24/2022

Deploying a Steered Query Optimizer in Production at Microsoft

Modern analytical workloads are highly heterogeneous and massively compl...
research
05/24/2023

Towards Optimizing Storage Costs on the Cloud

We study the problem of optimizing data storage and access costs on the ...

Please sign up or login with your details

Forgot password? Click here to reset