Acela: Predictable Datacenter-level Maintenance Job Scheduling

12/10/2022
by   Yi Ding, et al.
0

Datacenter operators ensure fair and regular server maintenance by using automated processes to schedule maintenance jobs to complete within a strict time budget. Automating this scheduling problem is challenging because maintenance job duration varies based on both job type and hardware. While it is tempting to use prior machine learning techniques for predicting job duration, we find that the structure of the maintenance job scheduling problem creates a unique challenge. In particular, we show that prior machine learning methods that produce the lowest error predictions do not produce the best scheduling outcomes due to asymmetric costs. Specifically, underpredicting maintenance job duration has results in more servers being taken offline and longer server downtime than overpredicting maintenance job duration. The system cost of underprediction is much larger than that of overprediction. We present Acela, a machine learning system for predicting maintenance job duration, which uses quantile regression to bias duration predictions toward overprediction. We integrate Acela into a maintenance job scheduler and evaluate it on datasets from large-scale, production datacenters. Compared to machine learning based predictors from prior work, Acela reduces the number of servers that are taken offline by 1.87-4.28X, and reduces the server offline time by 1.40-2.80X.

READ FULL TEXT
research
11/14/2019

Optimal Server Selection for Straggler Mitigation

The performance of large-scale distributed compute systems is adversely ...
research
06/23/2022

Human-in-the-Loop Large-Scale Predictive Maintenance of Workstations

Predictive maintenance (PdM) is the task of scheduling maintenance opera...
research
04/23/2023

Machine learning framework for end-to-end implementation of Incident duration prediction

Traffic congestion caused by non-recurring incidents such as vehicle cra...
research
02/17/2017

Predicting Surgery Duration with Neural Heteroscedastic Regression

Scheduling surgeries is a challenging task due to the fundamental uncert...
research
01/01/2018

Chance-Constrained Outage Scheduling using a Machine Learning Proxy

Outage scheduling aims at defining, over a horizon of several months to ...
research
03/16/2022

NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction

Datacenters execute large computational jobs, which are composed of smal...
research
10/05/2021

Phoebe: A Learning-based Checkpoint Optimizer

Easy-to-use programming interfaces paired with cloud-scale processing en...

Please sign up or login with your details

Forgot password? Click here to reset