NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction

03/16/2022
by   Yi Ding, et al.
0

Datacenters execute large computational jobs, which are composed of smaller tasks. A job completes when all its tasks finish, so stragglers – rare, yet extremely slow tasks – are a major impediment to datacenter performance. Accurately predicting stragglers would enable proactive intervention, allowing datacenter operators to mitigate stragglers before they delay a job. While much prior work applies machine learning to predict computer system performance, these approaches rely on complete labels – i.e., sufficient examples of all possible behaviors, including straggling and non-straggling – or strong assumptions about the underlying latency distributions – e.g., whether Gaussian or not. Within a running job, however, none of this information is available until stragglers have revealed themselves when they have already delayed the job. To predict stragglers accurately and early without labeled positive examples or assumptions on latency distributions, this paper presents NURD, a novel Negative-Unlabeled learning approach with Reweighting and Distribution-compensation that only trains on negative and unlabeled streaming data. The key idea is to train a predictor using finished tasks of non-stragglers to predict latency for unlabeled running tasks, and then reweight each unlabeled task's prediction based on a weighting function of its feature space. We evaluate NURD on two production traces from Google and Alibaba, and find that compared to the best baseline approach, NURD produces 2–11 percentage point increases in the F1 score in terms of prediction accuracy, and 2.0–8.8 percentage point improvements in job completion time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/24/2021

The Case for Task Sampling based Learning for Cluster Job Scheduling

The ability to accurately estimate job runtime properties allows a sched...
research
01/11/2022

Leveraging Unlabeled Data to Predict Out-of-Distribution Performance

Real-world machine learning deployments are characterized by mismatches ...
research
12/10/2022

Acela: Predictable Datacenter-level Maintenance Job Scheduling

Datacenter operators ensure fair and regular server maintenance by using...
research
10/06/2018

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of ...
research
09/05/2019

Straggler Mitigation with Tiered Gradient Codes

Coding theoretic techniques have been proposed for synchronous Gradient ...
research
02/18/2020

Hierarchical Classification of Enzyme Promiscuity Using Positive, Unlabeled, and Hard Negative Examples

Despite significant progress in sequencing technology, there are many ce...

Please sign up or login with your details

Forgot password? Click here to reset