Learning Scheduling Algorithms for Data Processing Clusters

10/03/2018
by   Hongzi Mao, et al.
6

Efficiently scheduling data processing jobs on distributed compute clusters requires complex algorithms. Current systems, however, use simple generalized heuristics and ignore workload structure, since developing and tuning a bespoke heuristic for each workload is infeasible. In this paper, we show that modern machine learning techniques can generate highly-efficient policies automatically. Decima uses reinforcement learning (RL) and neural networks to learn workload-specific scheduling algorithms without any human instruction beyond specifying a high-level objective such as minimizing average job completion time. Off-the-shelf RL techniques, however, cannot handle the complexity and scale of the scheduling problem. To build Decima, we had to develop new representations for jobs' dependency graphs, design scalable RL models, and invent new RL training methods for continuous job arrivals. Our prototype integration with Spark on a 25-node cluster shows that Decima outperforms several heuristics, including hand-tuned ones, by at least 21 Further experiments with an industrial production workload trace demonstrate that Decima delivers up to a 17 scales to large clusters.

READ FULL TEXT

page 4

page 12

research
09/13/2019

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, wh...
research
08/10/2020

Bilevel Learning Model Towards Industrial Scheduling

Automatic industrial scheduling, aiming at optimizing the sequence of jo...
research
09/04/2019

Inductive Bias-driven Reinforcement Learning For Efficient Schedules in Heterogeneous Clusters

The problem of scheduling of workloads onto heterogeneous processors (e....
research
12/14/2022

Monte-Carlo Tree-Search for Leveraging Performance of Blackbox Job-Shop Scheduling Heuristics

In manufacturing, the production is often done on out-of-the-shelf manuf...
research
08/21/2019

Dynamic Scheduling of MPI-based Distributed Deep Learning Training Jobs

There is a general trend towards solving problems suited to deep learnin...
research
12/21/2021

A Scalable Deep Reinforcement Learning Model for Online Scheduling Coflows of Multi-Stage Jobs for High Performance Computing

Coflow is a recently proposed networking abstraction to help improve the...
research
06/12/2019

Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation

Runtime variability in computing systems causes some tasks to straggle a...

Please sign up or login with your details

Forgot password? Click here to reset