RLScheduler: Learn to Schedule HPC Batch Jobs Using Deep Reinforcement Learning

10/20/2019
by   Di Zhang, et al.
0

We present RLScheduler, a deep reinforcement learning based job scheduler for scheduling independent batch jobs in high-performance computing (HPC) environment. From knowing nothing about scheduling at beginning, RLScheduler is able to autonomously learn how to effectively schedule HPC batch jobs, targeting a given optimization goal. This is achieved by deep reinforcement learning with the help of specially designed neural network structures and various optimizations to stabilize and accelerate the learning. Our results show that RLScheduler can outperform existing heuristic scheduling algorithms, including a manually fine-tuned machine learning-based scheduler on the same workload. More importantly, we show that RLScheduler does not blindly over-fit the given workload to achieve such optimization, instead, it learns general rules for scheduling batch jobs which can be further applied to different workloads and systems to achieve similarly optimized performance. We also demonstrate that RLScheduler is capable of adjusting itself along with changing goals and workloads, making it an attractive solution for the future autonomous HPC management.

READ FULL TEXT
research
02/11/2021

Deep Reinforcement Agent for Scheduling in HPC

Cluster scheduler is crucial in high-performance computing (HPC). It det...
research
09/12/2021

Hybrid Workload Scheduling on HPC Systems

Traditionally, on-demand, rigid, and malleable applications have been sc...
research
11/17/2017

RLWS: A Reinforcement Learning based GPU Warp Scheduler

The Streaming Multiprocessors (SMs) of a Graphics Processing Unit (GPU) ...
research
05/16/2021

DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

For decades, system administrators have been striving to design and tune...
research
10/10/2020

A Predictive Autoscaler for Elastic Batch Jobs

Large batch jobs such as Deep Learning, HPC and Spark require far more c...
research
09/07/2023

CPU frequency scheduling of real-time applications on embedded devices with temporal encoding-based deep reinforcement learning

Small devices are frequently used in IoT and smart-city applications to ...
research
03/04/2019

Opportunistic View Materialization with Deep Reinforcement Learning

Carefully selected materialized views can greatly improve the performanc...

Please sign up or login with your details

Forgot password? Click here to reset