DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

05/16/2021
by   Yuping Fan, et al.
0

For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.

READ FULL TEXT
02/11/2021

Deep Reinforcement Agent for Scheduling in HPC

Cluster scheduler is crucial in high-performance computing (HPC). It det...
10/20/2019

RLScheduler: Learn to Schedule HPC Batch Jobs Using Deep Reinforcement Learning

We present RLScheduler, a deep reinforcement learning based job schedule...
10/03/2019

Running Alchemist on Cray XC and CS Series Supercomputers: Dask and PySpark Interfaces, Deployment Options, and Data Transfer Times

Newly developed interfaces for Python, Dask, and PySpark enable the use ...
02/22/2021

BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

Hardware performance counters (HPCs) that measure low-level architectura...
05/01/2021

Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling

We consider the problem of scheduling in constrained queueing networks w...
11/17/2017

RLWS: A Reinforcement Learning based GPU Warp Scheduler

The Streaming Multiprocessors (SMs) of a Graphics Processing Unit (GPU) ...
04/28/2020

Enabling EASEY deployment of containerized applications for future HPC systems

The upcoming exascale era will push the changes in computing architectur...