DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

by   Yuping Fan, et al.

For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.


Deep Reinforcement Agent for Scheduling in HPC

Cluster scheduler is crucial in high-performance computing (HPC). It det...

RLScheduler: Learn to Schedule HPC Batch Jobs Using Deep Reinforcement Learning

We present RLScheduler, a deep reinforcement learning based job schedule...

Running Alchemist on Cray XC and CS Series Supercomputers: Dask and PySpark Interfaces, Deployment Options, and Data Transfer Times

Newly developed interfaces for Python, Dask, and PySpark enable the use ...

BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

Hardware performance counters (HPCs) that measure low-level architectura...

Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling

We consider the problem of scheduling in constrained queueing networks w...

RLWS: A Reinforcement Learning based GPU Warp Scheduler

The Streaming Multiprocessors (SMs) of a Graphics Processing Unit (GPU) ...

Enabling EASEY deployment of containerized applications for future HPC systems

The upcoming exascale era will push the changes in computing architectur...