DRAS-CQSim: A Reinforcement Learning based Framework for HPC Cluster Scheduling

05/16/2021
by   Yuping Fan, et al.
0

For decades, system administrators have been striving to design and tune cluster scheduling policies to improve the performance of high performance computing (HPC) systems. However, the increasingly complex HPC systems combined with highly diverse workloads make such manual process challenging, time-consuming, and error-prone. We present a reinforcement learning based HPC scheduling framework named DRAS-CQSim to automatically learn optimal scheduling policy. DRAS-CQSim encapsulates simulation environments, agents, hyperparameter tuning options, and different reinforcement learning algorithms, which allows the system administrators to quickly obtain customized scheduling policies.

READ FULL TEXT
research
02/11/2021

Deep Reinforcement Agent for Scheduling in HPC

Cluster scheduler is crucial in high-performance computing (HPC). It det...
research
10/20/2019

RLScheduler: Learn to Schedule HPC Batch Jobs Using Deep Reinforcement Learning

We present RLScheduler, a deep reinforcement learning based job schedule...
research
11/21/2022

Fine-Grained Scheduling for Containerized HPC Workloads in Kubernetes Clusters

Containerization technology offers lightweight OS-level virtualization, ...
research
02/22/2021

BayesPerf: Minimizing Performance Monitoring Errors Using Bayesian Statistics

Hardware performance counters (HPCs) that measure low-level architectura...
research
10/03/2019

Running Alchemist on Cray XC and CS Series Supercomputers: Dask and PySpark Interfaces, Deployment Options, and Data Transfer Times

Newly developed interfaces for Python, Dask, and PySpark enable the use ...
research
05/01/2021

Better than the Best: Gradient-based Improper Reinforcement Learning for Network Scheduling

We consider the problem of scheduling in constrained queueing networks w...
research
01/17/2021

Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System

Kubernetes (k8s) has the potential to merge the distributed edge and the...

Please sign up or login with your details

Forgot password? Click here to reset