Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

06/25/2023
by   Qiyang Ding, et al.
0

Accommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we investigate a set of statistical learning and reinforcement learning (RL) techniques, including random forest, xgboost, Deep Q-Network, and policy gradient to design a proactive provisioner using production job traces from three GPU clusters. We follow the standard machine learning practice by partitioning each job trace into training and validation subsets, then train each model using the training subset and evaluate the generality using the validation subset. We introduce Mirage, a Slurm-compatible resource provisioner that integrates the candidate RL methods. Our experiments show that the Mirage can reduce the interruption by 17-100 interruption across varying load levels on the three clusters.

READ FULL TEXT
research
07/16/2022

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning a...
research
09/13/2019

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, wh...
research
02/16/2022

Aryl: An Elastic Cluster Scheduler for Deep Learning

Companies build separate training and inference GPU clusters for deep le...
research
12/26/2021

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU...
research
07/30/2019

DeepPlace: Learning to Place Applications in Multi-Tenant Clusters

Large multi-tenant production clusters often have to handle a variety of...
research
06/12/2019

Optimizing Redundancy Levels in Master-Worker Compute Clusters for Straggler Mitigation

Runtime variability in computing systems causes some tasks to straggle a...
research
06/02/2023

A Modular Test Bed for Reinforcement Learning Incorporation into Industrial Applications

This application paper explores the potential of using reinforcement lea...

Please sign up or login with your details

Forgot password? Click here to reset