Aryl: An Elastic Cluster Scheduler for Deep Learning

02/16/2022
by   Jiamin Li, et al.
0

Companies build separate training and inference GPU clusters for deep learning, and use separate schedulers to manage them. This leads to problems for both training and inference: inference clusters have low GPU utilization when the traffic load is low; training jobs often experience long queueing time due to lack of resources. We introduce Aryl, a new cluster scheduler to address these problems. Aryl introduces capacity loaning to loan idle inference GPU servers for training jobs. It further exploits elastic scaling that scales a training job's GPU allocation to better utilize loaned resources. Capacity loaning and elastic scaling create new challenges to cluster management. When the loaned servers need to be returned, we need to minimize the number of job preemptions; when more GPUs become available, we need to allocate them to elastic jobs and minimize the job completion time (JCT). Aryl addresses these combinatorial problems using principled heuristics. It introduces the notion of server preemption cost which it greedily reduces during server reclaiming. It further relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem. Prototype implementation on a 64-GPU testbed and large-scale simulation with 15-day traces of over 50,000 production jobs show that Aryl brings 1.53x and 1.50x reductions in average queuing time and JCT, and improves cluster usage by up to 26.9 loaning or elastic scaling.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/30/2022

EasyScale: Accuracy-consistent Elastic Training for Deep Learning

Distributed synchronized GPU training is commonly used for deep learning...
research
09/26/2019

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of...
research
05/10/2023

Fast Distributed Inference Serving for Large Language Models

Large language models (LLMs) power a new generation of interactive AI ap...
research
06/24/2020

Effective Elastic Scaling of Deep Learning Workloads

The increased use of deep learning (DL) in academia, government and indu...
research
05/19/2020

Optimal Resource Allocation for Elastic and Inelastic Jobs

Modern data centers are tasked with processing heterogeneous workloads c...
research
06/25/2023

Mirage: Towards Low-interruption Services on Batch GPU Clusters with Reinforcement Learning

Accommodating long-running deep learning (DL) training and inference job...
research
12/19/2021

Efficient Strong Scaling Through Burst Parallel Training

As emerging deep neural network (DNN) models continue to grow in size, u...

Please sign up or login with your details

Forgot password? Click here to reset