Elastic deep learning in multi-tenant GPU cluster

by   Yidi Wu, et al.

Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability to adjust the parallelism (number of GPUs) of a job with low overhead, helps to achieve the goals of GPU cluster management. With elasticity, we can adjust the trade-off between throughput and efficiency, adapt to the cluster load variations, utilize transient idle resource and etc. Motivated by the benefits of elasticity, we designed Amoeba, which requires minimum change to user code and provides a simple API for the scheduler to control the parallelism of jobs. Amoeba is general in that it delegates single machine execution to existing deep learning frameworks and uses light-weight control layer for coordination and management. As it is crucial to reduce the overhead of parallelism adjustment, Amoeba adopts key designs including automatic job management, background scaling and dynamic data pipeline. Experimental results show that Amoeba introduces negligible overhead to normal training without parallelism adjustment and pays significantly lower cost (around 95 also show that state-of-the-art GPU cluster scheduler can leverage elasticity with simple modifications and reduce the average JCT by as much as 29 case without elasticity.


page 1

page 2

page 3

page 4


Aryl: An Elastic Cluster Scheduler for Deep Learning

Companies build separate training and inference GPU clusters for deep le...

Efficient Strong Scaling Through Burst Parallel Training

As emerging deep neural network (DNN) models continue to grow in size, u...

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Modern GPU datacenters are critical for delivering Deep Learning (DL) mo...

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as ...

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameter...

PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters

DNN learning jobs are common in today's clusters due to the advances in ...

Using Multi-Instance GPU for Efficient Operation of Multi-Tenant GPU Clusters

GPU technology has been improving at an expedited pace in terms of size ...