Elastic deep learning in multi-tenant GPU cluster

09/26/2019
by   Yidi Wu, et al.
0

Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability to adjust the parallelism (number of GPUs) of a job with low overhead, helps to achieve the goals of GPU cluster management. With elasticity, we can adjust the trade-off between throughput and efficiency, adapt to the cluster load variations, utilize transient idle resource and etc. Motivated by the benefits of elasticity, we designed Amoeba, which requires minimum change to user code and provides a simple API for the scheduler to control the parallelism of jobs. Amoeba is general in that it delegates single machine execution to existing deep learning frameworks and uses light-weight control layer for coordination and management. As it is crucial to reduce the overhead of parallelism adjustment, Amoeba adopts key designs including automatic job management, background scaling and dynamic data pipeline. Experimental results show that Amoeba introduces negligible overhead to normal training without parallelism adjustment and pays significantly lower cost (around 95 also show that state-of-the-art GPU cluster scheduler can leverage elasticity with simple modifications and reduce the average JCT by as much as 29 case without elasticity.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2022

Aryl: An Elastic Cluster Scheduler for Deep Learning

Companies build separate training and inference GPU clusters for deep le...
research
12/19/2021

Efficient Strong Scaling Through Burst Parallel Training

As emerging deep neural network (DNN) models continue to grow in size, u...
research
10/01/2020

PipeTune: Pipeline Parallelism of Hyper and System Parameters Tuning for Deep Learning Clusters

DNN learning jobs are common in today's clusters due to the advances in ...
research
02/28/2019

Speeding up Deep Learning with Transient Servers

Distributed training frameworks, like TensorFlow, have been proposed as ...
research
07/23/2022

MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

GPU technology has been improving at an expedited pace in terms of size ...
research
11/07/2021

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameter...
research
12/03/2018

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

Deep Learning system architects strive to design a balanced system where...

Please sign up or login with your details

Forgot password? Click here to reset