EasyScale: Accuracy-consistent Elastic Training for Deep Learning

08/30/2022
by   Mingzhen Li, et al.
0

Distributed synchronized GPU training is commonly used for deep learning. The resource constraint of using fixed GPUs makes large-scale deep learning training jobs suffer, and also lowers the cluster utilization. However, incorporating resource elasticity often introduces non-determinism in model accuracy, which is mainly due to the lack of capability to isolate the model training procedure from hardware resources. We introduce EasyScale, an elastic framework that scales distributed training on heterogeneous GPUs while producing deterministic deep learning models. EasyScale follows the data-parallel training flow strictly, traces the accuracy-relevant factors carefully, utilizes the deep learning characteristics for context switching efficiently, thus achieving elastic accuracy-consistent model training. To saturate the computation capability of heterogeneous GPUs, EasyScale dynamically assigns workers based on our intra-job and inter-job scheduling policies, minimizing GPU idle time and maximizing aggregated job throughput accordingly. Deployed in an online serving cluster of CompanyA, EasyScale powers elastic deep learning training jobs to utilize free GPUs opportunistically, improving the overall cluster utilization by 62.1 violating SLA.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/16/2022

Aryl: An Elastic Cluster Scheduler for Deep Learning

Companies build separate training and inference GPU clusters for deep le...
research
11/20/2021

Doing More by Doing Less: How Structured Partial Backpropagation Improves Deep Learning Clusters

Many organizations employ compute clusters equipped with accelerators su...
research
12/03/2018

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

Deep Learning system architects strive to design a balanced system where...
research
10/13/2021

Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers

Motivated by extreme multi-label classification applications, we conside...
research
11/07/2021

Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

Systems for training massive deep learning models (billions of parameter...
research
12/06/2021

End-to-end Adaptive Distributed Training on PaddlePaddle

Distributed training has become a pervasive and effective approach for t...
research
05/11/2018

Peacock: Probe-Based Scheduling of Jobs by Rotating Between Elastic Queues

In this paper, we propose Peacock, a new distributed probe-based schedul...

Please sign up or login with your details

Forgot password? Click here to reset