Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

08/12/2019
by   Shigang Li, et al.
0

Load imbalance pervasively exists in distributed deep learning training systems, either caused by the inherent imbalance in learned tasks or by the system itself. Traditional synchronous Stochastic Gradient Descent (SGD) achieves good accuracy for a wide variety of tasks, but relies on global synchronization to accumulate the gradients at every training step. In this paper, we propose eager-SGD, which relaxes the global synchronization for decentralized accumulation. To implement eager-SGD, we propose to use two partial collectives: solo and majority. With solo allreduce, the faster processes contribute their gradients eagerly without waiting for the slower processes, whereas with majority allreduce, at least half of the participants must contribute gradients before continuing, all without using a central parameter server. We theoretically prove the convergence of the algorithms and describe the partial collectives in detail. Experimental results on load-imbalanced environments (CIFAR-10, ImageNet, and UCF101 datasets) show that eager-SGD achieves 1.27x speedup over the state-of-the-art synchronous SGD, without losing accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/29/2023

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters

As the size of models and datasets grows, it has become increasingly com...
research
12/10/2020

A Mechanism for Distributed Deep Learning Communication Optimization

Intensive communication and synchronization cost for gradients and param...
research
05/14/2020

OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

The training of modern deep learning neural network calls for large amou...
research
07/26/2020

CSER: Communication-efficient SGD with Error Reset

The scalability of Distributed Stochastic Gradient Descent (SGD) is toda...
research
10/27/2019

PopSGD: Decentralized Stochastic Gradient Descent in the Population Model

The population model is a standard way to represent large-scale decentra...
research
11/06/2019

DC-S3GD: Delay-Compensated Stale-Synchronous SGD for Large-Scale Decentralized Neural Network Training

Data parallelism has become the de facto standard for training Deep Neur...
research
04/30/2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Deep learning at scale is dominated by communication time. Distributing ...

Please sign up or login with your details

Forgot password? Click here to reset