Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

04/30/2020
by   Shigang Li, et al.
1

Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates equivalent to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD, WAGMA-SGD significantly improves training throughput (by 2.1x on 1,024 GPUs) and achieves the fastest time-to-solution.

READ FULL TEXT

page 5

page 9

page 10

research
03/12/2019

A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

Reducing communication overhead is a big challenge for large-scale distr...
research
10/20/2020

Dual Averaging is Surprisingly Effective for Deep Learning Optimization

First-order stochastic optimization methods are currently the most widel...
research
01/03/2022

Stochastic Weight Averaging Revisited

Stochastic weight averaging (SWA) is recognized as a simple while one ef...
research
08/22/2018

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

State-of-the-art distributed machine learning suffers from significant d...
research
08/12/2019

Taming Unbalanced Training Workloads in Deep Learning with Partial Collective Operations

Load imbalance pervasively exists in distributed deep learning training ...
research
01/11/2018

MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning

Existing Deep Learning frameworks exclusively use either Parameter Serve...
research
03/02/2020

Iterate Averaging Helps: An Alternative Perspective in Deep Learning

Iterate averaging has a rich history in optimisation, but has only very ...

Please sign up or login with your details

Forgot password? Click here to reset