Making Asynchronous Stochastic Gradient Descent Work for Transformers

06/08/2019
by   Alham Fikri Aji, et al.
0

Asynchronous stochastic gradient descent (SGD) is attractive from a speed perspective because workers do not wait for synchronization. However, the Transformer model converges poorly with asynchronous SGD, resulting in substantially lower quality compared to synchronous SGD. To investigate why this is the case, we isolate differences between asynchronous and synchronous methods to investigate batch size and staleness effects. We find that summing several asynchronous updates, rather than applying them immediately, restores convergence behavior. With this hybrid method, Transformer training for neural machine translation task reaches a near-convergence level 1.36x faster in single-node multi-GPU training with no impact on model quality.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/17/2019

Zeno++: robust asynchronous SGD with arbitrary number of Byzantine workers

We propose Zeno++, a new robust asynchronous synchronous Stochastic Grad...
research
08/29/2023

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters

As the size of models and datasets grows, it has become increasingly com...
research
02/24/2018

Stochastic Gradient Descent on Highly-Parallel Architectures

There is an increased interest in building data analytics frameworks wit...
research
03/07/2020

ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

Distributed training is useful to train complicated models to shorten th...
research
05/22/2018

Gradient Energy Matching for Distributed Asynchronous Gradient Descent

Distributed asynchronous SGD has become widely used for deep learning in...
research
01/15/2016

Faster Asynchronous SGD

Asynchronous distributed stochastic gradient descent methods have troubl...
research
03/23/2018

The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory

Stochastic Gradient Descent (SGD) is a fundamental algorithm in machine ...

Please sign up or login with your details

Forgot password? Click here to reset