Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

10/19/2018
by   Jianyu Wang, et al.
0

Large-scale machine learning training, in particular distributed stochastic gradient descent, needs to be robust to inherent system variability such as node straggling and random communication delays. This work considers a distributed training framework where each worker node is allowed to perform local model updates and the resulting models are averaged periodically. We analyze the true speed of error convergence with respect to wall-clock time (instead of the number of iterations), and analyze how it is affected by the frequency of averaging. The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Rigorous experiments on training deep neural networks show that AdaComm can take 3 × less time than fully synchronous SGD, and still reach the same final training loss.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/04/2022

Adaptive Stochastic Gradient Descent for Fast and Communication-Efficient Distributed Learning

We consider the setting where a master wants to run a distributed stocha...
research
02/21/2020

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

Distributed stochastic gradient descent (SGD) is essential for scaling t...
research
02/25/2020

Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

We consider the setting where a master wants to run a distributed stocha...
research
05/14/2020

OD-SGD: One-step Delay Stochastic Gradient Descent for Distributed Training

The training of modern deep learning neural network calls for large amou...
research
03/03/2018

Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous ...
research
07/02/2021

ResIST: Layer-Wise Decomposition of ResNets for Distributed Training

We propose , a novel distributed training protocol for Residual Networks...
research
06/02/2020

Asynchronous Distributed Averaging: A Switched System Framework for Average Error Analysis

This paper investigates an expected average error for distributed averag...

Please sign up or login with your details

Forgot password? Click here to reset