Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

02/21/2020
by   Jianyu Wang, et al.
0

Distributed stochastic gradient descent (SGD) is essential for scaling the machine learning algorithms to a large number of computing nodes. However, the infrastructures variability such as high communication delay or random node slowdown greatly impedes the performance of distributed SGD algorithm, especially in a wireless system or sensor networks. In this paper, we propose an algorithmic approach named Overlap-Local-SGD (and its momentum variant) to overlap the communication and computation so as to speedup the distributed training procedure. The approach can help to mitigate the straggler effects as well. We achieve this by adding an anchor model on each node. After multiple local updates, locally trained models will be pulled back towards the synchronized anchor model rather than communicating with others. Experimental results of training a deep neural network on CIFAR-10 dataset demonstrate the effectiveness of Overlap-Local-SGD. We also provide a convergence guarantee for the proposed algorithm under non-convex objective functions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/09/2019

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Recent developments on large-scale distributed machine learning applicat...
research
11/20/2019

Local AdaAlter: Communication-Efficient Stochastic Gradient Descent with Adaptive Learning Rates

Recent years have witnessed the growth of large-scale distributed machin...
research
10/19/2018

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Large-scale machine learning training, in particular distributed stochas...
research
11/12/2019

Throughput Prediction of Asynchronous SGD in TensorFlow

Modern machine learning frameworks can train neural networks using multi...
research
11/27/2018

Stochastic Gradient Push for Distributed Deep Learning

Large mini-batch parallel SGD is commonly used for distributed training ...
research
07/27/2020

Multi-Level Local SGD for Heterogeneous Hierarchical Networks

We propose Multi-Level Local SGD, a distributed gradient method for lear...
research
05/19/2021

Communication-Efficient Distributed SGD using Preamble-based Random Access

In this paper, we study communication-efficient distributed stochastic g...

Please sign up or login with your details

Forgot password? Click here to reset