Stochastic Gradient Push for Distributed Deep Learning

11/27/2018
by   Mahmoud Assran, et al.
0

Large mini-batch parallel SGD is commonly used for distributed training of deep networks. Approaches that use tightly-coupled exact distributed averaging based on AllReduce are sensitive to slow nodes and high-latency communication. In this work we show the applicability of Stochastic Gradient Push (SGP) for distributed training. SGP uses a gossip algorithm called PushSum for approximate distributed averaging, allowing for much more loosely coupled communications, which can be beneficial in high-latency or high-variability scenarios. The tradeoff is that approximate distributed averaging injects additional noise in the gradient which can affect the train and test accuracies. We prove that SGP converges to a stationary point of smooth, non-convex objective functions. Furthermore, we validate empirically the potential of SGP. For example, using 32 nodes with 8 GPUs per node to train ResNet-50 on ImageNet, where nodes communicate over 10Gbps Ethernet, SGP completes 90 epochs in around 1.6 hours while AllReduce SGD takes over 5 hours, and the top-1 validation accuracy of SGP remains within 1.2 using AllReduce SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/05/2015

Experiments on Parallel Training of Deep Neural Network using Model Averaging

In this work we apply model averaging to parallel training of deep neura...
research
03/12/2019

A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

Reducing communication overhead is a big challenge for large-scale distr...
research
12/30/2020

Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

Distributed deep learning is an effective way to reduce the training tim...
research
02/21/2020

Overlap Local-SGD: An Algorithmic Approach to Hide Communication Delays in Distributed SGD

Distributed stochastic gradient descent (SGD) is essential for scaling t...
research
07/13/2020

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Stochastic Gradient Descent (SGD) is the key learning algorithm for many...
research
07/03/2017

Parle: parallelizing stochastic gradient descent

We propose a new algorithm called Parle for parallel training of deep ne...
research
03/09/2020

Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives

Stochastic gradient descent (SGD) has been widely studied in the literat...

Please sign up or login with your details

Forgot password? Click here to reset