Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

11/20/2019
by   Shaohuai Shi, et al.
0

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of "significant" gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results show that LAGS-SGD achieves from around 40% to 95% of the maximum benefit of pipelining on a 16-node GPU cluster. Combining the benefit of pipelining and sparsification, the speedup of LAGS-SGD over S-SGD ranges from 2.86× to 8.52× on our tested CNN and LSTM models, without losing obvious model accuracy.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/28/2019

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

With the increase in the amount of data and the expansion of model scale...
research
07/01/2020

Shuffle-Exchange Brings Faster: Reduce the Idle Time During Communication for Decentralized Neural Network Training

As a crucial scheme to accelerate the deep neural network (DNN) training...
research
05/31/2020

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed traini...
research
03/12/2019

Communication-efficient distributed SGD with Sketching

Large-scale distributed training of neural networks is often limited by ...
research
08/16/2020

Domain-specific Communication Optimization for Distributed DNN Training

Communication overhead poses an important obstacle to distributed DNN tr...
research
10/24/2022

Adaptive Top-K in SGD for Communication-Efficient Distributed Learning

Distributed stochastic gradient descent (SGD) with gradient compression ...
research
05/20/2021

Towards Quantized Model Parallelism for Graph-Augmented MLPs Based on Gradient-Free ADMM framework

The Graph Augmented Multi-layer Perceptron (GA-MLP) model is an attracti...

Please sign up or login with your details

Forgot password? Click here to reset