O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

06/12/2020
by   Subhadeep Bhattacharya, et al.
0

Large neural network models present a hefty communication challenge to distributed Stochastic Gradient Descent (SGD), with a communication complexity of O(n) per worker for a model of n parameters. Many sparsification and quantization techniques have been proposed to compress the gradients, some reducing the communication complexity to O(k), where k << n. In this paper, we introduce a strategy called two-level gradient averaging (A2SGD) to consolidate all gradients down to merely two local averages per worker before the computation of two global averages for an updated model. A2SGD also retains local errors to maintain the variance for fast convergence. Our theoretical analysis shows that A2SGD converges similarly like the default distributed SGD algorithm. Our evaluation validates the theoretical conclusion and demonstrates that A2SGD significantly reduces the communication traffic per worker, and improves the overall training time of LSTM-PTB by 3.2x and 23.2x, respectively, compared to Top-K and QSGD. To the best of our knowledge, A2SGD is the first to achieve O(1) communication complexity per worker for distributed SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/31/2020

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed traini...
research
07/13/2020

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Stochastic Gradient Descent (SGD) is the key learning algorithm for many...
research
03/15/2018

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

In this paper, we present GossipGraD - a gossip communication protocol b...
research
03/12/2020

Machine Learning on Volatile Instances

Due to the massive size of the neural network models and training datase...
research
08/16/2019

NUQSGD: Improved Communication Efficiency for Data-parallel SGD via Nonuniform Quantization

As the size and complexity of models and datasets grow, so does the need...
research
05/21/2020

rTop-k: A Statistical Estimation Approach to Distributed SGD

The large communication cost for exchanging gradients between different ...
research
08/22/2018

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

State-of-the-art distributed machine learning suffers from significant d...

Please sign up or login with your details

Forgot password? Click here to reset