Sparse Communication for Training Deep Networks

09/19/2020
by   Negar Foroutan Eghlidi, et al.
0

Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. Although distributed training reduces the computation time, the communication overhead associated with the gradient exchange forms a scalability bottleneck for the algorithm. There are many compression techniques proposed to reduce the number of gradients that needs to be communicated. However, compressing the gradients introduces yet another overhead to the problem. In this work, we study several compression schemes and identify how three key parameters affect the performance. We also provide a set of insights on how to increase performance and introduce a simple sparsification scheme, random-block sparsification, that reduces communication while keeping the performance close to standard SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/13/2021

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Distributed training is an effective way to accelerate the training proc...
research
09/04/2019

Performance Analysis and Comparison of Distributed Machine Learning Systems

Deep learning has permeated through many aspects of computing/processing...
research
05/21/2020

rTop-k: A Statistical Estimation Approach to Distributed SGD

The large communication cost for exchanging gradients between different ...
research
09/20/2018

Sparsified SGD with Memory

Huge scale machine learning problems are nowadays tackled by distributed...
research
01/19/2022

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Communication overhead is one of the major obstacles to train large deep...
research
09/18/2022

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

To train deep learning models faster, distributed training on multiple G...
research
04/21/2023

Gradient Derivation for Learnable Parameters in Graph Attention Networks

This work provides a comprehensive derivation of the parameter gradients...

Please sign up or login with your details

Forgot password? Click here to reset