S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

10/05/2021
by   Keshi Ge, et al.
0

Distributed stochastic gradient descent (SGD) approach has been widely used in large-scale deep learning, and the gradient collective method is vital to ensure the training scalability of the distributed deep learning system. Collective communication such as AllReduce has been widely adopted for the distributed SGD process to reduce the communication time. However, AllReduce incurs large bandwidth resources while most gradients are sparse in many cases since many gradient values are zeros and should be efficiently compressed for bandwidth saving. To reduce the sparse gradient communication overhead, we propose Sparse-Sketch Reducer (S2 Reducer), a novel sketch-based sparse gradient aggregation method with convergence guarantees. S2 Reducer reduces the communication cost by only compressing the non-zero gradients with count-sketch and bitmap, and enables the efficient AllReduce operators for parallel SGD training. We perform extensive evaluation against four state-of-the-art methods over five training models. Our results show that S2 Reducer converges to the same accuracy, reduces 81% sparse communication overhead, and achieves 1.8× speedup compared to state-of-the-art approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/13/2021

A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

Distributed training is an effective way to accelerate the training proc...
research
01/19/2022

Near-Optimal Sparse Allreduce for Distributed Deep Learning

Communication overhead is one of the major obstacles to train large deep...
research
12/08/2021

FastSGD: A Fast Compressed SGD Framework for Distributed Machine Learning

With the rapid increase of big data, distributed Machine Learning (ML) h...
research
02/05/2021

DeepReduce: A Sparse-tensor Communication Framework for Distributed Deep Learning

Sparse tensors appear frequently in distributed deep learning, either as...
research
04/03/2023

SparDL: Distributed Deep Learning Training with Efficient Sparse Communication

Top-k sparsification has recently been widely used to reduce the communi...
research
05/22/2017

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

High network communication cost for synchronizing gradients and paramete...
research
05/11/2022

Libra: In-network Gradient Aggregation for Speeding up Distributed Sparse Deep Training

Distributed sparse deep learning has been widely used in many internet-s...

Please sign up or login with your details

Forgot password? Click here to reset