A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

08/13/2021
by   LingFei Dai, et al.
0

Distributed training is an effective way to accelerate the training process of large-scale deep learning models. However, the parameter exchange and synchronization of distributed stochastic gradient descent introduce a large amount of communication overhead. Gradient compression is an effective method to reduce communication overhead. In synchronization SGD compression methods, many Top-k sparsification based gradient compression methods have been proposed to reduce the communication. However, the centralized method based on the parameter servers has the single point of failure problem and limited scalability, while the decentralized method with global parameter exchanging may reduce the convergence rate of training. In contrast with Top-k based methods, we proposed a gradient compression method with globe gradient vector sketching, which uses the Count-Sketch structure to store the gradients to reduce the loss of the accuracy in the training process, named global-sketching SGD (gs-SGD). The gs-SGD has better convergence efficiency on deep learning models and a communication complexity of O(log d*log P), where d is the number of model parameters and P is the number of workers. We conducted experiments on GPU clusters to verify that our method has better convergence efficiency than global Top-k and Sketching-based methods. In addition, gs-SGD achieves 1.3-3.1x higher throughput compared with gTop-k, and 1.1-1.2x higher throughput compared with original Sketched-SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 6

page 7

research
10/05/2021

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Distributed stochastic gradient descent (SGD) approach has been widely u...
research
09/19/2020

Sparse Communication for Training Deep Networks

Synchronous stochastic gradient descent (SGD) is the most common method ...
research
11/20/2019

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely depl...
research
03/14/2021

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

Distributed deep learning workloads include throughput-intensive trainin...
research
10/14/2022

Communication-Efficient Adam-Type Algorithms for Distributed Data Mining

Distributed data mining is an emerging research topic to effectively and...
research
04/16/2021

Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning

Stochastic Gradient Descent (SGD) has become the de facto way to train d...
research
04/14/2022

Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression

Traditional one-bit compressed stochastic gradient descent can not be di...

Please sign up or login with your details

Forgot password? Click here to reset