A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration

by   LingFei Dai, et al.

Distributed training is an effective way to accelerate the training process of large-scale deep learning models. However, the parameter exchange and synchronization of distributed stochastic gradient descent introduce a large amount of communication overhead. Gradient compression is an effective method to reduce communication overhead. In synchronization SGD compression methods, many Top-k sparsification based gradient compression methods have been proposed to reduce the communication. However, the centralized method based on the parameter servers has the single point of failure problem and limited scalability, while the decentralized method with global parameter exchanging may reduce the convergence rate of training. In contrast with Top-k based methods, we proposed a gradient compression method with globe gradient vector sketching, which uses the Count-Sketch structure to store the gradients to reduce the loss of the accuracy in the training process, named global-sketching SGD (gs-SGD). The gs-SGD has better convergence efficiency on deep learning models and a communication complexity of O(log d*log P), where d is the number of model parameters and P is the number of workers. We conducted experiments on GPU clusters to verify that our method has better convergence efficiency than global Top-k and Sketching-based methods. In addition, gs-SGD achieves 1.3-3.1x higher throughput compared with gTop-k, and 1.1-1.2x higher throughput compared with original Sketched-SGD.


page 1

page 2

page 3

page 4

page 5

page 6

page 7


S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Distributed stochastic gradient descent (SGD) approach has been widely u...

Sparse Communication for Training Deep Networks

Synchronous stochastic gradient descent (SGD) is the most common method ...

Understanding Top-k Sparsification in Distributed Deep Learning

Distributed stochastic gradient descent (SGD) algorithms are widely depl...

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

Distributed deep learning workloads include throughput-intensive trainin...

Communication-Efficient Adam-Type Algorithms for Distributed Data Mining

Distributed data mining is an emerging research topic to effectively and...

Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning

Stochastic Gradient Descent (SGD) has become the de facto way to train d...

Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression

Traditional one-bit compressed stochastic gradient descent can not be di...

Please sign up or login with your details

Forgot password? Click here to reset