Global Momentum Compression for Sparse Communication in Distributed SGD

by   Shen-Yi Zhao, et al.

With the rapid growth of data, distributed stochastic gradient descent (DSGD) has been widely used for solving large-scale machine learning problems. Due to the latency and limited bandwidth of network, communication has become the bottleneck of DSGD when we need to train large scale models, like deep neural networks. Communication compression with sparsified gradient, abbreviated as sparse communication, has been widely used for reducing communication cost in DSGD. Recently, there has appeared one method, called deep gradient compression (DGC), to combine memory gradient and momentum SGD for sparse communication. DGC has achieved promising performance in practise. However, the theory about the convergence of DGC is lack. In this paper, we propose a novel method, called global momentum compression (GMC), for sparse communication in DSGD. GMC also combines memory gradient and momentum SGD. But different from DGC which adopts local momentum, GMC adopts global momentum. We theoretically prove the convergence rate of GMC for both convex and non-convex problems. To the best of our knowledge, this is the first work that proves the convergence of distributed momentum SGD (DMSGD) with sparse communication and memory gradient. Empirical results show that, compared with the DMSGD counterpart without sparse communication, GMC can reduce the communication cost by approximately 100 fold without loss of generalization accuracy. GMC can also achieve comparable (sometimes better) performance compared with DGC, with extra theoretical guarantee.


page 1

page 2

page 3

page 4


On the Convergence of Memory-Based Distributed SGD

Distributed stochastic gradient descent (DSGD) has been widely used for ...

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Recent developments on large-scale distributed machine learning applicat...

Compressing Gradient Optimizers via Count-Sketches

Many popular first-order optimization methods (e.g., Momentum, AdaGrad, ...

Compressing gradients by exploiting temporal correlation in momentum-SGD

An increasing bottleneck in decentralized optimization is communication....

Losing momentum in continuous-time stochastic optimisation

The training of deep neural networks and other modern machine learning m...

A New Adaptive Gradient Method with Gradient Decomposition

Adaptive gradient methods, especially Adam-type methods (such as Adam, A...

Scaling up Stochastic Gradient Descent for Non-convex Optimisation

Stochastic gradient descent (SGD) is a widely adopted iterative method f...

Please sign up or login with your details

Forgot password? Click here to reset