APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

08/26/2020
by   Hanlin Tang, et al.
11

Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet. However, Adam is generally not compatible with information (gradient) compression technology. Therefore, the communication usually becomes the bottleneck for parallelizing Adam. In this paper, we propose a communication efficient ADAM preconditioned Momentum SGD algorithm– named APMSqueeze– through an error compensated method compressing gradients. The proposed algorithm achieves a similar convergence efficiency to Adam in term of epochs, but significantly reduces the running time per epoch. In terms of end-to-end performance (including the full-precision pre-condition step), APMSqueeze is able to provide sometimes by up to 2-10× speed-up depending on network bandwidth. We also conduct theoretical analysis on the convergence and efficiency.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/27/2019

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

Communication overhead is a major bottleneck hampering the scalability o...
research
02/04/2021

1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed

Scalable training of large models (like BERT and GPT-3) requires careful...
research
08/17/2021

Compressing gradients by exploiting temporal correlation in momentum-SGD

An increasing bottleneck in decentralized optimization is communication....
research
04/13/2021

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

To train large models (like BERT and GPT-3) with hundreds or even thousa...
research
02/13/2018

signSGD: compressed optimisation for non-convex problems

Training large neural networks requires distributing learning across mul...
research
08/24/2020

Periodic Stochastic Gradient Descent with Momentum for Decentralized Training

Decentralized training has been actively studied in recent years. Althou...
research
10/31/2018

MaSS: an Accelerated Stochastic Method for Over-parametrized Learning

In this paper we introduce MaSS (Momentum-added Stochastic Solver), an a...

Please sign up or login with your details

Forgot password? Click here to reset