DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

04/24/2021
by   Kun Yuan, et al.
20

The scale of deep learning nowadays calls for efficient distributed training algorithms. Decentralized momentum SGD (DmSGD), in which each node averages only with its neighbors, is more communication efficient than vanilla Parallel momentum SGD that incurs global average across all computing nodes. On the other hand, the large-batch training has been demonstrated critical to achieve runtime speedup. This motivates us to investigate how DmSGD performs in the large-batch scenario. In this work, we find the momentum term can amplify the inconsistency bias in DmSGD. Such bias becomes more evident as batch-size grows large and hence results in severe performance degradation. We next propose DecentLaM, a novel decentralized large-batch momentum SGD to remove the momentum-incurred bias. The convergence rate for both non-convex and strongly-convex scenarios is established. Our theoretical results justify the superiority of DecentLaM to DmSGD especially in the large-batch scenario. Experimental results on a variety of computer vision tasks and models demonstrate that DecentLaM promises both efficient and high-quality training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/13/2020

SQuARM-SGD: Communication-Efficient Momentum SGD for Decentralized Optimization

In this paper, we consider the problem of communication-efficient decent...
research
03/04/2021

Correcting Momentum with Second-order Information

We develop a new algorithm for non-convex stochastic optimization that f...
research
08/24/2020

Periodic Stochastic Gradient Descent with Momentum for Decentralized Training

Decentralized training has been actively studied in recent years. Althou...
research
02/09/2020

Momentum Improves Normalized SGD

We provide an improved analysis of normalized SGD showing that adding mo...
research
10/26/2021

Exponential Graph is Provably Efficient for Decentralized Deep Training

Decentralized SGD is an emerging training method for deep learning known...
research
04/26/2019

Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

With an increasing demand for training powers for deep learning algorith...
research
02/04/2020

Improving Efficiency in Large-Scale Decentralized Distributed Training

Decentralized Parallel SGD (D-PSGD) and its asynchronous variant Asynchr...

Please sign up or login with your details

Forgot password? Click here to reset