Accelerating Gossip SGD with Periodic Global Averaging

05/19/2021
by   Yiming Chen, et al.
7

Communication overhead hinders the scalability of large-scale distributed training. Gossip SGD, where each node averages only with its neighbors, is more communication-efficient than the prevalent parallel SGD. However, its convergence rate is reversely proportional to quantity 1-β which measures the network connectivity. On large and sparse networks where 1-β→ 0, Gossip SGD requires more iterations to converge, which offsets against its communication benefit. This paper introduces Gossip-PGA, which adds Periodic Global Averaging into Gossip SGD. Its transient stage, i.e., the iterations required to reach asymptotic linear speedup stage, improves from Ω(β^4 n^3/(1-β)^4) to Ω(β^4 n^3 H^4) for non-convex problems. The influence of network topology in Gossip-PGA can be controlled by the averaging period H. Its transient-stage complexity is also superior to Local SGD which has order Ω(n^3 H^4). Empirical results of large-scale training on image classification (ResNet50) and language modeling (BERT) validate our theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/17/2021

Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD

We consider decentralized stochastic optimization problems where a netwo...
research
07/17/2018

Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication

For large scale non-convex stochastic optimization, parallel mini-batch ...
research
03/12/2019

A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

Reducing communication overhead is a big challenge for large-scale distr...
research
07/13/2020

Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

Stochastic Gradient Descent (SGD) is the key learning algorithm for many...
research
10/30/2019

Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

Communication overhead is one of the key challenges that hinders the sca...
research
10/15/2021

Trade-offs of Local SGD at Scale: An Empirical Study

As datasets and models become increasingly large, distributed training h...
research
07/27/2020

Multi-Level Local SGD for Heterogeneous Hierarchical Networks

We propose Multi-Level Local SGD, a distributed gradient method for lear...

Please sign up or login with your details

Forgot password? Click here to reset