Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication

07/17/2018
by   Hao Yu, et al.
0

For large scale non-convex stochastic optimization, parallel mini-batch SGD using multiple workers ideally can achieve a linear speed-up with respect to the number of workers compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for communication as more workers are involved. This is because the classical parallel mini-batch SGD requires gradient or model exchanges between workers (possibly through an intermediate server) at every iteration. In this paper, we study whether it is possible to maintain the linear speed-up property of parallel mini-batch SGD by using less frequent message passing between workers. We consider the parallel restarted SGD method where each worker periodically restarts its SGD by using the node average as a new initial point. Such a strategy invokes inter-node communication only when computing the node average to restart local SGD but otherwise is fully parallel with no communication overhead. We prove that the parallel restarted SGD method can maintain the same convergence rate as the classical parallel mini-batch SGD while reducing the communication overhead by a factor of O(T^1/4). The parallel restarted SGD strategy was previously used as a common practice, known as model averaging, for training deep neural networks. Earlier empirical works have observed that model averaging can achieve an almost linear speed-up if the averaging interval is carefully controlled. The results in this paper can serve as theoretical justifications for these empirical results on model averaging and provide practical guidelines for applying model averaging.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2018

Local SGD Converges Fast and Communicates Little

Mini-batch stochastic gradient descent (SGD) is the state of the art in ...
research
05/10/2019

On the Computation and Communication Complexity of Parallel SGD with Dynamic Batch Sizes for Stochastic Non-Convex Optimization

For SGD based distributed stochastic optimization, computation complexit...
research
03/12/2019

A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

Reducing communication overhead is a big challenge for large-scale distr...
research
06/23/2016

Parallel SGD: When does averaging help?

Consider a number of workers running SGD independently on the same pool ...
research
05/19/2021

Accelerating Gossip SGD with Periodic Global Averaging

Communication overhead hinders the scalability of large-scale distribute...
research
10/25/2019

The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters

To gain a better performance, many researchers put more computing resour...
research
12/30/2020

Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability

Distributed deep learning is an effective way to reduce the training tim...

Please sign up or login with your details

Forgot password? Click here to reset