Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

10/30/2019
by   Farzin Haddadpour, et al.
0

Communication overhead is one of the key challenges that hinders the scalability of distributed optimization algorithms. In this paper, we study local distributed SGD, where data is partitioned among computation nodes, and the computation nodes perform local updates with periodically exchanging the model among the workers to perform averaging. While local SGD is empirically shown to provide promising results, a theoretical understanding of its performance remains open. We strengthen convergence analysis for local SGD, and show that local SGD can be far less expensive and applied far more generally than current theory suggests. Specifically, we show that for loss functions that satisfy the Polyak-Łojasiewicz condition, O((pT)^1/3) rounds of communication suffice to achieve a linear speed up, that is, an error of O(1/pT), where T is the total number of model updates at each worker. This is in contrast with previous work which required higher number of communication rounds, as well as was limited to strongly convex loss functions, for a similar asymptotic performance. We also develop an adaptive synchronization scheme that provides a general condition for linear speed up. Finally, we validate the theory with experimental results, running over AWS EC2 clouds and an internal GPU cluster.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/25/2019

Communication trade-offs for synchronized distributed SGD with large step size

Synchronous mini-batch SGD is state-of-the-art for large-scale distribut...
research
03/14/2022

The Role of Local Steps in Local SGD

We consider the distributed stochastic optimization problem where n agen...
research
08/22/2018

Cooperative SGD: A unified Framework for the Design and Analysis of Communication-Efficient SGD Algorithms

State-of-the-art distributed machine learning suffers from significant d...
research
12/10/2020

A Mechanism for Distributed Deep Learning Communication Optimization

Intensive communication and synchronization cost for gradients and param...
research
09/10/2019

Better Communication Complexity for Local SGD

We revisit the local Stochastic Gradient Descent (local SGD) method and ...
research
05/19/2021

Accelerating Gossip SGD with Periodic Global Averaging

Communication overhead hinders the scalability of large-scale distribute...
research
10/15/2021

Trade-offs of Local SGD at Scale: An Empirical Study

As datasets and models become increasingly large, distributed training h...

Please sign up or login with your details

Forgot password? Click here to reset