Communication-efficient SGD: From Local SGD to One-Shot Averaging

06/09/2021
by   Artin Spiridonoff, et al.
10

We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among N workers, who can take SGD steps and coordinate with a central server. While it is possible to obtain a linear reduction in the variance by averaging all the stochastic gradients at every step, this requires a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs Ω ( √(T) ) communications for T local gradient steps in order for the error to scale proportionately to 1/(NT), this has been successively improved in a string of papers, with the state-of-the-art requiring Ω( N ( (T) ) ) communications. In this paper, we suggest a Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows. Our analysis shows that this can achieve an error that scales as 1/(NT) with a number of communications that is completely independent of T. In particular, we show that Ω(N) communications are sufficient. Empirical evidence suggests this bound is close to tight as we further show that √(N) or N^3/4 communications fail to achieve linear speed-up in simulations. Moreover, we show that under mild assumptions, the main of which is twice differentiability on any neighborhood of the optimal solution, one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2020

Local SGD With a Communication Overhead Depending Only on the Number of Workers

We consider speeding up stochastic gradient descent (SGD) by parallelizi...
research
12/30/2019

Variance Reduced Local SGD with Lower Communication Complexity

To accelerate the training of machine learning models, distributed stoch...
research
04/25/2019

Communication trade-offs for synchronized distributed SGD with large step size

Synchronous mini-batch SGD is state-of-the-art for large-scale distribut...
research
03/14/2022

The Role of Local Steps in Local SGD

We consider the distributed stochastic optimization problem where n agen...
research
10/31/2022

Communication-Efficient Local SGD with Age-Based Worker Selection

A major bottleneck of distributed learning under parameter-server (PS) f...
research
01/11/2022

Partial Model Averaging in Federated Learning: Performance Guarantees and Benefits

Local Stochastic Gradient Descent (SGD) with periodic model averaging (F...
research
01/11/2018

MXNET-MPI: Embedding MPI parallelism in Parameter Server Task Model for scaling Deep Learning

Existing Deep Learning frameworks exclusively use either Parameter Serve...

Please sign up or login with your details

Forgot password? Click here to reset