Local SGD With a Communication Overhead Depending Only on the Number of Workers

06/03/2020
by   Artin Spiridonoff, et al.
6

We consider speeding up stochastic gradient descent (SGD) by parallelizing it across multiple workers. We assume the same data set is shared among n workers, who can take SGD steps and coordinate with a central server. Unfortunately, this could require a lot of communication between the workers and the server, which can dramatically reduce the gains from parallelism. The Local SGD method, proposed and analyzed in the earlier literature, suggests machines should make many local steps between such communications. While the initial analysis of Local SGD showed it needs Ω ( √(T) ) communications for T local gradient steps in order for the error to scale proportionately to 1/(nT), this has been successively improved in a string of papers, with the state-of-the-art requiring Ω( n ( (T) ) ) communications. In this paper, we give a new analysis of Local SGD. A consequence of our analysis is that Local SGD can achieve an error that scales as 1/(nT) with only a fixed number of communications independent of T: specifically, only Ω(n) communications are required.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2021

Communication-efficient SGD: From Local SGD to One-Shot Averaging

We consider speeding up stochastic gradient descent (SGD) by parallelizi...
research
12/30/2019

Variance Reduced Local SGD with Lower Communication Complexity

To accelerate the training of machine learning models, distributed stoch...
research
10/06/2022

STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

Synchronous local stochastic gradient descent (local SGD) suffers from s...
research
10/31/2022

Communication-Efficient Local SGD with Age-Based Worker Selection

A major bottleneck of distributed learning under parameter-server (PS) f...
research
10/06/2018

Anytime Stochastic Gradient Descent: A Time to Hear from all the Workers

In this paper, we focus on approaches to parallelizing stochastic gradie...
research
07/01/2020

Shuffle-Exchange Brings Faster: Reduce the Idle Time During Communication for Decentralized Neural Network Training

As a crucial scheme to accelerate the deep neural network (DNN) training...
research
10/11/2018

signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

Training neural networks on large datasets can be accelerated by distrib...

Please sign up or login with your details

Forgot password? Click here to reset