Slow and Stale Gradients Can Win the Race: Error-Runtime Trade-offs in Distributed SGD

03/03/2018
by   Sanghamitra Dutta, et al.
0

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in waiting for the slowest learners (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect convergence. In this work we present the first theoretical characterization of the speed-up offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). The novelty in our work is that our runtime analysis considers random straggler delays, which helps us design and compare distributed SGD algorithms that strike a balance between stragglers and staleness. We also present a new convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/23/2020

Slow and Stale Gradients Can Win the Race

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous ...
research
02/25/2020

Adaptive Distributed Stochastic Gradient Descent for Minimizing Delay in the Presence of Stragglers

We consider the setting where a master wants to run a distributed stocha...
research
06/15/2022

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays

The existing analysis of asynchronous stochastic gradient descent (SGD) ...
research
10/19/2018

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

Large-scale machine learning training, in particular distributed stochas...
research
09/29/2019

Distributed SGD Generalizes Well Under Asynchrony

The performance of fully synchronized distributed systems has faced a bo...
research
06/22/2015

Taming the Wild: A Unified Analysis of Hogwild!-Style Algorithms

Stochastic gradient descent (SGD) is a ubiquitous algorithm for a variet...
research
02/17/2022

Delay-adaptive step-sizes for asynchronous learning

In scalable machine learning systems, model training is often paralleliz...

Please sign up or login with your details

Forgot password? Click here to reset