Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

03/03/2021
by   Sebastian U. Stich, et al.
0

It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and – in asynchronous implementations – on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a data-dependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and non-convex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2018

Taming Convergence for Asynchronous Stochastic Gradient Descent with Unbounded Delay in Non-Convex Learning

Understanding the convergence performance of asynchronous stochastic gra...
research
10/21/2019

Sparsification as a Remedy for Staleness in Distributed Asynchronous SGD

Large scale machine learning is increasingly relying on distributed opti...
research
06/15/2022

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays

The existing analysis of asynchronous stochastic gradient descent (SGD) ...
research
06/27/2015

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

Asynchronous parallel implementations of stochastic gradient (SG) have b...
research
09/06/2019

Decentralized Stochastic Gradient Tracking for Non-convex Empirical Risk Minimization

This paper studies a decentralized stochastic gradient tracking (DSGT) a...
research
11/08/2018

(Near) Optimal Parallelism Bound for Fully Asynchronous Coordinate Descent with Linear Speedup

When solving massive optimization problems in areas such as machine lear...
research
01/14/2020

Distributed Learning in the Non-Convex World: From Batch to Streaming Data, and Beyond

Distributed learning has become a critical enabler of the massively conn...

Please sign up or login with your details

Forgot password? Click here to reset