Asynchronous Stochastic Optimization Robust to Arbitrary Delays

06/22/2021
by   Alon Cohen, et al.
10

We consider stochastic optimization with delayed gradients where, at each time step t, the algorithm makes an update using a stale stochastic gradient from step t - d_t for some arbitrary delay d_t. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires O( σ^2/ϵ^4 + τ/ϵ^2 ) steps for finding an ϵ-stationary point x, where τ is the average delay 1/T∑_t=1^T d_t and σ^2 is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the maximal delay max_t d_t, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2022

Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

We study the asynchronous stochastic gradient descent algorithm for dist...
research
08/20/2015

AdaDelay: Delay Adaptive Distributed Stochastic Convex Optimization

We study distributed stochastic convex optimization under the delayed gr...
research
09/11/2019

The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication

We analyze (stochastic) gradient descent (SGD) with delayed updates on s...
research
07/06/2021

Distributed stochastic optimization with large delays

One of the most widely used methods for solving large-scale stochastic o...
research
06/08/2019

Reducing the variance in online optimization by transporting past gradients

Most stochastic optimization methods use gradients once before discardin...
research
08/28/2022

Asynchronous Training Schemes in Distributed Learning with Time Delay

In the context of distributed deep learning, the issue of stale weights ...
research
07/21/2023

Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

Perfect synchronization in distributed machine learning problems is inef...

Please sign up or login with your details

Forgot password? Click here to reset