Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

06/11/2023
by   Guojun Xiong, et al.
6

With the increasing demand for large-scale training of machine learning models, fully decentralized optimization methods have recently been advocated as alternatives to the popular parameter server framework. In this paradigm, each worker maintains a local estimate of the optimal parameter vector, and iteratively updates it by waiting and averaging all estimates obtained from its neighbors, and then corrects it on the basis of its local dataset. However, the synchronization phase is sensitive to stragglers. An efficient way to mitigate this effect is to consider asynchronous updates, where each worker computes stochastic gradients and communicates with other workers at its own pace. Unfortunately, fully asynchronous updates suffer from staleness of the stragglers' parameters. To address these limitations, we propose a fully decentralized algorithm DSGD-AAU with adaptive asynchronous updates via adaptively determining the number of neighbor workers for each worker to communicate with. We show that DSGD-AAU achieves a linear speedup for convergence (i.e., convergence performance increases linearly with respect to the number of workers). Experimental results on a suite of datasets and deep neural network models are provided to verify our theoretical results.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2021

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

With the increasing demand for large-scale training of machine learning ...
research
10/08/2021

RelaySum for Decentralized Deep Learning on Heterogeneous Data

In decentralized machine learning, workers compute model updates on thei...
research
02/28/2020

Decentralized gradient methods: does topology matter?

Consensus-based distributed optimization methods have recently been advo...
research
01/24/2019

Asynchronous Decentralized Optimization in Directed Networks

A popular asynchronous protocol for decentralized optimization is random...
research
07/07/2020

Divide-and-Shuffle Synchronization for Distributed Machine Learning

Distributed Machine Learning suffers from the bottleneck of synchronizat...
research
07/15/2022

Pick your Neighbor: Local Gauss-Southwell Rule for Fast Asynchronous Decentralized Optimization

In decentralized optimization environments, each agent i in a network of...
research
04/23/2023

An Asynchronous Decentralized Algorithm for Wasserstein Barycenter Problem

Wasserstein Barycenter Problem (WBP) has recently received much attentio...

Please sign up or login with your details

Forgot password? Click here to reset