Convergence Analysis of Decentralized ASGD

09/07/2023
by   Mauro DL Tosi, et al.
0

Over the last decades, Stochastic Gradient Descent (SGD) has been intensively studied by the Machine Learning community. Despite its versatility and excellent performance, the optimization of large models via SGD still is a time-consuming task. To reduce training time, it is common to distribute the training process across multiple devices. Recently, it has been shown that the convergence of asynchronous SGD (ASGD) will always be faster than mini-batch SGD. However, despite these improvements in the theoretical bounds, most ASGD convergence-rate proofs still rely on a centralized parameter server, which is prone to become a bottleneck when scaling out the gradient computations across many distributed processes. In this paper, we present a novel convergence-rate analysis for decentralized and asynchronous SGD (DASGD) which does not require partial synchronization among nodes nor restrictive network topologies. Specifically, we provide a bound of 𝒪(σϵ^-2) + 𝒪(QS_avgϵ^-3/2) + 𝒪(S_avgϵ^-1) for the convergence rate of DASGD, where S_avg is the average staleness between models, Q is a constant that bounds the norm of the gradients, and ϵ is a (small) error that is allowed within the bound. Furthermore, when gradients are not bounded, we prove the convergence rate of DASGD to be 𝒪(σϵ^-2) + 𝒪(√(Ŝ_avgŜ_max)ϵ^-1), with Ŝ_max and Ŝ_avg representing a loose version of the average and maximum staleness, respectively. Our convergence proof holds for a fixed stepsize and any non-convex, homogeneous, and L-smooth objective function. We anticipate that our results will be of high relevance for the adoption of DASGD by a broad community of researchers and developers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2022

Sharper Convergence Guarantees for Asynchronous SGD for Distributed and Federated Learning

We study the asynchronous stochastic gradient descent algorithm for dist...
research
09/02/2022

Revisiting Outer Optimization in Adversarial Training

Despite the fundamental distinction between adversarial and natural trai...
research
10/27/2020

Hogwild! over Distributed Local Data Sets with Linearly Increasing Mini-Batch Sizes

Hogwild! implements asynchronous Stochastic Gradient Descent (SGD) where...
research
02/22/2022

Asynchronous Fully-Decentralized SGD in the Cluster-Based Model

This paper presents fault-tolerant asynchronous Stochastic Gradient Desc...
research
08/09/2022

Training Overparametrized Neural Networks in Sublinear Time

The success of deep learning comes at a tremendous computational and ene...
research
10/17/2018

Distributed Learning over Unreliable Networks

Most of today's distributed machine learning systems assume reliable ne...
research
09/06/2019

Decentralized Stochastic Gradient Tracking for Empirical Risk Minimization

Recent works have shown superiorities of decentralized SGD to centralize...

Please sign up or login with your details

Forgot password? Click here to reset