Robust Fully-Asynchronous Methods for Distributed Training over General Architecture

07/21/2023
by   Zehan Zhu, et al.
0

Perfect synchronization in distributed machine learning problems is inefficient and even impossible due to the existence of latency, package losses and stragglers. We propose a Robust Fully-Asynchronous Stochastic Gradient Tracking method (R-FAST), where each device performs local computation and communication at its own pace without any form of synchronization. Different from existing asynchronous distributed algorithms, R-FAST can eliminate the impact of data heterogeneity across devices and allow for packet losses by employing a robust gradient tracking strategy that relies on properly designed auxiliary variables for tracking and buffering the overall gradient vector. More importantly, the proposed method utilizes two spanning-tree graphs for communication so long as both share at least one common root, enabling flexible designs in communication architectures. We show that R-FAST converges in expectation to a neighborhood of the optimum with a geometric rate for smooth and strongly convex objectives; and to a stationary point with a sublinear rate for general non-convex settings. Extensive experiments demonstrate that R-FAST runs 1.5-2 times faster than synchronous benchmark algorithms, such as Ring-AllReduce and D-PSGD, while still achieving comparable accuracy, and outperforms existing asynchronous SOTA algorithms, such as AD-PSGD and OSGP, especially in the presence of stragglers.

READ FULL TEXT
research
08/17/2022

SYNTHESIS: A Semi-Asynchronous Path-Integrated Stochastic Gradient Method for Distributed Learning in Computing Clusters

To increase the training speed of distributed learning, recent years hav...
research
10/24/2017

Asynchronous ADMM for Distributed Non-Convex Optimization in Power Systems

Large scale, non-convex optimization problems arising in many complex ne...
research
03/16/2019

A Provably Communication-Efficient Asynchronous Distributed Inference Method for Convex and Nonconvex Problems

This paper proposes and analyzes a communication-efficient distributed o...
research
06/25/2018

A Distributed Flexible Delay-tolerant Proximal Gradient Algorithm

We develop and analyze an asynchronous algorithm for distributed convex ...
research
10/14/2022

Distributed Distributionally Robust Optimization with Non-Convex Objectives

Distributionally Robust Optimization (DRO), which aims to find an optima...
research
06/22/2021

Asynchronous Stochastic Optimization Robust to Arbitrary Delays

We consider stochastic optimization with delayed gradients where, at eac...
research
03/04/2020

Distributed Asynchronous Union-Find for Scalable Feature Tracking

Feature tracking and the visualizations of the resulting trajectories ma...

Please sign up or login with your details

Forgot password? Click here to reset