LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

05/22/2019
by   Jingjing Zhang, et al.
0

Gradient-based distributed learning in Parameter Server (PS) computing architectures is subject to random delays due to straggling worker nodes, as well as to possible communication bottlenecks between PS and workers. Solutions have been recently proposed to separately address these impairments based on the ideas of gradient coding, worker grouping, and adaptive worker selection. This paper provides a unified analysis of these techniques in terms of wall-clock time, communication, and computation complexity measures. Furthermore, in order to combine the benefits of gradient coding and grouping in terms of robustness to stragglers with the communication and computation load gains of adaptive selection, novel strategies, named Lazily Aggregated Gradient Coding (LAGC) and Grouped-LAG (G-LAG), are introduced. Analysis and results show that G-LAG provides the best wall-clock time and communication performance, while maintaining a low computational cost, for two representative distributions of the computing times of the worker nodes.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/12/2022

Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

Wall-clock convergence time and communication load are key performance m...
research
04/20/2021

A Note on Slepian-Wolf Bounds for Several Node Grouping Configurations

The Slepian-Wolf bound on the admissible coding rate forms the most fund...
research
01/21/2023

ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less

Wall-clock convergence time and communication rounds are critical perfor...
research
03/02/2021

Optimal Communication-Computation Trade-Off in Heterogeneous Gradient Coding

Gradient coding allows a master node to derive the aggregate of the part...
research
11/17/2017

Approximate Gradient Coding via Sparse Random Graphs

Distributed algorithms are often beset by the straggler effect, where th...
research
05/14/2020

Communication-Efficient Gradient Coding for Straggler Mitigation in Distributed Learning

Distributed implementations of gradient-based methods, wherein a server ...
research
06/08/2020

Adaptive Gradient Coding

This paper focuses on mitigating the impact of stragglers in distributed...

Please sign up or login with your details

Forgot password? Click here to reset