Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

01/12/2022
by   Feng Zhu, et al.
8

Wall-clock convergence time and communication load are key performance metrics for the distributed implementation of stochastic gradient descent (SGD) in parameter server settings. Communication-adaptive distributed Adam (CADA) has been recently proposed as a way to reduce communication load via the adaptive selection of workers. CADA is subject to performance degradation in terms of wall-clock convergence time in the presence of stragglers. This paper proposes a novel scheme named grouping-based CADA (G-CADA) that retains the advantages of CADA in reducing the communication load, while increasing the robustness to stragglers at the cost of additional storage at the workers. G-CADA partitions the workers into groups of workers that are assigned the same data shards. Groups are scheduled adaptively at each iteration, and the server only waits for the fastest worker in each selected group. We provide analysis and experimental results to elaborate the significant gains on the wall-clock time, as well as communication load and computation load, of G-CADA over other benchmark schemes.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

research
05/22/2019

LAGC: Lazily Aggregated Gradient Coding for Straggler-Tolerant and Communication-Efficient Distributed Learning

Gradient-based distributed learning in Parameter Server (PS) computing a...
research
01/21/2023

ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less

Wall-clock convergence time and communication rounds are critical perfor...
research
04/17/2023

Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

In distributed machine learning, a central node outsources computational...
research
11/12/2020

Distributed Sparse SGD with Majority Voting

Distributed learning, particularly variants of distributed stochastic gr...
research
10/06/2022

STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

Synchronous local stochastic gradient descent (local SGD) suffers from s...
research
07/07/2023

DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification

Gradient sparsification is a widely adopted solution for reducing the ex...
research
11/08/2018

Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training

Distributed training of deep nets is an important technique to address s...

Please sign up or login with your details

Forgot password? Click here to reset