Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

03/01/2021
by   Baturalp Buyukates, et al.
5

Distributed implementations are crucial in speeding up large scale machine learning applications. Distributed gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers. A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is straggling workers. Coded distributed computation techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers. In this paper, we consider gradient coding (GC), and propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to dynamically choose from among a set of possible codes depending on the past straggling behavior. In particular, we consider GC with clustering, and regulate the number of stragglers in each cluster by dynamically forming the clusters at each iteration; hence, the proposed scheme is called GC with dynamic clustering (GC-DC). Under a time-correlated straggling behavior, GC-DC gains from adapting to the straggling behavior over time such that, at each iteration, GC-DC aims at distributing the stragglers across clusters as uniformly as possible based on the past straggler behavior. For both homogeneous and heterogeneous worker models, we numerically show that GC-DC provides significant improvements in the average per-iteration completion time without an increase in the communication load compared to the original GC scheme.

READ FULL TEXT
research
11/03/2020

Gradient Coding with Dynamic Clustering for Straggler Mitigation

In distributed synchronous gradient descent (GD) the main performance bo...
research
03/05/2019

Gradient Coding with Clustering and Multi-message Communication

Gradient descent (GD) methods are commonly employed in machine learning ...
research
02/06/2019

CodedReduce: A Fast and Robust Framework for Gradient Aggregation in Distributed Learning

We focus on the commonly used synchronous Gradient Descent paradigm for ...
research
05/16/2022

Two-Stage Coded Federated Edge Learning: A Dynamic Partial Gradient Coding Perspective

Federated edge learning (FEL) can training a global model from terminal ...
research
04/10/2020

Straggler-aware Distributed Learning: Communication Computation Latency Trade-off

When gradient descent (GD) is scaled to many parallel workers for large ...
research
08/07/2018

Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers

Distributed gradient descent (DGD) is an efficient way of implementing g...
research
03/02/2021

Stream Distributed Coded Computing

The emerging large-scale and data-hungry algorithms require the computat...

Please sign up or login with your details

Forgot password? Click here to reset