Gradient Coding with Clustering and Multi-message Communication

03/05/2019
by   Emre Ozfatura, et al.
0

Gradient descent (GD) methods are commonly employed in machine learning problems to optimize the parameters of the model in an iterative fashion. For problems with massive datasets, computations are distributed to many parallel computing servers (i.e., workers) to speed up GD iterations. While distributed computing can increase the computation speed significantly, the per-iteration completion time is limited by the slowest straggling workers. Coded distributed computing can mitigate straggling workers by introducing redundant computations; however, existing coded computing schemes are mainly designed against persistent stragglers, and partial computations at straggling workers are discarded, leading to wasted computational capacity. In this paper, we propose a novel gradient coding (GC) scheme which allows multiple coded computations to be conveyed from each worker to the master per iteration. We numerically show that the proposed GC with multi-message communication (MMC) together with clustering provides significant improvements in the average completion time (of each iteration), with minimal or no increase in the communication load.

READ FULL TEXT
research
11/03/2020

Gradient Coding with Dynamic Clustering for Straggler Mitigation

In distributed synchronous gradient descent (GD) the main performance bo...
research
10/23/2018

Computation Scheduling for Distributed Machine Learning with Straggling Workers

We study the scheduling of computation tasks across n workers in a large...
research
04/10/2020

Straggler-aware Distributed Learning: Communication Computation Latency Trade-off

When gradient descent (GD) is scaled to many parallel workers for large ...
research
08/07/2018

Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers

Distributed gradient descent (DGD) is an efficient way of implementing g...
research
05/16/2022

Two-Stage Coded Federated Edge Learning: A Dynamic Partial Gradient Coding Perspective

Federated edge learning (FEL) can training a global model from terminal ...
research
03/01/2021

Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

Distributed implementations are crucial in speeding up large scale machi...
research
03/02/2021

Stream Distributed Coded Computing

The emerging large-scale and data-hungry algorithms require the computat...

Please sign up or login with your details

Forgot password? Click here to reset