Speeding Up Distributed Gradient Descent by Utilizing Non-persistent Stragglers

08/07/2018
by   Emre Ozfaturay, et al.
14

Distributed gradient descent (DGD) is an efficient way of implementing gradient descent (GD), especially for large data sets, by dividing the computation tasks into smaller subtasks and assigning to different computing servers (CSs) to be executed in parallel. In standard parallel execution, per-iteration waiting time is limited by the execution time of the straggling servers. Coded DGD techniques have been introduced recently, which can tolerate straggling servers via assigning redundant computation tasks to the CSs. In most of the existing DGD schemes, either with coded computation or coded communication, the non-straggling CSs transmit one message per iteration once they complete all their assigned computation tasks. However, although the straggling servers cannot complete all their assigned tasks, they are often able to complete a certain portion of them. In this paper, we allow multiple transmissions from each CS at each iteration in order to make sure a maximum number of completed computations can be reported to the aggregating server (AS), including the straggling servers. We numerically show that the average completion time per iteration can be reduced significantly by slightly increasing the communication load per server.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/05/2019

Gradient Coding with Clustering and Multi-message Communication

Gradient descent (GD) methods are commonly employed in machine learning ...
research
11/22/2018

Distributed Gradient Descent with Coded Partial Gradient Computations

Coded computation techniques provide robustness against straggling serve...
research
10/23/2018

Computation Scheduling for Distributed Machine Learning with Straggling Workers

We study the scheduling of computation tasks across n workers in a large...
research
04/10/2020

Straggler-aware Distributed Learning: Communication Computation Latency Trade-off

When gradient descent (GD) is scaled to many parallel workers for large ...
research
11/03/2020

Gradient Coding with Dynamic Clustering for Straggler Mitigation

In distributed synchronous gradient descent (GD) the main performance bo...
research
03/01/2021

Gradient Coding with Dynamic Clustering for Straggler-Tolerant Distributed Learning

Distributed implementations are crucial in speeding up large scale machi...
research
02/17/2021

A Game-theoretic Approach Towards Collaborative Coded Computation Offloading

Coded distributed computing (CDC) has emerged as a promising approach be...

Please sign up or login with your details

Forgot password? Click here to reset