Straggler Mitigation with Tiered Gradient Codes

09/05/2019
by   Shanuja Sasi, et al.
0

Coding theoretic techniques have been proposed for synchronous Gradient Descent (GD) on multiple servers to mitigate stragglers. These techniques provide the flexibility that the job is complete when any k out of n servers finish their assigned tasks. The task size on each server is found based on the values of k and n. However, it is assumed that all the n jobs are started when the job is requested. In contrast, we assume a tiered system, where we start with n_1> k tasks, and on completion of c tasks, we start n_2-n_1 more tasks. The aim is that as long as k servers can execute their tasks, the job gets completed. This paper exploits the flexibility that not all servers are started at the request time to obtain the achievable task sizes on each server. The task sizes are in general lower than starting all n_2 tasks at the request times thus helping achieve lower task sizes which helps to reduce both the job completion time and the total server utilization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/14/2019

Optimal Server Selection for Straggler Mitigation

The performance of large-scale distributed compute systems is adversely ...
research
05/19/2021

Speed Scaling On Parallel Servers with MapReduce Type Precedence Constraints

A multiple server setting is considered, where each server has tunable s...
research
11/20/2020

Zero Queueing for Multi-Server Jobs

Cloud computing today is dominated by multi-server jobs. These are jobs ...
research
07/25/2019

MDS coding is better than replication for job completion times

In a multi-server system, how can one get better performance than random...
research
11/27/2017

On the Optimality of Scheduling Dependent MapReduce Tasks on Heterogeneous Machines

MapReduce is the most popular big-data computation framework, motivating...
research
10/06/2018

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of ...
research
03/16/2022

NURD: Negative-Unlabeled Learning for Online Datacenter Straggler Prediction

Datacenters execute large computational jobs, which are composed of smal...

Please sign up or login with your details

Forgot password? Click here to reset