DSAG: A mixed synchronous-asynchronous iterative method for straggler-resilient learning

11/27/2021
by   Albin Severinson, et al.
0

We consider straggler-resilient learning. In many previous works, e.g., in the coded computing literature, straggling is modeled as random delays that are independent and identically distributed between workers. However, in many practical scenarios, a given worker may straggle over an extended period of time. We propose a latency model that captures this behavior and is substantiated by traces collected on Microsoft Azure, Amazon Web Services (AWS), and a small local cluster. Building on this model, we propose DSAG, a mixed synchronous-asynchronous iterative optimization method, based on the stochastic average gradient (SAG) method, that combines timely and stale results. We also propose a dynamic load-balancing strategy to further reduce the impact of straggling workers. We evaluate DSAG for principal component analysis, cast as a finite-sum optimization problem, of a large genomics dataset, and for logistic regression on a cluster composed of 100 workers on AWS, and find that DSAG is up to about 50 as fast as coded computing methods, for the particular scenario that we consider.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2018

Lagrange Coded Computing: Optimal Design for Resiliency, Security and Privacy

We consider a distributed computing scenario that involves computations ...
research
11/03/2020

Gradient Coding with Dynamic Clustering for Straggler Mitigation

In distributed synchronous gradient descent (GD) the main performance bo...
research
04/04/2016

Revisiting Distributed Synchronous SGD

Distributed training of deep learning models on large-scale training dat...
research
06/18/2023

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Background: Distributed training is essential for large scale training o...
research
05/16/2022

Two-Stage Coded Federated Edge Learning: A Dynamic Partial Gradient Coding Perspective

Federated edge learning (FEL) can training a global model from terminal ...
research
08/08/2023

Iterative Sketching for Secure Coded Regression

In this work, we propose methods for speeding up linear regression distr...
research
04/11/2019

Timely-Throughput Optimal Coded Computing over Cloud Networks

In modern distributed computing systems, unpredictable and unreliable in...

Please sign up or login with your details

Forgot password? Click here to reset