Anytime MiniBatch: Exploiting Stragglers in Online Distributed Optimization

06/10/2020
by   Nuwan Ferdinand, et al.
0

Distributed optimization is vital in solving large-scale machine learning problems. A widely-shared feature of distributed optimization techniques is the requirement that all nodes complete their assigned tasks in each computational epoch before the system can proceed to the next epoch. In such settings, slow nodes, called stragglers, can greatly slow progress. To mitigate the impact of stragglers, we propose an online distributed optimization method called Anytime Minibatch. In this approach, all nodes are given a fixed time to compute the gradients of as many data samples as possible. The result is a variable per-node minibatch size. Workers then get a fixed communication time to average their minibatch gradients via several rounds of consensus, which are then used to update primal variables via dual averaging. Anytime Minibatch prevents stragglers from holding up the system without wasting the work that stragglers can complete. We present a convergence analysis and analyze the wall time performance. Our numerical results show that our approach is up to 1.5 times faster in Amazon EC2 and it is up to five times faster when there is greater variability in compute node performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/15/2020

Anytime Minibatch with Delayed Gradients

Distributed optimization is widely deployed in practice to solve a broad...
research
03/27/2018

DRACO: Robust Distributed Training via Redundant Gradients

Distributed model training is vulnerable to worst-case system failures a...
research
02/02/2021

Customizing Graph500 for Tianhe Pre-exacale system

BFS (Breadth-First Search) is a typical graph algorithm used as a key co...
research
10/09/2019

Straggler-Agnostic and Communication-Efficient Distributed Primal-Dual Algorithm for High-Dimensional Data Mining

Recently, reducing communication time between machines becomes the main ...
research
05/19/2018

Tell Me Something New: a new framework for asynchronous parallel learning

We present a novel approach for parallel computation in the context of m...
research
03/31/2018

Fundamental Resource Trade-offs for Encoded Distributed Optimization

Dealing with the shear size and complexity of today's massive data sets ...
research
03/13/2020

Communication Efficient Sparsification for Large Scale Machine Learning

The increasing scale of distributed learning problems necessitates the d...

Please sign up or login with your details

Forgot password? Click here to reset