Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

04/17/2023
by   Maximilian Egger, et al.
0

In distributed machine learning, a central node outsources computationally expensive calculations to external worker nodes. The properties of optimization procedures like stochastic gradient descent (SGD) can be leveraged to mitigate the effect of unresponsive or slow workers called stragglers, that otherwise degrade the benefit of outsourcing the computation. This can be done by only waiting for a subset of the workers to finish their computation at each iteration of the algorithm. Previous works proposed to adapt the number of workers to wait for as the algorithm evolves to optimize the speed of convergence. In contrast, we model the communication and computation times using independent random variables. Considering this model, we construct a novel scheme that adapts both the number of workers and the computation load throughout the run-time of the algorithm. Consequently, we improve the convergence speed of distributed SGD while significantly reducing the computation load, at the expense of a slight increase in communication load.

READ FULL TEXT
research
01/12/2022

Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

Wall-clock convergence time and communication load are key performance m...
research
12/16/2022

Nested Gradient Codes for Straggler Mitigation in Distributed Machine Learning

We consider distributed learning in the presence of slow and unresponsiv...
research
02/16/2022

Efficient Distributed Machine Learning via Combinatorial Multi-Armed Bandits

We consider the distributed stochastic gradient descent problem, where a...
research
07/07/2023

DEFT: Exploiting Gradient Norm Difference between Model Layers for Scalable Gradient Sparsification

Gradient sparsification is a widely adopted solution for reducing the ex...
research
11/12/2020

Distributed Sparse SGD with Majority Voting

Distributed learning, particularly variants of distributed stochastic gr...
research
07/07/2020

Divide-and-Shuffle Synchronization for Distributed Machine Learning

Distributed Machine Learning suffers from the bottleneck of synchronizat...
research
06/16/2021

Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning

Random features are a central technique for scalable learning algorithms...

Please sign up or login with your details

Forgot password? Click here to reset