DropCompute: simple and more robust distributed synchronous training via compute variance reduction

06/18/2023
by   Niv Giladi, et al.
0

Background: Distributed training is essential for large scale training of deep neural networks (DNNs). The dominant methods for large scale DNN training are synchronous (e.g. All-Reduce), but these require waiting for all workers in each step. Thus, these methods are limited by the delays caused by straggling workers. Results: We study a typical scenario in which workers are straggling due to variability in compute time. We find an analytical relation between compute time properties and scalability limitations, caused by such straggling workers. With these findings, we propose a simple yet effective decentralized method to reduce the variation among workers and thus improve the robustness of synchronous training. This method can be integrated with the widely used All-Reduce. Our findings are validated on large-scale training tasks using 200 Gaudi Accelerators.

READ FULL TEXT
research
07/16/2023

Accelerating Distributed ML Training via Selective Synchronization

In distributed training, deep neural networks (DNNs) are launched over m...
research
10/03/2021

Distributed Optimization using Heterogeneous Compute Systems

Hardware compute power has been growing at an unprecedented rate in rece...
research
04/04/2016

Revisiting Distributed Synchronous SGD

Distributed training of deep learning models on large-scale training dat...
research
02/28/2022

Distributed randomized Kaczmarz for the adversarial workers

Developing large-scale distributed methods that are robust to the presen...
research
11/27/2021

DSAG: A mixed synchronous-asynchronous iterative method for straggler-resilient learning

We consider straggler-resilient learning. In many previous works, e.g., ...
research
09/06/2020

PSO-PS: Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

Parameter updating is an important stage in parallelism-based distribute...
research
01/10/2019

Harnessing the Power of Serverless Runtimes for Large-Scale Optimization

The event-driven and elastic nature of serverless runtimes makes them a ...

Please sign up or login with your details

Forgot password? Click here to reset