Accelerating Distributed ML Training via Selective Synchronization

07/16/2023
by   Sahil Tyagi, et al.
0

In distributed training, deep neural networks (DNNs) are launched over multiple workers concurrently and aggregate their local updates on each step in bulk-synchronous parallel (BSP) training. However, BSP does not linearly scale-out due to high communication cost of aggregation. To mitigate this overhead, alternatives like Federated Averaging (FedAvg) and Stale-Synchronous Parallel (SSP) either reduce synchronization frequency or eliminate it altogether, usually at the cost of lower final accuracy. In this paper, we present , a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step either by calling the aggregation op or applying local updates based on their significance. We propose various optimizations as part of to improve convergence in the context of semi-synchronous training. Our system converges to the same or better accuracy than BSP while reducing training time by up to 14×.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/06/2020

PSO-PS: Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

Parameter updating is an important stage in parallelism-based distribute...
research
06/18/2023

DropCompute: simple and more robust distributed synchronous training via compute variance reduction

Background: Distributed training is essential for large scale training o...
research
02/17/2021

Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View

Deep learning has become an indispensable part of life, such as face rec...
research
07/22/2022

WRHT: Efficient All-reduce for Distributed DNN Training in Optical Interconnect System

Communication efficiency plays an important role in accelerating the dis...
research
08/16/2019

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

Deep learning is a popular machine learning technique and has been appli...
research
01/06/2020

Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning

The bulk synchronous parallel (BSP) is a celebrated synchronization mode...
research
04/12/2021

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

With increasing data and model complexities, the time required to train ...

Please sign up or login with your details

Forgot password? Click here to reset