Divide-and-Shuffle Synchronization for Distributed Machine Learning

07/07/2020
by   Weiyan Wang, et al.
6

Distributed Machine Learning suffers from the bottleneck of synchronization to all-reduce workers' updates. Previous works mainly consider better network topology, gradient compression, or stale updates to speed up communication and relieve the bottleneck. However, all these works ignore the importance of reducing the scale of synchronized elements and inevitable serial executed operators. To address the problem, our work proposes the Divide-and-Shuffle Synchronization(DS-Sync), which divides workers into several parallel groups and shuffles group members. DS-Sync only synchronizes the workers in the same group so that the scale of a group is much smaller. The shuffle of workers maintains the algorithm's convergence speed, which is interpreted in theory. Comprehensive experiments also show the significant improvements in the latest and popular models like Bert, WideResnet, and DeepFM on challenging datasets.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/11/2021

Straggler-Resilient Distributed Machine Learning with Dynamic Backup Workers

With the increasing demand for large-scale training of machine learning ...
research
10/06/2022

STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

Synchronous local stochastic gradient descent (local SGD) suffers from s...
research
04/30/2020

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning ...
research
06/11/2023

Straggler-Resilient Decentralized Learning via Adaptive Asynchronous Updates

With the increasing demand for large-scale training of machine learning ...
research
04/17/2023

Fast and Straggler-Tolerant Distributed SGD with Reduced Computation Load

In distributed machine learning, a central node outsources computational...
research
08/19/2020

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Using multiple nodes and parallel computing algorithms has become a prin...
research
09/17/2019

Heterogeneity-Aware Asynchronous Decentralized Training

Distributed deep learning training usually adopts All-Reduce as the sync...

Please sign up or login with your details

Forgot password? Click here to reset