ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training

03/07/2020
by   Qinqing Zheng, et al.
0

Distributed training is useful to train complicated models to shorten the training time. As each of the workers only sees a small fraction of data, workers need to synchronize on the parameter updates. One of the central questions in distributed training is how to parsimoniously synchronize parameters while preserving model quality. To address this problem, we propose the ShadowSync framework, in which we isolate synchronization from training and run it in the background. In contrast to common strategies including synchronous stochastic gradient descent (SGD), asynchronous SGD, and model averaging on independently trained sub-models, where synchronization happens in the foreground, ShadowSync synchronization is neither part of the backward pass, nor happens every k iterations. Our framework is generic to host various types of synchronization algorithms, and we propose 3 approaches under this theme. The superiority of ShadowSync is confirmed by experiments on training deep neural networks for click-through-rate prediction. Our methods all succeed in making the training throughput linearly scale with the number of trainers. Comparing to their foreground counterparts, our methods exhibit neutral to better model quality and better scalability when we keep the number of parameter servers the same. In our training system which expresses both replication and Hogwild parallelism, ShadowSync also accomplishes the highest example level parallelism number comparing to the prior arts.

READ FULL TEXT

page 4

page 5

research
06/08/2019

Making Asynchronous Stochastic Gradient Descent Work for Transformers

Asynchronous stochastic gradient descent (SGD) is attractive from a spee...
research
08/29/2023

ABS-SGD: A Delayed Synchronous Stochastic Gradient Descent Algorithm with Adaptive Batch Size for Heterogeneous GPU Clusters

As the size of models and datasets grows, it has become increasingly com...
research
04/16/2021

Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning

Stochastic Gradient Descent (SGD) has become the de facto way to train d...
research
03/14/2021

CrossoverScheduler: Overlapping Multiple Distributed Training Applications in a Crossover Manner

Distributed deep learning workloads include throughput-intensive trainin...
research
09/06/2020

PSO-PS: Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

Parameter updating is an important stage in parallelism-based distribute...
research
10/06/2022

STSyn: Speeding Up Local SGD with Straggler-Tolerant Synchronization

Synchronous local stochastic gradient descent (local SGD) suffers from s...
research
04/17/2021

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

Because of the superior feature representation ability of deep learning,...

Please sign up or login with your details

Forgot password? Click here to reset