HLSGD Hierarchical Local SGD With Stale Gradients Featuring

09/06/2020
by   Yuhao Zhou, et al.
13

While distributed training significantly speeds up the training process of the deep neural network (DNN), the utilization of the cluster is relatively low due to the time-consuming data synchronizing between workers. To alleviate this problem, a novel Hierarchical Parallel SGD (HPSGD) strategy is proposed based on the observation that the data synchronization phase can be paralleled with the local training phase (i.e., Feed-forward and back-propagation). Furthermore, an improved model updating method is unitized to remedy the introduced stale gradients problem, which commits updates to the replica (i.e., a temporary model that has the same parameters as the global model) and then merges the average changes to the global model. Extensive experiments are conducted to demonstrate that the proposed HPSGD approach substantially boosts the distributed DNN training, reduces the disturbance of the stale gradients and achieves better accuracy in given fixed wall-time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/23/2020

DBS: Dynamic Batch Size For Distributed Deep Neural Network Training

Synchronous strategies with data parallelism, such as the Synchronous St...
research
01/14/2019

A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks

Distributed synchronous stochastic gradient descent (S-SGD) with data pa...
research
05/22/2023

Adaptive Gradient Prediction for DNN Training

Neural network training is inherently sequential where the layers finish...
research
07/02/2021

ResIST: Layer-Wise Decomposition of ResNets for Distributed Training

We propose , a novel distributed training protocol for Residual Networks...
research
03/20/2023

Machine Learning Automated Approach for Enormous Synchrotron X-Ray Diffraction Data Interpretation

Manual analysis of XRD data is usually laborious and time consuming. The...
research
10/04/2019

Distributed Learning of Deep Neural Networks using Independent Subnet Training

Stochastic gradient descent (SGD) is the method of choice for distribute...
research
05/10/2019

Priority-based Parameter Propagation for Distributed DNN Training

Data parallel training is widely used for scaling distributed deep neura...

Please sign up or login with your details

Forgot password? Click here to reset