Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization

12/03/2020
by   Huiping Zhuang, et al.
0

Decoupled learning is a branch of model parallelism which parallelizes the training of a network by splitting it depth-wise into multiple modules. Techniques from decoupled learning usually lead to stale gradient effect because of their asynchronous implementation, thereby causing performance degradation. In this paper, we propose an accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect. We give both theoretical and empirical evidences regarding how the gradient staleness can be reduced. We prove that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature. Empirical validation is provided by training deep convolutional neural networks to perform classification tasks on CIFAR-10 and ImageNet datasets. The ADL is shown to outperform several state-of-the-arts in the classification tasks, and is the fastest among the compared methods.

READ FULL TEXT
research
06/21/2019

Fully Decoupled Neural Network Learning Using Delayed Gradients

Using the back-propagation (BP) to train neural networks requires a sequ...
research
02/14/2018

Exploring Hidden Dimensions in Parallelizing Convolutional Neural Networks

The past few years have witnessed growth in the size and computational r...
research
09/05/2019

Diversely Stale Parameters for Efficient Training of CNNs

The backpropagation algorithm is the most popular algorithm training neu...
research
02/25/2020

Optimal Gradient Quantization Condition for Communication-Efficient Distributed Training

The communication of gradients is costly for training deep neural networ...
research
09/09/2020

Tunable Subnetwork Splitting for Model-parallelism of Neural Network Training

Alternating minimization methods have recently been proposed as alternat...
research
07/02/2020

Adaptive Braking for Mitigating Gradient Delay

Neural network training is commonly accelerated by using multiple synchr...
research
06/11/2021

Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

A commonly cited inefficiency of neural network training using back-prop...

Please sign up or login with your details

Forgot password? Click here to reset