Fully Decoupled Neural Network Learning Using Delayed Gradients

06/21/2019 ∙ by Huiping Zhuang, et al. ∙ Nanyang Technological University 2

Using the back-propagation (BP) to train neural networks requires a sequential passing of the activations and the gradients, which forces the network modules to work in a synchronous fashion. This has been recognized as the lockings (i.e., the forward, backward and update lockings) inherited from the BP. In this paper, we propose a fully decoupled training scheme using delayed gradients (FDG) to break all these lockings. The proposed method splits a neural network into multiple modules that are trained independently and asynchronously in different GPUs. We also introduce a gradient shrinking process to reduce the stale gradient effect caused by the delayed gradients. In addition, we prove that the proposed FDG algorithm guarantees a statistical convergence during training. Experiments are conducted by training deep convolutional neural networks to perform classification tasks on benchmark datasets. The proposed FDG is able to train very deep networks (>100 layers) and very large networks (>35 million parameters) with significant speed gains while outperforming the state-of-the-art methods and the standard BP.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, deep neural networks, e.g., convolutional neural network (CNN) lecun1998gradient

and recurrent neural network

hochreiter1997long ; cho2014learning , have demonstrated great success in numerous highly complex tasks. Such success is built, to a great extent, on the ability to train extremely deep networks enabled by ResNet he2016deep or other techniques with skip-connection-like structures zagoruyko2016wide ; xie2017aggregated ; huang2017densely ; gastaldi2017shake . Training networks with back-propagation (BP) werbos1974beyond is a standard practice but it requires a complete forward and backward pass before the parameter update can be finished. This easily leads to inefficiency especially for training deeper networks, which is recognized as the lockings jaderberg2017decoupled (i.e., forward, backward and update lockings) inherited from the standard BP. The existence of these lockings keeps the majority of the network on hold during the training, thereby compromising the efficiency.

In order to improve the efficiency, there have been a number of contributions of decoupling the training by splitting the network into multiple modules to facilitate model parallelization. These techniques can be roughly categorized into two groups: the backward-unlocking (BU) methods and the local error learning (LEL) methods.

The BU-based methods have access to the global information from the top layer and could break the backward locking. An additional benefit is that they often introduce no extra trainable parameters while enabling decoupling behaviors. Nonetheless, a full forward pass is still required in advance of any parameter update. One important motivation for these techniques is to promote biological plausibility, which focuses on removing the weight symmetry and the gradient propagation from the BP. Feedback alignment (FA) lillicrap2016random removes the weight symmetry by replacing symmetrical weights with random ones. Direct feedback alignment nokland2016direct following the FA replaces the BP by unlocking the backward pass and enables a simultaneous update for all layers. However, these biologically inspired approaches suffer from performance losses and are shown to scale poorly on more complex datasets bartunov2018assessing . On the other hand, the delayed gradients provide another solution of breaking the backward pass. The recently proposed decoupled learning using delayed gradients (DDG) huo2018decoupled is able to train extremely deep (up to 110 layers) CNNs and shows no performance loss in certain cases while reducing the training time. Since the DDG is still constrained by the forward locking, the computation time could only be reduced by about 50% even with multiple GPUs.

The LEL-based methods use the local information and are more promising in terms of decoupling ability. This is because they are able to fully decouple (breaking the forward, backward and update lockings) the neural network training. The full decoupling is achieved by building auxiliary local loss functions to generate local error gradients, severing the gradient flow between the adjacent modules. The decoupled neural interface (DNI) proposed in

jaderberg2017decoupled is one of the pioneers exhibiting parallel training potential for neural networks. This technique uses a local neural network to generate synthetic error gradients for the hidden layers so that the update could happen before completing either the forward or the backward pass. However, the DNI has been shown less capable of learning well and even exhibiting convergence problems in deeper networks huo2018decoupled . In mostafa2018deep

, local classifiers with cross-entropy loss are adopted showing potentials to train the hidden layers simultaneously. It has been shown that the local classifier alone fails to match the performance of a standard BP. In

nokland2019training , a similarity measure combined with the local classifier is introduced to provide local error gradients. The mixed loss functions can produce classification performances comparable with or even better than the BP baselines but are currently tested only in VGG-like networks ( layers). Very recently, the depth problem of the LEL-based methods is alleviated by decoupled greedy learning (DGL) belilovsky2019decoupled , which is able to train extremely deep networks ( layers) while maintaining comparable performance against a standard BP. The common sacrifice that any LEL technique has to make is the introduction of extra trainable parameters imposed by auxiliary networks. For instance, to match the standard BP, the local learning in nokland2019training needs to train several times more parameters.

Methods DDG huo2018decoupled DNI jaderberg2017decoupled DGL belilovsky2019decoupled FDG (ours)
Any Lockings Yes No No No
Extra trainable parameters No Yes Yes No
Table 1: Comparison with state-of-the-art methods in terms of lockings and extra trainable parameters.

In summary, both BU-based and LEL-based methods can decouple the training of neural networks while obtaining comparable performances against the standard BP. In comparison, LEL-based methods lead in fully decoupling the network learning but introduce extra trainable parameters. The BU-based methods behave in the opposite way. In this paper, we propose a fully decoupled training scheme using delayed gradients (FDG) sharing both merits of the BU-based and the LEL-based techniques (see Table 1). Although we adopt delayed gradients like the DDG huo2018decoupled and other asynchronous SGD methods dean2012large ; lian2015asynchronous ; zheng2017asynchronous , the proposed FDG utilizes a different training scheme, which is more efficient and has better generalization ability. The main contributions of this work are as follows:

We propose the FDG, a novel training technique that breaks the forward, backward and update lockings without introducing extra trainable parameters. We also develop a gradient shrinking (GS) process that can reduce the stale gradient effect caused by utilizing the delayed gradients.

Theoretical analysis is provided showing that the proposed technique guarantees a statistical convergence under certain conditions.

We conduct experiments by training deep CNNs and show that the proposed FDG obtains comparable or even better performances on benchmark datasets while reducing a significant amount of computation time.

2 Background

In this section, we provide some basic background knowledge for training a feedforward neural network. The forward, b ackward and update lockings are also revisited.

Assume we need to train an -layer network. The () layer produces an activation by taking as its input, where

is an activation function and

is a column vector representing the weights in layer

. The sequential generation of the activations constructs the forward locking jaderberg2017decoupled since will not be available before all the dependent activations are obtained. Let denote the parameter vector of the network. Assume is a loss function that maps a high-dimensional vector to a scalar. The learning of the feedforward network can then be summarized as the following optimization problem:

(1)

where represents the input-label information (or training samples). We will drop the in (1) in this paper for convenience: .

The gradient descent algorithm is often used to solve (1) by updating the parameter iteratively. At iteration , we have

(2)

where is the learning rate and is the gradient vector obtained by

. If the training sample size is large, we apply stochastic gradient descent (SGD) as a replacement by obtaining the gradient vector with respect to

where is a mini-batch of . Such a replacement is based on the following realistic assumption:

(3)

To obtain the gradient vectors, the BP (also known as the chain rule) can be employed. One could calculate the gradients in layer

from the gradients in layer ():

(4)

which also indicates a dependency of on . In other words, the gradients in layer would remain unavailable until the gradient computations of all the dependent layers are completed. This is also known as the backward locking jaderberg2017decoupled in BP. The existence of this locking prevents the update of before . In addition, the parameter update must come after executing the forward pass. This is recognized as the update locking jaderberg2017decoupled . In the following, we will show that a full decoupling (including breaking the forward, backward and update lockings) can be achieved.

3 Fully Coupled Neural Network Learning

In this section, we give details of the proposed FDG. This technique provides a fully decoupled asynchronous learning algorithm with a gradient shrinking (GS) process that is able to reduce the accuracy loss caused by the delayed gradients.

3.1 The Proposed FDG

We first split the network into modules with each module containing a stack of layers. Then we rewrite in terms of modules as

where denotes the layer indices in module and represents the first index in .

As illustrated in Figure 1(a), during the decoupled training, module is able to perform BP using the delayed gradients passed from module . Also, the error gradients in the first layer are passed to module , while the activations of the last layer are passed to module . This can be summarized as the following steps for module ():

backward: after receiving the delayed error gradients from module , we run BP to compute the gradients for each layer: , and then update the module through

(5)

After that, we save error gradients in the first layer for communication.

forward: run the input through this module and save the activation for communication.

communication: pass the saved stale error gradients to module and pass the saved activation to module as its new input.

In particular, at iteration , can be further explored such that

(6)

where indicating that there is a delay of of gradients from module to module (see Figure 1(b)). In module , no delay is expected because it interacts with the label information directly. It is easily noticed that the backward, forward and communication steps break the forward and the backward lockings since all the modules can be trained in parallel as shown in Figure 1(b).

On the other hand, different from the traditional training strategy which forwards the input before backwarding the error gradients, we do the backward pass first and update the module before producing the module output. This update-before-forward strategy also breaks the update locking.

Figure 1: The proposed FDG: an example with 3 split modules. (a) the backward, forward and communication steps for the modules. (b) a pipeline-like parallel training scheme for the proposed FDG. We can observe that there is a delay of of gradients from the last module to module .

3.2 The Gradient Shrinking Process

Using the delayed gradients enables model parallelization but could also lead to certain performance loss. This is a common phenomenon observed in algorithms with stale gradients chen2016revisiting . To compensate the performance loss, we introduce a gradient shrinking (GS) process before back-propagating the delayed error gradients through each module.

Figure 2: An intuitive interpretation of the benefit brought by the GS process.

The GS process works in a very straightforward manner. At iteration , before backwarding the delayed error gradients through module , we shrink the error gradients by multiplying it with a shrinking factor (). This can be shown by rewriting as

(7)

Then the module is updated through (5). In particular, if , the GS process is not used.

The GS process works similarly by scaling the learning rate in the corresponding module, determining how much we should move towards the direction of the negative gradients. We can interpret this process in an intuitive way shown in Figure 2. The delayed error gradients, especially with longer delays, would lead to deteriorated performance chen2016revisiting . Figure 2(a) shows a scenario where the delayed gradients cause the learning to miss the local minimum to a large margin. By using the shrunk delayed gradients, we could have better chances of reducing the stale gradient effect (see Figure 2(b) for an illustration). The proposed FDG with the GS process is summarized in Algorithm I with SGD optimizer.

Comparison to DDG huo2018decoupled : Although the DDG also adopts the delayed gradients, we have shown that the proposed FDG breaks all the lockings while the DDG only succeeds in unlocking the backward pass. Additionally, the GS process is proposed to process the delayed gradients while the DDG feeds them directly to the modules. The effectiveness of the GS process will be illustrated in experiments.

Algorithm I: FDG (SGD)
Required: learning rate , number of split modules , gradient shrinking factor .
Split the network into modules and initialize them with .
for :
  Parallel for , do (backward and forward):
     if not the last module:
       compute the shrunk delayed gradients in each layer: .
      

update the module: ; save the error gradients of the first layer

.
       forward the input through module ; produce and save the activation .
     else:
       update the module (without delay).
       forward the input through module, calculating the loss.
      

do BP for the module with respect to the loss; save the error gradients of the first layer

.
  for , do (communication):
     clone the activation of module as the input of module .
    

clone the error gradients of the first layer in module as the delayed gradients of module

.

4 Convergence Analysis

In this section, we prove that the proposed FDG in Algorithm I guarantees a statistical convergence. This proof is mainly based on two commonly used assumptions as follows.

Assumption 1.

The gradient of the loss function is Lipschitz continuous. This means there exists a constant such that and :

(8)
(9)
Assumption 2.

The second moment of the stochastic gradient is bounded. This means there exists a constant

such that:

(10)

Under Assumptions 1 and 2, we can obtain the FDG’s convergence property that is similar to the DDG’s huo2018decoupled in the following theorem.

Theorem 1.

Let Assumptions 1 and 2 hold. Assume that the learning rate is diminishing and . The proposed FDG in Algorithm I satisfies

(11)

where

As shown in Theorem 1, the behavior of the expected loss value is controlled by the learning rate . If the right side of (11) is less than zero, i.e.,

the FDG guarantees convergence statistically. The proof of Theorem 1 is provided in the supplementary materials.

5 Experiments

In this section, we conduct experiments with several ResNet-like structures on image classification tasks (on CIFAR-10 and CIFAR-100 krizhevsky2009learning datasets). The conducted experiments show that the proposed FDG is able to obtain comparable or better results against the state-of-the-art methods and the BP baselines while accelerating the training significantly. The source code and trained models will be publicly available111https://github.com/ZHUANGHP/FDG.git.

Implementation Details

: We implement our proposed method in Pytorch platform

paszke2017automatic , and evaluate it using ResNet he2016deep and WRN zagoruyko2016wide models on CIFAR-10 and CIFAR-100 krizhevsky2009learning . These datasets are pre-processed with standard data augmentation (i.e., random cropping, random horizontal flip and normalizing he2016deep ; huang2017densely ). We use SGD optimizer with an initial learning rate of 0.1. The momentum and weight decay are set as 0.9 and

respectively. All the models are trained using a batch size of 128 for 300 epochs. The learning rate is divided by 10 at 150, 225 and 275 epochs. Our experiments are run using one or more Tesla K80 GPUs. The test errors of the FDG are reported by the median of 3 runs.

Architecture # params BP DDG huo2018decoupled DGL belilovsky2019decoupled FDG
ResNet-20 0.27M 8.75%he2016deep /7.78% - - 7.92%(=1)/7.23%(=0.2)
ResNet-56 0.46M 6.97%he2016deep /6.19% 6.89% - 6.20%(=1)/5.90%(=0.5)
ResNet-110 1.70M 6.41%he2016deep /5.79% 6.59% 6.500.10% 5.79%(=1)/5.73%(=0.5)
ResNet-18 11.2M 6.48%huang2018learning /4.87% - - 4.82%(=1)/4.79%(=0.8)
WRN-28-10 36.5M 4.00%zagoruyko2016wide - - 4.13%(=1)/3.85%(=0.7)
Table 2: The Top 1 errors for various CNN structures on CIFAR-10 dataset under a split number =2. “-” indicates that the results are not reported in the original paper. Results with “” are rerun using our training strategy. We use to warm up the training of ResNet-110 for 3 epochs.

5.1 Compare with BP and State-of-the-Art Methods

We compare performances of four different methods, including the BP, the DDG huo2018decoupled , the DGL belilovsky2019decoupled and our proposed FDG. The DNI jaderberg2017decoupled is not included as its performance deteriorates severely with deeper networks huo2018decoupled .

CIFAR-10: We begin by reporting the classification results on the CIFAR-10 dataset, which includes 50000 training and 10000 testing color images with 10 classes. The networks are trained using 50000 training samples without any validation and we report the test error at the last epoch.

In this CIFAR-10 experiment, we split the original network roughly at the center into two modules (=2) and train them asynchronously and independently in 2 GPUs. In the conventional ResNet structures, we rerun the BP baselines in he2016deep using our training strategy, which gives better results than those reported in huo2018decoupled and belilovsky2019decoupled . Since we use SGD instead of Adam kingma2014adam in the experiments, the improved baselines might be because the adaptive optimizers (e.g., Adam) are more prone to over-fitting wilson2017marginal .

The corresponding classification results are reported in Table 2. We can observe that our proposed method is able to achieve lower classification errors than all the state-of-the-art methods. The proposed FDG is validated by reporting the individual results with and without the GS process. By shrinking the delayed error gradients, the generalization abilities of these ResNet-like networks are enhanced to even surpass their BP counterparts. In particular, the improvements in ResNet-20 and WRN-28-10 are not trivial. The delayed gradients can be treated as the up-to-date gradients with noises. This poses difficulties for the networks to learn, but it also introduces certain regularization during training, which explains the improved performances. We also provide the learning curves (see the top panel in Figure 3) for ResNet-56 and ResNet-110. It is clear that our proposed method is able to converge in the same way a standard BP does throughout the training process. In particular, the error rate of 3.85% by decoupling the WRN-28-10 is a new state-of-the-art result for the CIFAR-10 among the published decoupling techniques.

Architecture # params BP DDG huo2018decoupled FDG
ResNet-56 0.46M 30.21%huo2018decoupled /27.68% 29.83% 27.87%(=1)/27.70%(=0.8)
ResNet-110 1.70M 28.10%huo2018decoupled /25.82% 28.61% 25.73%(=1)/25.43%(=0.5)
ResNet-18 11.2M 22.35% - 22.78%(=1)/22.18%(=0.5)
WRN-28-10 36.5M 19.2%zagoruyko2016wide - 20.28%(=1)/19.08%(=0.6)
Table 3: The Top 1 errors for various CNN structures on CIFAR-100 dataset under a split number =2. “-” indicates that the results are not reported in the original paper. Results with “” are rerun using our training strategy. We use to warm up the training of ResNet-110 for 3 epochs.
Figure 3: Learning curves of the BP, the DDG and the FDG for ResNet-56 and ResNet-110 on CIFAR-10 and CIFAR-100 datasets. The top panel indicates the learning curves (on error and loss) on CIFAR-10. The bottom panel shows the learning curves on CIFAR-100.

CIFAR-100: We now study the classification performance (=2) on CIFAR-100, which contains the same number of training and testing samples as CIFAR-10 but with 100 classes. The training strategy follows the CIFAR-10 experiment and the performance is reported by the Top 1 error rate.

We also rerun the baselines using our training strategy, which again overtake the baselines provided in huo2018decoupled . The classification results are reported in Table 2. We observe that the proposed FDG again beats the state-of-the-art methods by improving their classification performances by at least 2%. More importantly, the classification performances of the proposed FDG are able to match the rerun BP baselines. The learning curves shown in the bottom panel of Figure 3 indicate the proposed method converges in the same way as the standard BP. The Top 1 error rate of 19.08% is also a new state-of-the-art result for the CIFAR-100 dataset among the published decoupling methods.

Figure 4: (a) The impact of the GS process with different for training ResNet-20 on CIFAR-10 dataset. The classification results that surpass the standard BP are painted red. (b)-(c) are the learning curves (on error and loss respectively) for ResNet-56 on CIFAR-10 by scaling to more GPUs.

5.2 The Impact of the GS Process

We empirically evaluate the impact of the GS process by experimenting with various values of the shrinking factor . This evaluation is conducted by training the ResNet-20 on CIFAR-10 dataset. The bar chart in Figure 4(a) reports the Top 1 error rates. We notice that the results for the proposed FDG are able to surpass the BP baseline with a small effort of tunning the . This also shows that the introduction of the GS process does enhance a network’s generalization ability.

5.3 Speed Gains for Scaling to More GPUs

In this experiment, we study the performance of ResNet-56 on CIFAR-10 by splitting it into =3 and =4 modules with each module trained in an independent GPU. The conducted experiment is to show the empirical behaviors of the proposed FDG in presence of more split modules. The results are shown in Table 4 where we list the test errors of FDG with against the BP. It becomes more obvious that more split modules have caused the FDG to lose accuracy. By enforcing the GS process, the classification performances can be restored to the BP baselines. The improved performances indicate that the GS process plays an essential role in reducing the stale gradient effect especially with more split modules. Table 4 also shows that the use of more GPUs significantly reduces the computation time by more than 55%222One could obtain more significant acceleration by improving the efficiency of the communication among GPUs, but this is beyond the scope of this work.. On the other hand, as indicated in Figure 4(b)-(c), the convergence behaviors of the FDG with still exhibit little difference from the BP.

Method Test error Time
BP 6.19% 100%
FDG(=2) 6.20%(=1)/5.90%(=0.5) 60.64%
FDG(=3) 6.40%(=1)/6.08%(=0.2) 52.03%
FDG(=4) 6.83%(=1)/6.14%(=0.3) 44.32%
Table 4: The Top 1 errors and the computation time in percentage for training ResNet-56 on CIFAR-10 by scaling to more GPUs. We use to warm up the training (=3,4) for 3 epochs.

6 Conclusion

In this paper, we utilize the delayed gradients to develop a novel training technique FDG that is able to break the forward, backward and update lockings for neural network learning. We have also introduced the gradient shrinking process that can help reduce the stale gradient effect caused by the delayed gradients. In addition, theoretical analysis has shown that the proposed FDG guarantees a statistical convergence under certain conditions. Finally, we conduct experiments on CNNs, showing that the FDG outperforms the state-of-the-art methods and obtains comparable or even better performances against the standard BP while significantly accelerating the training process.

References

A. Proof to Theorem 1

Proof.

According to Assumption 1, the following inequality holds:

(a)

Taking the expectation on both sides, we have

(b)

where the second inequality is due to .

The is bounded by

where the first inequality follows from , the second one is from (3) and , the third one follows from Assumption 2, and the can be bounded by

with the first inequality coming from the Assumption 1.

On the other hand, The is bounded by

where the second equality follows by the unbiased gradient using SGD, the inequality comes from .

By substituting and into (b), the inequality is rewritten as

where the last inequality follows from such that . ∎