Improving Layer-wise Adaptive Rate Methods using Trust Ratio Clipping

11/27/2020
by   Jeffrey Fong, et al.
National University of Singapore
0

Training neural networks with large batch is of fundamental significance to deep learning. Large batch training remarkably reduces the amount of training time but has difficulties in maintaining accuracy. Recent works have put forward optimization methods such as LARS and LAMB to tackle this issue through adaptive layer-wise optimization using trust ratios. Though prevailing, such methods are observed to still suffer from unstable and extreme trust ratios which degrades performance. In this paper, we propose a new variant of LAMB, called LAMBC, which employs trust ratio clipping to stabilize its magnitude and prevent extreme values. We conducted experiments on image classification tasks such as ImageNet and CIFAR-10 and our empirical results demonstrate promising improvements across different batch sizes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

02/04/2020

Large Batch Training Does Not Need Warmup

Training deep neural networks using a large batch size has shown promisi...
08/13/2017

Large Batch Training of Convolutional Networks

A common way to speed up training of large convolutional networks is to ...
02/12/2021

A Large Batch Optimizer Reality Check: Traditional, Generic Optimizers Suffice Across Batch Sizes

Recently the LARS and LAMB optimizers have been proposed for training ne...
02/26/2019

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have bee...
12/16/2020

Study on the Large Batch Size Training of Neural Networks Based on the Second Order Gradient

Large batch size training in deep neural networks (DNNs) possesses a wel...
04/06/2020

Adaptive Fractional Dilated Convolution Network for Image Aesthetics Assessment

To leverage deep learning for image aesthetics assessment, one critical ...
03/09/2019

SSN: Learning Sparse Switchable Normalization via SparsestMax

Normalization methods improve both optimization and generalization of Co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The recent trend towards large scale datasets Deng et al. (2009); Abu-El-Haija et al. (2016); Sun et al. (2020) requires training large neural networks to learn effectively. However, employing such large neural networks incurs the cost of larger computational power requirements and additional training time. For example, previous work took 29 hours to train ResNet-50, a state-of-the-art deep learning model, on 8 Tesla P100 GPUs He et al. (2016b). Therefore, many types of optimization techniques have been proposed to accelerate training large deep neural networks. Some works have focused on data-parallel optimization where each global minibatch of data is distributed among the workers Krizhevsky (2014); Goyal et al. (2017); Li et al. (2014), while some others have been involved in model-parallel methods Shoeybi et al. (2019); Rajbhandari et al. (2019).

One prominent type of technique involves large-batch optimization whereby gradients are computed on large minibatches in parallel. Such techniques has seen a resurgence recently due to advances in hardware capabilities, and has been shown in previous works to be able to accelerate large deep neural network training. For example, Goyal et al. Goyal et al. (2017)

successfully trained ResNet-50 in 1 hour on 256 GPUs using distributed Stochastic Gradient Descent (SGD) with 8K minibatch size. However, such methods also underscore the need for adaptive learning rate mechanisms for large batch training. To address this need, recent work implemented layerwise adaptive learning rates for large batch training. The most successful ones are LARS

You et al. (2017) and LAMB You et al. (2019), which calculate the trust ratio (ratio of L2-norm of weights over L2-norm of gradients) of each layer in the network. LARS and LAMB has been shown to be able to scale ResNet-50 and BERT Devlin et al. (2018) models up to batch size of 32K without loss of accuracy, while drastically reducing the training time.

Though prevailing, such layerwise adaptive methods are observed to still suffer from unstable and extreme trust ratios which degrades performance. This happens when the weight norm becomes too large compared to the gradient norm, resulting in possible divergence. To this end, we propose an approach that entails clipping the trust ratio within a range of values. Inspired by recent work NVIDIA (2020) that proposed trust ratio clipping on LARS, we propose a new variant of LAMB, called LAMBC, that clips the trust ratio for LAMB.

Contributions.

Our contributions in this paper are twofold: (1) we develop a new variant of LAMB, called LAMBC, for achieving stability and improvement in performance over standard LAMB, and (2) we demonstrate the effectiveness of trust ratio clipping across different image classification tasks such as ImageNet and CIFAR-10.

2 Background

Many neural networks can be trained using Stochastic Gradient based methods, which follows the following equation:

(1)

where is the learning rate and is the update at time step . differs between different optimizers. For example, in SGD, , while in Adam Kingma and Ba (2014), . To enable training with large batch, one way is to adjust the learning rate LR. However, the main obstacle for such a method is the instability of training with high LR. Goyal et al. Goyal et al. (2017) proposed to use LR warm-up which entails starting with small LR and gradually increasing LR to the target. However, such methods require manual adjustments of the LR (e.g.: rate of increase of LR and target LR in LR warm-up, etc.). Furthermore, such methods are unable to maintain the accuracy for batch size larger than 8K. Such problems lead to the layerwise adaptive methods proposed by You et al. (2017, 2019).

2.1 Layerwise Adaptive Methods

In layerwise adaptive methods, the general strategy is to perform layerwise normalization, where each layer’s update is normalized to unit L2-norm. This is performed in the form where refers to the -th layer. Similarly, the learning rate is also scaled layerwise by for some function . Thus, the modifications result in the following weight update rule:

(2)

where are the gradients of the -th layer. For LARS, where

is the first moment. For LAMB,

, where and are the second moment and a small offset respectively. Eq. 2 introduces a new term which is called trust ratio. The trust ratio is essentially a ratio of the L2-norm of weights over the L2-norm of gradients. Intuitively, this offers a major benefit for large batch training. Such a normalization provides robustness to exploding and vanishing gradients since the trust ratio explicitly compares the magnitudes of the weights and the gradients for each layer. Exploding gradients occur due to significantly large gradients compared to the weights. Therefore, the trust ratio will adapt to produce a small value to lower the LR, reducing the chance of divergence. The corresponding effect happens for vanishing gradients.

2.2 Lamb

The LAMB algorithm is an instantiation of the layerwise adaptive strategy with the normalization modification performed on the Adam optimizer. In LAMB, there are two normalizations. The first normalization occurs when is normalized with , providing adaptivity for each weight. Furthermore, the second normalization occurs layerwise when computing the trust ratio. Despite having two normalizations, the authors of LAMB You et al. (2019) provided convergence guarantees that proves LAMB’s convergence. Algorithm 1 shows the pseudocode for LAMB.

3 Methodology

1:Given: , learning rate policy , , , , -layer neural network model , clipping parameters True, False,
2:Initialize: ,
3:for  todo
4:     Draw b samples from training set.
5:     for  to  do
6:         
7:         
8:         
9:         
10:         
11:     end for
12:     Compute ratio
13:     Compute trust ratio
14:     if  True then
15:         
16:     end if
17:     
18:end for
Algorithm 1 LAMB and LAMBC algorithms

Layerwise adaptive methods, such as LAMB, are observed to suffer from unstable and extreme trust ratios which degrades performance. This happens when the weight norm becomes too large compared to the gradient norm, leading to divergence during training. To solve this problem, we apply a clipping operation to the trust ratio (ratio between the L2-norm of weights and the L2-norm of the per layer gradients) by constraining it to be within a range of values between a predefined set of upper and lower bounds. Specifically, at any time ,

(3)

where and are the lower and upper bounds for all layers, while and are the weights and gradients for layer respectively. In our implementation, we set the lower bound . Clipping the trust ratio prevents the weight update from exploding to huge values. As such, the degraded performance or training divergence caused by extreme and unstable trust ratios can be alleviated.

The upper bound and lower bound in the clipping operation in Eq. 3

are hyperparameters that are manually tuned and are set as constant for all layers. To improve the flexibility of training, one may consider using adaptive methods to decide the upper and lower bound for different layers during training. In our preliminary experiments, we tried an adaptive method from

Luo et al. (2019) to investigate this possibility. However, our empirical results show that such adaptive method performs not as well as manually defined bounds, although with improvements over no clipping. We postulate that the dynamics of the upper and lower bound functions in Luo et al. (2019) does not fit the evolution of the trust ratio values during training, resulting in conflicting scenarios whereby clipping is performed on the trust ratio when it should not.

4 Experiments

In this section, we will introduce the experiments we have done to validate the performance of the trust ratio clipping. Our experiments aim to answer the following questions:

  1. Can trust ratio clipping help with the task generalization and test performance?

  2. If trust ratio clipping works, what is the best or recommended trust ratio value that we should adopt?

  3. Does the trust ratio clipping work in a more complex image classification task such as on ImageNet Deng et al. (2009)?

Following the three aforementioned questions, we divide the experiment section into three parts, each with individual experiments that address the respective research questions in detail.

4.1 Image classification on CIFAR10

In this section, we aim to find out whether applying trust ratio clipping results in test performance improvement compared against without clipping. We test our hypothesis on the CIFAR10 dataset Krizhevsky et al. (2009) that contains 60000 32x32 colour images in 10 classes. Due to limited computation resources, we choose ResNet-18 He et al. (2016a)

as our neural network model backbone. We set the learning rate to be 1e-2, number of epochs to be 80 and compared the performance with and without trust ratio clipping on various batch sizes, ranging from 1000 to 3000. If trust ratio clipping is enabled, we clip the trust ratio to be less than 1.

Figure 1: Image classification task on CIFAR10 dataset Krizhevsky et al. (2009). We compare the scenarios with and without the trust ratio clipping. X-axis is the number of epochs and y-axis is the prediction accuracy in percentage. We conduct experiments on different batch sizes: 1000 (left), 2000 (middle) and 3000 (right).
Batch Size 1000 2000 3000
Test Accuracy (clip) 87.71 87.3 86.29
Test Accuracy (no clip) 85.68 86.61 85.41
Table 1: Quantitative results for the test performance across different batch sizes.

Figure 1 and Table 1 shows the evaluation results on CIFAR 10 dataset. All the three tasks with different batch sizes clearly indicate an improvement brought by trust ratio clipping on the final testing performance. With batch size 1K, trust ratio clipping has the highest improvement of about 2% for the testing accuracy and about 0.7% improvement for the other batch sizes. Therefore, we conclude that trust ratio clipping can improve on the task generalization and test performance.

4.2 Selecting the suitable trust ratio clipping bound

Observing the success of trust ratio clipping in the previous experiments, we are curious about what is the best or recommended value to clip the trust ratio. We conduct another set of experiments on the CIFAR10 dataset Krizhevsky et al. (2009) with different upper bound values of trust ratio clipping. We test the trust ratio values on four different scales: 1, 3, 5, 10.

Figure 2: Image classification task on CIFAR10 dataset Krizhevsky et al. (2009) with different trust ratio values. X-axis is the number of epochs and y-axis is the prediction accuracy in percentage.

Figure 2 shows the evaluation results. Interestingly, we discover that all the conditions with trust ratio clipping outperform the no-clipping setup, while the final testing performance is inversely proportional to the maximum clipping value. From the figure, clipping with max value 1 is the best, 10 is the worst, and 3, 5 sits in between. Since trust ratio reflects the ratio between the magnitudes of the neural network weights and gradients, a possible explanation will be that drastic gradient updates on a stabilized weight parameter may jeopardize the generalization performance. This is indeed true, if the scale of the weight value attempts to stabilize, drastic changes (with trust ratio > 1) on the weight parameter take higher risks to downgrade the generalization performance. This observation is also consistent in the first experiments in figure 1, where models with clipping only starts to surpass models without clipping at the later stage after the scale of the weight stabilized.

An insight discovered here is to dynamically adjust the maximum trust ratio clipping value. At the beginning of the training, larger trust ratios should be allowed but it should be avoided after the weight scaling stabilizes, e.g. the converging phase, to prevent the risks of downgrading the generation performance.

4.3 Image classification on ImageNet

In this section, we aim to find out whether trust ratio clipping works in more complex image classification tasks such as ImageNet Deng et al. (2009). ImageNet consists of 14 million real images with a total of 1000 classes. The original dataset occupies about 500G disk space, which exceeds beyond the capability of our computational resources and we have to conduct our experiments on the down-sampled ImageNet dataset on the scale of 64x64x3 per images. The batch size is also limited to the size of 400.

The final result is shown in Figure 3. From the figure, even with the more complex image classification task on ImageNet, our proposed trust ratio clipping still helps with the generalization performance and outperforms the model without trust ratio clipping. Although, our model seems to over-fit the training data with a much higher accuracy in the training dataset, our objective is not to achieve absolute performance in test dataset but to show the effectiveness of the trust ratio clipping over the model without trust ratio clipping.

Figure 3: Image classification task on the ImageNet dataset. X-axis is the number of epochs and y-axis is the prediction accuracy in percentage.

5 Future Work

As observed from the empirical results, it is crucial to define a good clipping bound for the task. Ideally, the selection of the clipping bound should be tuned as accurately as possible. This is to ensure that each weight update can be significant, while the magnitude of the weight update should also just be large enough for controllable and optimal updates. This requires going beyond manual specification of clipping bound values and exploring adaptive methods for the clipping bounds. In our preliminary experiments, we attempted the technique from Luo et al. (2019) and provided an analysis on the possible subpar performance compared against manual clipping. Following this line of thought, we suggest other possible methods for trust ratio adaptivity. Inspired by Ede and Beanland (2020)

, a possible approach is to consider maintaining a standard deviation from the trust ratio and setting the clipping bound

standard deviations away from the mean.

One of the main limitations of our approach is that the same clipping bound value is applied for all layers in the neural network. However, it has been observed in You et al. (2017) that the trust ratio values can vary significantly among different layers in the network. Therefore, the direction towards applying an adaptive trust ratio clipping should also take this into consideration and adopt a layerwise trust ratio clipping approach too.

Through our experiments, we have verified LAMBC on image classification tasks with ImageNet and CIFAR-10 datasets. However, we were unable to investigate its effectiveness on large batch sizes ( 8K on CIFAR-10 and

1K on ImageNet) due to a lack of computational resources. It would be interesting to analyze the effects of clipping on both small and large batch training using LAMBC. Furthermore, we must also test and verify the algorithm’s effectiveness on a wider range of tasks, such as language modeling and neural machine translation, etc. It is important that trust ratio clipping must not degrade LAMBC’s generalization ability.

6 Conclusion

Large batch training is critical to accelerating training of large deep neural networks. The existing approach for large batch training, the LAMB optimizer, features adaptive layerwise learning rates based on computing the trust ratio. Trust ratios explicitly compare the L2-norm of layer weights over the L2-norm of layer gradients, and uses this difference as an adaptive feedback to adjust the overall layerwise learning rate.

However, the trust ratio introduced by LAMB is still vulnerable to extreme gradient values due to the increasing norm of weights of layers within neural networks. The unstable and extreme trust ratio can lead to degrading performance of trained model. To solve this problem, we present, a new variant of LAMB, called LAMBC, that clips the trust ratio corresponding to the predefined clipping bound value. Clipping constrains the trust ratios within a reasonable range of values, which prevents the gradient update from exploding to huge values, while improving the final performance of the trained model by encouraging a reasonable rate of weight update.

We evaluated LAMBC on image classification tasks using different datasets, including CIFAR-10 and ImageNet. LAMBC achieves a better performance than LAMB for all of the experiments, with better generalization ability and higher test accuracies. LAMBC also works effectively across small and large batch sizes, as well as across different clipping bound values. Although all the investigated clipping bound values improves the performance compared against no clipping, it was observed that the selection of clipping bound value is still paramount to the success of trust ratio clipping. Therefore, training LAMBC with a suitable adaptive trust ratio clipping approach is an immediate future work to look into.

Acknowledgement

We would like to thank Yang You for his valuable input regarding potential adaptive trust ratio clipping methods. We also want to thank the National University of Singapore for computational resource support. We would like to acknowledge that this work is done for the course CS6285: Bridging Systems and Deep Learning.

References

  • S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §1.
  • J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 248–255. External Links: Document Cited by: §1, item 3, §4.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • J. M. Ede and R. Beanland (2020) Adaptive learning rate clipping stabilizes learning. Machine Learning: Science and Technology 1 (1), pp. 015011. Cited by: §5.
  • P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1, §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016a) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016b) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: Figure 1, Figure 2, §4.1, §4.2.
  • A. Krizhevsky (2014)

    One weird trick for parallelizing convolutional neural networks

    .
    arXiv preprint arXiv:1404.5997. Cited by: §1.
  • M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su (2014) Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583–598. Cited by: §1.
  • L. Luo, Y. Xiong, Y. Liu, and X. Sun (2019) Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843. Cited by: §3, §5.
  • NVIDIA (2020) NVCaffe user guide :: nvidia deep learning frameworks documentation. Note: https://docs.nvidia.com/deeplearning/frameworks/caffe-user-guide/index.html#larcOnline; accessed 25 November 2020 Cited by: §1.
  • S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He (2019) Zero: memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054. Cited by: §1.
  • M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, and B. Catanzaro (2019) Megatron-lm: training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1.
  • P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. (2020) Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454. Cited by: §1.
  • Y. You, I. Gitman, and B. Ginsburg (2017) Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: §1, §2, §5.
  • Y. You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C. Hsieh (2019) Large batch optimization for deep learning: training bert in 76 minutes. arXiv preprint arXiv:1904.00962. Cited by: §1, §2.2, §2.