1 Introduction
The recent trend towards large scale datasets Deng et al. (2009); AbuElHaija et al. (2016); Sun et al. (2020) requires training large neural networks to learn effectively. However, employing such large neural networks incurs the cost of larger computational power requirements and additional training time. For example, previous work took 29 hours to train ResNet50, a stateoftheart deep learning model, on 8 Tesla P100 GPUs He et al. (2016b). Therefore, many types of optimization techniques have been proposed to accelerate training large deep neural networks. Some works have focused on dataparallel optimization where each global minibatch of data is distributed among the workers Krizhevsky (2014); Goyal et al. (2017); Li et al. (2014), while some others have been involved in modelparallel methods Shoeybi et al. (2019); Rajbhandari et al. (2019).
One prominent type of technique involves largebatch optimization whereby gradients are computed on large minibatches in parallel. Such techniques has seen a resurgence recently due to advances in hardware capabilities, and has been shown in previous works to be able to accelerate large deep neural network training. For example, Goyal et al. Goyal et al. (2017)
successfully trained ResNet50 in 1 hour on 256 GPUs using distributed Stochastic Gradient Descent (SGD) with 8K minibatch size. However, such methods also underscore the need for adaptive learning rate mechanisms for large batch training. To address this need, recent work implemented layerwise adaptive learning rates for large batch training. The most successful ones are LARS
You et al. (2017) and LAMB You et al. (2019), which calculate the trust ratio (ratio of L2norm of weights over L2norm of gradients) of each layer in the network. LARS and LAMB has been shown to be able to scale ResNet50 and BERT Devlin et al. (2018) models up to batch size of 32K without loss of accuracy, while drastically reducing the training time.Though prevailing, such layerwise adaptive methods are observed to still suffer from unstable and extreme trust ratios which degrades performance. This happens when the weight norm becomes too large compared to the gradient norm, resulting in possible divergence. To this end, we propose an approach that entails clipping the trust ratio within a range of values. Inspired by recent work NVIDIA (2020) that proposed trust ratio clipping on LARS, we propose a new variant of LAMB, called LAMBC, that clips the trust ratio for LAMB.
Contributions.
Our contributions in this paper are twofold: (1) we develop a new variant of LAMB, called LAMBC, for achieving stability and improvement in performance over standard LAMB, and (2) we demonstrate the effectiveness of trust ratio clipping across different image classification tasks such as ImageNet and CIFAR10.
2 Background
Many neural networks can be trained using Stochastic Gradient based methods, which follows the following equation:
(1) 
where is the learning rate and is the update at time step . differs between different optimizers. For example, in SGD, , while in Adam Kingma and Ba (2014), . To enable training with large batch, one way is to adjust the learning rate LR. However, the main obstacle for such a method is the instability of training with high LR. Goyal et al. Goyal et al. (2017) proposed to use LR warmup which entails starting with small LR and gradually increasing LR to the target. However, such methods require manual adjustments of the LR (e.g.: rate of increase of LR and target LR in LR warmup, etc.). Furthermore, such methods are unable to maintain the accuracy for batch size larger than 8K. Such problems lead to the layerwise adaptive methods proposed by You et al. (2017, 2019).
2.1 Layerwise Adaptive Methods
In layerwise adaptive methods, the general strategy is to perform layerwise normalization, where each layer’s update is normalized to unit L2norm. This is performed in the form where refers to the th layer. Similarly, the learning rate is also scaled layerwise by for some function . Thus, the modifications result in the following weight update rule:
(2) 
where are the gradients of the th layer. For LARS, where
is the first moment. For LAMB,
, where and are the second moment and a small offset respectively. Eq. 2 introduces a new term which is called trust ratio. The trust ratio is essentially a ratio of the L2norm of weights over the L2norm of gradients. Intuitively, this offers a major benefit for large batch training. Such a normalization provides robustness to exploding and vanishing gradients since the trust ratio explicitly compares the magnitudes of the weights and the gradients for each layer. Exploding gradients occur due to significantly large gradients compared to the weights. Therefore, the trust ratio will adapt to produce a small value to lower the LR, reducing the chance of divergence. The corresponding effect happens for vanishing gradients.2.2 Lamb
The LAMB algorithm is an instantiation of the layerwise adaptive strategy with the normalization modification performed on the Adam optimizer. In LAMB, there are two normalizations. The first normalization occurs when is normalized with , providing adaptivity for each weight. Furthermore, the second normalization occurs layerwise when computing the trust ratio. Despite having two normalizations, the authors of LAMB You et al. (2019) provided convergence guarantees that proves LAMB’s convergence. Algorithm 1 shows the pseudocode for LAMB.
3 Methodology
Layerwise adaptive methods, such as LAMB, are observed to suffer from unstable and extreme trust ratios which degrades performance. This happens when the weight norm becomes too large compared to the gradient norm, leading to divergence during training. To solve this problem, we apply a clipping operation to the trust ratio (ratio between the L2norm of weights and the L2norm of the per layer gradients) by constraining it to be within a range of values between a predefined set of upper and lower bounds. Specifically, at any time ,
(3) 
where and are the lower and upper bounds for all layers, while and are the weights and gradients for layer respectively. In our implementation, we set the lower bound . Clipping the trust ratio prevents the weight update from exploding to huge values. As such, the degraded performance or training divergence caused by extreme and unstable trust ratios can be alleviated.
The upper bound and lower bound in the clipping operation in Eq. 3
are hyperparameters that are manually tuned and are set as constant for all layers. To improve the flexibility of training, one may consider using adaptive methods to decide the upper and lower bound for different layers during training. In our preliminary experiments, we tried an adaptive method from
Luo et al. (2019) to investigate this possibility. However, our empirical results show that such adaptive method performs not as well as manually defined bounds, although with improvements over no clipping. We postulate that the dynamics of the upper and lower bound functions in Luo et al. (2019) does not fit the evolution of the trust ratio values during training, resulting in conflicting scenarios whereby clipping is performed on the trust ratio when it should not.4 Experiments
In this section, we will introduce the experiments we have done to validate the performance of the trust ratio clipping. Our experiments aim to answer the following questions:

Can trust ratio clipping help with the task generalization and test performance?

If trust ratio clipping works, what is the best or recommended trust ratio value that we should adopt?

Does the trust ratio clipping work in a more complex image classification task such as on ImageNet Deng et al. (2009)?
Following the three aforementioned questions, we divide the experiment section into three parts, each with individual experiments that address the respective research questions in detail.
4.1 Image classification on CIFAR10
In this section, we aim to find out whether applying trust ratio clipping results in test performance improvement compared against without clipping. We test our hypothesis on the CIFAR10 dataset Krizhevsky et al. (2009) that contains 60000 32x32 colour images in 10 classes. Due to limited computation resources, we choose ResNet18 He et al. (2016a)
as our neural network model backbone. We set the learning rate to be 1e2, number of epochs to be 80 and compared the performance with and without trust ratio clipping on various batch sizes, ranging from 1000 to 3000. If trust ratio clipping is enabled, we clip the trust ratio to be less than 1.
Batch Size  1000  2000  3000 

Test Accuracy (clip)  87.71  87.3  86.29 
Test Accuracy (no clip)  85.68  86.61  85.41 
Figure 1 and Table 1 shows the evaluation results on CIFAR 10 dataset. All the three tasks with different batch sizes clearly indicate an improvement brought by trust ratio clipping on the final testing performance. With batch size 1K, trust ratio clipping has the highest improvement of about 2% for the testing accuracy and about 0.7% improvement for the other batch sizes. Therefore, we conclude that trust ratio clipping can improve on the task generalization and test performance.
4.2 Selecting the suitable trust ratio clipping bound
Observing the success of trust ratio clipping in the previous experiments, we are curious about what is the best or recommended value to clip the trust ratio. We conduct another set of experiments on the CIFAR10 dataset Krizhevsky et al. (2009) with different upper bound values of trust ratio clipping. We test the trust ratio values on four different scales: 1, 3, 5, 10.
Figure 2 shows the evaluation results. Interestingly, we discover that all the conditions with trust ratio clipping outperform the noclipping setup, while the final testing performance is inversely proportional to the maximum clipping value. From the figure, clipping with max value 1 is the best, 10 is the worst, and 3, 5 sits in between. Since trust ratio reflects the ratio between the magnitudes of the neural network weights and gradients, a possible explanation will be that drastic gradient updates on a stabilized weight parameter may jeopardize the generalization performance. This is indeed true, if the scale of the weight value attempts to stabilize, drastic changes (with trust ratio > 1) on the weight parameter take higher risks to downgrade the generalization performance. This observation is also consistent in the first experiments in figure 1, where models with clipping only starts to surpass models without clipping at the later stage after the scale of the weight stabilized.
An insight discovered here is to dynamically adjust the maximum trust ratio clipping value. At the beginning of the training, larger trust ratios should be allowed but it should be avoided after the weight scaling stabilizes, e.g. the converging phase, to prevent the risks of downgrading the generation performance.
4.3 Image classification on ImageNet
In this section, we aim to find out whether trust ratio clipping works in more complex image classification tasks such as ImageNet Deng et al. (2009). ImageNet consists of 14 million real images with a total of 1000 classes. The original dataset occupies about 500G disk space, which exceeds beyond the capability of our computational resources and we have to conduct our experiments on the downsampled ImageNet dataset on the scale of 64x64x3 per images. The batch size is also limited to the size of 400.
The final result is shown in Figure 3. From the figure, even with the more complex image classification task on ImageNet, our proposed trust ratio clipping still helps with the generalization performance and outperforms the model without trust ratio clipping. Although, our model seems to overfit the training data with a much higher accuracy in the training dataset, our objective is not to achieve absolute performance in test dataset but to show the effectiveness of the trust ratio clipping over the model without trust ratio clipping.
5 Future Work
As observed from the empirical results, it is crucial to define a good clipping bound for the task. Ideally, the selection of the clipping bound should be tuned as accurately as possible. This is to ensure that each weight update can be significant, while the magnitude of the weight update should also just be large enough for controllable and optimal updates. This requires going beyond manual specification of clipping bound values and exploring adaptive methods for the clipping bounds. In our preliminary experiments, we attempted the technique from Luo et al. (2019) and provided an analysis on the possible subpar performance compared against manual clipping. Following this line of thought, we suggest other possible methods for trust ratio adaptivity. Inspired by Ede and Beanland (2020)
, a possible approach is to consider maintaining a standard deviation from the trust ratio and setting the clipping bound
standard deviations away from the mean.One of the main limitations of our approach is that the same clipping bound value is applied for all layers in the neural network. However, it has been observed in You et al. (2017) that the trust ratio values can vary significantly among different layers in the network. Therefore, the direction towards applying an adaptive trust ratio clipping should also take this into consideration and adopt a layerwise trust ratio clipping approach too.
Through our experiments, we have verified LAMBC on image classification tasks with ImageNet and CIFAR10 datasets. However, we were unable to investigate its effectiveness on large batch sizes ( 8K on CIFAR10 and
1K on ImageNet) due to a lack of computational resources. It would be interesting to analyze the effects of clipping on both small and large batch training using LAMBC. Furthermore, we must also test and verify the algorithm’s effectiveness on a wider range of tasks, such as language modeling and neural machine translation, etc. It is important that trust ratio clipping must not degrade LAMBC’s generalization ability.
6 Conclusion
Large batch training is critical to accelerating training of large deep neural networks. The existing approach for large batch training, the LAMB optimizer, features adaptive layerwise learning rates based on computing the trust ratio. Trust ratios explicitly compare the L2norm of layer weights over the L2norm of layer gradients, and uses this difference as an adaptive feedback to adjust the overall layerwise learning rate.
However, the trust ratio introduced by LAMB is still vulnerable to extreme gradient values due to the increasing norm of weights of layers within neural networks. The unstable and extreme trust ratio can lead to degrading performance of trained model. To solve this problem, we present, a new variant of LAMB, called LAMBC, that clips the trust ratio corresponding to the predefined clipping bound value. Clipping constrains the trust ratios within a reasonable range of values, which prevents the gradient update from exploding to huge values, while improving the final performance of the trained model by encouraging a reasonable rate of weight update.
We evaluated LAMBC on image classification tasks using different datasets, including CIFAR10 and ImageNet. LAMBC achieves a better performance than LAMB for all of the experiments, with better generalization ability and higher test accuracies. LAMBC also works effectively across small and large batch sizes, as well as across different clipping bound values. Although all the investigated clipping bound values improves the performance compared against no clipping, it was observed that the selection of clipping bound value is still paramount to the success of trust ratio clipping. Therefore, training LAMBC with a suitable adaptive trust ratio clipping approach is an immediate future work to look into.
Acknowledgement
We would like to thank Yang You for his valuable input regarding potential adaptive trust ratio clipping methods. We also want to thank the National University of Singapore for computational resource support. We would like to acknowledge that this work is done for the course CS6285: Bridging Systems and Deep Learning.
References
 Youtube8m: a largescale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §1.

ImageNet: a largescale hierarchical image database.
In
2009 IEEE Conference on Computer Vision and Pattern Recognition
, Vol. , pp. 248–255. External Links: Document Cited by: §1, item 3, §4.3.  Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
 Adaptive learning rate clipping stabilizes learning. Machine Learning: Science and Technology 1 (1), pp. 015011. Cited by: §5.
 Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §1, §1, §2.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §2.
 Learning multiple layers of features from tiny images. Cited by: Figure 1, Figure 2, §4.1, §4.2.

One weird trick for parallelizing convolutional neural networks
. arXiv preprint arXiv:1404.5997. Cited by: §1.  Scaling distributed machine learning with the parameter server. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pp. 583–598. Cited by: §1.
 Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843. Cited by: §3, §5.
 NVCaffe user guide :: nvidia deep learning frameworks documentation. Note: https://docs.nvidia.com/deeplearning/frameworks/caffeuserguide/index.html#larcOnline; accessed 25 November 2020 Cited by: §1.
 Zero: memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054. Cited by: §1.
 Megatronlm: training multibillion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053. Cited by: §1.
 Scalability in perception for autonomous driving: waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454. Cited by: §1.
 Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888. Cited by: §1, §2, §5.
 Large batch optimization for deep learning: training bert in 76 minutes. arXiv preprint arXiv:1904.00962. Cited by: §1, §2.2, §2.
Comments
There are no comments yet.