Large-batch optimization is becoming an important research topic as it is an efficient approach for current distributed deep learning systems. It has enabled researchers to reduce the ImageNet/ResNet-50 training from 29 hours to around 1 minute (Table 1). Researchers can also reduce the BERT pre-training time from 3 days to 76 minutes you2019large , 67 minutes shoeybi2019megatron , 47 minutes nvidia2019bert and 44 minutes rajbhandari2019zero
. The speedup comes from the fact that researchers can scale the training to a larger batch size, and so use larger scale, parallelism, without losing accuracy in a fixed number of epochs. For example, researchers scaled the batch size of ImageNet training from 256he2016deep to 1K krizhevsky2014one , 5K li2017scaling , 8K goyal2017accurate , 16K you2017imagenet , 32K ying2018image , and 64K kumar2019scale . On the other hand, the hardware designers are also interested in large-batch training, as the chip makers fail to continue improving the single processor’s clock frequency due to power consumption issue asanovic2009view . Thus, the hardware vendors have to use more processors to increase the raw floating point computing performance, which requires high-parallelism for future algorithms. Many successful use cases of large-batch training have encouraged industry to design and implement several AI supercomputers like Google TPU Pod, NVIDIA SuperPOD, and Huawei Atlas 900 AI Cluster to take advantage of the extremely high parallelism provided by large-batch algorithms.
|He et al. he2016deep||12/10/2015||75.3%||29h||Momentum (Rumelhart et al. rumelhart1986learning )|
|Goyal et al. goyal2017accurate||06/08/2017||76.3%||65m||Momentum (Rumelhart et al. rumelhart1986learning )|
|You et al. you2017imagenet||11/02/2017||75.3%||48m||LARS (You et al. you2017scaling )|
|You et al. you2017imagenet||11/07/2017||75.3%||31m||LARS (You et al. you2017scaling )|
|Akiba et al. akiba2017extremely||11/12/2017||74.9%||15m||RMSprop (Hinton tieleman2012lecture )|
|You et al. you2017imagenet||12/07/2017||74.9%||14m||LARS (You et al. you2017scaling )|
|Jia et al. jia2018highly||07/30/2018||75.8%||6.6m||LARS (You et al. you2017scaling )|
|Mikami et al. mikami2018imagenet||11/14/2018||75.0%||3.7m||LARS (You et al. you2017scaling )|
|Ying et al. ying2018image||11/16/2018||76.3%||2.2m||LARS (You et al. you2017scaling )|
|Yamazaki et al. yamazaki2019yet||03/29/2019||75.1%||1.25m||LARS (You et al. you2017scaling )|
|Kumar et al. kumar2019scale||10/02/2019||75.9%||67.1s||LARS (You et al. you2017scaling )|
In this paper, we focus on studying the limit of the batch size and hope to provide guidance to AI supercomputer and algorithm designers. To the best of our knowledge, we have not found any previous works studying a batch size over 128K. We pick some realistic applications like ImageNet/ResNet-50, and provide detailed numerical optimization instructions for step-by-step accuracy comparison. Moreover, it is important to understand the generalization and optimization difficulties encountered when batch size is huge. There are a series of works hoffer2017train keskar2016large smith2017don on this topic. Specifically, Hoffer et al. hoffer2017train introduced “ultra-slow diffusion” theory to large-batch training. They pointed out that there is an “ultra-slow” logarithmic increase in the distance of the weights from their initialization. Their paper indicated that, under a series of optimization techniques, the generalization performance can be controlled by the number of iterations. However, our experiments show contradictory results once the batch size is increased beyond a certain boundary, which we refer to as the “huge-batch” regime. We provide comprehensive experimental results and detailed analysis to study the limitations of batch size scaling and the “ultra-slow diffusion” theory.
Concretely, we scale the batch size of Imageanet/ResNet-50 to 819K and 1.28 million, which is an order of magnitude larger than any previous works. This is the first work to report an accuracy for huge/full-batch ImageNet/ResNet-50 training. We also scale the batch size to the full-dataset for MNIST, CIFAR-10, and ImageNet. The contribution of our paper can be summarized as follows:
For the first time we scale the batch size on ImageNet to at least a magnitude larger that all previous works, and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting. We propose an optimization recipe that is able to improve the top-1 test accuracy by 18% compared to the baseline.
We identify a “huge-batch” regime for realistic optimization tasks (e.g. ImageNet/ResNet-50) , where the optimization process becomes intrinsically harder and we cannot reach the target test accuracy in polynomial time by applying any of current optimization techniques. The “ultra-slow diffusion” theory does not hold any more in this regime.
Our results help system researchers and designers to understand the limit of parallelism in realistic machine learning tasks and design future machine learning hardware accordingly.
2 Background and Related Work
For any vector, either or are used to denote its coordinate where . For any function , we use to denote the gradient with respect to . We use to denote -norm of a vector. We study nonconvex stochastic optimization problems of the form
where is a smooth (possibly nonconvex) function and
is a probability distribution on the domain. Here, corresponds to model parameters,
is the loss function and
is an unknown data distribution. Stochastic gradient descent (SGD) is a method for solving problem in Equation (1). The update at the iteration of SGD is of the following form:
where is set of random samples drawn from the distribution . There are some SGD variants (e.g. duchi2011adaptive kingma2014adam tieleman2012lecture ) proposed in recent years. We assume the batch size of our baseline is , and our baseline can achieve a testing accuracy of with a validation loss of by a learning rate of in epochs. We define define three optimization regimes with batch sizes , and :
If we use a batch size () and a learning rate of , then we can get a testing accuracy of (or higher) and a validation loss of (or lower) in epochs. We define a batch size is large batch if (examples in Table 2).
Huge Batch: If we use a batch size () with current available optimization techniques, then we can not get a testing accuracy of in epochs. We define a batch size is huge batch if (examples in Table 2).
Full Batch: We define a positive integer , which is equal to the total number of available training samples at runtime (examples in Table 2).
Keskar et al. keskar2016large reported that traditional first-order optimization techniques fail to scale up the batch size to a very large number. Large-batch training will hurt the testing accuracy. They indicate there is an inherent generalization gap in large-batch optimization within fixed number of epochs. Hoffer et al. hoffer2017train suggested training longer can close the generalization gap. After that, researchers have developed a series of optimization techniques and scaled the batch size to 64K without losing accuracy for various applications within fixed number of epochs (Table 1). There are also other related literatures anil2019memory ginsburg2019stochastic gupta2020stochastic golmant2018computational krizhevsky2014one lin2018don liu2019variance ma2019inefficiency osawa2019large shallue2018measuring smith2017don . Inspired by these works, we focus on a different direction. As far as we know, there is no previous work studying huge-batch optimization (e.g. over 128K). Our investigation includes: (1) pushing the batch size to the limit and maximizing the accuracy in fixed number of epochs; (2) studying the optimization problems of huge-batch and full-batch; (3) studying the relationship between generalization and optimization.
In this section, we use a series of techniques to maximize the accuracy in huge/full-batch training. The information about our applications are in Table 2. Due to lack of space, some experimental results are in the Appendix. The first step of our study is to identify the “huge-batch” regime, where we conduct experiments with extremely large batch sizes. We detail the process of finding the “huge-batch” numbers in Table 2 and more results on how accuracy drops with batch size in the Appendix.
|Dataset||Model||Implementation||Baseline Batch||Epochs||Top-1 Accuracy||Large Batch||Huge Batch||Full Batch|
|ImageNet||ResNet-50 v1.5 he2016deep||resnet2019mlperf||256||90||75.9%||[2K, 64K]||(64K, 1.28M)||1.28M|
|MNIST||LeNet lecun2015lenet||lenet2019google||256||30||99.2%||[2K, 8K]||(8k, 60k)||60K|
|CIFAR-10||ResNet-50 v1 he2016deep||resnet2020cifar||128||200||93.9%||[2K, 25K]||(25k, 50k)||50K|
3.1 Huge Batch
Researchers are able to scale the batch size of ImageNet/ResNet-50 to 64K and achieve 75.9% top-1 accuracy in 90 epochs kumar2019scale . As far as we know, no one reported that the batch size of ImageNet/ResNet-50 can be scaled to over 64K without losing accuracy. The baseline of our implementation used Momentum SGD optimizer. The training diverges and only achieves a top-1 accuracy of 0.0977% for a batch size of 819200. We train it by 141 steps, which has the same number of epochs as the small-batch baseline. We investigate the effectiveness of existing large batch optimization techniques in the huge batch regime, starting from simple Momentum SGD to more advanced techniques proposed by recent works pereyra2017regularizing ; kumar2019scale ; ying2018image ; goyal2017accurate ; hoffer2017train :
Optimization 0: this is our baseline; we use Momentum SGD optimizer with linear learning rate scaling (or sqrt scaling if it is better) krizhevsky2014one , cosine learning rate decay loshchilov2016sgdr
(see Appendix), gradient clippingpascanu2013difficulty
and ghost batch normalizationhoffer2017train . The top-1 accuracy is 0.0977%.
Optimization 1: we add learning rate warmup goyal2017accurate , which allows us to use linear learning rate scaling without diverging. We improved the top-1 accuracy from 0.0977% to 2.712%.
Optimization 2: we tune learning rate and warmup epochs by an auto-tuner. But we found just using grid-search tuning can roughly get the same accuracy as the auto-tuner in a reasonable computing budget. We improved the top-1 accuracy from 2.712% to 4.563%.
Optimization 3: we add labeling smoothing pereyra2017regularizing , which is a useful technique in large-batch training. The accuracy did not increase and also did not significantly decrease. The accuracy was changed from 4.563% to 4.226%.
Optimization 4: we switch the optimizer from Momentum SGD to LARS. In this way, we can increase the accuracy from 4.226% to 18.95%.
The results are summarized in Figure 1
. Since LARS can give us largest boost in accuracy, we further study its sensitivity to hyperparameters in Figure2, by varying the two most important hyperparameters, learning rate and number of warmup epochs. From Figure 2, we can observe that the advantage layer-wise adaptive technique provided by LARS optimizer is maintained over a large number of hyperparameters, while the Momentum SGD fails to obtain good accuracy regardless of hyperparameter tuning. We conclude that LARS can make a difference in huge-batch learning.
We also study an application with relatively small number of training samples. We pick MNIST with LeNet. We partition the dataset as 55K/5K/10K samples for training/validation/testing111We partition the dataset as 60K/10K samples for training/testing in Table 2. By using Momentum SGD, the baseline can achieve 99.2% accuracy in 30 epochs with a batch size of 256. As mentioned in Table 2, 8K is the boundary between the large batch region and the huge batch region. When we scale the batch size to 8K, we find the baseline (Momentum SGD) diverges in the training. So we tried other state-of-the-art optimizers like AdaGrad, RMSprop, Adam and LARS. AdaGrad and RMSprop can not achieve an accuracy over 99%. Adam has a slight accuracy loss. However, LARS achieved a better accuracy than the baseline. We want to find the reason behind this. We find one key feature of LARS is the layer-wise adaptive learning technique. So we add layer-wise technique to all the other optimizers. After that, SGD and AdaGrad can get the same accuracy as the baseline. However, Adam becomes very unstable and RMSprop diverges in the middle of the training. We find the reason is that the trust ratio222In the simplest form (for SGD), trust ratio is the ratio between the L2 norm of the weights and the L2 norm of the gradients. of layer-wise learning is either too large or too small for Adam and RMSprop. So we add a bound to the trust ratio for Adam optimizer, which is a similar idea as AdaBound luo2019adaptive . By doing so, it can also achieve 99.4% accuracy in 30 epochs with a batch size of 8K, which is the same as LARS. It is worth noting that Adam + layer-wise + bound-ratio is a special case of LAMB optimizer you2019large . The results are shown in Table 3. Figure 3 shows all the training details of LARS optimizer.
|Solver||batch=256||Original||+ layer-wise||+ layer-wise and bound-ratio|
Since LARS with learning rate warmup and polynomial decay gave us best performance for large-batch MNIST training, we use this scheme for huge-batch MNIST training. We increase the batch size from 8K to 32K. However, the testing error rate increased from 0.8% to 4.2% (Figure 4). If we use other optimizers, the error rate is higher (8%). In Figure 4, we can observe huge batch training may lead the optimizer to a wrong path and the model is not optimized well: in the middle of the training, even though the training error rate and validation error rate are becoming lower, the training loss is increasing. Thus, in this situation, the huge batch training is suffering not only a generalization problem but also an optimization problem. After trying various different optimization techniques, we find only LAMB optimizer with extremely long learning rate warmup epochs and polynomial learning rate decay can stabilize the training and generalize well (Figure. 5). This scheme gets 98.7% testing accuracy. However, even using this scheme, we can not reach the target testing accuracy from the baseline – huge batch-training optimization may have an inherent generalization problem that cannot be resolved by any existing techniques. Huge-batch ImageNet/ResNet-50 training has a similar problem but is much worse than MNIST/LeNet. The target top-1 accuracy is 75.9% in 90 epochs. A huge-batch of 819200 only achieves an accuracy lower than 20%. We conclude that huge-batch training suffers generalization problems and sometimes also optimization problems with whatever current first-order optimization techniques.
3.2 Full Batch
In this section, we increase the batch size from the huge-batch region to the full-batch case. For ImageNet/ResNet-50, we increase the batch size from 819200 to 1.28 million, which is the first time demonstrated in literature. For MNIST/LeNet, we partition the dataset as 60K/10K for training/testing. We increase the batch size from 32768 to 60K. We used the optimization techniques in Section 3.1. Even though we only increase the batch size by less than two times, we find the accuracy becomes much lower. The results are shown in Figures 6 and 7. We can observe that LARS can also make a significant difference in full-batch training. However, we can not reach the target testing accuracy in 90 epochs. Momentum SGD is our baseline optimizer. LARS is the optimizer achieving the best accuracy for us. For MNIST, huge-batch can not reach the 99.2% target accuracy (results in Appendix). LARS can achieve a higher accuracy than Momentum SGD. We can also observe that LARS is more stable than Momentum SGD for huge-batch training.
4.1 Train longer, generalize better?
We focus on ImageNet/ResNet-50 training in Sections 4.1 and 4.2. Our experiments indicate that huge/full-batch training suffers a serious generalization problem. Hoffer et al. hoffer2017train suggest training longer will lead to a better generalization for large-batch training. We want to study this validity of this suggestion for huge/full-batch training. The large-batch size of 32K is able to reach a testing accuracy of 76.7% in just 90 epochs. The huge-batch (batch size = 819200) training only achieves 19% accuracy in 90 epochs. By increasing the number of epochs to 1200, we can reach around 70% accuracy. However, even with an extravagant computing budget, we can not reach the target accuracy by just training longer. Figure 8 shows that there is almost no accuracy improvement as we increase the number of epochs from 3000 to 10000. For a huge batch size of 819200, the best accuracy we observed is 71.8%. For full-batch training, the best accuracy we can get within 15000 epochs is 71.1%. We hypothesize there is an inherent generalization gap in huge/full-batch optimization and simply training longer can not close this gap.
A typical approach on distributed systems is to perform batch normalization (BN) per replica, which reduces the cross-device communication cost. However, distributed BN may also have an impact on the testing accuracy. For example, Ying et al. observed using an effective batch size of 64 for batch normalization can lead to the best accuracy (Figure 6 of ying2018image ). We use the same approach as Ying et al. ying2018image . Ghost Batch Normalization by Hoffer et al. hoffer2017train
is a similar idea. We only conduct reduction across a few peers to compute the mean and variance over a subset of all the replicas. In our experiments, we observed that we achieve best accuracy on 256 v3 TPU cores for the batch size of 32K. So the ratio between batch size and the number of cores is 128. We keep this ratio as we increase the number of cores. The current largest TPU-based supercomputer has 2048 cores, which corresponds to a batch size of 256K. That means the largest useful batch size for Ghost Batch Norm on current TPU supercomputer is 256K. However, we believe 256K is in the huge batch region. We only achieve 65% accuracy in 90 epochs. Even if we increase the number of epochs from 90 to 500, we can not reach the target accuracy (Figure9). This means the current optimization algorithm is not scalable enough to make full use of the current hardware. To make better use of future hardware, we offer advice for system/hardware designers in the Appendix.
4.2 Optimization and Generalization
Hoffer et al. hoffer2017train introduced “ultra-slow diffusion” theory to explain the generalization of large-batch training. According to some statistical physics models, even though the shape of the loss function can not be visualized, we can describe the complicated DNN learning process as a random process with some potential. Specially, they use “Random Walk on a Random Potential” to model this complicated optimization and generalization problem. Based on a series of theoretical works bouchaud1990anomalous ; bray2007statistics ; dauphin2014identifying ; soudry2017exponentially , they built a relationship between the number of iterations and the weight at -th iteration : . It is worth noting that the typical relationship in standard diffusion (on a flat potential) is . They speculate to be 2, which means the distance between weight with the initial weight increases logarithmically with the number of iterations. They reached an argument that the optimizer needs to travel at least a distance of to find a minimum with the width of , which takes iterations. Essentially, their conclusion implies that given the best optimization technique, the quality of a minimizer is dependent on the number of iterations regardless of the batch size. Thus, the authors proposed an approach, “regime adaptation”, which suggests using the same number of iterations as the baseline (e.g. batch size = 256) regardless of the batch size. However, our results show that huge/full-batch training can not reach the target accuracy even with a large number of epochs. A natural question to ask is “why can longer training not lead to a better generalization for huge/full-batch regime?” Unfortunately, the relationship between optimization and generalization is missing in their theory. Our experiments (Figure 10) show that training longer leads to a better training loss; however, it can not reach the target testing accuracy from the baseline. By running the same number of iterations, full-batch can achieve an even lower training loss than the 32K batch size. It indicates there is an inherent generalization gap in huge/full-batch optimization that cannot be closed by any of existing techniques, which should be kept in mind by hardware designers when designing systems for machine learning.
We study the limit of the batch size for deep neural network training. For the first time we scale the batch size on ImageNet to at least a magnitude larger that all previous works (819200 versus 64K), and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting. We propose an optimization recipe that is able to improve the top-1 test accuracy by 18% compared to the baseline. We identify a“huge-batch” regime for realistic optimization tasks (e.g. ImageNet/ResNet-50) , where the optimization process becomes intrinsically harder and we cannot reach the target test accuracy in polynomial time by applying any of current optimization techniques (empirical observation). The “ultra-slow diffusion” theory hoffer2017train does not hold any more in this regime. Our results help system and algorithm designers to understand the limit of parallelism in realistic machine learning tasks and design future machine learning hardware/algorithms accordingly. For more results and explanations, please check out the Appendix.
We study the potential applications of huge-batch training and full-batch training, which may have an influence on future hardware design.
-  Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet-50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.
-  Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory efficient adaptive optimization. In Advances in Neural Information Processing Systems, pages 9746–9755, 2019.
-  Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, et al. A view of the parallel computing landscape. Communications of the ACM, 52(10):56–67, 2009.
-  Jean-Philippe Bouchaud and Antoine Georges. Anomalous diffusion in disordered media: statistical mechanisms, models and physical applications. Physics reports, 195(4-5):127–293, 1990.
-  Alan J Bray and David S Dean. Statistics of critical points of gaussian fields on large-dimensional spaces. Physical review letters, 98(15):150201, 2007.
-  Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
-  John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.
-  Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, and Jonathan M Cohen. Stochastic gradient methods with layer-wise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286, 2019.
-  Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941, 2018.
-  Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
-  Vipul Gupta, Santiago Akle Serrano, and Dennis DeCoste. Stochastic weight averaging in parallel: Large-batch training that generalizes well. arXiv preprint arXiv:2001.02312, 2020.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731–1741, 2017.
-  Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.
-  Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
Implementation of cifar-10 by pytorch, 2020.
-  Sameer Kumar, Victor Bitorff, Dehao Chen, Chiachen Chou, Blake Hechtman, HyoukJoong Lee, Naveen Kumar, Peter Mattson, Shibo Wang, Tao Wang, et al. Scale mlperf-0.6 models on google tpu-v3 pods. arXiv preprint arXiv:1909.09756, 2019.
Yann LeCun et al.
Lenet-5, convolutional neural networks.
-  Mu Li. Scaling distributed machine learning with system and algorithm co-design. PhD thesis, Carnegie Mellon University, 2017.
-  Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large mini-batches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
-  Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
-  Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
-  Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
-  Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Inefficiency of k-fac for large batch size training. arXiv preprint arXiv:1903.06237, 2019.
-  James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
-  Hiroaki Mikami, Hisahiro Suganuma, et al. Imagenet/resnet-50 training in 224 seconds. arXiv preprint arXiv:1811.05233, 2018.
-  MLperf. Implementation of resnet-50 by mlperf, 2019.
-  NVIDIA. Nvidia clocks world’s fastest bert training time and largest transformer based model, paving path for advanced conversational ai, 2019.
-  Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Large-scale distributed second-order optimization using kronecker-factored approximate curvature for deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12359–12367, 2019.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
On the difficulty of training recurrent neural networks.In International conference on machine learning, pages 1310–1318, 2013.
-  Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
-  Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054, 2019.
-  David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by back-propagating errors. nature, 323(6088):533–536, 1986.
-  Christopher J Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.
-  Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
-  Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472. IEEE, 2017.
-  Samuel L Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
-  Daniel Soudry and Elad Hoffer. Exponentially vanishing sub-optimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.
-  Google TPU Team. Implementation of lenet by google, 2019.
-  Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
-  Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. Yet another accelerated sgd: Resnet-50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650, 2019.
-  Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992, 2018.
-  Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 6, 2017.
-  Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2019.
-  Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. arXiv preprint arXiv:1709.05011, 2017.
6.1 Discussion on Ultra-Slow Diffusion Theory
As Section 2 in the main text, let us refer to as the gradient at -th iteration and as the batch size. Here, we define signal as the update in the direction of the true gradient () and noise as the update perpendicular to the true gradient (
). We also define signal-to-noise-ratio as the ratio between the expected power of the signal and the expected power of the noise:.
In this section we want to briefly discuss why training longer can not close the generalization gap in the huge-batch training and full-batch training regimes. In the “ultra-slow diffusion theory” , the relationship between the random walk distance and the number of iterations is technically sound and empirically supported by some strong experimental results. However, we believe the ineffectiveness in huge and full-batch regimes is caused by a few potential mismatches between this theory and the huge/full-batch setting:
Based on the analysis of , we assume a minima with a larger "width" will have a better generalization performance. However, it is not guaranteed that we can find a minima of "width" after traveling a distance of (although we at least need to travel a distance of to find a minima of "width" ). Figures 11 and 12 support our conjecture. Even all the cases travel the same distance in Figure 11 or Figure 12, each one reaches a very different minima. Figure 2 in  only shows that the batch size up to 2K. However, the optimization problem is significantly different if we enter the huge-batch regime.
They did not analyze the effect of signal-to-noise-ratio in stochastic gradients. The signal is increasing linearly with the batch size ( = ). The noise is increasing at a sqrt rate with the batch size ( = ). Thus, increasing the batch size will increase the signal-to-noise-ratio in stochastic gradients, and the full batch setting has no noises. The authors essentially normalized the co-variance in the paper . However, they did not consider the effect of signal-to-noise-ratio in large-batch training. The effect is even bigger in huge-batch training. When the noise is small comparing to the signal, it is unsure if the “slow diffusion theory” still holds.
In a real-world training process, the randomness of stochastic potential is very limited. Under the situation of limited training samples and data augmentation, the information is largely redundant and correlated after enough iterations for huge-batch training, violating the assumption of a random potential.
6.2 Additional Results
We do not have enough space in the main text, so we present additional results here. Figure 13 shows the top-5 accuracy for huge-batch ImageNet training. We can observe that LARS is very import for huge-batch training, which in line with the results of previous large-batch training studies   . The batch size of CIFAR-10 is 25K, which is at the boundary between large-batch regime and huge-batch regime. LARS can also help in this mixed regime.
Figure 14 shows the results of full-batch training. We scale the batch size of ImageNet training to 1.28 million and the batch size of CIFAR-10 to 50K. It shows the top5 accuracy for full-batch ImageNet training and the top1 accuracy for CIFAR-10 training are increased dramatically after using a series of optimization techniques including LARS. Figure 15 shows the hyper-parameter tuning results of full-batch CIFAR-10 training by ResNet-50 (200 epochs or 200 iterations, batch size = 50K). We only show the tuning region of the best hyper-parameters. We can observe that LARS can increase the accuracy of full-batch training. However, the current first-order optimizers can not achieve the same accuracy as the baseline for full-batch training. Figure 16 shows the full-batch training results of MNIST. We partition the dataset as 60K/10K samples for training/testing. We increase the batch size to 60K. Momentum SGD is our baseline optimizer. LARS is the optimizer achieving the best accuracy for us. We can observe that both optimizers can not reach the target accuracy (99.2%). However, LARS can achieve a higher accuracy than Momentum SGD. Figure 16 only shows the most effective hyper-parameter tuning region. We can also observe that LARS is more stable than Momentum SGD for full-batch training. Figure 17 shows a similar result.
As the results shown in the main text, training longer can not close the generalization gap for huge-batch and full-batch training. MNIST with LeNet (batch size = 60k). The baseline with a batch size of 256 can achieve 99.2% testing accuracy in 30 epochs. With LARS, the baseline can achieve 99.3% accuracy in 30 epochs. For the huge-batch MNIST training, we set the batch size as 32768. From Figure 18 we can observe that even increase the number of epochs from 30 to 1000, we can only achieve 99.16% accuracy, which still can not match the target accuracy. For the full-batch training, the accuracy for 1000-epoch training is even lower, which is only 98.3%. It is worth noting that a batch size of 256 can achieve this level of accuracy in just 10 epochs. That means that there is huge generalization gap between full-batch optimization and regular SGD optimization. For CIFAR-10 training with ResNet-50 (batch size = 50k), the baseline can achieve 93.9% accuracy in 200 epochs. However, the full-batch version can not reach the target accuracy even we train it for a long time. The best accuracy we can get is 92.58% by 5000 epochs (Figure 20)).
6.3 Why LARS can help?
From results in the previous section, we can observe that LARS can make a significant difference in huge-batch training and full-batch training. We want to briefly explain why LARS can help for huge/full-batch training, which is out of the range of the original LARS paper . As mentioned by Keskar et al. , small-batch learner uses noisy gradients in the computation of each step. The noises actually can push the learner away from the basin of the sharp minimizers. However, the noises are greatly reduced in large/full batch training, which are not enough to push the learner out of the basin of the sharp minimizers. Thus, adding noises to huge/full-batch learner may help. However, how to add proper noises? We tried adding the Gaussian noises and significantly tuning the hyperparameters, but it did not improve the testing accuracy. Specially, we did the following experiments.
Add noise to activations, i.e. the outputs of each layer.
Add noise to weights, i.e. an alternative to the inputs.
Add noise to the gradients, i.e. the direction to update weights.
Add noise to the outputs, i.e. the labels or target variables.
Keskar et al. did similar experiments, but it also did not help333https://openreview.net/forum?id=H1oyRlYgg¬eId=H1oyRlYgg. On the other hand, LARS computes the trust ratio at each iteration :
where is the layer ID, is the batch size, is the iteration ID, and is the weight decay (e.g. = 0.01). Then LARS uses trust ratio to multiply the learning rate. From Figure 2 of this link444https://arxiv.org/pdf/1708.03888v3.pdf we can observe that:
For a different iteration, the learning rate is different.
For a different layer, the learning rate is different.
For a different batch size, the learning rate is different.
Therefore, we think the trust ratio of LARS can provide dynamics to the learning process. The dynamics may act as a proper noise to help the learner get out of the sharp minimum.
6.4 Identify the huge-batch regime
As mentioned in the main text, we identified the huge batch regime of ImageNet/ResNet-50 as (64K, 1.28 million). Kumar et al.  reported that they can reach the target accuracy for ImageNet/ResNet-50 at a batch size of 64K by using LARS optimizer . We use the training recipe of Kumar et al.  and an auto-tuner to maximize the accuracy of ImageNet/ResNet-50 at a batch size of 128K. However, we can not reach the target accuracy. Table 4 shows the best results that we can achieve. The best top-1 accuracy we can get is only 73.37%, which is much lower than the target accuracy (75.9%). Thus, we identify a batch size over 64K is huge batch for ImageNet/ResNet-50 training. Figure 21 shows that we can identify the huge-batch regime of MNIST/LeNet training as (8K, 60K). Figure 22 shows that we can identify the huge-batch regime of CIFAR-10/ResNet-50 training as (25K, 50K).
|LR||warmup epochs||momentum||LR schedule||TPU chips||Top1 accuracy|
6.5 Learning Rate Schedule
Since learning rate is the most important hyper-parameter, we tried several different learning rate schedules. Specifically, they include regular cosine learning rate schedule (Figure 23), regular learning rate warmup and poly learning rate decay (Figure 24), regular learning rate warmup and fine-grained cosine learning rate schedule (Figure 25), regular learning rate warmup and regular cosine learning rate schedule (Figure 26), regular learning rate warmup and coarse-grained cosine learning rate schedule (Figure 27), regular learning rate warmup and cosine learning rate decay (Figure 28), and cosine learning rate warmup and cosine learning rate decay (Figure 29). We pick the best schedule for each application.
For example, we tried cyclical learning rate  scheme in MNIST training. This brings an additional hyper-parameter: the cyclical length. If we do not tune it carefully, the testing accuracy can be significantly hurt (error rate increases from 4.2% to 12.7% in Figure 32
). After tuning it, we can get a slightly better accuracy (error rate decreases from 4.2% to 3.8%), which is shown in Figure33. However, we find in both these two figures, the training loss reached the peaking point in the middle of the training and remain very high in the end of the training.
6.6 Hardware Usage Discussion
BERT pre-training is a good example for deep learning applications on supercomputers. Actually, BERT pre-training can make full use of the most powerful TPU-based supercomputers, which means researchers are able to scale it on 1024 TPU chips . However, current huge-batch algorithms can not make full use of the supercomputers for some applications like ImageNet. Here, we list a few concrete suggestions for hardware/algorithm designers; what can we do to make better use of supercomputers?
We give the following suggestions:
Using model parallelism and pipeline together with data parallelism, which can maximize the potential performance.
Using a larger sample, which may also be a future trend.
For ImageNet with ResNet-50, 32 batch size keeps a GPU busy.
32K batch size can keep 1024 GPUs busy.
For large image, even a batch size of 1 may keep a GPU busy (e.g. 1920x1080 HD images on ResNet-50).
32K batch size can keep 32K GPUs busy.
Trying second-order approach like K-FAC .