1 Introduction
Largebatch optimization is becoming an important research topic as it is an efficient approach for current distributed deep learning systems. It has enabled researchers to reduce the ImageNet/ResNet50 training from 29 hours to around 1 minute (Table 1). Researchers can also reduce the BERT pretraining time from 3 days to 76 minutes you2019large , 67 minutes shoeybi2019megatron , 47 minutes nvidia2019bert and 44 minutes rajbhandari2019zero
. The speedup comes from the fact that researchers can scale the training to a larger batch size, and so use larger scale, parallelism, without losing accuracy in a fixed number of epochs. For example, researchers scaled the batch size of ImageNet training from 256
he2016deep to 1K krizhevsky2014one , 5K li2017scaling , 8K goyal2017accurate , 16K you2017imagenet , 32K ying2018image , and 64K kumar2019scale . On the other hand, the hardware designers are also interested in largebatch training, as the chip makers fail to continue improving the single processor’s clock frequency due to power consumption issue asanovic2009view . Thus, the hardware vendors have to use more processors to increase the raw floating point computing performance, which requires highparallelism for future algorithms. Many successful use cases of largebatch training have encouraged industry to design and implement several AI supercomputers like Google TPU Pod, NVIDIA SuperPOD, and Huawei Atlas 900 AI Cluster to take advantage of the extremely high parallelism provided by largebatch algorithms.Teams  Date  Accuracy  Time  Optimizer 

He et al. he2016deep  12/10/2015  75.3%  29h  Momentum (Rumelhart et al. rumelhart1986learning ) 
Goyal et al. goyal2017accurate  06/08/2017  76.3%  65m  Momentum (Rumelhart et al. rumelhart1986learning ) 
You et al. you2017imagenet  11/02/2017  75.3%  48m  LARS (You et al. you2017scaling ) 
You et al. you2017imagenet  11/07/2017  75.3%  31m  LARS (You et al. you2017scaling ) 
Akiba et al. akiba2017extremely  11/12/2017  74.9%  15m  RMSprop (Hinton tieleman2012lecture ) 
You et al. you2017imagenet  12/07/2017  74.9%  14m  LARS (You et al. you2017scaling ) 
Jia et al. jia2018highly  07/30/2018  75.8%  6.6m  LARS (You et al. you2017scaling ) 
Mikami et al. mikami2018imagenet  11/14/2018  75.0%  3.7m  LARS (You et al. you2017scaling ) 
Ying et al. ying2018image  11/16/2018  76.3%  2.2m  LARS (You et al. you2017scaling ) 
Yamazaki et al. yamazaki2019yet  03/29/2019  75.1%  1.25m  LARS (You et al. you2017scaling ) 
Kumar et al. kumar2019scale  10/02/2019  75.9%  67.1s  LARS (You et al. you2017scaling ) 
In this paper, we focus on studying the limit of the batch size and hope to provide guidance to AI supercomputer and algorithm designers. To the best of our knowledge, we have not found any previous works studying a batch size over 128K. We pick some realistic applications like ImageNet/ResNet50, and provide detailed numerical optimization instructions for stepbystep accuracy comparison. Moreover, it is important to understand the generalization and optimization difficulties encountered when batch size is huge. There are a series of works hoffer2017train keskar2016large smith2017don on this topic. Specifically, Hoffer et al. hoffer2017train introduced “ultraslow diffusion” theory to largebatch training. They pointed out that there is an “ultraslow” logarithmic increase in the distance of the weights from their initialization. Their paper indicated that, under a series of optimization techniques, the generalization performance can be controlled by the number of iterations. However, our experiments show contradictory results once the batch size is increased beyond a certain boundary, which we refer to as the “hugebatch” regime. We provide comprehensive experimental results and detailed analysis to study the limitations of batch size scaling and the “ultraslow diffusion” theory.
Concretely, we scale the batch size of Imageanet/ResNet50 to 819K and 1.28 million, which is an order of magnitude larger than any previous works. This is the first work to report an accuracy for huge/fullbatch ImageNet/ResNet50 training. We also scale the batch size to the fulldataset for MNIST, CIFAR10, and ImageNet. The contribution of our paper can be summarized as follows:

For the first time we scale the batch size on ImageNet to at least a magnitude larger that all previous works, and provide detailed studies on the performance of many stateoftheart optimization schemes under this setting. We propose an optimization recipe that is able to improve the top1 test accuracy by 18% compared to the baseline.

We identify a “hugebatch” regime for realistic optimization tasks (e.g. ImageNet/ResNet50) , where the optimization process becomes intrinsically harder and we cannot reach the target test accuracy in polynomial time by applying any of current optimization techniques. The “ultraslow diffusion” theory does not hold any more in this regime.

Our results help system researchers and designers to understand the limit of parallelism in realistic machine learning tasks and design future machine learning hardware accordingly.
2 Background and Related Work
For any vector
, either or are used to denote its coordinate where . For any function , we use to denote the gradient with respect to . We use to denote norm of a vector. We study nonconvex stochastic optimization problems of the form(1) 
where is a smooth (possibly nonconvex) function and
is a probability distribution on the domain
. Here, corresponds to model parameters,is the loss function and
is an unknown data distribution. Stochastic gradient descent (SGD) is a method for solving problem in Equation (
1). The update at the iteration of SGD is of the following form:(SGD) 
where is set of random samples drawn from the distribution . There are some SGD variants (e.g. duchi2011adaptive kingma2014adam tieleman2012lecture ) proposed in recent years. We assume the batch size of our baseline is , and our baseline can achieve a testing accuracy of with a validation loss of by a learning rate of in epochs. We define define three optimization regimes with batch sizes , and :
Definition 2.1.
Large Batch:
If we use a batch size () and a learning rate of , then we can get a testing accuracy of (or higher) and a validation loss of (or lower) in epochs. We define a batch size is large batch if (examples in Table 2).
Definition 2.2.
Huge Batch: If we use a batch size () with current available optimization techniques, then we can not get a testing accuracy of in epochs. We define a batch size is huge batch if (examples in Table 2).
Definition 2.3.
Full Batch: We define a positive integer , which is equal to the total number of available training samples at runtime (examples in Table 2).
Keskar et al. keskar2016large reported that traditional firstorder optimization techniques fail to scale up the batch size to a very large number. Largebatch training will hurt the testing accuracy. They indicate there is an inherent generalization gap in largebatch optimization within fixed number of epochs. Hoffer et al. hoffer2017train suggested training longer can close the generalization gap. After that, researchers have developed a series of optimization techniques and scaled the batch size to 64K without losing accuracy for various applications within fixed number of epochs (Table 1). There are also other related literatures anil2019memory ginsburg2019stochastic gupta2020stochastic golmant2018computational krizhevsky2014one lin2018don liu2019variance ma2019inefficiency osawa2019large shallue2018measuring smith2017don . Inspired by these works, we focus on a different direction. As far as we know, there is no previous work studying hugebatch optimization (e.g. over 128K). Our investigation includes: (1) pushing the batch size to the limit and maximizing the accuracy in fixed number of epochs; (2) studying the optimization problems of hugebatch and fullbatch; (3) studying the relationship between generalization and optimization.
3 Optimization
In this section, we use a series of techniques to maximize the accuracy in huge/fullbatch training. The information about our applications are in Table 2. Due to lack of space, some experimental results are in the Appendix. The first step of our study is to identify the “hugebatch” regime, where we conduct experiments with extremely large batch sizes. We detail the process of finding the “hugebatch” numbers in Table 2 and more results on how accuracy drops with batch size in the Appendix.
Dataset  Model  Implementation  Baseline Batch  Epochs  Top1 Accuracy  Large Batch  Huge Batch  Full Batch 

ImageNet  ResNet50 v1.5 he2016deep  resnet2019mlperf  256  90  75.9%  [2K, 64K]  (64K, 1.28M)  1.28M 
MNIST  LeNet lecun2015lenet  lenet2019google  256  30  99.2%  [2K, 8K]  (8k, 60k)  60K 
CIFAR10  ResNet50 v1 he2016deep  resnet2020cifar  128  200  93.9%  [2K, 25K]  (25k, 50k)  50K 
3.1 Huge Batch
Researchers are able to scale the batch size of ImageNet/ResNet50 to 64K and achieve 75.9% top1 accuracy in 90 epochs kumar2019scale . As far as we know, no one reported that the batch size of ImageNet/ResNet50 can be scaled to over 64K without losing accuracy. The baseline of our implementation used Momentum SGD optimizer. The training diverges and only achieves a top1 accuracy of 0.0977% for a batch size of 819200. We train it by 141 steps, which has the same number of epochs as the smallbatch baseline. We investigate the effectiveness of existing large batch optimization techniques in the huge batch regime, starting from simple Momentum SGD to more advanced techniques proposed by recent works pereyra2017regularizing ; kumar2019scale ; ying2018image ; goyal2017accurate ; hoffer2017train :

Optimization 0: this is our baseline; we use Momentum SGD optimizer with linear learning rate scaling (or sqrt scaling if it is better) krizhevsky2014one , cosine learning rate decay loshchilov2016sgdr
(see Appendix), gradient clipping
pascanu2013difficultyand ghost batch normalization
hoffer2017train . The top1 accuracy is 0.0977%. 
Optimization 1: we add learning rate warmup goyal2017accurate , which allows us to use linear learning rate scaling without diverging. We improved the top1 accuracy from 0.0977% to 2.712%.

Optimization 2: we tune learning rate and warmup epochs by an autotuner. But we found just using gridsearch tuning can roughly get the same accuracy as the autotuner in a reasonable computing budget. We improved the top1 accuracy from 2.712% to 4.563%.

Optimization 3: we add labeling smoothing pereyra2017regularizing , which is a useful technique in largebatch training. The accuracy did not increase and also did not significantly decrease. The accuracy was changed from 4.563% to 4.226%.

Optimization 4: we switch the optimizer from Momentum SGD to LARS. In this way, we can increase the accuracy from 4.226% to 18.95%.
The results are summarized in Figure 1
. Since LARS can give us largest boost in accuracy, we further study its sensitivity to hyperparameters in Figure
2, by varying the two most important hyperparameters, learning rate and number of warmup epochs. From Figure 2, we can observe that the advantage layerwise adaptive technique provided by LARS optimizer is maintained over a large number of hyperparameters, while the Momentum SGD fails to obtain good accuracy regardless of hyperparameter tuning. We conclude that LARS can make a difference in hugebatch learning.We also study an application with relatively small number of training samples. We pick MNIST with LeNet. We partition the dataset as 55K/5K/10K samples for training/validation/testing^{1}^{1}1We partition the dataset as 60K/10K samples for training/testing in Table 2. By using Momentum SGD, the baseline can achieve 99.2% accuracy in 30 epochs with a batch size of 256. As mentioned in Table 2, 8K is the boundary between the large batch region and the huge batch region. When we scale the batch size to 8K, we find the baseline (Momentum SGD) diverges in the training. So we tried other stateoftheart optimizers like AdaGrad, RMSprop, Adam and LARS. AdaGrad and RMSprop can not achieve an accuracy over 99%. Adam has a slight accuracy loss. However, LARS achieved a better accuracy than the baseline. We want to find the reason behind this. We find one key feature of LARS is the layerwise adaptive learning technique. So we add layerwise technique to all the other optimizers. After that, SGD and AdaGrad can get the same accuracy as the baseline. However, Adam becomes very unstable and RMSprop diverges in the middle of the training. We find the reason is that the trust ratio^{2}^{2}2In the simplest form (for SGD), trust ratio is the ratio between the L2 norm of the weights and the L2 norm of the gradients. of layerwise learning is either too large or too small for Adam and RMSprop. So we add a bound to the trust ratio for Adam optimizer, which is a similar idea as AdaBound luo2019adaptive . By doing so, it can also achieve 99.4% accuracy in 30 epochs with a batch size of 8K, which is the same as LARS. It is worth noting that Adam + layerwise + boundratio is a special case of LAMB optimizer you2019large . The results are shown in Table 3. Figure 3 shows all the training details of LARS optimizer.
Solver  batch=256  Original  + layerwise  + layerwise and boundratio 

SGD  99.2%  diverge  99.2%  99.2% 
AdaGrad  99.2%  98.3%  99.3%  99.3% 
RMSprop  99.2%  98.6%  diverge  99.0% 
Adam  99.3%  99.1%  unstable  99.4% 
LARS  99.3%  99.4%  N/A  N/A 
Since LARS with learning rate warmup and polynomial decay gave us best performance for largebatch MNIST training, we use this scheme for hugebatch MNIST training. We increase the batch size from 8K to 32K. However, the testing error rate increased from 0.8% to 4.2% (Figure 4). If we use other optimizers, the error rate is higher (8%). In Figure 4, we can observe huge batch training may lead the optimizer to a wrong path and the model is not optimized well: in the middle of the training, even though the training error rate and validation error rate are becoming lower, the training loss is increasing. Thus, in this situation, the huge batch training is suffering not only a generalization problem but also an optimization problem. After trying various different optimization techniques, we find only LAMB optimizer with extremely long learning rate warmup epochs and polynomial learning rate decay can stabilize the training and generalize well (Figure. 5). This scheme gets 98.7% testing accuracy. However, even using this scheme, we can not reach the target testing accuracy from the baseline – huge batchtraining optimization may have an inherent generalization problem that cannot be resolved by any existing techniques. Hugebatch ImageNet/ResNet50 training has a similar problem but is much worse than MNIST/LeNet. The target top1 accuracy is 75.9% in 90 epochs. A hugebatch of 819200 only achieves an accuracy lower than 20%. We conclude that hugebatch training suffers generalization problems and sometimes also optimization problems with whatever current firstorder optimization techniques.
3.2 Full Batch
In this section, we increase the batch size from the hugebatch region to the fullbatch case. For ImageNet/ResNet50, we increase the batch size from 819200 to 1.28 million, which is the first time demonstrated in literature. For MNIST/LeNet, we partition the dataset as 60K/10K for training/testing. We increase the batch size from 32768 to 60K. We used the optimization techniques in Section 3.1. Even though we only increase the batch size by less than two times, we find the accuracy becomes much lower. The results are shown in Figures 6 and 7. We can observe that LARS can also make a significant difference in fullbatch training. However, we can not reach the target testing accuracy in 90 epochs. Momentum SGD is our baseline optimizer. LARS is the optimizer achieving the best accuracy for us. For MNIST, hugebatch can not reach the 99.2% target accuracy (results in Appendix). LARS can achieve a higher accuracy than Momentum SGD. We can also observe that LARS is more stable than Momentum SGD for hugebatch training.
4 Discussion
4.1 Train longer, generalize better?
We focus on ImageNet/ResNet50 training in Sections 4.1 and 4.2. Our experiments indicate that huge/fullbatch training suffers a serious generalization problem. Hoffer et al. hoffer2017train suggest training longer will lead to a better generalization for largebatch training. We want to study this validity of this suggestion for huge/fullbatch training. The largebatch size of 32K is able to reach a testing accuracy of 76.7% in just 90 epochs. The hugebatch (batch size = 819200) training only achieves 19% accuracy in 90 epochs. By increasing the number of epochs to 1200, we can reach around 70% accuracy. However, even with an extravagant computing budget, we can not reach the target accuracy by just training longer. Figure 8 shows that there is almost no accuracy improvement as we increase the number of epochs from 3000 to 10000. For a huge batch size of 819200, the best accuracy we observed is 71.8%. For fullbatch training, the best accuracy we can get within 15000 epochs is 71.1%. We hypothesize there is an inherent generalization gap in huge/fullbatch optimization and simply training longer can not close this gap.
A typical approach on distributed systems is to perform batch normalization (BN) per replica, which reduces the crossdevice communication cost. However, distributed BN may also have an impact on the testing accuracy. For example, Ying et al. observed using an effective batch size of 64 for batch normalization can lead to the best accuracy (Figure 6 of ying2018image ). We use the same approach as Ying et al. ying2018image . Ghost Batch Normalization by Hoffer et al. hoffer2017train
is a similar idea. We only conduct reduction across a few peers to compute the mean and variance over a subset of all the replicas. In our experiments, we observed that we achieve best accuracy on 256 v3 TPU cores for the batch size of 32K. So the ratio between batch size and the number of cores is 128. We keep this ratio as we increase the number of cores. The current largest TPUbased supercomputer has 2048 cores, which corresponds to a batch size of 256K. That means the largest useful batch size for Ghost Batch Norm on current TPU supercomputer is 256K. However, we believe 256K is in the huge batch region. We only achieve 65% accuracy in 90 epochs. Even if we increase the number of epochs from 90 to 500, we can not reach the target accuracy (Figure
9). This means the current optimization algorithm is not scalable enough to make full use of the current hardware. To make better use of future hardware, we offer advice for system/hardware designers in the Appendix.4.2 Optimization and Generalization
Hoffer et al. hoffer2017train introduced “ultraslow diffusion” theory to explain the generalization of largebatch training. According to some statistical physics models, even though the shape of the loss function can not be visualized, we can describe the complicated DNN learning process as a random process with some potential. Specially, they use “Random Walk on a Random Potential” to model this complicated optimization and generalization problem. Based on a series of theoretical works bouchaud1990anomalous ; bray2007statistics ; dauphin2014identifying ; soudry2017exponentially , they built a relationship between the number of iterations and the weight at th iteration : . It is worth noting that the typical relationship in standard diffusion (on a flat potential) is . They speculate to be 2, which means the distance between weight with the initial weight increases logarithmically with the number of iterations. They reached an argument that the optimizer needs to travel at least a distance of to find a minimum with the width of , which takes iterations. Essentially, their conclusion implies that given the best optimization technique, the quality of a minimizer is dependent on the number of iterations regardless of the batch size. Thus, the authors proposed an approach, “regime adaptation”, which suggests using the same number of iterations as the baseline (e.g. batch size = 256) regardless of the batch size. However, our results show that huge/fullbatch training can not reach the target accuracy even with a large number of epochs. A natural question to ask is “why can longer training not lead to a better generalization for huge/fullbatch regime?” Unfortunately, the relationship between optimization and generalization is missing in their theory. Our experiments (Figure 10) show that training longer leads to a better training loss; however, it can not reach the target testing accuracy from the baseline. By running the same number of iterations, fullbatch can achieve an even lower training loss than the 32K batch size. It indicates there is an inherent generalization gap in huge/fullbatch optimization that cannot be closed by any of existing techniques, which should be kept in mind by hardware designers when designing systems for machine learning.
5 Conclusion
We study the limit of the batch size for deep neural network training. For the first time we scale the batch size on ImageNet to at least a magnitude larger that all previous works (819200 versus 64K), and provide detailed studies on the performance of many stateoftheart optimization schemes under this setting. We propose an optimization recipe that is able to improve the top1 test accuracy by 18% compared to the baseline. We identify a
“hugebatch” regime for realistic optimization tasks (e.g. ImageNet/ResNet50) , where the optimization process becomes intrinsically harder and we cannot reach the target test accuracy in polynomial time by applying any of current optimization techniques (empirical observation). The “ultraslow diffusion” theory hoffer2017train does not hold any more in this regime. Our results help system and algorithm designers to understand the limit of parallelism in realistic machine learning tasks and design future machine learning hardware/algorithms accordingly. For more results and explanations, please check out the Appendix.Broader Impact
We study the potential applications of hugebatch training and fullbatch training, which may have an influence on future hardware design.
References
 [1] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch sgd: Training resnet50 on imagenet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.
 [2] Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory efficient adaptive optimization. In Advances in Neural Information Processing Systems, pages 9746–9755, 2019.
 [3] Krste Asanovic, Rastislav Bodik, James Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Nelson Morgan, David Patterson, Koushik Sen, John Wawrzynek, et al. A view of the parallel computing landscape. Communications of the ACM, 52(10):56–67, 2009.
 [4] JeanPhilippe Bouchaud and Antoine Georges. Anomalous diffusion in disordered media: statistical mechanisms, models and physical applications. Physics reports, 195(45):127–293, 1990.
 [5] Alan J Bray and David S Dean. Statistics of critical points of gaussian fields on largedimensional spaces. Physical review letters, 98(15):150201, 2007.
 [6] Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in neural information processing systems, pages 2933–2941, 2014.
 [7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(Jul):2121–2159, 2011.
 [8] Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk, Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Jason Li, Huyen Nguyen, Yang Zhang, and Jonathan M Cohen. Stochastic gradient methods with layerwise adaptive moments for training of deep networks. arXiv preprint arXiv:1905.11286, 2019.
 [9] Noah Golmant, Nikita Vemuri, Zhewei Yao, Vladimir Feinberg, Amir Gholami, Kai Rothauge, Michael W Mahoney, and Joseph Gonzalez. On the computational inefficiency of large batch sizes for stochastic gradient descent. arXiv preprint arXiv:1811.12941, 2018.
 [10] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 [11] Vipul Gupta, Santiago Akle Serrano, and Dennis DeCoste. Stochastic weight averaging in parallel: Largebatch training that generalizes well. arXiv preprint arXiv:2001.02312, 2020.

[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778, 2016.  [13] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731–1741, 2017.
 [14] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, et al. Highly scalable deep learning training system with mixedprecision: Training imagenet in four minutes. arXiv preprint arXiv:1807.11205, 2018.
 [15] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 [16] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [17] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.

[18]
kuangliu.
Implementation of cifar10 by pytorch, 2020.
 [19] Sameer Kumar, Victor Bitorff, Dehao Chen, Chiachen Chou, Blake Hechtman, HyoukJoong Lee, Naveen Kumar, Peter Mattson, Shibo Wang, Tao Wang, et al. Scale mlperf0.6 models on google tpuv3 pods. arXiv preprint arXiv:1909.09756, 2019.

[20]
Yann LeCun et al.
Lenet5, convolutional neural networks.
 [21] Mu Li. Scaling distributed machine learning with system and algorithm codesign. PhD thesis, Carnegie Mellon University, 2017.
 [22] Tao Lin, Sebastian U Stich, Kumar Kshitij Patel, and Martin Jaggi. Don’t use large minibatches, use local sgd. arXiv preprint arXiv:1808.07217, 2018.
 [23] Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
 [24] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
 [25] Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
 [26] Linjian Ma, Gabe Montague, Jiayu Ye, Zhewei Yao, Amir Gholami, Kurt Keutzer, and Michael W Mahoney. Inefficiency of kfac for large batch size training. arXiv preprint arXiv:1903.06237, 2019.
 [27] James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pages 2408–2417, 2015.
 [28] Hiroaki Mikami, Hisahiro Suganuma, et al. Imagenet/resnet50 training in 224 seconds. arXiv preprint arXiv:1811.05233, 2018.
 [29] MLperf. Implementation of resnet50 by mlperf, 2019.
 [30] NVIDIA. Nvidia clocks world’s fastest bert training time and largest transformer based model, paving path for advanced conversational ai, 2019.
 [31] Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Largescale distributed secondorder optimization using kroneckerfactored approximate curvature for deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12359–12367, 2019.

[32]
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.
On the difficulty of training recurrent neural networks.
In International conference on machine learning, pages 1310–1318, 2013.  [33] Gabriel Pereyra, George Tucker, Jan Chorowski, Łukasz Kaiser, and Geoffrey Hinton. Regularizing neural networks by penalizing confident output distributions. arXiv preprint arXiv:1701.06548, 2017.
 [34] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimization towards training a trillion parameter models. arXiv preprint arXiv:1910.02054, 2019.
 [35] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. nature, 323(6088):533–536, 1986.
 [36] Christopher J Shallue, Jaehoon Lee, Joseph Antognini, Jascha SohlDickstein, Roy Frostig, and George E Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.
 [37] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatronlm: Training multibillion parameter language models using gpu model parallelism. arXiv preprint arXiv:1909.08053, 2019.
 [38] Leslie N Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472. IEEE, 2017.
 [39] Samuel L Smith, PieterJan Kindermans, Chris Ying, and Quoc V Le. Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489, 2017.
 [40] Daniel Soudry and Elad Hoffer. Exponentially vanishing suboptimal local minima in multilayer neural networks. arXiv preprint arXiv:1702.05777, 2017.
 [41] Google TPU Team. Implementation of lenet by google, 2019.
 [42] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 [43] Masafumi Yamazaki, Akihiko Kasagi, Akihiro Tabuchi, Takumi Honda, Masahiro Miwa, Naoto Fukumoto, Tsuguchika Tabaru, Atsushi Ike, and Kohta Nakashima. Yet another accelerated sgd: Resnet50 training on imagenet in 74.7 seconds. arXiv preprint arXiv:1903.12650, 2019.
 [44] Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classification at supercomputer scale. arXiv preprint arXiv:1811.06992, 2018.
 [45] Yang You, Igor Gitman, and Boris Ginsburg. Scaling sgd batch size to 32k for imagenet training. arXiv preprint arXiv:1708.03888, 6, 2017.
 [46] Yang You, Jing Li, Sashank Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bhojanapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and ChoJui Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2019.
 [47] Yang You, Zhao Zhang, ChoJui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. arXiv preprint arXiv:1709.05011, 2017.
6 Appendix
6.1 Discussion on UltraSlow Diffusion Theory
As Section 2 in the main text, let us refer to as the gradient at th iteration and as the batch size. Here, we define signal as the update in the direction of the true gradient () and noise as the update perpendicular to the true gradient (
). We also define signaltonoiseratio as the ratio between the expected power of the signal and the expected power of the noise:
.In this section we want to briefly discuss why training longer can not close the generalization gap in the hugebatch training and fullbatch training regimes. In the “ultraslow diffusion theory” [13], the relationship between the random walk distance and the number of iterations is technically sound and empirically supported by some strong experimental results. However, we believe the ineffectiveness in huge and fullbatch regimes is caused by a few potential mismatches between this theory and the huge/fullbatch setting:

Based on the analysis of [15], we assume a minima with a larger "width" will have a better generalization performance. However, it is not guaranteed that we can find a minima of "width" after traveling a distance of (although we at least need to travel a distance of to find a minima of "width" ). Figures 11 and 12 support our conjecture. Even all the cases travel the same distance in Figure 11 or Figure 12, each one reaches a very different minima. Figure 2 in [13] only shows that the batch size up to 2K. However, the optimization problem is significantly different if we enter the hugebatch regime.

They did not analyze the effect of signaltonoiseratio in stochastic gradients. The signal is increasing linearly with the batch size ( = ). The noise is increasing at a sqrt rate with the batch size ( = ). Thus, increasing the batch size will increase the signaltonoiseratio in stochastic gradients, and the full batch setting has no noises. The authors essentially normalized the covariance in the paper [13]. However, they did not consider the effect of signaltonoiseratio in largebatch training. The effect is even bigger in hugebatch training. When the noise is small comparing to the signal, it is unsure if the “slow diffusion theory” still holds.

In a realworld training process, the randomness of stochastic potential is very limited. Under the situation of limited training samples and data augmentation, the information is largely redundant and correlated after enough iterations for hugebatch training, violating the assumption of a random potential.
6.2 Additional Results
We do not have enough space in the main text, so we present additional results here. Figure 13 shows the top5 accuracy for hugebatch ImageNet training. We can observe that LARS is very import for hugebatch training, which in line with the results of previous largebatch training studies [14] [19] [44]. The batch size of CIFAR10 is 25K, which is at the boundary between largebatch regime and hugebatch regime. LARS can also help in this mixed regime.
Figure 14 shows the results of fullbatch training. We scale the batch size of ImageNet training to 1.28 million and the batch size of CIFAR10 to 50K. It shows the top5 accuracy for fullbatch ImageNet training and the top1 accuracy for CIFAR10 training are increased dramatically after using a series of optimization techniques including LARS. Figure 15 shows the hyperparameter tuning results of fullbatch CIFAR10 training by ResNet50 (200 epochs or 200 iterations, batch size = 50K). We only show the tuning region of the best hyperparameters. We can observe that LARS can increase the accuracy of fullbatch training. However, the current firstorder optimizers can not achieve the same accuracy as the baseline for fullbatch training. Figure 16 shows the fullbatch training results of MNIST. We partition the dataset as 60K/10K samples for training/testing. We increase the batch size to 60K. Momentum SGD is our baseline optimizer. LARS is the optimizer achieving the best accuracy for us. We can observe that both optimizers can not reach the target accuracy (99.2%). However, LARS can achieve a higher accuracy than Momentum SGD. Figure 16 only shows the most effective hyperparameter tuning region. We can also observe that LARS is more stable than Momentum SGD for fullbatch training. Figure 17 shows a similar result.
As the results shown in the main text, training longer can not close the generalization gap for hugebatch and fullbatch training. MNIST with LeNet (batch size = 60k). The baseline with a batch size of 256 can achieve 99.2% testing accuracy in 30 epochs. With LARS, the baseline can achieve 99.3% accuracy in 30 epochs. For the hugebatch MNIST training, we set the batch size as 32768. From Figure 18 we can observe that even increase the number of epochs from 30 to 1000, we can only achieve 99.16% accuracy, which still can not match the target accuracy. For the fullbatch training, the accuracy for 1000epoch training is even lower, which is only 98.3%. It is worth noting that a batch size of 256 can achieve this level of accuracy in just 10 epochs. That means that there is huge generalization gap between fullbatch optimization and regular SGD optimization. For CIFAR10 training with ResNet50 (batch size = 50k), the baseline can achieve 93.9% accuracy in 200 epochs. However, the fullbatch version can not reach the target accuracy even we train it for a long time. The best accuracy we can get is 92.58% by 5000 epochs (Figure 20)).
6.3 Why LARS can help?
From results in the previous section, we can observe that LARS can make a significant difference in hugebatch training and fullbatch training. We want to briefly explain why LARS can help for huge/fullbatch training, which is out of the range of the original LARS paper [45]. As mentioned by Keskar et al. [15], smallbatch learner uses noisy gradients in the computation of each step. The noises actually can push the learner away from the basin of the sharp minimizers. However, the noises are greatly reduced in large/full batch training, which are not enough to push the learner out of the basin of the sharp minimizers. Thus, adding noises to huge/fullbatch learner may help. However, how to add proper noises? We tried adding the Gaussian noises and significantly tuning the hyperparameters, but it did not improve the testing accuracy. Specially, we did the following experiments.

Add noise to activations, i.e. the outputs of each layer.

Add noise to weights, i.e. an alternative to the inputs.

Add noise to the gradients, i.e. the direction to update weights.

Add noise to the outputs, i.e. the labels or target variables.
Keskar et al. did similar experiments, but it also did not help^{3}^{3}3https://openreview.net/forum?id=H1oyRlYgg¬eId=H1oyRlYgg. On the other hand, LARS computes the trust ratio at each iteration [45]:
(2) 
where is the layer ID, is the batch size, is the iteration ID, and is the weight decay (e.g. = 0.01). Then LARS uses trust ratio to multiply the learning rate. From Figure 2 of this link^{4}^{4}4https://arxiv.org/pdf/1708.03888v3.pdf we can observe that:

For a different iteration, the learning rate is different.

For a different layer, the learning rate is different.

For a different batch size, the learning rate is different.
Therefore, we think the trust ratio of LARS can provide dynamics to the learning process. The dynamics may act as a proper noise to help the learner get out of the sharp minimum.
6.4 Identify the hugebatch regime
As mentioned in the main text, we identified the huge batch regime of ImageNet/ResNet50 as (64K, 1.28 million). Kumar et al. [19] reported that they can reach the target accuracy for ImageNet/ResNet50 at a batch size of 64K by using LARS optimizer [45]. We use the training recipe of Kumar et al. [19] and an autotuner to maximize the accuracy of ImageNet/ResNet50 at a batch size of 128K. However, we can not reach the target accuracy. Table 4 shows the best results that we can achieve. The best top1 accuracy we can get is only 73.37%, which is much lower than the target accuracy (75.9%). Thus, we identify a batch size over 64K is huge batch for ImageNet/ResNet50 training. Figure 21 shows that we can identify the hugebatch regime of MNIST/LeNet training as (8K, 60K). Figure 22 shows that we can identify the hugebatch regime of CIFAR10/ResNet50 training as (25K, 50K).
LR  warmup epochs  momentum  LR schedule  TPU chips  Top1 accuracy 

26  25  0.94429  cosine  256  72.92% 
28  25  0.94429  cosine  256  73.37% 
30  25  0.94429  cosine  256  73.08% 
32  25  0.94429  cosine  256  72.82% 
26  25  0.94429  cosine  512  73.14% 
28  25  0.95  cosine  512  73.30% 
30  25  0.95  cosine  512  73.14% 
32  25  0.95  cosine  512  72.82% 
6.5 Learning Rate Schedule
Since learning rate is the most important hyperparameter, we tried several different learning rate schedules. Specifically, they include regular cosine learning rate schedule (Figure 23), regular learning rate warmup and poly learning rate decay (Figure 24), regular learning rate warmup and finegrained cosine learning rate schedule (Figure 25), regular learning rate warmup and regular cosine learning rate schedule (Figure 26), regular learning rate warmup and coarsegrained cosine learning rate schedule (Figure 27), regular learning rate warmup and cosine learning rate decay (Figure 28), and cosine learning rate warmup and cosine learning rate decay (Figure 29). We pick the best schedule for each application.
For example, we tried cyclical learning rate [38] scheme in MNIST training. This brings an additional hyperparameter: the cyclical length. If we do not tune it carefully, the testing accuracy can be significantly hurt (error rate increases from 4.2% to 12.7% in Figure 32
). After tuning it, we can get a slightly better accuracy (error rate decreases from 4.2% to 3.8%), which is shown in Figure
33. However, we find in both these two figures, the training loss reached the peaking point in the middle of the training and remain very high in the end of the training.6.6 Hardware Usage Discussion
BERT pretraining is a good example for deep learning applications on supercomputers. Actually, BERT pretraining can make full use of the most powerful TPUbased supercomputers, which means researchers are able to scale it on 1024 TPU chips [46]. However, current hugebatch algorithms can not make full use of the supercomputers for some applications like ImageNet. Here, we list a few concrete suggestions for hardware/algorithm designers; what can we do to make better use of supercomputers?
We give the following suggestions:

Using model parallelism and pipeline together with data parallelism, which can maximize the potential performance.

Using a larger sample, which may also be a future trend.

For ImageNet with ResNet50, 32 batch size keeps a GPU busy.

32K batch size can keep 1024 GPUs busy.


For large image, even a batch size of 1 may keep a GPU busy (e.g. 1920x1080 HD images on ResNet50).

32K batch size can keep 32K GPUs busy.



Trying secondorder approach like KFAC [27].