The Limit of the Batch Size

06/15/2020 ∙ by Yang You, et al. ∙ The University of Texas at Austin berkeley college 22

Large-batch training is an efficient approach for current distributed deep learning systems. It has enabled researchers to reduce the ImageNet/ResNet-50 training from 29 hours to around 1 minute. In this paper, we focus on studying the limit of the batch size. We think it may provide a guidance to AI supercomputer and algorithm designers. We provide detailed numerical optimization instructions for step-by-step comparison. Moreover, it is important to understand the generalization and optimization performance of huge batch training. Hoffer et al. introduced "ultra-slow diffusion" theory to large-batch training. However, our experiments show contradictory results with the conclusion of Hoffer et al. We provide comprehensive experimental results and detailed analysis to study the limitations of batch size scaling and "ultra-slow diffusion" theory. For the first time we scale the batch size on ImageNet to at least a magnitude larger than all previous work, and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting. We propose an optimization recipe that is able to improve the top-1 test accuracy by 18

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Large-batch optimization is becoming an important research topic as it is an efficient approach for current distributed deep learning systems. It has enabled researchers to reduce the ImageNet/ResNet-50 training from 29 hours to around 1 minute (Table 1). Researchers can also reduce the BERT pre-training time from 3 days to 76 minutes you2019large , 67 minutes shoeybi2019megatron , 47 minutes nvidia2019bert and 44 minutes rajbhandari2019zero

. The speedup comes from the fact that researchers can scale the training to a larger batch size, and so use larger scale, parallelism, without losing accuracy in a fixed number of epochs. For example, researchers scaled the batch size of ImageNet training from 256

he2016deep to 1K krizhevsky2014one , 5K li2017scaling , 8K goyal2017accurate , 16K you2017imagenet , 32K ying2018image , and 64K kumar2019scale . On the other hand, the hardware designers are also interested in large-batch training, as the chip makers fail to continue improving the single processor’s clock frequency due to power consumption issue asanovic2009view . Thus, the hardware vendors have to use more processors to increase the raw floating point computing performance, which requires high-parallelism for future algorithms. Many successful use cases of large-batch training have encouraged industry to design and implement several AI supercomputers like Google TPU Pod, NVIDIA SuperPOD, and Huawei Atlas 900 AI Cluster to take advantage of the extremely high parallelism provided by large-batch algorithms.

Teams Date Accuracy Time Optimizer
He et al. he2016deep 12/10/2015 75.3% 29h Momentum (Rumelhart et al. rumelhart1986learning )
Goyal et al. goyal2017accurate 06/08/2017 76.3% 65m Momentum (Rumelhart et al. rumelhart1986learning )
You et al. you2017imagenet 11/02/2017 75.3% 48m LARS (You et al. you2017scaling )
You et al. you2017imagenet 11/07/2017 75.3% 31m LARS (You et al. you2017scaling )
Akiba et al. akiba2017extremely 11/12/2017 74.9% 15m RMSprop (Hinton tieleman2012lecture )
You et al. you2017imagenet 12/07/2017 74.9% 14m LARS (You et al. you2017scaling )
Jia et al. jia2018highly 07/30/2018 75.8% 6.6m LARS (You et al. you2017scaling )
Mikami et al. mikami2018imagenet 11/14/2018 75.0% 3.7m LARS (You et al. you2017scaling )
Ying et al. ying2018image 11/16/2018 76.3% 2.2m LARS (You et al. you2017scaling )
Yamazaki et al. yamazaki2019yet 03/29/2019 75.1% 1.25m LARS (You et al. you2017scaling )
Kumar et al. kumar2019scale 10/02/2019 75.9% 67.1s LARS (You et al. you2017scaling )
Table 1: ImageNet/ResNet-50 Training Speed Records.

In this paper, we focus on studying the limit of the batch size and hope to provide guidance to AI supercomputer and algorithm designers. To the best of our knowledge, we have not found any previous works studying a batch size over 128K. We pick some realistic applications like ImageNet/ResNet-50, and provide detailed numerical optimization instructions for step-by-step accuracy comparison. Moreover, it is important to understand the generalization and optimization difficulties encountered when batch size is huge. There are a series of works hoffer2017train keskar2016large smith2017don on this topic. Specifically, Hoffer et al. hoffer2017train introduced “ultra-slow diffusion” theory to large-batch training. They pointed out that there is an “ultra-slow” logarithmic increase in the distance of the weights from their initialization. Their paper indicated that, under a series of optimization techniques, the generalization performance can be controlled by the number of iterations. However, our experiments show contradictory results once the batch size is increased beyond a certain boundary, which we refer to as the “huge-batch” regime. We provide comprehensive experimental results and detailed analysis to study the limitations of batch size scaling and the “ultra-slow diffusion” theory.

Concretely, we scale the batch size of Imageanet/ResNet-50 to 819K and 1.28 million, which is an order of magnitude larger than any previous works. This is the first work to report an accuracy for huge/full-batch ImageNet/ResNet-50 training. We also scale the batch size to the full-dataset for MNIST, CIFAR-10, and ImageNet. The contribution of our paper can be summarized as follows:

  • For the first time we scale the batch size on ImageNet to at least a magnitude larger that all previous works, and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting. We propose an optimization recipe that is able to improve the top-1 test accuracy by 18% compared to the baseline.

  • We identify a “huge-batch” regime for realistic optimization tasks (e.g. ImageNet/ResNet-50) , where the optimization process becomes intrinsically harder and we cannot reach the target test accuracy in polynomial time by applying any of current optimization techniques. The “ultra-slow diffusion” theory does not hold any more in this regime.

  • Our results help system researchers and designers to understand the limit of parallelism in realistic machine learning tasks and design future machine learning hardware accordingly.

2 Background and Related Work

For any vector

, either or are used to denote its coordinate where . For any function , we use to denote the gradient with respect to . We use to denote -norm of a vector. We study nonconvex stochastic optimization problems of the form

(1)

where is a smooth (possibly nonconvex) function and

is a probability distribution on the domain

. Here, corresponds to model parameters,

is the loss function and

is an unknown data distribution. Stochastic gradient descent (SGD) is a method for solving problem in Equation (

1). The update at the iteration of SGD is of the following form:

(SGD)

where is set of random samples drawn from the distribution . There are some SGD variants (e.g. duchi2011adaptive kingma2014adam tieleman2012lecture ) proposed in recent years. We assume the batch size of our baseline is , and our baseline can achieve a testing accuracy of with a validation loss of by a learning rate of in epochs. We define define three optimization regimes with batch sizes , and :

Definition 2.1.

Large Batch:

If we use a batch size () and a learning rate of , then we can get a testing accuracy of (or higher) and a validation loss of (or lower) in epochs. We define a batch size is large batch if (examples in Table 2).

Definition 2.2.

Huge Batch: If we use a batch size () with current available optimization techniques, then we can not get a testing accuracy of in epochs. We define a batch size is huge batch if (examples in Table 2).

Definition 2.3.

Full Batch: We define a positive integer , which is equal to the total number of available training samples at runtime (examples in Table 2).

Keskar et al. keskar2016large reported that traditional first-order optimization techniques fail to scale up the batch size to a very large number. Large-batch training will hurt the testing accuracy. They indicate there is an inherent generalization gap in large-batch optimization within fixed number of epochs. Hoffer et al. hoffer2017train suggested training longer can close the generalization gap. After that, researchers have developed a series of optimization techniques and scaled the batch size to 64K without losing accuracy for various applications within fixed number of epochs (Table 1). There are also other related literatures anil2019memory ginsburg2019stochastic gupta2020stochastic golmant2018computational krizhevsky2014one lin2018don liu2019variance ma2019inefficiency osawa2019large shallue2018measuring smith2017don . Inspired by these works, we focus on a different direction. As far as we know, there is no previous work studying huge-batch optimization (e.g. over 128K). Our investigation includes: (1) pushing the batch size to the limit and maximizing the accuracy in fixed number of epochs; (2) studying the optimization problems of huge-batch and full-batch; (3) studying the relationship between generalization and optimization.

3 Optimization

In this section, we use a series of techniques to maximize the accuracy in huge/full-batch training. The information about our applications are in Table 2. Due to lack of space, some experimental results are in the Appendix. The first step of our study is to identify the “huge-batch” regime, where we conduct experiments with extremely large batch sizes. We detail the process of finding the “huge-batch” numbers in Table 2 and more results on how accuracy drops with batch size in the Appendix.

Dataset Model Implementation Baseline Batch Epochs Top-1 Accuracy Large Batch Huge Batch Full Batch
ImageNet ResNet-50 v1.5 he2016deep resnet2019mlperf 256 90 75.9% [2K, 64K] (64K, 1.28M) 1.28M
MNIST LeNet lecun2015lenet lenet2019google 256 30 99.2% [2K, 8K] (8k, 60k) 60K
CIFAR-10 ResNet-50 v1 he2016deep resnet2020cifar 128 200 93.9% [2K, 25K] (25k, 50k) 50K
Table 2: The baseline information of our applications (e.g. For ImageNet, =2K, =64K, =1.28M).

3.1 Huge Batch

Researchers are able to scale the batch size of ImageNet/ResNet-50 to 64K and achieve 75.9% top-1 accuracy in 90 epochs kumar2019scale . As far as we know, no one reported that the batch size of ImageNet/ResNet-50 can be scaled to over 64K without losing accuracy. The baseline of our implementation used Momentum SGD optimizer. The training diverges and only achieves a top-1 accuracy of 0.0977% for a batch size of 819200. We train it by 141 steps, which has the same number of epochs as the small-batch baseline. We investigate the effectiveness of existing large batch optimization techniques in the huge batch regime, starting from simple Momentum SGD to more advanced techniques proposed by recent works pereyra2017regularizing ; kumar2019scale ; ying2018image ; goyal2017accurate ; hoffer2017train :

  • Optimization 0: this is our baseline; we use Momentum SGD optimizer with linear learning rate scaling (or sqrt scaling if it is better) krizhevsky2014one , cosine learning rate decay loshchilov2016sgdr

    (see Appendix), gradient clipping

    pascanu2013difficulty

    and ghost batch normalization

    hoffer2017train . The top-1 accuracy is 0.0977%.

  • Optimization 1: we add learning rate warmup goyal2017accurate , which allows us to use linear learning rate scaling without diverging. We improved the top-1 accuracy from 0.0977% to 2.712%.

  • Optimization 2: we tune learning rate and warmup epochs by an auto-tuner. But we found just using grid-search tuning can roughly get the same accuracy as the auto-tuner in a reasonable computing budget. We improved the top-1 accuracy from 2.712% to 4.563%.

  • Optimization 3: we add labeling smoothing pereyra2017regularizing , which is a useful technique in large-batch training. The accuracy did not increase and also did not significantly decrease. The accuracy was changed from 4.563% to 4.226%.

  • Optimization 4: we switch the optimizer from Momentum SGD to LARS. In this way, we can increase the accuracy from 4.226% to 18.95%.

The results are summarized in Figure 1

. Since LARS can give us largest boost in accuracy, we further study its sensitivity to hyperparameters in Figure

2, by varying the two most important hyperparameters, learning rate and number of warmup epochs. From Figure 2, we can observe that the advantage layer-wise adaptive technique provided by LARS optimizer is maintained over a large number of hyperparameters, while the Momentum SGD fails to obtain good accuracy regardless of hyperparameter tuning. We conclude that LARS can make a difference in huge-batch learning.

Figure 1: msgd: Mementum SGD optimizer with linear learning rate (LR) scaling, cosine LR decay, gradient clipping and ghost batch norm; 1: LR warmup; 2: tuning LR and warmup epochs; 3: label smoothing; 4: LARS. This figure shows the top1 accuracy for huge-batch Imagenet training.
(a) Momentum SGD
(b) LARS
Figure 2: The results of ImageNet training by ResNet-50 (90 epochs or 141 iterations, batch size = 819200). We only show the two most important hyper-parameters and the tuning region with highest accuracy. We can observe that LARS is very necessary in huge-batch learning.

We also study an application with relatively small number of training samples. We pick MNIST with LeNet. We partition the dataset as 55K/5K/10K samples for training/validation/testing111We partition the dataset as 60K/10K samples for training/testing in Table 2. By using Momentum SGD, the baseline can achieve 99.2% accuracy in 30 epochs with a batch size of 256. As mentioned in Table 2, 8K is the boundary between the large batch region and the huge batch region. When we scale the batch size to 8K, we find the baseline (Momentum SGD) diverges in the training. So we tried other state-of-the-art optimizers like AdaGrad, RMSprop, Adam and LARS. AdaGrad and RMSprop can not achieve an accuracy over 99%. Adam has a slight accuracy loss. However, LARS achieved a better accuracy than the baseline. We want to find the reason behind this. We find one key feature of LARS is the layer-wise adaptive learning technique. So we add layer-wise technique to all the other optimizers. After that, SGD and AdaGrad can get the same accuracy as the baseline. However, Adam becomes very unstable and RMSprop diverges in the middle of the training. We find the reason is that the trust ratio222In the simplest form (for SGD), trust ratio is the ratio between the L2 norm of the weights and the L2 norm of the gradients. of layer-wise learning is either too large or too small for Adam and RMSprop. So we add a bound to the trust ratio for Adam optimizer, which is a similar idea as AdaBound luo2019adaptive . By doing so, it can also achieve 99.4% accuracy in 30 epochs with a batch size of 8K, which is the same as LARS. It is worth noting that Adam + layer-wise + bound-ratio is a special case of LAMB optimizer you2019large . The results are shown in Table 3. Figure 3 shows all the training details of LARS optimizer.

Solver batch=256 Original + layer-wise + layer-wise and bound-ratio
SGD 99.2% diverge 99.2% 99.2%
AdaGrad 99.2% 98.3% 99.3% 99.3%
RMSprop 99.2% 98.6% diverge 99.0%
Adam 99.3% 99.1% unstable 99.4%
LARS 99.3% 99.4% N/A N/A
Table 3: Scaling the batch size to 8K
Figure 3: LARS optimizer with learning rate warmup and polynomial learning rate decay. The baseline with a batch size of 256 can achieve 99.2% test accuracy (0.8% test error rate) in 30 epochs.

Since LARS with learning rate warmup and polynomial decay gave us best performance for large-batch MNIST training, we use this scheme for huge-batch MNIST training. We increase the batch size from 8K to 32K. However, the testing error rate increased from 0.8% to 4.2% (Figure 4). If we use other optimizers, the error rate is higher (8%). In Figure 4, we can observe huge batch training may lead the optimizer to a wrong path and the model is not optimized well: in the middle of the training, even though the training error rate and validation error rate are becoming lower, the training loss is increasing. Thus, in this situation, the huge batch training is suffering not only a generalization problem but also an optimization problem. After trying various different optimization techniques, we find only LAMB optimizer with extremely long learning rate warmup epochs and polynomial learning rate decay can stabilize the training and generalize well (Figure. 5). This scheme gets 98.7% testing accuracy. However, even using this scheme, we can not reach the target testing accuracy from the baseline – huge batch-training optimization may have an inherent generalization problem that cannot be resolved by any existing techniques. Huge-batch ImageNet/ResNet-50 training has a similar problem but is much worse than MNIST/LeNet. The target top-1 accuracy is 75.9% in 90 epochs. A huge-batch of 819200 only achieves an accuracy lower than 20%. We conclude that huge-batch training suffers generalization problems and sometimes also optimization problems with whatever current first-order optimization techniques.

Figure 4: LARS optimizer with learning rate (LR) warmup and polynomial LR decay.
Figure 5: LAMB optimizer with extremely long LR warmup epochs and polynomial LR decay.

3.2 Full Batch

In this section, we increase the batch size from the huge-batch region to the full-batch case. For ImageNet/ResNet-50, we increase the batch size from 819200 to 1.28 million, which is the first time demonstrated in literature. For MNIST/LeNet, we partition the dataset as 60K/10K for training/testing. We increase the batch size from 32768 to 60K. We used the optimization techniques in Section 3.1. Even though we only increase the batch size by less than two times, we find the accuracy becomes much lower. The results are shown in Figures 6 and 7. We can observe that LARS can also make a significant difference in full-batch training. However, we can not reach the target testing accuracy in 90 epochs. Momentum SGD is our baseline optimizer. LARS is the optimizer achieving the best accuracy for us. For MNIST, huge-batch can not reach the 99.2% target accuracy (results in Appendix). LARS can achieve a higher accuracy than Momentum SGD. We can also observe that LARS is more stable than Momentum SGD for huge-batch training.

Figure 6: msgd: Mementum SGD optimizer with linear learning rate (LR) scaling, cosine LR decay, gradient clipping and ghost batch norm; 1: LR warmup; 2: tuning LR and warmup epochs; 3: label smoothing; 4: LARS. This figure shows the top1 accuracy for full-batch ImageNet training.
(a) Momentum SGD
(b) LARS
Figure 7: Impact of hyperparameters on ImageNet/ResNet-50 in the full batch regime (90 epochs or 90 iterations, batch size = 1.28 million). We only show the two most important hyper-parameters and the tuning region with highest accuracy. We can see that LARS is very useful in full-batch learning.

4 Discussion

4.1 Train longer, generalize better?

We focus on ImageNet/ResNet-50 training in Sections 4.1 and 4.2. Our experiments indicate that huge/full-batch training suffers a serious generalization problem. Hoffer et al. hoffer2017train suggest training longer will lead to a better generalization for large-batch training. We want to study this validity of this suggestion for huge/full-batch training. The large-batch size of 32K is able to reach a testing accuracy of 76.7% in just 90 epochs. The huge-batch (batch size = 819200) training only achieves 19% accuracy in 90 epochs. By increasing the number of epochs to 1200, we can reach around 70% accuracy. However, even with an extravagant computing budget, we can not reach the target accuracy by just training longer. Figure 8 shows that there is almost no accuracy improvement as we increase the number of epochs from 3000 to 10000. For a huge batch size of 819200, the best accuracy we observed is 71.8%. For full-batch training, the best accuracy we can get within 15000 epochs is 71.1%. We hypothesize there is an inherent generalization gap in huge/full-batch optimization and simply training longer can not close this gap.

Figure 8: For a realistic application like ImageNet/ResNet-50, training longer can not close the generalization gap for huge-batch training and full-batch training. The batch size of 32K is able to reach a testing accuracy of 76.7% in just 90 epochs. However, even we increase the number of epochs to over 10000, huge-batch and full-batch can not reach the target accuracy.

A typical approach on distributed systems is to perform batch normalization (BN) per replica, which reduces the cross-device communication cost. However, distributed BN may also have an impact on the testing accuracy. For example, Ying et al. observed using an effective batch size of 64 for batch normalization can lead to the best accuracy (Figure 6 of ying2018image ). We use the same approach as Ying et al. ying2018image . Ghost Batch Normalization by Hoffer et al. hoffer2017train

is a similar idea. We only conduct reduction across a few peers to compute the mean and variance over a subset of all the replicas. In our experiments, we observed that we achieve best accuracy on 256 v3 TPU cores for the batch size of 32K. So the ratio between batch size and the number of cores is 128. We keep this ratio as we increase the number of cores. The current largest TPU-based supercomputer has 2048 cores, which corresponds to a batch size of 256K. That means the largest useful batch size for Ghost Batch Norm on current TPU supercomputer is 256K. However, we believe 256K is in the huge batch region. We only achieve 65% accuracy in 90 epochs. Even if we increase the number of epochs from 90 to 500, we can not reach the target accuracy (Figure

9). This means the current optimization algorithm is not scalable enough to make full use of the current hardware. To make better use of future hardware, we offer advice for system/hardware designers in the Appendix.

Figure 9: Because of distributed batch normalization, the largest useful batch size for the current TPU Pod is 256K. However, we can not reach the target accuracy for ImageNet/ResNet-50.

4.2 Optimization and Generalization

Hoffer et al. hoffer2017train introduced “ultra-slow diffusion” theory to explain the generalization of large-batch training. According to some statistical physics models, even though the shape of the loss function can not be visualized, we can describe the complicated DNN learning process as a random process with some potential. Specially, they use “Random Walk on a Random Potential” to model this complicated optimization and generalization problem. Based on a series of theoretical works bouchaud1990anomalous ; bray2007statistics ; dauphin2014identifying ; soudry2017exponentially , they built a relationship between the number of iterations and the weight at -th iteration : . It is worth noting that the typical relationship in standard diffusion (on a flat potential) is . They speculate to be 2, which means the distance between weight with the initial weight increases logarithmically with the number of iterations. They reached an argument that the optimizer needs to travel at least a distance of to find a minimum with the width of , which takes iterations. Essentially, their conclusion implies that given the best optimization technique, the quality of a minimizer is dependent on the number of iterations regardless of the batch size. Thus, the authors proposed an approach, “regime adaptation”, which suggests using the same number of iterations as the baseline (e.g. batch size = 256) regardless of the batch size. However, our results show that huge/full-batch training can not reach the target accuracy even with a large number of epochs. A natural question to ask is “why can longer training not lead to a better generalization for huge/full-batch regime?” Unfortunately, the relationship between optimization and generalization is missing in their theory. Our experiments (Figure 10) show that training longer leads to a better training loss; however, it can not reach the target testing accuracy from the baseline. By running the same number of iterations, full-batch can achieve an even lower training loss than the 32K batch size. It indicates there is an inherent generalization gap in huge/full-batch optimization that cannot be closed by any of existing techniques, which should be kept in mind by hardware designers when designing systems for machine learning.

5 Conclusion

Figure 10: Training longer does not generalize better.

We study the limit of the batch size for deep neural network training. For the first time we scale the batch size on ImageNet to at least a magnitude larger that all previous works (819200 versus 64K), and provide detailed studies on the performance of many state-of-the-art optimization schemes under this setting. We propose an optimization recipe that is able to improve the top-1 test accuracy by 18% compared to the baseline. We identify a

“huge-batch” regime for realistic optimization tasks (e.g. ImageNet/ResNet-50) , where the optimization process becomes intrinsically harder and we cannot reach the target test accuracy in polynomial time by applying any of current optimization techniques (empirical observation). The “ultra-slow diffusion” theory hoffer2017train does not hold any more in this regime. Our results help system and algorithm designers to understand the limit of parallelism in realistic machine learning tasks and design future machine learning hardware/algorithms accordingly. For more results and explanations, please check out the Appendix.

Broader Impact

We study the potential applications of huge-batch training and full-batch training, which may have an influence on future hardware design.

References

6 Appendix

6.1 Discussion on Ultra-Slow Diffusion Theory

As Section 2 in the main text, let us refer to as the gradient at -th iteration and as the batch size. Here, we define signal as the update in the direction of the true gradient () and noise as the update perpendicular to the true gradient (

). We also define signal-to-noise-ratio as the ratio between the expected power of the signal and the expected power of the noise:

.

In this section we want to briefly discuss why training longer can not close the generalization gap in the huge-batch training and full-batch training regimes. In the “ultra-slow diffusion theory” [13], the relationship between the random walk distance and the number of iterations is technically sound and empirically supported by some strong experimental results. However, we believe the ineffectiveness in huge and full-batch regimes is caused by a few potential mismatches between this theory and the huge/full-batch setting:

  • Based on the analysis of [15], we assume a minima with a larger "width" will have a better generalization performance. However, it is not guaranteed that we can find a minima of "width" after traveling a distance of (although we at least need to travel a distance of to find a minima of "width" ). Figures 11 and 12 support our conjecture. Even all the cases travel the same distance in Figure 11 or Figure 12, each one reaches a very different minima. Figure 2 in [13] only shows that the batch size up to 2K. However, the optimization problem is significantly different if we enter the huge-batch regime.

  • They did not analyze the effect of signal-to-noise-ratio in stochastic gradients. The signal is increasing linearly with the batch size ( = ). The noise is increasing at a sqrt rate with the batch size ( = ). Thus, increasing the batch size will increase the signal-to-noise-ratio in stochastic gradients, and the full batch setting has no noises. The authors essentially normalized the co-variance in the paper [13]. However, they did not consider the effect of signal-to-noise-ratio in large-batch training. The effect is even bigger in huge-batch training. When the noise is small comparing to the signal, it is unsure if the “slow diffusion theory” still holds.

  • In a real-world training process, the randomness of stochastic potential is very limited. Under the situation of limited training samples and data augmentation, the information is largely redundant and correlated after enough iterations for huge-batch training, violating the assumption of a random potential.

Figure 11: Euclidean distance of weight vector from initialization. b denotes the batch size.
Figure 12: Euclidean distance of weight vector from initialization. b denotes the batch size.

6.2 Additional Results

We do not have enough space in the main text, so we present additional results here. Figure 13 shows the top-5 accuracy for huge-batch ImageNet training. We can observe that LARS is very import for huge-batch training, which in line with the results of previous large-batch training studies [14] [19] [44]. The batch size of CIFAR-10 is 25K, which is at the boundary between large-batch regime and huge-batch regime. LARS can also help in this mixed regime.

Figure 13: msgd: mementum sgd optimizer; 1: learning rate warmup; 2: tuning learning rate and tuning warmup epochs; 3: label smoothing; 4: LARS. This figure shows the top5 accuracy for huge-batch ImageNet training (batch size is 819200, which is larger than 50% of the total training data size). The batch size of CIFAR-10 is 25K, which is at the boundary between large-batch regime and huge-batch regime.

Figure 14 shows the results of full-batch training. We scale the batch size of ImageNet training to 1.28 million and the batch size of CIFAR-10 to 50K. It shows the top5 accuracy for full-batch ImageNet training and the top1 accuracy for CIFAR-10 training are increased dramatically after using a series of optimization techniques including LARS. Figure 15 shows the hyper-parameter tuning results of full-batch CIFAR-10 training by ResNet-50 (200 epochs or 200 iterations, batch size = 50K). We only show the tuning region of the best hyper-parameters. We can observe that LARS can increase the accuracy of full-batch training. However, the current first-order optimizers can not achieve the same accuracy as the baseline for full-batch training. Figure 16 shows the full-batch training results of MNIST. We partition the dataset as 60K/10K samples for training/testing. We increase the batch size to 60K. Momentum SGD is our baseline optimizer. LARS is the optimizer achieving the best accuracy for us. We can observe that both optimizers can not reach the target accuracy (99.2%). However, LARS can achieve a higher accuracy than Momentum SGD. Figure 16 only shows the most effective hyper-parameter tuning region. We can also observe that LARS is more stable than Momentum SGD for full-batch training. Figure 17 shows a similar result.

Figure 14: msgd: mementum sgd optimizer; 1: learning rate warmup; 2: tuning learning rate and tuning warmup epochs; 3: label smoothing; 4: LARS. This figure shows the top5 accuracy for full-batch ImageNet training and the top1 accuracy for full-batch CIFAR-10 training.
(a) Momentum SGD on CIFAR-10
(b) LARS on CIFAR-10
Figure 15: This figure shows the hyper-parameter tuning results of full-batch CIFAR-10 training by ResNet-50 (200 epochs or 200 iterations, batch size = 50K). We only show the tuning region of the best hyper-parameters. We can observe that LARS can increase the accuracy of full-batch training. However, the current first-order optimizers can not achieve the same accuracy as the baseline for full-batch training.
Figure 16: We partition the dataset as 60K/10K samples for training/testing. We increase the batch size to 60K. Momentum SGD is our baseline optimizer. LARS is the optimizer achieving the best accuracy for us. We can observe that both optimizers can not reach the target accuracy (99.2%). However, LARS can achieve a higher accuracy than Momentum SGD. This figure only shows the most effective hyper-parameter tuning region. We can also observe that LARS is more stable than Momentum SGD for huge-batch training.
(a) Momentum SGD on CIFAR-10
(b) LARS on CIFAR-10
Figure 17: This figure shows the results of ImageNet training by ResNet-50 (90 epochs or 90 iterations, batch size = 1.28 million). We only show the tuning region of the best hyper-parameters. We can observe that the current first-order optimizers can only achieve a very low accuracy for the full-batch training.

As the results shown in the main text, training longer can not close the generalization gap for huge-batch and full-batch training. MNIST with LeNet (batch size = 60k). The baseline with a batch size of 256 can achieve 99.2% testing accuracy in 30 epochs. With LARS, the baseline can achieve 99.3% accuracy in 30 epochs. For the huge-batch MNIST training, we set the batch size as 32768. From Figure 18 we can observe that even increase the number of epochs from 30 to 1000, we can only achieve 99.16% accuracy, which still can not match the target accuracy. For the full-batch training, the accuracy for 1000-epoch training is even lower, which is only 98.3%. It is worth noting that a batch size of 256 can achieve this level of accuracy in just 10 epochs. That means that there is huge generalization gap between full-batch optimization and regular SGD optimization. For CIFAR-10 training with ResNet-50 (batch size = 50k), the baseline can achieve 93.9% accuracy in 200 epochs. However, the full-batch version can not reach the target accuracy even we train it for a long time. The best accuracy we can get is 92.58% by 5000 epochs (Figure 20)).

Figure 18: MNIST with LeNet (batch size = 32768). The baseline with a batch size of 256 can achieve 99.2% testing accuracy in 30 epochs. With LARS, the baseline can achieve 99.3% accuracy in 30 epochs. However, huge-batch can not reach the target accuracy even we train it for a long time. The best accuracy we can get is 99.16% by 1000 epochs.
Figure 19: MNIST with LeNet (batch size = 60k). The baseline with a batch size of 256 can achieve 99.2% testing accuracy in 30 epochs. With LARS, the baseline can achieve 99.3% accuracy in 30 epochs. However, full-batch can not reach the target accuracy even we train it for a long time. The best accuracy we can get is 98.30% by 1000 epochs.
Figure 20: CIFAR-10 with ResNet-50 (batch size = 50k). The baseline can achieve 93.9% accuracy in 200 epochs. However, full-batch can not reach the target accuracy even we train it for a long time. The best accuracy we can get is 92.58% by 5000 epochs.

6.3 Why LARS can help?

From results in the previous section, we can observe that LARS can make a significant difference in huge-batch training and full-batch training. We want to briefly explain why LARS can help for huge/full-batch training, which is out of the range of the original LARS paper [45]. As mentioned by Keskar et al. [15], small-batch learner uses noisy gradients in the computation of each step. The noises actually can push the learner away from the basin of the sharp minimizers. However, the noises are greatly reduced in large/full batch training, which are not enough to push the learner out of the basin of the sharp minimizers. Thus, adding noises to huge/full-batch learner may help. However, how to add proper noises? We tried adding the Gaussian noises and significantly tuning the hyperparameters, but it did not improve the testing accuracy. Specially, we did the following experiments.

  • Add noise to activations, i.e. the outputs of each layer.

  • Add noise to weights, i.e. an alternative to the inputs.

  • Add noise to the gradients, i.e. the direction to update weights.

  • Add noise to the outputs, i.e. the labels or target variables.

Keskar et al. did similar experiments, but it also did not help333https://openreview.net/forum?id=H1oyRlYgg&noteId=H1oyRlYgg. On the other hand, LARS computes the trust ratio at each iteration [45]:

(2)

where is the layer ID, is the batch size, is the iteration ID, and is the weight decay (e.g. = 0.01). Then LARS uses trust ratio to multiply the learning rate. From Figure 2 of this link444https://arxiv.org/pdf/1708.03888v3.pdf we can observe that:

  • For a different iteration, the learning rate is different.

  • For a different layer, the learning rate is different.

  • For a different batch size, the learning rate is different.

Therefore, we think the trust ratio of LARS can provide dynamics to the learning process. The dynamics may act as a proper noise to help the learner get out of the sharp minimum.

6.4 Identify the huge-batch regime

As mentioned in the main text, we identified the huge batch regime of ImageNet/ResNet-50 as (64K, 1.28 million). Kumar et al. [19] reported that they can reach the target accuracy for ImageNet/ResNet-50 at a batch size of 64K by using LARS optimizer [45]. We use the training recipe of Kumar et al. [19] and an auto-tuner to maximize the accuracy of ImageNet/ResNet-50 at a batch size of 128K. However, we can not reach the target accuracy. Table 4 shows the best results that we can achieve. The best top-1 accuracy we can get is only 73.37%, which is much lower than the target accuracy (75.9%). Thus, we identify a batch size over 64K is huge batch for ImageNet/ResNet-50 training. Figure 21 shows that we can identify the huge-batch regime of MNIST/LeNet training as (8K, 60K). Figure 22 shows that we can identify the huge-batch regime of CIFAR-10/ResNet-50 training as (25K, 50K).

LR warmup epochs momentum LR schedule TPU chips Top1 accuracy
26 25 0.94429 cosine 256 72.92%
28 25 0.94429 cosine 256 73.37%
30 25 0.94429 cosine 256 73.08%
32 25 0.94429 cosine 256 72.82%
26 25 0.94429 cosine 512 73.14%
28 25 0.95 cosine 512 73.30%
30 25 0.95 cosine 512 73.14%
32 25 0.95 cosine 512 72.82%
Table 4: ImageNet training with ResNet-50. The batch size is 128K, weight decay is 0.0001, and label smoothing is 0.1.
Figure 21: Even we increase the number of training epochs from 30 to 200, we can not reach the target accuracy for MNIST/LeNet at a batch size of 16K or larger.
Figure 22: Even we increase the number of training epochs from 200 to 1000, we can not reach the target accuracy for CIFAR-10/ResNet-50 at a batch size of 25K or larger.
Figure 23: Regular cosine learning rate schedule.
Figure 24: Regular learning rate warmup and poly learning rate decay.
Figure 25: Regular learning rate warmup and fine-grained cosine learning rate schedule
Figure 26: Regular learning rate warmup and regular cosine learning rate schedule
Figure 27: Regular learning rate warmup and coarse-grained cosine learning rate schedule
Figure 28: Regular learning rate warmup and cosine learning rate decay.
Figure 29: Cosine learning rate warmup and cosine learning rate decay.

6.5 Learning Rate Schedule

Since learning rate is the most important hyper-parameter, we tried several different learning rate schedules. Specifically, they include regular cosine learning rate schedule (Figure 23), regular learning rate warmup and poly learning rate decay (Figure 24), regular learning rate warmup and fine-grained cosine learning rate schedule (Figure 25), regular learning rate warmup and regular cosine learning rate schedule (Figure 26), regular learning rate warmup and coarse-grained cosine learning rate schedule (Figure 27), regular learning rate warmup and cosine learning rate decay (Figure 28), and cosine learning rate warmup and cosine learning rate decay (Figure 29). We pick the best schedule for each application.

For example, we tried cyclical learning rate [38] scheme in MNIST training. This brings an additional hyper-parameter: the cyclical length. If we do not tune it carefully, the testing accuracy can be significantly hurt (error rate increases from 4.2% to 12.7% in Figure 32

). After tuning it, we can get a slightly better accuracy (error rate decreases from 4.2% to 3.8%), which is shown in Figure

33. However, we find in both these two figures, the training loss reached the peaking point in the middle of the training and remain very high in the end of the training.

Figure 30: LARS optimizer with learning rate warmup and polynomial learning rate decay.
Figure 31: LARS optimizer with learning rate warmup and polynomial learning rate decay. After we increase the batch size from 8K to 32K, we observed a significant increase in the test error rate (from 0.8% to 4.2%).
Figure 32: LARS optimizer with Cyclical Learning Rate. The baseline with a batch size of 256 can achieve 99.2% test accuracy (0.8% test error rate) in 30 epochs.
Figure 33: LARS optimizer with Cyclical Learning Rate. The baseline with a batch size of 256 can achieve 99.2% test accuracy (0.8% test error rate) in 30 epochs. The cyclical length is longer than Figure 32.

6.6 Hardware Usage Discussion

BERT pre-training is a good example for deep learning applications on supercomputers. Actually, BERT pre-training can make full use of the most powerful TPU-based supercomputers, which means researchers are able to scale it on 1024 TPU chips [46]. However, current huge-batch algorithms can not make full use of the supercomputers for some applications like ImageNet. Here, we list a few concrete suggestions for hardware/algorithm designers; what can we do to make better use of supercomputers?

We give the following suggestions:

  • Using model parallelism and pipeline together with data parallelism, which can maximize the potential performance.

  • Using a larger sample, which may also be a future trend.

    • For ImageNet with ResNet-50, 32 batch size keeps a GPU busy.

      • 32K batch size can keep 1024 GPUs busy.

    • For large image, even a batch size of 1 may keep a GPU busy (e.g. 1920x1080 HD images on ResNet-50).

      • 32K batch size can keep 32K GPUs busy.

  • Trying second-order approach like K-FAC [27].