1 Introduction
Minibatch stochastic gradient descent (SGD) is the dominant optimization method for training deep neural networks (DNNs)
[1, 2]. In the face of unprecedented growth in dataset size, a large body of work has attempted to scale SGD to train DNN models on increasingly large datasets, while keeping wallclock time manageable [13, 9, 29, 7]. The most common approach to train large models at scale is distributed synchronous minibatch SGD, which exploits additional computational resources through data parallelism. This technique reduces wallclock training time by increasing the minibatch size, i.e., the number of examples used to compute a stochastic estimate of the gradient of the loss function at each training iteration, while holding the number of epochs constant. Proponents of large batch size training often argue that the merits stem from its ability to decrease wallclock training time while maintaining final model performance. Indeed, an enormous amount of work has gone into designing systems that seem to operate under an assumption that equates large batch size training with machine learning at scale
[9, 15, 24].Increasing the batch size improves the scaling performance of SGD per epoch, but there are significant challenges in building efficient distributed systems that are able to exploit additional computational resources to use large batch sizes [15]. However, even if we were able to address these systems challenges, there are still more fundamental limitations to this approach. Large batch sizes often negatively impact important performance metrics of interest, including total computational cost (which usually determines monetary cost) and prediction quality.
In this paper, we will measure the total computational cost as the number of training iterations times the work done per iteration—in order to simplify measurements, we use the number of training iterations as a proxy for the wallclock time. We do this because the implementation of parallel algorithms depends on software and hardware choices, and our goal is to draw more general conclusions about the performance of SGDbased methods.
Based on this model for total computational cost and wallclock time, the following should be clear: unless increasing the batch size leads to a commensurate decrease in the total number of training iterations needed to find a good model, large batch training will result in greater total computational cost with littletono decrease in wallclock training time.
Based on our empirical results across a range of datasets and architectures, we find that as the batch size becomes larger, there are three main phases of scaling behavior for convergence speed:

Linear: there is a small regime of batch sizes in which increasing the batch size results in linear gains in convergence speed;

Diminishing returns: there is a larger regime of batch sizes that results in sublinear gains in convergence speed—in this regime, increasing the batch size can improve wallclock training time at the expense of greater total computational cost;

Stagnation: eventually, we reach a third regime where a higher batch size results in marginal or nonexistent reductions in convergence speed.
In our experiments, we find that this third regime begins at a batch size that is too small to fully populate the memory of all GPUs at our disposal, leading to low GPU utilization. Even though training past this batch size allows for the GPU cycles to be fully utilized, doing so increases the total computational cost without reducing wallclock training time or improving prediction quality.
While there has been considerable excitement around heuristics that have been shown to make large batch training practical for certain problems
[9, 29], we demonstrate that these techniques still suffer from the same convergence trends we observe, and they often decrease stability of the training process.Recent work has observed that the final test performance of models trained with large batch sizes degrades after training for a fixed number of epochs [31, 17]. This phenomenon is known as the generalization gap. Previous work addressing this problem has focused on training for more iterations in the large batch case [12] or adopting various heuristics to select a learning rate for larger batch sizes [9, 29]. Based on our empirical results, we find that existing techniques to mitigate the generalization gap do not work on some problems, and for other problems they only work for batch sizes that are too small to fully populate the memory of all GPUs at our disposal. Perhaps more importantly, they do little to affect the diminishing returns in rates of convergence for training loss as batch size increases.
Our objective is to understand the behavior of SGD and existing large batch techniques for many network architectures and problem domains, e.g., image classification/segmentation and natural language processing (NLP). We observe markedly worse performance for these techniques in domains other than image classification, where large batch optimization has received the most attention
[15, 33]. Because we eschew the challenges of an efficient distributed implementation by measuring number of iterations instead of wallclock time, our results assume the most optimistic circumstances for large batch training. Our key observations are:
Increasing the batch size beyond a certain point yields no improvement in wallclock time to convergence, even for a system with perfect parallelism. We observe that larger batch sizes result in a limited reduction in the number of training iterations needed to achieve low training or test error, and that eventually these gains become nearzero.

Increasing the batch size leads to a significant increase in generalization error, which cannot be mitigated by existing techniques. We observe that these techniques often result in divergent training behavior or that they only mitigate degradation in test performance for small batch sizes relative to available compute.

Dataset size plays a less decisive role in determining training efficiency than factors such as model architecture and data complexity. We observe that both the diminishing returns in convergence speed and the failure of existing methods seem to correlate more with these other problem properties than dataset size alone.
In Section 2, we review the formulation of SGD as well as existing strategies to train with large batch sizes. In Section 3
, we review recent theoretical results regarding the convergence rates of SGD in highly overparameterized settings and discuss the potential impact of these results on the computational efficiency of SGD for deep learning. Section
4 presents our empirical results that demonstrate the inefficiencies of training SGD with large batch sizes, and we show that these persist when using existing large batch optimization techniques.2 Background and Related Work
Stochastic Gradient Descent. SGD is the most widely used algorithm to train DNN models. The model is parameterized by weights , and the objective is to minimize the empirical loss over data points :
(1) 
where is a loss, e.g., crossentropy or squared error. This loss gives a corresponding gradient
(2) 
A minibatch of size is a collection of indices randomly drawn from the set
, and we can use it to form an unbiased estimate of the gradient at iteration
, as well as the corresponding SGD update:(3) 
where is the learning rate for iteration . One iteration of training for SGD corresponds to a single gradient computation / weight update. One epoch corresponds to iterations of training. This constitutes a single pass over the dataset, assuming the dataset is sampled without replacement.
Efficient distributed systems reduce wallclock training time by parallelizing gradient calculations across many machines. When the batch size is large enough to populate all available compute resources, this allows us to amortize the cost of coordination for each weight update.
Existing large batch techniques. With the hope of keeping training times manageable as dataset sizes escalate, recent work has focused on the development of techniques that allow practitioners to increase the batch size to make use of growing computational resources [16, 15, 32]. However, there is a growing body of theoretical and empirical results suggesting that large batch sizes adversely affect the generalization performance of the final model [31, 17, 7].
In response to this, recent work has proposed changing two parameters in relation to batch size: the number of training iterations and the learning rate. However, they also make assumptions that limit the effectiveness of their proposals as useful heuristics for practitioners.

Training longer: [12] suggest increasing the number of training iterations. Even if this does reduce the generalization gap, it significantly increases both wallclock training time and computational cost. Moreover, in some problems it does not lead to minima with better generalization performance (as we found when running our experiments).

Square root LR scaling: Scaling the learning rate as attempts to keep the weight increment length statistics constant, but the distance between SGD iterates is governed more by properties of the objective function than the ratio of learning rate to batch size [3, 35]. This rule has also been found to be empirically suboptimal in various problem domains [18].

Linear LR scaling: The performance of large batch training can also be improved by using the linear scaling rule, which suggests choosing a learning rate proportional to the batch size () [9]. There are two motivations for this rule: the first assumes that one largebatch gradient step should resemble a series of smallbatch gradient steps in order for convergence rates to improve linearly [9]; the other regards the SGD update equation as the EulerMaruyama discretization of a stochastic differential equation [26, 30], and attempts to maintain a constant level of minibatch noise to help SGD explore the loss landscape [3, 35, 29].
Both justifications for the linear scaling rule implicitly impose strong conditions on the loss function by requiring that it behave linearly near SGD iterates; therefore, if the loss function is highly nonlinear along the SGD trajectory or the step size is not small enough, then we should not expect these rules to provide useful guidance for many problems. Whereas several groups have successfully used this rule to train on the ImageNet dataset in under an hour, e.g.
[9, 33], applying this heuristic to other datasets has not led to similarly impressive results so far [24].The focus of this paper, however, is on more fundamental limitations of large batch training, and we empirically show that the above approaches fail to prevent diminishing returns in the rate of convergence for large batch sizes. We believe that these diminishing returns are of more immediate concern than the generalization gap and warrant more careful examination: if we cannot even minimize training error quickly, there is no real opportunity to minimize test error quickly, regardless of the difference in final test error across batch sizes by the time the model has converged.
3 Critical Batch Sizes and Diminishing Returns
The convergence rate of SGD, denoted by , is the number of iterations needed to achieve training error less than a fixed constant by using SGD with batch size (we will drop the subscript when it is unambiguous). In order to guarantee that large batch sizes speed up training, should continue to decrease nearlinearly with . Otherwise, a larger batch size increases computational cost with only limited reductions in wallclock training time. For nearconstant , the benefit of large batch sizes becomes nearzero.
[21] showed theoretically that in convex, overparameterized settings, the reduction in convergence time obtained by increasing the batch size decays dramatically to a nearconstant level after a critical batch size that is independent of the dataset size. This speedup is measured with respect to the number of SGD iterations required to reach some fixed loss error for some baseline batch size , and for this purpose we define the speedup ratio . The speedup ratio represents the amount of time we save by increasing the batch size to . Beyond the critical batch size mentioned above, even with no communication overhead and unlimited resources (where each batch size requires the same amount of wallclock time to process) we would prefer to use the critical batch size because it requires less overall computation.
This result is surprising because researchers have asserted that it should be possible to achieve linear gains in convergence speed so long as the batch size is small relative to dataset size [29]. This will present significant difficulties for future optimization work (large minibatch training) because it prevents us from using large batch sizes as a catchall approach to quickly train models as datasets grow larger.
4 Empirical Evaluation
Recent work studying large batch training has looked primarily at image classification [14, 31], especially on the ImageNet dataset [6]. We perform large batch size experiments across both traditional image classification (IC) tasks (such as on CIFAR10/100 [19]), as well as previously unexplored tasks like image segmentation (IS) using the Cityscapes dataset [4], and natural language processing (NLP) using the WikiText2 dataset [23]. We also test how these results vary across other modern DNN architectures, namely ResNets [10], LSTMs [11, 8], AlexNet [20], VGG [27], Dilated Residual Networks [34], and MobileNetV2 [25]. We tested all of the large batch training techniques described in Section 2. We tried training longer based on the work of [12], but we found that this necessarily cannot improve the convergence speed and often does not improve final test performance. The two other techniques include the square root scaling rule strategy (SRSR) and the linear scaling rule strategy (LSR). For the latter, we used a warmup period at the start of training as suggested by [9]. Table 1 reports our datasets, models and different training strategies. For each model, we evaluated against a base learning rate strategy (BLR) that used the same learning rate across all batch sizes. We selected this learning rate based on its performance on a small baseline batch size.
4.1 Diminishing Returns in Rates of Convergence
We demonstrate the rapidly diminishing returns in rates of convergence across various problem domains and network configurations. Researchers increase the batch size in an attempt to achieve nearly linear speedups in convergence compared to a small minibatch size. In particular, if the speedup is nearlinear, i.e. , then the computational cost remains nearly constant for large and small minibatch SGD. However, if , then the benefit of using large batch size training is negligible.
In Figure 1, we show contour plots of training loss as a function of both the batch size and the number of training iterations of ResNet34 on CIFAR10, an LSTM on WikiText2, and DRND22 on Cityscapes. Consider, for example, the contour plot for ResNet34 trained on CIFAR10. We can see that as the batch size increases from 16 to roughly 2048, in the reasonably welltrained model regime, the number of SGD iterations needed to achieve a particular loss value decreases linearly. Exceeding this regime, however, the speedup ratio becomes increasingly sublinear and soon we have . For batch size roughly 4096, the training procedure does not achieve the lowest training loss. From this perspective, even if we did not care about computational cost or training time, we would not be able to find an accurate model. We observe even worse scaling behavior for test performance (please see Figure 5 for details).
Dataset  Task  Architecture  Training Strategy  BS range 

MNIST  IC  ResNet34  BLR, LSR ()  – 
CIFAR10  IC  AlexNet, MobileNetV2  BLR, LSR, SRSR  – 
ResNet34, VGG16  ()  
CIFAR100  IC  ResNet34  BLR, LSR ()  – 
SVHN  IC  ResNet34  BLR, LSR ()  – 
WikiText2  NLP  LSTM  BLR, LSR ()  – 
Cityscapes  IS  DRND22  BLR, LSR ()  – 
For NLP and IS, note that the gain from large batch training diminishes even faster. Neither the LSTM on WikiText2 nor DRND22 on Cityscapes can reach their respective baseline performances after reasonably small batch sizes of about and , respectively. Although [24] showed that training on the Amazon Reviews dataset [22] can be done within 4 hours, they tune hyperparameters heavily. This poses an issue for many practical deployments because these problems are often already slow to train.
4.2 Existing Strategies Break Down for Large Batch Sizes
We further explore how training with the linear and square root scaling rules compares to training with a fixed baseline learning rate (BLR) that does not change with batch size. In the left subfigure of Figure 2, we show the speedup curves of BLR, LSR, and SRSR strategies for ResNet34 on CIFAR10. Note that LSR and SRSR outperform BLR from batch size 256 to 2048 which implies that LSR and SRSR can help the model train for smalltomedium batch sizes. However, the speedup of LSR and SRSR is still worse than the ideal linear case, and the curves plateau quickly after a batch size of 2048, at which point BLR becomes better than LSR and SRSR. This means that for certain problems, scaling up the learning rate to compensate for an increased batch size hurts performance.
In the right subfigure of Figure 2, we plot the test performance and the approximation error for LSR of ResNet34 on CIFAR10. We measure the approximation error at the end of training, with final weights . We take this error to be the absolute difference between the true loss value and the linear approximation at , given by . The approximation is calculated for to understand the behavior of the approximation along the trajectory for a single SGD iterate using the LSR. It appears that there exists a strong relationship between linear approximation error and test accuracy: as the linear approximation error increases, the test accuracy drops. Note the transition that happens at the critical batch size of 2048. After this point, the test accuracy drops significantly and the linear approximation error exceeds , showing that we quickly exit the regime in which the linear approximation is valid.
4.3 Convergence speed has a weak dependence on dataset size
Speedup curves across different problem configurations. Left: different architectures result in different rates of convergence on CIFAR10. Right: ResNet34 exhibits different rates of convergence on CIFAR10, CIFAR100, and SVHN. Note that in this experiment, we only used 50k training examples for SVHN so the dataset sizes would be consistent for all runs. Loss thresholds are obtained by computing the lower quartile of loss values achieved by the largest batch size.
Previous works have conjectured that the maximum batch size that can result in a good model is proportional to the size of the whole dataset [28, 29]. However, for convex, overparameterized problems, [21] show that there is a modeldependent critical batch size after which we observe rapidly diminishing returns in convergence speed. In this section, to observe if a similar critical batch size exists in the nonconvex case, we compare how changing model architecture or data complexity affects the shapes of speedup curves compared to changing the dataset size alone.
First, in order to show that these diminishing returns depend on data complexity and DNN architecture, we plot speedup curves in Figure 3 to compare the scaling behaviors across different models and dataset configurations. For the error threshold , we chose the lowest quartile loss value reached by the largest batch size to make a fair comparison across configurations. This setup actually favors the large batch case, because there are lower loss thresholds that are attainable only in the small batch case. On the left, for the CIFAR10 dataset, we compared four model architectures. For each architecture, we plotted the speedup curve obtained by training this model on the dataset for various batch sizes. The variety of speedup curve shapes indicates that model architecture is an important factor in determining the convergence speed of training for large batch sizes. For MobileNetV2/AlexNet, the diminishing returns become visible when batch size is 1024. However, for VGG16/ResNet34, the speedup does not flatten out until batch size 8196. Hence, in practice, the choice of model strongly affects our ability to use large batch sizes in SGD.
On the right, in order to investigate the effect of problem complexity, we compared the performance of ResNet34 on four datasets of the same size: CIFAR10, CIFAR100, MNIST, and the SVHN dataset (we cut off MNIST and SVHN to training examples each). Although all problems display diminishing returns in rates of convergence, the point at which the curves plateau varies according to problem complexity. It is not hard to see that, for simpler problems such as SVHN, the curves flatten out later than for harder problems (e.g. CIFAR10/100).
In all of the above cases, the diminishing rates of return in convergence speed become visible after only moderate increases in the batch size. Previous works have only studied convergence behavior for a fairly limited range of batch sizes (e.g., up to for CIFAR10) [12, 17]. By increasing the batch size past this point, it becomes immediately apparent that the primary issue with large batch size optimization is training speed, not the generalization gap.
In order to test whether the sublinear behavior of depends primarily on dataset size, we compare the speedup curves obtained when training a single model on different fractions of the original training data. We trained ResNet34 models on the CIFAR10 and SVHN datasets (for SVHN in this experiment, we train on all available training images). For each dataset, we trained on , , and then of the available training data.
In Figure 4, we plot the resulting speedup curves for the various partitions. In order to maintain a fair comparison (as baseline loss values change for different dataset sizes), we again choose the loss threshold to be the lower quartile of loss values obtained by the largest batch size.^{1}^{1}1We observed that the loss threshold for a smaller partition is higher than that of the full dataset. This may be because, as we decrease dataset size, the large batch behavior that determines our threshold approaches that of vanilla gradient descent, which typically displays poor training convergence speed for DNN problems. Notably, the batch size at which the curves begin to plateau remains constant as dataset size changes. For ResNet34 on CIFAR10, the linear speedup behavior breaks around batch size 128 for all three curves. By a batch size of 1024, all curves have flattened. We can see similar behavior for ResNet34 on SVHN. Overall, looking back to Figure 3, the choice of model and the complexity of the dataset appear to be more related to the shape of speedup curve than dataset size alone.
5 Conclusion
By experimenting across a wide range of network architectures and problem domains, we find that, after a certain point, increasing the batch size fails to decrease wallclock time to convergence and results in low computational efficiency, even assuming perfect parallelism. The critical batch size after which these returns diminish tends to be small relative to existing system capabilities. These trends present impediments to progress in developing effective machine learning systems that are capable of handling growing data demands.
Recent works also suggest heuristics to decrease the generalization gap, but we find that these heuristics cannot be used to solve the underlying issue of training convergence speed. Moreover, we find that they usually only help decrease the generalization error in a smalltomedium batch size regime. There does not seem to be a simple training heuristic to improve large batch performance in general.
These results suggest that we should not assume that increasing the batch size for larger datasets will keep training times manageable for all problems. Even though it is a natural form of data parallelism for largescale optimization, alternative forms of parallelism should be explored to utilize all of our data more efficiently.
References
 [1] Y. Bengio and Y. LeCun, Scaling Learning Algorithms Towards AI, in Large Scale Kernel Machines, MIT Press, 2007.
 [2] L. Bottou, Largescale machine learning with stochastic gradient descent, in Proceedings of COMPSTAT’2010, Springer, 2010, pp. 177–186.
 [3] P. Chaudhari and S. Soatto, Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks, arXiv:1710.11029, (2017).

[4]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele,
The Cityscapes Dataset for Semantic Urban Scene Understanding
, in Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
 [5] D. Das, S. Avancha, D. Mudigere, K. Vaidyanathan, S. Sridharan, D. D. Kalamkar, B. Kaul, and P. Dubey, Distributed Deep Learning Using Synchronous Stochastic Gradient Descent, arXiv:1602.06709, (2016).
 [6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei, ImageNet: A largescale hierarchical image database, in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 248–255.
 [7] A. Devarakonda, M. Naumov, and M. Garland, AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks, arXiv:1712.02029, (2017).
 [8] F. A. Gers, J. A. Schmidhuber, and F. A. Cummins, Learning to Forget: Continual Prediction with LSTM, Neural Computing, 12 (2000), pp. 2451–2471.
 [9] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, Accurate, large minibatch SGD: training ImageNet in 1 hour, arXiv:1706.02677, (2017).
 [10] K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [11] S. Hochreiter and J. Schmidhuber, Long ShortTerm Memory, Neural Computing, 9 (1997), pp. 1735–1780.
 [12] E. Hoffer, I. Hubara, and D. Soudry, Train longer, generalize better: closing the generalization gap in large batch training of neural networks, in Advances in Neural Information Processing Systems, 2017, pp. 1731–1741.
 [13] F. N. Iandola, K. Ashraf, M. W. Moskewicz, and K. Keutzer, FireCaffe: nearlinear acceleration of deep neural network training on compute clusters, arXiv:1511.00175, (2015).
 [14] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. J. Storkey, Three Factors Influencing Minima in SGD, arXiv:1711.04623, (2018).
 [15] X. Jia, S. Song, W. He, Y. Wang, H. Rong, F. Zhou, L. Xie, Z. Guo, Y. Yang, L. Yu, T. Chen, G. Hu, S. Shi, and X. Chu, Highly Scalable Deep Learning Training System with MixedPrecision: Training ImageNet in Four Minutes, arXiv:1807.11205, (2018).
 [16] P. H. Jin, Q. Yuan, F. N. Iandola, and K. Keutzer, How to scale distributed deep learning?, arXiv:1611.04581, (2016).
 [17] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, On largebatch training for deep learning: Generalization gap and sharp minima, arXiv:1609.04836, (2016).
 [18] A. Krizhevsky, One weird trick for parallelizing convolutional neural networks, arXiv:1404.5997, (2014).
 [19] A. Krizhevsky and G. Hinton, Learning multiple layers of features from tiny images, tech. rep., Citeseer, 2009.

[20]
A. Krizhevsky, I. Sutskever, and G. E. Hinton,
ImageNet Classification with Deep Convolutional Neural Networks
, in Proceedings of the 25th International Conference on Neural Information Processing Systems  Volume 1, NIPS’12, Curran Associates Inc., 2012, pp. 1097–1105.  [21] S. Ma, R. Bassily, and M. Belkin, The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Overparametrized Learning, arXiv:1712.06559, (2017).
 [22] J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, ImageBased Recommendations on Styles and Substitutes, in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, ACM, 2015, pp. 43–52.
 [23] S. Merity, C. Xiong, J. Bradbury, and R. Socher, Pointer Sentinel Mixture Models, arXiv:1609.07843, (2016).
 [24] R. Puri, R. Kirby, N. Yakovenko, and B. Catanzaro, Large Scale Language Modeling: Converging on 40GB of Text in Four Hours, arXiv:1808.01371, (2018).
 [25] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov, and L. Chen, Very Deep Convolutional Networks for LargeScale Image Recognition, arXiv:1801.04381, (2014).
 [26] T. Sauer, Numerical solution of stochastic differential equations in finance, in Handbook of computational finance, Springer, 2012, pp. 529–550.
 [27] K. Simonyan and A. Zisserman, Very Deep Convolutional Networks for LargeScale Image Recognition, arXiv:1409.1556, (2014).
 [28] S. L. Smith, P.J. Kindermans, and Q. V. Le, Don’t Decay the Learning Rate, Increase the Batch Size, arXiv:1711.00489, (2017).
 [29] S. L. Smith and Q. V. Le, A Bayesian Perspective on Generalization and Stochastic Gradient Descent, in International Conference on Learning Representations, 2018.
 [30] C. Xing, D. Arpit, C. Tsirigotis, and Y. Bengio, A Walk with SGD, arXiv:1802.08770, (2018).
 [31] Z. Yao, A. Gholami, Q. Lei, K. Keutzer, and M. W. Mahoney, Hessianbased Analysis of Large Batch Training and Robustness to Adversaries, arXiv:1802.08241, (2018).
 [32] Y. You, I. Gitman, and B. Ginsburg, Large Batch Training of Convolutional Networks, arXiv:1708.03888, (2017).
 [33] Y. You, Z. Zhang, C. Hsieh, and J. Demmel, 100epoch ImageNet Training with AlexNet in 24 Minutes, arXiv:1709.05011, (2017).
 [34] F. Yu, V. Koltun, and T. Funkhouser, Dilated Residual Networks, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 07 2017, pp. 636–644.
 [35] Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma, The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects, arXiv:1803.00195, (2018).
Appendix A Additional results
Batch Size  BLR  LSR 

32  93.58  93.58 
64  93.44  93.23 
128  92.92  93.21 
512  91.91  92.9 
1024  91.5  92.37 
2048  90.63  92.17 
4096  89.92  87.15 
8192  86.93  13.00 
16384  81.66  11.01 
32768  70.01  10.96 
Comments
There are no comments yet.