1 Introduction
Neural networks have proven highly effective at solving a wide variety of prediction tasks, including image classification, machine translation, and speech recognition. Larger models trained on larger data sets are partly responsible for these recent successes and, in general, we expect that models trained on more data will continue to yield improvements in predictive performance (Hestness et al., 2017). Although modern GPUs and custom neural network accelerators let us train state of the art models faster than ever before, training time still limits both the predictive performance of these techniques and how widely they can be applied. For many important problems, the best models are still improving at the end of training because researchers cannot afford to train for more than a few days or weeks at a time. In extreme cases, training must end before completing a single pass over the data (e.g. Anil et al., 2018). One way to reduce training time is to increase the rate at which data is processed during training. This can facilitate dramatic improvements in model quality, not only by allowing more data to be processed, but also by decreasing the experiment iteration time and allowing researchers to try new ideas and configurations more rapidly. Faster training also allows neural networks to be deployed in applications where models have to be updated frequently, for instance when new models have to be produced when training data get added or removed regularly.
Data parallelism offers a straightforward, popular means of accelerating neural network training. For our purposes, data parallelism refers to distributing training examples across multiple processors to compute gradient updates (or higherorder derivative information) and then aggregating these locally computed updates. As long as the training objective decomposes into a sum over training examples, data parallelism is model agnostic and applicable to any neural network architecture. In contrast, the maximum degree of model parallelism (distributing parameters and computation across different processors for the same training examples) depends on the model size and structure. Although data parallelism can be simple to implement, ultimately, large scale systems should consider all types of parallelism at their disposal. In this work, we focus on the costs and benefits of data parallelism in the synchronous training setting.
Hardware for training neural networks is trending towards everincreasing capacity for data parallelism. Specialized systems using GPUs or custom ASICs (e.g. Jouppi et al., 2017) combined with highperformance interconnect technology are unlocking unprecedented scales of data parallelism where the costs and benefits have not yet been well studied. On the one hand, if data parallelism can provide a significant speedup at the limits of today’s systems, we should build much bigger systems. On the other hand, if additional data parallelism comes with minimal benefits or significant costs, we might consider designing systems to maximize serial execution speed, exploit other types of parallelism, or even prioritize separate design goals such as power use or cost.
There is considerable debate in the literature about the costs and benefits of data parallelism in neural network training and several papers take seemingly contradictory positions. Some authors contend that largescale data parallelism is harmful in a variety of ways, while others contend that it is beneficial. The range of conjectures, suggestive empirical results, and folk knowledge seems to cover most of the available hypothesis space. Answering these questions definitively has only recently become important (as increasing amounts of data parallelism have become practical) so it is perhaps unsurprising that the literature remains equivocal, especially in the absence of sufficiently comprehensive experimental data.
In this work, we attempt to provide the most rigorous and extensive experimental study on the effects of data parallelism on neural network training to date. In order to achieve this goal, we consider realistic workloads up to the current limits of data parallelism. We try to avoid making assumptions about how the optimal metaparameters vary as a function of batch size. Finally, in order to guide future work, we consider any remaining limitations in our methodology, and we discuss what we see as the most interesting unanswered questions that arise from our experiments.
1.1 Scope
We restrict our attention to variants of minibatch stochastic gradient descent (SGD), which are the dominant algorithms for training neural networks. These algorithms iteratively update the model’s parameters in the direction opposite an estimate of the gradient of the training objective. The gradient is estimated at each step using a different subset, or
batch, of training examples. See Section 2.2 for a more detailed description of these algorithms. A dataparallel implementation computes gradients for different training examples in each batch in parallel, and so, in the context of minibatch SGD and its variants, we equate the batch size with the amount of data parallelism.^{1}^{1}1Minibatch SGD can be implemented in a variety of ways, including dataserially, but a dataparallel implementation is always possible in principle. We restrict our attention to synchronous SGD because of its popularity and advantages over asynchronous SGD (Chen et al., 2016).Practitioners are primarily concerned with outofsample error and the cost they pay to achieve that error. Cost can be measured in a variety of ways, including training time and hardware costs. Training time can be decomposed into number of steps multiplied by average time per step, and hardware cost into number of steps multiplied by average hardware cost per step. The average time and hardware costs depend on the practitioner’s hardware, but the number of training steps is hardwareagnostic and can be used to compute the total costs for any hardware given its average perstep costs. Furthermore, for an idealized dataparallel system, the wall time is conveniently proportional to the number of steps. Therefore, we focus on number of training steps as our main unit of measurement for training cost.
An alternative hardwareagnostic measure of training cost is the number of training examples processed, or equivalently the number of passes (epochs) over the training data. This measure describes the case where the perstep costs are proportional to the number of examples processed. However, in an idealized dataparallel system, the time cost depends only on the number of training steps and is independent of the number of examples processed. Indeed, in realworld systems such as TPU pods^{2}^{2}2
A TPU pod is an accelerator designed for machine learning workloads. See
https://www.blog.google/products/googlecloud/googlecloudoffertpusmachinelearning/.there can be a range of batch sizes for which the time per step is almost constant. Thus, on realistic dataparallel hardware, a neural network that trains in fewer steps with a larger batch size can incur a lower time cost, even if it processes more epochs of training data.
In light of practitioners’ primary concerns of outofsample error and the resources needed to achieve it, we believe the following questions are the most important to study in order to understand the costs and benefits of data parallelism with minibatch SGD and its variants:

What is the relationship between batch size and number of training steps to reach a goal outofsample error?

What governs this relationship?

Do large batch sizes incur a cost in outofsample error?
1.2 Contributions of This Work

We show that the relationship between batch size and number of training steps to reach a goal outofsample error has the same characteristic form across six different families of neural network, three training algorithms, and seven data sets.
Specifically, for each workload (model, training algorithm, and data set), increasing the batch size initially decreases the required number of training steps proportionally, but eventually there are diminishing returns until finally increasing the batch size no longer changes the required number of training steps. To the best of our knowledge, we are the first to experimentally validate this relationship across models, training algorithms, and data sets while independently tuning the learning rate, momentum, and learning rate schedule (where applicable) for each batch size. Unlike prior work that made strong assumptions about these metaparameters, our results reveal a universal relationship that holds across all workloads we considered, across different error goals, and when considering either training error or outofsample error.

We show that the maximum useful batch size varies significantly between workloads and depends on properties of the model, training algorithm, and data set. Specifically, we show that:

SGD with momentum (as well as Nesterov momentum) can make use of much larger batch sizes than plain SGD, suggesting future work to study the batch size scaling properties of other algorithms.

Some models allow training to scale to much larger batch sizes than others. We include experimental data on the relationship between various model properties and the maximum useful batch size, demonstrating that the relationship is not as simple as one might hope from previous work (e.g. wider models do not always scale better to larger batch sizes).

The effect of the data set on the maximum useful batch size tends to be smaller than the effects of the model and training algorithm, but this effect does not depend on data set size in a consistent way.


We show that the optimal values of training metaparameters do not consistently follow any simple relationships with the batch size. In particular, popular learning rate heuristics – such as linearly scaling the learning rate with the batch size – do not hold across all problems or across all batch sizes.

Finally, by reviewing the specifics of the experimental protocols used in prior work, we at least partially reconcile conflicting stances in the literature on whether increasing the batch size degrades model quality. Specifically, we show that assumptions about computational budgets and the procedures for selecting metaparameters at different batch sizes can explain many of the disagreements in the literature. We find no evidence that increasing the batch size necessarily degrades model quality, but additional regularization techniques may become important at larger batch sizes.
2 Setup and Background
In this section we set up the basic definitions and background concepts used throughout the paper.
2.1 Learning
A data distribution
over a data domain. For example, we might consider a supervised learning task over a domain
, where is the set of 32by32pixel color images and the possible labels denoting what appears in the image. A training set is a collection of examples from the data domain, conventionally assumed to be drawn i.i.d. from the data distribution .A machine learning model is a function that, given parameters from some set , and given a data point
, produces a prediction whose quality is measured by a differentiable nonnegative scalarvalued loss function.
^{3}^{3}3Technically, the loss need only be subdifferentiable, and extending our setup to this end is straightforward. We denote by the loss of a prediction made by the model, under parameters , on the data point . We denote by the outofsample loss or expected loss:(1) 
and by the empirical average loss under a data set :
(2) 
When is the training set, we call the average training loss. We will say that the data source , loss , and model with parameter set together specify a learning task, in which our aim is to find parameters that achieve low outofsample loss (Equation 1), while given access only to training examples. A common approach is to find parameters of low average training loss (Equation 2) as an estimate of the outofsample loss (ShalevShwartz and BenDavid, 2014).
When minimizing average training loss , it is common to add regularization penalties to the objective function. For a differentiable penalty , regularization weight , and training set , the training objective might be
(3) 
In practice, we often approach a task by replacing its loss with another that is more amenable to training. For instance, in supervised classification, we might be tasked with learning under the 0/1 loss, which is an indicator of whether a prediction is correct (e.g. matches a groundtruth label), but we train by considering instead a surrogate loss (e.g. the logistic loss) that is more amenable to continuous optimization. When the surrogate loss bounds the original, achieving low loss under the surrogate implies low loss under the original. To distinguish the two, we say error to describe the original loss (e.g. 0/1), and we save loss to refer to the surrogate used in training.
2.2 Algorithms
The dominant algorithms for training neural networks are based on minibatch stochastic gradient descent (SGD, Robbins and Monro, 1951; Kiefer et al., 1952; Rumelhart et al., 1986; Bottou and Bousquet, 2008; LeCun et al., 2015). Given an initial point , minibatch SGD attempts to decrease the objective via the sequence of iterates:^{4}^{4}4In practice, we may pick any of the iterates for which we estimate that is low using a validation data set.
where each is a random subset of training examples, the sequence of positive scalars is called the learning rate, and where, for any and ,
(4) 
When the examples are a uniformly random subset of training examples,
forms an unbiased estimate of the gradient of the objective
that we call a stochastic gradient. In our largerscale experiments, when we sample subsequent batches , we actually follow the common practice of cycling through permutations of the training set (Shamir, 2016).Variants of SGD commonly used with neural networks include SGD with momentum (Polyak, 1964; Rumelhart et al., 1986; Sutskever et al., 2013), Nesterov momentum (Nesterov, 1983; Sutskever et al., 2013)
, RMSProp
(Hinton et al., 2012), and Adam (Kingma and Ba, 2015). All of these optimization procedures, or optimizers, interact with the training examples only by repeatedly estimating stochastic gradients (Equation 4), so they support the same notion of batch size that we equate with the scale of data parallelism. In this work, we focus on SGD, SGD with momentum, and Nesterov momentum. The latter two optimizers are configured by a learning rate and a scalar that we call momentum. They define the iterates:^{5}^{5}5These iteration rules take slightly different forms across the literature and across library implementations. Here we present and use the update rules used by the MomentumOptimizerclass in TensorFlow
(Abadi et al., 2016).SGD with momentum  Nesterov momentum  
given and an initial . Note that plain SGD can be recovered from either optimizer by taking . The outcome of using these optimizers should therefore be no worse if, in any experiment, the momentum is tuned across values including zero.
If we run SGD with momentum under a constant learning rate , then, at a given iteration , the algorithm computes
For any fixed , the coefficient accompanying the stochastic gradient in the above update is . We define the effective learning rate, as the value of this coefficient at the end of training (), in the limit of a large number of training steps (, while is held fixed):
Put intuitively, captures the contribution of a given minibatch gradient to the parameter values at the end of training.
2.3 Additional Terminology in Experiments
A dataparallel implementation of minibatch SGD (or one of its variants) computes the summands of Equation 4 in parallel and then synchronizes to coordinate their summation.
The models and algorithms in our experiments are modifiable by what we call metaparameters.^{6}^{6}6
Sometimes called “hyperparameters,” but we prefer a different name so as not to clash with the notion of hyperparameters in Bayesian statistics.
These include architectural choices, such as the number of layers in a neural network, and training parameters, such as learning rates and regularization weights . When we use the term model, we typically assume that all architectural metaparameters have been set. In our experiments, we tune the training metaparameters by selecting the values that yield the best performance on a validation set. We use the term workload to jointly refer to a data set, model, and training algorithm.3 Related Work
In this section we review prior work related to our three main questions from Section 1.1. First we review studies that considered the relationship between batch size and number of training steps (Questions 1 and 2), and then we review studies that considered the effects of batch size on solution quality (Question 3).
3.1 Steps to Reach a Desired OutOfSample Error
We broadly categorize the related work on this topic as either analytical or empirical in nature.
3.1.1 Analytical Studies
Convergence upper bounds from the theory of stochastic (convex) optimization can be specialized to involve terms dependent on batch size, so in this sense they comprise basic related work. These upper bounds arise from worstcase analysis, and moreover make convexity and regularity assumptions that are technically violated in neural network training, so whether they predict the actual observed behavior of our experimental workloads is an empirical question in its own right.
Given a sequence of examples drawn i.i.d. from a data source, an upper bound on the performance of SGD applied to Lipschitz convex losses is (Hazan, 2016; ShalevShwartz and BenDavid, 2014)
(5) 
for any batch size. Here, is our objective function, is its value at the global optimum, and denotes the final output of the algorithm supposing it took iterations.^{7}^{7}7Not necessarily the ^{th} iterate, which may differ from if the algorithm averages its iterates. Meanwhile, when losses are convex and the objective is smooth, accelerated parallel minibatch SGD enjoys the bound (Lan, 2012)
(6) 
where is the batch size.
Compared to sequential processing without batching (i.e. a batch size of one), the bounds Equation 5 and Equation 6 offer two extremes, respectively:

No benefit: Increasing the batch size does not change the number of steps to convergence, as per Equation 5.

A fold benefit: The term in Equation 6 proportional to dominates the bound. Increasing the batch size by a multiplicative factor decreases the number of steps to a given suboptimality by the same factor.
In other words, under these simplifications, batching cannot hurt the asymptotic guarantees of steps to convergence, but it could be wasteful of examples. The two extremes imply radically different guidance for practitioners, so the critical task of establishing a relationship between batch size and number of training steps remains one to resolve experimentally.
A few recent papers propose analytical notions of a critical batch size: a point at which a transition occurs from a fold benefit to no benefit. Under assumptions including convexity, Ma et al. (2018) derive such a critical batch size, and argue that a batch size of one is optimal for minimizing the number of training epochs required to reach a given target error. Under different assumptions, Yin et al. (2018) establish a critical batch size and a pathological loss function that together exhibit a transition from a fold benefit to no benefit. Although they experiment with neural networks, their experiments are designed to investigate the effect of data redundancy and they do not provide enough information to reveal the empirical relationship between batch size and number of training steps. Focusing on linear leastsquares regression, Jain et al. (2018)
also derive a threshold batch size, here in terms of (i) the operator norm of the objective’s Hessian and (ii) a constant from a fourthmoment bound on example inputs.
To our knowledge, in all previous work that aims to analytically characterize a critical batch size, the thresholds defined are either (i) parameterdependent, or (ii) specific to linear leastsquares regression. A critical batch size that depends on model parameters can change over the course of optimization; it is not a problemwide threshold that can be estimated efficiently a priori. Focusing on leastsquares has issues as well: while it sheds intuitive light on how batching affects stochastic optimization locally, the quantities defined inherently cannot generalize to the nonlinear optimization setting of neural network training, both because the objective’s Hessian is not constant across the space of parameters as it is in a quadratic problem, and more broadly because it is unclear whether the Hessian of the objective is still the correct analogue to consider.
3.1.2 Empirical Studies
Wilson and Martinez (2003) investigated the relationship between batch size and training speed for plain minibatch SGD. They found that a simple fully connected neural network took more epochs to converge with larger batch sizes on a data set of 20,000 examples, and also that using a batch size equal to the size of the training set took more epochs to converge than a batch size of one on several small data sets of size . However, their experiment protocol and assumptions limit the conclusions we can draw from their results. One issue is that training time was measured to different outofsample errors for different batch sizes on the same data set. To compare training speed fairly, the error goal should be fixed across all training runs being compared. Additionally, only four learning rates were tried for each data set, but quite often the best learning rate was at one of the two extremes and it appeared that a better learning rate might be found outside of the four possibilities allowed. Finally, despite the conclusions of the authors, their results do not imply slower training with larger batch sizes in a dataparallel implementation: for the most part, their larger batch size experiments took fewer training steps than the corresponding batch size one experiments.
In the last few years, increasingly specialized computing systems have spurred practitioners to try much larger batch sizes than ever before, while increasingly promising results have driven hardware designers to create systems capable of even more data parallelism. Chen et al. (2016) used a pool of synchronized worker machines to increase the effective batch size of minibatch SGD. They demonstrated speedups in both wall time and steps to convergence for an Inception model (Szegedy et al., 2016)
on ImageNet
(Russakovsky et al., 2015) by scaling the effective batch size from 1,600 to 6,400. More recently, Goyal et al. (2017) showed that the number of training epochs could be held constant across a range of batch sizes to achieve the same validation error for ResNet50 (He et al., 2016a) on ImageNet. Holding the number of training epochs constant is equivalent to scaling the number of training steps inversely with the batch size, and this reduction in training steps with increasing batch size produced nearly proportional wall time speedups on their hardware. Although this hints at a fold benefit regime in which increasing the batch size reduces the number of training steps by the same factor, the authors did not attempt to minimize the number of training steps (or epochs) required to reach the goal at each batch size separately. It is unclear whether any of the batch sizes that achieved the goal could do so in fewer steps than given, or how many steps the other batch sizes would have needed to achieve the same error goal.Two studies performed concurrently with this work also investigate the relationship between batch size and training speed for neural networks. Chen et al. (2018) provide experimental evidence of a problemdependent critical batch size after which a fold benefit is no longer achieved for plain minibatch SGD. They contend that wider and shallower networks have larger critical batch sizes, and while their empirical results are equivocal for this particular claim, they show that the threshold batch size can depend on aspects of both the data set and the model. Additionally, the anonymous authors of an ICLR 2019 submission (Anonymous, 2019, under review at time of writing) study how three previously proposed heuristics for adjusting the learning rate as a function of batch size (linear scaling, square root scaling, and no scaling) affect the number of training steps required to reach a particular result. They find that if the learning rate is tuned for the the smallest batch size only, all three of these common scaling techniques break down for larger batch sizes and result in either (i) divergent training, or (ii) training that cannot reach the same error goal within a fixed number of training epochs. They also describe a basic relationship between batch size and training steps to a fixed error goal, which is comprised of three regions: fold benefit initially, then diminishing returns, and finally no benefit for all batch sizes greater than a maximum useful batch size. However, at least at the time of writing, their results are inconclusive because (i) not all model and data set pairs exhibit this basic relationship, (ii) it does not appear consistently across error goals, and (iii) the relationship is primarily evident in training error but not outofsample error. These inconsistent results may be due to suboptimal predetermined learning rates arising from the scaling rules, especially at larger batch sizes. Finally, they also find that the maximum useful batch size depends on aspects of the model and the data set type, but not on the data set size. Since all their experiments use plain minibatch SGD, their results are unable to reveal any effects from the choice of optimizer and might not generalize to other popular optimizers, such as SGD with momentum.
3.2 Solution Quality
The literature contains some seemingly conflicting claims regarding the effects of batch size on solution quality (outofsample error at the conclusion of training). Primarily, the debate centers on whether increasing the batch size incurs a cost in solution quality. Keskar et al. (2017) argue that large batch^{8}^{8}8The term “large batch” is inherently ambiguous, and in this case accompanies experiments in Keskar et al. (2017) that only compare two absolute batch sizes per data set, rather than charting out a curve to its apparent extremes. training converges to socalled “sharp” minima with worse generalization properties. However, Dinh et al. (2017) show that a minimum with favorable generalization properties can be made, through reparameterization, arbitrarily sharp in the same sense. Le Cun et al. (1998) suggest that a batch size of one can result in better solutions because the noisier updates allow for the possibility of escaping from local minima in a descent algorithm. However, they also note that we usually stop training long before reaching any sort of critical point. Hoffer et al. (2017) argue that increasing the batch size need not degrade outofsample error at all, assuming training has gone on long enough. Goyal et al. (2017), among others, tested batch sizes larger than those used in Keskar et al. (2017) without noticing any reduction in solution quality. Still, their results with yet larger batch sizes do not rule out the existence of a more sudden degradation once the batch size is large enough. Meanwhile, Goodfellow et al. (2016) state that small batches can provide a regularization effect such that they result in the best observed outofsample error, although in this case other regularization techniques might serve equally well.
Alas, the best possible outofsample error for a particular model and data set cannot be measured unconditionally due to practical limits on wall time and hardware resources, as well as practical limits on our ability to tune optimization metaparameters (e.g. the learning rate). An empirical study can only hope to measure solution quality subject to the budgets allowed for each model experiment, potentially with caveats due to limitations of the specific procedures for selecting the metaparameters. To the best of our knowledge, all published results handle the training budget issue in exactly one of three ways: by ignoring budgets (train to convergence, which is not always possible); by using a step budget (restrict the number of gradient descent updates performed); or by using an epoch budget (restrict number of training examples processed).^{9}^{9}9There are, of course, budgets in between an epoch budget and a step budget that might allow the possibility of trading off time, computation, and/or solution quality. For example, it may be possible to trade the total number of gradient computations for faster training time to reach the same quality solution. However, we are not aware of work that emphasizes these budgets. Furthermore, while some published results tune the learning rate anew for each batch size, others tune for only a single batch size and use a preordained heuristic to set the learning rate for the remaining batch sizes (the most common heuristics are constant, square root, and linear learning rate scaling rules). Tuning metaparameters at a single batch size and then heuristically adjusting them for others could clearly create a systematic advantage for trials at batch sizes near to the one tuned. All in all, the conclusions we can draw from previous studies depend on the budgets they assume and on how they select metaparameters across batch sizes. The following subsections attempt an investigation of their experimental procedures to this end.
3.2.1 Studies That Ignore Budgets
All studies we mention in this section compared solution quality for different batch sizes after they deemed their models to have converged. To ensure convergence, they used manual inspection or a heuristic to determine the stopping time, or used compute budgets that they considered sufficient to guarantee convergence.^{10}^{10}10As discussed further in Section 4.8, we find that millions of training steps for small batch sizes, or thousands of epochs for large batch sizes, are required to saturate performance even for data sets as small and simple as MNIST. In our experiments, this corresponded to more than 25 hours of walltime for each metaparameter configuration.
Keskar et al. (2017)
trained several neural network architectures on MNIST and CIFAR10, each with two batch sizes, using the Adam optimizer and without changing the learning rate between batch sizes. They found that the larger batch size consistently achieved worse outofsample error after training error had ceased to improve. However, all models used batch normalization
(Ioffe and Szegedy, 2015) and presumably computed the batch normalization statistics using the full batch size. For a fair comparison between batch sizes, batch normalization statistics should be computed over the same number of examples or else the training objective differs between batch sizes (Goyal et al., 2017). Indeed, Hoffer et al. (2017) found that computing batch normalization statistics over larger batches can degrade solution quality, which suggests an alternative explanation for the results of Keskar et al. (2017). Moreover, Keskar et al. (2017) reported that data augmentation eliminated the difference in solution quality between small and large batch experiments.Smith and Le (2018) trained a small neural network on just 1,000 examples sampled from MNIST with two different batch sizes, using SGD with momentum and without changing the learning rate between batch sizes. They observed that the larger batch size overfit more than the small batch size resulting in worse outofsample error, but this gap was mitigated by applying L2 regularization (Smith and Le, 2018, Figures 3 and 8). They also compared a wider range of batch sizes in experiments that either (i) used a step budget without changing the learning rate for each batch size (Smith and Le, 2018, Figures 4 and 6), or (ii) varied the learning rate and used a step budget that was a function of the learning rate (Smith and Le, 2018, Figure 5). Instead, we focus on the case where the learning rate and batch size are chosen independently.
Breuel (2015a, b) trained a variety of neural network architectures on MNIST with a range of batch sizes, using the SGD and SGD with momentum optimizers with a range of learning rates and momentum values. They found that batch size had no effect on solution quality for LSTM networks (Breuel, 2015a)
, but found that larger batch sizes achieved worse solutions for fully connected and convolutional networks, and that the scale of the effect depended on the activation function in the hidden and output layers
(Breuel, 2015b).Finally, Chen et al. (2016) observed no difference in solution quality when scaling the batch size from 1,600 to 6,400 for an Inception model on ImageNet when using the RMSProp optimizer and a heuristic to set the learning rate for each batch size.
3.2.2 Studies with Step Budgets
Hoffer et al. (2017) trained neural networks with two different batch sizes on several image data sets. They found that, by computing batch normalization statistics over a fixed number of examples per iteration (“ghost batch normalization”), and by scaling the learning rate with the square root of the batch size instead of some other heuristic, the solution quality arising from the larger batch size was as good as or better than the smaller batch size. However, the largest batch size used was 4,096, which does not rule out an effect appearing at still larger batch sizes, as suggested by the work of Goyal et al. (2017). Moreover, it remains open whether their proposed learning rate heuristic extends to arbitrarily large batch sizes, or whether it eventually breaks down for batch sizes sufficiently far from the base batch size.
3.2.3 Studies with Epoch Budgets
An epoch budget corresponds to fixing the total number of perexample gradient computations, but, in an idealized dataparallel implementation of SGD, it also corresponds to a step (or even wall time) budget that scales inversely with the batch size. With an epoch budget, a larger batch size can only achieve the same solution quality as a smaller batch size if it achieves perfect scaling efficiency (a fold reduction in steps from increasing the batch size, as described in Section 3.1.1).
Masters and Luschi (2018) show that after a critical batch size depending on the model and data set, solution quality degrades with increasing batch size when using a fixed epoch budget. Their results effectively show a limited region of fold benefit for those model and data set pairs when trained with SGD, although they did not investigate whether this critical batch size depends on the optimizer used, and they did not consider more than one epoch budget for each problem. We reproduced a subset of their experiments and discuss them in Section 5.
Goyal et al. (2017) recently popularized a linear learning rate scaling heuristic for training the ResNet50 model using different batch sizes. Using this heuristic, a 90 epoch budget, and SGD with momentum without adjusting or tuning the momentum, they increased the batch size from 64 to 8,192 with no loss in accuracy. However, their learning rate heuristic broke down for even larger batch sizes. Inspired by these results, a sequence of followup studies applied additional techniques to further increase the batch size while still achieving the same accuracy and using the same 90 epoch budget. These followon studies (Codreanu et al., 2017; You et al., 2017; Akiba et al., 2017) confirm that the best solution quality for a given batch size will also depend on the exact optimization techniques used.
There are several additional papers (Lin et al., 2018; Devarakonda et al., 2017; Anonymous, 2019) with experiments relevant to solution quality that use an epoch budget, tune the learning rate for the smallest batch size, and then use a heuristic to choose the learning rate for all larger batch sizes. For instance, Devarakonda et al. (2017) and Lin et al. (2018) used linear learning rate scaling and Anonymous (2019) tried constant, square root, and linear learning rate scaling heuristics. All of them conclude that small batch sizes have superior solution quality with a fixed epoch budget than large batch sizes, for various notions of “small” and “large.” This could just as easily be an artifact of the learning rate heuristics, and a possible alternative conclusion is that these heuristics are limited (as heuristics can often be).
4 Experiments and Results
The primary quantity we measure is the number of steps needed to first reach a desired outofsample error, or steps to result. To measure steps to result, we used seven image and text data sets with training set sizes ranging from 45,000 to 26 billion examples. Table 1 summarizes these data sets and Appendix A
provides the full details. We chose six families of neural network to train on these data sets. For MNIST and Fashion MNIST, we chose a simple fully connected neural network and a simple convolutional neural network (CNN). For CIFAR10, we chose the ResNet8 model without batch normalization, partly to compare our results to
Masters and Luschi (2018), and partly to have a version of ResNet without batch normalization. For ImageNet, we chose ResNet50, which uses batch normalization and residual connections, and VGG11, which uses neither. For Open Images, we chose ResNet50. For LM1B, we chose the Transformer model and an LSTM model. For Common Crawl, we chose the Transformer model. Table
2 summarizes these models and Appendix B provides the full details.Data Set  Type  Task  Size  Evaluation Metric 
MNIST  Image  Classification  55,000  Classification error 
Fashion MNIST  Image  Classification  55,000  Classification error 
CIFAR10  Image  Classification  45,000  Classification error 
ImageNet  Image  Classification  1,281,167  Classification error 
Open Images  Image  Classification (multilabel)  4,526,492  Average precision 
LM1B  Text  Language modeling  30,301,028  Cross entropy error 
Common Crawl  Text  Language modeling  billion  Cross entropy error 
Model Class  Sizes  Optimizers  Data Sets  Learning rate 
schedule  
Fully Connected  Various  SGD  MNIST  Constant 
Simple CNN  Base  SGD  MNIST  Constant 
Narrow  Momentum  Fashion MNIST  
Wide  Nesterov mom.  
ResNet  ResNet8  SGD  CIFAR10  Linear decay 
Nesterov mom.  
ResNet50  Nesterov mom.  ImageNet  Linear decay  
Open Images  
VGG  VGG11  Nesterov mom.  ImageNet  Linear decay 
Transformer  Base  SGD  LM1B  Constant 
Narrow and shallow  Momentum  Common crawl  
Shallow  Nesterov mom.  
Wide  
LSTM  —  Nesterov mom.  LM1B  Constant 
Measuring steps to result requires a particular value of outofsample error to be chosen as the goal. Ideally, for each task and model, we would select the best achievable error, but since validation error is noisy, the best error is sometimes obtained unreliably. Moreover, for some workloads, the validation error continues to improve steadily beyond the maximum practical training time. Therefore, we generally tried to select the best validation error that we could achieve reliably within a practical training time.
Table 2 also shows the learning rate schedule we used for each model and data set. Learning rate schedules are often used to accelerate neural network training, but finding the best schedule is an optimization problem in its own right (Wu et al., 2018). Instead, researchers typically choose from a range of common learning rate functions based on validation performance and individual preference. While most schedules decay the learning rate monotonically over training, some researchers also “warmup” the learning rate at the start of training (e.g. He et al., 2016a), particularly when training with large batch sizes (Goyal et al., 2017). We ran experiments with both constant learning rates and with learning rate decay. We used decay for ResNet8, ResNet50, and VGG11, which significantly reduced training time for those models. We selected our decay function by running an extensive set of experiments with ResNet50 on ImageNet (see Appendix C for details). We chose linear decay because it performed at least as well as all other schedules we tried, while also being the simplest and requiring only two additional metaparameters. In experiments that used linear decay, we specified metaparameters such that the learning rate decayed linearly from to . That is, the learning rate at step is given by
Steps to result depends on the training metaparameters, and, for a given task and model, each batch size might have a different metaparameter configuration that minimizes steps to result. In all experiments, we independently tuned the metaparameters at each batch size, including the initial learning rate and, where learning rate decay was used, the decay schedule (). Also, unless otherwise specified, we used the Nesterov momentum optimizer (Sutskever et al., 2013) and tuned the momentum . Tuning anew for each batch size is extremely important since otherwise we would not be measuring steps to result as a function of batch size, rather we would be measuring steps to result as a function of batch size and the specific values of the learning rate and other metaparameters. We used quasirandom search (Bousquet et al., 2017) to tune the metaparameters with equal budgets of nondivergent^{11}^{11}11We discarded trials with a divergent training loss. Typically, this occurred when the learning rate was too high. trials for different batch sizes. We selected metaparameter search spaces by hand based on preliminary experiments. The exact number of nondivergent trials needed to produce stable results depends on the search space, but 100 trials seemed to suffice in all of our experiments.^{12}^{12}12LSTM on LM1B used 50 trials because we only tuned with fixed . We validated that tuning did not significantly affect the results for , 1,024, and 4,096. If the optimal trial occurred near the boundary of the search space, or if the goal validation error was not achieved within the search space, we repeated the search with a new search space. We measured steps to result for each batch size by selecting the metaparameter tuning trial that reached the goal validation error in the fewest number of steps.
4.1 Steps to Result Depends on Batch Size in a Similar Way Across Problems
To get a sense of the basic empirical relationship, we measured the number of steps required to reach a goal validation error as a function of batch size across several different data sets and models (Figure 1). In all cases, as the batch size grows, there is an initial period of perfect scaling (fold benefit, indicated with a dashed line on the plots) where the steps needed to achieve the error goal halves for each doubling of the batch size. However, for all problems, this is followed by a region of diminishing returns that eventually leads to a regime of maximal data parallelism where additional parallelism provides no benefit whatsoever. In other words, for any given problem and without making strong assumptions about learning rates or other optimizer parameters, we can achieve both extremes suggested by theory (see Section 3.1.1). A priori, it is not obvious that every workload in our experiments should exhibit perfect scaling at the smallest batch sizes instead of immediately showing diminishing returns.
4.2 Validating Our Measurement Protocol
If the curves in Figure 1 were sensitive to the exact choice of goal validation error, then measuring the steps needed to first reach a particular validation error would not be a meaningful proxy for training speed. For small changes in the goal validation error, we do not care about vertical shifts as long as the transition points between the three scaling regions remain relatively unchanged. Figure 2 shows that varying the error goal only vertically shifts the stepstoresult curve, at least for modest variations centered around a good absolute validation error. Furthermore, although we ultimately care about outofsample error, if our stepstoresult plots looked very different when measuring the steps needed to reach a particular training error, then we would need to present our results somewhat differently and include both curves. However, switching to training error does not change the plots much at all (see Figure 12 in the Appendix).
Our experiments depend on extensive metaparameter tuning for the learning rate, momentum, and, where applicable, the learning rate schedule. For each experiment, we verified our metaparameter search space by checking that the optimal trial was not too close to a boundary of the space. See Figures 13 and 14 in the Appendix for examples of how we verified our search spaces.
4.3 Some Models Can Exploit Much Larger Batch Sizes Than Others
We investigated whether some models can make more use of larger batches than others by experimenting with different models while keeping the data set and optimizer fixed. We explored this question in two ways: (i) by testing completely different model architectures on the same data set, and (ii) by varying the size (width and depth) of a model within a particular model family. Since the absolute number of steps needed to reach a goal validation error depends on the model, the steps to result vs batch size curves for each model generally appear at different vertical offsets from each other. Since we primarily care about the locations of the perfect scaling, diminishing returns, and maximal data parallelism regions, we normalized the yaxis of each plot by dividing by the number of steps needed to reach the goal for a particular batch size and data set. This normalization corresponds to a vertical shift of each curve (on logscale plots), and makes it easier to compare different models. Appendix D contains all plots in this section without the yaxis normalized.
Figures 2(a)–2(c) show that the model architecture significantly affects the relationship between batch size and the number of steps needed to reach a goal validation error. In Figure 2(a), the curve for the Fully Connected model flattens later than for the Simple CNN model on MNIST (although in this case the Simple CNN model can ultimately achieve better performance than the Fully Connected model). In Figure 2(b), the curve for ResNet50 flattens much later than the curve for VGG11, indicating that ResNet50 can make better use of large batch sizes on this data set. Unlike ResNet50, VGG11 does not use batch normalization or residual connections. Figure 2(c) shows that Transformer can make better use of large batch sizes than LSTM on LM1B.
Figures 2(d)–2(f) show that varying the depth and width can affect a model’s ability to exploit larger batches, but not necessarily in a consistent way across different model architectures. In Figure 2(d), the regions of perfect scaling, diminishing returns, and maximum useful batch size do not change much when the width is varied for the Fully Connected model on MNIST, although the shallower model seems less able to exploit larger batches than the deeper models. This contrasts with the findings of Chen et al. (2018), although they changed width and depth simultaneously while keeping the number of parameters fixed. For Simple CNN on MNIST, the relationship between batch size and steps to a goal validation error seems not to depend on width at all (Figure 14(e) in the Appendix shows that the curves are the same even when the yaxis is not normalized). However, in Figure 2(f), the curves for narrower Transformer models on LM1B flatten later than for wider Transformer models, while the depth seems to have less of an effect. Thus, reducing width appears to allow Transformer to make more use of larger batch sizes on LM1B.
4.4 Momentum Extends Perfect Scaling to Larger Batch Sizes, but Matches Plain SGD at Small Batch Sizes
We investigated whether some optimizers can make better use of larger batches than others by experimenting with plain SGD, SGD with momentum, and Nesterov momentum on the same model and data set. Since plain SGD is a special case of both Nesterov momentum and SGD with momentum (with in each case), and since we tune in all experiments, we expect that experiments with either of these optimizers should do no worse than plain SGD at any batch size. However, it is not clear a priori whether momentum optimizers should outperform SGD, either by taking fewer training steps or by extending the perfect scaling region to larger batch sizes.
Figure 4 shows that Nesterov momentum and SGD with momentum can both extend the perfect scaling region beyond that achieved by SGD, and thus can significantly reduce the number of training steps required to reach a goal validation error at larger batch sizes. However, at batch sizes small enough that all optimizers are within their perfect scaling region, momentum optimizers perform identically to SGD without momentum. Though initially surprising, this identical performance at small batch sizes is consistent with observations made in Kidambi et al. (2018). In our experiments, we did not see a large difference between Nesterov momentum and SGD with momentum – Nesterov momentum appears to scale slightly better for Transformer on LM1B, but both perform about equally well for Simple CNN on MNIST.
4.5 The Data Set Matters, at Least Somewhat
We investigated whether properties of the data set make some problems able to exploit larger batch sizes than others by experimenting with different data sets while keeping the model and optimizer fixed. We explored this question in two ways: (i) by testing the same model on completely different data sets, and (ii) by testing the same model on different subsets of the same data set. We normalized the yaxis of all plots in this section in the same way as Section 4.3. Appendix D contains all plots in this section without the yaxis normalized.
Figure 5 shows that changing the data set can affect the relationship between batch size and the number of steps needed to reach a goal validation error. Figure 4(a) shows that Fashion MNIST deviates from perfect scaling at a slightly larger batch size than MNIST for the Simple CNN model. Figure 4(b) shows that ImageNet and Open Images are extremely similar in how well ResNet50 can make use of larger batch sizes, although, if anything, ImageNet might make slightly better use of larger batch sizes. Figure 4(c) shows that LM1B scales slightly better with increasing batch size than Common Crawl for Transformer. Since Fashion MNIST is the same size as MNIST, Open Images is larger than ImageNet, and Common Crawl is far larger than LM1B, these differences are not simply as straightforward as larger data sets making larger batch sizes more valuable.
To disentangle the effects from changes to the distribution and changes to the number of examples, we generated steps to result vs batch size plots for different random subsets of MNIST (Figure 5(a)) and ImageNet (Figure 5(b)). For MNIST, we selected subsets of different sizes, while for ImageNet, we selected a random subset of half the images and a similar sized subset that only includes images from half of the classes. At least on MNIST, any effect on the maximum useful batch size is extremely small or nonexistent. For ImageNet, Figure 5(b) shows that the random subset of half the images deviates from perfect scaling sooner than the full data set, but the curve for the subset with half the classes is very close to the curve for the full data set and, if anything, deviates from perfect scaling later, even though it contains roughly the same number of images as the random subset.
4.6 Regularization Can Be More Helpful at Some Batch Sizes Than Others
We used label smoothing (Szegedy et al., 2016) to regularize training in our experiments with ResNet50 on ImageNet. Without label smoothing, we could not achieve our goal validation error rate of 0.25 with batch sizes greater than within our training budget. With a fixed compute budget for each batch size, label smoothing improved the error by as much as one percentage point at large batch sizes, while having no apparent effect at small batch sizes (Figure 6(a)). Meanwhile, if multiple choices for the label smoothing metaparameter achieved the goal within the training budget, then label smoothing did not change the number of steps needed (Figure 6(b)).
We confirmed that label smoothing reduced overfitting at large batch sizes for ResNet50 on ImageNet (see Figure 18 in the Appendix). This is consistent with the idea that noise from small batch training is a form of implicit regularization (e.g. Goodfellow et al., 2016). However, although our results show that other forms of regularization can serve in place of this noise, it might be difficult to select and tune other forms of regularization for large batch sizes. For example, we unsuccessfully tried to control overfitting with larger batch sizes by increasing the L2 weight penalty and by applying additive Gaussian gradient noise before we obtained good results with label smoothing.
Finally, we also tried label smoothing with Simple CNN on MNIST and Fashion MNIST, and found that it generally helped all batch sizes, with no consistent trend of helping smaller or larger batch sizes more (see Figure 19 in the Appendix), perhaps because these data sets are sufficiently small and simple that overfitting is an issue at all batch sizes.
4.7 The Best Learning Rate and Momentum Vary with Batch Size
Across all problems we considered, the effective learning rate (; see Section 2.2) that minimized the number of training steps to a goal validation error tended to increase with increasing batch size (Figure 8). However, it did not always follow either a linear or square root scaling heuristic, despite the popularity of these rules of thumb. In some cases, the optimal effective learning rate even decreased for larger batch sizes. We also found that the best effective learning rate should be chosen by jointly tuning the learning rate and momentum, rather than tuning only the learning rate. For example, the optimal way to scale the effective learning rate for Transformer was to increase the momentum while decreasing the learning rate or holding it constant (see Figures 21 and 22 in the Appendix). This is a refinement to past prescriptions that only change the learning rate while keeping the momentum fixed.
We further investigated the relationship between learning rate, momentum, and training speed by examining our metaparameter search spaces for different batch sizes and model sizes. For this analysis, we used Transformer on LM1B with Nesterov momentum because the metaparameter search spaces are consistent between all batch and model sizes, and can be easily visualized because they consist only of the constant learning rate and the momentum . We observe the following behaviors:

With increasing batch size, the region in metaparameter space corresponding to rapid training in terms of epochs becomes smaller (Figure 8(a), consistent with the findings of Breuel, 2015b), while the region in metaparameter space corresponding to rapid training in terms of stepcount grows larger (Figure 8(b), although it eventually plateaus for batch sizes in the maximal data parallelism regime). Thus, with a fixed error goal and in a setting where training epochs are constrained (e.g. a compute budget), it may become more challenging to choose good values for the metaparameters with increasing batch size. Conversely, with a fixed error goal and in a setting where training steps are constrained (e.g. a walltime budget), it may become easier to choose good values for the metaparameters with increasing batch size.

The metaparameters yielding the fastest training are typically on the edge of the feasible region of the search space (Figure 9). In other words, small changes in the optimal metaparameters might make training diverge. This behavior may pose a challenge for metaparameter optimization techniques, such as Gaussian Process approaches, that assume a smooth relationship between metaparameter values and model performance. It could motivate techniques such as learning rate warmup that enable stability at larger eventual learning rates, since the maximum stable learning rate depends on the current model parameters. That said, we did not use need to use learning rate warmup for any of our problems. We also did not observe this behavior for ResNet50 on ImageNet. Figure 20 in the Appendix shows the results for a range of effective learning rates near the optimum for ResNet50 on ImageNet and Transformer on LM1B.
4.8 Solution Quality Depends on Compute Budget More Than Batch Size




We investigated the relationship between batch size and outofsample error for Simple CNN on MNIST and Fashion MNIST, and for two sizes of Transformer on LM1B. For each task, we ran a quasirandom metaparameter search over the constant learning rate and Nesterov momentum . For MNIST and Fashion MNIST, we also added label smoothing and searched over the label smoothing parameter in to mitigate any confounding effects of overfitting (see Section 4.6). We ran 100 metaparameter trials for each batch size with a large practical walltime budget.
To disentangle the effects of the batch size from the compute budget, we compared batch sizes subject to budgets of either training steps or training epochs. For each batch size and compute budget, we found the model checkpoint that achieved the best validation accuracy across all metaparameter trials, and across all training steps that fell within the compute budget. Figure 11 shows the validation error for these bestvalidationerror checkpoints, as a function of batch size, for a range of compute budgets. We observe that, subject to a budget on training steps, larger batch sizes achieve better outofsample error than smaller batch sizes, but subject to a budget on training epochs, smaller batch sizes achieve better outofsample error than larger batch sizes. These observations are likely explained by the observations that, for a fixed number of training steps, larger batch sizes train on more data, while for a fixed number of epochs, smaller batch sizes perform more training steps.
The workloads in Figure 11 represent two distinct modes of neural network training. For the small MNIST and Fashion MNIST data sets, we chose training budgets that would saturate (or almost saturate) performance at each batch size. In other words, outofsample error cannot be improved any further by simply increasing the budget, with caveats due to practical limitations on our ability to find optimal values for the metaparameters. Figures 10(a) and 10(b) show that differences in maximum performance between batch sizes on these data sets are very small (Figures 23 and 24 in the Appendix contain zoomed versions of these plots). We cannot rule out that any differences at this magnitude are due to noise from metaparameter choices and training stochasticity. Thus, for these workloads at least, the effect of batch size on solution quality is either very small or nonexistent. On the other hand, we cannot saturate performance with Transformer on LM1B within a practical training time. In this case, the scenario is much simpler: for a given batch size, the best error is achieved by the largest compute budget. Larger batch sizes are favored by compute budgets defined in terms of training steps, while smaller batch sizes are favored by compute budgets defined in terms of training epochs.
Taken together, these observations suggest that in practice the relevant question is not which batch size leads to the best performance, but rather how compute budget varies as a function of batch size. Although we tried our best to saturate performance with MNIST and Fashion MNIST, we found that it took millions of training steps for small batch sizes, and thousands of epochs for large batch sizes, even for data sets as small and simple as these. Indeed, despite sampling 100 metaparameter configurations per batch size and training for up to 25 hours per configuration, it is still not certain whether we truly saturated performance at the smallest and largest batch sizes (see Figures 23 and 24 in the Appendix). Thus, the regime of saturated performance is of limited practical concern for most workloads – the compute budget required to saturate performance is likely beyond what a practitioner would typically use. For realistic workloads, practitioners should be most concerned with identifying the batch size at which they can most efficiently apply their compute.
5 Discussion
Our goals in measuring the effects of data parallelism on neural network training were twofold: first, we hoped to produce actionable advice for practitioners, and second, we hoped to understand the utility of building systems capable of very high degrees of data parallelism. Our results indicate that, for idealized data parallel hardware, there is a universal relationship between training time and batch size, but there is dramatic variation in how well different workloads can make use of larger batch sizes. Across all our experiments, increasing the batch size initially reduced the number of training steps needed proportionally. However, depending on the workload, this perfect scaling regime ended anywhere from a batch size of to a batch size of . As batch size increases beyond the perfect scaling regime, there are diminishing returns (where increasing the batch size by a factor of only reduces the number of training steps needed by a factor less than ) that end with a maximum useful batch size (where increasing the batch size no longer changes the number of training steps needed). Once again, the maximum useful batch size is extremely problemdependent and varied between roughly and in our experiments. Other workloads may have the region of perfect scaling end at batch sizes even smaller or larger than the range we observed, as well as having even smaller or larger maximum useful batch sizes.
On the one hand, the possibility that perfect scaling can extend to batch sizes beyond for some workloads is good news for practitioners because it suggests that efficient dataparallel systems can provide extremely large speedups for neural network training. On the other hand, the wide variation in scaling behavior across workloads is bad news because any given workload might have a maximum useful batch size well below the limits of our hardware. Moreover, for a new workload, measuring the training steps needed as a function of batch size and confirming the boundaries of the three basic scaling regimes requires expensive experiments. In this work, we have only described how to retrospectively predict the scaling behavior by tuning the optimization metaparameters for every batch size. Although Anonymous (2019) also described the same basic scaling behavior we found, in their experiments the relationship did not appear consistently across problems, across error goals, or in outofsample error. In light of our own results, the heuristics they assumed for adjusting the learning rate as a function of batch size are the likely cause of these inconsistencies, but this explanation only drives home the inconvenience of having to carefully tune at every new batch size. We were unable to find reliable support for any of the previously proposed heuristics for adjusting the learning rate as a function of batch size. Thus we are forced to recommend that practitioners tune all optimization parameters anew when they change the batch size or they risk masking the true behavior of the training procedure.
If the scaling behavior of workloads with respect to batch size has a simple dependence on properties of the workload, then we might be able to predict the limits of perfect scaling (or the maximum useful batch size) before running extensive experiments. We could then prioritize workloads to run on specialized hardware or decide whether gaining access to specialized hardware would be useful for a given workload of interest. On the one hand, our results are bad news for practitioners because they show that accurate scaling predictions must depend on a combination of nonobvious properties of the model, properties of the optimizer, and properties of the data set. On the other hand, we have a lot of control over the choice of model and optimizer and there is some indication that model and optimizer properties might be responsible for the largest portion of the variation between workloads. Our results comparing SGD and SGD with momentum (or Nesterov momentum) show that, at least for the problems we tried, momentum can extend perfect scaling to much larger batch sizes, offering clear guidance for practitioners. Other optimizers, such as KFAC (Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2017), or optimization techniques designed specifically for massively data parallel systems (e.g. Li et al., 2014), might allow perfect scaling to extend much further. Intuitively, it seems plausible that optimizers that estimate local curvature information might be able to benefit more from large batches than optimizers that only use gradients.
Although the model seems to have a large effect on the maximum useful batch size and the limit of perfect scaling, our results do not give definitive answers on exactly how to design models that scale better for a given optimizer and data set. Even when we kept the model family fixed, we observed somewhat inconsistent results from changing the model width and depth. Chen et al. (2018) suggested that wider models can exploit larger batch sizes than narrower models, but their theoretical arguments only apply to linear networks and fully connected networks with a single hidden layer. In contrast, we found that narrower variants of the Transformer model scaled better to larger batch sizes, although it is unclear if the same notion of “width” transfers between different types of neural networks.
Unlike the model and optimizer, we generally have much less control over the data set. Unfortunately, properties of the data set also affect how well training scales in practice. Our results are equivocal on whether the number of training examples has any effect, but changing the data set entirely can certainly change the scaling behavior with respect to batch size.
Finally, our results at least partially reconcile conflicting stances in the literature on whether increasing the batch size degrades model quality. Our experiments show that:

Any study that only tunes the learning rate for one batch size and then uses a heuristic to choose the learning rate for other batch sizes (Goyal et al., 2017; Keskar et al., 2017; Hoffer et al., 2017; Lin et al., 2018; Devarakonda et al., 2017; Anonymous, 2019) gives a systematic advantage to the batch size used in tuning (as well as nearby batch sizes). Our results did not show a simple relationship between the optimal learning rate and batch size that scales indefinitely (see Figures 8 and 21), so the use of simple heuristics for batch sizes sufficiently far from the base batch size could very well explain the degraded solutions and divergent training reported in prior work. Similarly, the optimal values of other metaparameters, such as the momentum and learning rate decay schedule, should not be assumed to remain constant or scale in a simple way as the batch size increases.

Assuming an epoch budget when comparing solution quality between batch sizes (Masters and Luschi, 2018; Goyal et al., 2017; Lin et al., 2018; Devarakonda et al., 2017), in effect, limits an investigation to the perfect scaling region of the steps to result vs batch size curve (see Figure 1). This budget favors smaller batch sizes because they will perform more optimizer steps for the same number of training examples (see Section 4.8). Certainly, there are situations where an epoch budget is appropriate, but there may exist budgets just outside the perfect scaling region that can achieve the same quality solution, and those budgets may still represent a significant reduction in the number of training steps required. Moreover, even for a fixed model and data set, simply changing the optimizer can significantly extend the perfect scaling regime to larger batch sizes. For example, Masters and Luschi (2018) found that test performance of ResNet8 (without batch normalization) on CIFAR10 with a fixed epoch budget degraded after batch size 16, but considered only plain minibatch SGD. Our experiments confirmed that perfect scaling ends at batch size 16 with plain minibatch SGD, but using Nesterov momentum extends the perfect scaling regime to batch size 256 (see Figure 0(c)).

Assuming a step budget when comparing solution quality between batch sizes (Hoffer et al., 2017) might favor larger batch sizes because they will see more training examples for the same number of gradient updates (see Section 4.8). A step budget is likely sufficient for a larger batch size to reach at least the same performance as a smaller batch size: we never saw the number of steps to reach a goal validation error increase when the batch size was increased (see Figure 1).

Increasing the batch size reduces noise in the gradient estimates (see Equation 4). However, the noise in updates due to small batches might, in some cases, provide a helpful regularization effect (Goodfellow et al., 2016; Smith and Le, 2018). Thankfully, other regularization techniques, such as label smoothing, can replace this effect (see Section 4.6). Others have also used regularization techniques, such as data augmentation (Keskar et al., 2017) and L2 regularization (Smith and Le, 2018), to eliminate the “generalization gap” between two batch sizes.

Finally, although we do not believe there is an inherent degradation in solution quality associated with increasing the batch size, depending on the compute budget, it may become increasingly difficult to find good values for the metaparameters with larger batch sizes. Specifically, increasing the batch size may shrink the region in metaparameter space corresponding to rapid training in terms of epochs (see Figure 8(a)), as previously reported by Breuel (2015b). On the other hand, increasing the batch size may increase the region in metaparameter space corresponding to rapid training in terms of steps (see Figure 8(b)).
5.1 Limitations of our experimental protocol
When interpreting our results, one should keep in mind any limitations of our experimental protocol, even if they seem minor. We do not believe any of these limitations are debilitating, and we hope that describing these potential areas of concern will spur methodological innovation in future work.
Firstly, we were unable to avoid some amount of human judgment when tuning metaparameters. Although we did not tune metaparameters by hand, we specified the search spaces for automatic tuning by hand and they may not have been equally appropriate for all batch sizes, despite our best efforts. We are most confident in our search spaces that tuned the fewest metaparameters (such as in our experiments that only tuned learning rate and momentum). We found it quite difficult to be confident that our tuning was sufficient when we searched over learning rate decay schedules; readers should be aware that the steps to result measurement is generally quite sensitive to the learning rate schedule. Thus, we may not have sampled enough trials at some batch sizes or, nearly equivalently, our search spaces may have been too wide at some batch sizes. Even though we verified that the best trial was not on the boundary of the search space, this by no means guarantees that we found the globally optimal metaparameters.
Smaller batch sizes typically had more opportunities to measure validation error and, when validation error was noisy, got more chances to sample a lucky validation error. Batch sizes (usually larger ones) that did not reach the goal validation error using the first search space we tried used revised search spaces that gave them an extra bite of the apple, so to speak.
Finally, our analysis does not consider how robustly we can reach a goal error rate. For instance, we did not distinguish between batch sizes where all 100 trials achieved the goal validation error and batch sizes where only one of the 100 trials achieved the goal. The maximum or minimum value over a set of trials is not usually a very robust statistic, but something like the 50^{th} percentile trial is a close to meaningless quantity that mostly reveals information about the search space. We tried to strike a balance between our desire to study realistic workloads and our desire to be able to repeat our experiments so many times over that these uncertainty questions become trivial. Ultimately, for this work, we opted for simplicity of presentation and reported results for optimal trials.
6 Conclusions and Future Work
Increasing the batch size is a simple way to produce valuable speedups across a range of workloads, but, for all the workloads we tried, the benefits diminished well within the limits of stateoftheart hardware. Unfortunately, blindly increasing the batch size to the current limits of our hardware will not produce a large speedup for all workloads. However, our results suggest that some optimization algorithms may be able to consistently extend perfect scaling across many models and data sets. Future work should perform our same measurements with other optimizers, beyond the closelyrelated ones we tried, to see if any existing optimizer extends perfect scaling across many problems. Alternatively, if we only crave speedups for specific, highvalue problems, we can also consider designing models that extend perfect scaling to much larger batch sizes. However, unlike the optimizer, practitioners are likely to tailor their model architectures to the specific problems at hand. Therefore, instead of searching for model architectures that happen to scale extremely well, future work should try to uncover general principles for designing models that can scale perfectly to larger batch sizes. Even if such principles remain elusive, we would still benefit from methods to prospectively predict the scaling behavior of a given workload without requiring careful metaparameter tuning at several different batch sizes. Although not all of these avenues of future work may pan out, the deep learning community can always benefit from methodical experiments designed to test hypotheses, characterize phenomena, and reduce confusion, to balance more exploratory work designed to generate new ideas for algorithms and models.
Acknowledgements
We would like to thank Tomer Koren for helpful discussions. We would also like to thank Justin Gilmer and Simon Kornblith for helpful suggestions and comments on the manuscript. Finally, we would like to thank Matt J. Johnson for letting us borrow some computing resources.
Appendix A Data Set Details
In this section we give details of the data sets summarized in Table 1.
a.1 Descriptions and data augmentation
MNIST (LeCun et al., 1998) is a classic handwritten digit image classification data set with 10 mutually exclusive classes. We split the original training set into 55,000 training images and 5,000 validation images, and used the official test set of 10,000 images. We did not use data augmentation.
Fashion MNIST (Xiao et al., 2017) is another reasonably simple image classification data set with 10 mutually exclusive classes. It was designed as a dropin replacement for MNIST. We split the original training set into 55,000 training images and 5,000 validation images, and used the official test set of 10,000 images. We did not use data augmentation.
CIFAR10 (Krizhevsky, 2009) is an image classification data set of
color images with 10 mutually exclusive classies. We split the original training set into 45,000 training images and 5,000 validation images. We used the official test set of 10,000 images. We preprocessed each image by subtracting the average value across all pixels and channels and dividing by the standard deviation.
^{13}^{13}13We used the TensorFlow op tf.image.per_image_standardization. We did not use data augmentation.ImageNet (Russakovsky et al., 2015) is an image classification data set with 1,000 mutually exclusive classes. We split the official training set into 1,281,167 training images and 50,045 test images, and used the official validation set of 50,000 images. We preprocessed the images and performed data augmentation in a similar way to Simonyan and Zisserman (2014). Specifically, at training time, we sampled a random integer , performed an aspectpreserving resize so that the smallest side had length , and took a random crop of size . We randomly reflected the images horizonally, but unlike Simonyan and Zisserman (2014) we did not distort the colors. At evaluation time, we performed an aspectpreserving resize so that the smallest side had length 256, and took a central crop of size . In both training and evaluation, we then subtracted the global mean RGB value from each pixel using the values computed by Simonyan and Zisserman (2014).^{14}^{14}14See https://gist.github.com/ksimonyan/211839e770f7b538e2d8#description for the mean RGB values used.
Open Images v4 (Krasin et al., 2017) is a data set of 9 million images that are annotated with imagelevel labels and object bounding boxes.^{15}^{15}15Available at https://storage.googleapis.com/openimages/web/index.html.
The image labels were generated by a computer vision model and then verified as either
positive or negative labels by human annotators. We only considered the 7,186 “trainable” classes with at least 100 humanannotated positives in the training set. We filtered the official subsets to images with at least one positive trainable label, which produced training, validation and test sets of size 4,526,492; 41,225; and 124,293 images, respectively. On average, each image in the training set has 2.9 humanannotated positive labels, while each image in the validation and test sets have 8.4 humanannotated positive labels. We only considered the humanannotated positives and assumed all other classes were negative. We preprocessed the images and performed data augmentation identically to ImageNet.LM1B (Chelba et al., 2014) is a text data set of English news articles.^{16}^{16}16Available at http://www.statmt.org/lmbenchmark/. We used the official training set and created validation and test sets using files news.en.heldout00000of00050 and news.en.heldout00001of00050, respectively. These splits contain 30,301,028; 6,075; and 6,206 sentences, respectively. We used an invertable word tokenizer to split the text into subword tokens with a vocabulary of size 32,000.^{17}^{17}17The code for processing the raw data and generating the vocabulary is available at https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/lm1b.py On average, the training set contains around 20 tokens per sentence and the validation and test sets contain around 29 tokens per sentence. At training time, we clipped long sentences to the first 64 tokens, which affected only about 2% of sentences. We did not clip long sentences at evaluation time. The maximum sentence across the validation and test sets has 476 tokens.
Common Crawl is a repository of web data containing over 3 billion web pages.^{18}^{18}18Available at http://commoncrawl.org/2017/07/june2017crawlarchivenowavailable/. We filtered and processed the data set identically to Anil et al. (2018).^{19}^{19}19See https://github.com/googleresearch/googleresearch/tree/master/codistillation for document ids. The vocabulary contains 24,006 subword tokens. We randomly partitioned the sentences into a training set (99.98%) and a holdout set (0.02%). Our training set contains billion sentences. We used the first 6,075 sentences of the holdout set as our validation set, which is the same number of sentences in our LM1B validation set. Some sentences are tens of thousands of tokens long. To maintain consistency with our LM1B processing, we clipped sentences to 64 tokens at training time and 476 at evaluation time.
a.2 Evaluation metrics
We use classification error for MNIST, Fashion MNIST, CIFAR10, and ImageNet. To compute this metric, we consider the model’s classification for each image to be the class it assigned the highest probability. Then
We use classagnostic average precision () for Open Images. To compute this metric, we first rank each imageclass pair by the predicted likelihood of the class being a true positive for that image. Then
(7) 
where is the precision when considering the top imageclass pairs, is an indicator function equal to 1 if the ^{th} imageclass pair is a verified positive and 0 otherwise, is the number of images in the validation set, is the number of classes, and is the number of positive labels. Average precision was proposed for Open Images by Veit et al. (2017). Due to false negatives in the validation set, Veit et al. (2017) only computed over the the humanannotated classes in each image. However, on average, each image in the validation set only has 8.4 positive and 4 negative humanannotated classes, so each image is only evaluated over classes out of 7,186. This yields misleadingly high values of . Instead, we compute over all classes in each image, which may underestimate the true due to false negatives in the validation set, but is more indicative of the true performance in our experience. We compute using an efficient approximation of the area under the discrete precisionrecall curve.^{20}^{20}20Equation 7 can be interpreted as a right Riemann sum of the discrete precisionrecall curve , where and is the maximum precision among all values of precision with recall (each value of recall may correspond to different values of precision at different classification thresholds). We use the TensorFlow op tf.metrics.auc with curve="PR", num_thresholds=200, and summation_method="careful_interpolation".
We use average perword cross entropy error for LM1B and Common Crawl. For a single sentence , let denote the model’s predicted probability of the word given all prior words in the sentence. Thus, the predicted logprobability of is . We compute the average perword cross entropy error over a data set as
where denotes the number of words in . This is the logarithm of the perword perplexity.
Appendix B Model Details
In this section we give the architectural details of the models summarized in Table 2. In addition to the descriptions below, each model has a taskspecific output layer. Models trained on MNIST, Fashion MNIST, CIFAR10, and ImageNet (classification with mutually exclusive labels) use a softmax output layer to model the probability distribution over classes. Models trained on Open Images (classification with multiple labels per image) use a sigmoid output layer to model the probability of each class. Models trained on LM1B and Common Crawl (language modeling) use a softmax output layer to model the probability of the next word in a sentence given all prior words in the sentence.
Fully Connected
is a fully connected neural network with ReLU activation function. Hidden layers use dropout with probability 0.4 during training. We vary the number of layers and number of units per layer in different experiments to investigate the impact of model size. We use the notation FC
… to denote a fully connected neural network with hidden layers and units in the ^{th} layer.Simple CNN
consists of 2 convolutional layers with maxpooling followed by 1 fully connected hidden layer. The convolutional layers use
filters with stride length 1, “same” padding
(Goodfellow et al., 2016), and ReLU activation function. Max pooling uses windows with stride length 2. The fully connected layer uses dropout with probability 0.4 during training. We used three different model sizes: base has 32 and 64 filters in the convolutional layers and 1,024 units in the fully connected layer; narrow has 16 and 32 filters in the convolutional layers and 512 units in the fully connected layer; and wide has 64 and 128 filters in the convolutional layers and 2,048 units in the fully connected layer. We used the base model unless otherwise specified.ResNet8 consists of 7 convolutional layers with residual connections followed by 1 fully connected hidden layer. We used the model described in Section 4.2 of He et al. (2016a) with , but with the improved residual block described by He et al. (2016b). We removed batch normalization, which is consistent with Masters and Luschi (2018).
ResNet50 consists of 49 convolutional layers with residual connections followed by 1 fully connected hidden layer. We used the model described in Section 4.1 of He et al. (2016a), but with the improved residual block described by (He et al., 2016b). We replaced batch normalization (Ioffe and Szegedy, 2015) with ghost batch normalization to keep the training objective fixed between batch sizes and to avoid possible negative effects from computing batch normalization statistics over a large number of examples (Hoffer et al., 2017). We used a ghost batch size of 32 for all experiments. We also applied label smoothing (Szegedy et al., 2016) to regularize the model at training time, which was helpful for larger batch sizes. The label smoothing coefficient was a metaparameter that we tuned in our experiments.
VGG11 consists of 8 convolutional layers followed by 3 fully connected hidden layers. We used the model referred to as “model A” by Simonyan and Zisserman (2014).
LSTM is a one hiddenlayer LSTM model (Hochreiter and Schmidhuber, 1997). It is a simpler variant of the LSTM2048512 model described by Jozefowicz et al. (2016), with 1,024 embedding dimensions, 2,048 hidden units, and 512 projection dimensions. We did not use bias parameters in the output layer because we found this improved performance in our preliminary experiments.
Transformer
is a selfattention model that was originally presented for machine translation
(Vaswani et al., 2017). We used it as an autoregressive language model by applying the decoder directly to the sequence of word embeddings for each sentence. We used four different sizes: the base model described by Vaswani et al. (2017); a shallow model that is identical to the base model except with only two hidden layers instead of six; a narrow and shallow model that is identical to the shallow model except with half as many hidden units and attention heads as well as half the filter size; and a wide model that is identical to the base model except with double the number of hidden units and attention heads as well as double the filter size. We used the base model unless otherwise specified.Appendix C Learning Rate Schedules
We chose our learning rate schedule by experimenting with a variety of different schedules for ResNet50 on ImageNet. For each schedule, we specified the following metaparameters:

: initial learning rate

: decay factor ()

: number of training steps until the learning rate decays from to
Each schedule corresponds to a decay function , such that the learning rate at training step is
We experimented with the following decay functions:

Constant:

Linear:

Cosine (Loshchilov and Hutter, 2017):

Exponential Polynomial: , where

Inverse Exponential Polynomial: , where

Exponential:
We also tried piecewise linear learning rate schedules. These schedules are specified by a sequence of pairs , with , such that the learning rate at training step is
The schedules used by both He et al. (2016a) (piecewise constant) and Goyal et al. (2017) (linear warmup followed by piecewise constant) for ResNet50 on ImageNet can both be expressed as piecewise linear.
We ran experiments with ResNet50 on ImageNet, using Nesterov momentum with batch size 1,024 for 150,000 training steps, while tuning the momentum and all metaparameters governing the learning rate schedule. We used the quasirandom metaparameter search discussed in Section 4. We tried piecewise linear schedules with 1, 3, and 5 decay events. We found that it was possible to get good results with several of the schedules we tried, and it is likely that other schedules would also work well. Ultimately, we chose linear decay because it performed at least well as all other schedules we tried, while also being the simplest and requiring only two additional metaparameters.
Appendix D Additional Plots




References
 Abadi et al. (2016) Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. TensorFlow: a system for largescale machine learning. In Conference on Operating Systems Design and Implementation, volume 16, pages 265–283. USENIX, 2016.
 Akiba et al. (2017) Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch SGD: Training ResNet50 on ImageNet in 15 minutes. arXiv preprint arXiv:1711.04325, 2017.
 Anil et al. (2018) Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E. Dahl, and Geoffrey E. Hinton. Large scale distributed neural network training through online distillation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rkr1UDeC.
 Anonymous (2019) Anonymous. On the computational inefficiency of large batch sizes for stochastic gradient descent. In International Conference on Learning Representations, 2019. URL https://openreview.net/forums?id=S1en0sRqKm. Under review.
 Ba et al. (2017) Jimmy Ba, Roger Grosse, and James Martens. Distributed secondorder optimization using kroneckerfactored approximations. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=SkkTMpjex.
 Bottou and Bousquet (2008) Léon Bottou and Olivier Bousquet. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems, pages 161–168, 2008.
 Bousquet et al. (2017) Olivier Bousquet, Sylvain Gelly, Karol Kurach, Olivier Teytaud, and Damien Vincent. Critical hyperparameters: No random, no cry. arXiv preprint arXiv:1706.03200, 2017.
 Breuel (2015a) Thomas M Breuel. Benchmarking of LSTM networks. arXiv preprint arXiv:1508.02774, 2015a.
 Breuel (2015b) Thomas M Breuel. The effects of hyperparameters on SGD training of neural networks. arXiv preprint arXiv:1508.02788, 2015b.
 Chelba et al. (2014) Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. In Conference of the International Speech Communication Association, 2014.
 Chen et al. (2016) Jianmin Chen, Xinghao Pan, Rajat Monga, Samy Bengio, and Rafal Jozefowicz. Revisiting distributed synchronous SGD. In International Conference on Learning Representations Workshop Track, 2016. URL https://openreview.net/forum?id=D1VDZ5kMAu5jEJ1zfEWL.
 Chen et al. (2018) Lingjiao Chen, Hongyi Wang, Jinman Zhao, Dimitris Papailiopoulos, and Paraschos Koutris. The effect of network width on the performance of largebatch training. arXiv preprint arXiv:1806.03791, 2018.
 Codreanu et al. (2017) Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch SGD: Residual network training on ImageNet1K with improved accuracy and reduced time to train. arXiv preprint arXiv:1711.04291, 2017.
 Devarakonda et al. (2017) Aditya Devarakonda, Maxim Naumov, and Michael Garland. AdaBatch: Adaptive batch sizes for training deep neural networks. arXiv preprint arXiv:1712.02029, 2017.
 Dinh et al. (2017) Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pages 1019–1028, 2017.
 Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. URL http://www.deeplearningbook.org.
 Goyal et al. (2017) Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
 Grosse and Martens (2016) Roger Grosse and James Martens. A Kroneckerfactored approximate Fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582, 2016.
 Hazan (2016) Elad Hazan. Introduction to online convex optimization. Foundations and Trends in Optimization, 2(34):157–325, 2016.

He et al. (2016a)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Conference on Computer Vision and Pattern Recognition
, pages 770–778. IEEE, 2016a.  He et al. (2016b) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630–645. Springer, 2016b.
 Hestness et al. (2017) Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Patwary, Mostofa Ali, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
 Hinton et al. (2012) Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Neural networks for machine learning, lecture 6a: overview of minibatch gradient descent, 2012. URL https://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural Computation, 9(8):1735–1780, 1997.
 Hoffer et al. (2017) Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731–1741, 2017.
 Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pages 448–456, 2015.
 Jain et al. (2018) Prateek Jain, Sham M. Kakade, Rahul Kidambi, Praneeth Netrapalli, and Aaron Sidford. Parallelizing stochastic gradient descent for least squares regression: Minibatching, averaging, and model misspecification. Journal of Machine Learning Research, 18(223):1–42, 2018. URL http://jmlr.org/papers/v18/16595.html.

Jouppi et al. (2017)
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal,
Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al.
Indatacenter performance analysis of a tensor processing unit.
In International Symposium on Computer Architecture, pages 1–12. IEEE, 2017.  Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
 Karakida et al. (2018) Ryo Karakida, Shotaro Akaho, and Shunichi Amari. Universal statistics of Fisher information in deep neural networks: Mean field approach. arXiv preprint arXiv:1806.01316, 2018.
 Keskar et al. (2017) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=H1oyRlYgg.
 Kidambi et al. (2018) Rahul Kidambi, Praneeth Netrapalli, Prateek Jain, and Sham M. Kakade. On the insufficiency of existing momentum schemes for stochastic optimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rJTutzbA.
 Kiefer et al. (1952) Jack Kiefer, Jacob Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3):462–466, 1952.
 Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 Krasin et al. (2017) Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami AbuElHaija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci, Jordi PontTuset, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. OpenImages: A public dataset for largescale multilabel and multiclass image classification., 2017. URL https://storage.googleapis.com/openimages/web/index.html.
 Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL http://www.cs.toronto.edu/~kriz/learningfeatures2009TR.pdf.
 Lan (2012) Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(12):365–397, 2012.
 Le Cun et al. (1998) Yann Le Cun, Léon Bottou, Genevieve B. Orr, and KlausRobert Müller. Efficient backprop. In Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science LNCS 1524. Springer Verlag, 1998. URL http://leon.bottou.org/papers/lecun98x.
 LeCun et al. (1998) Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database, 1998. URL http://yann.lecun.com/exdb/mnist.
 LeCun et al. (2015) Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436, 2015.
 Li et al. (2014) Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J Smola. Efficient minibatch training for stochastic optimization. In International Conference on Knowledge Discovery and Data Mining, pages 661–670. ACM, 2014.
 Lin et al. (2018) Tao Lin, Sebastian U Stich, and Martin Jaggi. Don’t use large minibatches, use local SGD. arXiv preprint arXiv:1808.07217, 2018.
 Loshchilov and Hutter (2017) Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warm restarts. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Skq89Scxx.

Ma et al. (2018)
Siyuan Ma, Raef Bassily, and Mikhail Belkin.
The power of interpolation: Understanding the effectiveness of SGD in modern overparametrized learning.
In International Conference on Machine Learning, pages 3331–3340, 2018.  Martens and Grosse (2015) James Martens and Roger Grosse. Optimizing neural networks with Kroneckerfactored approximate curvature. In International Conference on Machine Learning, pages 2408–2417, 2015.
 Masters and Luschi (2018) Dominic Masters and Carlo Luschi. Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612, 2018.
 Nesterov (1983) Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence . In Doklady AN USSR, volume 269, pages 543–547, 1983.
 Polyak (1964) Boris T Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
 Robbins and Monro (1951) Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 Rumelhart et al. (1986) David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by backpropagating errors. Nature, 323(6088):533, 1986.
 Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 ShalevShwartz and BenDavid (2014) Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From foundations to algorithms. Cambridge University Press, 2014. URL https://books.google.com/books?id=OE9etAEACAAJ.
 Shamir (2016) Ohad Shamir. Withoutreplacement sampling for stochastic gradient methods. In Advances in Neural Information Processing Systems, pages 46–54, 2016.
 Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Smith and Le (2018) Samuel L. Smith and Quoc V. Le. A Bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJij4yg0Z.
 Sutskever et al. (2013) Ilya Sutskever, James Martens, George E. Dahl, and Geoffrey E. Hinton. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning, pages 1139–1147, 2013.
 Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the Inception architecture for computer vision. In Conference on Computer Vision and Pattern Recognition, pages 2818–2826. IEEE, 2016.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
 Veit et al. (2017) Andreas Veit, Neil Alldrin, Gal Chechik, Ivan Krasin, Abhinav Gupta, and Serge J Belongie. Learning from noisy largescale datasets with minimal supervision. In Conference on Computer Vision and Pattern Recognition, pages 6575–6583. IEEE, 2017.
 Wilson and Martinez (2003) D Randall Wilson and Tony R Martinez. The general inefficiency of batch training for gradient descent learning. Neural Networks, 16(10):1429–1451, 2003.
 Wu et al. (2018) Yuhuai Wu, Mengye Ren, Renjie Liao, and Roger Grosse. Understanding shorthorizon bias in stochastic metaoptimization. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=H1MczcgR.
 Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. FashionMNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.

Yin et al. (2018)
Dong Yin, Ashwin Pananjady, Max Lam, Dimitris Papailiopoulos, Kannan
Ramchandran, and Peter Bartlett.
Gradient diversity: a key ingredient for scalable distributed
learning.
In
International Conference on Artificial Intelligence and Statistics
, 2018. URL http://proceedings.mlr.press/v84/yin18a.html.  You et al. (2017) Yang You, Zhao Zhang, ChoJui Hsieh, James Demmel, and Kurt Keutzer. ImageNet training in minutes. arXiv preprint arXiv:1709.05011, 2017.