Stochastic Training is Not Necessary for Generalization

09/29/2021 ∙ by Jonas Geiping, et al. ∙ University of Maryland 0

It is widely believed that the implicit regularization of stochastic gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation. To this end, we utilize modified hyperparameters and show that the implicit regularization of SGD can be completely replaced with explicit regularization. This strongly suggests that theories that rely heavily on properties of stochastic sampling to explain generalization are incomplete, as strong generalization behavior is still observed in the absence of stochastic sampling. Fundamentally, deep learning can succeed without stochasticity. Our observations further indicate that the perceived difficulty of full-batch training is largely the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

fullbatchtraining

Training vision models with full-batch gradient descent and regularization


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic gradient descent (SGD) is the backbone of optimization for deep neural networks, going back at least as far as LeCun et al. (1998a), and SGD is the de-facto tool for optimizing the parameters of modern neural networks (Krizhevsky et al., 2012; He et al., 2015a; Brown et al., 2020)

. A central reason for the success of stochastic gradient descent is its efficiency in the face of large datasets – a noisy estimate of the loss function gradient is generally sufficient to improve the parameters of a neural network and can be computed much faster than a full gradient over the entire training set.

At the same time, folk wisdom dictates that small-batch SGD is not only faster but also has a unique bias towards good loss function minima that cannot be replicated with full batch gradient descent. Some even believe that stochastic sampling is the fundamental force behind the success of neural networks. These popular beliefs are linked to various properties of SGD, such as its gradient noise, fast escape from saddle points, and its uncanny ability to avoid sub-optimal local minima (Hendrik, 2017; LeCun, 2018). It is common to under-saturate compute capabilities and retain small batch sizes, even if enough compute is available to reap these apparent benefits. These properties are also attributed in varying degrees to all mini-batched first-order optimizers, such as Adam (Kingma and Ba, 2015) and others (Schmidt et al., 2020).

But why does stochastic mini-batching really aid generalization? In this work, we set out to isolate the mechanisms which underlie the benefits of SGD and use these mechanisms to replicate the empirical benefits of SGD without stochasticity. In this way, we provide a counterexample to the hypothesis that stochastic mini-batching, which leads to noisy estimates of the gradient of the loss function, is fundamental for the strong generalization success of over-parameterized neural networks.

We show that a standard ResNet-18 can be trained using batch size 50K (the entire training dataset) and still achieve validation accuracy on CIFAR-10, which is comparable to the same network trained with a strong SGD baseline, provided data augmentation is used for both methods. We then extend these findings to train without (random) data augmentations, for an entirely

non-stochastic full-batch training routine with exact computation of the full loss gradient, while still achieving over 95% accuracy. Because existing training routines are heavily optimized for small-batch SGD, the success of our experiments requires us to eschew standard training parameters in favor of an increased number of training steps, aggressive gradient clipping, and specialized explicit regularization terms.

The existence of this example raises questions about the role of stochastic mini-batching, and by extension gradient noise, in generalization. In particular, it shows that the practical effects of such gradient noise can be captured by explicit, non-stochastic, regularization. Overall, deep learning succeeds, even in the absence of mini-batched training.

A number of authors have studied relatively large batch training, often finding trade-offs between batch size and model performance (Yamazaki et al., 2019; Mikami et al., 2019; You et al., 2020). However, the goal of these studies has been first and foremost to accelerate training speed (Goyal et al., 2018; Jia et al., 2018), with maintaining accuracy as a secondary goal. In this study, we seek to achieve high performance on full-batch training at all costs. Our focus is not on fast runtimes or ultra-efficient parallelism, but rather on the implications of our experiments for deep learning theory.

We begin our discussion by reviewing the literature on SGD and describing some of the many studies that have sought to explain various successes of deep learning through the lens of stochastic sampling. Then, we explain the hyper-parameters needed to achieve strong results in the full-batch setting, and present benchmark results using a range of settings, both with and without data augmentation.

Figure 1: One-dimensional loss landscapes (random direction) of models trained with gradient descent. Default full-batch gradient descent (left) produces sharp models that do not train and generalize well, yet it can be modified to converge to flatter minima with longer training, gradient clipping and appropriate regularization (right).

2 Perspectives on Generalization via SGD

The widespread success of SGD in practical neural network implementations has inspired theorists to investigate the gradient noise created by stochastic sampling as a potential source of observed generalization phenomena in neural networks. This section will cover some of the recent literature concerning hypothesized effects of stochastic mini-batch gradient descent (SGD). We explicitly focus on generalization effects of SGD in this work. Other possible sources of generalization for neural networks have been proposed that do not lean on stochastic sampling, for example generalization results that only require overparametrization (Neyshabur et al., 2018; Advani et al., 2020), large width (Golubeva et al., 2021), and well-behaved initialization schemes (Wu et al., 2017; Mehta et al., 2020). We will not discuss these here. Furthermore, because we wish to isolate the effect of stochastic sampling in our experiments, we fix an architecture and network hyperparameters in our studies, acknowledging that they were likely chosen because of their synergy with SGD.

Notation:

We denote the optimization objective for training a neural network by , where represents network parameters, and is a single data sample. Over a dataset of data points, , the neural network training problem is the minimization of

(1)

This objective can be optimized via first-order optimization, of which the simplest form is descent in the direction of the negative gradient with respect to parameters on a batch of data points and with step size :

(2)

Now, full-batch gradient descent corresponds to descent on the full dataset , stochastic gradient descent corresponds to sampling a single random data point (with or without replacement), and mini-batch stochastic gradient descent corresponds to sampling data points at once. When sampling without replacement, the set is commonly reset after all elements are depleted.

This update equation is often analyzed as an update of the full-batch gradient that is contaminated by gradient noise arising from the stochastic mini-batch sampling:

(3)

Although stochastic gradient descent has been used intermittently in applications of pattern recognition as far back as the 90’s, its advantages were debated as late as

Wilson and Martinez (2003), who in support of SGD discuss its efficiency benefits (which would become much more prominent in the following years due to increasing dataset sizes), in addition to earlier ideas that stochastic training can escape from local minima, and its relationship to Brownian motion and “quasi-annealing”, both of which are also discussed in practical guides such as LeCun et al. (1998b).

SGD and critical points While early results from an optimization perspective were concerned with showing the effectiveness and convergence properties of SGD (Bottou, 2010), later ideas focused on the generalization benefits of stochastic training via navigating the optimization landscape, finding global minima, and avoiding bad local minima and saddlepoints. Ge et al. (2015) show that stochastic descent is advantageous compared to full-batch gradient descent (GD) in its ability to escape saddle points. Although the same conditions actually also allow vanilla gradient descent to avoid saddle-points (Lee et al., 2016), full-batch descent is slowed down significantly by the existence of saddle points compared to stochastically perturbed variants (Du et al., 2017). Random perturbations also appear necessary to facilitate escape from saddle points in Jin et al. (2019). Related works study a critical mini-batch size (Ma et al., 2018; Jain et al., 2018) after which SGD behaves similarly to full-batch gradient descent (GD) and converges slowly. The idea of a critical batch size is echoed for noisy quadratic models in Zhang et al. (2019a), and an empirical measure of critical batch size is proposed in McCandlish et al. (2018). There are also hypotheses (HaoChen et al., 2020) that GD necessarily overfits at sub-optimal minima as it trains in the linearized neural tangent kernel regime of Jacot et al. (2018); Arora et al. (2019b).

It is unclear though whether the analysis of sub-optimal critical points can explain the benefits of SGD, given that modern neural networks can generally be trained to reach global minima even with deterministic algorithms (for wide enough networks (Du et al., 2019)). This phenomenon is itself puzzling, sub-optimal local minima exist and can be found by specialized optimization (Yun et al., 2018; Goldblum et al., 2020), but they are not found by first-order descent methods with standard initialization. With this observation in mind, it has been postulated that “good” minima that generalize well share geometric properties that make it likely for SGD to find them (Huang et al., 2020).

Flatness and Noise Shapes One such geometric property of a global minimizer is its flatness (Hochreiter and Schmidhuber, 1997). Empirically, Keskar et al. (2016) discuss the advantages of small-batch stochastic gradient descent and propose that finding flat basins is a benefit of small-batch SGD: Large-batch training converges to models with both lower generalization and sharper minimizers. Although flatness is difficult to measure (Dinh et al., 2017), flatness based measures appear to be the most promising tool for predicting generalization in Jiang et al. (2019).

The analysis of such stochastic effects is often facilitated by considering the stochastic differential equation that arises for small enough step sizes from Equation 3

under the assumption that the gradient noise is effectively a Gaussian random variable:

(4)

where represents the covariance of gradient noise at time , and is a Brownian motion modeling it. The magnitude of is inversely proportional to mini-batch size (Jastrzębski et al., 2018), and it is also connected to the flatness of minima reached by SGD in Dai and Zhu (2018) and Jastrzębski et al. (2018) if is isotropic. Analysis therein as well as in Le (2018) provides evidence that the step size should increase linearly with the batch size to keep the magnitude of noise fixed. However, the anisotropy of is strong enough to generate behavior that qualitatively differs from Brownian motion around critical points (Chaudhari and Soatto, 2018; Simsekli et al., 2019) and isotropic diffusion is insufficient to explain generalization benefits in Saxe et al. (2019).

The shape of is thus further discussed in Zhu et al. (2019) where the anisotropic noise induced by SGD is found to be beneficial to escape and reach flat minima in contrast to isotropic noise, Zhou et al. (2020) where it is contrasted with noise induced by Adam (Kingma and Ba, 2015), and HaoChen et al. (2020) who discuss that such parameter-dependent noise, also induced by label noise, biases SGD towards well-generalizing minima in a case study. Empirical studies in Wen et al. (2020); Wu et al. (2020) and Li et al. (2021) show that large-batch training can be improved by adding the right kind of anisotropic noise.

Notably, in all of these works, the noise introduced by SGD is in the end both unbiased and (mostly) Gaussian, and its disappearance in full-batch gradient descent should remove its beneficial effects. However, Equation 4 only approximates SGD to first-order, while for non-vanishing step sizes , Li et al. (2017) find that a second-order approximation,

(5)

does include an implicit bias proportional to the step size. Later studies such as Li et al. (2020) discuss the importance of large initial learning rates, which are also not well modeled by first-order SDE analysis but have a noticeable impact on generalization.

Analysis of flatness through other means, such as dynamical system theory (Wu et al., 2018; Hu et al., 2018), also derives stability conditions for SGD and GD, where among all possible global minima, SGD both converges to flatter minima than GD and also can escape from sharp minima. Xing et al. (2018) analyze SGD and GD empirically in response to the aforementioned theoretical findings about noise shape, finding that both algorithms (without momentum) significantly differ in their exploration of the loss landscape and that the structure of the noise induced by SGD is closely related to this behavior. Yin et al. (2018) introduce gradient diversity as a measure of the effectiveness of SGD:

(6)

which works well up to a critical batch size proportional to . Crucially gradient diversity is a ratio of per-example gradient norms to the full gradient norm. This relationship is also investigated as gradient coherence in Chatterjee (2020)

as it depends on the amount of alignment of these gradient vectors.

An explicit, non-stochastic bias? Several of these theoretical investigations into the nature of generalization via SGD rely on earlier intuitions that this generalization effect would not be capturable by explicit regularization Arora et al. (2019a), who write that “standard regularizers may not be rich enough to fully encompass the implicit regularization brought forth by gradient-based optimization” and further rule out norm-based regularizers rigorously. Such a behavior would not be entirely impossible, given that similar statements were already shown for the generalization effects of overparametrization in Arora et al. (2018) who show that no regularizer exists that could replicate the effects of overparametrization in deep linear networks. Yet, Barrett and Dherin (2020); Smith et al. (2020b) find that the implicit regularization induced by GD and SGD can be analyzed via backward-error analysis and a scalar regularizer can be derived. The implicit generalization of mini-batched gradient descent with batches can be (up to third-order terms and sampling without replacement) described explicitly by the modified loss function

(7)

which simplifies for gradient descent to

(8)

as found in Barrett and Dherin (2020). Training with this regularizer can induce the generalization benefits of larger learning rates, even if optimized with small learning rates, and induce benefits in generalization behavior for small batch sizes when training moderately larger batch sizes. However, Smith et al. (2020b) “expect this phenomenon to break down for very large batch sizes”. A related discussion of this gradient bias of SGD can be found in Roberts (2018), while in Poggio and Cooper (2020), a simplified setting is discussed in which SGD can be shown to converge to a critical point where holds separately for each data point , a condition which implies that the regularizer of Equation 7 is zero.

Large-batch training in practice In response to Keskar et al. (2016), Hoffer et al. (2017)

show that the adverse effects of (moderately) large batch training can be mitigated by improved hyperparameters – tuning learning rates, optimization steps, and batch normalization behavior. A resulting line of work suggests hyperparameter improvements that successively allow larger batch sizes,

(You et al., 2017) with reduced trade-offs in generalization. Yet, parity in generalization between small and large batch training has proven elusive in many applications, even after extensive hyperparameter studies in De et al. (2017); Golmant et al. (2018); Masters and Luschi (2018) and Smith et al. (2020a). Golmant et al. (2018) go on to discuss that this is not only a problem of generalization in their experiments but also one of optimization during training, as they find that the number of iterations it takes to even reach low training loss increases significantly after the critical batch size is surpassed. Conversely, Shallue et al. (2019) find that training in a large-batch regime is often still possible, but this is dependent on finding an optimal learning rate that is not predicted by simple scaling rules, and it also depends on choosing optimal hyperparameters and momentum, a choice which becomes more difficult with increased batch size. This reduction of possible learning rates that converge reliably is also discussed in Masters and Luschi (2018), but a significant gap in generalization is observed in Smith et al. (2020a) even after grid-searching for an optimal learning rate.

Empirical studies continue to optimize hyperparameters for large-batch training with reasonable sacrifices in generalization performance, including learning rate scaling and warmup (Goyal et al., 2018; You et al., 2019a), adaptive optimizers (You et al., 2017, 2019b), omitting weight regularization on scales and biases (Jia et al., 2018), adaptive momentum (Mikami et al., 2019), second-order optimization (Osawa et al., 2019), and label smoothing Yamazaki et al. (2019). Yet, You et al. (2020) find that full-batch gradient descent cannot be tuned to reach the performance of SGD, even when optimizing for long periods, indicating a fundamental “limit of batch size”. The difficulty of achieving good generalization with large batches has been linked to instability of training. As discussed in Cohen et al. (2020), training with GD progressively increases the sharpness of the objective function until training destabilizes in a sudden loss spike. Surprisingly however, the algorithm does not diverge entirely, but the loss instead quickly recovers and continues to decrease non-monotonically, while the sharpness remains close to the stability threshold. This phenomenon of non-monotone, but effective training close to a stability threshold is also found in Lewkowycz et al. (2020).

2.1 A more subtle hypothesis

From the above literature, we find two main advantages of SGD over GD. First, its optimization behavior appears qualitatively different, both in terms of stability and in terms of convergence speed beyond the critical batch size. Secondly, there is evidence that the implicit bias induced by large step size SGD on mini batches can be replaced with explicit regularization as derived in Equation 5 and Equation 7. This bias that effectively approximates penalizing the per-example gradient norm of every example in relation to Equation 6. In light of these apparent advantages, we hypothesize that we can modify and tune optimization hyperparameters for GD and also add an explicit regularizer in order to recover SGD’s generalization performance without injecting any noise into training. This would imply that gradient noise from mini-batching is not necessary for generalization, but an intermediate factor; while modelling the bias of gradient noise and its optimization properties are sufficient for generalization, mini-batching by itself is not necessary and these benefits can also be procured by other means.

This hypothesis stands in contrast to possibilities that gradient noise injection is either necessary to reach state-of-the-art performance (as in Wu et al. (2020); Li et al. (2021)) or that no regularizing function exists with the property that its gradient replicates the practical effect of gradient noise (Arora et al., 2018). A “cultural” roadblock in this endeavor is further that existing models and hyperparameter strategies have been extensively optimized for SGD, with a significant number of hours spent improving performance on CIFAR-10 for models trained with small batch SGD, which begets the question whether these mechanisms are by now self-reinforcing?

3 Full-batch GD with randomized data augmentation

We now set out to investigate our hypothesis empirically, attempting to set up training so that strong generalization occurs even without gradient noise from mini-batching. We will thus compare full-batch settings in which the gradient of the full loss is computed every iteration and mini-batch settings in which a noisy estimate of the loss is computed. Our central goal is to reach good full-batch performance without resorting to gradient noise, via mini-batching or explicit injection. Yet, we will occasionally make remarks regarding full-batch in practical scenarios outside the limitations of this construction.

For this, we focus on a well-understood case in the literature and train a ResNet model on CIFAR-10 for image classification. We consider a standard ResNet-18 (He et al., 2015a, 2019) with randomly initialized linear layer parameters (He et al., 2015b)

and batch normalization parameters initialized with mean zero and unit variance, except for the last in each residual branch which is initialized to zero

(Goyal et al., 2018). This model and its initialization were tuned to reach optimal performance when trained with SGD. The default random CIFAR-10 data ordering is kept as is.

We proceed in several stages from baseline experiments using standard settings to specialized schemes for full-batch training, comparing stochastic gradient descent performance with full-batch gradient descent. Over the course of this and the next section we first examine full-batch training with standard data augmentations, and later remove randomized data augmentations from training as well to evaluate a completely noise-less pipeline.

3.1 Baseline SGD

We start by describing our baseline setup, which is well-tuned for SGD. For the entire Section 3

, every image is randomly augmented by horizontal flips and random crops after padding by 4 pixels.

Baseline SGD:

For the SGD baseline, we train with stochastic gradient descent and a batch size of 128, Nesterov momentum of

and weight decay of

. Mini-batches are drawn randomly without replacement in every epoch. The learning rate is warmed up from 0.0 to 0.1 over the first 5 epochs and then reduced via cosine annealing to 0 over the course of training

(Loshchilov and Hutter, 2017). The model is trained for 300 epochs. In total, update steps occur in this setting.

With these hyperparameters, mini-batch SGD (sampling without replacement) reaches a validation accuracy of , which we consider a very competitive modern baseline for this architecture. Mini-batch SGD provides this strong baseline largely independent from the exact flavor of mini-batching, see Table 1, reaching the same accuracy when sampling with replacement. In both cases the gradient noise induced by random mini-batching leads to strong generalization. If batches are sampled without replacement and in the same order every epoch, i.e. without shuffling in every epoch, then mini-batching still provides its generalization benefit. The apparent discrepancy between both versions of shuffling is not actually a SGD effect, but shuffling benefits the batch normalization layers also present in the ResNet-18. This can be seen by replacing batch norm with group normalization (Wu and He, 2018), which has no dependence on batching. Without shuffling we find for SGD and with shuffling for group normalized ResNets; a difference of less than . Overall all of these variations of mini-batched stochastic gradient descent lead to strong generalization after training.

Source of Gradient Noise Batch size Val. Accuracy %
Sampling without replacement 128
Sampling with replacement 128
Sampling without replacement (fixed across epochs) 128
Additive 50’000
Multiplicative 50’000
- 50’000
Table 1: Summary of validation accuracies on the CIFAR-10 validation dataset for baseline types of gradient noise in experiments with data augmentations considered in Section 3.

With the same settings, we now switch to full-batch gradient descent. We replace the mini-batch updates by full batches and accumulate the gradients over all mini-batches. To rule out confounding effects of batch normalization, batch normalization is still computed over blocks of size 128 (Hoffer et al., 2017), although the assignment of data points to these blocks is kept fixed throughout training so that no stochasticity is introduced by batch normalization. In line with literature on large-batch training, applying full-batch gradient descent with these settings reaches a validation accuracy of only , yielding a gap in accuracy between SGD and GD. As a validation of previous work, we also note that this gap is not easily closed by injecting simple forms of gradient noise, such as additive or multiplicative noise, as can also be seen in Table 1.

In the following experiments, we will close the gap between full-batch and mini-batch training. We do this by eschewing common training hyper-parameters used for small batches, and re-designing the training pipeline to maintain stability without mini-batching.

3.2 Stabilizing Training

Training with huge batches leads to unstable behavior. As the model is trained close to its edge of stability (Cohen et al., 2020), we soon encounter spike instabilities, where the cross entropy objective suddenly increases in value, before quickly returning to its previous value and improving further. While this behavior can be mitigated with small-enough learning rates and aggressive learning rate decay (see supp. material), small learning rates also mean that the training will firstly make less progress, but secondly also induce a smaller implicit gradient regularization, i.e. Equation 8. Accordingly, we seek to reduce the negative effects of instability while keeping learning rates from getting vanishingly small. In our experiments, we found that very gentle warmup learning rate schedules combined with aggressive gradient clipping enables us to maintain stability with a manageable learning rate.

Gentle learning rate schedules

Because full-batch training is notoriously unstable, the learning rate is now warmed up from 0.0 to 0.4 over 400 steps (each step is now an epochs) to maintain stability, and then decayed by cosine annealing (with a single decay without restarts) to 0.1 over the course of 3000 steps/epochs.

The initial learning rate of is not significantly larger than in the small-batch regime, and it is extremely small by the standards of linear scaling rule (Goyal et al., 2018), which would suggest a learning rate of , or even a square-scaling rule (Hoffer et al., 2017), which would predict a learning rate of when training longer. As the size of the full dataset is certainly larger than any critical batch size, we would not expect to succeed in fewer steps than SGD. Yet, the number of steps, , is simultaneously huge, when measuring efficiency in passes through the dataset, and tiny, when measuring parameter update steps. Compared to the baseline of SGD, this approach requires a ten-fold increase in dataset passes, but it provides a 39-fold decrease in parameter update steps. Another point of consideration is the effective learning rate of Li and Arora (2019). Due to the effects of weight decay over 3000 steps and limited annealing, the effective learning rate is not actually decreasing during training.

Experiment Mini-batching Epochs Steps Modifications Val. Accuracy %
Baseline SGD 300 117,000 -
Baseline FB 300 300 -
FB train longer 3000 3000 -
FB clipped 3000 3000 clip
FB regularized 3000 3000 clip+reg
FB strong reg. 3000 3000 clip+reg+bs32
FB in practice 3000 3000 clip+reg+bs32+shuffle
Table 2: Summary of validation accuracies in percent on the CIFAR-10 validation dataset for each of the experiments with data augmentations considered in Section 3. All validation accuracies are averaged over 5 runs.

Training with these changes leads to full-batch gradient descent performance of , which is a increase over the baseline, but still ways off from the performance of SGD. We summarize validation scores in Table 2 as we move across experiments.

Gradient Clipping:

We clip the gradient over the entire dataset to have an norm of at most before updating parameters.

Training with all the previous hyperparameters and additional gradient clipping obtains a validation accuracy of . This is a significant increase of over the previous result due to a surprisingly simple modification, even as other improvements suggested in the literature (label smoothing (Yamazaki et al., 2019), partial weight decay (Jia et al., 2018), adaptive optimization (You et al., 2017), sharpness-aware minimization (Foret et al., 2021) fail to produce significant gains (see supp. material).

Gradient clipping is used in some applications to stabilize training (Pascanu et al., 2013). However in contrast to its usual application in mini-batch SGD, where a few batches with high gradient contributions might be clipped in every epoch, here the entire dataset gradient is clipped. As such, the method is not a tool against heavy-tailed noise (Gorbunov et al., 2020), but it is effectively a limit on the maximum distance moved in parameter space during a single update. Because clipping simply changes the size of the gradient update but not its direction, clipping is equivalent to choosing a small learning rate when the gradient is large. Theoretical analysis of gradient clipping for GD in Zhang et al. (2019b) and Zhang et al. (2020) supports these findings, where it is shown that clipped descent algorithms can converge faster than unclipped algorithms for a class of functions with a relaxed smoothness condition. Note also that the clipping does not actually repress the spike behavior entirely. To remove spikes entirely would require a combination of even stronger clipping and reduced step sizes, but the latter would reduce both training progress and the gradient regularization via Equation 8.

Figure 2: Cross-Entropy Loss on the training and validation set and full loss (including weight decay) during training for full-batch gradient descent. Left when training as described in Section 3.2, right with additional gradient clipping. Validation computed every 100 steps.

3.3 Bridging the gap with Explicit Regularization

Finally, there is still the bias of mini-batch gradient descent towards solutions with low gradient norm per batch described in Equation 5 and Equation 7. This bias, although a 2nd-order effect, is noticeable in our experiments. We can replicate this bias as an explicit regularizer via Equation 7. However, computing exact gradients of this regularizer directly is computationally expensive due to the computation of repeated Hessian-vector products in each accumulated batch, especially within frameworks without forward automatic differentiation which would allow for the method of Pearlmutter (1994) for fast approximation of the Hessian. As such, we approximate the gradient of the regularizer through a finite-differences approximation and compute

(9)

This approximation only requires one additional forward-backward pass, given that is already required for the main loss function. Its accuracy is similar to a full computation of the Hessian-vector products (see supplementary material). In all experiments, we set , similar to (Liu et al., 2018a). To compute Equation 7, the same derivation is applied for averaged gradients .

Gradient Penalty:

We regularize the loss via the gradient penalty in Equation 7 with coefficient . We set for these experiments.

We use this regularizer entirely without sampling, computing it over the fixed mini-batch blocks , already computed for batch normalization, which are never shuffled. We control the strength of the regularization via a parameter . Note that this regularizer can be computed in parallel across all batches in the dataset. Theoretical results from Smith et al. (2020b) do not guarantee that the regularizer can work in this setting, especially given the relatively large step sizes we employ. However, the regularizer leads to the direct effect that not only is small after optimization, but also , i.e. the loss on each mini-batch. Intuitively, the model is optimized so that it is still optimal when evaluated only on subsets of the training set (such as these mini-batches).

Applying this regularizer on top of clipping and longer training periods leads to a validation accuracy of for a regularizer accumulated over the default batch size of . This can be further increased to if the accumulation batch size is further reduced to (reducing the SGD batch size in the same way does not lead to an additional improvement, see supp. material). Reducing is a beneficial effect as the regularizer Equation 7 is moved closer to a direct penalty on the per-example gradient norms of Poggio and Cooper (2020), yet computational effort increases proportionally.

Double the learning rate:

We again increase the initial learning rate, now to 0.8 at iteration 400, which then decays to over the course of 3000 steps/epochs.

This second modification of the learning rate is interestingly only an advantage after the regularizer is included. Training with this learning rate and clipping, but without the regularizer (i.e. as in Section 3.2), reduces that accuracy slightly to . However, the larger learning rate significantly improves the performance when the regularizer is included, reaching if and if , which is finally fully on par with SGD.

Overall, we find that after all modifications, both full-batch (with random data augmentations) and SGD behave similarly, achieving significantly more than validation accuracy. Figure 3 visualizes the loss landscape around the found solution throughout these changes. Noticeably both clipping and gradient regularization correlate with a flatter landscape.

Remark (The Practical View).

Throughout these experiments with full-batch GD, we have decided not to shuffle the data in every epoch to rule out confounding effects of batch normalization. If we turn on shuffle again, we reach validation accuracy (with separate runs ranging between and ), which even slightly exceeds SGD. The beneficial effects of shuffle appear to be on-par with the performance gain seen in table 1 attributed to shuffling batch normalization layers, but is is also conceivable that shuffling improves the regularization effect, as batch order in the regularizer is then random. This is the practical view, given that shuffling is nearly for free in terms of performance, but of course potentially introduces a meaningful source of gradient noise - which is why it is not our main focus.

Furthermore, to verify that this behavior is not specific to the ResNet-18 model considered so far, we also evaluate a range of related vision models with exactly the same hyperparameters. Results can be found in Table 3 for ResNet-50, ResNet-152 and a DenseNet-121, where we find that our methods generalize to these models as well. It is surprising that we are able to utilize the same hyperparameters easily, as previous work such as Shallue et al. (2019) has highlighted the brittleness of hyperparameters in the default large-batch setting. We provide code to replicate all experiments at github.com/JonasGeiping/fullbatchtraining.

Experiment ResNet-18 ResNet-50 Resnet-152 DenseNet-121
Baseline SGD
Baseline FB
FB train longer
FB clipped
FB regularized
FB strong reg.
FB in practice
Table 3: Summary of validation accuracies in percent on the CIFAR-10 validation dataset for each of the experiments with data augmentations considered in Section 3 for multiple modern convolutional models.

4 Full-batch gradient descent in the totally non-stochastic setting

A final question remains – if the full-batch experiments shown so far work to capture the effect of mini-batch SGD, what about the stochastic effect of random data augmentations on gradient noise? It is conceivable that the generalization effect is affected by the noise variance of data augmentations. As such, we repeat the experiments of the last section in several variations.

Figure 3: One-dimensional loss landscapes visualizations (random direction) of models trained with gradient descent, going from SGD (left) to GD with successive modifications (right). Whereas the models trained with unmodified gradient descent (middle) are noticeably sharper than the model trained with stochastic gradient descent (left), the final model trained with modified gradient descent (right) replicates the qualitative properties of the SGD model.

No Data Augmentation

If we do not apply any data augmentations and repeat previous experiments, then GD with clipping and regularization at , substantially beats SGD with default hyperparameters at and nearly matches SGD with newly tuned hyperparameters at , see Table 4. Interestingly in this example, not only is SGD matched by the modified GD, the modified GD is even more stable, as it works well with the same hyperparameters as described in the previous section, and it is SGD that we have to tune even though it benefits from the same regularization implicitly.

Enlarged CIFAR-10

To analyze both GD and SGD in a setting were they enjoy the benefits of augmentation, but without stochasticity, we replace the random data augmentations with a fixed increased CIFAR-10 dataset. This dataset is generated by sampling random data augmentations for each data point before training. These samples are kept fixed during training and never resampled, resulting in an -times larger CIFAR-10 dataset. This dataset contains the same kind of variations that would appear through data augmentation, but is entirely devoid of stochastic effects on training.

If we consider this experiment for a enlarged CIFAR-10 dataset, then we do recover a value of . Note that we present this experiments only because of its implications for deep learning theory; computing the gradient over the enlarged CIFAR-10 is -times as expensive, and there are additional training expenses incurred through increased step numbers and regularization. For this reason we do not endorse training this way as a practical mechanism. Note that SGD still have an advantage over CIFAR – SGD sees 300 augmented CIFAR-10 datasets, once each, over its 300 epochs of training. If we take the same enlarged CIFAR-10 dataset and train SGD by selecting one of the 10 augmented versions in each epoch, then SGD reaches .

Overall, we find that we can reach more than validation accuracy entirely without stochasticity, after disabling gradient noise induced through mini-batching, shuffling as well as through data augmentations. The gains of compared to the setting without data augmentations are realized only through the increased dataset size. This shows that noise introduced through data augmentations does not appear to influence generalization in our setting and is by itself also not necessary for generalization.

Experiment Fixed Dataset Mini-batching Steps Modifications Val. Accuracy
Baseline SGD CIFAR-10 -
Baseline SGD* CIFAR-10 -
FB strong reg. CIFAR-10 clip+reg+bs32
Baseline SGD CIFAR-10 -
FB CIFAR-10 -
FB strong reg. CIFAR-10 clip+reg+bs32
Table 4: Summary of validation accuraciess of the CIFAR-10 validation dataset when training with a fixed version of the dataset with no random data augmentations in Section 4. Hyperparameters are fixed from the previous section except for baseline SGD marked*, where the initial learning rate is doubled to 0.2 to provide a stronger baseline.

5 Discussion & Conclusions

SGD, which was originally introduced to speed up computation, has become a mainstay of neural network training. The hacks and tricks at our disposal for improving generalization in neural models are the result of millions of hours of experimentation in the small batch regime. For this reason, it should come as no surprise that conventional training routines work best with small batches. The heavy reliance of practitioners on small batch training has made stochastic noise a prominent target for theorists, and SGD is and continues to be the practical algorithm of choice, but the assumption that stochastic mini-batching by itself is the unique key to reaching the impressive generalization performance of popular models may not be well founded.

In this paper, we show that full-batch training matches the performance of stochastic small-batch training for a popular image classification benchmark. We observe that (i) with randomized augmentations, full-batch training can match the performance of even a highly optimized SGD baseline, reaching for a ResNet-18 on CIFAR-10, (ii) without any form of data augmentation, fully non-stochastic training beats SGD with standard hyper-parameters, matching it when optimizing SGD hyperparameters, and (iii) after a 10 fixed dataset expansion, full-batch training with no stochasticity exceeds , matching SGD on the same dataset.

The results in this paper focus heavily on commonly used vision models. While the scope may seem narrow, the existence of these counter-examples is enough to show that stochastic mini-batching, and by extension gradient noise, is not required for generalization. It also strongly suggests that any theory that relies exclusively on stochastic properties to explain generalization is unlikely to capture the true phenomena responsible for the success of deep learning. Stochastic sampling has become a focus of the theory community in efforts to explain generalization. However, experimental evidence in this paper and others suggests that strong generalization can be achieved with large or even full batches. If stochastic regularization does indeed have benefits that cannot be captured through non-stochastic, regularized training, those benefits are just the cherry on top of a large and complex cake.

This research was heavily supported by the OMNI cluster of the University of Siegen which contributed a notable part of its GPU resources to the project and we thank the Zentrum für Informations- und Medientechnik of the University of Siegen for their support. We further thank the University of Maryland Institute for Advanced Computer Studies for additional resources and support through the Center for Machine Learning cluster.

This research was supported by the universities of Siegen and Maryland and by the ONR MURI program, AFOSR MURI Program, and the National Science Foundation Division of Mathematical Sciences.

References

Appendix A Experimental Setup

a.1 Experimental Details

As mentioned in the main body, all experiments are evaluated on the CIFAR-10 dataset. The data is normalized per color channel. When augmented, the augmentations are random horizontal flips and random crops of size after zero-padding by 4 pixels in both spatial dimensions. For the experiments with a fixed CIFAR-10, the data is fully written to a database (LMDB) in rounds, to guarantee that the dataset is fixed. The same fixed dataset is used for all experiments using that dataset. The ResNet-18 model used for most of the experiments is the default model as described in He et al. (2015a, 2019)

with the usual CIFAR-10 adaption of replacing the ImageNet stem (

convolution and max-pooling) with a

convolution. All experiments run in float32 precision.

As mentioned further, the batch size during gradient accumulation is if not otherwise mentioned. Gradients are averaged over all machines and batches using a running mean. The optimizer is always gradient descent with Nesterov momentum () with learning rates as specified in the main body. Weight decay is applied to all layers. The learning rate in the basic full-batch setting with 3000 steps decays from 0.4 to 0.0 (after the 400 step linear warmup from 0.0 to 0.4) via cosine annealing over 4000 ticks, so that is reached at the final iterate 3000 (The algorithm could be run for the additional 1000 steps to anneal to 0, but we did not find that this hurt or hindered generalization performance and as such iterate only up to 3000 steps for efficiency reasons). For the initial learning rate of 0.8 used later, this corresponds to the same schedule with warmup, but starting from at iteration 400, which decays to at iteration 3000. The gradient clipping is computed based on the norm of the fully accumulated gradient vector (after addition of regularizer gradients if applicable) and the gradient vector is divided by this norm value with a fudge factor of 1e-6, if the target norm value is exceeded. This is also the PyTorch (Paszke et al., 2017) gradient clipping fudge factor. Batch normalization statistics are accumulated sequentially. If multiple GPUs are used then these accumulated statistics are averaged over all machines before each validation.

Gradient regularization as described in the main body via forward differences approximation is implemented by in-place addition of the already computed batch gradient to the model parameters with differential length as suggested in (Liu et al., 2018a). The gradient at this offset location is computed through automatic differentiation as usual and the finite difference of both gradients is added to the loss gradient with the factor . Afterwards the gradient is subtracted from the model parameters in-place to restore their original values.

All reported statistics are based on averaged results over 4-5 trials in all cases where standard deviation is reported. Numbers without standard deviation correspond to single runs. In a few edge cases where spike behavior such as in

Figure 2 is seen at the final iteration 3000, and the training loss is accordingly large, the value of validation at minimal training loss is reported as a fallback. Note that this fallback option is entirely based on training performance (and does as such not optimize on the validation set).

The loss landscape visualizations in Figure 1 and Figure 3 are computed by sampling the loss landscape of a fixed model (with batch normalization in evaluation mode) in a fixed random direction, which is drawn by filter normalization as described in Li et al. (2018) in a single dimension.

We provide the code to repeat these experiments with our PyTorch implementation at github.com/JonasGeiping/fullbatchtraining.

a.2 Computational Setup

This experimental setup is implemented in PyTorch (Paszke et al., 2017), version . All experiments are run on an internal SLURM cluster of NVIDIA Tesla V100-PCIE-16GB GPUs. Jobs in the data-augmented setting were mainly run on the single GPUs one at a time, whereas the CIFAR-10 jobs and in the fixed dataset section and jobs with gradient regularization were run distributed over GPUs. In both settings, this amounts to 16-32 GPU hours (depending on hyperparameters, especially gradient regularization) of computation time for each experiment, times for the repeated CIFAR-10 variants.

Appendix B Ablation Studies

b.1 Other techniques for generalization

Several other strategies to increase generalization performance have been proposed in literature about large-batch training. Here we enumerate some of the strategies as alternative to gradient clipping in the main body. The baseline here is , i.e. FB with long training. Label smoothing with smoothing value of leads to a full-batch performance of (which is lower on average, due to one run at

. If that run were treated as outlier and removed, we would find

). This is still significantly lower than the gradient clipping value of with which label smoothing does not stack. Applying weight decay only to linear (convolutional and fully-connected) layers leads to a performance of . Sharpness-aware minimization (SAM) on the full gradient level leads to , even with our increased budget to 3000 iterations and gentle learning rate scheduling, however note the connection of gradient regularization and SAM when accumulated over mini-batches discussed in Section B.4.

Furthermore, regarding discussions about reducing the batch size for SGD, we find a validation accuracy of at without hyperparameter adaptation. When the learning rate is scaled to , SGD reaches , which is slightly below the value for . Using for accumulation is further not an advantage for variants without regularization, FB train longer in Table 2 reaches accuracy with this batch size, and FB clipped reaches . Both numbers are below the performance at accumulation batch size .

b.2 Training without spikes

Figure 4 shows training curves for stable full-batch training without spiking behavior. However, the optimization does not reach levels of performance shown for in the main body within the allotted 3000 steps.

Figure 4: Cross-Entropy Loss on the training and validation set during training for full-batch gradient descent in direct comparison to Figure 2 in the main body - but with reduced learning rates. Top Row: . Bottom Row: Same step sizes but with gradient clipping of . is pictured with and without clipping in the main body. Training behavior is stabilized at lower learning rates, but significantly slowed in most cases. In the first case, some overfitting appears in the validation loss, possibly corresponding to the reduced regularization of the full gradient, i.e. Barrett and Dherin (2020) and catapult behavior of Lewkowycz et al. (2020). Weight decay regularization not pictured. Validation computed every 100 steps.

b.3 Hessian-vector product approximation

The Hessian-vector product necessary to compute the gradient of the gradient regularization of Equation (6) can also be compute by automatic differentiation instead of using finite differences. Using Pytorch’s automatic differentiation via autograd.grad lead to a validation performance of for a preliminary experimental setup with label smoothing and , for which the forward differences approximation we otherwise employ found , so that both approaches performed near identical.

Another alternative would be to improve the precision of the finite-differences approximation. Although we employ a forward-difference scheme, it is also possible to utilize a central-differences scheme which has beneficial approximation properties. Using central differences instead of central differences leads to in another preliminary FB regularized setting in which forward-differences yielded , i.e. similar performance.

However, the additional computational effort is significant is both cases. Computing the FB regularized experiment on GPUs takes about hours, minutes. The same experiment with central differences takes about hours, minutes. Finally, with automatic differentiation this takes about hours and minutes so that we employ the forward differences approximation only, especially in preparation for the and CIFAR-10 experiments.

b.4 Relationship of HVP approximation and sharpness-aware minimization

The gradient regularization of Smith et al. (2020b) is related to the sharpness-aware minimization of Foret et al. (2021) if the latter is computed on a mini-batch level and accumulated over the entire dataset. This relationship is especially apparent when using the approximation of the Hessian-vector product proposed in the main body. In our notation, the sharpness-aware minimization update consists of an update step based on the gradient

(10)

In comparison the update via loss gradient plus derivative of the regularizer can be be written as

(11)

If we consider the differential step size of (Liu et al., 2018b) and identify , the we can rewrite this update to

(12)

This shows that from the point of view of Foret et al. (2021), gradient regularization is an interpolation between the normal loss gradient and the adversarial gradient that depends on the step size. From the point of view of Smith et al. (2020b), SAM minimization accumulated over mini-batches is a finite-difference approximation of the gradient regularization for a fixed step size. Both are equivalent iff . If , , (the hyperparameters at the end of training in our experiments), this happens whenever . At the beginning of training, i.e. , equivalence is reached at . If the gradient norm is greater than this equivalence, then the adversarial gradient dominates, if it is smaller the loss gradient dominates. However we note that according to the experiments in Section B.3, gradient regularization can also be implement via automatic differentiation, so that (in the spirit of this work) the finite differences approximation itself is not necessary for the generalization effect of this regularizer.

b.5 Chaos Theory - What our results do not show

We would like to point out that while our results show that stochastic mini-batching (or even non-stochastic minibatching) in gradient descent is not necessary to achieve state-of-the-art generalization behavior, this does not entirely rule out stochastic modelling of the behavior of GD for deep neural networks as proposed in works such as Chaudhari and Soatto (2018); Kunin et al. (2021) and Simsekli et al. (2019). Even a full-batch gradient descent algorithm could potentially exhibit chaotic behavior on the loss surface of deep neural networks, which could be modelled by statistical techniques. In this work, we can make no statement about whether chaotic behavior exists for these examples of gradient descent and whether it has an impact on model performance.

Appendix C Additional Details

c.1 Social Impact

We foresee no direct social impact of this work at the moment.

c.2 Asset Licences

We use only variants of CIFAR-10 data (Krizhevsky, 2009) in our experiments, for more information refer to https://www.cs.toronto.edu/~kriz/cifar.html. Code licenses for submodules are included within their respective files and can be found as part of our code repository.