Previous work has shown that neural networks with large capacity, even in the absence of explicit regularization, generalize better than smaller capacity networks. Neyshabur et al. (2014) suggested through analogy to matrix factorization that network size is not the main form of capacity control in neural networks. Zhang et al. (2016)
then demonstrated that neural networks are capable of memorizing random labels, but still generalize given good data. These findings prompted investigation into stochastic gradient descent’s ability to implement some form of regularization that allows larger architectures to outperform smaller ones, even in the absence of explicit regularization, such as dropout, batch normalization, and weight decaySrivastava et al. (2014)Ioffe and Szegedy (2015)Krogh and Hertz (1992).
One line of inquiry has studied how noise may improve generalization ability. An (1996)
studied the effect of adding noise to backpropagation.Blundell et al. (2015) found that training so that the weights learn to cope with uncertainty improves generalization. Later, Mandt et al. (2016) noted that when training with SGD, each minibatch of size provides independent samples of the gradient. Letting be the weights at time , the training loss, and the learning rate, Mandt et al. (2016) describe the SGD update as
where has zero mean and some covariance, and is referred to as the noise induced by minibatch sampling. It was later discovered that the noise in SGD is anisotropic, yielding study of the gradient noise when the covariance matrix of is not constant Zhu et al. (2019).
Related to the idea of noise improving generalization performance are parameter perturbations. Parameter perturbations are a tool used in PAC Bayes bounds Dziugaite and Roy (2017), which include a term measuring the ’sharpness’ of the final minimum found by training. The ’flatness’ of the minima of the training loss relates to the volume of the space around the final minimizer that has a loss similar to the actual minimizer. Keskar et al. (2017) found that flat minimizers tend to be more robust to noise introduced by parameter perturbations, and that large batch training produces sharper minimizers than small batch training. Noise has also been used as an explanation for explicit regularization such as dropout Wager et al. (2013).
In this paper, we consider how the inherent noisiness of using a gradient based optimizer along with capacity may contribute to generalization for neural networks. In particular, we notice that instead of iterating through each scalar parameter and updating them one by one, GD updates all the parameters at the same time. As a result, parameter
calculates its partial derivative at the stale parameter vector, but then suffers loss .
We find a term to describe the above noise, and find that the optimization introduces a penalty for solutions that are sensitive to parameter pertubation. We then relate it to the Taylor series of the loss and compare the first order approximation to the loss made by SGD to the actual change in loss. We find that for larger models, although they may overfit more in a first order sense, this implicit penalty is also higher potentially producing a regularization effect.
1.1 Related Work
There has been a line of inquiry about the dot product of the gradients during SGD training. Sankararaman et al. (2019) noted how width and depth affect a quantity they call ’gradient confusion,’ and determine how this affects the speed of convergence of SGD. Others Arpit et al. (2017) have measured the ’loss’ sensitivity for different capacity networks for good versus corrupted data. Several works have examined whether neural networks learn ’simpler’ functions before learning more complex ones Kalimeris et al. (2019) Rahaman et al. (2019).
In order to study the implicit regularization provided by SGD, one line of work has examined the ’flatness’ or ’sharpness’ of the minima found by SGD Hoffer et al. (2017), with the hypothesis that flatter minima generalize better. Other work has posited that the ratio of learning rate over batch size is important in SGD optimization Jastrzebski et al. (2017). Other work has analyzed the anisotropic nature of the noise in SGD Zhu et al. (2019). Dinh et al. (2017) examine whether sharp minima for neural networks can generalize, and conclude that flatness must be defined carefully. Dziugaite and Roy (2017) took a PAC Bayes approach to computing generalization bounds. Neyshabur et al. (2019)
studied the effect of over-parameterization on generalization by looking at ’unit capacity’ and ’unit impact’ for 2 layer ReLU networks. Other work has empirically studied how network width may affect the ’noise scale’ of the networkPark et al. (2019).
We use . We denote by the true (general) loss associated with the weights of the neural network, at a certain time t. That is, let be training data points and labels, such that and be data drawn from some distribution, , and let
be some loss function, then. We will sometimes omit the term and write , where the corresponding to the is taken implicitly. We describe the training loss, which is the average loss over the training data as
where is a set containing the training data points.
SGD: We take to be the empirical loss evaluated on the th minibatch, . Instead of taking the full gradient update over , SGD computes
Typically, the learning rate schedule is manipulated, however, for our experiments and analysis we maintain a fixed constant learning rate, so that we may separate the effects of the learning rate schedule from the effects of batching and SGD. Although SGD is explicitly given (and told to minimize) the training loss, without direct knowledge of the true loss, in practice it often manages to find a solution that has reasonable generalization loss.
2 Our model
Simultaneous move games GD updates all the scalar weights at the same time instead of updating them individually. Each scalar weight therefore knows the values of the other weights at but is then evaluated at . An analogy to this process is the game of synchronous chess, where and are players, and each player must make their move based on the current state of the board, but simultaneously without knowledge of the other player’s concurrent move (of course, in synchronous chess each player may try to ’guess’ what the other player will do, whereas the weights do not).
2.1 Parameter updates at the same time
Because we are considering GD in this section, we sometimes omit the data parameter of the loss, since it is always . In contrast to gradient descent, consider the following algorithm:
In other words, this algorithm takes the partial derivative of each scalar weight, and updates one of them at a time, instead of updating them all at the same time. The are optimized jointly, so that each knows what the current values of the others are when it makes its decision on how to update. The change in loss experienced by weight is . Notice that the only weight that changes is , so when updates itself there is no uncertainty introduced by the other weights .
Gradient descent, by contrast, computes all the gradients at the old weights, as follows:
However, following these updates, each weight suffers a loss
Each weight computed its partial derivative at , and therefore had full information about the other parameters at time . However, because all the parameters are combined to produce a single model with , an implicit penalty is introduced for weight changes , that were not robust to perturbations made by the other weights. More specifically, if all weights updated in the same round, and no uncertainty were introduced by any of the weights for any updating weight , the change in loss at the end of the round would be:
but in actuality, GD first combines the various weight updates into a single model with weights , and then produces a joint penalty as follows:
The first term is the objective function, and searches for weights that would most improve the loss if no uncertainty were introduced by any weight for any other weight . The second term can be thought of as a regularizer, or penalty. It will reward weight choices whose effect on the loss is similar or better when they are implemented alongside other parameter updates than when they are implemented individually. These effects apply to the discrete dynamics of GD. Namely, if the learning rate is small enough, it may be close to the case that the other parameters don’t change very much.
Notice that if the loss were to behave linearly over this round:
So that linear models, where no uncertainty is introduced by any weight for , would not receive a penalty. We will be interested in experimentally examining the effect of this penalty for SGD. To do so, we will create a Taylor approximation to the loss and measure the first order effects versus the higher order effects, but first we discuss why the above penalty may link to generalization.
We notice that larger models have more nodes, and hence have a propensity to behave more non-linearly, and a potential ability to claim higher rewards from Equation 5 without generalizing well. However, we hypothesize that any undesirable non-linear behavior will be curbed by producing a higher value of the penalty above. We reason that if large models are regularized more using this mechanism, they may achieve better generalization performance.
2.2 Penalizing functions not robust to perturbation
There is a rich set of literature relating noise to generalization. Consider, for a counterexample, a decision tree, which is prone to overfitting unless ensembled. From Elements of Statistical LearningFriedman et al. (2001) p. 307 for splitting variable and split point , the split point can be decided according to the following optimization problem:
This optimization problem gives the tree a greedy, but precise look at the loss after the update, and it may choose the that produce the best value of the loss a posteriori. The optimization is therefore not inherently noisy, since is aware of exactly which it will be paired with and has access to the resulting loss, and the penalty term described in the previous section does not apply. The neural network, by contrast, cannot for example try all possible weight vectors such that and select the one that produces the lowest loss.
In particular, due to the partial derivative, expects the weight vector to move from to , but in reality it moves from to . The movement of the other parameters can be seen as a perturbation to the update made by . Therefore, from the perspective of , its loss at time is:
where models the effect on the loss due to other weights changing and is the value of the loss when the optimizer chooses assuming all other weights remain at their time values. We would expect that a larger would produce a larger perturbation, and could increase the magnitude of . Although larger models tend to have closer distance to initialization, so that could be smaller, larger models have more weights and more possible activation patterns, which could still cause the loss perturbation to be large. Unlike a Gaussian perturbation, is driven by the data, so it is not unreasonable to expect to be able to withstand it.
Penalizing weight changes that were not robust to other parameters in the network being simultaneously perturbed could qualitatively bias the network towards flatter minima, which reflect weight settings which are not too sensitive to perturbation.
2.3 Expected behavior on experiments
We run our experiments with SGD, not GD, so that we may observe the interaction of the penalty with the stochasticity introduced by SGD. Notice that the penalty described can be taken on a particular batch. If a batch is used to update , we would expect each weight to successfully make progress on if only were to update. Therefore, would be high. However, we would also expect that because has access to the the other weights along with the particular activations produced by the data , that the weight change may have a larger penalty on than on other batches. For a batch, , that updated long ago, we would expect , but we would also expect the penalty on to be smaller, as is less likely to be very overfitted on a batch that updated long ago. We expect that recently updating batches, , may display an intermediate behavior.
2.4 Taylor series on the weights
We will use the Taylor series so that we find a way to experimentally measure the penalty. We will be interested in the behavior of different batches, as well as different capacity models. We will use and interchangeably. We consider the effect that moving has on the loss of another batch, :
where we have made the higher order terms.
We use to approximate Equation 5 so that the penalty can be written
In Section 2.2 we discussed the penalty in the case of GD where the entire training data is shown in every round. However, as shown in previous work, selecting a minibatch introduces additional noise. The batches in SGD have different relationships to the parameter . We call the batch being used to perform the gradient update the updating batch. In Equation 10, is the updating batch. As discussed in Section 2.3, we expect , and . We will discuss this effect in more detail in Section 3.
We are also interested in the behavior of larger versus smaller capacity models. We would expect that since larger models generalize better than smaller ones, that even if is larger for larger models, or is also larger for larger models providing a regularization effect. We will discuss this more in Section 4
3 Plotting the penalty term and dot product over different batches
Notation Generically, we will use to refer to the updating batch to refer to a recently updating batch and to a long ago updating batch. We will also use , and
Gist: In this section, we will show that the updating batch 1) is able to claim a larger reward from Equation 5 than other batches, but 2) also experiences a larger penalty for doing so. We conclude that the penalty penalizes the updating batch, which seems most at risk for being overfitted in a particular round.
Figure 2 shows vs in Column 1 and 3 and and versus in Columns 2 and 4 for a fully connected two layer 8000 node network on MNIST in the left two columns and a Resnet 18 on CIFAR 10 in the right two columns. Consistently with what we would expect, we find , is higher than and .
Next, we examine the penalty term. Figure 1 depicts in Row 1, from left to right, versus for FC 8000 on MNIST followed by versus for FC8000 on MNIST followed by versus for Alexnet on CIFAR 10 followed by versus for Alexnet on CIFAR 10. In Row 2 it depicts versus for Resnet 18 on CIFAR 10 followed by versus for Resnet 18 on CIFAR 10. For all cases, as expected in Section 2.3.we see that in all cases.
From this, we conclude that the updating batch is able to make more progress on itself because of its success in the first order, but it also incurs a large penalty because the weights do not work as well together as they do individually .
Therefore, even if the updating batch can cause the weights to overfit on its data in a first order sense, it has an increased higher order penalty that penalizes weight changes that may not generalize well.
4 Comparing different model dot product and reduction in loss
Gist: We wish to compare how larger versus smaller capacity models behave in terms of the penalty. We expect that larger models are more heavily regularized, even if they are able to claim a larger reward from Equation 5.
We plot the reduction in loss so far on the x axis in order to compare the models at similar stages in training. We run an experiment on MNIST using for a fully connected two layer network with (green), (orange), and (blue) nodes respectively. The results are shown in Figure 3
Figure 3 shows the results of plotting in Row 1, followed by followed by . In Row 2, followed by followed by . And in Row 3 followed by followed by .
First we notice that , and , and are higher for larger models as can be seen in Column 1 (we will actually find that for more complex datasets they are higher for a majority of training, but become lower at a later point in training). We interpret this to mean that larger models can fit more in a first order sense, i.e. they are more able to find weights that would reduce the loss if implemented individually, and would therefore be able to increase the reward given by Equation 5. However, we notice that for the recently updating batches, larger models experience a higher dot product, but experience a larger increase in loss, and therefore a higher magnitude penalty . Larger models also experience a larger penalty over training as can be seen in Column 3 (again we will find that for more complex datasets this stops being true late in training.)
We also show the analogue of Figure 3 for Alexnet (blue) and Alexnet with only one fully connected layer of size 256 (orange) on CIFAR 10 in Figure 4. We use vanilla SGD and no momentum, batch normalization, or dropout. We use a constant learning rate of .01. We see in Figure 4 that the larger model (Alexnet) has a larger magnitude penalty and dot product, until the models reach a training loss of about 1.0. We notice that this tracks the time the models begins overfitting in the test loss (see Appendix Figure 6)
We also show the analogue of Figure 3 for Resnet 18 (orange) versus a wide Resnet 50 (blue) on CIFAR 10 in Figure 5. We train a Resnet 18 model and a wide Resnet 50 model with width factor 2. We use a learning rate of .01 and a batch size of 150. We plot on the x-axis the fraction reduction in loss. We observe similar results to the Alexnet case.
We identify a property of using gradient based optimizers, namely that they update all the scalar weights at the same time instead of updating them individually. We find that this introduces uncertainty into the optimization, as each scalar weight knows the values of the other weights at but is then evaluated at . We relate this phenomenon to the Taylor series. We find that penalties are most pronounced for batches that are currently being used to update. We find that penalties are higher for larger models. Examining a broader array of datasets and architectures, and studying how this phenomenon interacts with other regularizers such as batch normalization and skip connections is an interesting investigation we leave to future work.
- The effects of adding noise during backpropagation training on a generalization performance. Neural computation 8 (3), pp. 643–674. Cited by: §1.
A closer look at memorization in deep networks.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 233–242. Cited by: §1.1.
- Weight uncertainty in neural networks. JMLR. Cited by: §1.
- Sharp minima can generalize for deep nets. In International Conference on Machine Learning, pp. 1019–1028. Cited by: §1.1.
- Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. UAI. Cited by: §1.1, §1.
- The elements of statistical learning. Vol. 1, Springer series in statistics New York. Cited by: §2.2.
- The local elasticity of neural networks. ICLR. Cited by: §1.1.
- Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pp. 1731–1741. Cited by: §1.1.
- Batch normalization: accelerating deep network training by reducing internal covariate shift. JMLR. Cited by: §1.
- Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623. Cited by: §1.1.
- SGD on neural networks learns functions of increasing complexity. In Advances in Neural Information Processing Systems, pp. 3491–3501. Cited by: §1.1.
On large-batch training for deep learning: generalization gap and sharp minima. ICLR. Cited by: §1.
- A simple weight decay can improve generalization. In Advances in neural information processing systems, pp. 950–957. Cited by: §1.
- A variational analysis of stochastic gradient algorithms. In International conference on machine learning, pp. 354–363. Cited by: §1.
- The role of over-parametrization in generalization of neural networks. ICLR. Cited by: §1.1.
- In search of the real inductive bias: on the role of implicit regularization in deep learning. Proceeding of the International Conference on Learning Representations workshop track. Cited by: §1.
- Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760. Cited by: §1.1.
- The effect of network width on stochastic gradient descent and generalization: an empirical study. In International Conference on Machine Learning, pp. 5042–5051. Cited by: §1.1.
- On the spectral bias of neural networks. PLMR. Cited by: §1.1.
- The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. arXiv preprint arXiv:1904.06963. Cited by: §1.1.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §1.
- Dropout training as adaptive regularization. In Advances in neural information processing systems, pp. 351–359. Cited by: §1.
- Understanding deep learning requires rethinking generalization. ICLR. Cited by: §1.
- The anisotropic noise in stochastic gradient descent: its behavior of escaping from sharp minima and regularization effects. In International Conference on Machine Learning, pp. 7654–7663. Cited by: §1.1, §1.
Appendix A Appendix
a.1 Small Alexnet architecture
Our small Alexnet model retains the original Alexnet convolutional layers, but replaces the fully connected layers by a single one with 256 nodes. We use a batch size of 150 and a constant learning rate of 0.01 for CIFAR 10 experiments. We depict the test loss for Alexnet (blue) and small Alexnet (orange) in Figure 6. By comparing with Figure 4 we see that the crossover point, where Alexnet starts to have smaller dot product than small Alexnet, approximately tracks the point where Alexnet begins overfitting.