Generalization in over-parameterized neural networks trained using Stochastic Gradient Descent (SGD) is not well understood. Such networks typically have sufficient capacity to memorize their training set Zhang17 which naturally leads to the question: Among all the maps that are consistent with the training set, why does SGD learn one that generalizes well to the test set?
This question has spawned a lot of research in the past few years but no satisfactory answer has emerged. There have been many attempts to extend classical algorithm-independent techniques for reasoning about generalization (e.g., VC-dimension) to incorporate the “implicit bias” of SGD to get tighter bounds (by limiting the size of the hypothesis space to that reachable through SGD). Although this line of work is too large to review here, the recent paper of Nagarajan19 provides a nice overview. However, they also point out some fundamental problems with this approach (particularly, poor asymptotics), and come to the conclusion that the underlying technique itself (uniform convergence) may be inadequate. They argue instead for looking at algorithmic stability Bousquet02. While there has been work on analysing the stability of SGD Hardt16; Kuzborskij18, it does not take into account the training data. Since SGD can memorize training data with random labels, and yet generalize on real data (i.e., its generalization behavior is data-dependent Arpit17), any such analysis must lead to vacuous bounds in practical settings Zhang17. Thus, in order for a stability based argument to work, what is needed is an approach that takes into account both the algorithmic details of SGD as well as the training data.
Recently, a new approach for understanding generalization along these lines has been proposed Chatterjee20. Called the Coherent Gradients Hypothesis (CGH), the approach is motivated by random forests which also display data-dependent generalization. For example, forests can easily fit training data with random labels, and yet, generalize well when trained on real data. However, generalization in forests is not as much of a mystery since it is understood that tree construction algorithms attempt to extract commonality from the training examples by grouping similar examples together. When they find such commonality the resultant model generalizes, and when they do not, the model simply memorizes the training data and fails to generalize.
CGH postulates that neural networks trained with SGD also extract commonality from training examples. The key observation in CGH is that descent directions that are common to multiple examples (i.e., similar) add up in the overall gradient (i.e., reinforce each other) whereas directions that are idiosyncratic to particular examples fail to add up. Thus, the biggest changes to the network parameters are those that benefit multiple examples.
In other words, certain directions in the tangent space of the loss function are “strong” gradient directions supported by multiple examples whereas other directions are “weak” directions supported by only a few examples. Intuitively–and CGH is only a qualitative theory at this point–strong directions are stable (i.e., altered marginally by the removal of a single example) whereas weak directions are unstable (could disappear entirely if the example supporting it is removed). Thus a change to the parameters along a strong direction should generalize better than one along a weak direction. Since the overall gradient is the mean of per-example gradients, if strong directions exist, the overall gradient has large components along it, and thus the parameter updates are biased towards stability.
Now, whether a strong direction exists or not depends entirely on the parameters of the network and the dataset. For example, the per-example gradients could be pairwise orthogonal. This would correspond to perfect memorization. On the other hand, they could all point in the same direction which would correspond to perfect generalization.
Since CGH is a causal explanation for generalization, Chatterjee20 performed a couple of causal interventions to test the theory. While they found good agreement between the qualitative predictions of the theory and their experiments, an important limitation of their work is that their experiments were on shallow (1 and 3 hidden layers) fully connected networks trained on mnist using sgd with a fixed learning rate.
In this work, we present new evidence for CGH by reproducing and scaling up previous studies as well as through new experiments and observations. Our main contributions are:
We significantly expand the scope of the original study to include large, practically relevant architectures such as ResNet, Inception, and VGG; more complex datasets such as cifar-10 and ImageNet; and realistic training protocols such as SGD with momentum and variable learning rates (Section 3 and Section 4). We also collect additional statistics not reported in the original study that provide greater insight into CGH.
An intervention experiment in the original study (winsorization) required per-example gradients on a large batch. By eliminating outlier examples on a per-coordinate basis, they were able to greatly reduce over-fitting. Since that would be impractically slow for large networks on ImageNet, we propose a significant modification that runs as fast as regular training but achieves a similar effect (Section 4.2
). Although our present interest in this modification is as a proxy for winsorization that scales up to ImageNet, we believe it has practical significance for deep learning beyond understanding memorization and generalization.
CGH presents a new perspective on “easy” and “hard” examples that suggests a new test of the theory (i.e., one that was not considered in the original study) and one that is not based on adding noise to the training labels. We perform that test with ImageNet and find good agreement thus providing further evidence for CGH (Section 5).
Let be the inputs to a neural net with some weights : we want the neural net to learn to predict a target which may be discrete or continuous. We will do so by minimizing the loss function where is drawn from the data distribution , and is a per sample loss function. Thus, we want to solve the optimization problem
If the true data distribution, , is not known (as is the case in practice), the expected loss is replaced with an empirical loss. Given a set of training samples , let be the loss for a particular sample . Then the problem we want to solve is
Consider a regularized first order approximation of around the point :
Minimizing leads to the familiar rule for gradient descent, . Now, if we assume that the learning rate is small, then the change in loss after a single step of gradient descent can be written as
This implies that the rate of change of loss is captured by the squared norm of the mean gradient. If the overall gradient is big, the loss drops quicker and slower otherwise. Note that the squared norm of the mean gradient is same as the mean of the dot product of pairwise example gradients.
where . If with the current weights , all the example gradients are well aligned, then we expect the gradients to add up and have mean gradient with higher norm. However, if the individual example gradients prefer distinctive directions, then the norm of the mean gradient will be smaller and hence the loss will presumably drop less. Similar measures have been used by other researchers to explain training speed Sankararaman19 and has been observed to correlate with generalization Fort19. In what follows, we use the derivative of the loss curve as a proxy to measure coherence in the gradients.
Minibatch SGD is a stochastic version of gradient descent Robbins1951Stochastic, where one computes the average gradient of the loss over a small set of examples chosen i.i.d, and take a step in the direction of the negative gradient. The argument presented earlier carries over in a straightforward way to the stochastic setting.
. We keep a fixed sample of 50k examples from the pristine and corrupt sets to measure our accuracy and loss. Figures (a) shows the training and test accuracy with various levels of noise, while (b) shows the corresponding training loss. Note that earlier in the training, the derivative of training loss decreases with more label noise. Figures (c) and (d) show the accuracy and loss plots on the held-out pristine and corrupt training samples. The plots confirm our prediction that with increasing label noise, we expect a slower learning of pristine examples. (The jumps at 30 and 60 epochs are due to reductions in learning rate as per the usual learning rate schedule for ResNet-18.)
Huber’s -corruption model. Modern data sets that arise in various branches of science and engineering are characterized by their ever increasing scale and richness. However, these large and rich data-sets are usually not carefully curated, are often collected in a decentralized, distributed fashion, and consequently are plagued with the complexities of heterogeneity, adversarial manipulations, and outliers. A suitable theoretical model to consider when reasoning with such data is Huber’s -corruption model Huber81, where the sampling distribution is modeled as a well-behaved distribution contaminated by an -fraction of arbitrary outliers. In this setting, instead of observing samples directly from the true distribution , we observe samples drawn from , which for an arbitrary distribution is defined as a mixture model,
In order to reason about and validate CGH, we will create experiments where we will manipulate the data distribution under this model.
3 Reducing Similarity on ImageNet
Since CGH proposes a mechanism for commonality extraction from similar examples, one test of CGH is to study how dataset similarity impacts training. Directly studying similarity is difficult since which examples are considered similar may change during training. Recall that two examples are similar if their gradients are similar. To get around this, Chatterjee20 proposed adding label noise to a dataset based on the intuition is that no matter what the notion of similarity, adding label noise is likely to decrease it. In this section, we scale up their study to ImageNet and augment it by tracking new metrics that provide additional insight.
Setup. In accordance to Huber’s -corruption model, we randomize 25%, 50%, 75%, and 100% of the training labels to get 5 variants of ImageNet (including the original training set which we say has 0% label noise) and train ResNet-18, Inception-V3, and VGG-13 on these variants. We use the 50k examples in the validation set as our test set, and no noise is added to those labels. Those training examples whose labels are unaffected by randomization are called pristine and the rest are corrupt. Note that even in 100% label noise case, we expect 1 in 1000 examples to be pristine due to chance.111Although it is easy to randomize to avoid this, pristine examples in the 100% noise case provide a natural sanity check for our experiments, since in that case they are no different than the corrupt examples and, therefore, should behave identically.
By “memorization” we mean the ability to learn corrupt examples. To make it easier to observe memorization in a reasonable amount of training time, we turn off weight decay and random augmentation. We otherwise follow normal training and testing protocols for the models we consider (e.g. use momentum, usual learning rate schedules, batch normalization, etc.), and verify that we get expected state-of-the-art accuracy with random augmentation and weight decay turned on.
Predictions. Noise is added to the training labels to control the similarity between examples. Although the notion of similarity changes in the course of training, in general, we expect variants with more label noise to have less similarity between examples. Thus, with more noise, we expect per-example gradients to be more distinctive, and the overall gradient to be more diffuse, i.e., have a smaller norm.
Now, within the 25%, 50%, and 75% variants, we expect pristine examples as a group to be more similar to each other than the corrupt examples. Thus, early in training, we expect the gradients of the pristine examples to add up and dominate the overall gradient. In contrast, corrupt examples being more idiosyncratic have a smaller presence in the overall gradient. Consequently, we expect pristine examples to be learned faster than corrupt examples. Furthermore, with more label noise, we have fewer pristine examples, and thus a smaller pristine contribution to the overall gradient. Thus, with increasing label noise, we expect a slower learning of pristine examples.
Experimental Results. Figure 5 shows the training and test curves (top-1 accuracy and loss) for the ResNet-18 training for various amounts of label noise. In addition during training, we track 50k pristine and corrupt examples and show their top-1 accuracy and loss respectively.222except for the 100% noise case where there are fewer than 50k pristine examples as noted before
Our first observation from the slopes of the training losses confirms that increasing amounts of label noise causes a decrease in the average gradient norm indicating that noise reduces similarity between examples as expected (and it takes longer to reach a given level of training accuracy). Next, as predicted above, we note that in all cases, pristine examples are learned much faster than corrupt examples (except for 100% label noise where as we saw before, we expect parity). Furthermore, the rate at which pristine examples are learned goes down with increasing noise also as predicted.
Furthermore, the loss for corrupt examples (a statistic not reported in the original study) increases in early training (instead of decreasing). This could be interpreted as additional evidence of CGH: the early gradient is dominated by pristine gradients333even in the 75% noise case where only 25% of the examples are pristine since a step in a direction that improves training accuracy on the actual dataset is likely to increase loss when an example is mislabeled.
Finally, in the presence of label noise, we see that the test accuracy reaches a maximum relatively early in training before high accuracy is achieved on the training set. This is in line with what Arpit17 observed on cifar-10 from which they concluded that “the model first learns the simple and general patterns of the real data before fitting the noise.” CGH helps shed light on what these “simple and general” patterns are, and why they are learned first: they arise from examples whose gradients are well-aligned, i.e., the pristine examples.
Figure 6 shows the expected dot product of the gradients in the minibatch over the course of training ResNet-18 on ImageNet with 0% and 50% label noise. We can see that on the clean dataset, the gradient dot products are much higher than those for the corrupted dataset early on in the training. Once most of the examples are learned the dot product drops in comparison with the noisy dataset. The plots are in agreement with the derivative of the loss curves in Figure 5.
Anti-adversarial initialization. In Figure 5(d), we see that with increasing noise, the effective learning rate on the pristine examples decreases as expected. But we also observe that the effective learning rate on the corrupt examples also decreases. We conjecture that this is because features learned from pristine examples (early on in training) help with memorizing corrupt examples. To test this, we initialized the parameters of a ResNet-18 from a model trained to 100% accuracy on original ImageNet, and trained it on 100% noisy label data. On this particular experiment, our conjecture was confirmed. It took only 62 epochs to reach 90% accuracy as opposed to 81 epochs for a model starting from a random initialization.
Benign and Malignant Overfitting. Observe that with 0% label noise the test accuracy rises and then stays flat. However, with label noise, the test accuracy reaches a maximum and then starts falling. Note that in both cases there is overfitting, but in one case the overfitting is benign (training accuracy increases but test accuracy remains flat) whereas in the other case it is malignant (test accuracy falls as training accuracy increases). This has been observed both in the original study and in Arpit17 in different settings.
What causes malignant overfitting? Limited model capacity is not a problem in this case since the network is able to memorize the whole training set (as the 100% noise curve shows). One intriguing conjecture suggested by CGH is that SGD extracts common patterns even from corrupt examples (e.g. from accidentally consistent mis-labelings). These spurious patterns may generalize, i.e., trigger even on the test set (since if they are common enough to show up in training, they may show up in test as well) and cause mis-classification. This may explain why the fall from maximum test accuracy increases with increasing noise.
Other Architectures. Finally, we note that the results for Inception-V3 and VGG-13 are very similar to those for ResNet-18 (please see Supplementary Material). Since these are very different architectures from ResNet-18, these results help build confidence in the broader validity of CGH.
4 Suppressing Weak Gradient Directions
Mean estimation is a well studied problem in robust statisticsHuber81. In many real-world applications, collected data are contaminated by noise with heavy-tailed distribution and might contain outliers of large magnitude. In this situation, it is necessary to apply methods which produce reliable outcomes even if the input contains corrupted measurements.
In the context of learning with SGD, at each step a random minibatch of examples is chosen and the mean gradient is applied to the parameters with a learning rate to update the model. If the examples chosen are such that their gradients exhibit outlying or heavy-tail behavior, we would like mean estimators that are robust to noise. We pose the problem of suppressing weak gradient directions as a problem of robust mean estimation. The median of means algorithm Minsker13 is an optimal estimation technique in the sense that deviation from the true mean is bounded above by
with high probability (is the number of samples). The sample mean satisfies this property only if the observations are Gaussian. The main idea of the median of means algorithm is to divide the samples into groups, computing the sample mean of each group, and then returning the geometric median of these means. The geometric median of vectors is the vector such that When , the geometric median is just the ordinary median of scalars. However, in high dimensions the algorithm to compute the geometric median weiszfeld1937point is iterative and is difficult to integrate seamlessly into a traditional training loop. A simpler technique is to apply the median of means algorithm to each coordinate that gives a dimension dependent bound on the performance of the estimator.
Another statistical approach to limit the influence of outliers is called winsorization. It is the transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. The typical strategy in winsorization is to set all outliers to a specified percentile of the data; for example, a winsorization would clip all data below the percentile set to the percentile, and data above the percentile set to the percentile. This approach was used by Chatterjee20 as a way to perform stable training on shallow networks on mnist.
We implement winsorization and median of means in the context of training deep networks and show their influence on learning and generalization. We do not use momentum in the experiments in this section since it was found to lead to instability. Furthermore, as in the previous section, we turn off weight decay and random augmentation.
4.1 Winsorization on CIFAR-10
We train a ResNet-32 on cifar-10 using SGD with a batch size of 32. We use a normal learning rate schedule where the initial rate is 0.1 and is lowered by th first at 40K steps, and then every 20K steps thereafter. We train for a total of 100K steps (i.e., 64 epochs).
The amount of winsorization is controlled by a parameter and it is performed as described in Chatterjee20: First, we compute individual gradients for each example in a minibatch (this procedure is memory intensive for a large network like ResNet-32, which is why we use a smaller batchsize than usual). We perform winsorization by manipulating the per-example gradients in a coordinate-wise fashion. For the component, we collect the per-example gradients to get 32 scalars. Instead of simply summing them to get the component of the overall gradient, we first reduce the impact of outliers by clipping each per-example scalar to be no less than the -smallest and no larger than the -largest of the 32 values, and then sum. Thus corresponds to no clipping and thus regular SGD.
Figure 7 shows the performance of winsorization. The top row of the grid shows our results on the original dataset with winsorization parameter and . We can see that as increases, the gap between train and test accuracy decreases. Also worth noting is that for small levels of winsorization, the model performance on test data does not change significantly. The bottom row in Figure 7 shows similar results when we add 100% label noise. At , we can see that the model completely memorizes the training data while generalizing very poorly. By increasing to 2, we can completely suppress overfitting as the model fails to train. This, we feel, is a strong validation for CGH.
4.2 Rolling Median-of-3 Minibatches (RM3) on ImageNet
A challenge with winsorization is the need to compute and store per example gradients. Even in the cifar-10 example, we had to reduce the minibatch size to 32 to make training feasible. This issue is exacerbated when we deal with larger models and complex datasets like ResNet-18 on ImageNet. Therefore, we implemented a per-coordinate version of the median of means algorithm (implemented in a rolling fashion) to generate our robust gradient estimates for ImageNet training. We modify the basic training loop of SGD as shown in LABEL:lst:implementation.
We keep track of the gradients computed in the previous two steps of training along with the current gradient to compute the per-coordinate median of the three. This is used as the update vector at the current step. We use a standard mini-batch of size 256 and the standard learning rate schedule but no momentum (since as in winsorization it led to instability). We compare it with vanilla SGD (i.e. without momentum). We train without data augmentation or weight decay. We run experiments for 0%, 50% and 100% label noise.
The results of our experiments are shown in Figure 11. Each plot shows the training and test performance of regular SGD and our Rolling Median-of-3 Minibatches (RM3) modification. As we can see, RM3 significantly reduces the generalization gap while maintaining model performance. Furthermore, as Figure 12 shows, RM3 greatly reduces memorization which provides further evidence for CGH.
Remark. Since Chatterjee20 only trained with mnist which has low generalization gap, the effect of suppressing weak directions only manifested with label noise. Our results on cifar-10 and ImageNet are stronger because even the 0% case has significant overfitting which can be reduced by suppressing weak directions.
5 Easy and Hard Examples
Background. Arpit17 conducted a detailed study of memorization in shallow fully connected networks and small AlexNet-style convolutional networks on mnist and cifar
-10. One of their main findings is that for real data sets, starting from different random initializations, many examples are consistently classified correctly or incorrectly after one epoch of training which is in contrast to what happens with noisy data. They call these easy or hard examples respectively. They conjecture that this variability of difficulty in real data “is because the easier examples are explained by some simple patterns, which are reliably learned within the first epoch of training.”
We believe that CGH provides an explanation of this phenomenon. But rather than say easy examples are explained by “simple” patterns (which leads to the question of what makes a pattern simple), CGH would posit that easy examples are those that have a lot in common with other examples (where commonality is measured by the dot product of the gradients during training). With this postulate it is easy to see why an easy example is learned sooner reliably: most gradient steps benefit it.
Note that this is a more nuanced phenomenon than claimed in Arpit17. The dynamics of training (including initialization) can determine the ease or hardness of examples. In particular, it may explain the results on adversarial initialization Liu19 (where examples that are easy to learn with random initialization become significantly harder) and our own experience with anti-adversarial initialization (Section 3) since in both these cases the dataset remains the same (and thus the patterns remain the same).
If our hypothesis is true, the easy examples as a group have more in common with each other than the hard examples. Therefore, we would expect the gradients for the easy examples to be stronger than those for hard examples, and thus the easy examples to generalize better to other easy examples than we would expect hard examples to generalize to other hard examples. To test the predictions as an indirect validation of our hypothesis, we ran the following experimental study.
Experimental Study. We trained a ResNet-18 on ImageNet to 50% training accuracy using the normal training protocol. As per the discussion above, we call the examples that have been learned easy and the rest hard. From the easy examples, we pick 500K examples and 100K examples at random (without replacement) to create a training set (e-train) and a test set (e-test). Likewise, from the hard examples we create a training set (h-train) and a test set (h-test).
We then train two new ResNet-18s (e-model and h-model) on e-train and h-train respectively using the same training procedure as before. These are then evaluated on e-test and h-test respectively (which ensures that there is no test-train distribution mismatch). The results are shown in Figure 16. First, we verify that even in this setup of separate training, the easy examples (e-train) is learned faster than the hard examples (h-train). The slope of the corresponding losses early in training confirm that the gradients of the easy examples are more coherent than those of the hard examples. For completeness, we also show the performance of the two models on the original ImageNet test set in Figure 16(c). Finally, as predicted, at the end of training, we see that the generalization gap for e-model is significantly smaller than that of h-model.
Discussion. Does SGD generalize well because it explores functions of increasing complexity during training?
One intuitive explanation of generalization is that SGD somehow explores candidate hypotheses of increasing “complexity” during training, thus finding the simplest hypothesis that explains the data.
While there is some evidence backing this view Arpit17; Nakkiran19, and this is not incompatible with CGH, one aspect of our experiment suggests that this may not be the best way to look at the situation. From this viewpoint, one might think of the examples far away from the decision boundary as easy (since they can be separated by simpler hypotheses explored early on) and ones closer as hard (since they need more complex hypotheses to be separated). The decision boundary learned from the easy examples, one might guess, generalizes poorly to the hard examples, and that is what we observe (e-model leads to 17% accuracy on h-test). But, this would also suggest that what we learn from the hard examples (provided there are sufficiently many of them, which is true for us since they are 50% of the total training set) should generalize well to the easy test set. But this is not what we find. We find h-model has 44% accuracy on e-test which is much lower than the 85% e-model accuracy on e-test. In other words, the examples learned late by SGD by themselves do not define the decision boundary.
Remark. Finally, we note that easy and hard examples provide a fundamentally different way of testing CGH than the label noise techniques used so far (in the original study and in this paper) since they do not force memorization.
Coherent Gradients provides an approach to understanding the data-dependent generalization observed in neural networks trained with SGD. In this paper, we have presented new evidence for CGH both through reproducing and scaling up previous work, and through new methods and experiments.
If CGH is true, our experiments with naturally occurring easy and hard examples suggest a subtle yet important shift in perspective: Generalization happens not because “simple” patterns are prioritized by SGD, but because common patterns are found first. (Of course, since common patterns are usually also simple, prevailing intuitions are not incorrect.)
Furthermore, as we see in both the label noise experiments and the hardness experiments, when per-example gradients are well aligned, the network not only learns quicker, but also generalizes better. Thus CGH suggests a rough rule of thumb for the practitioner: all else being equal (e.g., architecture, optimizer, learning rate, initialization), faster training on a dataset likely leads to better generalization.
In terms of future work, it would be interesting to test CGH in non-image settings such as language models, reinforcement learning, etc. Furthermore, the simplicity and low computational overhead of RM3, and its ability to reduce overfitting suggests that further research in this direction could lead to practically useful stable training algorithms for deep learning perhaps with generalization guarantees.
We thank Michele Covell, Sergey Ioffe, Rahul Sukthankar and Ying Xiao for valuable advice and feedback.
Figures 21 and 26 show the results of the label noise experiments on Inception-V3 and VGG-13 respectively. The results are essentially the same as that for ResNet-18 which has been analysed in Section 3.