When and how epochwise double descent happens

Deep neural networks are known to exhibit a double descent' behavior as the number of parameters increases. Recently, it has also been shown that an epochwise double descent' effect exists in which the generalization error initially drops, then rises, and finally drops again with increasing training time. This presents a practical problem in that the amount of time required for training is long, and early stopping based on validation performance may result in suboptimal generalization. In this work we develop an analytically tractable model of epochwise double descent that allows us to characterise theoretically when this effect is likely to occur. This model is based on the hypothesis that the training data contains features that are slow to learn but informative. We then show experimentally that deep neural networks behave similarly to our theoretical model. Our findings indicate that epochwise double descent requires a critical amount of noise to occur, but above a second critical noise level early stopping remains effective. Using insights from theory, we give two methods by which epochwise double descent can be removed: one that removes slow to learn features from the input and reduces generalization performance, and another that instead modifies the training dynamics and matches or exceeds the generalization performance of standard training. Taken together, our results suggest a new picture of how epochwise double descent emerges from the interplay between the dynamics of training and noise in the training data.

Authors

• 10 publications
• 5 publications
• Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Over-parameterized models, in particular deep networks, often exhibit a ...
07/20/2020 ∙ by Reinhard Heckel, et al. ∙ 29

• Mitigating deep double descent by concatenating inputs

The double descent curve is one of the most intriguing properties of dee...
07/02/2021 ∙ by John Chen, et al. ∙ 0

• Rethinking Parameter Counting in Deep Models: Effective Dimensionality Revisited

Neural networks appear to have mysterious generalization properties when...
03/04/2020 ∙ by Wesley J. Maddox, et al. ∙ 14

• On the Role of Optimization in Double Descent: A Least Squares Study

Empirically it has been observed that the performance of deep neural net...
07/27/2021 ∙ by Ilja Kuzborskij, et al. ∙ 0

• On the geometry of generalization and memorization in deep neural networks

Understanding how large neural networks avoid memorizing training data i...
05/30/2021 ∙ by Cory Stephenson, et al. ∙ 0

• Double Descent in Adversarial Training: An Implicit Label Noise Perspective

Here, we show that the robust overfitting shall be viewed as the early p...
10/07/2021 ∙ by Chengyu Dong, et al. ∙ 0

• Avoiding The Double Descent Phenomenon of Random Feature Models Using Hybrid Regularization

We demonstrate the ability of hybrid regularization methods to automatic...
12/11/2020 ∙ by Kelvin Kan, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Training deep neural networks (DNNs) is typically a time consuming and computationally expensive process owing to the large number of parameters, non-linearity, and the resulting use of slow to converge first order optimization algorithms (SGD and its variants) ruder2016overview ; du2018gradient . Fortunately, good solutions are often obtained well before actual minimization of the objective function occurs, found either by early stopping (ES) using a validation data set, or by training until diminishing returns in generalization performance are no longer worth additional training bengio2012practical . It is therefore of practical interest to determine when these strategies may be employed without a loss in generalization performance.

In practice the number of training iterations is often estimated by monitoring generalization performance on validation data under the assumption that if the generalization performance no longer improves, or improves at a very slow rate then further training is unlikely to be worthwhile. However, it has been shown recently

nakkiran2019deep that this is not always the case, owing to a phenomenon called ‘Epochwise Double Descent’ (EDD) in which the generalization performance initially improves, then begins to decrease before reversing course and improving again. This behavior mirrors the more familiar double descent effect belkin2019reconciling

seen in model complexity. If the generalization performance follows the EDD behavior, simple heuristics for determining the end of training may give sub-optimal results. For example, a common early stopping criterion is to end training if the generalization performance does not improve for a set (typically fairly small) number of training epochs. Such a heuristic will likely only capture the first descent and miss the second, even if training longer would be beneficial.

The goal of this work is to characterize the dependence of Epochwise Double Descent on the size of the model and the amount of noise in the data, and determine when this phenomenon is or is not likely to occur. To do this we use a combination of experiments and theory, and extend a line of work on double descent in linear models opper1990ability ; le1991eigenvalues ; krogh1992generalization ; watkin1993statistical ; opper1995statistical ; advani2017high ; mei2019generalization to enable analysis of epochwise double descent in a theoretically tractable setting. Our main contributions can be summarized as follows:

1. We introduce a solvable model that exhibits Epochwise Double Descent and find its phase diagram, showing when it occurs and when early stopping gives optimal generalization.

2. In agreement with our theory, We show experimentally that Epochwise Double Descent occurs only if the amount of label noise exceeds a critical threshold, and simple early stopping heuristics fail only for intermediate amounts of noise.

3. Using insights gained from our analysis, we give two modifications to the training procedure that each remove Epochwise Double Descent. One comes at a price of reduced generalization, the other matches or exceeds the generalization of standard training.

On a practical level, our findings give insight into when simple heuristics may be used to determine when to stop training, and we provide a means of removing the epochwise double descent should it occur. From a more fundamental perspective, our findings have some tension with the generality of previously proposed ‘effective model complexity’ (EMC) hypothesis nakkiran2019deep . The EMC is related to the number of training samples a given training procedure can learn, and so it increases with increasing training time in the case of DNNs trained with SGD. The EMC hypothesis holds that double descent occurs as the training time is increased and the EMC exceeds the size of the training set. In contrast, we provide a training procedure that doesn’t exhibit an epochwise double descent. Instead, we hypothesize an alternate explanation of the phenomenon in the presence of informative, but slow to learn features which only become relevant late in training. We provide some evidence supporting this view by showing that removal of certain features in the input can also remove the epochwise double descent phenomenon, but this comes at a cost of reducing the generalization performance.

2 Related work

The non-monotonic behavior of generalization error as a function of model and sample size in some types of linear models has been known since at least the late 1980s vallet1989linear and was observed theoretically and experimentally in several works in the early 1990s opper1990ability ; le1991eigenvalues ; krogh1992generalization ; watkin1993statistical ; opper1995statistical

that studied learning using tools from statistical physics. These works connect the generalization properties of the pseudoinverse solution for perceptron-based classifiers

vallet1989linear

to the eigenvalue distribution of the input covariance matrix. For i.i.d. normally distributed input data in the limit that the number of datapoints

and input features are large ( and ) but the ratio is a finite constant, the eigenvalue distribution is given by the Marchenko–Pasteur distribution marvcenko1967distribution

. As a result, the generalization error of the pseudoinverse solution exhibits a phase transition at

characterized by poor generalization in the vicinity of . The generalization error of this solution continues to improve for increasing provided , and exhibits a U-shaped behavior for with poor generalization performance at and and better performance in between.

The peak in generalization error as the model transitions from underparameterized to overparameterized was rediscovered and phrased in the language of the bias-variance tradeoff in

belkin2019reconciling , which also named the phenomenon ’double descent.’ While the connection to earlier work is unclear loog2020brief ; belkin2020reply , this work invokes the small norm bias of many common training procedures to show that the norm of the solution decreases as the amount of overparameterization increases, which is empirically associated with improvements in generalization performance belkin2018understand

. Experimentally they show that double descent occurs across many types of models, including random feature models, decision tree ensembles, and neural networks, and argue that it is a general phenomenon. However,

belkin2019reconciling

notes that experiments involving neural networks trained with (stochastic) gradient descent are difficult due to the lack of theoretical solutions, long training times, and sensitivity to weight initialization.

Double descent in deep neural networks was experimentally investigated in nakkiran2019deep . In this setting, double descent is observed as a function of the number of parameters and the size of the dataset (see the follow up work nakkiran2019more ), a finding compatible with earlier results opper1990ability ; le1991eigenvalues ; krogh1992generalization ; watkin1993statistical ; opper1995statistical . It is also shown that early stopping removes the double descent peak, similar to the results obtained for linear networks in advani2017high ; mei2019generalization . Intriguingly, double descent as a function of the number of SGD iterations is also observed in that the test error has a peak near when the training error reaches zero, but decreases with further training. To unify these findings, this work builds from kalimeris2019sgd and defines the ’Effective Model Complexity’ (EMC) as the maximum number of samples, given a dataset and training procedure, that a model can fit with error . This increases with training time as well as the number of parameters in a given model. Hypothesis 1 of nakkiran2019deep proposes that double descent occurs generally as a function of the EMC (EMC hypothesis).

Particularly relevant for the case of deep neural networks, which typically operate deep into the overparameterized regime zhang2016understanding , is the part of the hypothesis in nakkiran2019deep which states (paraphrasing) that if the EMC is much larger than the size of the training dataset, any modification to the training procedure that increases the EMC results in a decrease in the test error. While supported by the experiments of belkin2019reconciling ; nakkiran2019deep in the parameter ranges considered, we note that this is different from what is known in the case of linear models, in which a minimum of the test error is achieved at a finite ratio of features to datapoints, after which increasing the number of features for a fixed amount of data results in an increasing test error.

Exploring the EMC hypothesis, heckel2020early

finds that in the case of linear regression, epochwise double descent occurs as a consequence of the gradient descent training dynamics when differently scaled features are learned at different times. In this setting, the test risk decomposes into a sum of bias-variance tradeoffs which occur at different times, resulting in an epochwise double descent. This effect can be removed by taking an appropriate learning rate for each feature, which lines up the different bias-variance tradeoffs in time, an effect which sometimes can improve generalization. Additionally,

heckel2020early shows analytically that similar effects occur in a two layer network, and experimentally demonstrates similar behavior in a 5-layer convolutional network.

Our work aims to bring our understanding of epochwise double descent closer to that of double descent with model complexity. Specifically, we characterize how the amount of data and noise interacts to produce an epochwise double descent given generic assumptions on the training data.

3 Setup for analysis of training dynamics

We consider training a model on samples where , and . That is, a dataset of samples from a distribution with input features and classes. For convenience, we adopt the bold notation for matrices, and group the input features into the data matrix , and similarly, the matrix given by the result of an dimensional embedding of by where each is a map . Let , and be the training target matrix and model output matrix respectively. In the case of probabilistic (hard) labels we take .

We first consider training with mean squared error (MSE) cost defined as

 LMSE=12NTr[(wΦ−y)⊺(wΦ−y)] (1)

For some weights . Gradient descent on this cost with a learning rate of gives the dynamics

 w(t+1)MSE=w(t)MSE−γNw(t)MSEΦΦ⊺+γNyΦ⊺ (2)

The MSE dynamics permit an analytic solution (see appendix for the derivation)

 w(t)MSE=w(∞)MSE+(w(0)MSE−w(∞)MSE)U[IF−γΛ]tU⊺ (3)

Where we have used the decomposition . The parameters after infinite training time are given by

 w(∞)MSE=yΦ⊺(ΦΦ⊺)++w(0)MSE(IF−UΛ+ΛU⊺) (4)

Where the superscript denotes the Moore-Penrose pseudoinverse and is the identity.

In the appendix we show that in the high temperature limit the dynamics of training with softmax cross entropy permits a similar solution for the trajectory of the paramters

 Mw(t)XENT=Mw(∞)XENT+M(w(0)XENT−w(∞)XENT)U[IF−γΛ]tU⊺ (5)

After defining , and introducing as a matrix of ones. The parameters after infinite training time are given similarly:

 Mw(∞)XENT=M(CPL−1CN)Φ⊺(ΦΦ⊺)++Mw(0)XENT(IF−UΛ+ΛU⊺) (6)

This is nearly identical to the solution for MSE given in Eq. 3 if the label matrix in the MSE case is identified as . The matrix acts to subtract the mean across classes of whatever it acts on, so the cross entropy and MSE solutions are identical other than differences in this mean. This relationship allows us to gain insight into the generalization behavior of training with softmax cross entropy by analyzing the easier MSE case.

4 A simple model of epochwise double descent

In building a toy model of epochwise double descent, we note that the gradient descent training dynamics shown in Sec. 3 is such that large eigenvalues of the covariance matrix are learned first, and this has implications for observing epochwise double descent in linear models. In particular, the small eigenvalues will dominate the learning at long training times, and if test error is to decrease in this regime, the small eigenvalues must be relatively noise free. However, if the test error is to increase at intermediate training times, the larger eigenvalues must be noisy. In many analyses, such as advani2017high , the noise is typically assumed to be uniform across eigenvalues, and so epochwise double descent does not occur in these works. Here we construct a model in which the noise only has an effect on the large eigenvalues of the data covariance matrix, which is sufficient to produce an epochwise double descent effect near .

We assume a training dataset consisting of datapoints each with normally distributed features such that . These data points are assigned labels via labeled using a teacher matrix chosen such that . Here, is a label noise term, but we do not assume it is i.i.d. normal. Instead, we assume

couples to eigenvectors with large eigenvalues only. Using the SVD

, we define such that , or equivalently, , and choose according to the following:

 [z]ij∼{N(0,σ)[Λ]jj≥10[Λ]jj≤1 (7)

Note that we still have and , however the noise is now correlated across examples. For a further discussion on this type of noise, and how it relates to randomly labeled examples in a typical training setup, see the appendix.

The eigenvalues of the data covariance matrix follow the Marchenko-Pasteur (MP) distribution marvcenko1967distribution in the limit with .

 pMP(x|λ)=12π√(λ+−x)(x−λ−)λx1x∈[λ−,λ+] (8)

with . Note that there is a point mass of at if . Additionally, the discontinuity at in Eq. 7 corresponds to the mean of the MP distribution. With this distribution and the solutions in Sec. 3 we can calculate the expected generalization error (as measured by the MSE cost function) after iterations of training as well as it’s dependence on the amount of noise, measured by and the amount of over/under parameterization as measured by . The underparameterized regime has and the overparameterized regime has .

 ⟨Ltest(t,λ,σ)⟩ϵ,X,wT= ∫∞1[(1+σx)(1−γx)2t+σx(1−2(1−γx)t)]pMP(x|λ)dx (9) +∫10[(1−γx)2t]pMP(x|λ)dx

Where the average has been taken over the teacher matrix, the training data, and the noise.

The dynamics given by Eq. 9 can be qualitatively different depending on the amount of training data and the amount of noise. We define four different classes of behavior depending on whether or not an epochwise double descent occurs and whether or not the minimum of the generalization cost occurs before the end of training, necessitating early stopping. we define:

1. NDD-NES: No epochwise double desecnt, no early stopping: Here the generalization cost decreases monotonically with increasing training time. We observe this behavior for small amounts of noise especially near critical parameterization .

2. NDD-ES: No epochwise double descent, early stopping: Here the generalization cost first decreases, then increases and plateaus above it’s minimum value, meaning early stopping is necessary for optimal generalization performance.

3. EDD-NES: Epochwise double descent occurs, no early stopping: Here, epochwise double descent occurs in that the generalization cost first decreases, then increases, and then decreases again. The second decrease is large enough to overcome the increase, and so early stopping gives sub-optimal generalization.

4. EDD-ES: Epochwise double descent occurs, early stopping is necessary: Here, epochwise double descent occurs in that the generalization cost first decreases, then increases, and then decreases again. The second decrease is not large enough to overcome the increase, and so early stopping is necessary.

Each of these types of behaviors occurs in this model, and the situation is summarized in Fig. 1. The generalization error obtained by early stopping and training to convergence is shown in Fig. 2. We observe a discontinuous drop in generalization performance when using early stopping at the onset of epochwise double descent in Fig. 2B. We identify four properties P1-P4 with practical implications:

1. [label=P0]

2. For clean datasets (small ), early stopping is not crucial for achieving good generalization performance.

3. Epochwise double descent only occurs near critical parameterization.

4. Epochwise double descent requires noise above a certain threshold to occur.

5. If the amount of noise is too large (large ), early stopping gives the best generalization performance, and epochwise double descent can be ignored in practice.

The first of these is known to hold true in deep neural networks as well, as common practice is to train models for as long as practically feasible regardless of the possibility of overfitting. The second of these has been partially demonstrated in nakkiran2019deep which shows that networks below a critical size do not exhibit epochwise double descent.

5 EDD is only relevant within a specific noise range

To evaluate properties 1-4 extracted from the linear model of Sec. 4 we replicated the experimental setup of nakkiran2019deep and trained instances of ResNet18 he2016deep with SGD and varying widths on the CIFAR10 dataset krizhevsky2009learning

. This work was implemented using PyTorch

NEURIPS2019_9015 and the TorchVision ResNet18 implementation. We did not replicate the experiments on CIFAR100 krizhevsky2009learning as clean, unambiguous labels are required to investigate the generalization behavior with small amounts of label noise. CIFAR100 is known to have a much larger amount of label noise/ambiguity than CIFAR10 northcutt2021pervasive . We investigated noise fractions ranging from to random label noise, and averaged the test error across 10 random seeds111This random seed determines the selection and relabling of the randomly labeled examples as well as the network initialization for each combination of width and amount of label noise.

Results can be seen in Fig. 3. At , test error monotonically decreases for all network widths, and so there is no well defined early stopping epoch (property 1). For smaller networks widths ( and ) we observe little to no epochwise double descent at any noise levels, a finding consistent with nakkiran2019deep (property 2). At larger widths, such as the standard ResNet18 () and larger () we see a strong epochwise double descent effect above ( noisy labels). At a small double descent peak begins to emerge, though it is similar in magnitude to the jitter in the test error due to SGD. Below , we do not observe epochwise double descent at any width. This is consistent with the phase diagram in Fig. 1A in the NDD-NES regime which suggests there is a critical noise level below which epochwise double descent does not occur (property 3).

At larger widths () and larger noise values (), we observe a strong epochwise double descent effect. However, the initial descent achieves a lower test error than the second descent, and so a simple early stopping heuristic outperforms training to convergence. This is also consistent with the EDD-ES regime in the phase diagram in Fig. 1A in which epochwise double descent occcurs but can be safely ignored via early stopping (property 4). For large widths and intermediate noise levels ( for and for ) epochwise double descent occurs, and the lowest test error is achieved at the end of training. This is the regime in which early stopping gives sub-optimal performance, similar the EDD-NES regime in Fig. 1A.

Our linear model also predicts that networks which are above a certain size will also not exhibit epochwise double descent. While this may happen for very large networks, we were not able to observe it experimentally in ResNet18. Experiments at large values of are quite computationally expensive222The number of parameters grows like , so it is possible our networks are simply not large enough to show this effect. Data augmentation and parameter redundancy both act to reduce the number of effective parameters, so in the case of ResNet18 it is unclear how large should be to reach the correct regime. We note that this occurs in the same regime as the decrease in generalization performance seen with increasing overparameterization regime shown in linear models (advani2017high and others) which has similarly not been observed in deep neural networks.

6 Removing epochwise double descent by removing features

A central assumption of the linear model in Sec. 4 is the existence of slow to learn but informative features. In the linear setting, these features consist of eigenvectors with small eigenvalues. In the case of DNNs, it is less clear what features in the input are slow to learn owing to the nonlinearity of the training dynamics and differing architecture choices. However, very wide DNNs are known to behave approximately linearly lee2019wide in a transformed feature space determined by the network architecture and initial weights. While this transformation can change the ordering in which features are learned and introduce new nonlinearly constructed features, there may still be some correspondence between the slow to learn features in the linear case and the slow to learn features in a DNN. With this in mind, we carried out an experiment in which we discarded the smaller eigenvalue features of a dataset that shows epochwise double descent and trained a DNN on the remaining components.

Fig. 4A shows the result of training333

Note: hyperparameters here are different from

nakkiran2019deep and were obtained with a hyperparameter sweep. See appendix for details ResNet18 on CIFAR10 with and a varying number of the largest principal components, leaving the test set unmodified. All results are averaged across 10 random seeds. When all components are kept (yellow line) we see epochwise double descent. However, as the number of components is reduced the epochwise double descent phenomenon becomes less pronounced, and vanishes when only the largest components (which account for 90% of the variance) are kept. This is consistent with the features responsible for the epochwise double descent having been removed. As the removed features are informative, this also corresponds to a decrease in generalization performance. The top 100 PCA components are shown in Fig. 4C and the modified data with 100 components in Fig. 4D. As might be expected, the larger principal components tend to represent larger scale features. Fig. 4B demonstrates that ResNet18 is still able to fit this modified training data.

7 Removing epochwise double descent by training the last layer

While the experiment in Sec. 6 demonstrates that epochwise double descent can be eliminated by removing certain useful features in the training data, this also causes a significant drop in generalization performance. However, in the linear model of Sec. 4 epochwise double descent appears as a property of the training dynamics as well as the training data which hints at an alternate method of removing the effect. If it were possible to skip to the end of training, the order in which features are learned becomes irrelevant and there is no possibility for an epochwise double descent to exist. In general, this is not possible for a DNN as a whole as analytic solutions don’t exist, with the exception of the final layer.

As previous works have shown, representations in early layers of DNNs typically stabilize quickly raghu2017svcca ; morcos2018insights , while the final layers are responsible for memorizing noise stephenson2021on and continue to learn during training. This leads us to suspect that training just the final layer to convergence might be sufficient to remove the epochwise double descent as this layer plays a dominant role in the later epochs of training. This can be done in the case of a network trained with softmax cross entropy using the smooth-label approximation for the weights given in Eq. 6. Our procedure is as follows: Train a DNN with SGD using the softmax cross entropy objective for epochs (standard training procedure). Then, replace the weights of the final classification layer with the ’converged’ weights given by Eq. 6 using the penultimate layer activations as features. The result is a network which has the final layer trained to convergence, and the earlier layers trained for epochs with a standard training procedure.

Fig. 5 shows the test error achieved by ResNet18 () trained444Hyperparameters are those from Sec. 6 on CIFAR10 for values of between 0.0 and 0.4. The blue line represents the error achieved with standard training, and the orange ’Converged’ line represents the error achieved after substituting in the converged weights of Eq. 6. We see that in all cases, using the ’converged’ weights eliminates epochwise double descent. When , ’converged’ weights give also give a lower test error than the standard training, as the second descent in test error results in a very slow convergence to this final value. In the noise free case of , we find that standard training and using ’converged’ weights for the final layer give equivalent performance, with differences smaller than variance arising from different random seeds.

8 Discussion and Conclusion

In constructing the linear model of epochwise double descent, we assumed the existence of small scale (small eigenvalue) features that are largely unaffected by the presence of noise, and learn slowly compared to to larger scale (large eigenvalue) features which are noisy. Near critical parameterization, this gives rise to an epochwise double descent for noise levels above a threshold. This is similar to other linear models (recently advani2017high ; heckel2020early ) that achieve the well-studed double descent in model complexity by assuming the existence of small scale (small eigenvalue) which couple to uniform noise. While not a focus of this work, we note that both effects can coexist if the small scale features are weakly affected by noise in comparison to the large scale features. We believe that the study of other noise models is likely to yield other behaviors which may be interesting, and the problem of determining which noise model is most appropriate for the case of training deep neural nets with label noise is a promising direction for future work.

We also demonstrated that epochwise double descent can be removed by the deletion of certain features in the input data (Sec. 6). While in this work we simply discarded the smallest principal components, this is likely suboptimal. Other approaches which are more selective about which input features to remove might have the potential to remove epochwise double descent at a smaller cost to generalization performance. Determining which features are responsible for epochwise double descent may also shed light on what features DNNs are sensitive to at different stages of training, and so this is a potential avenue for further work on DNN interpretability in connection to the various double descent phenomena.

Finally, we note an apparent tension between our experimental results and the EMC hypothesis of nakkiran2019deep . In Sec. 7 we give an example of a training procedure that does not exhibit double descent as the number of samples fit with error exceeds the size of the training set. Strictly speaking, this acts as a counterexample to the EMC hypothesis. Furthermore, we also note that our results in Sec. 5 suggest that a critical amount of noise is necessary to produce epochwise double descent, while the EMC hypothesis suggests that epochwise double descent should occur for all noise levels. We provide an alternate view: epochwise double descent occurs as a result of an interplay between the features in the data and the noise in the labels (as in out linear model) rather than as a result of the EMC. We believe additional work clarifying which view of the epochwise double descent phenomena is correct would improve the understanding of the generalization dynamics of DNNs.

In summary, we have developed a linear model which exhibits an epochwise double descent effect. We have calculated a phase diagram that shows when epochwise double descent does/does not occur, and when early stopping does/does not lead to optimal generalization performance. In spite of the simplicity of our linear model, all four combinations of these two effects occur depending on the amount of overparameterization and the amount of label noise. Experiments on deep neural networks show that these highly nonlinear models behave qualitatively similarly to our linear model, in that epochwise double descent requires both overparameterization and noise above a critical threshold to occur. Experimentally and in our linear model we find that above a second critical noise level epochwise double descent occurs but is not relevant as early stopping gives superior generalization performance. Using insights from the linear model we give two methods that experimentally eliminate the epochwise double descent effect, one which harms generalization and another which improves or matches the generalization performance of standard training. These findings follow from our hypothesis that the epochwise double descent effect arises from slow to learn but informative features. We hope future works will shed further light on the nature of these features and their relation to the influence of noisy labels on the generalization dynamics of deep neural nets.

Deep learning has become an incredibly compute-intensive field, where long-running model training runs are the norm. Epochwise double descent poses a challenge in this context, since, as described, early stopping may yield worse generalization performance. Beyond this, our work is application-agnostic, so we cannot foresee any clear negative impacts.

9 Appendix

9.1 Derivation of MSE dynamics

The mean-squared error () is given by

 LMSE=12NN∑j=1C∑c=1[wΦ−y]2cj (10)

Computing the gradient with respect to one of the parameters gives

 ∂LMSE∂wcj=1N[wΦΦ⊺−yΦ⊺]cj (11)

Carrying out gradient descent with a learning rate gives the recursion relation for the parameters (Eq. 2 from the main text)

 w(t+1)MSE=w(t)MSE−γNw(t)MSEΦΦ⊺+γNyΦ⊺ (12)

Where denotes the value of parameters obtained after training for time with MSE. This relation has a fixed point at

 w∗MSE=yΦ⊺(ΦΦ⊺)+ (13)

where denotes the Moore-Penrose pseudoinverse. Changing variables to Gives

 z(t+1)=z(t)−γNz(t)ΦΦ⊺ (14)

Changing variables to the eigenbasis via and gives a set of decoupled linear recursion relations

 ~z(t+1)=~z(t)−γ~z(t)Λ (15)

Which can easily be solved to give

 ~z(t)=~z(0)(IF−γΛ)t (16)

Changing variables back to the original basis with and gives the solution for the parameters

 w(t)MSE=w∗MSE+(w(0)MSE−w∗MSE)U(IF−γΛ)tU⊺ (17)

In the case where some eigenvalues of are zero, there is a frozen subspace in which the parameters don’t change in time and so the infinite time solution can be written as

 w(∞)MSE=yΦ⊺(ΦΦ⊺)++w(0)MSE(IF−UΛ+ΛU⊺) (18)

Where the quantity is a diagonal matrix with zeros on the diagonal corresponding to zero eigenvalues of and ones on the diagonal corresponding to nonzero eigenvalues of . The dynamics can be written in terms of the infinite time solution as

 w(t)MSE=w(∞)MSE+(w(0)MSE−w(∞)MSE)U(IF−γΛ)tU⊺ (19)

Which is Eq. 3 of the main text.

9.2 Derivation of high temperature softmax cross entropy dynamics

In the case of cross entropy, we take the model outputs

to be the probability assigned to class

on examples , given by the matrix . For our models, we have given by the softmax function applied to the model outputs

 PM=eβwXENTΦ1⊺CeβwXENTΦ (20)

Where is a weight matrix of learnable parameters, is the inverse temperature, and is a dimensional vector of ones. Exponentials should be interpreted as elementwise operations. We also have the label matrix

 [PL]ij=δi,argmaxyj (21)

Where is the Kronecker delta. is thus a matrix of one-hot vectors defining the labels given by the s. We also define the label-smoothed label matrix as

 ~PL=αPL+(1−α)1CNC (22)

Where is a matrix of ones, and is a smoothing parameter[30, 31] that obeys . Given an , we find the parameters by gradient descent on the cross entropy cost defined by

 LXENT=−1NTr[~PL⊺log(PM)] (23)

Where the log is elementwise. The gradient of with respect to the weights is given by

 ∂L∂w=βN(PM−~PL)Φ⊺ (24)

And the discrete time gradient descent dynamics of (abbreviated as ) is given by the nonlinear recursion relation

 w(t+1)=w(t)−γβN(PM−~PL)Φ⊺ (25)

For learning rate . These dynamics are not solvable in general, but we can gain insight into their behavior by analyzing the high temperature limit. We construct this by taking the first order taylor expansion of about , which is

 PM≈1CNC+βCMwΦ+O(β2) (26)

Here is defined as

 M≡IC−1C1CC (27)

Where is a matrix of ones, and is a identity matrix. Under this approximation, the dynamics simplify to

 w(t+1)=w(t)−γβ2NC[Mw(t)Φ−αβ(CPL−1CN)]Φ⊺ (28)

For simplicity, we absorb the factor of into the learning rate, and take under the assumption that to get

 w(t+1)=w(t)−γN[Mw(t)Φ−(CPL−1CN)]Φ⊺ (29)

Lastly, we make use of two special properties of the matrix :

 M2=MM(CPL−1CN)=(CPL−1CN) (30)

Both of these follow from the action of to subtract the mean across classes of whatever it acts on, and the second follows from the fact that is mean zero across classes. Multiplying through the recursion relation with and using these properties gives

 Mw(t+1)=Mw(t)−γNMw(t)ΦΦ⊺+γN(CPL−1CN)Φ⊺ (31)

These dynamics are identical to the MSE dynamics given in the preceding section with labels , and can be solved the same way except with a weight matrix given by . This correspondence generates the solutions given in the main text.

9.3 Relationship between dynamics of softmax cross entropy and MSE

As shown in the previous two sections, the training dynamics of a highly label-smoothed softmax cross entropy cost is similar to the dynamics of training with a mean squared error cost. However, differences in the learned weights may occur in the mean across classes. Here we show that in the case of softmax cross entropy, the mean across classes of the weight matrix remaines fixed at its initial value, while in the case of the mean squared error, it decays away to zero. First, we reparameterize the weight matrix via

 w=Mw+1C1CCw=~w+μ (32)

Where has zero mean across classes, and is the mean of across classes. It is clear then that and by the action of . In this parameterization, the high temperature dynamics of softmax cross entropy training become

 ~w(t+1)XENT=~w(t)XENT−γN~w(t)XENTΦΦ⊺+γN(CPL−1CN)Φ⊺μ(t+1)XENT=μ(t)XENT (33)

For MSE, the situation is similar with the exception that the mean couples to . Acting on both sides of the MSE recursion relation with gives a recursion relation for , and a recursion relation for can be obtained from . The result is

 ~w(t+1)MSE=~w(t)MSE−γN~w(t)MSEΦΦ⊺+γN(CPL−1CN)Φ⊺μ(t+1)MSE=μ(t)MSE−γNμ(t)MSEΦΦ⊺ (34)

Which is identical to the dynamics of high temperature softmax cross entropy with the exception that the mean across classes decays to zero with a rate determined by . In general, the argmax model output is independent of , and so in all cases the top-1 accuracy of training with cross entropy and a high-temperature softmax and MSE will be identical.

9.4 Derivation of expected test cost

We begin with the solution to the training dynamics in the case of MSE training, which behaves equivalently to high-temperature softmax cross entropy training, and take the special case and for and :

 w(t)=yX⊺(XX⊺)+−yX⊺(XX⊺)+U(ID−γΛ)tU⊺ (35)

Taking the labels for a teacher matrix and noise , we change to the eigenbasis using the SVD . This involves defining , , and . The result is

 q(t)=(qT+ηΛ−1/2)(ID−(IF−γΛ)t) (36)

We can then evaluate the expected generalization performance on noise free test data via

 ⟨Ltest⟩ϵ,X,xT,wT=12⟨(wxT−wTxT)⊺(wxT−wTxT)⟩ϵ,X,xT,wT (37)

Here the expectation is taken over the distribution of the training data , the test data , the teacher matrix , and the label noise . The expectation over is easy, as . Since the MSE is invariant to orthogonal transformations, we can compute the expectations in the rotated basis

 ⟨Ltest⟩ϵ,X,wT=12⟨(q−qT)⊺(q−qT)⟩η,Λ,qT (38)

This can be evaluated for each dimension identically as so far everything is isotropic. Using the solution to the dynamics, and dropping terms first order in and since and gives

 ⟨Ltest⟩ϵ,X,wT=12⟨(q2T+η2l)(1−γl)2t+η2l(1−2(1−γl)t)⟩η,l,qT (39)

Where is a single component of , is a single component of the noise, and is the corresponding eigenvalue. Using , along with the assumption for the noise

 η∼{N(0,σ)l≥10l≤1 (40)

We get

 ⟨Ltest⟩ϵ,X,wT= 12∫∞1[(1+σl)(1−γl)2t+σl(1−2(1−γl)t)]p(l)dl (41) +12∫10[(1−γl)2t]p(l)dl

Here, is the distribution of eigenvalues of the training data. The expectation over the training data has thus been replaced with an expectation over . Owing to our assumption that the training data is drawn from an uncorrelated gaussian, the eigenvalues of the data covariance matrix follow the Marchenko-Pasteur (MP) distribution in the limit with .

 pMP(x|λ)=12π√(λ+−x)(x−λ−)λx1x∈[λ−,λ+] (42)

Taking this high-dimensional limit, we get

 ⟨Ltest⟩ϵ,X,wT= 12∫∞1[(1+σl)(1−γl)2t+σl(1−2(1−γl)t)]pMP(l|λ)dl (43) +12∫10[(1−γl)2t]pMP(l|λ)dl

Which is Eq. 9 of the main text.

9.5 Correspondence between label noise and type of noise used in model

Our assumption that the noise in the training labels couples only to large eigenvalue features is key to obtaining an epochwise double descent effect. Other assumptions, such as noise that couples to all eigenvalues equally do not show epochwise double descent anywhere in the plane. Here we informally explain some of the intuition behind this assumption.

First, we note that noise introduced by randomly shuffling labels (as in the experiments on Deep Networks that show epochwise double descent) does not couple to all features equally in the eigenbasis. To see this, note that we can write totally random labels in terms of the clean labels and a noise term .

 yrand=yclean+η (44)

Hence, noise in a dataset with totally random labels can be written as . In a dataset where only a fraction of the training examples have random labels, we introduce the matrix where is a diagonal matrix with ones randomly distributed on the diagonal, and zeros in the remaining positions on the diagonal. Then we have

 η=(yrand−yclean)F (45)

In the solution to the dynamics, noise enters via the product for , and so through . Using the SVD , we can write

 FX⊺=VFΛ1/2U⊺−[F,V]Λ1/2U⊺ (46)

The first term on the RHS includes a product which acts to drop a subset of the eigenvalues. The second term arises from the commutator which has a simple structure:

 [F,V]=V⊙D (47)

Where is given by . Thus, the action of fractional label permutation has two properties:

1. Some eigenvectors do not couple to the noise (for )

9.6 Compute requirements

All of the experiments described here were run on an internal compute cluster. Each experiment was run on a compute unit with 1 GPU, 12 CPU Cores and 16 GB RAM. The results in the main paper stem from  340 ResNet18 training runs, which consume the bulk of our overall compute usage.

Our experiments were developed using the PyTorch library (pytorch.org), which is open-sourced under the BSD license. Many of our experiments were run on the CIFAR10 dataset, which is publicly available under the MIT license.

9.7 Experiment hyperparameters

Experiments in Section 5 were performed using hyperparameters found in [4]. This was done to ease comparisons with prior work, but we found those hyperparameters to be suboptimal. In Sections 6 and 7 we instead performed a hyperparameter sweep over learning rate, learning rate decay, batch size and momentum. For learning rate decay, we used a schedule of , evaluated at the end of each epoch. We found the following hyperparameters to work best: