1 Introduction
Modern deep learning models achieve good generalization performances even with more parameters than data points. This surprising phenomenon is referred to as benign overfitting, and differs from the canonical learning regime where good generalization requires limiting the model complexity (Mohri et al., 2018)
. One widely accepted explanation for benign overfitting is that optimization algorithms benefit from implicit bias and find good solutions among the interpolating ones under the overparametrized settings. The implicit bias can vary from problem to problem. Examples include the minnorm solution in regression settings or the maxmargin solution in classification settings
(Gunasekar et al., 2018a; Soudry et al., 2018; Gunasekar et al., 2018b). These types of bias in optimization can further result in good generalization performances (Bartlett et al., 2020; Zou et al., 2021; Frei et al., 2022). These studies provide novel insights, yet they sometimes differ from the deep learning practice: state of the art models, despite being overparameterized, often do not interpolate the data points (e.g., He et al. (2016); Devlin et al. (2018)).In this work, we first examine the existence of benign overfitting in realistic setups by testing whether ResNet (He et al., 2016) models can overfit data benignly under common experiment setups: image classification on CIFAR10 and ImageNet. Our results are shown in Figure 1
. In particular, we trained ResNet18 on CIFAR10 for 200 epochs and the model interpolates the train data. In addition, we also trained ResNet101 on ImageNet for 160 epochs, as opposed to the common schedule that stops at 90 epochs. Surprisingly, we found that although benign overfitting happens on the CIFAR10 dataset,
overfitting is not benign on the ImageNet dataset—the test loss increased as the model further fits the train set. This phenomenon cannot be explained by known analysis on benign overfitting for classification tasks, as no negative results were studied yet.Motivated by the above observation, our work aims to understand the cause for the two different overfitting behaviors in ImageNet and CIFAR10, and to reconcile the empirical phenomenon with previous analysis on benign overfitting. Our first hint comes from the level of overparameterization. Previous results on benign overfitting in the classification setting usually requires that , where denotes the number of parameters and denotes the training sample size (Wang and Thrampoulidis, 2021; Cao et al., 2021; Chatterji and Long, 2021; Frei et al., 2022). However, in practice many deep learning models fall in the mild overparameterization regime, where the number of parameters is only slightly larger than the number of samples despite overparameterization. In our case, the sample size in ImageNet, whereas the parameter size in ResNets.
To close the gap, we study the overfitting behavior of classification models under the mild overparameterization setup where (this is sometimes referred to as the asymptotic regime). In particular, following Wang and Thrampoulidis (2021); Cao et al. (2021); Chatterji and Long (2021); Frei et al. (2022)
, we analyze the solution of stochastic gradient descent for the Gaussian mixture models. We found that a
phase change happens when we move from (studied in Wang and Thrampoulidis (2021)) to . Unlike previous analysis, we show that benign overfitting now provably fails in presence of label noise (see Table 1 and Figure 2). This aligns with our empirical findings as ImageNet is known to suffer from mislabelling and multilabels (Yun et al., 2021; Shankar et al., 2020).More specifically, our analysis (see Theorem 3.1 for details) under the mild overparameterization () setup supports the following statements that align with our empirical observations in Figure 1 and 5:

[leftmargin=0.5cm]

When the labels are noiseless, benign overfitting still holds under similar conditions as in previous analyses.

When the labels are noisy, the interpolating solution can provably lead to a positive excess risk that does not diminish with the sample size.

When the labels are noisy, early stopping can provably lead to better generalization performances compared to interpolating models.
Furthermore, our analysis naturally leads to a simple technique, known as selftraining, that can improve a model’s generalization performance by removing “hard data points”. The technique was studied in (Hinton et al., 2015; AllenZhu and Li, 2020; Huang et al., 2020). We confirm through our experiment design that the gain in generalization is coupled with the level of label noise.
More importantly, the empirical and theoretical results in our work point out that modern models may not operate under the benign interpolating regime. Hence, it highlights the importance of characterizing implicit bias when the model, though mildly overparameterized, does not interpolate the train data.
using SGD to train a linear classifier and plot the excess risk. The brighter grid means that the classifier has smaller excess error, and therefore does not overfit. Figure (a) shows that training on noiseless data does not cause overfitting, and Figure (b) shows that training on noisy data causes overfitting with overparameterization. Besides, we find that overfitting consistently happens under noisy regimes, despite theoretical guarantee that the model overfits benignly when
.













2 Related Works
Although the benign overfitting phenomenon has been systematically studied both empirically (Belkin et al., 2018b, 2019; Nakkiran et al., 2021) and theoretically (see later), our work differs in that we aim to understand why benign overfitting fails in the classification setup. In particular, among all theoretical studies, the closest ones to us are (Chatterji and Long, 2021), (Wang et al., 2021) and (Cao et al., 2021), as the analyses are done under the classification setup with Gaussian mixture models. Our work provides new result by moving from heavy overparameterization to mild overparameterization . We found that this new condition leads to an analysis that is consistent with our empirical observations, and can explain why benign overfitting fails. In short, compared to all previous studies on benign overfitting in classification tasks, we focus on identifying the condition that breaks benign overfitting in our ImageNet experiments. The comparison is summarized in Table 1.
More related works on benign overfitting:
Researchers have made a lot of efforts to generalizing the notation of benign overfitting beyond the novel work on linear regression
(Bartlett et al., 2020), e.g., variants of linear regression (Tsigler and Bartlett, 2020; Muthukumar et al., 2020; Zou et al., 2021), linear classification (Liang and Sur, 2020; Belkin et al., 2018a)with different distribution assumptions(instead of Gaussian mixture model), kernelbased estimators
(Liang and Rakhlin, 2018; Liang et al., 2020; Mei and Montanari, 2022)(Frei et al., 2022).Gaussian Mixture Model
(GMM) represents the data distribution where the input is drawn from a Gaussian distribution with different centers for each class. The model was widely studied in hidden Markov models, anomaly detection, and many other fields
(Hastie et al., 2001; Xuan et al., 2001; Reynolds, 2009; Zong et al., 2018). A closely related work is Jin (2009) which analyzes the lower bound for excess risk of Gaussian Mixture Model under noiseless regimes. However, their analysis cannot be directly extended to either the overparameterization or noisy label regimes. Another closely related work (Mai and Liao, 2019)focus on the GMM setting with mild overparameterization, but they require a noiseless label regime and rely on a small signal to noise ratio on the input, and thus cannot be generalized to our theoretical results.
Asymptotic (Mildly Overparameterized) Regimes. This paper considers asymptotic regimes where the ratio of parameter dimension and the sample size is upper and lower bounded by absolute constants. Previous work studied this setup with different focus than benign overfitting (e.g., double descent), both in regression perspective (Hastie et al., 2019) and classification perspective (Sur and Candès, 2019; Mai and Liao, 2019; Deng et al., 2019)
. We study the mild overparameterization case because it generally happens in realistic machine learning tasks, where the number of parameters exceeds the number of training samples but not extremely.
Label Noise: Label noise often exists in real world datasets, e.g., in ImageNet (Yun et al., 2021; Shankar et al., 2020). However, its effect on generalization remains debatable. Recently, Damian et al. (2021) claim that in regression settings, label noise may prefer flat global minimizers and thus helps generalization. Another line of work claims that label noise hurts the effectiveness of empirical risk minimization in classification regimes, and proposes to improve the model performance by explicit regularization (Mignacco et al., 2020) or robust training (Brodley and Friedl, 1996; Guan et al., 2011; Huang et al., 2020). Among them, Bagherinezhad et al. (2018) studies how to refine the labels in ImageNet using label propagation to improve the model performance, demonstrating the importance of the label in ImageNet. This paper mainly falls in the latter branch, which analyzes how label noise acts in the mild overparameterization regimes.
3 Overfitting under Mild Overparameterization
We observed in Figure 1 that training ResNets on CIFAR10 and ImageNet can result in different overfitting behaviors. This discrepancy was not reflected in the previous analysis of benign overfitting. In this section, we provide a theoretical analysis by studying the Gaussian mixture model under the mild overparameterization setup. Our analysis shows that mild overparameterization along with label noise can break overfitting.
3.1 Overparameterized Linear Classification
In this subsection, we study the generalization performance of linear models on the Gaussian Mixture Model (GMM). We assume that linear models are obtained by solving logistic regression with stochastic gradient descent (SGD). This simplified model may help explain phenomenons in neural networks, since previous works show that neural networks converge to linear models as the width goes to infinity
(Arora et al., 2019; AllenZhu et al., 2019).We next introduce GMM under two setups of overparameterized linear classification: the noiseless regime and the noisy regime.
Noiseless Regime. Let denote the ground truth label. The corresponding feature is generated by
where denotes noise drawn from subGaussian distribution. We denote the dataset where are generated by the above above mechanism.
Noisy Regime with contamination rate . For the noisy regime, we first generate noiseless data using the noiseless regime, and then consider the data point where is the contaminated version of . Formally, given the contamination rate , the contaminated label
with probability
and with probability . And the returned dataset is .For simplicity, we assume that the data points in the train set are linearly separable in both noiseless and noisy regimes. This assumption holds almost surely under mild overparameterization. Besides, we make the following assumptions about the data distribution:
Assumption 1 (Assumptions on the data distribution.).

The noise when generating feature is drawn from the Gaussian distribution, i.e., .

The signaltonoise ratio satisfies for a given constant .

The ratio is a fixed constant.
The three assumptions are all crucial but can be made slightly more general (see Section 4). The first Assumption [A1] stems from the requirement that we need to derive a lower bound for excess risk under a noisy regime. The second Assumption [A2] is widely used in the analysis (Chatterji and Long, 2021; Frei et al., 2022). For a smaller ratio, the model may be unable to learn even under the noiseless regime and return vacuous bounds. The third Assumption [A3] is the main difference from the previous analysis, where we consider a mild overparameterization instead of heavy overparameterization (i.e., ).
Training Procedure. We consider the multipass SGD training with logistic loss . During each epoch, each data is visited exactly once randomly without replacement. Formally, at the beginning of each epoch , we uniformly random sample a permutation , then at iteration , given the learning rate we have
where , given .
Under the above procedure, Proposition 3.1 shows that the classifier under the GMM regime with multipass SGD training will converge in the direction of the maxmargin interpolating classifier. This paper considers zero initialization where for simplicity.
Proposition 3.1 (Interpolator of multipass SGD under GMM regime, from Nacson et al. (2019)).
Under the regime of GMM with logistic loss, denote the iterates in multipass SGD by . Then for any initialization , the iterates converges to the maxmargin solution almost surely, namely,
where denotes the maxmargin solution.
For simplicity, we denote as the parameter at iteration in the noiseless setting, and as the parameter in the noisy setting. By the proposition above, we know that both and are maxmargin classifiers on the training data points.
During the evaluation process, we also focus on the 01 loss, where the population 01 loss is . Based on the above assumptions and discussions, we state the following Theorem 3.1, indicating the different performances between noiseless setting and noisy setting.
Theorem 3.1.
We consider the above GMM regime with Assumption [A1A3]. Specifically, denote the noise level by and the mild overparameterization ratio by . Then there exists absolute constant such that the following statements hold with probability^{1}^{1}1The probability is taken over the training set and the randomness of algorithm. at least :

Under the noiseless setting, the maxmargin classifier obtained from SGD has nonvacuous 01 loss, namely,

Under the noisy setting, the maxmargin classifier has vacuous 01 loss with constant lower bound, namely, the following inequality holds for any training sample size ,

Under the noisy setting, if the learning rate satisfies , there exists a time such that the trained earlystopping classifier has nonvacuous 01 loss, namely,
Intuitively, Theorem 3.1 illustrates that although SGD leads to benign overfitting under noiseless regimes, it provably overfits when the labels are noisy. In particular, it incurs error on noiseless data and hence would incur error on noisy labeled data, since label noise in the test set is independent of the algorithm. Furthermore, Theorem 3.1.3 shows that the overfitting is avoidable through early stopping. Therefore, it would be insufficient to consider only the interpolators under the noiseless regimes, and further studying on the early stopping classifier is necessary.
One may doubt the fast convergence rate in Statement One and Statement Three, which seems too good to be true. The strange phenomenon happens because we split the randomness in the training set/algorithm and the randomness in the test set. Note that the first probability is in order , and therefore, the total 01 loss is approximate after union bound. We refer to Cao et al. (2021); Wang and Thrampoulidis (2021) for similar types of bounds.
Comparison against benign overfitting under heavy overparameterization: Previous work usually analyze the GMM model under the heavy overparameterization regime, e.g., (Cao et al., 2021; Chatterji and Long, 2021) or (Wang and Thrampoulidis, 2021). In comparison, our paper focuses on the mild overparameterization regime, where . We note that this leads to a phase change that the overfitting model under noisy settings now provably overfits.
3.2 Experiment in Neural Networks: Overfitting in Noisy CIFAR10
In Section 3.1, we prove that under noisy label regimes with mild overparameterization, earlystopping classifiers and interpolators perform differently. This section aims to verify if the phenomenon empirically happens in the realworld dataset. Specifically, we generate a noisy CIFAR10 dataset where each label is randomly flipped with probability , Moreover, we show that the test error first increases and then dramatically decreases, demonstrating that the interpolator performs worse than the models in the middle of the training.
Setup. The base dataset is CIFAR10, where each sample is randomly flipped with probability . We use the ResNet18 and use SGD to train the model with cosine learning rate decay. For each model, we train for 200 epochs, test the validation accuracy and plot the training accuracy and validation accuracy in Figure 3. More details can be found in the code.
Figure 3 illustrates a similar phenomenon to the results in linear models under mild overparameterization analysis (see Section 3.1). Precisely, the interpolator achieves suboptimal accuracy under the mild overparameterization regimes, but we can still find a better classifier through early stopping. In Figure 3, such a phenomenon is manifested as the validation accuracy curve increases and decreases. We also notice that the degree becomes sharper as the noise level increases.
There is another interesting phenomenon during the training process in Figure 3, where the neural networks have an oscillation phase between the increasing time and the decreasing time. That is to say, the training process can be roughly split into three phases:

Phase One (Climbing Phase). The training accuracy and the test accuracy both increase.

Phase Two (Oscillation Phase). The training accuracy and the test accuracy both oscillate.

Phase Three (Overfitting Phase). The training accuracy increases while the test accuracy decreases.
These three phases lead us to make the following conjecture on the neural network training process.
A conjecture on the training trajectory. Nagarajan and Kolter (2019) conjecture that in overparameterized deep networks, SGD finds a fit that is simple at a macroscopic level but also has many microscopic fluctuations. Stemmed from this insight and the experiment observation, we conjecture that the process to fit at the macroscopic and microscopic levels can be separable under mild overparameterization regimes. Precisely, during the training process, SGD first fits the features at the macroscopic level, which leads to Phase One, where the training accuracy and the test accuracy both increase. Then the model oscillates in preparation for fitting the noise, leading to Phase Two, where the training accuracy and the test accuracy oscillate. Finally, the model oscillates to a proper position and starts to fit the noise, leading to poor generalization in Phase Three. We plot a sketch map in Figure 4 to illustrate such a process. Previous work mainly considers the last iterate at Phase Three, which may heavily rely on the noiseless assumption or heavy overparameterization assumption. However, such analysis can rarely hold in practice since (a) the realistic data usually contains heavy noise due to data collection and data poisoning, and (b) the heavy overparameterization regime becomes impossible as we collect increasingly more data points.
We propose the following experiments based on our conjecture.
3.3 Control Experiment: Avoid Overfitting in Noisy Label Regimes
Section 3.2 illustrates that noisy labels can indeed lead to overfitting in Cifar10. This section aims to test whether removing label noise can avoid overfitting. Therefore, motivated by the conjecture in the previous section, this section proposes an experiment that prevents the model from fitting noisy labels by dropping out hardtofit samples during training. The procedure of removing hard data points resembles the selftraining techniques (Guan et al., 2011; Huang et al., 2020), where they train neural networks using the predictions from another trained network (referred to as the teacher network) and achieve better generalization than the teacher network. Therefore, our analysis and threephase phenomenon conjecture in Section 3 provide an explanation for the selftraining methods. On the other hand, the gain achieved in selftraining further validates our threephase conjecture empirically.
Setup. For the selftraining experiment, we consider two models based on noisy CIFAR10 and ImageNet, individually and named them as selftrained.
For noisy CIFAR10, we random flip each sample with probability , and train it with model ResNet18. Starting from epoch 132, in each iteration, we remove the hard data points if the (noisy) train dataset label differs from the model prediction.
For the ImageNet dataset, which naturally contains label noise in our conjecture, we train it with model ResNet101. Starting from epoch 89, in each iteration, we remove data points with incorrect top5 predictions and continue to train the model with the remaining data points.
As a comparison, we also continue to train the models without removing hard data points, named as Baseline.
Result. We plot the training accuracy and the test accuracy of the selftrained model and the baseline in Figure 5, where the training accuracy is calculated based on the whole training set. For both models, the training accuracy increases. Besides, in baseline training, the validation accuracy decreases dramatically due to overfitting the label noise, as observed in Section 3.2. However, in selftraining, the trend is effectively stopped, and the validation accuracy in our Selftrain experiments consistently increased compared to the early stopped model. This fact is consistent with the previous result of selftraining, and researchers use a similar approach to use the model itself to distinguish the possibly mislabeled data (Zhu et al., 2003; Huang et al., 2020).
Another interesting phenomenon is that the training accuracy (calculated on the whole training set) increases in selftraining experiments, which means the model is able to correct its own mistake while only training on the data that it has correctly labeled. We leave the analysis for future work.
4 Challenges in Proving Theorem 3.1
This section provides more details to the three statements in Theorem 3.1 with milder assumptions than those in Assumption 2. We also explain why existing analysis cannot be applied to our setup.
Assumption 2.
The following assumptions are more general,

[itemsep=2pt,topsep=0pt,parsep=0pt]

The noise in is generated from a subGaussian distribution.

The signaltonoise ration satisfies .

The signaltonoise ration satisfies .
We compare the assumptions in Assumption 1 and Assumption 2. Assumption [A4] is a relaxation of Assumption [A1], and Assumption [A5, A6] can be obtained by Assumption [A2, A3]. Therefore, we conclude that Assumption 2 is weaker than Assumption 1. We next introduce the generalized version of the three arguments in Theorem 3.1, including Theorem 2, Theorem 2 and Theorem 2.
[Statement One] Under the noiseless setting, for a fixed , under Assumption [A2, A4, A5], there exists constant such that the following statement holds with probability at least ,
Previous results on noiseless GMM (e.g., Cao et al. (2021); Wang and Thrampoulidis (2021)) rely on the heavy overparameterization and assumption . In contrast, our results only requires and can be deployed in the mild overparameterization regimes. Therefore, the existing results cannot directly imply Theorem 2.
[Statement Two] Under the noisy regime with noisy level and mild overparameterization ratio , for a fixed , under Assumption [A1, A3], there exists constant such that the following statement holds with probability at least ,
Therefore, has constant excess risk, given that and are both constant.
Previous results (e.g., Chatterji and Long (2021); Wang and Thrampoulidis (2021)) mainly focus on deriving the nonvacuous bound for noisy GMM, which also relies on the heavy overparameterization assumption . Instead, our results show that the interpolator dramatically fails and suffers from a constant lower bound under mild overparameterization regimes. Therefore, heavy overparameterization performs differently from mild overparameterization cases. We finally remark that although it is still an open problem when the phase change happens, we conjecture that realistic training procedures are more close to the mild overparameterization regime according to the experiment results.
[Statement Three] Under the noisy regime, if one runs the SGD update with initialization and learning rate where denotes the data point. For a fixed , under Assumption [A2, A4, A6], the following statement holds with probability at least ,
Therefore, the above bound is nonvacuous under the Assumption [A6]. One may wonder whether we can apply the results of stabilitybased bound (Bousquet and Elisseeff, 2002; Hardt et al., 2016) into the analysis since the training process is convex. However, the analysis might not be proper due to a bad Lipschitz constant during the training process. Therefore, the stabilitybased analysis may only return vacuous bound under such regimes. Besides, the previous results on convex optimization with onepass SGD (e.g., Sekhari et al. (2021)) cannot be directly applied to the analysis, since most results on onepass SGD are expectation bounds, while we provide a high probability bound in Theorem 2.
5 Conclusions and Discussions
In this work, we aim to understand why benign overfitting happens in training ResNet on CIFAR10, but fails on ImageNet. We start by identifying a phase change in the theoretical analysis of benign overfitting. We found that when the model parameter is in the same order as the number of data points, benign overfitting would fail due to label noise. We conjecture that the noise in labels leads to the different behaviors in CIFAR10 and ImageNet. We verify the conjecture by injecting label noise into CIFAR10 and adopting selftraining in ImageNet. The results support our hypothesis.
Our work also left many questions unanswered. First, our theoretical and empirical evidence shows that realistic deep learning models may not work in the interpolating scheme. Still, although there is a larger number of parameters than data points, the model generalizes well. Understanding the implicit bias in deep learning when the model underfits is still open. A closely related topic would be algorithmic stability (Bousquet and Elisseeff, 2002; Hardt et al., 2016), however, the benefit of overparameterization within the stability framework stills requires future studies. Second, the GMM model provides a convenient way for analysis, but how the number of parameters in the linear setup relates to that in the neural network remains unclear.
References
 A convergence theory for deep learning via overparameterization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 242–252. External Links: Link Cited by: §3.1.
 Towards understanding ensemble, knowledge distillation and selfdistillation in deep learning. CoRR abs/2012.09816. External Links: Link, 2012.09816 Cited by: §1.
 Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 915 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 322–332. External Links: Link Cited by: §3.1.
 Label refinery: improving imagenet classification through label progression. CoRR abs/1805.02641. External Links: Link, 1805.02641 Cited by: §2.
 Benign overfitting in linear regression. Proceedings of the National Academy of Sciences 117 (48), pp. 30063–30070. Cited by: Table 1, §1, §2.
 Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. Advances in neural information processing systems 31. Cited by: §2.

Reconciling modern machinelearning practice and the classical bias–variance tradeoff
. Proceedings of the National Academy of Sciences 116 (32), pp. 15849–15854. Cited by: §2.  To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541–549. Cited by: §2.
 Stability and generalization. J. Mach. Learn. Res. 2, pp. 499–526. External Links: Link Cited by: §4, §5.

Identifying and eliminating mislabeled training instances.
In
Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, USA, August 48, 1996, Volume 1
, W. J. Clancey and D. S. Weld (Eds.), pp. 799–805. External Links: Link Cited by: §2.  Risk bounds for overparameterized maximum margin classification on subgaussian mixtures. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 614, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 8407–8418. External Links: Link Cited by: Table 1, §1, §1, §2, §3.1, §3.1, §4.
 Finitesample analysis of interpolating linear classifiers in the overparameterized regime. J. Mach. Learn. Res. 22, pp. 129:1–129:30. External Links: Link Cited by: Table 1, §1, §1, §2, §3.1, §3.1, §4.
 Label noise SGD provably prefers flat global minimizers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 614, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 27449–27461. External Links: Link Cited by: §2.
 A model of double descent for highdimensional binary linear classification. CoRR abs/1911.05822. External Links: Link, 1911.05822 Cited by: §2.
 Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
 Benign overfitting without linearity: neural network classifiers trained by gradient descent for noisy linear data. CoRR abs/2202.05928. External Links: Link, 2202.05928 Cited by: Table 1, §1, §1, §1, §2, §3.1.
 Identifying mislabeled training data with the aid of unlabeled data. Appl. Intell. 35 (3), pp. 345–358. External Links: Link, Document Cited by: §2, §3.3.
 Characterizing implicit bias in terms of optimization geometry. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 1015, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1827–1836. External Links: Link Cited by: §1.
 Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 9482–9491. External Links: Link Cited by: §1.
 Train faster, generalize better: stability of stochastic gradient descent. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1225–1234. External Links: Link Cited by: §4, §5.
 The elements of statistical learning: data mining, inference, and prediction. Springer Series in Statistics, Springer. External Links: Link, Document, ISBN 9781489905192 Cited by: §2.
 Surprises in highdimensional ridgeless least squares interpolation. CoRR abs/1903.08560. External Links: Link, 1903.08560 Cited by: §2.

Deep residual learning for image recognition.
In
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 2730, 2016
, pp. 770–778. External Links: Link, Document Cited by: §1, §1.  Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Link, 1503.02531 Cited by: §1.
 Selfadaptive training: beyond empirical risk minimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1, §2, §3.3, §3.3.
 Impossibility of successful classification when useful features are rare and weak. Proceedings of the National Academy of Sciences 106 (22), pp. 8859–8864. Cited by: §2.
 On the multiple descent of minimumnorm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, COLT 2020, 912 July 2020, Virtual Event [Graz, Austria], J. D. Abernethy and S. Agarwal (Eds.), Proceedings of Machine Learning Research, Vol. 125, pp. 2683–2711. External Links: Link Cited by: §2.
 Just interpolate: kernel ”ridgeless” regression can generalize. CoRR abs/1808.00387. External Links: Link, 1808.00387 Cited by: §2.
 A precise highdimensional asymptotic theory for boosting and minl1norm interpolated classifiers. CoRR abs/2002.01586. External Links: Link, 2002.01586 Cited by: §2.
 High dimensional classification via empirical risk minimization: improvements and optimality. CoRR abs/1905.13742. External Links: Link, 1905.13742 Cited by: §2, §2.
 The generalization error of random features regression: precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics 75 (4), pp. 667–766. Cited by: §2.
 The role of regularization in classification of highdimensional noisy gaussian mixture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 6874–6883. External Links: Link Cited by: §2.
 Foundations of machine learning. MIT press. Cited by: §1.
 Harmless interpolation of noisy data in regression. IEEE J. Sel. Areas Inf. Theory 1 (1), pp. 67–83. External Links: Link, Document Cited by: §2.
 Stochastic gradient descent on separable data: exact convergence with a fixed learning rate. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 1618 April 2019, Naha, Okinawa, Japan, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, pp. 3051–3059. External Links: Link Cited by: Proposition 3.1.
 Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 11611–11622. External Links: Link Cited by: §3.2.
 Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021 (12), pp. 124003. Cited by: §2.
 Gaussian mixture models. In Encyclopedia of Biometrics, S. Z. Li and A. K. Jain (Eds.), pp. 659–663. External Links: Link, Document Cited by: §2.
 SGD: the role of implicit regularization, batchsize and multipleepochs. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 614, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 27422–27433. External Links: Link Cited by: §4.
 Evaluating machine accuracy on imagenet. In International Conference on Machine Learning, pp. 8634–8644. Cited by: §1, §2.
 The implicit bias of gradient descent on separable data. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
 A modern maximumlikelihood theory for highdimensional logistic regression. Proceedings of the National Academy of Sciences 116 (29), pp. 14516–14525. Cited by: §2.

Benign overfitting in ridge regression
. arXiv preprint arXiv:2009.14286. Cited by: §2.  Benign overfitting in multiclass classification: all roads lead to interpolation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 614, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 24164–24179. External Links: Link Cited by: §2.
 Benign overfitting in binary classification of gaussian mixtures. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 611, 2021, pp. 4030–4034. External Links: Link, Document Cited by: Table 1, §1, §1, §3.1, §3.1, §4, §4.

EM algorithms of gaussian mixture model and hidden markov model
. In Proceedings of the 2001 International Conference on Image Processing, ICIP 2001, Thessaloniki, Greece, October 710, 2001, pp. 145–148. External Links: Link, Document Cited by: §2.  Relabeling imagenet: from single to multilabels, from global to localized labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2340–2350. Cited by: §1, §2.
 Eliminating class noise in large datasets. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 2124, 2003, Washington, DC, USA, T. Fawcett and N. Mishra (Eds.), pp. 920–927. External Links: Link Cited by: §3.3.

Deep autoencoding gaussian mixture model for unsupervised anomaly detection
. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30  May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.  Benign overfitting of constantstepsize SGD for linear regression. In Conference on Learning Theory, COLT 2021, 1519 August 2021, Boulder, Colorado, USA, M. Belkin and S. Kpotufe (Eds.), Proceedings of Machine Learning Research, Vol. 134, pp. 4633–4635. External Links: Link Cited by: Table 1, §1, §2.
Appendix A Detailed Proofs
a.1 Proof of Theorem 2
The first statement in Theorem 3.1 states that the interpolator has a nonvacuous bound in a noiseless setting, which is a direct corollary of the following Theorem 2. The proof of Theorem 2 mainly depends on bounding the projection of the classifier on , which relies on a sketch of the classification margin.
See 2
Proof of Theorem 2.
We denote the classifier by during the proof for simplicity. Due to Proposition 3.1, the final classifier converges to its maxmargin solution. Without loss of generality, let the final classifier satisfy . Therefore, the following equation for margin holds since is maxmargin solution:
where denotes the margin of classifier for the dataset. We next consider the margin for the classifier . Note that the margin can be rewritten as
We note that is
subGaussian due to the definition of subGaussian random vector. Therefore, due to Claim
B.1, we havewhere the last inequality is due to Assumption [A2].
We next bound the term via the above margin. From one hand, we notice that by the definition of the margin function,
(1) 
From the other hand, we rewrite the margin as
(2) 
The right hand side can be bounded as
(3) 
where the final equation is due to Claim B.2. Therefore, combining the above Equation 1, Equation 2 and Equation 3, we bound the projection of on as:
(4) 
where the last equation is due to Assumption [A5]. We rewrite Equation (4) as , then we can bound the 01 loss as follows for a given constant :
Due to Assumption [A2], , and therefore, by setting , we have
The proof is done. ∎
a.2 Proof of Theorem 2
Theorem 2 shows that the interpolator can be nonvacuous under noiseless regimes with mild overparameterization. However, things can be much different in noisy regimes. Statement Two proves a vacuous lower bound for interpolators in noisy settings, which can be derived by the following Theorem 2. The core of the proof lies in controlling the distance between the center of wrong labeled samples and the point , which further leads to an upper bound of . One can then derive the corresponding 01 loss for classifier .
See 2
Proof of Theorem 2.
We denote as for simplicity, and assume that without loss of generality. Let denote the original label and denote its corrupted label.
Without loss of generality, we consider those samples with while , which are indexed by . Consider the center point of , which is
Due to the interpolation in Proposition 3.1 and , we derive that
(5) 
Case 1: . In this case, naturally has a lower bound of since it even fails in the center point .
Case 2: . In this case, the classifier satisfies and . Therefore, the distance between and must be less than the distance from
to its projection on the separating hyperplane that perpendicular to
through the origin. Formally,Note that is independent and subGaussian, and therefore applying Claim B.2, we have that . Besides, we derive by Claim B.3 that . Therefore,
(6) 
We next consider the corresponding test error of , where we consider the test error on noiseless regime instead of noisy regime. Note that the two arguments are equivalent, we refer to Claim B.4 for more details. We rewrite Equation 6 as , where we abuse the notation as a fixed constant. Therefore,
(7) 
where is sampled from Gaussian distribution, and
denotes the CDF of standard Gaussian Random Variable.
Case 1. If , then .
Case 2. If , note that if . Therefore,
Taking the above two cases together and denoting , we have
which is a constant lower bound under Assumption [A3]. The proof is done. ∎
a.3 Proof of Theorem 2
Statement Two shows that the interpolator fails in the noisy regime with mild overparameterization. How can we derive a nonvacuous bound under such regimes? The key is earlystopping. To show that, Statement Three provides a nonvacuous bound for earlystopping classifiers in noisy regimes, which is induced by the following Theorem 2.
See 2
To show the relationship between Theorem 2 and Theorem 3.1, one can directly use Assumption [A2, A3] in Theorem 2 to reach generalization bound in Theorem 3.1 (Statement Three). The derivation of Theorem 2 relies on the analysis on onepass SGD, where we show that onepass SGD is sufficient to reach nonvacuous bound. The proof of Theorem 2 again, relies on bounding the term c but in a different way. Different from the previous approaches where we can directly assume , the classifier is trained in this case and we need to first bound it. We then define a surrogate classifier and show that (a) the surrogate classifier is close to the trained classifier for a sufficiently small learning rate, and (b) the surrogate classifier can return satisfying projection on the direction . Therefore, we bound the term which leads to the results.
Proof of Theorem 2.
We abuse the notation to represent which is returned by onepass SGD. We first lower bound the term , where is the optimal classification direction. To achieve the goal, we bound the term and individually.
Before diving into the proof, we first introduce a surrogate classifier . From the definition, we have that for update step size ,
(8) 
Therefore, we have that
(9) 
Bounding the term . We fist bound the different between and . We note that
To bound the above different, the fist step is to bound the term . Since is subGuassian, we have that with probability with
(10) 
We then bound the term . Note that when . By the iteration in Equation 8, we have that
where the first equation is due to the iteration, and the last equation is due to (by setting ). Therefore,
(11) 
On the other hand, we show the bound for . Note that , therefore,
where we denote as a random variable which takes value with probability and takes value with probability .
Due to Claim B.3, we have , where we note that . Besides, since is subGaussian, we have that with probability ,
where the term comes from the probability . In summary, we have that
(13) 
where the last equation follows Assumption [A2].
We additionally note that Equation 14 holds by choosing proper constant in Equation (11) (and in the choice of ).
Bounding the norm . We next bound the norm . Before that, we first bound the norm . Note that
(15) 
where the last equation is due to Claim B.2 by choosing probability .
Therefore, we have
Note that according to the iteration in Equation 8, we have that
where the last equation is due to .
Therefore, we bound the norm as
(16) 
We next consider the probability on the test point, note that given the dataset and taking probability on the testing point , we have
where and is subGaussian. Plugging Equation 17 into the above equation, we have that with high probability,
If , .
If , , which is large when .
Therefore, the above bound is nonvacuous if .
Note that for given a constant , under Assumption [A2], we have