DeepAI
Log In Sign Up

Realistic Deep Learning May Not Fit Benignly

06/01/2022
by   Kaiyue Wen, et al.
MIT
0

Studies on benign overfitting provide insights for the success of overparameterized deep learning models. In this work, we examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points. Under this mild overparameterization setup, our analysis identifies a phase change: unlike in the heavy overparameterization setting, benign overfitting can now fail in the presence of label noise. Our study explains our empirical observations, and naturally leads to a simple technique known as self-training that can boost the model's generalization performances. Furthermore, our work highlights the importance of understanding implicit bias in underfitting regimes as a future direction.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

06/06/2021

Towards an Understanding of Benign Overfitting in Neural Networks

Modern machine learning models often employ a huge number of parameters ...
08/06/2020

Benign Overfitting and Noisy Features

Modern machine learning often operates in the regime where the number of...
08/08/2022

Generalization and Overfitting in Matrix Product State Machine Learning Architectures

While overfitting and, more generally, double descent are ubiquitous in ...
03/16/2021

Deep learning: a statistical viewpoint

The remarkable practical success of deep learning has revealed some majo...
06/11/2020

A new measure for overfitting and its implications for backdooring of deep learning

Overfitting describes the phenomenon that a machine learning model fits ...
11/13/2022

TIER-A: Denoising Learning Framework for Information Extraction

With the development of deep neural language models, great progress has ...
05/24/2019

Perturbed Model Validation: A New Framework to Validate Model Relevance

This paper introduces PMV (Perturbed Model Validation), a new technique ...

1 Introduction

Modern deep learning models achieve good generalization performances even with more parameters than data points. This surprising phenomenon is referred to as benign overfitting, and differs from the canonical learning regime where good generalization requires limiting the model complexity (Mohri et al., 2018)

. One widely accepted explanation for benign overfitting is that optimization algorithms benefit from implicit bias and find good solutions among the interpolating ones under the overparametrized settings. The implicit bias can vary from problem to problem. Examples include the min-norm solution in regression settings or the max-margin solution in classification settings 

(Gunasekar et al., 2018a; Soudry et al., 2018; Gunasekar et al., 2018b). These types of bias in optimization can further result in good generalization performances (Bartlett et al., 2020; Zou et al., 2021; Frei et al., 2022). These studies provide novel insights, yet they sometimes differ from the deep learning practice: state of the art models, despite being overparameterized, often do not interpolate the data points (e.g., He et al. (2016); Devlin et al. (2018)).

In this work, we first examine the existence of benign overfitting in realistic setups by testing whether ResNet (He et al., 2016) models can overfit data benignly under common experiment setups: image classification on CIFAR10 and ImageNet. Our results are shown in Figure 1

. In particular, we trained ResNet18 on CIFAR10 for 200 epochs and the model interpolates the train data. In addition, we also trained ResNet101 on ImageNet for 160 epochs, as opposed to the common schedule that stops at 90 epochs. Surprisingly, we found that although benign overfitting happens on the CIFAR10 dataset,

overfitting is not benign on the ImageNet dataset—the test loss increased as the model further fits the train set. This phenomenon cannot be explained by known analysis on benign overfitting for classification tasks, as no negative results were studied yet.

Motivated by the above observation, our work aims to understand the cause for the two different overfitting behaviors in ImageNet and CIFAR10, and to reconcile the empirical phenomenon with previous analysis on benign overfitting. Our first hint comes from the level of overparameterization. Previous results on benign overfitting in the classification setting usually requires that , where denotes the number of parameters and denotes the training sample size (Wang and Thrampoulidis, 2021; Cao et al., 2021; Chatterji and Long, 2021; Frei et al., 2022). However, in practice many deep learning models fall in the mild overparameterization regime, where the number of parameters is only slightly larger than the number of samples despite overparameterization. In our case, the sample size in ImageNet, whereas the parameter size in ResNets.

To close the gap, we study the overfitting behavior of classification models under the mild overparameterization setup where (this is sometimes referred to as the asymptotic regime). In particular, following Wang and Thrampoulidis (2021); Cao et al. (2021); Chatterji and Long (2021); Frei et al. (2022)

, we analyze the solution of stochastic gradient descent for the Gaussian mixture models. We found that a

phase change happens when we move from (studied in Wang and Thrampoulidis (2021)) to . Unlike previous analysis, we show that benign overfitting now provably fails in presence of label noise (see Table 1 and Figure 2). This aligns with our empirical findings as ImageNet is known to suffer from mislabelling and multi-labels (Yun et al., 2021; Shankar et al., 2020).

More specifically, our analysis (see Theorem 3.1 for details) under the mild overparameterization () setup supports the following statements that align with our empirical observations in Figure 1 and  5:

  • [leftmargin=0.5cm]

  • When the labels are noiseless, benign overfitting still holds under similar conditions as in previous analyses.

  • When the labels are noisy, the interpolating solution can provably lead to a positive excess risk that does not diminish with the sample size.

  • When the labels are noisy, early stopping can provably lead to better generalization performances compared to interpolating models.

Furthermore, our analysis naturally leads to a simple technique, known as self-training, that can improve a model’s generalization performance by removing “hard data points”. The technique was studied in (Hinton et al., 2015; Allen-Zhu and Li, 2020; Huang et al., 2020). We confirm through our experiment design that the gain in generalization is coupled with the level of label noise.

More importantly, the empirical and theoretical results in our work point out that modern models may not operate under the benign interpolating regime. Hence, it highlights the importance of characterizing implicit bias when the model, though mildly overparameterized, does not interpolate the train data.

(a) ImageNet
(b) CIFAR10
Figure 1: Different Overfitting Behaviors between ImageNet and CIFAR10. We use ResNet101/ResNet18 to train ImageNet/CIFAR10 and plot the training loss as well as validation loss. We find that ResNet101 overfits the ImageNet, while ResNet18 does not overfit CIFAR10.
(a) Noiseless
(b) Noisy
Figure 2: Phase Transition in Noisy Regimes. We conduct experiments on simulated GMM data with and

using SGD to train a linear classifier and plot the excess risk. The brighter grid means that the classifier has smaller excess error, and therefore does not overfit. Figure (a) shows that training on noiseless data does not cause overfitting, and Figure (b) shows that training on noisy data causes overfitting with overparameterization. Besides, we find that overfitting consistently happens under noisy regimes, despite theoretical guarantee that the model overfits benignly when

.
classification
noiseless
classification
noisy
regression
noisy
mild over-
parameterization
Work
(Ours)
Fail
(Ours)
Fail
(Bartlett et al., 2020)
heavy over-
parameterization
Work
(Cao et al., 2021)
(Wang and Thrampoulidis, 2021)
Work
(Chatterji and Long, 2021)
(Frei et al., 2022)
(Wang and Thrampoulidis, 2021)
Work
(Bartlett et al., 2020)
(Zou et al., 2021)
Table 1: For classification problems, the previous work (Cao et al., 2021; Chatterji and Long, 2021) provided upper bounds under the heavy overparameterization setting. This paper focuses on a mild overparameterization setting, and shows that the label noise may break benign overfitting. Similar result was known in regression problems. For regression with noisy response, Bartlett et al. (2020) shows that the interpolator fails under mild overparameterization regimes while may work under heavy overparameterization, as is consistent with the double descent curve. However, the analysis under mild overparameterization in the classification task remained unknown.

2 Related Works

Although the benign overfitting phenomenon has been systematically studied both empirically (Belkin et al., 2018b, 2019; Nakkiran et al., 2021) and theoretically (see later), our work differs in that we aim to understand why benign overfitting fails in the classification setup. In particular, among all theoretical studies, the closest ones to us are (Chatterji and Long, 2021), (Wang et al., 2021) and (Cao et al., 2021), as the analyses are done under the classification setup with Gaussian mixture models. Our work provides new result by moving from heavy overparameterization to mild overparameterization . We found that this new condition leads to an analysis that is consistent with our empirical observations, and can explain why benign overfitting fails. In short, compared to all previous studies on benign overfitting in classification tasks, we focus on identifying the condition that breaks benign overfitting in our ImageNet experiments. The comparison is summarized in Table 1.

More related works on benign overfitting:

Researchers have made a lot of efforts to generalizing the notation of benign overfitting beyond the novel work on linear regression 

(Bartlett et al., 2020), e.g., variants of linear regression (Tsigler and Bartlett, 2020; Muthukumar et al., 2020; Zou et al., 2021), linear classification (Liang and Sur, 2020; Belkin et al., 2018a)

with different distribution assumptions(instead of Gaussian mixture model), kernel-based estimators 

(Liang and Rakhlin, 2018; Liang et al., 2020; Mei and Montanari, 2022)

, neural networks 

(Frei et al., 2022).

Gaussian Mixture Model

(GMM) represents the data distribution where the input is drawn from a Gaussian distribution with different centers for each class. The model was widely studied in hidden Markov models, anomaly detection, and many other fields 

(Hastie et al., 2001; Xuan et al., 2001; Reynolds, 2009; Zong et al., 2018). A closely related work is Jin (2009) which analyzes the lower bound for excess risk of Gaussian Mixture Model under noiseless regimes. However, their analysis cannot be directly extended to either the overparameterization or noisy label regimes. Another closely related work (Mai and Liao, 2019)

focus on the GMM setting with mild overparameterization, but they require a noiseless label regime and rely on a small signal to noise ratio on the input, and thus cannot be generalized to our theoretical results.

Asymptotic (Mildly Overparameterized) Regimes. This paper considers asymptotic regimes where the ratio of parameter dimension and the sample size is upper and lower bounded by absolute constants. Previous work studied this setup with different focus than benign overfitting (e.g., double descent), both in regression perspective (Hastie et al., 2019) and classification perspective (Sur and Candès, 2019; Mai and Liao, 2019; Deng et al., 2019)

. We study the mild overparameterization case because it generally happens in realistic machine learning tasks, where the number of parameters exceeds the number of training samples but not extremely.

Label Noise: Label noise often exists in real world datasets, e.g., in ImageNet (Yun et al., 2021; Shankar et al., 2020). However, its effect on generalization remains debatable. Recently, Damian et al. (2021) claim that in regression settings, label noise may prefer flat global minimizers and thus helps generalization. Another line of work claims that label noise hurts the effectiveness of empirical risk minimization in classification regimes, and proposes to improve the model performance by explicit regularization (Mignacco et al., 2020) or robust training (Brodley and Friedl, 1996; Guan et al., 2011; Huang et al., 2020). Among them, Bagherinezhad et al. (2018) studies how to refine the labels in ImageNet using label propagation to improve the model performance, demonstrating the importance of the label in ImageNet. This paper mainly falls in the latter branch, which analyzes how label noise acts in the mild overparameterization regimes.

3 Overfitting under Mild Overparameterization

We observed in Figure 1 that training ResNets on CIFAR10 and ImageNet can result in different overfitting behaviors. This discrepancy was not reflected in the previous analysis of benign overfitting. In this section, we provide a theoretical analysis by studying the Gaussian mixture model under the mild overparameterization setup. Our analysis shows that mild overparameterization along with label noise can break overfitting.

3.1 Overparameterized Linear Classification

In this subsection, we study the generalization performance of linear models on the Gaussian Mixture Model (GMM). We assume that linear models are obtained by solving logistic regression with stochastic gradient descent (SGD). This simplified model may help explain phenomenons in neural networks, since previous works show that neural networks converge to linear models as the width goes to infinity 

(Arora et al., 2019; Allen-Zhu et al., 2019).

We next introduce GMM under two setups of overparameterized linear classification: the noiseless regime and the noisy regime.

Noiseless Regime. Let denote the ground truth label. The corresponding feature is generated by

where denotes noise drawn from subGaussian distribution. We denote the dataset where are generated by the above above mechanism.

Noisy Regime with contamination rate . For the noisy regime, we first generate noiseless data using the noiseless regime, and then consider the data point where is the contaminated version of . Formally, given the contamination rate , the contaminated label

with probability

and with probability . And the returned dataset is .

For simplicity, we assume that the data points in the train set are linearly separable in both noiseless and noisy regimes. This assumption holds almost surely under mild overparameterization. Besides, we make the following assumptions about the data distribution:

Assumption 1 (Assumptions on the data distribution.).
  1. The noise when generating feature is drawn from the Gaussian distribution, i.e., .

  2. The signal-to-noise ratio satisfies for a given constant .

  3. The ratio is a fixed constant.

The three assumptions are all crucial but can be made slightly more general (see Section 4). The first Assumption [A1] stems from the requirement that we need to derive a lower bound for excess risk under a noisy regime. The second Assumption [A2] is widely used in the analysis (Chatterji and Long, 2021; Frei et al., 2022). For a smaller ratio, the model may be unable to learn even under the noiseless regime and return vacuous bounds. The third Assumption [A3] is the main difference from the previous analysis, where we consider a mild overparameterization instead of heavy overparameterization (i.e., ).

Training Procedure. We consider the multi-pass SGD training with logistic loss . During each epoch, each data is visited exactly once randomly without replacement. Formally, at the beginning of each epoch , we uniformly random sample a permutation , then at iteration , given the learning rate we have

where , given .

Under the above procedure, Proposition 3.1 shows that the classifier under the GMM regime with multi-pass SGD training will converge in the direction of the max-margin interpolating classifier. This paper considers zero initialization where for simplicity.

Proposition 3.1 (Interpolator of multi-pass SGD under GMM regime, from Nacson et al. (2019)).

Under the regime of GMM with logistic loss, denote the iterates in multi-pass SGD by . Then for any initialization , the iterates converges to the max-margin solution almost surely, namely,

where denotes the max-margin solution.

For simplicity, we denote as the parameter at iteration in the noiseless setting, and as the parameter in the noisy setting. By the proposition above, we know that both and are max-margin classifiers on the training data points.

During the evaluation process, we also focus on the 0-1 loss, where the population 0-1 loss is . Based on the above assumptions and discussions, we state the following Theorem 3.1, indicating the different performances between noiseless setting and noisy setting.

Theorem 3.1.

We consider the above GMM regime with Assumption [A1-A3]. Specifically, denote the noise level by and the mild overparameterization ratio by . Then there exists absolute constant such that the following statements hold with probability111The probability is taken over the training set and the randomness of algorithm. at least :

  1. Under the noiseless setting, the max-margin classifier obtained from SGD has non-vacuous 0-1 loss, namely,

  2. Under the noisy setting, the max-margin classifier has vacuous 0-1 loss with constant lower bound, namely, the following inequality holds for any training sample size ,

  3. Under the noisy setting, if the learning rate satisfies , there exists a time such that the trained early-stopping classifier has non-vacuous 0-1 loss, namely,

Intuitively, Theorem 3.1 illustrates that although SGD leads to benign overfitting under noiseless regimes, it provably overfits when the labels are noisy. In particular, it incurs error on noiseless data and hence would incur error on noisy labeled data, since label noise in the test set is independent of the algorithm. Furthermore, Theorem 3.1.3 shows that the overfitting is avoidable through early stopping. Therefore, it would be insufficient to consider only the interpolators under the noiseless regimes, and further studying on the early stopping classifier is necessary.

One may doubt the fast convergence rate in Statement One and Statement Three, which seems too good to be true. The strange phenomenon happens because we split the randomness in the training set/algorithm and the randomness in the test set. Note that the first probability is in order , and therefore, the total 0-1 loss is approximate after union bound. We refer to Cao et al. (2021); Wang and Thrampoulidis (2021) for similar types of bounds.

Comparison against benign overfitting under heavy overparameterization: Previous work usually analyze the GMM model under the heavy overparameterization regime, e.g.,  (Cao et al., 2021; Chatterji and Long, 2021) or  (Wang and Thrampoulidis, 2021). In comparison, our paper focuses on the mild overparameterization regime, where . We note that this leads to a phase change that the overfitting model under noisy settings now provably overfits.

3.2 Experiment in Neural Networks: Overfitting in Noisy CIFAR10

In Section 3.1, we prove that under noisy label regimes with mild overparameterization, early-stopping classifiers and interpolators perform differently. This section aims to verify if the phenomenon empirically happens in the real-world dataset. Specifically, we generate a noisy CIFAR10 dataset where each label is randomly flipped with probability , Moreover, we show that the test error first increases and then dramatically decreases, demonstrating that the interpolator performs worse than the models in the middle of the training.

(a)
(b)
(c)
(d)
Figure 3: Noisy CIFAR10 under mild overparameterization. In each experiment, the validation accuracy first increases and then dramatically decreases. This confirms point 2 and 3 in Theorem 3.1 that model would overfit under noisy label regimes.

Setup. The base dataset is CIFAR10, where each sample is randomly flipped with probability . We use the ResNet18 and use SGD to train the model with cosine learning rate decay. For each model, we train for 200 epochs, test the validation accuracy and plot the training accuracy and validation accuracy in Figure 3. More details can be found in the code.

Figure 3 illustrates a similar phenomenon to the results in linear models under mild overparameterization analysis (see Section 3.1). Precisely, the interpolator achieves suboptimal accuracy under the mild overparameterization regimes, but we can still find a better classifier through early stopping. In Figure 3, such a phenomenon is manifested as the validation accuracy curve increases and decreases. We also notice that the degree becomes sharper as the noise level increases.

There is another interesting phenomenon during the training process in Figure 3, where the neural networks have an oscillation phase between the increasing time and the decreasing time. That is to say, the training process can be roughly split into three phases:

  • Phase One (Climbing Phase). The training accuracy and the test accuracy both increase.

  • Phase Two (Oscillation Phase). The training accuracy and the test accuracy both oscillate.

  • Phase Three (Overfitting Phase). The training accuracy increases while the test accuracy decreases.

These three phases lead us to make the following conjecture on the neural network training process.

A conjecture on the training trajectory. Nagarajan and Kolter (2019) conjecture that in overparameterized deep networks, SGD finds a fit that is simple at a macroscopic level but also has many microscopic fluctuations. Stemmed from this insight and the experiment observation, we conjecture that the process to fit at the macroscopic and microscopic levels can be separable under mild overparameterization regimes. Precisely, during the training process, SGD first fits the features at the macroscopic level, which leads to Phase One, where the training accuracy and the test accuracy both increase. Then the model oscillates in preparation for fitting the noise, leading to Phase Two, where the training accuracy and the test accuracy oscillate. Finally, the model oscillates to a proper position and starts to fit the noise, leading to poor generalization in Phase Three. We plot a sketch map in Figure 4 to illustrate such a process. Previous work mainly considers the last iterate at Phase Three, which may heavily rely on the noiseless assumption or heavy overparameterization assumption. However, such analysis can rarely hold in practice since (a) the realistic data usually contains heavy noise due to data collection and data poisoning, and (b) the heavy overparameterization regime becomes impossible as we collect increasingly more data points.

We propose the following experiments based on our conjecture.

(a) Bayesian Optimal
(b) Phase One
(c) Phase Two
(d) Phase Three
Figure 4: Three Phase Phenomenon. The Bayesian optimal classifier is simple and smooth. In phase one, the classifier fits the data at a macroscopic level and ignores those hard-to-fit samples, which is sufficient to reach a good test error. In phase two, the classifier starts to fit those hard-to-fit samples slowly. In phase three, the classifier completely fits those samples, and the test error decreases.

3.3 Control Experiment: Avoid Overfitting in Noisy Label Regimes

Section 3.2 illustrates that noisy labels can indeed lead to overfitting in Cifar10. This section aims to test whether removing label noise can avoid overfitting. Therefore, motivated by the conjecture in the previous section, this section proposes an experiment that prevents the model from fitting noisy labels by dropping out hard-to-fit samples during training. The procedure of removing hard data points resembles the self-training techniques (Guan et al., 2011; Huang et al., 2020), where they train neural networks using the predictions from another trained network (referred to as the teacher network) and achieve better generalization than the teacher network. Therefore, our analysis and three-phase phenomenon conjecture in Section 3 provide an explanation for the self-training methods. On the other hand, the gain achieved in self-training further validates our three-phase conjecture empirically.

Setup. For the self-training experiment, we consider two models based on noisy CIFAR10 and ImageNet, individually and named them as self-trained.

For noisy CIFAR10, we random flip each sample with probability , and train it with model ResNet18. Starting from epoch 132, in each iteration, we remove the hard data points if the (noisy) train dataset label differs from the model prediction.

For the ImageNet dataset, which naturally contains label noise in our conjecture, we train it with model ResNet101. Starting from epoch 89, in each iteration, we remove data points with incorrect top-5 predictions and continue to train the model with the remaining data points.

As a comparison, we also continue to train the models without removing hard data points, named as Baseline.

(a) Noisy CIFAR10 Training
(b) Noisy CIFAR10 Validation
(c) ImageNet Training
(d) ImageNet Validation
Figure 5: Self-training Experiment on CIFAR10 and ImageNet. In both tasks, we first load the same pretrained model for the baseline model and the self-trained model. We then train the self-trained model by iteratively removing hard data points, and train the baseline model using the label from the train set. We note that the self-trained model does not overfit, whereas the baseline model suffers from overfitting.

Result. We plot the training accuracy and the test accuracy of the self-trained model and the baseline in Figure 5, where the training accuracy is calculated based on the whole training set. For both models, the training accuracy increases. Besides, in baseline training, the validation accuracy decreases dramatically due to overfitting the label noise, as observed in Section 3.2. However, in self-training, the trend is effectively stopped, and the validation accuracy in our Self-train experiments consistently increased compared to the early stopped model. This fact is consistent with the previous result of self-training, and researchers use a similar approach to use the model itself to distinguish the possibly mislabeled data (Zhu et al., 2003; Huang et al., 2020).

Another interesting phenomenon is that the training accuracy (calculated on the whole training set) increases in self-training experiments, which means the model is able to correct its own mistake while only training on the data that it has correctly labeled. We leave the analysis for future work.

4 Challenges in Proving Theorem 3.1

This section provides more details to the three statements in Theorem 3.1 with milder assumptions than those in Assumption 2. We also explain why existing analysis cannot be applied to our setup.

Assumption 2.

The following assumptions are more general,

  1. [itemsep=2pt,topsep=0pt,parsep=0pt]

  2. The noise in is generated from a -subGaussian distribution.

  3. The signal-to-noise ration satisfies .

  4. The signal-to-noise ration satisfies .

We compare the assumptions in Assumption 1 and Assumption 2. Assumption [A4] is a relaxation of Assumption [A1], and Assumption [A5, A6] can be obtained by Assumption [A2, A3]. Therefore, we conclude that Assumption 2 is weaker than Assumption 1. We next introduce the generalized version of the three arguments in Theorem 3.1, including Theorem 2, Theorem 2 and Theorem 2.

[Statement One] Under the noiseless setting, for a fixed , under Assumption [A2, A4, A5], there exists constant such that the following statement holds with probability at least ,

Previous results on noiseless GMM (e.g., Cao et al. (2021); Wang and Thrampoulidis (2021)) rely on the heavy overparameterization and assumption . In contrast, our results only requires and can be deployed in the mild overparameterization regimes. Therefore, the existing results cannot directly imply Theorem 2.

[Statement Two] Under the noisy regime with noisy level and mild overparameterization ratio , for a fixed , under Assumption [A1, A3], there exists constant such that the following statement holds with probability at least ,

Therefore, has constant excess risk, given that and are both constant.

Previous results (e.g., Chatterji and Long (2021); Wang and Thrampoulidis (2021)) mainly focus on deriving the non-vacuous bound for noisy GMM, which also relies on the heavy overparameterization assumption . Instead, our results show that the interpolator dramatically fails and suffers from a constant lower bound under mild overparameterization regimes. Therefore, heavy overparameterization performs differently from mild overparameterization cases. We finally remark that although it is still an open problem when the phase change happens, we conjecture that realistic training procedures are more close to the mild overparameterization regime according to the experiment results.

[Statement Three] Under the noisy regime, if one runs the SGD update with initialization and learning rate where denotes the data point. For a fixed , under Assumption [A2, A4, A6], the following statement holds with probability at least ,

Therefore, the above bound is nonvacuous under the Assumption  [A6]. One may wonder whether we can apply the results of stability-based bound (Bousquet and Elisseeff, 2002; Hardt et al., 2016) into the analysis since the training process is convex. However, the analysis might not be proper due to a bad Lipschitz constant during the training process. Therefore, the stability-based analysis may only return vacuous bound under such regimes. Besides, the previous results on convex optimization with one-pass SGD (e.g., Sekhari et al. (2021)) cannot be directly applied to the analysis, since most results on one-pass SGD are expectation bounds, while we provide a high probability bound in Theorem 2.

5 Conclusions and Discussions

In this work, we aim to understand why benign overfitting happens in training ResNet on CIFAR10, but fails on ImageNet. We start by identifying a phase change in the theoretical analysis of benign overfitting. We found that when the model parameter is in the same order as the number of data points, benign overfitting would fail due to label noise. We conjecture that the noise in labels leads to the different behaviors in CIFAR10 and ImageNet. We verify the conjecture by injecting label noise into CIFAR10 and adopting self-training in ImageNet. The results support our hypothesis.

Our work also left many questions unanswered. First, our theoretical and empirical evidence shows that realistic deep learning models may not work in the interpolating scheme. Still, although there is a larger number of parameters than data points, the model generalizes well. Understanding the implicit bias in deep learning when the model underfits is still open. A closely related topic would be algorithmic stability (Bousquet and Elisseeff, 2002; Hardt et al., 2016), however, the benefit of overparameterization within the stability framework stills requires future studies. Second, the GMM model provides a convenient way for analysis, but how the number of parameters in the linear setup relates to that in the neural network remains unclear.

References

  • Z. Allen-Zhu, Y. Li, and Z. Song (2019) A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 242–252. External Links: Link Cited by: §3.1.
  • Z. Allen-Zhu and Y. Li (2020) Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. CoRR abs/2012.09816. External Links: Link, 2012.09816 Cited by: §1.
  • S. Arora, S. S. Du, W. Hu, Z. Li, and R. Wang (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 322–332. External Links: Link Cited by: §3.1.
  • H. Bagherinezhad, M. Horton, M. Rastegari, and A. Farhadi (2018) Label refinery: improving imagenet classification through label progression. CoRR abs/1805.02641. External Links: Link, 1805.02641 Cited by: §2.
  • P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler (2020) Benign overfitting in linear regression. Proceedings of the National Academy of Sciences 117 (48), pp. 30063–30070. Cited by: Table 1, §1, §2.
  • M. Belkin, D. J. Hsu, and P. Mitra (2018a) Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. Advances in neural information processing systems 31. Cited by: §2.
  • M. Belkin, D. Hsu, S. Ma, and S. Mandal (2019)

    Reconciling modern machine-learning practice and the classical bias–variance trade-off

    .
    Proceedings of the National Academy of Sciences 116 (32), pp. 15849–15854. Cited by: §2.
  • M. Belkin, S. Ma, and S. Mandal (2018b) To understand deep learning we need to understand kernel learning. In International Conference on Machine Learning, pp. 541–549. Cited by: §2.
  • O. Bousquet and A. Elisseeff (2002) Stability and generalization. J. Mach. Learn. Res. 2, pp. 499–526. External Links: Link Cited by: §4, §5.
  • C. E. Brodley and M. A. Friedl (1996) Identifying and eliminating mislabeled training instances. In

    Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, USA, August 4-8, 1996, Volume 1

    , W. J. Clancey and D. S. Weld (Eds.),
    pp. 799–805. External Links: Link Cited by: §2.
  • Y. Cao, Q. Gu, and M. Belkin (2021) Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 8407–8418. External Links: Link Cited by: Table 1, §1, §1, §2, §3.1, §3.1, §4.
  • N. S. Chatterji and P. M. Long (2021) Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. J. Mach. Learn. Res. 22, pp. 129:1–129:30. External Links: Link Cited by: Table 1, §1, §1, §2, §3.1, §3.1, §4.
  • A. Damian, T. Ma, and J. D. Lee (2021) Label noise SGD provably prefers flat global minimizers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 27449–27461. External Links: Link Cited by: §2.
  • Z. Deng, A. Kammoun, and C. Thrampoulidis (2019) A model of double descent for high-dimensional binary linear classification. CoRR abs/1911.05822. External Links: Link, 1911.05822 Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
  • S. Frei, N. S. Chatterji, and P. L. Bartlett (2022) Benign overfitting without linearity: neural network classifiers trained by gradient descent for noisy linear data. CoRR abs/2202.05928. External Links: Link, 2202.05928 Cited by: Table 1, §1, §1, §1, §2, §3.1.
  • D. Guan, W. Yuan, Y. Lee, and S. Lee (2011) Identifying mislabeled training data with the aid of unlabeled data. Appl. Intell. 35 (3), pp. 345–358. External Links: Link, Document Cited by: §2, §3.3.
  • S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro (2018a) Characterizing implicit bias in terms of optimization geometry. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, J. G. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, pp. 1827–1836. External Links: Link Cited by: §1.
  • S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro (2018b) Implicit bias of gradient descent on linear convolutional networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 9482–9491. External Links: Link Cited by: §1.
  • M. Hardt, B. Recht, and Y. Singer (2016) Train faster, generalize better: stability of stochastic gradient descent. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, M. Balcan and K. Q. Weinberger (Eds.), JMLR Workshop and Conference Proceedings, Vol. 48, pp. 1225–1234. External Links: Link Cited by: §4, §5.
  • T. Hastie, J. H. Friedman, and R. Tibshirani (2001) The elements of statistical learning: data mining, inference, and prediction. Springer Series in Statistics, Springer. External Links: Link, Document, ISBN 978-1-4899-0519-2 Cited by: §2.
  • T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani (2019) Surprises in high-dimensional ridgeless least squares interpolation. CoRR abs/1903.08560. External Links: Link, 1903.08560 Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

    ,
    pp. 770–778. External Links: Link, Document Cited by: §1, §1.
  • G. E. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. CoRR abs/1503.02531. External Links: Link, 1503.02531 Cited by: §1.
  • L. Huang, C. Zhang, and H. Zhang (2020) Self-adaptive training: beyond empirical risk minimization. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: §1, §2, §3.3, §3.3.
  • J. Jin (2009) Impossibility of successful classification when useful features are rare and weak. Proceedings of the National Academy of Sciences 106 (22), pp. 8859–8864. Cited by: §2.
  • T. Liang, A. Rakhlin, and X. Zhai (2020) On the multiple descent of minimum-norm interpolants and restricted lower isometry of kernels. In Conference on Learning Theory, COLT 2020, 9-12 July 2020, Virtual Event [Graz, Austria], J. D. Abernethy and S. Agarwal (Eds.), Proceedings of Machine Learning Research, Vol. 125, pp. 2683–2711. External Links: Link Cited by: §2.
  • T. Liang and A. Rakhlin (2018) Just interpolate: kernel ”ridgeless” regression can generalize. CoRR abs/1808.00387. External Links: Link, 1808.00387 Cited by: §2.
  • T. Liang and P. Sur (2020) A precise high-dimensional asymptotic theory for boosting and min-l1-norm interpolated classifiers. CoRR abs/2002.01586. External Links: Link, 2002.01586 Cited by: §2.
  • X. Mai and Z. Liao (2019) High dimensional classification via empirical risk minimization: improvements and optimality. CoRR abs/1905.13742. External Links: Link, 1905.13742 Cited by: §2, §2.
  • S. Mei and A. Montanari (2022) The generalization error of random features regression: precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics 75 (4), pp. 667–766. Cited by: §2.
  • F. Mignacco, F. Krzakala, Y. Lu, P. Urbani, and L. Zdeborová (2020) The role of regularization in classification of high-dimensional noisy gaussian mixture. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 6874–6883. External Links: Link Cited by: §2.
  • M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT press. Cited by: §1.
  • V. Muthukumar, K. Vodrahalli, V. Subramanian, and A. Sahai (2020) Harmless interpolation of noisy data in regression. IEEE J. Sel. Areas Inf. Theory 1 (1), pp. 67–83. External Links: Link, Document Cited by: §2.
  • M. S. Nacson, N. Srebro, and D. Soudry (2019) Stochastic gradient descent on separable data: exact convergence with a fixed learning rate. In The 22nd International Conference on Artificial Intelligence and Statistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan, K. Chaudhuri and M. Sugiyama (Eds.), Proceedings of Machine Learning Research, Vol. 89, pp. 3051–3059. External Links: Link Cited by: Proposition 3.1.
  • V. Nagarajan and J. Z. Kolter (2019) Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 11611–11622. External Links: Link Cited by: §3.2.
  • P. Nakkiran, G. Kaplun, Y. Bansal, T. Yang, B. Barak, and I. Sutskever (2021) Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment 2021 (12), pp. 124003. Cited by: §2.
  • D. A. Reynolds (2009) Gaussian mixture models. In Encyclopedia of Biometrics, S. Z. Li and A. K. Jain (Eds.), pp. 659–663. External Links: Link, Document Cited by: §2.
  • A. Sekhari, K. Sridharan, and S. Kale (2021) SGD: the role of implicit regularization, batch-size and multiple-epochs. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 27422–27433. External Links: Link Cited by: §4.
  • V. Shankar, R. Roelofs, H. Mania, A. Fang, B. Recht, and L. Schmidt (2020) Evaluating machine accuracy on imagenet. In International Conference on Machine Learning, pp. 8634–8644. Cited by: §1, §2.
  • D. Soudry, E. Hoffer, M. S. Nacson, and N. Srebro (2018) The implicit bias of gradient descent on separable data. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §1.
  • P. Sur and E. J. Candès (2019) A modern maximum-likelihood theory for high-dimensional logistic regression. Proceedings of the National Academy of Sciences 116 (29), pp. 14516–14525. Cited by: §2.
  • A. Tsigler and P. L. Bartlett (2020)

    Benign overfitting in ridge regression

    .
    arXiv preprint arXiv:2009.14286. Cited by: §2.
  • K. Wang, V. Muthukumar, and C. Thrampoulidis (2021) Benign overfitting in multiclass classification: all roads lead to interpolation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 24164–24179. External Links: Link Cited by: §2.
  • K. Wang and C. Thrampoulidis (2021) Benign overfitting in binary classification of gaussian mixtures. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp. 4030–4034. External Links: Link, Document Cited by: Table 1, §1, §1, §3.1, §3.1, §4, §4.
  • G. Xuan, W. Zhang, and P. Chai (2001)

    EM algorithms of gaussian mixture model and hidden markov model

    .
    In Proceedings of the 2001 International Conference on Image Processing, ICIP 2001, Thessaloniki, Greece, October 7-10, 2001, pp. 145–148. External Links: Link, Document Cited by: §2.
  • S. Yun, S. J. Oh, B. Heo, D. Han, J. Choe, and S. Chun (2021) Re-labeling imagenet: from single to multi-labels, from global to localized labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2340–2350. Cited by: §1, §2.
  • X. Zhu, X. Wu, and Q. Chen (2003) Eliminating class noise in large datasets. In Machine Learning, Proceedings of the Twentieth International Conference (ICML 2003), August 21-24, 2003, Washington, DC, USA, T. Fawcett and N. Mishra (Eds.), pp. 920–927. External Links: Link Cited by: §3.3.
  • B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, and H. Chen (2018)

    Deep autoencoding gaussian mixture model for unsupervised anomaly detection

    .
    In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, External Links: Link Cited by: §2.
  • D. Zou, J. Wu, V. Braverman, Q. Gu, and S. M. Kakade (2021) Benign overfitting of constant-stepsize SGD for linear regression. In Conference on Learning Theory, COLT 2021, 15-19 August 2021, Boulder, Colorado, USA, M. Belkin and S. Kpotufe (Eds.), Proceedings of Machine Learning Research, Vol. 134, pp. 4633–4635. External Links: Link Cited by: Table 1, §1, §2.

Appendix A Detailed Proofs

a.1 Proof of Theorem 2

The first statement in Theorem 3.1 states that the interpolator has a non-vacuous bound in a noiseless setting, which is a direct corollary of the following Theorem 2. The proof of Theorem 2 mainly depends on bounding the projection of the classifier on , which relies on a sketch of the classification margin.

See 2

Proof of Theorem 2.

We denote the classifier by during the proof for simplicity. Due to Proposition 3.1, the final classifier converges to its max-margin solution. Without loss of generality, let the final classifier satisfy . Therefore, the following equation for margin holds since is max-margin solution:

where denotes the margin of classifier for the dataset. We next consider the margin for the classifier . Note that the margin can be rewritten as

We note that is

-subGaussian due to the definition of subGaussian random vector. Therefore, due to Claim 

B.1, we have

where the last inequality is due to Assumption [A2].

We next bound the term via the above margin. From one hand, we notice that by the definition of the margin function,

(1)

From the other hand, we rewrite the margin as

(2)

The right hand side can be bounded as

(3)

where the final equation is due to Claim B.2. Therefore, combining the above Equation 1, Equation 2 and Equation 3, we bound the projection of on as:

(4)

where the last equation is due to Assumption [A5]. We rewrite Equation (4) as , then we can bound the 0-1 loss as follows for a given constant :

Due to Assumption [A2], , and therefore, by setting , we have

The proof is done. ∎

a.2 Proof of Theorem 2

Theorem 2 shows that the interpolator can be non-vacuous under noiseless regimes with mild overparameterization. However, things can be much different in noisy regimes. Statement Two proves a vacuous lower bound for interpolators in noisy settings, which can be derived by the following Theorem 2. The core of the proof lies in controlling the distance between the center of wrong labeled samples and the point , which further leads to an upper bound of . One can then derive the corresponding 0-1 loss for classifier .

See 2

Proof of Theorem 2.

We denote as for simplicity, and assume that without loss of generality. Let denote the original label and denote its corrupted label.

Without loss of generality, we consider those samples with while , which are indexed by . Consider the center point of , which is

Due to the interpolation in Proposition 3.1 and , we derive that

(5)

Case 1: . In this case, naturally has a lower bound of since it even fails in the center point .

Case 2: . In this case, the classifier satisfies and . Therefore, the distance between and must be less than the distance from

to its projection on the separating hyperplane that perpendicular to

through the origin. Formally,

Note that is independent and -subGaussian, and therefore applying Claim B.2, we have that . Besides, we derive by Claim B.3 that . Therefore,

(6)

We next consider the corresponding test error of , where we consider the test error on noiseless regime instead of noisy regime. Note that the two arguments are equivalent, we refer to Claim B.4 for more details. We rewrite Equation 6 as , where we abuse the notation as a fixed constant. Therefore,

(7)

where is sampled from Gaussian distribution, and

denotes the CDF of standard Gaussian Random Variable.

Case 1. If , then .

Case 2. If , note that if . Therefore,

Taking the above two cases together and denoting , we have

which is a constant lower bound under Assumption [A3]. The proof is done. ∎

a.3 Proof of Theorem 2

Statement Two shows that the interpolator fails in the noisy regime with mild overparameterization. How can we derive a non-vacuous bound under such regimes? The key is early-stopping. To show that, Statement Three provides a non-vacuous bound for early-stopping classifiers in noisy regimes, which is induced by the following Theorem 2.

See 2

To show the relationship between Theorem 2 and Theorem 3.1, one can directly use Assumption [A2, A3] in Theorem 2 to reach generalization bound in Theorem 3.1 (Statement Three). The derivation of Theorem 2 relies on the analysis on one-pass SGD, where we show that one-pass SGD is sufficient to reach non-vacuous bound. The proof of Theorem 2 again, relies on bounding the term c but in a different way. Different from the previous approaches where we can directly assume , the classifier is trained in this case and we need to first bound it. We then define a surrogate classifier and show that (a) the surrogate classifier is close to the trained classifier for a sufficiently small learning rate, and (b) the surrogate classifier can return satisfying projection on the direction . Therefore, we bound the term which leads to the results.

Proof of Theorem 2.

We abuse the notation to represent which is returned by one-pass SGD. We first lower bound the term , where is the optimal classification direction. To achieve the goal, we bound the term and individually.

Before diving into the proof, we first introduce a surrogate classifier . From the definition, we have that for update step size ,

(8)

Therefore, we have that

(9)

Bounding the term . We fist bound the different between and . We note that

To bound the above different, the fist step is to bound the term . Since is -subGuassian, we have that with probability with

(10)

We then bound the term . Note that when . By the iteration in Equation 8, we have that

where the first equation is due to the iteration, and the last equation is due to (by setting ). Therefore,

(11)

Combining Equation 10 and Equation 11, we have that with probability at least ,

(12)

On the other hand, we show the bound for . Note that , therefore,

where we denote as a random variable which takes value with probability and takes value with probability .

Due to Claim B.3, we have , where we note that . Besides, since is -subGaussian, we have that with probability ,

where the term comes from the probability . In summary, we have that

(13)

where the last equation follows Assumption [A2].

Combining Equation (12) and Equation (13), we have

(14)

We additionally note that Equation 14 holds by choosing proper constant in Equation (11) (and in the choice of ).

Bounding the norm . We next bound the norm . Before that, we first bound the norm . Note that

(15)

where the last equation is due to Claim B.2 by choosing probability .

Therefore, we have

Note that according to the iteration in Equation 8, we have that

where the last equation is due to .

Therefore, we bound the norm as

(16)

Combining Equation 14 and Equation 16, we derive that with probability at least ,

(17)

We next consider the probability on the test point, note that given the dataset and taking probability on the testing point , we have

where and is -subGaussian. Plugging Equation 17 into the above equation, we have that with high probability,

If , .

If , , which is large when .

Therefore, the above bound is non-vacuous if .

Note that for given a constant , under Assumption [A2], we have