DeepAI

# Gradient Descent with Early Stopping is Provably Robust to Label Noise for Overparameterized Neural Networks

Modern neural networks are typically trained in an over-parameterized regime where the parameters of the model far exceed the size of the training data. Due to over-parameterization these neural networks in principle have the capacity to (over)fit any set of labels including pure noise. Despite this high fitting capacity, somewhat paradoxically, neural network models trained via first-order methods continue to predict well on yet unseen test data. In this paper we take a step towards demystifying this phenomena. In particular we show that first order methods such as gradient descent are provably robust to noise/corruption on a constant fraction of the labels despite over-parametrization under a rich dataset model. In particular: i) First, we show that in the first few iterations where the updates are still in the vicinity of the initialization these algorithms only fit to the correct labels essentially ignoring the noisy labels. ii) Secondly, we prove that to start to overfit to the noisy labels these algorithms must stray rather far from from the initial model which can only occur after many more iterations. Together, these show that gradient descent with early stopping is provably robust to label noise and shed light on empirical robustness of deep networks as well as commonly adopted heuristics to prevent overfitting.

• 15 publications
• 43 publications
• 40 publications
02/11/2022

### Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

Benign overfitting, the phenomenon where interpolating models generalize...
10/20/2019

### Leveraging inductive bias of neural networks for learning without explicit human annotations

Classification problems today are typically solved by first collecting e...
11/19/2019

### Prestopping: How Does Early Stopping Help Generalization against Label Noise?

Noisy labels are very common in real-world training data, which lead to ...
11/25/2021

### Predicting the success of Gradient Descent for a particular Dataset-Architecture-Initialization (DAI)

Despite their massive success, training successful deep neural networks ...
11/17/2022

### Why Deep Learning Generalizes

Very large deep learning models trained using gradient descent are remar...
12/30/2017

### Theory of Deep Learning III: explaining the non-overfitting puzzle

A main puzzle of deep networks revolves around the absence of overfittin...
07/12/2021

### Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping

We explore the ability of overparameterized shallow neural networks to l...

## 1 Introduction

### 1.1 Motivation

Deep neural networks (DNN) are ubiquitous in a growing number of domains ranging from computer vision to healthcare. State-of-the-art DNN models are typically overparameterized and contain more parameters than the size of the training dataset. It is well understood that in this overparameterized regime, DNNs are highly expressive and have the capacity to (over)fit arbitrary training datasets including pure noise

[56]

. Mysteriously however neural network models trained via simple algorithms such as stochastic gradient descent continue to predict well on yet unseen test data. In such over-parametrized scenarios there maybe infinitely many globally optimal network parameters consistent with the training data, the key challenge is to understand which network parameters (stochastic) gradient descent converges to and what are its properties. Indeed, a recent series of papers

[52, 56, 16], suggest that solutions found by first order methods tend to have favorable generalization properties. As DNNs begin to be deployed in safety critical applications, the need for foundational understanding of their noise robustness and their unique prediction capabilities intensifies.

This paper focuses on an intriguing phenomena: overparameterized neural networks are surprisingly robust to label noise when first order methods with early stopping is used to train them [25]. To observe this phenomena consider Figure 1 where we perform experiments on the MNIST data set. Here, we corrupt a fraction of the labels of the training data by assigning their label uniformly at random. We then fit a four layer model via stochastic gradient descent and plot various performance metrics in Figures 0(a) and 0(b). Figure 0(a) (blue curve) shows that indeed with a sufficiently large number of iterations the neural network does in fact perfectly fit the corrupted training data. However, Figure 0(a) also shows that such a model does not generalize to the test data (yellow curve) and the accuracy with respect to the ground truth labels degrades (orange curve). These plots clearly demonstrate that the model overfits with many iterations. In Figure 0(b) we repeat the same experiment but this time stop the updates after a few iterations (i.e. use early stopping). In this case the train accuracy degrades linearly (blue curve). However, perhaps unexpected, the test accuracy (yellow curve) remains high even with a significant amount of corruption. This suggests that with early stopping the model does not overfit and generalizes to new test data. Even more surprising, the train accuracy (orange curve) with respect to the ground truth labels continues to stay around even when of the labels are corrupted111We remark that related empirical observations are made by [25] and [44].. That is, with early stopping overparameterized neural networks even correct the corrupted labels! These plots collectively demonstrate that overparameterized neural networks when combined with early stopping have unique generalization and robustness capabilities. As we detail further in Section 4 this phenomena holds (albeit less pronounced) for richer data models and architectures.

This paper aims to demystify the surprising robustness of overparameterized neural networks when early stopping is used. We show that gradient descent is indeed provably robust to noise/corruption on a constant fraction of the labels in such over-parametrized learning scenarios. In particular, under a fairly expressive dataset model and focusing on one-hidden layer networks, we show that after a few iterations (a.k.a. early stopping), gradient descent finds a model (i) that is within a small neighborhood of the point of initialization and (ii) only fits to the correct labels essentially ignoring the noisy labels. We complement these findings by proving that if the network is trained to overfit to the noisy labels, then the solution found by gradient descent must stray rather far from the initial model. Together, these results highlight the key features of a solution that generalizes well vs a solution that fits well.

Our theoretical results further highlight the role of the distance between final and initial network weights as a key feature that determines noise robustness vs. overfitting. This is inherently connected to the commonly used early stopping heuristic for DNN training as this heuristic helps avoid models that are too far from the point of initialization. In the presence of label noise, we show that gradient descent implicitly ignores the noisy labels as long as the model parameters remain close to the initialization. Hence, our results help explain why early stopping improves robustness and helps prevent overfitting. Under proper normalization, the required distance between the final and initial network and the predictive accuracy of the final network is independent of the size of the network such as number of hidden nodes. Our extensive numerical experiments corroborate our theory and verify the surprising robustness of DNNs to label noise. Finally, we would like to note that while our results show that solutions found by gradient descent are inherently robust to label noise, specialized techniques such as penalization or sample reweighting are known to further improve robustness. Our theoretical framework may enable more rigorous understandings of the benefits of such heuristics when training overparameterized models.

### 1.2 Prior Art

Our work is connected to recent advances on theory for deep learning as well as heuristics and theory surrounding outlier robust optimization.

Robustness to label corruption: DNNs have the ability to fit to pure noise [56], however they are also empirically observed to be highly resilient to label noise and generalize well despite large corruption [44]. In addition to early stopping, several heuristics have been proposed to specifically deal with label noise [42, 37, 57, 47, 31, 27]. See also [23, 38, 43, 48] for additional work on dealing with label noise in classification tasks. When learning from pairwise relations, noisy labels can be connected to graph clustering and community detection problems [14, 54, 1]

. Label noise is also connected to outlier robustness in regression which is a traditionally well-studied topic. In the context of robust regression and high-dimensional statistics, much of the focus is on regularization techniques to automatically detect and discard outliers by using tools such as

penalization [17, 33, 6, 36, 10, 15, 22]. We would also like to note that there is an interesting line of work that focuses on developing robust algorithms for corruption not only in the labels but also input data [19, 41, 32]. Finally, noise robustness is particularly important in safety critical domains. Noise robustness of neural nets has been empirically investigated by Hinton and coauthors in the context of automated medical diagnosis [25].

Overparameterized neural networks: Intriguing properties and benefits of overparameterized neural networks has been the focus of a growing list of publications [56, 49, 12, 18, 4, 29, 53, 58, 51, 11]. A recent line of work [34, 2, 3, 21, 59, 20, 39] show that overparameterized neural networks can fit the data with random initialization if the number of hidden nodes are polynomially large in the size of the dataset. Recently in [40] we showed that this conclusion continues to hold with more modest amounts of overparameterization and as soon as the number of parameters of the model exceed the square of the size of the training data set. This line of work however is not informative about the robustness of the trained network against corrupted labels. Indeed, such theory predicts that (stochastic) gradient descent will eventually fit the corrupted labels. In contrast, our focus here is not in finding a global minima, rather a solution that is robust to label corruption. In particular, we show that with early stopping we fit to the correct labels without overfitting to the corrupted training data. Our result also defers from this line of research in another way. The key property utilized in this research area is that the Jacobian of the neural network is well-conditioned at a random initialization if the dataset is sufficiently diverse (e.g. if the points are well-separated). In contrast, in our model the Jacobian is inherently low-rank with the rank of the Jacobian corresponding to different clusters/classes within the dataset. We harness this low-rank nature to prove that gradient descent is robust to label corruptions. We further utilize this low-rank structure to explain why neural networks can work with much more modest amounts of overparameterization where the number of parameters in the model exceeds the number of clusters raised to the fourth power and is independent of the number of data points. Furthermore, our numerical experiments verify that the Jacobian matrix of real datasets (such as CIFAR10) indeed exhibit low-rank structure. This is closely related to the observations on the Hessian of deep networks which is empirically observed to be low-rank [45]. We would also like to note that the importance of the Jacobian for overparameterized neural network analysis has also been noted by other papers including [39, 49, 21] and also [30, 16] which investigate the optimization landscape and properties of SGD for training neural networks. An equally important question to understanding the convergence behavior of optimization algorithms for overparameterized models is understanding their generalization capabilities. This is the subject of a few interesting recent papers [5, 7, 24, 50, 13, 8, 35, 9]. While in this paper we do not tackle generalization in the traditional sense, we do show that solution found by gradient descent are robust to label noise/corruption which demonstrates their predictive capabilities and in turn suggests better generalization.

### 1.3 Models

We first describe the dataset model used in our theoretical results. In this model we assume that the input samples come from clusters which are located on the unit Euclidian ball in . We also assume our data set consists of classes where each class can be composed of multiple clusters. We consider a deterministic data set with samples with roughly balanced clusters each consisting on the order of samples.222This is for ease of exposition rather than a particular challenge arising in the analysis. Finally, while we allow for multiple classes, in our model we assume the labels are scalars and take values in interval. We formally define our dataset model below and provide an illustration in Figure 2.

###### Definition 1.1 (Clusterable dataset)

Consider a data set of size consisting of input/label pairs
. We assume the input data have unit Euclidean norm and originate from clusters with the th cluster containing data points. We assume the number of points originating from each cluster is well-balanced in the sense that with and two numerical constants obeying . We use

to denote the cluster centers which are distinct unit Euclidian norm vectors. We assume the input data points

that belong to the -th cluster obey

 ∥x−cℓ∥ℓ2≤ε0,

with denoting the input noise level.

We assume the labels belong to one of classes. Specifically, we assume with denoting the labels associated with each class. We assume all the elements of the same cluster belong to the same class and hence have the same label. However, a class can contain multiple clusters. Finally, we assume the labels are separated in the sense that

 |αr−αs|≥δforr≠s, (1.1)

with denoting the class separation.

In the data model above are the cluster centers that govern the input distribution. We note that in this model different clusters can be assigned to the same label. Hence, this setup is rich enough to model data which is not linearly separable: e.g. over , we can assign cluster centers and to label and cluster centers and to label . Note that the maximum number of classes are dictated by the separation . In particular, we can have at most classes. We remark that this model is related to the setup of [34] which focuses on providing polynomial guarantees for learning shallow networks. Finally, note that, we need some sort of separation between the cluster centers to distinguish them. While Definition 1.1 doesn’t specifies such separation explicitly, Definition 2.1 establishes a notion of separation in terms of how well a neural net can distinguish the cluster centers. Next, we introduce our noisy/corrupted dataset model.

###### Definition 1.2 ((ρ,ε0,δ) corrupted dataset)

Let be an clusterable dataset with , , denoting the possible class labels. A noisy/corrupted dataset is generated from as follows. For each cluster , at most of the labels associated with that cluster (which contains points) is assigned to another label value chosen from . We shall refer to the initial labels as the ground truth labels.

We note that this definition allows for a fraction of corruptions in each cluster.

Network model: We will study the ability of neural networks to learn this corrupted dataset model. To proceed, let us introduce our neural network model. We consider a network with one hidden layer that maps to . Denoting the number of hidden nodes by

, this network is characterized by an activation function

, input weight matrix and output weight vector . In this work, we will fix output to be a unit vector where half the entries are and other half are to simplify exposition.333

If the number of hidden units is odd we set one entry of

to zero.
We will only optimize over the weight matrix which contains most of the network parameters and will be shown to be sufficient for robust learning. We will also assume has bounded first and second order derivatives, i.e.  for all . The network’s prediction at an input sample is given by

 x↦f(W,x)=vTϕ(Wx), (1.2)

where the activation function applies entrywise. Given a dataset , we shall train the network via minimizing the empirical risk over the training data via a quadratic loss

 L(W)=12n∑i=1(yi−f(xi,W))2. (1.3)

In particular, we will run gradient descent with a constant learning rate , starting from a random initialization via the following updates

 Wτ+1=Wτ−η∇L(Wτ). (1.4)

## 2 Main results

Throughout,

denotes the largest singular value of a given matrix. The notation

denotes that a certain identity holds up to a fixed numerical constant. Also, , , , etc. represent numerical constants.

### 2.1 Robustness of neural network to label noise with early stopping

Our main result shows that overparameterized neural networks, when trained via gradient descent using early stopping are fairly robust to label noise. The ability of neural networks to learn from the training data, even without label corruption, naturally depends on the diversity of the input training data. Indeed, if two input data are nearly the same but have different uncorrupted labels reliable learning is difficult. We will quantify this notion of diversity via a notion of condition number related to a covariance matrix involving the activation and the cluster centers .

###### Definition 2.1 (Neural Net Cluster Covariance and Condition Number)

Define the matrix of cluster centers

 C=[c1 … cK]T∈RK×d.

Let . Define the neural net covariance matrix as

 Σ(C)=(CCT)⨀Eg[ϕ′(Cg)ϕ′(Cg)T].

Here

denotes the elementwise product. Also denote the minimum eigenvalue of

by and define the following condition number associated with the cluster centers

 κ(C)=√dK∥C∥λ(C).

One can view as an empirical kernel matrix associated with the network where the kernel is given by . Note that is trivially rank deficient if there are two cluster centers that are identical. In this sense, the minimum eigenvalue of will quantify the ability of the neural network to distinguish between distinct cluster centers. Therefore, one can think of as a condition number associated with the neural network which characterizes the distinctness/diversity of the cluster centers. The more distinct the cluster centers, the larger and smaller the condition number is. Indeed, based on results in [40] when the cluster centers are maximally diverse e.g. uniformly at random from the unit sphere scales like a constant. Throughout we shall assume that is strictly positive (and hence ). This property is empirically verified to hold in earlier works [55] when

is a standard activation (e.g. ReLU, softplus). As a concrete example, for ReLU activation, using results from

[40] one can show if the cluster centers are separated by a distance , then . We note that variations of the assumption based on the data points (i.e.  not cluster centers) [40, 21, 20] are utilized to provide convergence guarantees for DNNs. Also see [3, 59] for other publications using related definitions.

Now that we have a quantitative characterization of distinctiveness/diversity in place we are now ready to state our main result. Throughout we use , etc. to denote constants only depending on . We note that this Theorem is slightly simplified by ignoring logarithmic terms and precise dependencies on . We refer the reader to Theorem 6.13 for precise statement including logarithmic terms.

###### Theorem 2.2 (Robust learning with early stopping-simplified)

Consider an clusterable corrupted data set of input/label pairs per Definition 1.2 with cluster centers aggregated as rows of a matrix . Furthermore, let be the corresponding uncorrupted ground truth labels. Also consider a one-hidden layer neural network with hidden units and one output of the form with and the input-to-hidden and hidden-to-output weights. Also suppose the activation obeys and for all and some . Furthermore, we set half of the entries of to and the other half to 444If is odd we set one entry to zero to and entries to . and train only over . Starting from an initial weight matrix selected at random with i.i.d.  entries we run gradient descent updates of the form on the least-squares loss (1.3) with step size with . Furthermore, assume the number of parameters obey

 kd≥CΓκ4(C)K4d,

with the neural net cluster condition number per Definition 2.1. Then as long as and

with probability at least

, after iterations, the neural network found by gradient descent assigns all input samples to the correct ground truth labels . That is,

 argminαℓ:1≤ℓ≤¯K|f(Wτ,xi)−αℓ|=˜yi, (2.1)

holds for all . Furthermore, for all , the distance to the initial point obeys

 ∥Wτ−W0∥F≤¯CΓ(√K+K2∥C∥2τε0).

Theorem 2.2 shows that gradient descent with early stopping has a few intriguing properties. We further discuss these properties below.
Robustness. The solution found by gradient descent with early stopping degrades gracefully as the label corruption level grows. In particular, as long as

, the final model is able to correctly classify all samples including the corrupted ones. In our setup, intuitively label gap obeys

, hence, we prove robustness to

 Total Number of corrupted labels≲n¯K.

This result is independent of number of clusters and only depends on number of classes. An interesting future direction is to improve this result to allow on the order of corrupted labels. Such a result maybe possible by using a multi-output classification neural network.

Early stopping time. We show that gradient descent finds a model that is robust to outliers after a few iterations. In particular using the maximum allowed step size, the required number of iterations is of the order of which scales with up to condition numbers.

Modest overparameterization. Our result requires modest overparemetrization and apply as soon as the number of parameters exceed the number of classes to the power four (). Interestingly, under our data model the required amount of overparameterization is essentially independent of the size of the training data (ignoring logarithmic terms) and conditioning of the data points, only depending on the number of clusters and conditioning of the cluster centers. This can be interpreted as ensuring that the network has enough capacity to fit the cluster centers and the associated true labels.

Distance from initialization. Another feature of Theorem 2.2 is that the network weights do not stray far from the initialization as the distance between the initial model and the final model (at most) grows with the square root of the number of clusters (). This dependence implies that the more clusters there are, the updates travel further away but continue to stay within a certain radius. This dependence is intuitive as the Rademacher complexity of the function space is dictated by the distance to initialization and should grow with the square-root of the number of input clusters to ensure the model is expressive enough to learn the dataset.

Before we end this section we would like to note that in the limit of where the input data set is perfectly clustered one can improve the amount of overparamterization. Indeed, the result above is obtained via a perturbation argument from this more refined result stated below.

###### Theorem 2.3 (Training with perfectly clustered data)

Consier the setting and assumptions of Theorem 2.3 with . Starting from an initial weight matrix selected at random with i.i.d.  entries we run gradient descent updates of the form on the least-squares loss (1.3) with step size . Furthermore, assume the number of parameters obey

 kd≥CΓ4κ2(C)K2,

with the neural net cluster condition number per Definition 2.1. Then, with probability at least over randomly initialized , the iterates obey the following properties.

• The distance to initial point is upper bounded by

 ∥Wτ−W0∥F≤cΓ√KlogKλ(C).
• After iterations, the entrywise predictions of the learned network with respect to the ground truth labels satisfy

 |f(Wτ,xi)−˜yi|≤4ρ,

for all . Furthermore, if the noise level obeys the network predicts the correct label for all samples i.e.

 argminαℓ:1≤ℓ≤¯K|f(Wτ,xi)−αℓ|=˜yifori=1,2,…,n. (2.2)

This result shows that in the limit where the data points are perfectly clustered, the required amount of overparameterization can be reduced from to . In this sense this can be thought of a nontrivial analogue of [40] where the number of data points are replaced with the number of clusters and the condition number of the data points is replaced with a cluster condition number. This can be interpreted as ensuring that the network has enough capacity to fit the cluster centers and the associated true labels. Interestingly, the robustness benefits continue to hold in this case. However, in this perfectly clustered scenario there is no need for early stopping and a robust network is trained as soon as the number of iterations are sufficiently large. Infact, in this case given the clustered nature of the input data the network never overfits to the corrupted data even after many iterations.

### 2.2 To (over)fit to corrupted labels requires straying far from initialization

In this section we wish to provide further insight into why early stopping enables robustness and generalizable solutions. Our main insight is that while a neural network maybe expressive enough to fit a corrupted dataset, the model has to travel a longer distance from the point of initialization as a function of the distance from the cluster centers and the amount of corruption. We formalize this idea as follows. Suppose

1. two input points are close to each other (e.g. they are from the same cluster),

2. but their labels are different, hence the network has to map them to distant outputs.

Then, the network has to be large enough so that it can amplify the small input difference to create a large output difference. Our first result formalizes this for a randomly initialized network. Our random initialization picks with i.i.d. standard normal entries which ensures that the network is isometric i.e. given input , .

###### Theorem 2.4

Let be two vectors with unit Euclidean norm obeying . Let where is fixed, , and with a fixed constant. Assume . Let and be two scalars satisfying . Suppose . Then, with probability at least , for any such that and

 f(W,x1)=y1andf(W,x2)=y2,

holds, we have

 ∥W−W0∥≥δCΓε0−t1000.

In words, this result shows that in order to fit to a data set with a single corrupted label, a randomly initialized network has to traverse a distance of at least . The next lemma clarifies the role of the corruption amount and shows that more label corruption within a fixed class requires a model with a larger norm in order to fit the labels. For this result we consider a randomized model with

input noise variance.

###### Lemma 2.5

Let be a cluster center. Consider data points and in generated i.i.d. around according to the following distribution

 c+gwithg∼N(0,ε20dId).

Assign with labels and with labels and assume these two labels are separated i.e. . Also suppose and . Then, any satisfying

 f(W,xi)=yiandf(W,˜xi)=˜yifori=1,…,s,

obeys with probability at least .

Unlike Theorem 2.4 this result lower bounds the network norm in lieu of the distance to the initialization . However, using the triangular inequality we can in turn get a guarantee on the distance from initialization via triangle inequality as long as (e.g. by choosing a small ).

The above Theorem implies that the model has to traverse a distance of at least

 ∥Wτ−W0∥F≳√ρnKδε0,

to perfectly fit corrupted labels. In contrast, we note that the conclusions of the upper bound in Theorem 2.2 show that to be able to fit to the uncorrupted true labels the distance to initialization grows at most by after iterates. This demonstrates that there is a gap in the required distance to initialization for fitting enough to generalize and overfitting. To sum up, our results highlight that, one can find a network with good generalization capabilities and robustness to label corruption within a small neighborhood of the initialization and that the size of this neighborhood is independent of the corruption. However, to fit to the corrupted labels, one has to travel much more, increasing the search space and likely decreasing generalization ability. Thus, early stopping can enable robustness without overfitting by restricting the distance to the initialization.

## 3 Technical Approach and General Theory

In this section, we outline our approach to proving robustness of overparameterized neural networks. Towards this goal, we consider a general formulation where we aim to fit a general nonlinear model of the form with denoting the parameters of the model. For instance in the case of neural networks represents its weights. Given a data set of input/label pairs , we fit to this data by minimizing a nonlinear least-squares loss of the form

 L(θ)=12n∑i=1(yi−f(θ,xi))2.

which can also be written in the more compact form

 L(θ)=12∥f(θ)−y∥2ℓ2withf(θ):=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣f(θ,x1)f(θ,x2)⋮f(θ,xn)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦.

To solve this problem we run gradient descent iterations with a constant learning rate starting from an initial point . These iterations take the form

 θτ+1=θτ−η∇L(θτ)with∇L(θ)=JT(θ)(f(θ)−y). (3.1)

Here, is the Jacobian matrix associated with the nonlinear mapping defined via

 J(θ)=[∂f(θ,x1)∂θ … ∂f(θ,xn)∂θ]T. (3.2)

### 3.1 Bimodal jacobian structure

Our approach is based on the hypothesis that the nonlinear model has a Jacobian matrix with bimodal spectrum where few singular values are large and remaining singular values are small

. This assumption is inspired by the fact that realistic datasets are clusterable in a proper, possibly nonlinear, representation space. Indeed, one may argue that one reason for using neural networks is to automate the learning of such a representation (essentially the input to the softmax layer). We formalize the notion of bimodal spectrum below.

###### Assumption 1 (Bimodal Jacobian)

Let be scalars. Let be a nonlinear mapping and consider a set containing the initial point (i.e. ). Let be a subspace and be its complement. We say the mapping has a Bimodal Jacobian with respect to the complementary subpspaces and as long as the following two assumptions hold for all .

• Spectrum over : For all with unit Euclidian norm we have

 α≤∥∥JT(θ)v∥∥ℓ2≤β.
• Spectrum over : For all with unit Euclidian norm we have

 ∥∥JT(θ)v∥∥ℓ2≤ϵ.

We will refer to as the signal subspace and as the noise subspace.

When the Jacobian is approximately low-rank. An extreme special case of this assumption is where so that the Jacobian matrix is exactly low-rank. We formalize this assumption below for later reference.

###### Assumption 2 (Low-rank Jacobian)

Let be scalars. Consider a set containing the initial point (i.e. ). Let be a subspace and be its complement. For all , and with unit Euclidian norm, we have that

Our dataset model in Definition 1.2 naturally has a low-rank Jacobian when and each input example is equal to one of the cluster centers . In this case, the Jacobian will be at most rank since each row will be in the span of . The subspace is dictated by the membership of each cluster as follows: Let be the set of coordinates such that . Then, subspace is characterized by

 S+={v∈Rn ∣∣ vi1=vi2  for all  i1,i2∈Λℓ  and  1≤ℓ≤K}.

When and the data points of each cluster are not the same as the cluster center we have the bimodal Jacobian structure of Assumption 1 where over the spectral norm is small but nonzero.

In Section 4, we verify that the Jacobian matrix of real datasets indeed have a bimodal structure i.e. there are few large singular values and the remaining singular values are small which further motivate Assumption 2. This is inline with earlier papers which observed that Hessian matrices of deep networks have bimodal spectrum (approximately low-rank) [45] and is related to various results demonstrating that there are flat directions in the loss landscape [28].

### 3.2 Meta result on learning with label corruption

Define the -dimensional residual vector where . A key idea in our approach is that we argue that (1) in the absence of any corruption approximately lies on the subspace and (2) if the labels are corrupted by a vector , then approximately lies on the complement space. Before we state our general result we need to discuss another assumption and definition.

###### Assumption 3 (Smoothness)

The Jacobian mapping associated to a nonlinear mapping is -smooth if for all we have .555Note that, if is continuous, the smoothness condition holds over any compact domain (albeit for a possibly large ).

Additionally, to connect our results to the number of corrupted labels, we introduce the notion of subspace diffusedness defined below.

###### Definition 3.1 (Diffusedness)

is diffused if for any vector

 ∥v∥ℓ∞≤√γ/n∥v∥ℓ2,

holds for some .

The following theorem is our meta result on the robustness of gradient descent to sparse corruptions on the labels when the Jacobian mapping is exactly low-rank. Theorem 2.3 for the perfectly clustered data (

) is obtained by combining this result with specific estimates developed for neural networks.

###### Theorem 3.2 (Gradient descent with label corruption)

Consider a nonlinear least squares problem of the form with the nonlinear mapping obeying assumptions 2 and 3 over a unit Euclidian ball of radius around an initial point and denoting the corrupted labels. Also let denote the uncorrupted labels and the corruption. Furthermore, suppose the initial residual with respect to the uncorrupted labels obey . Then, running gradient descent updates of the from (3.1) with a learning rate , all iterates obey

 ∥θτ−θ0∥ℓ2≤4∥r0∥ℓ2α.

Furthermore, assume is a precision level obeying . Then, after iterations, achieves the following error bound with respect to the true labels

 ∥f(θτ)−˜y∥ℓ∞≤2ν.

Furthermore, if has at most nonzeros and is diffused per Definition 3.1, then using

 ∥f(θτ)−˜y∥ℓ∞≤2∥ΠS+(e)∥ℓ∞≤γ√sn∥e∥ℓ2.

This result shows that when the Jacobian of the nonlinear mapping is low-rank, gradient descent enjoys two intriguing properties. First, gradient descent iterations remain rather close to the initial point. Second, the estimated labels of the algorithm enjoy sample-wise robustness guarantees in the sense that the noise in the estimated labels are gracefully distributed over the dataset and the effects on individual label estimates are negligible. This theorem is the key result that allows us to prove Theorem 2.3 when the data points are perfectly clustered (). Furthermore, this theorem when combined with a perturbation analysis allows us to deal with data that is not perfectly clustered () and to conclude that with early stopping neural networks are rather robust to label corruption (Theorem 2.2).

Finally, we note that a few recent publication [39, 3, 21] require the Jacobian to be well-conditioned to fit labels perfectly. In contrast, our low-rank model cannot perfectly fit the corrupted labels. Furthermore, when the Jacobian is bimodal (as seems to be the case for many practical data sets and neural network models) it would take a very long time to perfectly fit the labels and as demonstrated earlier such a model does not generalize and is not robust to corruptions. Instead we focus on proving robustness with early stopping.

### 3.3 To (over)fit to corrupted labels requires straying far from initialization

In this section we state a result that provides further justification as to why early stopping of gradient descent leads to more robust models without overfitting to corrupted labels. This is based on the observation that while finding an estimate that fits the uncorrupted labels one does not have to move far from the initial estimate in the presence of corruption one has to stray rather far from the initialization with the distance from initialization increasing further in the presence of more corruption. We make this observation rigorous below by showing that it is more difficult to fit to the portion of the residual that lies on the noise space compared to the portion on the signal space (assuming ).

###### Theorem 3.3

Denote the residual at initialization by . Define the residual projection over the signal and noise space as

 E+=∥ΠS+(r0)∥ℓ2andE−=∥ΠS−(r0)∥ℓ2.

Suppose Assumption 1 holds over an Euclidian ball of radius around the initial point with . Then, over there exists no that achieves zero training loss. In particular, if , any parameter achieving zero training loss () satisfies the distance bound

 ∥θ−θ0∥ℓ2≥max(E+β,E−ε).

This theorem shows that the higher the corruption (and hence ) the further the iterates need to stray from the initial model to fit the corrupted data.

## 4 Numerical experiments

We conduct several experiments to investigate the robustness capabilities of deep networks to label corruption. In our first set of experiments, we explore the relationship between loss, accuracy, and amount of label corruption on the MNIST dataset to corroborate our theory. Our next experiments study the distribution of the loss and the Jacobian on the CIFAR-10 dataset. Finally, we simulate our theoretical model by generating data according to the corrupted data model of Definition 1.2 and verify the robustness capability of gradient descent with early stopping in this model.

In Figure 3, we train the same model used in Figure 1 with MNIST samples for different amounts of corruption. Our theory predicts that more label corruption leads to a larger distance to initialization. To probe this hypothesis, Figure 2(a) and 2(b) visualizes training accuracy and training loss as a function of the distance from the initialization. These results demonstrate that the distance from initialization gracefully increase with more corruption.

Next, we study the distribution of the individual sample losses on the CIFAR-10 dataset. We conducted two experiments using Resnet-20 with cross entropy loss666We opted for cross entropy as it is the standard classification loss however least-squares loss achieves similar accuracy.. In Figure 4 we assess the noise robustness of gradient descent where we used all 50,000 samples with either 30% random corruption or 50% random corruption. Theorem 2.3 predicts that when the corruption level is small, the loss distribution of corrupted vs clean samples should be separable. Figure 4 shows that when 30% of the data is corrupted the distributions are approximately separable. When we increase the shuffling amount to 50% the training loss on the clean data increases as predicted by our theory and the distributions start to gracefully overlap.

As described in Section 3, our technical framework utilizes a bimodal prior on the Jacobian matrix (3.2

) of the model. We now further investigate this hypothesis. For a multiclass task, the Jacobian matrix is essentially a 3-way tensor where dimensions are sample size (

), total number of parameters in the model (), and the number of classes (). The neural network model we used for CIFAR 10 has around 270,000 parameters in total. In Figure 5 we illustrate the singular value spectrum of the two multiclass Jacobian models where we form the Jacobian from all layers except the five largest (in total we use parameters).777We depict the smaller Jacobian due to the computational cost of calculating the full Jacobian. We train the model with all samples and focus on the spectrum before and after the training. In Figure 4(a), we picked samples and unfolded this tensor along parameters to obtain a matrix which verifies our intuition on bimodality. In particular, only 10 to 20 singular values are larger than the top one. This is consistent with earlier works that studied the Hessian spectrum. However, focusing on the Jacobian has the added advantage of requiring only first order information [45, 26]. A disadvantage is that the size of Jacobian grows with number of classes. Intuitively, cross entropy loss focuses on the class associated with the label hence in Figure 4(b), we only picked the partial derivative associated with the correct class so that each sample is responsible for a single (size ) vector. This allowed us to scale to samples and the corresponding spectrum is strikingly similar. Another intriguing finding is that the spectrums of before and after training are fairly close to each other highlighting that even at random initialization, spectrum is bimodal.

In Figure 6, we turn our attention to verifying our findings for the corrupted dataset model of Definition 1.2. We generated classes where the associated clusters centers are generated uniformly at random on the unit sphere of . We also generate the input samples at random around these two clusters uniformly at random on a sphere of radius around the corresponding cluster center. Hence, the clusters are guaranteed to be at least distance from each other to prevent overlap. Overall we generate samples ( per class/cluster). Here, and the class labels are and . We picked a network with hidden units and trained on a data set with samples where 30% of the labels were corrupted. Figure 5(a) plots the trajectory of training error and highlights the model achieves good classification in the first few iterations and ends up overfitting later on. In Figures 5(b) and 5(c), we focus on the loss distribution of 5(a) at iterations and . In this figure, we visualize the loss distribution of clean and corrupted data. Figure 5(b) highlights the loss distribution with early stopping and implies that the gap between corrupted and clean loss distributions is surprisingly resilient despite a large amount of corruption and the high-capacity of the model. In Figure 5(c), we repeat plot after many more iterations at which point the model overfits. This plot shows that the distribution of the two classes overlap demonstrating that the model has overfit the corruption and lacks generalization/robustness.

## 5 Conclusions

In this paper, we studied the robustness of overparameterized neural networks to label corruption from a theoretical lens. We provided robustness guarantees for training networks with gradient descent when early stopping is used and complemented these guarantees with lower bounds. Our results point to the distance between final and initial network weights as a key feature to determine robustness vs. overfitting which is inline with weight decay and early stopping heuristics. We also carried out extensive numerical experiments to verify the theoretical predictions as well as technical assumptions. While our results shed light on the intriguing properties of overparameterized neural network optimization, it would be appealing (i) to extend our results to deeper network architecture, (ii) to more complex data models, and also (iii) to explore other heuristics that can further boost the robustness of gradient descent methods.

## 6 Proofs

### 6.1 Proofs for General Theory

We begin by defining the average Jacobian which will be used throughout our analysis.

###### Definition 6.1 (Average Jacobian)

We define the average Jacobian along the path connecting two points as

 J(y,x):=∫10J(x+α(y−x))dα. (6.1)
###### Lemma 6.2 (Linearization of the residual)

Given gradient descent iterate , define

 C(θ)=J(^θ,θ)J(θ)T.

The residuals , obey the following equation

 ^r=(I−ηC(θ))r.

Proof Following Definition 6.1, denoting and , we find that

 ^r= r−f(θ)+f(^θ) (a)= r+J(^θ,θ)(^θ−θ) (b)= r−ηJ(^θ,θ)J(θ)Tr = (I−ηC(θ))r. (6.2)

Here (a) uses the fact that Jacobian is the derivative of and (b) uses the fact that .

Using Assumption 3.1, one can show that sparse vectors have small projection on .

###### Lemma 6.3

Suppose Assumption 3.1 holds. If is a vector with nonzero entries, we have that

 ∥ΠS+(r)∥ℓ∞≤γ√sn∥r∥ℓ2. (6.3)

Proof First, we bound the projection of on as follows

 ∥ΠS+(r)∥ℓ2=supv∈S+vTr∥v∥ℓ2≤√γn∥r∥ℓ1≤√γsn∥r∥ℓ2.

where we used the fact that . Next, we conclude with

 ∥ΠS+(r)∥ℓ∞≤√γn∥ΠS+(r)∥ℓ2≤γ√sn∥r∥ℓ2.

#### 6.1.1 Proof of Theorem 3.2

Proof The proof will be done inductively over the properties of gradient descent iterates and is inspired from the recent work [39]. In particular, [39] requires a well-conditioned Jacobian to fit labels perfectly. In contrast, we have a low-rank Jacobian model which cannot fit the noisy labels (or it would have trouble fitting if the Jacobian was approximately low-rank). Despite this, we wish to prove that gradient descent satisfies desirable properties such as robustness and closeness to initialization. Let us introduce the notation related to the residual. Set and let be the initial residual. We keep track of the growth of the residual by partitioning the residual as where

 ¯eτ=ΠS−(rτ),¯rτ=ΠS+(rτ).

We claim that for all iterations , the following conditions hold.

 ¯eτ= ¯e0 (6.4) ∥¯rτ∥2ℓ2≤ (1−ηα22)τ∥¯r0∥2ℓ2, (6.5) ∥¯r0∥ℓ2≤∥r0∥ℓ2. (6.6)

Assuming these conditions hold till some , inductively, we focus on iteration . First, note that these conditions imply that for all , where is the Euclidian ball around of radius . This directly follows from (6.6) induction hypothesis. Next, we claim that is still within the set . This can be seen as follows:

###### Claim 1

Under the induction hypothesis (6.4), .

Proof Since range space of Jacobian is in and , we begin by noting that

 ∥θτ+1−θτ∥ℓ2 =η∥JT(θτ)(f(θτ)−y)∥ℓ2 (6.7) (a)=η∥JT(θτ)(ΠS+(f(θτ)−y))∥ℓ2 (6.8) (b)=η∥JT(θτ)¯rτ∥ℓ2 (6.9) (c)≤ηβ∥¯rτ∥ℓ2 (6.10) (d)≤∥¯rτ∥ℓ2β (6.11) (e)≤∥¯rτ∥ℓ2α (6.12)

In the above, (a) follows from the fact that row range space of Jacobian is subset of via Assumption 2. (b) follows from the definition of . (c) follows from the upper bound on the spectral norm of the Jacobian over per Assumption 2, (d) from the fact that , (e) from . The latter combined with the triangular inequality and induction hypothesis (6.6) yields (after scaling (6.6) by )

 ∥θτ+1−θ0∥ℓ2≤∥θτ+1−θτ∥ℓ2+∥θ0−θτ∥ℓ2≤∥θτ−θ0∥ℓ2+∥¯rτ∥ℓ2α≤