Understanding Generalization of Deep Neural Networks Trained with Noisy Labels

05/27/2019 ∙ by Wei Hu, et al. ∙ Princeton University 0

Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. When the training dataset contains a fraction of noisy labels, can neural networks be resistant to over-fitting and still generalize on the true distribution? Inspired by recent theoretical work that established connections between over-parameterized neural networks and neural tangent kernel (NTK), we propose two simple regularization methods for this purpose: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, both methods are related to kernel ridge regression with respect to the NTK, and we prove their generalization guarantee on the true data distribution despite being trained using noisy labels. The generalization bound is independent of the network size, and only depends on the training inputs and true labels (instead of noisy labels) as well as the noise level in the labels. Empirical results verify the effectiveness of these methods on noisily labeled datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern deep neural networks are trained in a highly over-parameterized regime, with many more trainable parameters than training examples. It is well-known that these networks trained with simple first-order methods can fit any labels, even completely random ones (Zhang et al., 2017). Although training on properly labeled data usually leads to good generalization performance, the ability to over-fit the entire training dataset is undesirable for generalization when noisy labels are present. Since mislabeled data are ubiquitous in very large datasets (Krishna et al., 2016), principled methods that are robust to noisy labels are expected to improve generalization significantly.

In order to prevent over-fitting to mislabeled data, some form of regularization is necessary. A simple such example is early stopping, which has been observed to be effective for this purpose (Rolnick et al., 2017; Guan et al., 2018; Li et al., 2019). For instance, on MNIST, even when 90% of the labels are corrupted, training a neural net with early stopping can still give 90% test accuracy (Li et al., 2019). How to explain such generalization phenomenon is an intriguing theoretical question.

In an effort to understanding the optimization and generalization mysteries in deep learning, theoretical progress has been made recently by considering very wide deep neural nets 

(Du et al., 2019, 2018; Li and Liang, 2018; Allen-Zhu et al., 2018a, b; Zou et al., 2018; Arora et al., 2019b; Cao and Gu, 2019)

. When the width in every hidden layer is sufficiently large, it was shown that (stochastic) gradient descent with random initialization can almost always drive the training loss to

; under further assumptions the trained net can be shown to have good generalization guarantee. In this line of work, wideness plays an important role: parameters in a wide neural net will stay close to their initialization during gradient descent training, and as a consequence, the neural net can be effectively approximated by its first-order Taylor expansion with respect to its parameters at initialization. This leads to tractable linear dynamics, and the final solution can be characterized by kernel regression using a particular kernel, which was named neural tangent kernel (NTK) by Jacot et al. (2018). Such type of kernels were further explicitly studied by Lee et al. (2019); Yang (2019); Arora et al. (2019a).

The connection between over-parameterized neural nets and neural tangent kernels is theoretically appealing because a pure kernel method captures the power of a fully trained neural net. These kernels also exhibit reasonable empirical performance on CIFAR-10 (Arora et al., 2019a), thus suggesting that ultra-wide (or infinitely wide) neural nets are at least not irrelevant.

The aforementioned papers do not provide an explanation of the generalization behavior in presence of mislabeled training data, because the training loss will always go to for any labels, i.e., over-fitting will happen. However, one may use the theoretical insights from the kernel viewpoint to design and analyze regularization methods for neural net training in the over-parameterization regime.

Our contribution.

This paper makes progress towards a theoretical explanation of the generalization phenomenon in over-parameterized neural nets when noisy labels are present. In particular, inspired by the correspondence between wide neural nets and neural tangent kernels (NTKs), we propose two simple regularization methods for neural net training:

  1. Regularization by distance to initialization. Denote by the network parameters and by its random initialization. This method adds a regularization to the training objective.

  2. Adding an auxiliary variable for each training example. Let be the -th training example and represent the neural net. This method adds a trainable variable and tries to fit the -th label using . At test time, only the neural net is used and the auxiliary variables are discarded.

We prove that for wide neural nets, both methods, when trained with gradient descent to convergence, correspond to kernel ridge regression using the NTK, which is usually regarded as an alternative to early stopping in kernel literature.

Then we prove a generalization bound of the learned predictor on the true data distribution when the training data labels are corrupted. This generalization bound depends on the (unobserved) true labels, and is comparable to the bound one can get when there is no label noise, therefore indicating that the proposed regularization methods are robust to noisy labels.

The effectiveness of these two regularization methods are verified empirically – on MNIST and CIFAR, they achieve similar test accuracy to early stopping, which is much lower than the noise level in the training dataset.

Additional related work.

Li et al. (2019) proved that gradient descent with early stopping is robust to label noise for an over-parameterized two-layer neural net. Under a clustering assumption on data, they showed that gradient descent fits the correct labels before starting to over-fit wrong labels. Their result is different from ours from several aspects: they only considered two-layer nets while we allow arbitrarily deep nets; they required a clustering assumption on data while our generalization bound is general and data-dependent; furthermore, they did not address the question of generalization, but only provided guarantees on the training data. Interestingly, Li et al. (2019) quantified over-fitting through the distance of parameters to their initialization . This exactly corresponds to our first regularization method, which we show can help generalization despite the presence of noisy labels. Distance to initialization was also believed to be related to generalization in deep learning (Neyshabur et al., 2019; Nagarajan and Kolter, 2019). These studies provide additional motivation for our regularization method.

Our methods are inspired by kernel ridge regression, which is one of the most common kernel methods and has been widely studied. It was shown to perform comparably to early-stopped gradient descent (Bauer et al., 2007; Gerfo et al., 2008; Raskutti et al., 2014; Wei et al., 2017). Accordingly, we indeed observe in our experiments that our regularization methods perform similarly to gradient descent with early stopping in neural net training.

A line of empirical work proposed various methods to deal with mislabeled examples, e.g. (Sukhbaatar et al., 2014; Liu and Tao, 2015; Veit et al., 2017; Northcutt et al., 2017; Jiang et al., 2017; Ren et al., 2018). Our work provides a theoretical guarantee on generalization by using principled and considerably simpler regularization methods.

Paper organization.

In Section 2 we review how training wide neural nets are connected to the kernel method. In Section 3 we introduce the setting considered in this paper. In Section 4 we describe our regularization methods and show how they lead to kernel ridge regression, In Section 5 we prove a generalization guarantee of our methods. We provide experimental results in Section 6 and conclude in Section 7.

Notation.

We use bold-faced letters for vectors and matrices. For a matrix

, is its -th entry. We use to denote the Euclidean norm of a vector or the spectral norm of a matrix, and to denote the Frobenius norm of a matrix. represents the standard inner product. Denote by

the minimum eigenvalue of a positive semidefinite matrix.

is the identity matrix of appropriate dimension. Let

. Let be the indicator of event .

2 Recap of Neural Tangent Kernel

It has been proved in a series of recent papers (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019a) that a sufficiently wide neural net trained with randomly initialized gradient descent will stay close to its first-order Taylor expansion with respect to its parameters at initialization. This leads to an essentially linear dynamics, and the final learned neural net is close to the kernel regression solution with respect to a particular kernel named neural tangent kernel (NTK). This section briefly and informally recaps this theory.

Let be an -layer fully connected neural network with scalar output, where is the input and is all the network parameters:

Here is the weight matrix in the -th layer () and is initialized using i.i.d. entries,

is a coordinate-wise activation function, and

is an activation-dependent normalization constant. The hidden widths are allowed to go to infinity. Suppose that the net is trained by minimizing the squared loss over a training dataset :

Let the random initial parameters be , and the parameters be updated according to gradient descent on . It is shown that if the network is sufficiently wide, the parameters will stay close to the initialization during training so that the following first-order approximation is accurate:

(1)

The above approximation is exact in the infinite width limit, but can also be shown when the width is finite but sufficiently large.

In the following, we assume the linear approximation (1) and proceed to derive the solution of gradient descent. Denote for each and let for any . Then using (1) we obtain the following approximation for the training objective :

Since we have reduced the problem to training a linear model with the squared loss, its gradient descent iterates can be written analytically: starting from and updating using , we can easily derive the solution at every iteration:

where , (assuming is invertible), , and . Assuming , the final solution at is , and the corresponding predictor is

Suppose that the neural net and its initialization are defined so that the initial output is small, i.e., and .111We can ensure small or even zero output at initialization by either multiplying a small factor (as done in (Arora et al., 2019a, b)), or using the following “difference trick”: define the network to be the difference between two networks with the same architecture, i.e., ; then initialize and to be the same (and still random); this ensures at initialization, while keeping the same value of for both and . See details in Appendix A.. Then we have

(2)
Neural tangent kernel (NTK).

The solution in (2) is in fact the kernel regression solution with respect to the kernel induced by (random) features , which is defined as for all . This kernel was named the neural tangent kernel (NTK) by Jacot et al. (2018)

. Although this kernel is random, it is shown that when the network is sufficiently wide, this random kernel converges to a deterministic limit in probability

(Arora et al., 2019a). Note that using the NTK we can rewrite the predictor at the end of training, (2), as

(3)

where represents the training inputs, , and with .

3 Setting: Learning from Noisily Labeled Data

In this section we formally describe the model of learning from noisily labeled data considered throughout this paper. Suppose that there is an underlying data distribution over of interest, but we only have access to samples from a noisy version of . Formally, consider the following data generation process:

  1. draw ,

  2. conditioned on , let be drawn from a noise distribution over that may depend on and , and

  3. let .

Let

be the joint distribution of

from the above process. Intuitively, represents the true label and is the noisy label.

Let be i.i.d. samples from , and suppose that we only have access to , i.e., we only observe inputs and their noisy labels, but do not observe true labels. The goal is to learn a function (in the form of a neural net) that can predict the true label well on the distribution , i.e., to find a function that makes the population loss

as small as possible for certain loss function

.

To set up notation for later sections, we denote , , , and .

The goal of learning from noisy labels would be impossible without assuming some correlation between true and noisy labels. The assumption we make throughout this paper is that for any , the noise distribution has mean and is subgaussian with parameter . As we explain below this already captures the interesting case of having corrupted labels in a classification task.

Example: binary classification with partially corrupted labels.

Let be a distribution over , but in the noisy observations, each label is flipped with probability (). Namely, for , the observed noisy label is equal to with probability , and is equal to with probability . The joint distribution of in this case does not satisfy our assumption, because the mean of is non-zero (conditioned on ). Nevertheless, this issue can be fixed by considering instead.222For binary classification, only the sign of the label matters, so we can assume that the true labels are from instead of , without changing the classification problem. Then we can easily check that conditioned on , has mean and is subgaussian with parameter . Therefore, this is a special case of having zero-mean subgaussian noise in the label.

4 Regularization Methods

Given a noisily labeled training dataset , let be a neural net to be trained. A direct, unregularized training method would involve minimizing an objective function like . To prevent over-fitting, we suggest the following simple regularization methods that slightly modify this objective.

Method 1: Regularization using Distance to Initialization (RDI).

We let the initial parameters be randomly generated, and minimize the following regularized objective:

(4)
Method 2: adding an AUXiliary variable for each training example (AUX).

We add an auxiliary trainable parameter for each , and minimize the following modified objective:

(5)

where is initialized to be .

4.1 Equivalence to Kernel Ridge Regression in Wide Neural Nets

Now we assume the setting in Section 2, where the neural net architecture is sufficiently wide so that the first-order approximation (1) is accurate during gradient descent: Recall that we have which induces the NTK . Also recall that we can assume near-zero initial output: (see Footnote 1). Therefore we have the approximation:

(6)

Under the approximation (6), it suffices to consider gradient descent on the objectives (4) and (5) using the linearized model instead:

The following theorem shows that in either case, gradient descent leads to the same dynamics and converges to the kernel ridge regression solution using the NTK.

Theorem 4.1.

Fix a learning rate . Consider gradient descent on with initialization :

(7)

and gradient descent on with initialization and :

(8)

Then we must have for all . Furthermore, if the learning rate satisfies , then converges linearly to a limit solution such that:

Proof.

The gradient of can be written as

where . Therefore we have . Then, according to the gradient descent update rule (8), we know that and can always be related by It follows that

On the other hand, from (7) we have

Comparing the above two equations, we find that and have the same update rule. Since , this proves for all .

The second part of the theorem can be proved by noticing that is a strongly convex quadratic function. Therefore gradient descent will converge linearly to its unique optimum which can be easily calculated. We defer the details to Appendix B. ∎

Theorem 4.1 indicates that gradient descent on the regularized objectives (4) and (5) both learn approximately the following function at the end of training when the neural net is sufficiently wide:

(9)

If no regularization were used, the labels would be fitted perfectly and the learned function would be (c.f. (2)). Therefore the effect of regularization is to add to the kernel matrix, and (9) is known as the solution to kernel ridge regression in kernel literature. In Section 5, we give a generalization bound of this solution on the true data distribution, which is comparable to the bound one can obtain even when true labels are available.

Although RDI and AUX are equivalent in the setting considered in Theorem 4.1, we find that AUX enjoys additional theoretical benefits. In Appendix D, we prove that with AUX, gradient flow can converge even if the neural net is not close to its linearization or if the loss function is not quadratic.

5 Generalization

In this section we analyze the population loss of the function defined in (9) on the true data distribution . Our main result is the following theorem:

Theorem 5.1.

Assume that the true labels are bounded, i.e., , and that the kernel matrix satisfies . Consider any loss function that is -Lipschitz in the first argument such that . Then, with probability at least we have

(10)

where .

Remark 5.1.

As , we have . In order for the second term in (10) to go to , we need to choose to grow with , e.g., for some small constant . Then, the only remaining term in (10) to worry about is . Notice that it depends on the (unobserved) true labels , instead of the noisy labels . By a very similar proof, one can show that training on the true labels (without regularization) leads to a population loss bound . In comparison, we can see that even when there is label noise, we only lose a factor of in the population loss on the true distribution, which can be chosen as any slow-growing function of . If grows much slower than , by choosing an appropriate , our result indicates that the underlying distribution is learnable in presence of label noise. See Remark 5.2 for an example.

Remark 5.2.

Arora et al. (2019b)

proved that two-layer ReLU neural nets trained with gradient descent can learn a class of smooth functions on the unit sphere. Their proof is by showing

if for certain function , where is the NTK corresponding to two-layer ReLU nets. Combined with their result, Theorem 5.1 implies that the same class of functions can be learned by the same network even if the labels are noisy.

Proof sketch of Theorem 5.1.

The proof is given in Section C.1. It has three main steps: first, bound the training error of on the true labels ; second, bound the RKHS (reproducing kernel Hilbert space) norm of , which in fact equals to the distance of to its initialization, ; finally, the population loss bound is proved via Rademacher complexity. ∎

As shown in Section 3, binary classification with partially corrupted labels can be viewed as a special case of the noisy label model considered in this paper. Therefore, Theorem 5.1 has the following corollary on classification. Its proof is given in Section C.2.

Corollary 5.1.

Under the same assumption as in Theorem 5.1 and additionally assuming that for , we have , and for , with probability at least we have

6 Experiments

In this section, we compare the performance of our regularization methods from Section 4 – regularization by distance to initialization (RDI) and regularization by adding auxiliary variable (AUX) – against vanilla gradient descent or stochastic gradient descent (GD/SGD) with or without early stopping. We perform binary classification with loss in two settings: two-layer fully-connected (FC) net on MNIST (“5” vs “8”) and deep convolutional neural net (CNN) on CIFAR (“airplanes” vs “automobiles”). See detailed description in Appendix E. The labels are corrupted with different levels of noise. We summarize our experimental findings as follows:

  1. Both RDI and AUX can prevent over-parameterized neural networks from over-fitting the noise in labels. They almost have the same performance when trained with GD.

  2. The weights of over-parametrized networks trained by GD/SGD remain close to the initialization, even with noisy labels. Moreover, as expected, RDI and AUX enforce the weights to stay closer to the initialization, and thus generalize even with noise in labels.

6.1 Performance of Regularization Methods

(a) Test error of GD with FC network on MNIST. For each noise level, we do a grid search for and report the best test accuracy.
(b) Train/test error of GD with CNN on CIFAR where 20% of labels are flipped and . Training error of AUX is measured with auxiliary variables.
Figure 1: Comparison between the performances of GD with different regularizations.
(a) The plot of train/test error. Training error of AUX is measured with auxiliary variables.
(b) and in CNN.
Figure 2: Performance and weights when SGD is used with different regularization methods () on CIFAR with 20% of labels flipped.

We first evaluate the performance of GD with AUX and RDI for binary classification with 1000 training samples from MNIST (see Figure 1). We observe that both methods GD+AUX and GD+RDI achieve much higher test accuracy than vanilla GD which over-fits the noisy training dataset, and they achieve similar test accuracy to GD with early stopping. According to Theorem 4.1, GD+AUX and GD+RDI should have similar performances, which is also empirically verified in Figure 1.

We then evaluate the performance of SGD on CIFAR with 20% labels flipped (see Figure 1(a)). In this setting, SGD+AUX outperforms SGD with early stopping by 1%. We also observe that there is a gap between the performance of SGD+AUX and that of SGD+RDI. An explanation is that SGD+AUX and SGD+RDI may have different trajectories since Theorem 4.1 only works for GD and sufficiently over-parameterized networks. Noting that the weights in SGD+RDI are farther from initialization than weights in SGD+AUX (Figure 1(b)), we conjecture that the gap comes from the fact that SGD+RDI suffers more perturbation along its trajectory due to the finite width. Indeed, Lemma D.1 provides some theoretical evidence for the benefits of AUX, which states that AUX enjoys a linear convergence rate under gradient flow even if the network is not over-parameterized. Additional figures for other noise levels can be bound in Appendix F.

6.2 Distance of Weights to Their Initialization

A key property in the NTK regime is that training loss is decreased to with minimal movement of weights from their initialization, which in our case is bounded by roughly in Lemma C.2. We indeed find that the distance will not increase as the width of the network changes as long as the network remains over-parametrized – it solely depends on the training data and labels and is much smaller than the initial weights. However, the weights still tend to move more with noise than without noise333See Figures 9 and 8 in Appendix F for the distance in noiseless case., which leads to over-fitting. As shown in Figure 1(b), explicit regularization (AUX and RDI) can reduce the moving distance of weights, thus giving better generalization (see Figure 1(a)).

Table 1 summarizes the relationship between the distance to initialization and other hyper-parameters that we observe from various experiments.

# samples noise level width regularization strength learning rate
Distance
Table 1: Relationship between distance to initialization at convergence and other hyper-parameters. Here “” means positive correlation, “” means negative correlation, ‘—’ means that width and learning rate do not affect the distance as long as width is sufficiently large and learning rate is sufficiently small.

7 Conclusion

Towards understanding generalization of deep neural networks in presence of noisy labels, this paper presents two simple regularization methods and shows that they are theoretically and empirically effective. The theoretical insights behind these methods come from the correspondence between neural networks and kernels. We believe that a better understanding of such correspondence could help the design of other principled methods in practice.

Acknowledgments

This work is supported by NSF, ONR, Simons Foundation, Schmidt Foundation, Mozilla Research, Amazon Research, DARPA and SRC. The authors thank Sanjeev Arora for helpful discussions and suggestions.

References

Appendix A Difference Trick

We give a quick analysis of the “difference trick” described in Footnote 1, i.e., let where , and initialize and to be the same (and still random). The following lemma implies that the NTKs for and are the same.

Lemma A.1.

If , then

  1. ;

  2. .

Proof.

(i) holds by definition. For (ii), we can calculate that

The above lemma allows us to ensure zero output at initialization while preserving NTK. As a comparison, Chizat and Bach [2018]

proposed the following "doubling trick": neurons in the last layer are duplicated, with the new neurons having the same input weights and opposite output weights. This satisfies zero output at initialization, but destroys the NTK. To see why, note that with the “doubling trick", the network will output 0 at initialization no matter what the input to its second to last layer is. Thus the gradients with respect to all parameters that are not in the last two layers are 0.

In our experiments, we observe that the performance of the neural net improves with the “difference trick.” See Figure 3. This intuitively makes sense, since the initial network output is independent of the label (only depends on the input) and thus can be viewed as noise. When the width of the neural net is infinity, the initial network output is actually a zero-mean Gaussian process, whose covariance matrix is equal to the NTK contributed by the gradients of parameters in its last layer. Therefore, learning an infinitely wide neural network with nonzero initial output is equivalent to doing kernel regression with an additive correlated Gaussian noise on training and testing labels.

Figure 3: Plot of test error for fully connected two-layer network on MNIST (binary classification between “5” and “8”) with difference trick and different mixing coefficients , where . Note that this parametrization preserves the NTK. The network has 10,000 hidden neurons and we train both layers with gradient descent with fixed learning rate for 5,000 steps. The training loss is less than 0.0001 at the time of stopping. We observe that when increases, the test error drops because the scale of the initial output of the network goes down.

Appendix B Missing Proof in Section 4

Proof of Theorem 4.1 (second part).

Now we prove the second part of the theorem. Notice that is a strongly convex quadratic function with Hessian , where . From the classical convex optimization theory, as long as , gradient descent converges linearly to the unique optimum of , which can be easily obtained:

Then we have

finishing the proof. ∎

Appendix C Missing Proofs in Section 5

c.1 Proof of Theorem 5.1

We first prove two lemmas.

Lemma C.1.

With probability at least , we have

Proof.

In this proof we are conditioned on and , and only consider the randomness in given and .

First of all, we can write

so we can write the training loss on true labels as

(11)

Next, since , conditioned on and , has independent and subgaussian entries (with parameter ), by [Hsu et al., 2012], for any symmetric matrix , with probability at least ,

(12)

Let and let be the eigenvalues of . We have

(13)

Therefore,

Finally, since (note ), we have

(14)

The proof is finished by combining (11), (13) and (14). ∎

Let be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel . Recall that the RKHS norm of a function is

Lemma C.2.

With probability at least , we have

Proof.

In this proof we are still conditioned on and , and only consider the randomness in given and . Note that with . Since and , we can bound