Understanding Generalization of Deep Neural Networks Trained with Noisy Labels
Over-parameterized deep neural networks trained by simple first-order methods are known to be able to fit any labeling of data. When the training dataset contains a fraction of noisy labels, can neural networks be resistant to over-fitting and still generalize on the true distribution? Inspired by recent theoretical work that established connections between over-parameterized neural networks and neural tangent kernel (NTK), we propose two simple regularization methods for this purpose: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, both methods are related to kernel ridge regression with respect to the NTK, and we prove their generalization guarantee on the true data distribution despite being trained using noisy labels. The generalization bound is independent of the network size, and only depends on the training inputs and true labels (instead of noisy labels) as well as the noise level in the labels. Empirical results verify the effectiveness of these methods on noisily labeled datasets.
READ FULL TEXT