Log In Sign Up

Regularization-wise double descent: Why it occurs and how to eliminate it

by   Fatih Furkan Yilmaz, et al.

The risk of overparameterized models, in particular deep neural networks, is often double-descent shaped as a function of the model size. Recently, it was shown that the risk as a function of the early-stopping time can also be double-descent shaped, and this behavior can be explained as a super-position of bias-variance tradeoffs. In this paper, we show that the risk of explicit L2-regularized models can exhibit double descent behavior as a function of the regularization strength, both in theory and practice. We find that for linear regression, a double descent shaped risk is caused by a superposition of bias-variance tradeoffs corresponding to different parts of the model and can be mitigated by scaling the regularization strength of each part appropriately. Motivated by this result, we study a two-layer neural network and show that double descent can be eliminated by adjusting the regularization strengths for the first and second layer. Lastly, we study a 5-layer CNN and ResNet-18 trained on CIFAR-10 with label noise, and CIFAR-100 without label noise, and demonstrate that all exhibit double descent behavior as a function of the regularization strength.


page 1

page 2

page 3

page 4


Early Stopping in Deep Networks: Double Descent and How to Eliminate it

Over-parameterized models, in particular deep networks, often exhibit a ...

Kernel regression in high dimension: Refined analysis beyond double descent

In this paper, we provide a precise characterize of generalization prope...

Optimal Regularization Can Mitigate Double Descent

Recent empirical and theoretical studies have shown that many learning a...

Superior generalization of smaller models in the presence of significant label noise

The benefits of over-parameterization in achieving superior generalizati...

Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks

We investigate the asymptotic risk of a general class of overparameteriz...

Double Descent in Adversarial Training: An Implicit Label Noise Perspective

Here, we show that the robust overfitting shall be viewed as the early p...

Mitigating deep double descent by concatenating inputs

The double descent curve is one of the most intriguing properties of dee...