The “double descent” risk curve was proposed by *belkin2018reconciling to qualitatively describe the out-of-sample prediction performance of several variably-parameterized machine learning models.
This risk curve reconciles the classical bias-variance trade-off with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications bos1998dynamics,advani2017high,spigler2018jamming,belkin2018reconciling.
In these studies, a predictive model with
parameters is fit to a training sample of size
, and the test risk (i.e., out-of-sample error) is examined as a function of
is below the sample size
, the test risk is governed by the usual bias-variance decomposition.
is increased towards
, the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up towards infinity.
The classical bias-variance analysis identifies a “sweet spot” value of
at which the bias and variance are balanced to achieve low test risk.
, the test risk again decreases, provided that the model is fit using a suitable inductive bias (e.g., least norm solution).
In many (but not all) cases, the limiting risk as
is lower than what is achieved at the “sweet spot” value of
In this article, we study the key aspects of the “double descent” risk curve for the least squares/least norm predictor in two simple random features models.
The first is a Gaussian model, which was studied by breiman1983many in the regime.
The second is a Fourier series model for functions on the circle.
In both cases, we prove that the risk is infinite around , and decreases to towards its minimum as increases beyond .
Our results provide a precise mathematical analysis of the mechanism described by belkin2018reconciling.
The transition from under- to over-parametrized regimes was also analyzed by *spigler2018jamming by drawing a connection to the physical phenomenon of “jamming” in particle systems.
We note that in both of the models, the features are selected randomly, which makes them useful for studying scenarios where features are plentiful but individually too “weak” to be selected in an informed manner.
Such scenarios are commonplace in machine learning practice, but they should be contrasted with scenarios where features are carefully designed or curated, as is often the case in scientific applications.
For comparison, we give an example of “prescient” feature selection, where the
most useful features are included in the model.
In this case, the optimal test risk is achieved at some
, which is consistent with the classical analysis of breiman1983many.