A Brief Prehistory of Double Descent

04/07/2020 ∙ by Marco Loog, et al. ∙ 0

In their thought-provoking paper [1], Belkin et al. illustrate and discuss the shape of risk curves in the context of modern high-complexity learners. Given a fixed training sample size n, such curves show the risk of a learner as a function of some (approximate) measure of its complexity N. With N the number of features, these curves are also referred to as feature curves. A salient observation in [1] is that these curves can display, what they call, double descent: with increasing N, the risk initially decreases, attains a minimum, and then increases until N equals n, where the training data is fitted perfectly. Increasing N even further, the risk decreases a second and final time, creating a peak at N=n. This twofold descent may come as a surprise, but as opposed to what [1] reports, it has not been overlooked historically. Our letter draws attention to some original, earlier findings, of interest to contemporary machine learning.



There are no comments yet.


page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


  • [1] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal.

    Reconciling modern machine-learning practice and the classical bias–variance trade-off.

    Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • [2] F Vallet, J-G Cailton, and Ph Refregier. Linear and nonlinear extension of the pseudo-inverse solution for learning boolean functions. Europhysics Letters, 9(4):315, 1989.
  • [3] Roger Penrose. On best approximate solutions of linear matrix equations. Mathematical Proceedings of the Cambridge Philosophical Society, 52(1):17–19, 1956.
  • [4] M Opper, W Kinzel, J Kleinz, and R Nehl. On the ability of the optimal perceptron to generalise. Journal of Physics A: Mathematical and General, 23(11):L581, 1990.
  • [5] Timothy LH Watkin, Albrecht Rau, and Michael Biehl. The statistical mechanics of learning a rule. Reviews of Modern Physics, 65(2):499, 1993.
  • [6] Robert P W Duin. Classifiers in almost empty spaces. In

    Proceedings of the 15th International Conference on Pattern Recognition

    , volume 2, pages 1–7. IEEE, 2000.
  • [7] Marina Skurichina and R P W Duin. Regularization by adding redundant features. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 564–572. Springer, 1998.
  • [8] Jesse H Krijthe and Marco Loog.

    The peaking phenomenon in semi-supervised learning.

    In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 299–309. Springer, 2016.
  • [9] Š Raudys and R P W Duin. Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognition Letters, 19(5-6):385–392, 1998.
  • [10] Marco Loog, Tom Viering, and Alexander Mey. Minimizers of the empirical risk and risk monotonicity. In Advances in Neural Information Processing Systems, pages 7476–7485, 2019.