A Brief Prehistory of Double Descent

04/07/2020 ∙ by Marco Loog, et al. ∙ 0

In their thought-provoking paper [1], Belkin et al. illustrate and discuss the shape of risk curves in the context of modern high-complexity learners. Given a fixed training sample size n, such curves show the risk of a learner as a function of some (approximate) measure of its complexity N. With N the number of features, these curves are also referred to as feature curves. A salient observation in [1] is that these curves can display, what they call, double descent: with increasing N, the risk initially decreases, attains a minimum, and then increases until N equals n, where the training data is fitted perfectly. Increasing N even further, the risk decreases a second and final time, creating a peak at N=n. This twofold descent may come as a surprise, but as opposed to what [1] reports, it has not been overlooked historically. Our letter draws attention to some original, earlier findings, of interest to contemporary machine learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

  • [1] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal.

    Reconciling modern machine-learning practice and the classical bias–variance trade-off.

    Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • [2] F Vallet, J-G Cailton, and Ph Refregier. Linear and nonlinear extension of the pseudo-inverse solution for learning boolean functions. Europhysics Letters, 9(4):315, 1989.
  • [3] Roger Penrose. On best approximate solutions of linear matrix equations. Mathematical Proceedings of the Cambridge Philosophical Society, 52(1):17–19, 1956.
  • [4] M Opper, W Kinzel, J Kleinz, and R Nehl. On the ability of the optimal perceptron to generalise. Journal of Physics A: Mathematical and General, 23(11):L581, 1990.
  • [5] Timothy LH Watkin, Albrecht Rau, and Michael Biehl. The statistical mechanics of learning a rule. Reviews of Modern Physics, 65(2):499, 1993.
  • [6] Robert P W Duin. Classifiers in almost empty spaces. In

    Proceedings of the 15th International Conference on Pattern Recognition

    , volume 2, pages 1–7. IEEE, 2000.
  • [7] Marina Skurichina and R P W Duin. Regularization by adding redundant features. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 564–572. Springer, 1998.
  • [8] Jesse H Krijthe and Marco Loog.

    The peaking phenomenon in semi-supervised learning.

    In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), pages 299–309. Springer, 2016.
  • [9] Š Raudys and R P W Duin. Expected classification error of the fisher linear classifier with pseudo-inverse covariance matrix. Pattern Recognition Letters, 19(5-6):385–392, 1998.
  • [10] Marco Loog, Tom Viering, and Alexander Mey. Minimizers of the empirical risk and risk monotonicity. In Advances in Neural Information Processing Systems, pages 7476–7485, 2019.