Understanding the generalisation properties of Artificial Deep Neural Networks (ANN) has recently motivated a number of statistical questions. These models perform well in practice despite perfectly fitting (interpolating) the data, a property that seems at odds with classical statistical theory. This has motivated the investigation of the generalisation performance of methods that achieve zero training error (interpolators) [32, 9, 11, 10, 8] and, in the context of linear least squares, the unique least norm solution to which gradient descent converges [22, 5, 37, 8, 21, 38, 20, 39]. Overparameterized linear models, where the number of variables exceed the number of points, are arguably the simplest and most natural setting where interpolation can be studied. Moreover, in certain regimes ANN can be approximated by suitable linear models [24, 17, 18, 2, 13].
The learning curve (test error versus model capacity) for interpolators has been shown to exhibit a characteristic “Double Descent” [1, 7] shape, where the test error decreases after peaking at the “interpolating” threshold, that is, the model capacity required to interpolate the data. The regime beyond this threshold naturally captures the settings of ANN , and thus, has motivated its investigation [36, 44, 39]. Indeed, for least squares regression, sharp characterisations of a double descent curve have been obtained for the least norm interpolating solution in the case of isotropic or auto-regressive covariates [22, 8] and random features .
For least squares regression the structure of the features and data can naturally influence generalisation performance. This can be argued to arise also in the case of ANN where, for instance, inductive biases can be encoded in the network architecture e.g. convolution layers for image classification [29, 30]. In contrast, least squares models investigated beyond the interpolation threshold have focused on cases where the ground truth parameter is symmetric in nature [16, 22, 5]
, without a natural notion of the estimation problem’s difficulty. This has left open the natural questions of what characteristics the learning curve exhibits beyond the interpolating threshold when the features and data are drawn from more structured distributions, such as, lower-dimensional spaces.
In this work we investigate the performance of ridge regression, and its ridgeless limit, assuming the data is generated from a noisy linear model with a structured regression parameter. This structure is encoded through a general function analogous to the source condition used in kernel regression and inverse problems, see e.g. [35, 6]. The function is applied to the spectrum of the population covariance of covariates and represents how well the true regression parameter is aligned to the variation in the covariates. We then study the test error of the ridge regression estimator in a high-dimensional asymptotic regime when the number of samples and ambient dimension go to infinity in proportion to one another. The limits of resulting quantities are then characterised by utilising tools from asymptotic Random Matrix Theory [3, 31, 16, 22], with results specifically developed to characterise the influence of the source condition. This provides a more general framework for studying the limiting test error of ridge regression, characterised by the signal to noise, regularisation, overparmeterisation, and now, the structure of the parameter through the source condition.
We then instantiate our general framework and results to a stylized structure, allowing to study model misspecification and its effect on prediction error. Specifically, we consider a population covariance with two types of Eigenvectors:strong features
, associated with a common large Eigenvalue (hence favored by the ridge estimator), as well asweak features, with a common smaller Eigenvalue. This model is an idealization of a realistic structure for distributions, with some parts of the signal (associated for instance to high smoothness, or low-frequency components) easier to estimate than other, higher-frequency components. The use of source conditions allows to study situations where the true coefficients exhibit either faster or slower decay than implicitly postulated by the ridge estimator, a form of model misspecification which affects predictive performance. This encodes the difficulty of the problem, and allows to distinguish between “easy” and “hard” learning problems. We now summarise the primary contributions of this work.
[leftmargin = *]
Asymptotic Prediction Error under General Source Condition. An asymptotic characterisation of the test error under a general source condition on the ground truth is provided. This required characterizing the limit of certain trace quantities, and provides a richer framework for investigating the performance of ridge regression. (Theorem 1)
Zero Ridge Regularisation Optimal for Easy Problems with High SNR.
In the “easy”, overparameterised and high signal-to-noise ratio (SNR) case, we show that the optimal regularisation choice is zero. Previously, for least squares regression with isotropic prior, the optimal regularisation choice was zero only in the limit of infinity signal to noise[14, 16]. (Section 3.1)
Our analysis of the strong and weak features model also provides asymptotic characterisations of a number of phenomena recently observed within the literature. That is, adding noisy weak features performs implicit regularisation and can recover the performance of optimally tuned regression restricted to the strong features . Also, we show an additional peak occurring in the learning curve beyond the interpolation threshold for the ridgeless bias and variance . These particular insights are presented in Sections 3.2 and 3.3, respectively.
Let us now describe the remainder of this work. Section 1.1 covers the related literature. Section 2 describes the setting, and provides the general theorem. Section 3 formally introduces the strong and weak features model, and presents the aforementioned insights. Section 4 gives the conclusion.
1.1 Related Literature
Due to the large number of works investigating interpolating methods as well as double descent, we next focus on works that consider the asymptotic regime.
Random matrix theory has found numerous applications in high-dimensional statistics[48, 19]. In particular, asymptotic random matrix theory has been leveraged to study the predictive performance of ridge regression under a well-specified linear model with an isotropic prior on the parameter, for identity population covariance [27, 26, 14, 47] and then general population covariance . More recently,  considered the limiting test error of the least norm predictor under the spiked covariance model  where, both, a subset of Eigenvalues and the ratio of dimension to samples diverge to infinity. They show the bias is bounded by the norm of the ground truth projected on the Eigenvectors associated to the subset of large Eigenvalues. In contrast to these works, our work follows the kernel regression or inverse problems literature , by adding structural assumptions on the parameter through the variation of its coefficients along the covariance basis.
Double Descent for Least Squares.
While interpolating predictors (which perfectly fit training data), are classically expected to be sensitive to noise and exhibit poor out-of-sample performance, empirical observations about the behaviour of artificial neural networks  challenged this received wisdom. This surprising phenomenon, where interpolators can generalize, has first been shown for some local averaging estimators [11, 9], kernel “ridgeless” regression 
, and linear regression, where characterised conditions on the covariance structure under which ridgeless estimation has small variance. A “double descent” phenomenon for interpolating predictors, where test error can decrease past the interpolation threshold, has been suggested by . This double descent curve has been established in the context of asymptotic least squares [22, 36, 8, 20, 38, 39]. The work  considers either isotropic or auto-regressive features, while  consider Random Features constructed from a non-linear functional applied to the product of isotropic covariates and a random matrix. Meanwhile, the works [37, 20, 38] considers recovery guarantees under sparsity assumptions on the parameter, with  showing a peak in the test error when the number of samples equals the sparsity of the true predictor. The work  considers recovery properties of interpolators in the non-asymptotic regime. In contrast to these works, we make a direct connection between the population covariance and the ground truth parameter. Finally,  recently gave empirical evidence showing additional peaks in the test error occur beyond the interpolation threshold when the covariance is misaligned with the ground truth predictor. These empirical observations are verified by the theory we develop in this paper.
2 Dense Regression with General Source Condition
In this section we formally introduce the setting as well as the main theorem. Section 2.1 introduces the linear regression setting. Section 2.2 introduces the functionals that arise from asymptotic random matrix theory. Section 2.3 then presents the main theorem.
2.1 Problem Setting
We start by introducing the linear regression setting and the general source condition.
We consider prediction in a random-design linear regression setting with Gaussian covariates. Let denote the true regression parameter, the population covariance, and the noise variance; up to rescaling, one can assume We consider an i.i.d. dataset such that for ,
In what follows, we let , as well as the design matrix . Given the samples the objective is to derive an estimator that minimises the error of predicting a new response. For a fixed parameter , the test risk is then , where the expectation is with respect to a new response sampled according to (1). We consider ridge regression [23, 46], defined for by
We consider an average-case analysis where the parameter is random, sampled with covariance encoded by a source function , which describes how coefficients of vary along Eigenvectors of . Specifically, denote by the Eigenvalue-Eigenvector pairs of , ordered so that , and let . For (one can assume up to change of ), the parameter is such that
For estimators linear in
(such as ridge regression), the expected risk only depends on the first two moments of the prior on, hence one can assume a Gaussian prior . Under prior (3), has isotropic covariance , so that . This means that the coordinate of in the
-th direction has standard deviation. We note that, as , has a “dense” high-dimensional structure, where the number of its components grows with , while their magnitude decreases proportionally. This prior is an average-case, high-dimensional analogue of the standard source condition considered in inverse problems and nonparametric regression [35, 6], which describes the behaviour of coefficients of along the Eigenvector basis of . In the special case , one has . For a Gaussian prior, , which is rotation invariant with squared norm distributed as (converging to as
), hence “close” to the uniform distribution on the sphere of radius.
Easy and Hard Problems.
The case of a constant function corresponds to an isotropic prior under the Euclidean norm used for regularisation, and has been studied by [14, 16, 22]. In this case (see Remark 1 below), properly-tuned ridge regression (in terms of ) is optimal in terms of average risk. The influence of can be understood in terms of the average signal strength in eigen-directions of . Specifically, let be an eigenvector of , with associated Eigenvalue . Then, given , the signal strength in direction (namely, the contribution of this direction to the signal) is , its expectation over is . When is increasing, strength along direction decays faster as decreases than postulated by the ridge estimator. In this sense, the problem is lower-dimensional, and hence “easier” than for constant ; likewise, a decreasing is associated to a slower decay of coefficients, and therefore a “harder”, higher-dimensional problem. While our results do not require this restriction, it is natural to consider functions such that is non-decreasing, so that principal components (with larger Eigenvalue) carry more signal on average; otherwise, the norm used by the ridge estimator favours the wrong directions. In this respect, the hardest prior is obtained for , corresponding to the isotropic prior in the prediction norm induced by : for this un-informative prior, all directions have same signal strength. Finally, note that in the standard nonparametric setting of reproducing kernel Hilbert spaces, source conditions are related to smoothness of the regression function .
As is random, we study the expected performance of the ridge estimator against the ground truth i.e. the expected test error , where the expectation is with respect to the parameter and the noise within the samples.
Remark 1 (Oracle Estimator)
The best linear (in ) estimator in terms of average risk can be described explicitly. It corresponds to the Bayes-optimal estimator under prior on , which writes:
This estimator requires knowledge of and . In the special case of an isotropic prior with , the oracle estimator is the ridge estimator (2) with .
2.2 Random Matrix Theory
Let us now describe the considered asymptotic regime, as well as quantities and notions from random matrix theory that appear in the analysis.
We study the performance of the ridge estimator under high-dimensional asymptotics [27, 26, 14, 16, 47, 3], where the number of samples and dimension go to infinity proportionally with . This setting enables precise characterisation of the risk, beyond the classical regime where with fixed true distribution.
The ratio plays a key role. A value of corresponds to an overparameterised model, with more parameters than samples. Some care is required in interpreting this quantity: indeed, for a fixed sample size , varying changes and hence the underlying distribution. Hence, should not be interpreted as a degree of overparmeterisation. Rather, it quantifies the sample size relatively to the dimension of the problem.
Random Matrix Theory.
converges almost surely to a probability distributionsupported on for
. Specifically, denoting the cumulative distribution function of the population covariance Eigenvalues as, we have almost surely as .
A key quantity utilised within the analysis is the Stieltjes Transform of the empirical spectral distribution, defined for as . Under appropriate assumptions of the covariates (see for instance ) it is known as the Stieltjes Transform of the empirical covariance converges almost surely to a Stieltjes transform that satisfies the following stationary point equation
In the case of an isotropic covariance , where the limiting spectral distribution is a point mass at one, the above equation can be solved for
where it is the Stieltjes Transform of the Marchenko-Pastur distribution. For more general spectral densities, the stationary point equation (5) may not be as easily solved algebraically, but can still yield insights into the limiting properties of quantities that arise. One tool that we will use extensively to gain insights into quantities that depend on will be its companion transform which is the Stieltjes transform of the limiting spectral distribution of . It is related to through the following equality Finally, introduce the -weighted Stieltjes Transform
which is the limit of the trace quantity .
2.3 Main Theorem: Asymptotic Risk under General Source Condition
Let us now state the main theorem of this work, which provides the limit of the ridge regression risk.
The above theorem characterises the expected test error of the ridge estimator when the sample size and dimension go to infinity with , and is distributed as (3). The asymptotic risk in Theorem 1 is characterised by the relative sample size , the limiting spectral distribution , and the source function (normalising ). This provides a general form for studying the asymptotic test error for ridge regression in a dense high-dimensional setting. The source condition affects the limiting bias; to evaluate it we are required to study the limit of the trace quantity , which is achieved utilising techniques from both  and  (key steps in proof of Lemma 1 Appendix C). The variance term in Theorem 1 aligns with that seen previously in , as the structure of only influences the bias.
We now give some examples of asymptotic expected risk r in Theorem 1 for different structures of , namely (isotropic), (easier case) and (harder case).
Consider the setting of Theorem 1. If with , then almost surely
The three choices of source function in Corollary 1 are cases where the functional for the asymptotic bias in Theorem 1 can be expressed in terms of the companion transform and its first derivative. The expression in the case was previously investigated in , while for the bias aligns with quantities previously studied in , and thus, can be simply plugged in. For , we show how algebraic manipulations similar to the case allow to be simplified. Finally, while for it is clear how the bias and variance can be brought together and simplified, yielding optimal regularisation choice , see also Remark 1. As noted in Section 2.1, corresponds to a hardest case, with no favoured direction. Finally, corresponds to an “easier” case with faster coefficient decay.
3 Strong and Weak Features Model
In this section we consider a simple covariance structure, the strong and weak features model. Let and be two orthonormal matrices such that and their collection of rows forms an orthonormal basis of . The covariance considered is then for
Unless stated otherwise, we adopt to the convention that the Eigenvalues are ordered . Naturally, we call elements of the span of rows of strong features, since they are associated to the dominant Eigenvalue . Similarly, is associated to the weak features. The size of then go to infinity with the sample size , with and thus . The limiting spectral measure of in this case is then atomic .
The parameter then has covariance , where are the coefficients for each type of feature and the source condition is . The coefficients encode the composition of the ground truth in terms of strong and weak features, and thus, the difficulty of the estimation problem. The case corresponds to the isotropic prior, while the case corresponds to faster decay and hence an “easier” problem. In particular, if increases, has faster decay, the problem becomes “easier” since the ground truth is increasingly made of strong features. Then, we say that if then the problem is easy, meanwhile when the problem is hard.
Under the model just introduced, Theorem 1 gives us the following asymptotic characterization for the expected test risk in terms of the companion transform as
We now investigate the above limit in the regime where the dimension exceeds the sample size, in order to gain insights into the performance of least squares when data is generated from the strong and weak features model. 555 Evaluating the companion transform requires solving a polynomial since the limiting measure is atomic, see for instance . The polynomial in our case can be solved efficiently as it is at most of order 3. The insights are then summarised in the following sections. Section 3.1 shows that zero regularisation can be optimal in some situations. Section 3.2 shows how noisy weak features can be added and used as a form of regularisation similar to ridge regression. Section 3.3 present findings related to the ridgeless bias and variance.
3.1 Zero Regularisation can be Optimal for Easy Problems with High SNR
In this section, we investigate how the true regression function, namely the parameter (through the source condition) affects optimal ridge regularisation. Here we consider the easy case, the hard case is then investigated in Appendix A.1. Figure 1 plots the performance of optimally tuned ridge regression (Left) and the optimal choice of regularisation parameter (Right) against (a monotonic transform) of the Eigenvalue ratio , for a coefficient ratios .
As shown in the right plot of Figure 1, for a fixed distribution of (characterised by ) and sample size (characterised by ) as the ratio increases (that is, signal concentrates more on strong features), the optimal regularisation decreases. Remarkably, if the ratio is large enough, the optimal ridge regularisation parameter can be , corresponding to ridgeless interpolation.
Comparison with the Isotropic Model.
In the case of a parameter drawn from an isotropic prior (see Section 2.1), the optimal ridge parameter is given by (see Remark 1, as well as [16, 22]). This parameter is always positive, and is inversely proportional to the signal-to-noise ratio . Studying the influence of through a general shows that optimal regularisation also depends on the coefficient decay of ; optimal regularisation can be equal to , which interpolates training data. Finally, let us note that the optimal estimator of Remark 1 (with oracle knowledge of ) does not interpolate; hence, the optimality of interpolation among the family of ridge estimators arises from a form of “prior misspecification”. We believe this phenomenon to extend beyond the specific case of ridge estimators.
3.2 The Special Case of Noisy Weak Features
In this section we consider the special case where weak features are pure noise variables, namely , while their dimension is large. Such noisy weak features can be artificially introduced to the dataset, to induce an overparameterised problem. We then refer to this technique as Noisy Feature Regularisation, and note it corresponds to the design matrix augmentation in . Looking to Figure 2, the ridgeless test error is then plotted against the Eigenvalue ratio (Left) and the number of weak features with the tuned Eigenvalue ratio (Right).
Observe (right plot) as we increase the number of weak features (as encoded by ), and tune the Eigenvalue , the performance converges to optimally tuned ridge regression with the strong features only. The left plot then shows the “regularisation path” as a function of the that the Eigenvalue ratio for some numbers of weak features . We repeated this experiment on the real dataset SUSY  with Random Fourier Features . The test error is plotted in Figure 5 in Appendix A.2.
Weak Features Can Implicitly Regularise.
The results in Sections 3.1 and 3.2 suggest that weak features can implicitly regularise when the ground truth is associated to a subset of stronger features. Specifically, Section 3.2 demonstrated how this can occur passively in an easy learning problem, with the weak features providing sufficient stability that zero ridge regularisation can be the optimal choice 666Zero regularisation has been shown to be optimal for Random Feature regression with a high signal to noise . For ridge regression, the work  numerically estimated the derivative of the test risk with respect to with a spiked covariance model and found that the derivative could be positive, suggesting zero regularization. . Meanwhile, in this section we demonstrated an active approach where weak features can purposely be added to a model and tuned similar to ridge.
3.3 Ridgeless Bias and Variance
In this section we investigate how the ridgeless bias and variance depend on the ratio of dimension to sample size . Conveniently the companion transform takes a closed form in this case, see equation (15) in Appendix (B.4.1). Looking to Figure 3 the ridgeless bias and variance is plotted against the ratio of dimension to sample size .
Note that an additional peak in the ridgeless bias and variance is observed beyond the interpolation threshold. This has only recently been empirically observed for the test error , as such, these plots now theoretically verify this phenomenon. The location of the peaks naturally depends on the number of strong and weak features as well as the ambient dimension, as denoted by the vertical lines. Specifically, the peak occurs in the ridgeless bias for the “hard” setting when the number of samples and number of strong features are equal . Meanwhile, a peak occurs in the ridgeless variance when the number of samples and number of strong features equal and the Eigenvalue ratio is large . This demonstrates that learning curves beyond the interpolation threshold can have different characteristics due to the interplay between the covariate structure and underlying data. We conjecture this arises due to instabilities of the design matrix Moore-Penrose Pseudo-inverse, similar to the isotropic setting .
In this work, we introduced a general framework for studying ridge regression in a high-dimensional regime. We characterised the limiting risk of ridge regression in terms of the dimension to sample size ratio, the spectrum of the population covariance and the coefficients of the true regression parameter along the covariance basis. This extends prior work [14, 16], that considered an isotropic ground truth parameter. Our extension enables the study of “prior misspecification”, where signal strength may decrease faster or slower than postulated by the ridge estimator, and its effect on ideal regularisation.
We instantiated this general framework to a simple structure, with strong and weak features. In this case, we deduced that in some situations, “ridgeless” regression with zero regularisation can be optimal among all ridge regression estimators. This occurs when the signal-to-noise ratio is large and when strong features (with large Eigenvalue of the covariance matrix) have sufficiently more signal than weak ones. The latter condition corresponds to an “easy” or “lower-dimensional” problem, where ridge tends to over-penalise along strong features. This phenomenon does not occur for isotropic priors, where optimal regularisation is always strictly positive. Finally, we discussed noisy weak features, which act as a form of regularisation, and concluded by showing additional peaks in ridgeless bias and variance can occur for our model.
Moving forward, it would be natural to consider non-Gaussian covariates. Other structures for the ground truth and data generating process can be investigated through Theorem 1 by consider different functions and the population Eigenvalue distributions. The tradeoff between prediction and estimation error exhibited by  in the isotropic case can be explored with a general source .
D.R. is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1). Part of this work has been carried out at the Machine Learning Genoa (MaLGa) center, Università di Genova (IT). L.R. acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826.
-  Madhu S Advani and Andrew M Saxe. High-dimensional dynamics of generalization error in neural networks. arXiv preprint arXiv:1710.03667, 2017.
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song.
A convergence theory for deep learning via over-parameterization.In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 242–252, 2019.
-  Zhidong Bai and Jack W Silverstein. Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010.
-  Pierre Baldi, Peter Sadowski, and Daniel Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5:4308, 2014.
-  Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 2020.
-  Frank Bauer, Sergei Pereverzev, and Lorenzo Rosasco. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
-  Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
-  Mikhail Belkin, Daniel Hsu, and Ji Xu. Two models of double descent for weak features. arXiv preprint arXiv:1903.07571, 2019.
-  Mikhail Belkin, Daniel J Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate. In Advances in neural information processing systems, pages 2300–2311, 2018.
-  Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need to understand kernel learning. arXiv preprint arXiv:1802.01396, 2018.
Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov.
Does data interpolation contradict statistical optimality?
Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1611–1619, 2019.
-  Lin S Chen, Debashis Paul, Ross L Prentice, and Pei Wang. A regularized hotelling’s t 2 test for pathway analysis in proteomic studies. Journal of the American Statistical Association, 106(496):1345–1360, 2011.
-  Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, pages 2933–2943, 2019.
-  Lee H. Dicker. Ridge regression and asymptotic minimax estimation over spheres of growing dimension. Bernoulli, 22(1):1–37, 2016.
-  Edgar Dobriban. Efficient computation of limit spectra of sample covariance matrices. Random Matrices: Theory and Applications, 4(04):1550019, 2015.
-  Edgar Dobriban, Stefan Wager, et al. High-dimensional asymptotics of prediction: Ridge regression and classification. The Annals of Statistics, 46(1):247–279, 2018.
-  Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 1675–1685, 2019.
-  Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. International Conference on Learning Representations (ICLR), 2019.
-  Noureddine El Karoui. Random matrices and high-dimensional statistics: beyond covariance matrices. In Proceedings of the International Congress of Mathematicians, volume 4, pages 2875–2894, Rio de Janeiro, 2018.
-  Cédric Gerbelot, Alia Abbara, and Florent Krzakala. Asymptotic errors for convex penalized linear regression beyond gaussian matrices. arXiv preprint arXiv:2002.04372, 2020.
-  Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019.
-  Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
-  Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
-  Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018.
Iain M Johnstone.
On the distribution of the largest eigenvalue in principal components analysis.Annals of statistics, pages 295–327, 2001.
-  Noureddine El Karoui. Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators: rigorous results. arXiv preprint arXiv:1311.2445, 2013.
-  Noureddine El Karoui and Holger Kösters. Geometric sensitivity of random matrix results: consequences for shrinkage estimators of covariance and related statistical methods. arXiv preprint arXiv:1105.1404, 2011.
-  Dmitry Kobak, Jonathan Lomond, and Benoit Sanchez. Implicit ridge regularization provided by the minimum-norm least squares estimator when n <= p. arXiv preprint arXiv:1805.10939, 2018.
-  Yann LeCun, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne Hubbard, and Lawrence D Jackel. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Olivier Ledoit and Sandrine Péché. Eigenvectors of some large sample covariance matrix ensembles. Probability Theory and Related Fields, 151(1-2):233–264, 2011.
-  Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel" ridgeless" regression can generalize. arXiv preprint arXiv:1808.00387, 2018.
-  Yasaman Mahdaviyeh and Zacharie Naulet. Asymptotic risk of least squares minimum norm estimator under the spike covariance model. arXiv preprint arXiv:1912.13421, 2019.
-  Vladimir A Marčenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.
-  Peter Mathé and Sergei V Pereverzev. Geometry of linear ill-posed problems in variable hilbert scales. Inverse problems, 19(3):789, 2003.
-  Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and double descent curve. arXiv preprint arXiv:1908.05355, 2019.
-  Partha P Mitra. Understanding overfitting peaks in generalization error: Analytical risk curves for and penalized interpolation. arXiv preprint arXiv:1906.03667, 2019.
-  Vidya Muthukumar, Kailas Vodrahalli, Vignesh Subramanian, and Anant Sahai. Harmless interpolation of noisy data in regression. IEEE Journal on Selected Areas in Information Theory, 2020.
-  Preetum Nakkiran, Prayaag Venkat, Sham Kakade, and Tengyu Ma. Optimal regularization can mitigate double descent. arXiv preprint arXiv:2003.01897, 2020.
-  Debashis Paul. Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642, 2007.
-  Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
Jack W Silverstein and Sang-Il Choi.
Analysis of the limiting spectral distribution of large dimensional
Journal of Multivariate Analysis, 54(2):295–309, 1995.
-  Jack W Silverstein and Patrick L Combettes. Signal detection via spectral theory of large dimensional random matrices. IEEE Transactions on Signal Processing, 40(8):2100–2105, 1992.
-  S Spigler, M Geiger, S d’Ascoli, L Sagun, G Biroli, and M Wyart. A jamming transition from under-to over-parametrization affects generalization in deep learning. Journal of Physics A: Mathematical and Theoretical, 52(47):474001, 2019.
-  Ingo Steinwart, Don Hush, and Clint Scovel. Optimal rates for regularized least squares regression. In Proceedings of the 22nd Annual Conference on Learning Theory (COLT), pages 79–93, 2009.
-  Andrey N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Mathematics Doklady, 4:1035–1038, 1963.
-  Antonia M Tulino, Sergio Verdú, et al. Random matrix theory and wireless communications. Foundations and Trends® in Communications and Information Theory, 1(1):1–182, 2004.
Jianfeng Yao, Shurong Zheng, and ZD Bai.
Sample covariance matrices and high-dimensional data analysis. Cambridge University Press Cambridge, 2015.
-  Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A Additional Material - Strong and Weak Features Model
In this section we provide additional material related to the strong and weak features model introduced within the main body of the manuscript. Section A.1 presents insights for the hard learning setting, covering a case not considered within the main body of the manuscript. Section A.2 provides plots related to applying noisy weak feature regularisation to real data.
a.1 Insights for Hard Problems
In this section we discuss insights related to the setting of Section 3.1 but for case of hard problems. That is the case when . Looking to Figure 4 we see plots similar to those in Section 3.1 but for choices of weights .
Observe, that the test error for optimally tuned ridge regression peaks, and decreases for large values of the ratio . We believe this is due to characteristic of ridge regression “suppressing” smaller Eigenvalues, in this case , improving performance for sufficiently large , even though . Intuitively, this is due to the contribution to the signal taking the form , and thus, when when and ridge regression can still perform well since it suppresses the small contribution to the signal . Looking to the right plot of Figure 4, we observe that the optimal choice of regularisation initially increases as the Eigenvalue ratio . One explanation, is that the estimated coefficients associated to the strong features are inflated in order to explain the signal coming from the weak features, and thus, for prediction ought to be corrected through regularisation.
a.2 Additional Plots for Noisy Weak Feature Regularisation
samples. Weak features constructed from standard Gaussian random variables scaled by. Predictions made from: Left extracting known signal co-ordinates from estimated coefficients (throw away weak features); Right: sampling weak features for test data. Error bars from (Left) and (Right) replications over RFF. Responses are or , therefore predictions were an indicator function on whether the predicted response was greater than . Red Line: Performance of ridge regression on with strong features only. Plotted against regularisation , error bars from 100 replications.
Appendix B Proofs for Ridge Regression
In this section we provide the calculations associated to ridge regression. Section B.1 provides some preliminary calculations. Section B.2 gives the proof of Theorem 1. Section B.3 provides the proof of Corollary 1. Section B.4 provides the calculations associated to the strong and weak features model.
We then have for , the companion transform is the unique solution to the Silverstein equation with such that the sign of the imaginary part is preserved . The above can then be differentiated with respect to to obtain a formula for in terms of :
Meanwhile from from the equality we note that we have the following equalities
which we will readily use to simplify/rewrite a number of the limiting functions.
b.2 Proof of Theorem 1
We begin with the decomposition into bias and variance terms following . The difference for the ridge parameter can be denoted
And thus taking expectation with respect to the noise in the observations
Taking expectation with respect to we arrive at
It is now a matter of showing the asymptotic almost sure convergence of the following three functionals
The limit of the first trace quantity comes directly from  meanwhile the limit of the second trace quantity is proven in . The third trace quantity depends upon the source condition and computing its limit is one of the main technical contributions of this work. The limits for these objects is summarised within the following Lemma, the proof of which provides the key steps for computing the limit involving the source function.
Under the assumptions of Theorem 1 for any we have almost surely as with
b.3 Proof of Corollary 1
In this section we provide the proof of Corollary 1. It will be broken into three parts associated to the three cases , and .
The purpose of this section is to demonstrate, in the case , how the functional can be written in terms of the Stieltjes Transform . For this particular choice of the asymptotics were calculated in , see also Lemma 7.9 in . We therefore repeat this calculation for completeness. Now, in this case we have
Following the steps are the start of the proof for Lemma 2.2 in , consider
Solving for we have
Picking and differentiating with respect to we get
This leads to the final form
where on the second equality we used (9). Multiplying through by then yields the quantity presented.
The functional in the case takes the form
Observe that we have
Solving for and plugging in the definition of the companion transform we arrive at