# Asymptotics of Ridge(less) Regression under General Source Condition

We analyze the prediction performance of ridge and ridgeless regression when both the number and the dimension of the data go to infinity. In particular, we consider a general setting introducing prior assumptions characterizing "easy" and "hard" learning problems. In this setting, we show that ridgeless (zero regularisation) regression is optimal for easy problems with a high signal to noise. Furthermore, we show that additional descents in the ridgeless bias and variance learning curve can occur beyond the interpolating threshold, verifying recent empirical observations. More generally, we show how a variety of learning curves are possible depending on the problem at hand. From a technical point of view, characterising the influence of prior assumptions requires extending previous applications of random matrix theory to study ridge regression.

## Authors

• 8 publications
• 10 publications
• 69 publications
• ### Benign overfitting in ridge regression

Classical learning theory suggests that strong regularization is needed ...
09/29/2020 ∙ by A. Tsigler, et al. ∙ 0

• ### Ridge Regression: Structure, Cross-Validation, and Sketching

We study the following three fundamental problems about ridge regression...
10/06/2019 ∙ by Sifan Liu, et al. ∙ 0

• ### Ridge Regression with Frequent Directions: Statistical and Optimization Perspectives

Despite its impressive theory & practical performance, Frequent Directio...
11/06/2020 ∙ by Charlie Dickens, et al. ∙ 0

• ### Efficiency of conformalized ridge regression

Conformal prediction is a method of producing prediction sets that can b...
04/08/2014 ∙ by Evgeny Burnaev, et al. ∙ 0

• ### Risk Convergence of Centered Kernel Ridge Regression with Large Dimensional Data

This paper carries out a large dimensional analysis of a variation of ke...
04/19/2019 ∙ by Khalil Elkhalil, et al. ∙ 0

• ### Fast rates in structured prediction

Discrete supervised learning problems such as classification are often t...
02/01/2021 ∙ by Vivien Cabannes, et al. ∙ 0

• ### Simple and Almost Assumption-Free Out-of-Sample Bound for Random Feature Mapping

Random feature mapping (RFM) is a popular method for speeding up kernel ...
09/24/2019 ∙ by Shusen Wang, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Understanding the generalisation properties of Artificial Deep Neural Networks (ANN) has recently motivated a number of statistical questions. These models perform well in practice despite perfectly fitting (interpolating) the data, a property that seems at odds with classical statistical theory

[49]. This has motivated the investigation of the generalisation performance of methods that achieve zero training error (interpolators) [32, 9, 11, 10, 8] and, in the context of linear least squares, the unique least norm solution to which gradient descent converges [22, 5, 37, 8, 21, 38, 20, 39]. Overparameterized linear models, where the number of variables exceed the number of points, are arguably the simplest and most natural setting where interpolation can be studied. Moreover, in certain regimes ANN can be approximated by suitable linear models [24, 17, 18, 2, 13].

The learning curve (test error versus model capacity) for interpolators has been shown to exhibit a characteristic “Double Descent” [1, 7] shape, where the test error decreases after peaking at the “interpolating” threshold, that is, the model capacity required to interpolate the data. The regime beyond this threshold naturally captures the settings of ANN [49], and thus, has motivated its investigation [36, 44, 39]. Indeed, for least squares regression, sharp characterisations of a double descent curve have been obtained for the least norm interpolating solution in the case of isotropic or auto-regressive covariates [22, 8] and random features [36].

For least squares regression the structure of the features and data can naturally influence generalisation performance. This can be argued to arise also in the case of ANN where, for instance, inductive biases can be encoded in the network architecture e.g. convolution layers for image classification [29, 30]. In contrast, least squares models investigated beyond the interpolation threshold have focused on cases where the ground truth parameter is symmetric in nature [16, 22, 5]

, without a natural notion of the estimation problem’s difficulty. This has left open the natural questions of what characteristics the learning curve exhibits beyond the interpolating threshold when the features and data are drawn from more structured distributions, such as, lower-dimensional spaces.

In this work we investigate the performance of ridge regression, and its ridgeless limit, assuming the data is generated from a noisy linear model with a structured regression parameter. This structure is encoded through a general function analogous to the source condition used in kernel regression and inverse problems, see e.g. [35, 6]. The function is applied to the spectrum of the population covariance of covariates and represents how well the true regression parameter is aligned to the variation in the covariates. We then study the test error of the ridge regression estimator in a high-dimensional asymptotic regime when the number of samples and ambient dimension go to infinity in proportion to one another. The limits of resulting quantities are then characterised by utilising tools from asymptotic Random Matrix Theory [3, 31, 16, 22], with results specifically developed to characterise the influence of the source condition. This provides a more general framework for studying the limiting test error of ridge regression, characterised by the signal to noise, regularisation, overparmeterisation, and now, the structure of the parameter through the source condition.

We then instantiate our general framework and results to a stylized structure, allowing to study model misspecification and its effect on prediction error. Specifically, we consider a population covariance with two types of Eigenvectors:

strong features

, associated with a common large Eigenvalue (hence favored by the ridge estimator), as well as

weak features, with a common smaller Eigenvalue. This model is an idealization of a realistic structure for distributions, with some parts of the signal (associated for instance to high smoothness, or low-frequency components) easier to estimate than other, higher-frequency components. The use of source conditions allows to study situations where the true coefficients exhibit either faster or slower decay than implicitly postulated by the ridge estimator, a form of model misspecification which affects predictive performance. This encodes the difficulty of the problem, and allows to distinguish between “easy” and “hard” learning problems. We now summarise the primary contributions of this work.

• [leftmargin = *]

• Asymptotic Prediction Error under General Source Condition. An asymptotic characterisation of the test error under a general source condition on the ground truth is provided. This required characterizing the limit of certain trace quantities, and provides a richer framework for investigating the performance of ridge regression. (Theorem 1)

• Zero Ridge Regularisation Optimal for Easy Problems with High SNR.

In the “easy”, overparameterised and high signal-to-noise ratio (SNR) case, we show that the optimal regularisation choice is zero. Previously, for least squares regression with isotropic prior, the optimal regularisation choice was zero only in the limit of infinity signal to noise

[14, 16]. (Section 3.1)

Our analysis of the strong and weak features model also provides asymptotic characterisations of a number of phenomena recently observed within the literature. That is, adding noisy weak features performs implicit regularisation and can recover the performance of optimally tuned regression restricted to the strong features [28]. Also, we show an additional peak occurring in the learning curve beyond the interpolation threshold for the ridgeless bias and variance [39]. These particular insights are presented in Sections 3.2 and 3.3, respectively.

Let us now describe the remainder of this work. Section 1.1 covers the related literature. Section 2 describes the setting, and provides the general theorem. Section 3 formally introduces the strong and weak features model, and presents the aforementioned insights. Section 4 gives the conclusion.

### 1.1 Related Literature

Due to the large number of works investigating interpolating methods as well as double descent, we next focus on works that consider the asymptotic regime.

##### High-Dimensional Statistics.

Random matrix theory has found numerous applications in high-dimensional statistics

[48, 19]. In particular, asymptotic random matrix theory has been leveraged to study the predictive performance of ridge regression under a well-specified linear model with an isotropic prior on the parameter, for identity population covariance [27, 26, 14, 47] and then general population covariance [16]. More recently, [33] considered the limiting test error of the least norm predictor under the spiked covariance model [25] where, both, a subset of Eigenvalues and the ratio of dimension to samples diverge to infinity. They show the bias is bounded by the norm of the ground truth projected on the Eigenvectors associated to the subset of large Eigenvalues. In contrast to these works, our work follows the kernel regression or inverse problems literature [6], by adding structural assumptions on the parameter through the variation of its coefficients along the covariance basis.

##### Double Descent for Least Squares.

While interpolating predictors (which perfectly fit training data), are classically expected to be sensitive to noise and exhibit poor out-of-sample performance, empirical observations about the behaviour of artificial neural networks [49] challenged this received wisdom. This surprising phenomenon, where interpolators can generalize, has first been shown for some local averaging estimators [11, 9], kernel “ridgeless” regression [32]

, and linear regression, where

[5] characterised conditions on the covariance structure under which ridgeless estimation has small variance. A “double descent” phenomenon for interpolating predictors, where test error can decrease past the interpolation threshold, has been suggested by [7]. This double descent curve has been established in the context of asymptotic least squares [22, 36, 8, 20, 38, 39]. The work [22] considers either isotropic or auto-regressive features, while [36] consider Random Features constructed from a non-linear functional applied to the product of isotropic covariates and a random matrix. Meanwhile, the works [37, 20, 38] considers recovery guarantees under sparsity assumptions on the parameter, with [20] showing a peak in the test error when the number of samples equals the sparsity of the true predictor. The work [38] considers recovery properties of interpolators in the non-asymptotic regime. In contrast to these works, we make a direct connection between the population covariance and the ground truth parameter. Finally, [39] recently gave empirical evidence showing additional peaks in the test error occur beyond the interpolation threshold when the covariance is misaligned with the ground truth predictor. These empirical observations are verified by the theory we develop in this paper.

## 2 Dense Regression with General Source Condition

In this section we formally introduce the setting as well as the main theorem. Section 2.1 introduces the linear regression setting. Section 2.2 introduces the functionals that arise from asymptotic random matrix theory. Section 2.3 then presents the main theorem.

### 2.1 Problem Setting

We start by introducing the linear regression setting and the general source condition.

##### Linear Regression.

We consider prediction in a random-design linear regression setting with Gaussian covariates. Let denote the true regression parameter, the population covariance, and the noise variance; up to rescaling, one can assume We consider an i.i.d. dataset such that for ,

 yi=⟨β⋆,xi⟩+σϵi,xi∼N(0,Σ),E[ϵi|xi]=0,E[ϵ2i|xi]=1. (1)

In what follows, we let , as well as the design matrix . Given the samples the objective is to derive an estimator that minimises the error of predicting a new response. For a fixed parameter , the test risk is then , where the expectation is with respect to a new response sampled according to (1). We consider ridge regression [23, 46], defined for by

 βλ:=(X⊤Xn+λI)−1X⊤Yn=argminβ∈Rd{1nn∑i=1(yi−⟨β,xi⟩)2+λ∥β∥2}. (2)
##### Source Condition.

We consider an average-case analysis where the parameter is random, sampled with covariance encoded by a source function , which describes how coefficients of vary along Eigenvectors of . Specifically, denote by the Eigenvalue-Eigenvector pairs of , ordered so that , and let . For (one can assume up to change of ), the parameter is such that

 E[β⋆]=0,E[β⋆(β⋆)⊤]=r2dΦ(Σ). (3)

For estimators linear in

(such as ridge regression), the expected risk only depends on the first two moments of the prior on

, hence one can assume a Gaussian prior . Under prior (3), has isotropic covariance , so that . This means that the coordinate of in the

-th direction has standard deviation

. We note that, as , has a “dense” high-dimensional structure, where the number of its components grows with , while their magnitude decreases proportionally. This prior is an average-case, high-dimensional analogue of the standard source condition considered in inverse problems and nonparametric regression [35, 6], which describes the behaviour of coefficients of along the Eigenvector basis of . In the special case , one has . For a Gaussian prior, , which is rotation invariant with squared norm distributed as (converging to as

), hence “close” to the uniform distribution on the sphere of radius

.

##### Easy and Hard Problems.

The case of a constant function corresponds to an isotropic prior under the Euclidean norm used for regularisation, and has been studied by [14, 16, 22]. In this case (see Remark 1 below), properly-tuned ridge regression (in terms of ) is optimal in terms of average risk. The influence of can be understood in terms of the average signal strength in eigen-directions of . Specifically, let be an eigenvector of , with associated Eigenvalue . Then, given , the signal strength in direction (namely, the contribution of this direction to the signal) is , its expectation over is . When is increasing, strength along direction decays faster as decreases than postulated by the ridge estimator. In this sense, the problem is lower-dimensional, and hence “easier” than for constant ; likewise, a decreasing is associated to a slower decay of coefficients, and therefore a “harder”, higher-dimensional problem. While our results do not require this restriction, it is natural to consider functions such that is non-decreasing, so that principal components (with larger Eigenvalue) carry more signal on average; otherwise, the norm used by the ridge estimator favours the wrong directions. In this respect, the hardest prior is obtained for , corresponding to the isotropic prior in the prediction norm induced by : for this un-informative prior, all directions have same signal strength. Finally, note that in the standard nonparametric setting of reproducing kernel Hilbert spaces, source conditions are related to smoothness of the regression function [45].

As is random, we study the expected performance of the ridge estimator against the ground truth i.e. the expected test error , where the expectation is with respect to the parameter and the noise within the samples.

###### Remark 1 (Oracle Estimator)

The best linear (in ) estimator in terms of average risk can be described explicitly. It corresponds to the Bayes-optimal estimator under prior on , which writes:

 ˜β=(X⊤Xn+σ2r2dnΦ(Σ)−1)−1X⊤Yn. (4)

This estimator requires knowledge of and . In the special case of an isotropic prior with , the oracle estimator is the ridge estimator (2) with .

### 2.2 Random Matrix Theory

Let us now describe the considered asymptotic regime, as well as quantities and notions from random matrix theory that appear in the analysis.

##### High-Dimensional Asymptotics.

We study the performance of the ridge estimator under high-dimensional asymptotics [27, 26, 14, 16, 47, 3], where the number of samples and dimension go to infinity proportionally with . This setting enables precise characterisation of the risk, beyond the classical regime where with fixed true distribution.

The ratio plays a key role. A value of corresponds to an overparameterised model, with more parameters than samples. Some care is required in interpreting this quantity: indeed, for a fixed sample size , varying changes and hence the underlying distribution. Hence, should not be interpreted as a degree of overparmeterisation. Rather, it quantifies the sample size relatively to the dimension of the problem.

##### Random Matrix Theory.

Following standard assumptions [31, 16], assume the spectral distribution of the covariance

converges almost surely to a probability distribution

supported on for

. Specifically, denoting the cumulative distribution function of the population covariance Eigenvalues as

, we have almost surely as .

A key quantity utilised within the analysis is the Stieltjes Transform of the empirical spectral distribution, defined for as . Under appropriate assumptions of the covariates (see for instance [16]) it is known as the Stieltjes Transform of the empirical covariance converges almost surely to a Stieltjes transform that satisfies the following stationary point equation

 (5)

In the case of an isotropic covariance , where the limiting spectral distribution is a point mass at one, the above equation can be solved for

where it is the Stieltjes Transform of the Marchenko-Pastur distribution

[34]. For more general spectral densities, the stationary point equation (5) may not be as easily solved algebraically, but can still yield insights into the limiting properties of quantities that arise. One tool that we will use extensively to gain insights into quantities that depend on will be its companion transform which is the Stieltjes transform of the limiting spectral distribution of . It is related to through the following equality Finally, introduce the -weighted Stieltjes Transform

 ΘΦ(z)=∫Φ(τ)1τ(1−γ(1+zm(z)))−zdH(τ) for all z∈C∖R+.

which is the limit of the trace quantity [31].

### 2.3 Main Theorem: Asymptotic Risk under General Source Condition

Let us now state the main theorem of this work, which provides the limit of the ridge regression risk.

###### Theorem 1

Consider the setting described in Section 2.1 and 2.2. Suppose is a real-valued bounded function defined on with finitely many points of discontinuity and let . If with then almost surely

 Eϵ,β⋆[R(βλ)−R(β⋆)]→σ2(v′(−λ)(v(−λ))2−1)Variance+r2ΘΦ(−λ)+λ∂ΘΦ(−λ)∂λv(−λ)2Bias.

The above theorem characterises the expected test error of the ridge estimator when the sample size and dimension go to infinity with , and is distributed as (3). The asymptotic risk in Theorem 1 is characterised by the relative sample size , the limiting spectral distribution , and the source function (normalising ). This provides a general form for studying the asymptotic test error for ridge regression in a dense high-dimensional setting. The source condition affects the limiting bias; to evaluate it we are required to study the limit of the trace quantity , which is achieved utilising techniques from both [12] and [31] (key steps in proof of Lemma 1 Appendix C). The variance term in Theorem 1 aligns with that seen previously in [16], as the structure of only influences the bias.

We now give some examples of asymptotic expected risk r in Theorem 1 for different structures of , namely (isotropic), (easier case) and (harder case).

###### Corollary 1

Consider the setting of Theorem 1. If with , then almost surely

 Eϵ,β⋆[R(βλ)−R(β⋆)]→σ2(v′(−λ)(v(−λ))2−1)+r2⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩v′(−λ)γ(v(−λ))4−1γv(−λ)2if Φ(x)=x1γv(−λ)−λγv′(−λ)(v(−λ))2if Φ(x)=12λγv′(−λ)v(−λ)+(1−1γ)v′(−λ)v(−λ)2−1γif Φ(x)=1/x

The three choices of source function in Corollary 1 are cases where the functional for the asymptotic bias in Theorem 1 can be expressed in terms of the companion transform and its first derivative. The expression in the case was previously investigated in [16], while for the bias aligns with quantities previously studied in [12], and thus, can be simply plugged in. For , we show how algebraic manipulations similar to the case allow to be simplified. Finally, while for it is clear how the bias and variance can be brought together and simplified, yielding optimal regularisation choice [16], see also Remark 1. As noted in Section 2.1, corresponds to a hardest case, with no favoured direction. Finally, corresponds to an “easier” case with faster coefficient decay.

## 3 Strong and Weak Features Model

In this section we consider a simple covariance structure, the strong and weak features model. Let and be two orthonormal matrices such that and their collection of rows forms an orthonormal basis of . The covariance considered is then for

 Σ=ρ1U⊤1U1+ρ2U⊤2U2. (6)

Unless stated otherwise, we adopt to the convention that the Eigenvalues are ordered . Naturally, we call elements of the span of rows of strong features, since they are associated to the dominant Eigenvalue . Similarly, is associated to the weak features. The size of then go to infinity with the sample size , with and thus . The limiting spectral measure of in this case is then atomic .

The parameter then has covariance , where are the coefficients for each type of feature and the source condition is . The coefficients encode the composition of the ground truth in terms of strong and weak features, and thus, the difficulty of the estimation problem. The case corresponds to the isotropic prior, while the case corresponds to faster decay and hence an “easier” problem. In particular, if increases, has faster decay, the problem becomes “easier” since the ground truth is increasingly made of strong features. Then, we say that if then the problem is easy, meanwhile when the problem is hard.

Under the model just introduced, Theorem 1 gives us the following asymptotic characterization for the expected test risk in terms of the companion transform as

 Eϵ,β⋆[R(βλ)−R(β⋆)]→σ2(v′(−λ)(v(−λ))2−1)Variance+r22∑i=1ϕiψi ρiv′(−λ)v(−λ)2(ρiv(−λ)+1)2Bias. (7)

We now investigate the above limit in the regime where the dimension exceeds the sample size, in order to gain insights into the performance of least squares when data is generated from the strong and weak features model. 555 Evaluating the companion transform requires solving a polynomial since the limiting measure is atomic, see for instance [15]. The polynomial in our case can be solved efficiently as it is at most of order 3. The insights are then summarised in the following sections. Section 3.1 shows that zero regularisation can be optimal in some situations. Section 3.2 shows how noisy weak features can be added and used as a form of regularisation similar to ridge regression. Section 3.3 present findings related to the ridgeless bias and variance.

### 3.1 Zero Regularisation can be Optimal for Easy Problems with High SNR

In this section, we investigate how the true regression function, namely the parameter (through the source condition) affects optimal ridge regularisation. Here we consider the easy case, the hard case is then investigated in Appendix A.1. Figure 1 plots the performance of optimally tuned ridge regression (Left) and the optimal choice of regularisation parameter (Right) against (a monotonic transform) of the Eigenvalue ratio , for a coefficient ratios .

As shown in the right plot of Figure 1, for a fixed distribution of (characterised by ) and sample size (characterised by ) as the ratio increases (that is, signal concentrates more on strong features), the optimal regularisation decreases. Remarkably, if the ratio is large enough, the optimal ridge regularisation parameter can be , corresponding to ridgeless interpolation.

##### Comparison with the Isotropic Model.

In the case of a parameter drawn from an isotropic prior (see Section 2.1), the optimal ridge parameter is given by (see Remark 1, as well as [16, 22]). This parameter is always positive, and is inversely proportional to the signal-to-noise ratio . Studying the influence of through a general shows that optimal regularisation also depends on the coefficient decay of ; optimal regularisation can be equal to , which interpolates training data. Finally, let us note that the optimal estimator of Remark 1 (with oracle knowledge of ) does not interpolate; hence, the optimality of interpolation among the family of ridge estimators arises from a form of “prior misspecification”. We believe this phenomenon to extend beyond the specific case of ridge estimators.

### 3.2 The Special Case of Noisy Weak Features

In this section we consider the special case where weak features are pure noise variables, namely , while their dimension is large. Such noisy weak features can be artificially introduced to the dataset, to induce an overparameterised problem. We then refer to this technique as Noisy Feature Regularisation, and note it corresponds to the design matrix augmentation in [28]. Looking to Figure 2, the ridgeless test error is then plotted against the Eigenvalue ratio (Left) and the number of weak features with the tuned Eigenvalue ratio (Right).

Observe (right plot) as we increase the number of weak features (as encoded by ), and tune the Eigenvalue , the performance converges to optimally tuned ridge regression with the strong features only. The left plot then shows the “regularisation path” as a function of the that the Eigenvalue ratio for some numbers of weak features . We repeated this experiment on the real dataset SUSY [4] with Random Fourier Features [41]. The test error is plotted in Figure 5 in Appendix A.2.

##### Weak Features Can Implicitly Regularise.

The results in Sections 3.1 and 3.2 suggest that weak features can implicitly regularise when the ground truth is associated to a subset of stronger features. Specifically, Section 3.2 demonstrated how this can occur passively in an easy learning problem, with the weak features providing sufficient stability that zero ridge regularisation can be the optimal choice 666Zero regularisation has been shown to be optimal for Random Feature regression with a high signal to noise [36]. For ridge regression, the work [28] numerically estimated the derivative of the test risk with respect to with a spiked covariance model and found that the derivative could be positive, suggesting zero regularization. . Meanwhile, in this section we demonstrated an active approach where weak features can purposely be added to a model and tuned similar to ridge.

### 3.3 Ridgeless Bias and Variance

In this section we investigate how the ridgeless bias and variance depend on the ratio of dimension to sample size . Conveniently the companion transform takes a closed form in this case, see equation (15) in Appendix (B.4.1). Looking to Figure 3 the ridgeless bias and variance is plotted against the ratio of dimension to sample size .

Note that an additional peak in the ridgeless bias and variance is observed beyond the interpolation threshold. This has only recently been empirically observed for the test error [39], as such, these plots now theoretically verify this phenomenon. The location of the peaks naturally depends on the number of strong and weak features as well as the ambient dimension, as denoted by the vertical lines. Specifically, the peak occurs in the ridgeless bias for the “hard” setting when the number of samples and number of strong features are equal . Meanwhile, a peak occurs in the ridgeless variance when the number of samples and number of strong features equal and the Eigenvalue ratio is large . This demonstrates that learning curves beyond the interpolation threshold can have different characteristics due to the interplay between the covariate structure and underlying data. We conjecture this arises due to instabilities of the design matrix Moore-Penrose Pseudo-inverse, similar to the isotropic setting [8].

## 4 Conclusion

In this work, we introduced a general framework for studying ridge regression in a high-dimensional regime. We characterised the limiting risk of ridge regression in terms of the dimension to sample size ratio, the spectrum of the population covariance and the coefficients of the true regression parameter along the covariance basis. This extends prior work [14, 16], that considered an isotropic ground truth parameter. Our extension enables the study of “prior misspecification”, where signal strength may decrease faster or slower than postulated by the ridge estimator, and its effect on ideal regularisation.

We instantiated this general framework to a simple structure, with strong and weak features. In this case, we deduced that in some situations, “ridgeless” regression with zero regularisation can be optimal among all ridge regression estimators. This occurs when the signal-to-noise ratio is large and when strong features (with large Eigenvalue of the covariance matrix) have sufficiently more signal than weak ones. The latter condition corresponds to an “easy” or “lower-dimensional” problem, where ridge tends to over-penalise along strong features. This phenomenon does not occur for isotropic priors, where optimal regularisation is always strictly positive. Finally, we discussed noisy weak features, which act as a form of regularisation, and concluded by showing additional peaks in ridgeless bias and variance can occur for our model.

Moving forward, it would be natural to consider non-Gaussian covariates. Other structures for the ground truth and data generating process can be investigated through Theorem 1 by consider different functions and the population Eigenvalue distributions. The tradeoff between prediction and estimation error exhibited by [16] in the isotropic case can be explored with a general source .

## 5 Acknowledgments

D.R. is supported by the EPSRC and MRC through the OxWaSP CDT programme (EP/L016710/1). Part of this work has been carried out at the Machine Learning Genoa (MaLGa) center, Università di Genova (IT). L.R. acknowledges the financial support of the European Research Council (grant SLING 819789), the AFOSR projects FA9550-17-1-0390 and BAA-AFRL-AFOSR-2016-0007 (European Office of Aerospace Research and Development), and the EU H2020-MSCA-RISE project NoMADS - DLV-777826.

## Appendix A Additional Material - Strong and Weak Features Model

In this section we provide additional material related to the strong and weak features model introduced within the main body of the manuscript. Section A.1 presents insights for the hard learning setting, covering a case not considered within the main body of the manuscript. Section A.2 provides plots related to applying noisy weak feature regularisation to real data.

### a.1 Insights for Hard Problems

In this section we discuss insights related to the setting of Section 3.1 but for case of hard problems. That is the case when . Looking to Figure 4 we see plots similar to those in Section 3.1 but for choices of weights .

Observe, that the test error for optimally tuned ridge regression peaks, and decreases for large values of the ratio . We believe this is due to characteristic of ridge regression “suppressing” smaller Eigenvalues, in this case , improving performance for sufficiently large , even though . Intuitively, this is due to the contribution to the signal taking the form , and thus, when when and ridge regression can still perform well since it suppresses the small contribution to the signal . Looking to the right plot of Figure 4, we observe that the optimal choice of regularisation initially increases as the Eigenvalue ratio . One explanation, is that the estimated coefficients associated to the strong features are inflated in order to explain the signal coming from the weak features, and thus, for prediction ought to be corrected through regularisation.

### a.2 Additional Plots for Noisy Weak Feature Regularisation

In this section we present additional plots associated to Section 3.2. In particular Figure 5 presents noisy weak feature regularisation applied to a real world example.

## Appendix B Proofs for Ridge Regression

In this section we provide the calculations associated to ridge regression. Section B.1 provides some preliminary calculations. Section B.2 gives the proof of Theorem 1. Section B.3 provides the proof of Corollary 1. Section B.4 provides the calculations associated to the strong and weak features model.

### b.1 Preliminaries

We begin introducing some useful properties of the Stieltjes transform as well as its companion transform. Firstly, we know the companion transform satisfies the Silverstein equation [43, 42]

 −1v(z)=z−γ∫τ1+τv(z)dH(τ). (8)

We then have for , the companion transform is the unique solution to the Silverstein equation with such that the sign of the imaginary part is preserved . The above can then be differentiated with respect to to obtain a formula for in terms of :

 ∂v(z)∂z=(1v(z)2−γ∫τ2(1+τv(z))2dH(τ))−1

Meanwhile from from the equality we note that we have the following equalities

 1−γ(1−λm(−λ)) =λv(−λ) (9) 1−λm(−λ) =γ−1(1−λv(−λ)) m(−λ)−λm′(−λ) =γ−1(v(−λ)−λv′(−λ))

which we will readily use to simplify/rewrite a number of the limiting functions.

### b.2 Proof of Theorem 1

We begin with the decomposition into bias and variance terms following [16]. The difference for the ridge parameter can be denoted

 βλ−β⋆=−λ(X⊤Xn+λI)−1β⋆+σ(X⊤Xn+λI)−1X⊤ϵn

And thus taking expectation with respect to the noise in the observations

 Eϵ[R(βλ)]−R(β⋆) =Eϵ[∥Σ1/2(βλ−β⋆)∥22] =Eϵ[∥Σ1/2(βλ−Eϵ[βλ])∥22]+∥Σ1/2(Eϵ[βλ]−β⋆)∥22 =σ2nTr((X⊤Xn+λI)−1Σ(X⊤Xn+λI)−1X⊤Xn) λ2Tr((β⋆)⊤(X⊤Xn+λI)−1Σ(X⊤Xn+λI)−1β⋆)

Taking expectation with respect to we arrive at

 Eβ⋆[Eϵ[R(βλ)]−R(β⋆)]=σ2nTr((X⊤Xn+λI)−1Σ(X⊤Xn+λI)−1X⊤Xn) =σ2γ1dTr((X⊤Xn+λI)−1Σ)−λσ2γ1dTr((X⊤Xn+λI)−2Σ)

It is now a matter of showing the asymptotic almost sure convergence of the following three functionals

 1dTr((X⊤Xn+λI)−1Σ),1dTr((X⊤Xn+λI)−2Σ) and 1dTr((X⊤Xn+λI)−1Σ(X⊤Xn+λI)−1Φ(Σ))

The limit of the first trace quantity comes directly from [31] meanwhile the limit of the second trace quantity is proven in [16]. The third trace quantity depends upon the source condition and computing its limit is one of the main technical contributions of this work. The limits for these objects is summarised within the following Lemma, the proof of which provides the key steps for computing the limit involving the source function.

###### Lemma 1

Under the assumptions of Theorem 1 for any we have almost surely as with

 1dTr((X⊤Xn+λI)−1Σ)→1−λm(−λ)1−γ(1−λm(−λ)) (10) 1dTr((X⊤Xn+λI)−2Σ)→m(−λ)−λm′(−λ)(1−γ(1−λm(−λ)))2 (11) 1dTr((X⊤Xn+λI)−1Σ(X⊤Xn+λI)−1Φ(Σ))→ΘΦ(−λ)+λ∂ΘΦ(−λ)∂λ(1−γ(1−λm(−λ)))2 (12)

The result is arrived at by plugging in the above limits and noting from the definition of the Companion Transform that , and, taking derivatives, . The proof of Lemma 1, which is the key technical step in the proof of Theorem 1, is provided in Appendix C.

### b.3 Proof of Corollary 1

In this section we provide the proof of Corollary 1. It will be broken into three parts associated to the three cases , and .

#### b.3.1 Case: Φ(x)=x

The purpose of this section is to demonstrate, in the case , how the functional can be written in terms of the Stieltjes Transform . For this particular choice of the asymptotics were calculated in [12], see also Lemma 7.9 in [16]. We therefore repeat this calculation for completeness. Now, in this case we have

 ΘΦ(z)=∫ττ(1−γ(1+zm(−λ)))−zdH(τ)

Following the steps are the start of the proof for Lemma 2.2 in [31], consider

 1+zm(z) =∫1+zτ(1−γ(1+zm(z)))−zdH(τ) =∫τ(1−γ(1+zm(z)))τ(1−γ(1+zm(z)))−zdH(τ) =(1−γ(1+zm(z)))ΘΦ(z)

Solving for we have

 ΘΦ(z)=1+zm(z)1−γ(1+zm(z))=1γ(11−γ(1+zm(z))−1)

Picking and differentiating with respect to we get

 ∂ΘΦ(−λ)∂λ=−m(−λ)−λm′(−λ)(1−γ(1−λm(−λ)))2

This leads to the final form

 ΘΦ(−λ)+λ∂ΘΦ(−λ)∂λ(1−γ(1−λm(−λ)))2 =1−λm(−λ)(1−γ(1−λm(−λ)))3−λm(−λ)−λm′(−λ)(1−γ(1−λm(−λ)))4 =γ−1(1−λv(−λ))(λv(−λ))3−λγ−1(v(−λ)−λv′(−λ))(λv(−λ))4 =v′(−λ)γλ2v(−λ)4−1γ(λv(−λ))2

where on the second equality we used (9). Multiplying through by then yields the quantity presented.

#### b.3.2 Case: Φ(x)=1

The functional of interest in this case aligns with that calculated within [16], which we include below for completeness. In particular we have and as such we get

 ΘΦ(−λ)+λ∂ΘΦ(−λ)∂λ =m(−λ)−λm′(−λ)=γ−1(v(−λ)−λv′(−λ))

where on the second equality we used (9). Dividing by as well as adding the asymptotic variance we get, from Theorem 1, the limit as

 Eβ⋆[Eϵ[R(βλ)]−R(β⋆)] →σ21−λv(−λ)λv(−λ)−λσ2v(−λ)−λv′(−λ)(λv(−λ))2+r2γv(−λ)−λv′(−λ)v(−λ)2 =σ2(v′(−λ)(v(−λ))2−1)+r2γv(−λ)−r2λγv′(−λ)v(−λ)2

#### b.3.3 Case: Φ(x)=1/x

The functional in the case takes the form

 ΘΦ(z) =∫1τ1τ(1−γ(1+zm(z)))−zdH(τ).

Observe that we have

 ∫1τdH(τ)+zΘΦ(z) =∫1τ(1+zτ(1−γ(1+zm(z)))−z)dH(τ) =∫1ττ(1−γ(1+zm(z)))τ(1−γ(1+zm(z)))−zdH(τ) =(1−γ(1+zm(z)))∫1τ(1−γ(1+zm(z)))−zdH(τ) =(1−γ(1+zm(z)))m(z).

Solving for and plugging in the definition of the companion transform we arrive at

 ΘΦ(z) =1z((1−γ(1+zm(z)))m(z)−1z∫1τdH(