DeepAI

# A finite sample analysis of the double descent phenomenon for ridge function estimation

Recent extensive numerical experiments in high scale machine learning have allowed to uncover a quite counterintuitive phase transition, as a function of the ratio between the sample size and the number of parameters in the model. As the number of parameters p approaches the sample size n, the generalisation error (a.k.a. testing error) increases, but it many cases, it starts decreasing again past the threshold p=n. This surprising phenomenon, brought to the theoretical community attention in <cit.>, has been thorougly investigated lately, more specifically for simpler models than deep neural networks, such as the linear model when the parameter is taken to be the minimum norm solution to the least-square problem, mostly in the asymptotic regime when p and n tend to +∞; see e.g. <cit.>. In the present paper, we propose a finite sample analysis of non-linear models of ridge type, where we investigate the double descent phenomenon for both the estimation problem and the prediction problem. Our results show that the double descent phenomenon can be precisely demonstrated in non-linear settings and complements recent works of <cit.> and <cit.>. Our analysis is based on efficient but elementary tools closely related to the continuous Newton method <cit.>.

• 5 publications
• 17 publications
03/08/2021

### Asymptotics of Ridge Regression in Convolutional Models

Understanding generalization and estimation error of estimators for simp...
08/21/2022

### Multiple Descent in the Multiple Random Feature Model

Recent works have demonstrated a double descent phenomenon in over-param...
12/10/2019

### Exact expressions for double descent and implicit regularization via surrogate random design

Double descent refers to the phase transition that is exhibited by the g...
10/18/2021

### Minimum ℓ_1-norm interpolators: Precise asymptotics and multiple descent

An evolving line of machine learning works observe empirical evidence th...
05/13/2022

### Sharp Asymptotics of Kernel Ridge Regression Beyond the Linear Regime

The generalization performance of kernel ridge regression (KRR) exhibits...
06/05/2020

### Triple descent and the two kinds of overfitting: Where why do they appear?

A recent line of research has highlighted the existence of a double desc...
06/17/2020

### Interpolation and Learning with Scale Dependent Kernels

We study the learning properties of nonparametric ridge-less least squar...

## 1 Introduction

The tremendous achievements of deep learning models in the solution of complex prediction tasks have been the focus of great attention in the applied Computer Science, Artificial Intelligence and Statistics communities in the recent years. Many success stories related to the use of Deep Neural Networks have even been reported in the media and no data scientist can ignore the Deep Learning tools available via opensource machine learning libraries such as Tensorflow, Keras, Pytorch and many others.

One of the key ingredient in their success is the huge number of parameters involved in all current architectures, a very counterintuitive approach that defies traditional statistical wisdom. Indeed, as intuition suggests, overparametrisation often results in interpolation, i.e. zero training error and the expected outcome of this approach should be very poor generalisation performance. However, the main suprise came from the observation that interpolating networks can still generalise well, as shown in the following table

[soltanolkotabi2019imaginginparis] reporting the error rate of various networks on the CIFAR 10 dataset, where is the number of parameters, the training sample size is , the feature size is and the number of classes is 10:

Model parameters Prain Test loss error
CudaConvNet 145,578 2.9 0
Microlnception 1,649,402 33 0
ResNet 2,401,440 48 0

Belkin et al. [belkin2019reconciling]

recently addressed the problem of resolving this paradox, and brought some new light on the relationships between interpolation and generalization to unseen data. In the particular instance of kernel ridge regression,

[liang2018just] proved that interpolation can coexist with good generalization. In a subsequent line of work, recent connections between kernel regression and wide neural networks extensively were studied by [jacot2018neural], [du2018gradient], [allen2018convergence], [belkin2018understand] and provide additional motivation for a deeper understanding of the double descent phenomenon for kernel methods. Further motivations provided by Chizat and Bach [chizat2018note] about any nonlinear model of the form

 E[Y∣X]=f(X;θ) (1)

with parameter . As elegantly summarised in the introduction of [mei2019generalization], if we assume that is so large that training by gradient flow moves each of them by just a small amount with respect to some random initialization , linearising the model around gives

 E(Y∣X)≈f(X;θ0)+∇θf(X;θ0)tβ,

which leads, in the Empirical Risk Minimisation setting, to consider a simpler linear regression problem, with high dimensional random features

, which owe their randomness to the randomness of the initialisation . This approximation is now well known to be missing the main features of deep neural networks [chizat2019lazy] but it is still a good test bench for new methods of analysing the double descent phenomenon.

In this paper, we consider a statistical model of the form

 E[Yi∣Xi] =f(Xtiθ∗),i=1,…,n, (2)

where and the function is assumed increasing and thrice differentiable, with bounded derivatives up to third order. The data will be assumed isotropic and subGaussian, and the observation errors will be assumed subGaussian as well. When the estimation of is performed using Empirical Risk Estimation, i.e. by solving

 ^θ =argminθ∈Θ 1nn∑i=1ℓ(Yi−f(Xtiθ)) (3)

for a given smooth loss function

, we show that the double descent phenomenon takes place and we give precise order of dependencies with respect to all the intrinsic parameters of the model, such as the dimensions and , various bounds on the derivatives of and of the loss function used in the Empirical Risk Estimation.

Our contribution is the first non-asymptotic analysis of the double descent phenomenon for non-linear models. Our results precisely characterise the proximity in

of a certain solution of (3) to , from which the performace of the minimum norm solution follows naturally. Our proofs are very elementary as they utilise an elegant continuous Newton method argument initially promoted by Neuberger in a series of papers intended to provide a new proof of the Nash Moser theorem [castro2001inverse], [neuberger2007continuous].

## 2 Main Results

In this section, we describe our mathematical model and set the notations.

### 2.1 Mathematical presentation of the problem

We assume that (2) holds and that satisfies the following properties

• is increasing,

• the first (resp. second, third) derivative is uniformly bounded by a positive constant (resp. and ), and

• for all

 w∈[E[Xi]tθ∗−μ√log(n),E[Xi]tθ∗+μ√log(n)], (4)

we have

 f′(w)≥cf′ (5)

for some positive constant .

Concerning the statistical data, we will assume that

• the random vectors

are independent subGaussian vectors in , to be subGaussian, with -norm upper bounded by , and are such that the matrix

 Xt =[X1,…,Xn] (6)

is full rank with probability one.

• for all , the random vectors are assumed

• to have a second moment matrix equal to the identity

111i.e. , are isotropic up to a scaling factor , i.e. ,

• to have -norm exactly equal to222notice that this is different from the usual regression model, where the columns are assumed to be normalised

• the errors

are independent subGaussian centered random variables with

-norm upper bounded by .

The performance of the estimators are often measured by the theoretical risk by

 R(θ)=E[ℓ(Y−f(Xtθ))].

Here, we will assume that the loss function satisfies

• is a fourth differentiable convex loss function,

• ,

• the derivative is -lipschitz function, which implies that is bounded by ,

• the third and fourth derivatives are also assumed bounded with the bound for and for ,

• is lower bounded by a constant .

In order to estimate , the Empirical Risk Minimizer is defined as a solution to the following optimisation problem

 ^θ ∈argminθ∈Θ ^Rn(θ) (7)

with

 ^Rn(θ) =argminθ∈Θ 1nn∑i=1ℓ(Yi−f(Xtiθ)). (8)

### 2.2 Statement of our main theorems

Our main result is the following. Their proofs are given in Section B and Section C in the appendix.

###### Theorem 2.1.

(Underparametrised setting) Let , and let . Let

 r =KϵρX 12C′√CCℓ′′Cf′cℓ′′ c2f′(1−β)√pn +K2ϵρ2X 2(2κ1+Cf′′′ν√log(n))c3ℓ′′ c6f′(1−υ)3/2(12C′√CCℓ′′Cf′)2(1−β)2 p3/2n

and

 Δp,n =(2κ1+Cf′′′ν√log(n)) √p r+ 12(6κ2+Cf(4)ν√log(n)) p r2. (9)

where

 κ1=max(3Cℓ′′Cf′Cf′′,Cℓ′′′C3f′) (10)

and

 κ2 =max{3Cℓ′′′C2f′Cf′′,3Cℓ′′C2f′′,3Cℓ′′Cf′Cf′′′, Cℓ(4)C4f′,3Cℓ′′′C2f′Cf′′,Cℓ′′Cf′Cf′′′}. (11)

Assume that and are such that

 (α+CKX)2 p<βn, (12)

and

 Δp,n ≤cℓ′′c2f′. (13)

Then, with probability at least

 1 −(2exp(−cKXα2n)+2nexp(−μ2log(n)∥θ∗∥22K2X) +exp(−p2)+2n (exp(−ν2log(n)C2ℓ′′K2ϵ) +exp(−(cℓ′′c2f′−Δp,n)24C2f′′C2ℓ′′K2ϵ))), (14)

the unique solution to the optimisation problem (7) satisfies

 ∥^θ−θ∗∥2 ≤r.
###### Theorem 2.2.

(Overparametrised setting) Let , and let . Let

 r =KϵρX 12C′√CCℓ′′Cf′cℓ′′ c2f′(1−β) √np (15)
 +K2ϵρ2X 2(2κ1+Cf′′′ν√log(n))c3ℓ′′ c6f′(1−υ)3/2(12C′√CCℓ′′Cf′)2(1−β)2 n2√p

and and and defined by (13), (10) and (11), respectively. Assume that and are such that

 (α+CKX)2 n<βp, (16)

and

 Δp,n ≤cℓ′′c2f′. (17)

Then, there exists a first order stationary point to the optimisation problem (7) such that, with probability larger than or equal to

 1 −(2exp(−cKXα2n)+2nexp(−μ2log(n)∥θ∗∥22K2X) +exp(−n2) +2n (exp(−ν2log(n)C2ℓ′′K2ϵ)+exp(−(cℓ′′c2f′−Δr)24C2f′′C2ℓ′′K2ϵ))),

we have

 ∥^θ−θ∗∥22 ≤r. (18)

Using these two theorems, we now establish the following risk bound.

###### Corollary 2.3.

Let the assumptions of Theorem 2.2 hold and let be defined by (15). Then, there exists a minimum norm first order stationary point of problem (7) which satisfies

 ∥X^θ♯−Xθ∗∥2 ≤ (1+2C √pn) r (19)

with probability larger than or equal to

 1 −(2exp(−cKXα2n)+2nexp(−μ2log(n)∥θ∗∥22K2X) +exp(−n2) +2n (exp(−ν2log(n)C2ℓ′′K2ϵ)+exp(−(cℓ′′c2f′−Δr)24C2f′′C2ℓ′′K2ϵ))+2exp(−cp)).
###### Proof.

Let denote the minimum norm solution in the variable to the system

 Xθ =X^θ, (20)

where is the first order stationary point of problem (7) of Theorem 2.2. Then, is a minimum norm first order stationary point of problem (7). Moreover

 X^θ♯−Xθ∗ =X^θ−Xθ∗ =X(^θ−θ∗).

Therefore, we obtain

 1√n ∥X^θ♯−Xθ∗∥2 ≤1√n ∥X∥ ∥^θ−θ∗∥2. (21)

Using [vershynin2010introduction, Theorem 5.39], we get that with probability at least ,

 ∥X∥ ≤√n+2C√p (22)

for some absolute constants and depending of , and combining this bound with the result of Theorem 2.2, the announced result follows. ∎

### 2.3 The case of linear regression

In the linear case where and the loss is quadratic , the optimisation problem (7) is

 ^θ=argminθ∈Rp1nn∑i=112(Yi−Xtiθ)2.

we have, and for .

The quadratic loss function is used with

 ℓ(z)=12z2,ℓ′(z)=z,ℓ′′(z)=1

and , . We therefore have

###### Corollary 2.4.

In addition to the assumptions about our model, let us assume that and that . Let , and let . Let

 r =KϵρX 12C′√C(1−β)√pn

Assume that and are such that

 (α+CKX)2 p<βn, (23)

Then, with probability at least

 1 −2exp(−cKXα2n)−2nexp(−μ2log(n)∥θ∗∥22K2X)−e−p2,

the unique solution to the optimisation problem (7) satisfies

 ∥^θ−θ∗∥2 ≤r.
###### Corollary 2.5.

In addition to the assumptions about our model, let us assume that and that . Let , and let . Let

 r =KϵρX 12C′√C(1−β) √np

Assume that and are such that

 (α+CKX)2 n<βp. (24)

Then, there exists a solution to the optimisation problem (7) such that, with probability larger than or equal to

 1 −2exp(−cKXα2n)−2nexp(−μ2log(n)∥θ∗∥22K2X)−e−n2,

we have

 ∥^θ−θ∗∥22 ≤r. (25)
###### Corollary 2.6.

In addition to the assumptions about our model, let us assume that and that . Let the other assumptions of Corollary 2.5. Then, the minimum norm solution of problem (7) satisfies

 1√n∥X^θ♯−Xθ∗∥2 ≤ (1+2C √pn) KϵρX 12C′√C(1−β) √np (26)

with probability at least

 1 −2exp(−cKXα2n)−2nexp(−μ2log(n)∥θ∗∥22K2X)−e−n2−2exp(−cp)

for some constants and depending on only.

### 2.4 Discussion of the results and new implications for some classical models

Theorem 2.1 and Theorem 2.2 provide a new finite sample analysis of the problem of estimating ridge functions in both underparametrised and overparametrised regimes, i.e. where the number of parameters is smaller (resp. larger) than the sample size.

• Our analysis of the underparametrised setting shows that:

• When , we can obtain an error of order less than or equal to .

• When , our bound undergoes a transition to a worse order of . Moreover, this error goes to at a logarithmic rate as grows to if becomes proportional to , and at a rate if becomes proportional to .

• One condition for this to work is (13).

• In the regime, will be of the order and we can choose to be of that same order so that (13) is satisfied. The last term in (14) will be of small order when which can be assumed under the natural assumption that the matrix is rescaled using .

• In the regime, is of the order . We can choose to multiply by this quantity in order to ensure that (13) is satisfied, a choose and as argued for the previous case.

• In the overparametrised setting, we get the following results:

• The error bound is of order and decreases as grows to . We therefore recover the "double descent phenomenon".

In the linear model setting, Corollary 2.4 and Corollary 2.5 give the following simpler results

• In the underparametrised setting, the error bound is of order , which is a simpler behavior than in the non-linear setting.

• In the overparametrised setting, the error bound is of order , which is a simpler behavior than in the non-linear setting, with a faster decay towards zero.

Concerning the prediction error, Corollary 2.3 and Corollary 2.6 provide corresponding prediction bounds. In the linear setting of Corollary 2.6, the obtained bound is decreasing as a function of and is of the order of the noise level. In the general case, the results of Corollary 2.3 depend of quantities that vanish in the linear case and that may be taken to be small as a function of and for fixed. We leave it for further study in relevant specific cases.

### 2.5 Comparison with previous results

Our results are based on a new zero finding approach inspired from [neuberger2007continuous] and we obtain precise quantitative results in the finite sample setting for linear and non-linear models. Following the initial discovery of the "double descent phenomenon" in [belkin2019reconciling], many authors have addressed the question of precisely characterising the error decay as a function of the number of parameters in the linear and non-linear setting (mostly based on random feature models). Some of the latest works [mei2019generalization] address the problem in the asymptotic regime. Recently, the finite sample analysis has been addressed in the very interesting works [bartlett2020benign] and [chinot2020benign] for the linear model only. The work of [bartlett2020benign] and [chinot2020benign] give very precise upper and lower bounds on the prediction risk for general covariate covariance matrices under the subGaussian assumption.

In the present work, we show that similar, very precise results can be obtained for non-linear models of the ridge function class, using elementary perturbation results and some (now) standard random matrix theory. Our results provide an explicit control of the distance between some empirical estimators and the ground truth in terms of the subGaussian norms of the error and the covariate vector in the case where the covariate vectors are assumed isotropic (more general results can easily be recovered by a simple change of variable). Our analysis is made very elementary by using Neuberger’s result

[neuberger2007continuous] and the subGaussian isotropic assumption which allows to leverage previous results of [vershynin2010introduction] about finite random matrix with subGaussian rows or columns, depending of the setting (underparametrised vs overparametrised).

## 3 Conclusion and perspectives

This work presents a precise quantitative, finite sample analysis of the double descent phenomenon in the estimation of linear and non-linear models. We make use of a zero-finding result of Neuberger [neuberger2007continuous] which can be applied to a large number of settings in machine learning.

Extending our work to the case of Deep Neural Networks is an exciting avenue for future research. We are currently working on the analysis of the double descent phenomenon in the case of Residual Neural Networks and we expect to post our new findings in a near future. Another possible direction is to include penalisation, which can be treated using the same techniques via Karush-Kuhn-Tucker conditions. This can be applied to Ridge Regression and -penalised estimation and makes a promising avenue for future investigations. Weakening the assumptions on our data, which are here of subGaussian type, could also lead to interesting new results; this could be achieved by utilising, e.g. the work of [mendelson2017extending].

## Appendix A Common framework: Chasing close-to-ideal solutions using Neuberger’s quantitative perturbation theorem

### a.1 Neuberger’s theorem

The following theorem of Neuberger [neuberger2007continuous] will be instrumental in our study of the ERM. In our context, this theorem can be restated as follows.

###### Theorem A.1 (Neuberger’s theorem for ERM).

Suppose that , that and that the Jacobian is a continuous map on with the property that for each in there exists a vector in such that,

 limt↓0 D^Rn(θ+td)−D^Rn(θ)t=−D^Rn(θ∗). (27)

Then there exists in such that .

### a.2 Computing the second derivative

Since the loss is twice differentiable, the empirical risk is itself twice differentiable. The Gradient of the empirical risk is given by

 ∇^Rn(θ) =−1nn∑i=1 ℓ′(Yi−f(Xtiθ)) f′(Xtiθ)Xi =−1nXtD(ν) l′(ϵ)

where is to be understood componentwise, and

 νi =f′(Xtiθ) (28)

and the Hessian is given by

 ∇2^Rn(θ)=1nn∑i=1 (ℓ′′(Yi−f(Xtiθ)) f′(Xtiθ)2 −ℓ′(Yi−f(Xtiθ)) f′′(Xtiθ))XiXti. (29)

The condition we have to satisfy in order to use Neuberger’s theorem, i.e. the version of (27) associated with our setting, is the following

 ∇2^Rn(θ)d =−∇^Rn(θ∗). (30)

The Hessian matrix can be rewritten as

 ∇2^Rn(θ) =1n XtD(μ)X (31)

where is a diagonal matrix given by

 μi =(ℓ′′(Yi−f(Xtiθ)) f′(Xtiθ)2 −ℓ′(Yi−f(Xtiθ)) f′′(Xtiθ))

Notice that

 μi =ℓ′′(Yi−f(Xtiθ∗)) f′(Xtiθ∗)2 −ℓ′(Yi−f(Xtiθ∗)) f′′(Xtiθ∗)+δi

with

 δi =ℓ′′(Yi−f(Xtiθ)) f′(Xtiθ)2 −ℓ′(Yi−f(Xtiθ)) f′′(Xtiθ) −(ℓ′′(Yi−f(Xtiθ∗)) f′(Xtiθ∗)2 −ℓ′(Yi−f(Xtiθ∗)) f′′(Xtiθ∗)).

The proof of the main results Theorem 2.1 and 2.2 will be based on controlling these quantities and using these estimates for controlling particular solutions of (27). The proof of Theorem 2.1 is given in Section B and the proof of Theorem 2.2 is given in Section C.

## Appendix B Proof of Theorem 2.1: The under-parametrised case

### b.1 Four technical lemmæ

###### Lemma B.1.

With probability larger than or equal to , we have

 f′(Xtiθ∗) ≥cf′. (32)
###### Proof.

As the rows of are subGaussian,

is subGaussian with variance proxy

. Consequently, for all

 (33)

Set and use the assumption that for all . ∎

###### Lemma B.2.

For all , the variable is subGaussian, with variance proxy .

###### Proof.

Let us compute

 ∥ℓ′(ϵi)∥ψ2 =supγ≥1γ−1/2(E|ℓ′(ϵi)|γ)1/γ

Lipschitzianity of implies that

 |ℓ′(ϵi)−ℓ′(0)| ≤Cℓ′′|ϵi−0|

and since , we get

 |ℓ′(ϵi)| ≤Cℓ′′|ϵi|

which implies that

 ∥ℓ′(ϵi)∥ψ2 =Cℓ′′ supγ≥1γ−1/2(E|ϵi|γ)1/γ

and thus,

 ∥ℓ′(ϵi)∥ψ2 =Cℓ′′ ∥ϵi∥ψ2<+∞

as announced. ∎

###### Lemma B.3.

Define

 δi(θ) =μi(Xtiθ)−(ℓ′′(Yi−f(Xtiθ∗)) f′(Xtiθ∗)2 −ℓ′(Yi−f(Xtiθ∗)) f′′(Xtiθ∗)).

Then, with probability at least , conditional on , we have

 nmaxi=1 |δi(θ)| ≤Δr(t) (34)

with

 Δr(t)=(2κ1+Cf′′′t) √p r+12(6κ2+Cf(4)t) p r2. (35)

for all such that

 ∥θ−θ∗∥2≤r, (36)

with

 κ1=max(3Cℓ′′Cf′Cf′′,Cℓ′′′C3f′) (37)

and

 κ2 =max{3Cℓ′′′C2f′Cf′′,3Cℓ′′C2f′′,3Cℓ′′Cf′Cf′′′, Cℓ(4)C4f′,3Cℓ′′′C2f′Cf′′,Cℓ′′Cf′Cf′′′}.
###### Proof.

Recall that

 μi(Xtiθ) =ℓ′′(Yi−f(Xtiθ)) f′(Xtiθ)2 −ℓ′(Yi−f(Xtiθ)) f′′(Xtiθ)

By Taylor’s formula

 μi(Xtiθ) =μi(Xtiθ∗)+dμid(Xtiθ)(Xtiθ∗)(Xtiθ−Xtiθ∗) +12d2μid(Xtiθ)2(Xtiθ∗+ci(Xtiθ−Xtiθ∗))(Xtiθ−Xtiθ∗)2

where is a number between and . Then

 |δi(θ)| =∣∣dμid(Xtiθ)(Xtiθ∗)(Xtiθ−Xtiθ∗) +12d2μid(Xtiθ)2(Xtiθ∗+ci(Xtiθ−Xtiθ∗))(Xtiθ−Xtiθ∗)2∣∣ ≤∣∣dμid(Xtiθ)(Xtiθ∗)∣∣∣∣Xtiθ−Xtiθ∗∣∣ +12(Xtiθ−Xtiθ∗)2∣∣d2μid(Xtiθ)2(Xtiθ∗+ci(Xtiθ−Xtiθ∗))∣∣

We have to control the quantities

 ∣∣dμid(Xtiθ)(Xtiθ