1 Introduction
The tremendous achievements of deep learning models in the solution of complex prediction tasks have been the focus of great attention in the applied Computer Science, Artificial Intelligence and Statistics communities in the recent years. Many success stories related to the use of Deep Neural Networks have even been reported in the media and no data scientist can ignore the Deep Learning tools available via opensource machine learning libraries such as Tensorflow, Keras, Pytorch and many others.
One of the key ingredient in their success is the huge number of parameters involved in all current architectures, a very counterintuitive approach that defies traditional statistical wisdom. Indeed, as intuition suggests, overparametrisation often results in interpolation, i.e. zero training error and the expected outcome of this approach should be very poor generalisation performance. However, the main suprise came from the observation that interpolating networks can still generalise well, as shown in the following table
[soltanolkotabi2019imaginginparis] reporting the error rate of various networks on the CIFAR 10 dataset, where is the number of parameters, the training sample size is , the feature size is and the number of classes is 10:Model | parameters | Prain | Test loss | error |
---|---|---|---|---|
CudaConvNet | 145,578 | 2.9 | 0 | |
Microlnception | 1,649,402 | 33 | 0 | |
ResNet | 2,401,440 | 48 | 0 |
Belkin et al. [belkin2019reconciling]
recently addressed the problem of resolving this paradox, and brought some new light on the relationships between interpolation and generalization to unseen data. In the particular instance of kernel ridge regression,
[liang2018just] proved that interpolation can coexist with good generalization. In a subsequent line of work, recent connections between kernel regression and wide neural networks extensively were studied by [jacot2018neural], [du2018gradient], [allen2018convergence], [belkin2018understand] and provide additional motivation for a deeper understanding of the double descent phenomenon for kernel methods. Further motivations provided by Chizat and Bach [chizat2018note] about any nonlinear model of the form(1) |
with parameter . As elegantly summarised in the introduction of [mei2019generalization], if we assume that is so large that training by gradient flow moves each of them by just a small amount with respect to some random initialization , linearising the model around gives
which leads, in the Empirical Risk Minimisation setting, to consider a simpler linear regression problem, with high dimensional random features
, which owe their randomness to the randomness of the initialisation . This approximation is now well known to be missing the main features of deep neural networks [chizat2019lazy] but it is still a good test bench for new methods of analysing the double descent phenomenon.In this paper, we consider a statistical model of the form
(2) |
where and the function is assumed increasing and thrice differentiable, with bounded derivatives up to third order. The data will be assumed isotropic and subGaussian, and the observation errors will be assumed subGaussian as well. When the estimation of is performed using Empirical Risk Estimation, i.e. by solving
(3) |
for a given smooth loss function
, we show that the double descent phenomenon takes place and we give precise order of dependencies with respect to all the intrinsic parameters of the model, such as the dimensions and , various bounds on the derivatives of and of the loss function used in the Empirical Risk Estimation.Our contribution is the first non-asymptotic analysis of the double descent phenomenon for non-linear models. Our results precisely characterise the proximity in
of a certain solution of (3) to , from which the performace of the minimum norm solution follows naturally. Our proofs are very elementary as they utilise an elegant continuous Newton method argument initially promoted by Neuberger in a series of papers intended to provide a new proof of the Nash Moser theorem [castro2001inverse], [neuberger2007continuous].2 Main Results
In this section, we describe our mathematical model and set the notations.
2.1 Mathematical presentation of the problem
We assume that (2) holds and that satisfies the following properties
-
is increasing,
-
the first (resp. second, third) derivative is uniformly bounded by a positive constant (resp. and ), and
-
for all
(4) we have
(5) for some positive constant .
Concerning the statistical data, we will assume that
-
the random vectors
are independent subGaussian vectors in , to be subGaussian, with -norm upper bounded by , and are such that the matrix(6) is full rank with probability one.
-
for all , the random vectors are assumed
-
to have a second moment matrix equal to the identity
111i.e. , are isotropic up to a scaling factor , i.e. , -
to have -norm exactly equal to222notice that this is different from the usual regression model, where the columns are assumed to be normalised
-
The performance of the estimators are often measured by the theoretical risk by
Here, we will assume that the loss function satisfies
-
is a fourth differentiable convex loss function,
-
,
-
the derivative is -lipschitz function, which implies that is bounded by ,
-
the third and fourth derivatives are also assumed bounded with the bound for and for ,
-
is lower bounded by a constant .
In order to estimate , the Empirical Risk Minimizer is defined as a solution to the following optimisation problem
(7) |
with
(8) |
2.2 Statement of our main theorems
Our main result is the following. Their proofs are given in Section B and Section C in the appendix.
Theorem 2.1.
(Underparametrised setting) Let , and let . Let
and
(9) |
where
(10) |
and
(11) |
Assume that and are such that
(12) |
and
(13) |
Then, with probability at least
(14) |
the unique solution to the optimisation problem (7) satisfies
Theorem 2.2.
(Overparametrised setting) Let , and let . Let
(15) |
and and and defined by (13), (10) and (11), respectively. Assume that and are such that
(16) |
and
(17) |
Then, there exists a first order stationary point to the optimisation problem (7) such that, with probability larger than or equal to
we have
(18) |
Using these two theorems, we now establish the following risk bound.
Corollary 2.3.
Proof.
Let denote the minimum norm solution in the variable to the system
(20) |
where is the first order stationary point of problem (7) of Theorem 2.2. Then, is a minimum norm first order stationary point of problem (7). Moreover
Therefore, we obtain
(21) |
Using [vershynin2010introduction, Theorem 5.39], we get that with probability at least ,
(22) |
for some absolute constants and depending of , and combining this bound with the result of Theorem 2.2, the announced result follows. ∎
2.3 The case of linear regression
In the linear case where and the loss is quadratic , the optimisation problem (7) is
we have, and for .
The quadratic loss function is used with
and , . We therefore have
Corollary 2.4.
In addition to the assumptions about our model, let us assume that and that . Let , and let . Let
Assume that and are such that
(23) |
Then, with probability at least
the unique solution to the optimisation problem (7) satisfies
Corollary 2.5.
In addition to the assumptions about our model, let us assume that and that . Let , and let . Let
Assume that and are such that
(24) |
Then, there exists a solution to the optimisation problem (7) such that, with probability larger than or equal to
we have
(25) |
2.4 Discussion of the results and new implications for some classical models
Theorem 2.1 and Theorem 2.2 provide a new finite sample analysis of the problem of estimating ridge functions in both underparametrised and overparametrised regimes, i.e. where the number of parameters is smaller (resp. larger) than the sample size.
-
Our analysis of the underparametrised setting shows that:
-
When , we can obtain an error of order less than or equal to .
-
When , our bound undergoes a transition to a worse order of . Moreover, this error goes to at a logarithmic rate as grows to if becomes proportional to , and at a rate if becomes proportional to .
-
-
In the overparametrised setting, we get the following results:
-
The error bound is of order and decreases as grows to . We therefore recover the "double descent phenomenon".
-
In the linear model setting, Corollary 2.4 and Corollary 2.5 give the following simpler results
-
In the underparametrised setting, the error bound is of order , which is a simpler behavior than in the non-linear setting.
-
In the overparametrised setting, the error bound is of order , which is a simpler behavior than in the non-linear setting, with a faster decay towards zero.
Concerning the prediction error, Corollary 2.3 and Corollary 2.6 provide corresponding prediction bounds. In the linear setting of Corollary 2.6, the obtained bound is decreasing as a function of and is of the order of the noise level. In the general case, the results of Corollary 2.3 depend of quantities that vanish in the linear case and that may be taken to be small as a function of and for fixed. We leave it for further study in relevant specific cases.
2.5 Comparison with previous results
Our results are based on a new zero finding approach inspired from [neuberger2007continuous] and we obtain precise quantitative results in the finite sample setting for linear and non-linear models. Following the initial discovery of the "double descent phenomenon" in [belkin2019reconciling], many authors have addressed the question of precisely characterising the error decay as a function of the number of parameters in the linear and non-linear setting (mostly based on random feature models). Some of the latest works [mei2019generalization] address the problem in the asymptotic regime. Recently, the finite sample analysis has been addressed in the very interesting works [bartlett2020benign] and [chinot2020benign] for the linear model only. The work of [bartlett2020benign] and [chinot2020benign] give very precise upper and lower bounds on the prediction risk for general covariate covariance matrices under the subGaussian assumption.
In the present work, we show that similar, very precise results can be obtained for non-linear models of the ridge function class, using elementary perturbation results and some (now) standard random matrix theory. Our results provide an explicit control of the distance between some empirical estimators and the ground truth in terms of the subGaussian norms of the error and the covariate vector in the case where the covariate vectors are assumed isotropic (more general results can easily be recovered by a simple change of variable). Our analysis is made very elementary by using Neuberger’s result
[neuberger2007continuous] and the subGaussian isotropic assumption which allows to leverage previous results of [vershynin2010introduction] about finite random matrix with subGaussian rows or columns, depending of the setting (underparametrised vs overparametrised).3 Conclusion and perspectives
This work presents a precise quantitative, finite sample analysis of the double descent phenomenon in the estimation of linear and non-linear models. We make use of a zero-finding result of Neuberger [neuberger2007continuous] which can be applied to a large number of settings in machine learning.
Extending our work to the case of Deep Neural Networks is an exciting avenue for future research. We are currently working on the analysis of the double descent phenomenon in the case of Residual Neural Networks and we expect to post our new findings in a near future. Another possible direction is to include penalisation, which can be treated using the same techniques via Karush-Kuhn-Tucker conditions. This can be applied to Ridge Regression and -penalised estimation and makes a promising avenue for future investigations. Weakening the assumptions on our data, which are here of subGaussian type, could also lead to interesting new results; this could be achieved by utilising, e.g. the work of [mendelson2017extending].
References
Appendix A Common framework: Chasing close-to-ideal solutions using Neuberger’s quantitative perturbation theorem
a.1 Neuberger’s theorem
The following theorem of Neuberger [neuberger2007continuous] will be instrumental in our study of the ERM. In our context, this theorem can be restated as follows.
Theorem A.1 (Neuberger’s theorem for ERM).
Suppose that , that and that the Jacobian is a continuous map on with the property that for each in there exists a vector in such that,
(27) |
Then there exists in such that .
a.2 Computing the second derivative
Since the loss is twice differentiable, the empirical risk is itself twice differentiable. The Gradient of the empirical risk is given by
where is to be understood componentwise, and
(28) |
and the Hessian is given by
(29) |
The condition we have to satisfy in order to use Neuberger’s theorem, i.e. the version of (27) associated with our setting, is the following
(30) |
The Hessian matrix can be rewritten as
(31) |
where is a diagonal matrix given by
Notice that
with
The proof of the main results Theorem 2.1 and 2.2 will be based on controlling these quantities and using these estimates for controlling particular solutions of (27). The proof of Theorem 2.1 is given in Section B and the proof of Theorem 2.2 is given in Section C.
Appendix B Proof of Theorem 2.1: The under-parametrised case
b.1 Four technical lemmæ
Lemma B.1.
With probability larger than or equal to , we have
(32) |
Proof.
As the rows of are subGaussian,
is subGaussian with variance proxy
. Consequently, for all(33) |
Set and use the assumption that for all . ∎
Lemma B.2.
For all , the variable is subGaussian, with variance proxy .
Proof.
Let us compute
Lipschitzianity of implies that
and since , we get
which implies that
and thus,
as announced. ∎
Lemma B.3.
Define
Then, with probability at least , conditional on , we have
(34) |
with
(35) |
for all such that
(36) |
with
(37) |
and
Proof.
Recall that
By Taylor’s formula
where is a number between and . Then
We have to control the quantities