Two models of double descent for weak features

The "double descent" risk curve was recently proposed to qualitatively describe the out-of-sample prediction accuracy of variably-parameterized machine learning models. This article provides a precise mathematical analysis for the shape of this curve in two simple data models with the least squares/least norm predictor. Specifically, it is shown that the risk peaks when the number of features p is close to the sample size n, but also that the risk decreases towards its minimum as p increases beyond n. This behavior is contrasted with that of "prescient" models that select features in an a priori optimal order.

There are no comments yet.

Authors

• 37 publications
• 64 publications
• 15 publications
• A Brief Prehistory of Double Descent

In their thought-provoking paper [1], Belkin et al. illustrate and discu...
04/07/2020 ∙ by Marco Loog, et al. ∙ 0

• Domain of Inverse Double Arcsine Transformation

To combine the proportions from different studies for meta-analysis, Fre...
11/19/2018 ∙ by Jong-Hyeon Jeong, et al. ∙ 0

• Multiple Descent: Design Your Own Generalization Curve

This paper explores the generalization loss of linear regression in vari...
08/03/2020 ∙ by Lin Chen, et al. ∙ 2

• Kernel regression in high dimension: Refined analysis beyond double descent

In this paper, we provide a precise characterize of generalization prope...
10/06/2020 ∙ by Fanghui Liu, et al. ∙ 0

• On the Role of Optimization in Double Descent: A Least Squares Study

Empirically it has been observed that the performance of deep neural net...
07/27/2021 ∙ by Ilja Kuzborskij, et al. ∙ 0

• A finite sample analysis of the double descent phenomenon for ridge function estimation

Recent extensive numerical experiments in high scale machine learning ha...
07/25/2020 ∙ by Emmanuel Caron, et al. ∙ 0

• A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent

This article characterizes the exact asymptotics of random Fourier featu...
06/09/2020 ∙ by Zhenyu Liao, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The “double descent” risk curve was proposed by *belkin2018reconciling to qualitatively describe the out-of-sample prediction performance of several variably-parameterized machine learning models. This risk curve reconciles the classical bias-variance trade-off with the behavior of predictive models that interpolate training data, as observed for several model families (including neural networks) in a wide variety of applications bos1998dynamics,advani2017high,spigler2018jamming,belkin2018reconciling. In these studies, a predictive model with

parameters is fit to a training sample of size , and the test risk (i.e., out-of-sample error) is examined as a function of . When is below the sample size , the test risk is governed by the usual bias-variance decomposition. As is increased towards , the training risk (i.e., in-sample error) is driven to zero, but the test risk shoots up towards infinity. The classical bias-variance analysis identifies a “sweet spot” value of at which the bias and variance are balanced to achieve low test risk. However, as grows beyond , the test risk again decreases, provided that the model is fit using a suitable inductive bias (e.g., least norm solution). In many (but not all) cases, the limiting risk as is lower than what is achieved at the “sweet spot” value of .

In this article, we study the key aspects of the “double descent” risk curve for the least squares/least norm predictor in two simple random features models. The first is a Gaussian model, which was studied by breiman1983many in the regime. The second is a Fourier series model for functions on the circle. In both cases, we prove that the risk is infinite around , and decreases to towards its minimum as increases beyond . Our results provide a precise mathematical analysis of the mechanism described by belkin2018reconciling. The transition from under- to over-parametrized regimes was also analyzed by *spigler2018jamming by drawing a connection to the physical phenomenon of “jamming” in particle systems.

We note that in both of the models, the features are selected randomly, which makes them useful for studying scenarios where features are plentiful but individually too “weak” to be selected in an informed manner. Such scenarios are commonplace in machine learning practice, but they should be contrasted with scenarios where features are carefully designed or curated, as is often the case in scientific applications. For comparison, we give an example of “prescient” feature selection, where the

most useful features are included in the model. In this case, the optimal test risk is achieved at some , which is consistent with the classical analysis of breiman1983many.

2 Gaussian model

We consider a regression problem where the response is equal to a linear function of real-valued variables plus noise :

 y=xTβ+σϵ=D∑j=1xjβj+σϵ.

The learner observes iid copies of , but fits a linear model to the data only using a subset of variables.

Let be the design matrix, and let

be the vector of responses. For a subset

and a -dimensional vector , we use to denote its -dimensional subvector of entries from ; we also use to denote the design matrix with variables from . For , we denote its complement by . Finally, denotes the Euclidean norm.

The learner fits regression coefficients with

 ^βT:=X†Ty,^βTc:=0.

Above, the symbol denotes the Moore-Penrose pseudoinverse. In other words, the learner uses the solution to the normal equations of least norm for and forces to all-zeros.

In the remainder of this section, we analyze the risk of in the case where the distribution of is the standard normal in , and then specialize the risk under particular selection models for . The Gaussian model was also studied by breiman1983many, although their analysis is restricted to the case where the number of variables used is always at most ; our analysis will also consider the regime. We show that the “interpolating” regime is preferred to the “classical” regime in our model.

We focus on the noise-free setting in Section 2.1 and Section 2.2; we consider noise in Section 2.3.

2.1 Risk analysis

In this section, we derive an expression for the risk of for an arbitrary choice of features .

We assume (i.e., the noise-free setting), so . Recall that we also assume

follows a standard normal distribution in

; since is isotropic (i.e., zero mean and identity covariance), the mean squared prediction error of any can be written as

 E[(y−xTβ′)2]=∥β−β′∥2=∥βTc−β′Tc∥2+∥βT−β′T∥2.

Since , it follows that the risk of is

 E[(y−xT^β)2]=∥βTc∥2+E[∥βT−^βT∥2].

Classical regime.

The risk of was computed by breiman1983many in the regime where :

 E[(y−xT^β)2]=⎧⎪ ⎪⎨⎪ ⎪⎩∥βTc∥2⋅(1+pn−p−1)if p≤n−2;0if βTc=0;+∞if p∈{n−1,n} and βTc≠0.

Interpolating regime.

We consider the regime where . Recall that the pseudoinverse of can be written as . Thus,

 βT−^βT =βT−XTT(XTXTT)†y =βT−XTT(XTXTT)†(XTcβTc+XTβT) =(I−XTT(XTXTT)†XT)βT−XTT(XTXTT)†XTcβTc.

On the right hand side, the first term is the orthogonal projection of onto the null space of , while the second term is a vector in the row space of . By the Pythagorean theorem, the squared norm of their sum is equal to the sum of their squared norms, so

 ∥βT−^βT∥2=∥(I−XTT(XTXTT)†XT)βT∥2+∥XTT(XTXTT)†XTcβTc∥2.

We analyze the expected values of these two terms by exploiting properties of the standard normal distribution.

First term.

Note that is the orthogonal projection matrix for the row space of . So, by the Pythagorean theorem, we have

 ∥(I−XTT(XTXTT)†XT)βT∥2=∥βT∥2−∥ΠTβT∥2.

By rotational symmetry of the standard normal distribution, it follows that

 E[∥ΠTβT∥2]=∥βT∥2⋅np.

Therefore

 E[∥(I−XTT(XTXTT)†XT)βT∥2]=∥βT∥2⋅(1−np).
Second term.

We use the “trace trick” to write

 ∥XTT(XTXTT)†XTcβTc∥2 =tr((XTXTT)†(XTXTT)(XTXTT)†(XTcβTc)(XTcβTc)T) =tr((XTXTT)†(XTcβTc)(XTcβTc)T)

where the second equality holds almost surely because is almost surely invertible. Since and are uncorrelated, it follows that

 E[∥XTT(XTXTT)†XTcβTc∥2]=tr(E[(XTXTT)†]E[(XTcβTc)(XTcβTc)T]).

The distribution of is normal with mean zero and covariance , so

 E[(XTcβTc)(XTcβTc)T]=∥βTc∥2⋅I.

The distribution of is inverse-Wishart with identity scale matrix and degrees-of-freedom, so

 tr(E[(XTXTT)†])={np−n−1if p≥n+2;+∞if p∈{n,n+1}.

Combining the last two displayed equations yields a simple expression for .

Thus, we obtain expressions for and hence also .

We summarize the risk of in the following theorem.

Theorem 1.

Assume the distribution of is the standard normal in and for some . Pick any and of cardinality . The risk of , where and , is

 E[(y−xT^β)2]=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩∥βTc∥2⋅(1+pn−p−1)if p≤n−2;+∞if n−1≤p≤n+1 and βTc≠0;∥βT∥2⋅(1−np)+∥βTc∥2⋅(1+np−n−1)if p≥n+2% ;∥βT∥2⋅max{1−np,0}% if βTc=0.

2.2 Feature selection model

We now study the risk of under a random selection model for . Again, we assume .

Let be a uniformly random subset of of cardinality , so

 E[∥βT∥2]=pD⋅∥β∥2,E[∥βTc∥2]=(1−pD)⋅∥β∥2.

We analyze the risk of , taking expectation with respect to the random choice of . First, consider . By Theorem 1, the risk of is

 E[(y−xT^β)2]=∥β∥2⋅(1−pD)⋅(1+pn−p−1),

which increases with as long as . Now consider . By Theorem 1, the risk of is

 E[(y−xT^β)2]=∥β∥2⋅(1−nD⋅(2−D−n−1p−n−1)),

which decreases with as long as .

Thus, we observe that the risk first increases with up to the “interpolation threshold” (), after which the risk decreases with . Moreover, the risk is smallest at . This is the “double descent” risk curve observed by belkin2018reconciling where the first “descent” is degenerate (i.e., the “sweet spot” that balances bias and variance is at ). See Figure 1 for an illustration. For a scenario where the first “descent” is non-degenerate, see Appendix A.

It is worth pointing out that the behavior under the random selection model of can be very different from that under a deterministic model of . Consider including variables in by decreasing order of —a kind of “prescient” selection model studied by breiman1983many. For simplicity, assume and , so

 ∥βT∥2=p∑j=11j2,∥βTc∥2=π26−p∑j=11j2.

The behavior of the risk as a function of , illustrated in Figure 2, reveals a striking difference between the random selection model and the “prescient” selection model.

2.3 Noise

We now briefly discuss the effect of additive noise in the responses. The following is a straightforward generalization of Theorem 1.

Theorem 2.

Assume the distribution of is the standard normal in ,

is a standard normal random variable independent of

, and for some and . Pick any and of cardinality . The risk of , where and , is

 E[(y−xT^β)2]=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩(∥βTc∥2+σ2)⋅(1+pn−1−p)if p≤n−2;+∞if n−1≤p≤n+1;∥βT∥2⋅(1−np)+(∥βTc∥2+σ2)⋅(1+np−n−1)if p≥n+2.

A similar analysis with a random selection model for can be obtained from Theorem 2.

3 Fourier series model

Let denote the

discrete Fourier transform matrix: its

-th entry is

 Fi,j=1√Dω(i−1)(j−1),

where is a primitive root of unity. Let for some . Consider the following observation model:

1. and are independent and uniformly random subsets of of cardinalities and , respectively.

2. We observe the design matrix and -dimensional vector of responses . Here, is the submatrix of with rows from and columns from , and is the subvector of of entries from .

The learner fits regression coefficients with

 ^βT:=F†S,TμS,^βTc:=0.

This can be regarded as a one-dimensional version of the random Fourier features model studied by rahimi2008random for functions defined on the unit circle.

One important property of the discrete Fourier transform matrix that we use is that the matrix has rank for any . This is a consequence of the fact that is Vandermonde. Thus, for , we have

 F†S,T=F∗S,T(FS,TF∗S,T)−1.

In the remainder of this section, we analyze the risk of under a random model for , where

 E[ββ∗]=1D⋅I

(which implies ). The random choice of is independent of and . Considering the risk under this random model for is a form of average-case analysis. For simplicity, we only consider the regime where , as it suffices to reveal some key aspects of the risk of .

Following the arguments from Section 2.1, we have

 ∥β−^β∥2 =∥βTc∥2+∥(I−F†S,TFS,T)βT∥2+∥F†S,TFS,TcβTc∥2 =∥β∥2−∥F†S,TFS,TβT∥2+∥F†S,TFS,TcβTc∥2.

Now we take (conditional) expectations with respect to , given and :

 E[∥β−^β∥2∣S,T]=1−1D⋅tr((F†S,TFS,T)∗(F†S,TFS,T))+1D⋅tr((F†S,TFS,Tc)∗(F†S,TFS,Tc)). (1)

Since has rank , the first trace expression is equal to

 tr((F†S,TFS,T)∗(F†S,TFS,T))=n.

For the second trace expression, we use the explicit formula for and the fact that to obtain

 tr((F†S,TFS,Tc)∗(F†S,TFS,Tc)) =tr(F∗S,Tc(FS,TF∗S,T)−1FS,Tc) =tr(F∗S,Tc(I−FS,TcF∗S,Tc)−1FS,Tc) =tr((I−FS,TcF∗S,Tc)−1FS,TcF∗S,Tc) =n∑i=1λi1−λi =−n+n∑i=111−λi,

where

are the eigenvalues of

. Therefore, from Equation 1, we have

 E[∥β−^β∥2]=1−2nD+nD⋅E[1nn∑i=111−λi](∗).

A precise characterization of is difficult to obtain. Under a slightly different model, in which membership in (respectively, ) is determined by independent Bernoulli variables with mean (respectively, ), we can use asymptotic arguments to characterize the empirical eigenvalue distribution for .111We can also derive essentially the same formula for the risk under this Bernoulli model, although the derivation is somewhat more cumbersome since we do not always have even if . Hence, we have opted to present the derivation only for the simpler model. Assuming the asymptotic equivalence of these random models for and , we find that the quantity approaches

 ρp⋅(1−ρn)ρp−ρn

as , where and are held fixed and  farrell2011limiting. So, in this limit, we have

 E[∥β−^β∥2]→1−ρn⋅(2−ρp⋅(1−ρn)ρp−ρn).

This quantity diverges to as , and decreases as . This is the same behavior as in the Gaussian model from Section 2 with random selection; we depict it empirically in Figure 3.

4 Discussion

Our analyses show that when features are chosen in an uninformed manner, it may be optimal to choose as many as possible—even more than the number of data—rather than limit the number to that which balances bias and variance. This choice is conceptually and algorithmically simple and avoids the need for precise control of regularization parameters. It is reminiscent of the practice in machine learning applications like image and speech recognition, where signal processing-based features are individually weak but in great abundance, and models that use all of the features are highly successful. This stands in contrast to scenarios where informed selection of features is possible; for example, in many science and medical applications, features are hand-crafted and purposefully selected. As illustrated by the “prescient” selection model, choosing the number of features to balance bias and variance can be better than incurring the costs that come with using all of the features. The best practices for model and feature selection thus crucially depend on which regime the application is operating under.

Appendix A Non-degenerate double descent in a Fourier series model

To observe the “double descent” risk curve where the first “descent” is non-degenerate, we consider a model in which the distribution of the feature vector is non-isotropic; instead, has a diagonal covariance matrix with decaying eigenvalues (e.g., ). In such a scenario, it is natural to select features in decreasing order of the eigenvalues, as is done in principal components regression. Like random feature selection, this form of feature selection is also generally uninformed by the responses.

Formally, we let for some and positive sequence , and use the same observation model as in Section 3, except that is deterministically set to . We fit in the same manner as in Section 3. We are interested in the risk

 D∑j=1t2j(βj−^βj)2,

which can be regarded as the mean squared error when is drawn uniformly at random from the rows of the design matrix .

We carried out the same simulation as from Figure 3 under the modified model, with and . We chose uniformly at random (once) from the unit sphere in for . Then, for each , we computed from independent random choices of (with ), and plotted the average value of . The plot is shown below. (The vertical axis is truncated for clarity; the curve peaks around with value on the order of .)

The plot shows the usual “U”-shaped curve arising from the bias-variance trade-off when , and a second “descent” towards the overall minimum for . This risk curve is qualitatively the same as those observed by belkin2018reconciling for neural networks and other predictive models.