Common statistical intuition suggests that more data should never harm the performance of an estimator. It was recently highlighted in [deep] that this may not hold for overparameterized
models: there are settings in modern deep learning where training on more data actually hurts. In this note, we analyze a simple setting to understand the mechanisms behind this behavior.
We focus on well-specified linear regression with Gaussian covariates, and we analyze the test risk of the minimum-norm ridgeless regression estimator— or equivalently, the estimator found by gradient descent on the least squares objective. We show that as we increase the number of samples, performance is non-monotonic: The test risk first decreases, and then increases, before decreasing again.
Such a “double-descent” behavior has been observed in the behavior of test risk as a function of the model size in a variety of machine learning settings[opper1995statistical, opper2001learning, advani2017high, belkin2018reconciling, spigler2018jamming, geiger2019jamming, deep]. Many of these works are motivated by understanding the test risk as function of model size, for a fixed number of samples. In this work, we take a complementary view and understand the test risk as a function of sample size, for a fixed model. We hope that understanding such simple settings can eventually lead to understanding the general phenomenon, and lead us to design learning algorithms which make the best use of data (and in particular, are monotonic in samples).
We note that similar analyses appear in recent works, which we discuss in Section 1.1– our focus is to highlight the sample non-monotonicity implicit in these works, and give intuitions for the mechanisms behind it. We specifically refer the reader to [hastie2019surprises, mei2019generalization] for analysis in a setting most similar to ours.
We first define the linear regression setting in Section 2. Then in Section 3 we state the form of the estimator found by gradient descent, and give intuitions for why this estimator has a peak in test risk when the number of samples is equal to the ambient dimension. In Section 3.1, we decompose the expected excess risk into bias and variance contributions, and we state approximate expressions for the bias, variance, and excess risk as a function of samples. We show that these approximate theoretical predictions closely agree with practice, as in Figure 1.
The peak in test risk turns out to be related to the conditioning of the data matrix, and in Section 3.2 we give intuitions for why this matrix is poorly conditioned in the “critical regime”, but well conditioned outside of it. We also analyze the marginal effect of adding a single sample to the test risk, in Section 3.3. We conclude with discussion and open questions in Section 4.
1.1 Related Works
This work was inspired by the long line of work studying “double descent” phenomena in deep and shallow models. The general principle is that as the model complexity increases, the test risk of trained models first decreases and then increases (the standard U-shape), and then decreases again
. The peak in test risk occurs in the “critical regime” when the models are just barely able to fit the training set. The second descent occurs in the “overparameterized regime”, when the model capacity is large enough to contain several interpolants on the training data. This phenomenon appears to be fairly universal among natural learning algorithms, and is observed in simple settings such as linear regression, random features regression, classification with random forests, as well as modern neural networks. Double descent of test risk with model size was introduced in generality by[belkin2018reconciling], building on similar behavior observed as early as [opper1995statistical, opper2001learning] and more recently by [advani2017high, neal2018modern, spigler2018jamming, geiger2019jamming]. A generalized double descent phenomenon was demonstrated on modern deep networks by [deep], which also highlighted “sample-wise nonmonotonicity” as a consequence of double descent – showing that more data can hurt for deep neural networks.
A number of recent works theoretically analyze the double descent behavior in simplified settings, often for linear models [belkin2019two, hastie2019surprises, bartlett2019benign, muthukumar2019harmless, bibas2019new, Mitra2019UnderstandingOP, mei2019generalization, liang2018just, liang2019risk, xu2019number, dereziski2019exact, lampinen2018analytic, deng2019model]. At a high level, these works analyze the test risk of estimators in overparameterized linear regression with different assumptions on the covariates. We specifically refer the reader to [hastie2019surprises, mei2019generalization] for rigorous analysis in a setting most similar to ours. In particular, [hastie2019surprises] considers the asymptotic risk of the minimum norm ridgeless regression estimator in the limit where dimension and number of samples are scaled as . We instead focus on the sample-wise perspective: a fixed large , but varying . In terms of technical content, the analysis technique is not novel to our work, and similar calculations appear in some of the prior works above. Our main contribution is highlighting the sample non-monotonic behavior in a simple setting, and elaborating on the mechanisms responsible.
While many of the above theoretical results are qualitatively similar, we highlight one interesting distinction: our setting is well-specified, and the bias of the estimator is monotone nonincreasing in number of samples (see Equation 3, and also [hastie2019surprises, Section 3]). In contrast, for misspecified problems (e.g. when the ground-truth is nonlinear, but we learn a linear model), the bias can actually increase with number of samples in addition to the variance increasing (see [mei2019generalization]).
2 Problem Setup
Consider the following learning problem: The ground-truth distribution is with covariates and response for some unknown, arbitrary such that . That is, the ground-truth is an isotropic Gaussian with observation noise. We are given samples from the distribution, and we want to learn a linear model for estimating given . That is, we want to find with small test mean squared error
|(for isotropic )|
Suppose we do this by performing ridgeless linear regression. Specifically, we run gradient descent initialized at on the following objective (the empirical risk).
where is the data-matrix of samples , and are the observations.
The solution found by gradient descent at convergence is , where denotes the Moore–Penrose pseudoinverse111To see this, notice that the iterates of gradient descent lie in the row-space of .. Figure 0(a) plots the expected test MSE of this estimator as we vary the number of train samples . Note that it is non-monotonic, with a peak in test MSE at .
There are two surprising aspects of the test risk in Figure 0(a), in the overparameterized regime ():
The first descent: where test risk initially decreases even when we have less samples than dimensions . This occurs because the bias decreases.
The first ascent: where test risk increases, and peaks when . This is because the variance increases, and diverges when .
When , this is the classical underparameterized regime, and test risk is monotone decreasing with number of samples.
Thus overparameterized linear regression exhibits a bias-variance tradeoff: bias decreases with more samples, but variance can increase. Below, we elaborate on the mechanisms and provide intuition for this non-monotonic behavior.
The solution found by gradient descent, , has different forms depending on the ratio . When , we are in the “underparameterized” regime and there is a unique minimizer of the objective in Equation 1. When , we are “overparameterized” and there are many minimizers of Equation 1. In fact, since
is full rank with probability 1, there are many minimizers whichinterpolate, i.e. . In this regime, gradient descent finds the minimum with smallest norm . That is, the solution can be written as
The overparameterized form yields insight into why the test MSE peaks at . Recall that the observations are noisy, i.e. where . When , there are many interpolating estimators , and in particular there exist such with small norm. In contrast, when , there is exactly one interpolating estimator , but this estimator must have high norm in order to fit the noise . More precisely, consider
The signal term is simply the orthogonal projection of onto the rows of . When we are “critically parameterized” and , the data matrix is very poorly conditioned, and hence the noise term has high norm, overwhelming the signal. This argument is made precise in Section 3.1, and in Section 3.2 we give intuition for why becomes poorly conditioned when .
The main point is that when , forcing the estimator to interpolate the noise will force it to have very high norm, far from the ground-truth . (See also Corollary 1 of [hastie2019surprises] for a quantification of this point).
3.1 Excess Risk and Bias-Variance Tradeoffs
For ground-truth parameter , the excess risk222For clarity, we consider the excess risk, which omits the unavoidable additive error in the true risk. of an estimator is:
For an estimator that is derived from samples , we consider the expected excess risk of in expectation over samples :
Where are the bias and variance of the estimator on samples.
For the specific estimator in the regime , the bias and variance can be written as (see Appendix A.1):
where is the orthogonal projector onto the rowspace of the data , and is the projector onto the orthogonal complement of the rowspace.
From Equation 3, the bias is non-increasing with samples (), since an additional sample can only grow the rowspace: . The variance in Equation 4 has two terms: the first term (A) is due to the randomness of , and is bounded. But the second term (B) is due to the randomness in the noise of , and diverges when since becomes poorly conditioned. This trace term is responsible for the peak in test MSE at .
We can also approximately compute the bias, variance, and excess risk.
[Overparameterized Risk] Let be the underparameterization ratio. The bias and variance are:
And thus the expected excess risk for is:
These approximations are not exact because they hold asyptotically in the limit of large (when scaling ), but may deviate for finite samples. In particular, the bias and term (A) of the variance can be computed exactly for finite samples: is simply a projector onto a uniformly random -dimensional subspace, so , and similarly . The trace term (B) is nontrivial to understand for finite samples, but converges333 For large , the spectrum of is understood by the Marchenko–Pastur law [marvcenko1967distribution]. Lemma 3 of [hastie2019surprises] uses this to show that . to in the limit of large (e.g. Lemma 3 of [hastie2019surprises]). In Section 3.3, we give intuitions for why the trace term converges to this.
For completeness, the bias, variance, and excess risk in the underparameterized regime are given in [hastie2019surprises, Theorem 1] as: [Underparameterized Risk, [hastie2019surprises]] Let be the underparameterization ratio. The bias and variance are:
3.2 Conditioning of the Data Matrix
Here we give intuitions for why the data matrix is well conditioned for
, but has small singular values for.
3.2.1 Near Criticality
First, let us consider the effect of adding a single sample when . For simplicity, assume the first samples
are just the standard basis vectors, scaled appropriately. That is, assume the data matrixis
This has all non-zero singular values equal to . Then, consider adding a new isotropic Gaussian sample . Split this into coordinates as . The new data matrix is
We claim that has small singular values. Indeed, consider left-multiplication by :
Thus, , while . Since is full-rank, it must have a singular value less than roughly . That is, adding a new sample has shrunk the minimum non-zero singular value of from to less than a constant.
The intuition here is: although the new sample adds rank to the existing samples, it does so in a very fragile way. Most of the mass of is contained in the span of existing samples, and only contains a small component outside of this subspace. This causes to have small singular values, which in turn causes the ridgeless regression estimator (which applies ) to be sensitive to noise.
A more careful analysis shows that the singular values are actually even smaller than the above simplification suggests — since in the real setting, the matrix was already poorly conditioned even before the new sample . In Section 3.3 we calculate the exact effect of adding a single sample to the excess risk.
3.2.2 Far from Criticality
When , the data matrix does not have singular values close to . One way to see this is to notice that since our data model treats features and samples symmetrically, is well conditioned in the regime for the same reason that standard linear regression works in the classical underparameterized regime (by “transposing” the setting).
More precisely, since is full rank, its smallest non-zero singular value can be written as
Since has entries i.i.d , for every fixed vector we have . Moreover, for uniform convergence holds, and concentrates around its expectation for all vectors in the ball. Thus:
3.3 Effect of Adding a Single Sample
Here we show how the trace term of the variance in Equation 4 changes with increasing samples. Specifically, the following claim shows how grows when we add a new sample to .
Let be the data matrix after samples, and let be the th sample. The new data matrix is , and
By computation in Appendix A.2. ∎
If we heuristically assume the denominator concentrates around its expectation,, then we can use Claim 3.3 to estimate the expected effect of a single sample:
We hope that understanding such simple settings can eventually lead to understanding the general behavior of overparameterized models in machine learning. We consider it extremely unsatisfying that the most popular technique in modern machine learning (training an overparameterized neural network with SGD) can be nonmonotonic in samples [deep]. We hope that a greater understanding here could help develop learning algorithms which make the best use of data (and in particular, are monotonic in samples).
In general, we believe it is interesting to understand when and why learning algorithms are monotonic – especially when we don’t explicitly enforce them to be.
We especially thank Jacob Steinhardt and Aditi Raghunathan for discussions and suggestions that motivated this work. We thank Jarosław Błasiok, Jonathan Shi, and Boaz Barak for useful discussions throughout this work, and we thank Gal Kaplun and Benjamin L. Edelman for feedback on an early draft.
This work supported in part by supported by NSF awards CCF 1565264, CNS 1618026, and CCF 1715187, a Simons Investigator Fellowship, and a Simons Investigator Award.
Appendix A Appendix: Computations
a.1 Bias and Variance
The computations in this section are standard.
Assume the data distribution and problem setting from Section 2.
For samples , the estimator is:
For , the bias and variance of the estimator is
Bias. Note that
Thus the bias is
Notice that is projection onto the rowspace of , i.e. . Thus,
a.2 Trace Computations
Proof of Claim 3.3.
Let be the data matrix after samples, and let be the th sample. The new data matrix is , and
Now by Schur complements:
Finally, we have