Deep learning methodology has revealed a surprising statistical phenomenon: overfitting can perform well. The classical perspective in statistical learning theory is that there should be a tradeoff between the fit to the training data and the complexity of the prediction rule. Whether complexity is measured in terms of the number of parameters, the number of non-zero parameters in a high-dimensional setting, the number of neighbors averaged in a nearest-neighbor estimator, the scale of an estimate in a reproducing kernel Hilbert space, or the bandwidth of a kernel smoother, this tradeoff has been ubiquitous in statistical learning theory. Deep learning seems to operate outside the regime where results of this kind are informative, since deep neural networks can perform well even with a perfect fit to the training data.
As one example of this phenomenon, consider the experiment illustrated in Figure 1(c) in : standard deep network architectures and stochastic gradient algorithms, run until they perfectly fit a standard image classification training set, give respectable prediction performance, even when significant levels of label noise are introduced. That paper reported zero loss for training data classifications. However, the deep networks in the experiments reported in  also achieved essentially zero cross-entropy loss on the training data.111Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals, personal communication, January 2017.
In statistics and machine learning textbooks, an estimate that fits every training example perfectly is often presented as an illustration of overfitting (“… interpolating fits… [are] unlikely to predict future data well at all.”[14, p37]). Thus, to arrive at a scientific understanding of the success of deep learning methods, it is a central challenge to understand the performance of prediction rules that interpolate the training data.
In this paper, we consider perhaps the simplest setting where we might hope to witness this phenomenon: linear regression with gaussian data. That is, we assume that the covariates and responses have a gaussian distribution, the loss is quadratic, the prediction rules are linear, and the dimension of the parameter space is large enough that a perfect fit is guaranteed. We consider data in an infinite dimensional space (a separable Hilbert space), but our results apply to a finite-dimensional subspace as a special case. There is an ideal value of the parameters,, the least squares parameters. We ask when it is possible to fit the data exactly and still compete with the prediction accuracy of
. Since we require more parameters than the sample size in order to fit exactly, the solution might be underdetermined, so there might be many interpolating solutions. We consider the most natural: choose the parameter vectorwith the smallest norm among all vectors that give perfect predictions on the training sample. (This corresponds to using the pseudoinverse to solve the normal equations; see Section 2.) We ask when it is possible to overfit in this way—and embed all of the noise of the labels into the parameter estimate —without harming prediction accuracy.
Our main result is a finite sample characterization of when overfitting is benign in this setting. The gaussian linear regression problem depends on the least squares parameters and the covariance of the covariates . The properties of
turn out to be crucial, since the magnitude of the variance in different directions determines both how the label noise gets distributed across the parameter space and how errors in parameter estimation in different directions in parameter space affect prediction accuracy. There is a classical decomposition of the excess prediction error into a bias term and a variance term. The bias part is rather standard: provided that the scale of the problem (that is, the sum of the eigenvalues of) is small compared to the sample size , the contribution to that we can view as coming from is not too distorted. The variance part is more interesting, since it reflects the impact of the noise in the labels on prediction accuracy. We show that this part is small if and only if the effective rank of in the subspace corresponding to low variance directions is large compared to . This necessary and sufficient condition of a large effective rank can be viewed as a property of significant overparameterization: fitting the training data exactly but with near-optimal prediction accuracy occurs if and only if there are many low variance (and hence unimportant) directions in parameter space where the label noise can be hidden.
The details are more complicated. The characterization depends in a specific way on two notions of effective rank, and ; the smaller one, , determines a split of into large and small eigenvalues, and the excess prediction error depends on the effective rank, as measured by the larger notion , of the subspace corresponding to the smallest eigenvalues. For the excess prediction error to be small, the smallest eigenvalues of must decay slowly.
The phenomenon of interpolating prediction rules has been an object of study by several authors over the last two years, since it emerged as an intriguing mystery at the Simons Institute program on Foundations of Machine Learning in Spring 2017. Belkin, Ma and Mandal  described an experimental study demonstrating that this phenomenon of accurate prediction for functions that interpolate noisy data also occurs for prediction rules chosen from reproducing kernel Hilbert spaces, and explained the mismatch between this phenomenon and classical generalization bounds. Belkin, Hsu and Mitra  gave an example of an interpolating nearest neighbor decision rule with an asymptotic consistency property as the input dimension gets large. Belkin, Rakhlin, and Tsybakov  showed that kernel smoothing methods based on singular kernels both interpolate and, with suitable bandwidth choice, give optimal rates for nonparametric estimation (building on earlier consistency results  for these unusual kernels). Liang and Rakhlin  considered minimum norm interpolating kernel regression with kernels defined as nonlinear functions of the Euclidean inner product, and showed that, with certain properties of the training sample (expressed in terms of the empirical kernel matrix), these methods can have good prediction accuracy. Belkin, Hsu, Ma and Mandal  studied experimentally the excess risk as a function of the dimension of a sequence of parameter spaces for linear and non-linear classes.
Subsequent to our work,  considered the properties of the interpolating linear prediction rule with minimal expected squared error. After this work was presented at the NAS Colloquium on the Science of Deep Learning , we became aware of the concurrent work of Belkin, Hsu and Xu  and of Hastie, Montanari, Rosset and Tibshirani . Belkin et al  calculated the excess risk for certain linear models (a regression problem with identity covariance, sparse least squares parameters, both with and without noise, and a problem with random Fourier features with no noise), and Hastie et al consider linear regression in an asymptotic regime, where sample size and input dimension go to infinity together with asymptotic ratio
. Rather than exploiting independence in the components of a gaussian distribution, they assume that the covariate vectors are linear transformations of vectors withi.i.d. entries. They also assume that, as gets large, the empirical spectral distribution of
(the discrete measure on its set of eigenvalues) converges to a fixed measure. They apply random matrix theory to explore the range of behaviors of the asymptotics of the excess prediction error as, the noise variance, and the eigenvalue distribution vary. They also study the asymptotics of a model involving random nonlinear features. In contrast, we assume that the data is gaussian, and we give upper and lower bounds on the excess prediction error for data of arbitrary dimension, for arbitrary finite sample size, and for arbitrary covariance matrices.
The next section introduces notation and definitions used throughout the paper, including definitions of the problem of linear regression in a gaussian setting and of various notions of effective rank of the covariance operator. Section 3 gives the characterization of benign overfitting, and illustrates why the effective rank condition corresponds to significant overparameterization, and presents several examples of patterns of eigenvalues that allow benign overfitting. Section 4 gives the proofs of these results.
2 Definitions and Notation
We consider linear regression problems, where a linear function of covariates from a (potentially infinite dimensional) Hilbert space
is used to predict a real-valued response variable. We use vector notation, so that denotes the inner product between and and
denotes the tensor product of.
Definition 1 (Linear regression).
A linear regression problem in a separable Hilbert space is defined by a random covariate vector and outcome . We assume are jointly gaussian and mean zero, and define
the covariance operator ,
the least squares parameter vector , satisfying , and
the noise variance, .
Given a training sample of i.i.d. pairs with the same distribution as , an estimator returns a parameter estimate . The excess risk of the estimator is defined as
where denotes the conditional expectation given all random quantities other than (in this case, given the estimate ). Define the vectors with entries and with entries . We use infinite matrix notation: denotes the linear map from to corresponding to , so that has th component . We use similar notation for the linear map from to .
The assumption that is gaussian simplifies the statement of the main theorem, but the proof requires only that the conditional distribution of given is subgaussian, independent of , and that the conditional expectation is linear in , all of which follow from the joint gaussianity assumption.
We shall be concerned with situations where an estimator can fit the data perfectly, that is,
Typically this implies that there are many such vectors. We consider the interpolating estimator with minimal norm in . We use to denote both the Euclidean norm of a vector in and the Hilbert space norm.
Definition 2 (Minimum norm estimator).
Given data and , the minimum norm estimator solves the optimization problem
By the projection theorem, parameter vectors that solve the least squares problem solve the normal equations, so we can equivalently write as the minimum norm solution to the normal equations,
where denotes the pseudoinverse of the bounded linear operator (for infinite dimensional , the existence of the pseudoinverse is guaranteed because is bounded and has a closed range; see ). When is finite dimensional with and has rank , there is a unique solution to the normal equations. We shall assume for the remainder of the paper that , in which case the normal equations can have many solutions. Since is gaussian with a positive definite covariance, almost surely has full rank, and so we can find a solution that achieves . The minimum norm solution is given by
Our main result gives tight bounds on the excess risk of the minimum norm estimator in terms of certain notions of effective rank of the covariance that are defined in terms of its eigenvalues.
We use to denote the eigenvalues of in descending order, and we denote the operator norm of by . We use to denote the identity operator on and to denote the identity matrix.
Definition 3 (Effective Ranks).
For the covariance operator , define for . If and for , define
3 Main Result
The following theorem establishes nearly matching upper and lower bounds for the risk of the minimum-norm interpolating estimator.
There are universal constants for which the following holds. Consider a linear regression problem with least squares parameter vector , covariance operator , noise variance and sample size . Suppose that , so that the minimum norm estimator satisfies almost surely. Define
where the minimum of the empty set is defined as .
Suppose with .
Then with probability at least
. Then with probability at least,
3.1 Effective Ranks and Overparameterization
In order to understand the implications of Theorem 4, we now study relationships between the two notions of effective rank, and , and establish sufficient and necessary conditions for the sequence of eigenvalues to lead to small excess risk.
The following lemma shows that the two notions of effective rank are closely related. See Appendix A.6 for its proof, and for other properties of and .
, , and
Notice that . More generally, if all the non-zero eigenvalues of are identical, then . For with finite rank, we can express both and as a product of the rank and a notion of symmetry. In particular, for we can write
Both notions of symmetry and lie between (when ) and (when the are all equal).
Theorem 4 shows that, for the minimum norm estimator to have near-optimal prediction accuracy, should be small compared to the sample size (from the first term) and and should be large compared to . Together, these conditions imply that overparameterization is essential for benign overfitting in this setting: the number of non-zero eigenvalues should be large compared to , they should have a small sum compared to , and there should be many eigenvalues no larger than . If the number of these small eigenvalues is not much larger than , then they should be roughly equal, but they can be more assymmetric if there are many more of them.
The following theorem shows that the kind of overparameterization that is essential for benign overfitting requires to have a heavy tail. (The proof is in Appendix A.7.) In particular, if we fix in an infinite-dimensional Hilbert space and ask when does the excess risk of the minimum norm estimator approach zero as , it imposes tight restrictions on the eigenvalues of . But there are many other possibilities for these asymptotics if can change with .
We say that a sequence of covariance operators is benign if
where for the universal constant in Theorem 4.
Define for all .
If , then is benign iff and .
If , then is benign iff . Furthermore,
then is benign iff either , and or , and .
then is benign iff and . Furthermore, for and ,
Many popular learning algorithms may be viewed as the combination of a feature transformation with a linear method applied to the transformed features. In practice, hyperparameter choices such as kernel parameters and deep network architectures can be made with the knowledge of, and of course these affect the spectrum of the covariance of the transformed features.
It is informative to compare the situations described by Parts 1 and 4 of Theorem 6. Part 1 shows that for infinite-dimensional data with a fixed covariance, benign overfitting occurs iff the eigenvalues of the covariance operator decay just slowly enough for their sum to remain finite. However, Part 4 shows that the situation is very different if the data has finite dimension and a small amount of isotropic noise is added to the covariates. In that case, even if the eigenvalues of the original covariance operator (before the addition of isotropic noise) decay very rapidly, benign overfitting occurs iff both the dimension is large compared to the sample size, and the isotropic component of the covariance is sufficiently small—but not exponentially small—compared to the sample size.
How relevant is Theorem 4 to the phenomenon of benign overfitting in deep neural networks? Certain very wide neural networks trained with suitable random initialization and gradient descent can be accurately approximated by linear functions in a certain randomly chosen Hilbert space, and in this case gradient descent finds an interpolating solution quickly; see [17, 11, 10, 1]. (Note that these papers do not consider prediction accuracy, except when there is no noise; for example, [17, Assumption A1] implies that the network can compute a suitable real-valued response exactly, and the data-dependent bound of [1, Theorem 5.1] becomes vacuous when independent noise is added to the s.) The eigenvalues of the covariance operator in this case appear to have a heavy tail (see , where this kernel was introduced, and ), as required for benign overfitting. Of course, the assumptions of Theorem 4 do not apply. Even so, the assumption that the network width is large compared to the sample size is somewhat strange, so extending Theorem 4 to more realistic deep network architectures seems to require significant progress beyond the linear case.
Throughout the proofs, we use the symbols , , to refer to universal constants whose values are suitably large (and always at least ) but do not depend on any parameters of the problems we consider.
The first step is a standard decomposition of the excess risk into two pieces, one that corresponds to the distortion that is introduced by viewing through the lens of the finite sample and one that corresponds to the distortion introduced by the noise . The impact of both sources of error in on the excess risk is modulated by the covariance , which gives different weight to different directions in parameter space.
The excess risk of the minimum norm estimator satisfies
Since is independent of and has mean zero,
Using (1), the definition of , and the fact that ,
Also, since has zero mean and is independent of and , we have
We can control the term using a standard argument. The proof of the following lemma is in Appendix A.1.
There is a universal constant such that for any , with probability at least ,
The following lemma shows that we can obtain a high-probability upper bound on the term in terms of the trace of . It is Lemma 36 in .
Consider random variables
Consider random variables, conditionally independent given and conditionally -subgaussian, that is, for all ,
Suppose that, given , is a.s. positive semidefinite. Then a.s. on , with conditional probability at least ,
Since and , with probability at least ,
Combining this with Lemma 7, both upper and lower bounds follow from suitable bounds on .
The gaussian distribution of the covariates allows the trace of to be expressed as a function of many standard normal vectors.
Consider a covariance operator with
and . Write its spectral
where the orthonormal are the eigenvectors
corresponding to the
are the eigenvectors corresponding to the. Define . Then we can write
and these are independent with distribution . Furthermore, for any ,
Each has distribution , and they are uncorrelated, hence independent. Without loss of generality, we can assume that the rows of have coordinates corresponding to the , that is, we can think of as diagonal. We can write
by Lemma 19, for the case and . ∎
The weighted sum of outer products of these gaussian vectors plays a central role in the rest of the proof. Define
where the are independent with distribution . Note that the vector is independent of the matrix , so the last part of Lemma 10 allows us to write in terms of ratios of quantities involving only independent vectors.
The next step is to show that eigenvalues of , and are concentrated. The proof of the following inequality is in Appendix A.3. Recall that and denote the largest and the smallest eigenvalues of the matrix .
With probability at least ,
Hence, there is a universal constant such that with probability at least ,
The following lemma uses this result to give bounds on the eigenvalues of , which in turn give bounds on some eigenvalues of and . For these upper and lower bounds to match up to a constant factor, the sum of the eigenvalues of should dominate the term involving its leading eigenvalue, which is a condition on the effective rank .
There are universal constants such that for any , with probability at least ,
for all ,
for all ,
if , then
By Lemma 11, we know that with probability at least ,
First, the matrix has rank at most (as a sum of matrices of rank ). Thus, there is a linear space of dimension such that for all , , and so .
Second, by the Courant-Fischer-Weyl Theorem, for all and , (see Lemma 25). On the other hand, for , , so all the eigenvalues of are lower bounded by .
Finally, if ,
Choosing and sufficiently large gives the third claim of the lemma. ∎
Upper bound on the trace term
There are universal constants such that if , , then with probability at least ,
The proof uses the following lemma. Its proof is in Appendix A.3.
Lemma 14 (Concentration of a weighted sum of ).
Suppose are i.i.d. random variables distributed as . Then for any and any non-negative non-increasing sequence such that , with probability at least ,
First, consider the sum up to . If , Lemma 12 shows that with probability at least , for all , , and, for all , . The lower bounds on the ’s imply that, for all and ,
and the upper bounds on the ’s give
where is the orthogonal projection on , the span of the eigenvectors of corresponding to its smallest eigenvalues, using the upper bound on the -th largest eigenvalue of . So for ,
Next, we apply Lemma 14 times, together with a union bound, to show that with probability at least , for all ,
Notice that is a weighted sum of random variables, with the weights given by the in blocks of size . Lemma 14 implies that, with probability at least ,
where the third inequality follows from the arithmetic mean-geometric mean inequality. Combining the above gives