Deep Neural Networks have shown amazing versatility across a large range of domains. Among one of their main features is their ability to perform better with scale. Indeed, some of the most impressive results [see e.g. brock2021high; brown2020language; senior2020improved; schrittwieser2020mastering; silver2017mastering; he2016deep and references therein] have been obtained often by exploiting this fact, leading to models that have at least as many parameters as the number of examples in the dataset they are trained on. Empirically, the limitation on the model size seems to be mostly imposed by hardware or compute. From a theoretical point of view, however, this property is quite surprising and counter-intuitive, as one would expect that in such extremely overparametrized regimes the learning would be prone to overfitting (hastie2009elements; shalev2014understanding).
) phenomena as an explanation. They argue that the classical view of overfitting does not apply in extremely over-parameterized regimes, which were less studied prior to the emergence of the deep learning era. The classical view in the parametric learning models was based on error curves showing that the training error decreases monotonically when plotted against model size, while the corresponding test errors displayed a U-shape curve, where the model size for the bottom of the U-shape was taken to achieve the ideal trade-off between model size and generalization, and larger model sizes than that were thought to lead to ‘overfitting’ since the gap between test errors and training errors increased.
The classical U-shape error curve dwells in what is now called the under-parameterized regime, where the model size is smaller than the size of the dataset. Arguably, the restricted model sizes used in the past were tied to the available computing power. By contrast, it is common nowadays for model sizes to be larger than the amount of available data, which we call the over-parameterized regime. The divide between these two regimes is marked by a point where model size matches dataset size, which belkin2019reconciling called the interpolation threshold.
The work of belkin2019reconciling argues that as model size grows beyond the interpolation threshold, one will observe a second descent of the test error that asymptotes in the limit to smaller values than those in the underparameterized regime, which indicates better generalization rather than overfitting. To some extent this was already known in the nonparametric learning where model complexity scales with the amount of data by design (such as in nearest neighbor rules and kernels), yet one can generalize well and even achieve statistical consistency (gyorfi2002distribution). This has lead to a growing body of works trying to identify the mechanisms behind DD, to which the current manuscript belongs too. We refer the reader to Section 2, where the related literature is discussed. Similar to these works, our goal is also to understand the cause of DD. Our approach is slightly different: we explore the least squares problem that allows us to work with analytic expressions for all the quantities involved. Fig. 1 provides a summary of our findings. In particular, it shows the behaviour of the excess risk in a setting with random inputs and noise-free labels, for which in Section 3 we prove a bound that has the form , for a rapidly decaying spectrum of the sample covariance. In this setting, the linear predictors project
-dimensional features by dot product with a weight vector which must be learned from data; thenrefers to the optimal solution, is a constant learning rate, and is the number of examples in the training set. Note that the feature dimension coincides with the number of parameters in this particular setting, hence is the overparameterized regime. The quantity is of special importance: It is the smallest positive eigenvalue of the sample covariance matrix of the features. In particular, we observe that the excess risk is controlled by the smallest non-zero eigenvalue of the covariance of the features, and its functional dependence exhibits a profile similar to the DD curve. This offers a new perspective on the problem.
In Fig. 1 we observe a peaking behavior, not only in the excess risk, but also in the quantity that we label ‘optimization error’ which is a special term of the excess risk bound that is purely related to optimization. The peaking behaviour of the excess risk (MSE in case of the square loss) was observed and studied in a number of settings (belkin2019reconciling; mei2019generalization; derezinski2020exact); however, the connection between the peaking behavior and optimization so far received less attention. This pinpoints a less-studied setting and we conjecture that the DD phenomenon occurs due to . In the absence of label noise, we conclude that DD manifests due to the optimization process. On the other hand, when label noise is present, in addition to the optimization effect, also has an effect on the generalization error.
Our main theoretical contribution is provided in Section 3. In particular, Section 3.1 focuses on the noise-free least squares problem, Section 3.2 adds noise to the problem, and Section 3.3 deals with concentration of the sample-dependent around its population counterpart. Sections 4 and 5 provide an in-depth discussion on the implications of our findings and an empirical exploration of the question whether simple neural networks have a similar behaviour.
The linear algebra/analysis notation used in this work is defined in Appendix A. We briefly mention here that we denote column vectors and matrices with small and capital bold letters, respectively, e.g. and
. Singular values of a rectangular matrixare denoted by . The rank of is . Eigenvalues of a Positive Semi-Definite (PSD) matrix are non-negative and are denoted , while the smallest non-zero eigenvalue is denoted .
Next, we set the learning theory notation. In a parametric statistical learning problem the learner is given a training set , which is an -tuple consisting of independent random elements, called training examples, distributed according to some unknown distribution , where is called the example space. The learner’s goal is to select parameter from some parameter space so as to minimize the population loss , where is some given loss function. A learner following the Empirical Risk Minimization (ERM) principle selects a with the smallest empirical loss over the training set. In this report we consider a Euclidean parameter space: .
We consider a least squares regression problem. In this setting, each example is an instance-label pair: . We assume that inputs are from the Euclidean ball of unit radius , and labels are in the unit interval . For a suitably chosen parameter vector , the noiseless regression model is and the model with label noise is where . The loss function is the square loss: .
2 Related Work
The literature on the DD
of the test error has mainly focused on the ordinary least squares with the explicit solution given by the Moore-Penrose pseudo-inverse. Early works have focused on instance-specific settings (making distributional assumptions on the inputs) while arguing when the analytic pseudo-inverse solutions yield DD behaviour(belkin2019two). This was later extended to a more general setting showcasing the control of DD by the spectrum of the feature matrix (derezinski2020exact). In this paper we also argue that the spectrum of the covariance matrix has a critical role in DD, however we take into account the effect of GD optimization, which was missed by virtually all the previous literature due to their focusing on analytic solutions. The effect of the smallest non-zero eigenvalue on DD, through a condition number, was briefly noticed by rangamani2020interpolating. In this work we carry out a more comprehensive analysis and show how the excess risk of GD is controlled the smallest eigenvalue. In particular,
has a “U”-shaped behaviour as the number of features increases, and we give a high-probability characterization of this behavior when inputs are subgaussian. To some extent, this is a non-asymptotic manifestation of the Bai-Yin law, whose connection toDD in an asymptotic setting was noted by hastie2019surprises.
Some interest was also dedicated to the effect of bias and variance of DD (mei2019generalization) in the same pseudo-inverse setting, while more involved fine-grained analysis was later carried out by adlam2020understanding. In this work we focus on the influence of the optimization error, which is complementary to the bias-variance effects (typically we care about it once optimization error is negligible).
DD behaviour was also observed beyond least squares, in neural networks and other interpolating models (belkin2019reconciling). To some extent a formal connection to neural networks was first made by mei2019generalization who studied asymptotic behaviour of the risk under the random feature model, when while having and fixed. Later on, with popularity of Neural Tangent Kernel (NTK) the connection became clearer as within NTK interpretation shallow neural networks can be paralleled with kernelized predictors (bartlett2021deep). A detailed experimental study of DD in deep neural networks was carried out by (nakkiran2019deep), who showed that various forms of regularization mitigate DD. In this work, we explain DD in least-squares solution obtained by GD through the spectrum of the features, where optimization error has a visible role. While we do not present formal results for neural networks, but we empirically investigate whether our conclusions extend to shallow neural nets as would be suggested by NTK theory.
3 Excess Risk of the Gradient Descent Solution
We focus on learners that optimize parameters via the Gradient Descent algorithm. We treat GD as a measurable map , where is the space of size- training sets. Given a training set and an initialization point , we write to indicate the output obtained recursively by the standard GD update rule with some fixed step size , i.e. , where
We look at the behavior of GD in the overparameterized regime () when the initialization parameters are sampled from an isotropic Gaussian density, that is with some initialization variance . It is well-known that in the overparameterized regime, GD is able to achieve zero empirical loss. Therefore, rather than focusing on the generalization gap it is natural to compare the loss of to that of the best possible predictor. Thus, we consider the excess risk defined as
Our results are based on a the requirement that satisfies the following regularity condition:
A map is called -admissible, where is a fixed PSD matrix and , if for all the following holds:
Notice that the norm on the left-hand side is , while that on the right-hand side is the standard Euclidean norm. Also note that this inequality entails a Lipschitz condition with Lipschitz factor .
Our first main result gives an upper bound on the excess risk of GD output, assuming that the output of is of low-rank, in the sense that for some low-rank orthogonal projection we assume that almost surely (a.s.) with respect to , for any initialization . This condition is of interest in the overparameterized regime, where the learning dynamics effectively happens in a subspace which is arguably of much smaller dimension than the whole parameter space. The following theorem bounds the excess risk (with respect to a possibly non-convex but smooth loss) of any algorithm that satisfies Definition 1 with some . Later it will become apparent that in a particular learning problem this pair consists of data-dependent quantities. Importantly, the theorem demonstrates how the excess risk is controlled by the learning dynamics on the subspace spanned by (the first and the second terms on the right hand side). It also shows how much is lost due to not learning on the complementary subspace (the third term). The first two terms will become crucial in our analysis of the double descent, while we will show that the last term will vanish as .
Theorem 1 (Excess Risk).
The proof is in Appendix C. The main steps are using the -smoothness of to upper-bound in terms of the squared norm of and decomposing the latter as the sum of the squared norms of its projections onto the space spanned by and its orthogonal complement, by the Pythagorean theorem. Then is used on the subspace spanned by : the norm of is controlled by using the admissibility of and Gaussian integration, and the norm of is controlled by the accumulated squared norms of gradients of over steps of gradient descent, which is conveniently bounded by when due to the -smoothness of .
We will rely on Theorem 1 for our analysis of the Least-Squares problem as follows.
3.1 Least-Squares with Random Design and No Label Noise
Consider a noise-free linear regression model with random design:
where instances are distributed according to some unknown distribution supported on a -dimensional unit Euclidean ball. After observing a training sample , we run GD on the given empirical square loss
In the setting of our interest, the sample covariance matrix might be degenerate, and therefore we will occasionally refer to the non-degenerate subspace , where is given by the Singular Value Decomposition (SVD): and
are the eigenvectors corresponding to the eigenvalues, where , arranged in decreasing order:
and . We write for the minimal non-zero eigenvalue, and we denote . Note that . Now we state our main result in this setting.
Assume that . Then, for any and any , with probability over random samples we have
The proof is in Appendix C. This is a consequence of Theorem 1, modulo showing that GD with the least squares objective is -admissible with , and upper-bounding by controlling the expected squared norm of the projection onto the orthogonal complement of the space spanned by . The later comes up in the analysis of PCA (see e.g. shawe2005eigenspectrum) and, as we show in Appendix E, this term is expected to be small enough whenever the eigenvalues have exponential decay, in which case with high probability we have as . Note that the middle term in the upper bound of our Theorem 1 vanishes in the noise-free case: .
Looking at Theorem 2, we can see that the excess risk is bounded by the sum of two terms. Note that the second term is negligible in many cases (consider the limit of infinite data) and additionally it is a term that remains constant during training as it does not depend on training data. Therefore, we are particularly interested in the first term of the bound, which is data-dependent. This term depends on via a functional form that has a double descent behaviour if plotted against for fixed . Before going into that analysis, let us also consider the scenario with label noise.
3.2 Least-Squares with Random Design and Label Noise
Now, in addition to the random design we introduce label noise into our model:
where we have random noise such that and , independent of the instances.
Assume that . Then, for any and any , with probability over random samples we have
The proof is in Appendix C. Again, this follows from Theorem 1, by the same steps used in the proof of Theorem 2, except that the term is now handled by conditioning on the sample and analyzing the expectation with respect to the random noise (Lemma 4 and its proof in Section C.2), leading to the new term
. The latter closely resembles the term one would get for ridge regression(shalev2014understanding, Cor. 13.7) due to algorithmic stability (bousquet2002stability), but here we have a dependence on the smallest non-zero eigenvalue instead of a regularization parameter.
3.3 Concentration of the Smallest Non-zero Eigenvalue
In this section we take a look at the behaviour of assuming that input instances are i.i.d. random vectors, sampled from some underlying marginal density that meets some regularity requirements (Definitions 2 and 3
below) so that we may use the results from random matrix theory(vershynin2010). Recall that the covariance matrix of the input features is . We focus on the concentration of around its population counterpart , where is the population covariance matrix: .
In particular, the Bai-Yin limit characterization of the extreme eigenvalues of sample covariance matrices (bai1993limit) implies that has almost surely an asymptotic behavior as the dimensions grow to infinity, assuming that the matrix has independent entries. We are interested in the non-asymptotic version of this result. However, unlike bai1993limit, we do not assume independence of all entries, but rather independence of observation vectors (columns of ). This will be done by introducing a distributional assumption: we assume that observations are sub-Gaussian and isotropic random vectors.
Definition 2 (Sub-Gaussian random vectors).
A random vector is sub-Gaussian if the random variables
is sub-Gaussian if the random variablesare sub-Gaussian for all . The sub-Gaussian norm of a random vector is defined as
Definition 3 (Isotropic random vectors).
A random vector is called isotropic if its covariance is the identity: . Equivalently, is isotropic if for all .
Let be the Moore-Penrose pseudoinverse of . In Appendix D we prove the following.111
Lemma 1 (Smallest non-zero eigenvalue of sample covariance matrix).
Let be a matrix with i.i.d. columns, such that , and let , and . Then, for every , with probability at least , we have
and furthermore, assuming that a.s. for all , we have
where we have an absolute constant .
Lemma 1 is a non-asymptotic result that allows us to understand the behaviour of , and hence the behaviour of the excess risk that depends on this quantity, for fixed dimensions. We will exploit this fact in the following section in which we discuss the implications of our findings.
4 Excess risk as a function of over-parameterization
First we note that, in the noise-free case, the middle term in the upper bound of Theorem 1 vanishes: . Thus, as in Theorem 2, the upper bound consists only of the term involving the smallest positive eigenvalue and the term involving . The behaviour of the former was clarified in Section 3.3, and the latter is controlled as explained in Appendix E. Thus, in the overparametrized regime () we have: 222We use when there exists a universal constant such that uniformly over all arguments.
A similar bound holds in the underparameterized case () but replacing the term with . Note that the term multiplying the learning rate is , in accordance with the Bai-Yin limit which says that asymptotically . It is interesting to see how varies with model size for a given fixed dataset size and fixed number of gradient updates . Setting and considering the cases (underparameterized regime), (the peak), and (overparameterized regime) it becomes evident that this term has a double descent behaviour. Thus, the double descent is captured in the part of the excess risk bound that corresponds to learning dynamics on the space spanned by .
Similarly, we can now consider the scenario with label noise: we can similarly bound the excess risk, following the same logic as for noise-free case; however we have an additional dependence on via the term . While this does not interfere with the DD shape as we change model size, it does imply that the peak is dependent on the amount of noise. In particular, the more noise we have in the learning problem the larger we expect the peak at the interpolation boundary to be.
While the presence of the double descent has been studied by several works, our derivation provides two potentially new interesting insights. The first one is that there is a dependency between the noise in the learning problem and the shape of the curve, the larger the noise is, the larger the peak in DD curve. This agrees with the typical intuition in the underparmetrized regime that the model fits the noise when it has enough capacity, leading towards a spike in test error. However, due to the dependence on , it is subdued as the model size grows. Secondly, and maybe considerably more interesting, there seems to be a connection between the double descent curve of the excess risk and the optimization process. In particular, our derivation is specific to gradient descent. In this case the excess risk seems to depend on the conditioning of the features in the least squares problem on the subspace spanned by the data through , which also affects convergence of the optimization process. For the least squares problem this can easily be seen, as the sample covariance of the features corresponds to the Gauss-Newton approximation of the Hessian (e.g. NoceWrig06), hence it impacts the convergence. In a more precise way, conditioning of any matrix is measured by the ratio (the ‘condition number’) which is determined solely by the smallest singular value in cases when is of constant order, such as the case that we studied here: Note that by our boundedness assumption, is constant, but in general one needs to consider both and in order to characterize the condition numbers, which interestingly have been observed to display a double descent as well poggio2019double.
More generally, normalization, standardization, whitening and various other preprocessing of the input data have been a default step in many computer vision systems(e.g. lecun98b; Krizhevsky09learningmultiple) where it has been shown empirically that they greatly affect learning. Such preprocessing techniques are usually aimed to improve conditioning of the data. Furthermore, various normalization layers like batch-norm (Ioffe15) or layer-norm (Ba16) are typical components of recent architectures, ensuring that features of intermediary layers are well conditioned. Furthermore, it has been suggested that model size improves conditioning of the learning problem (Li18), which is in line with our expectation given the behaviour of . Taking inspiration from the optimization literature, it is natural for us to ask whether for neural networks, we can also connect the conditioning or of intermediary features and double descent. This particular might be significant if we think of the last layer of the architecture as a least squares problem (assuming we are working with mean square error), and all previous layers as some random projection, ignoring that learning is affecting this projection as well.
This relationship between generalization and double descent on one hand, and the conditioning of the features and optimization process raises some additional interesting questions, particularly since, compared to the typical least squares setting, the conditioning of the problem for deep architectures does not solely depend on size. In the next section we empirically look at some of these questions.
5 Empirical exploration in neural networks
The first natural question to ask is whether the observed behaviour for the least squares problem is reflected when working with neural networks. To explore this hypothesis, and to allow tractability of computing various quantities of interest (like
), we focus on one hidden layer MLPs on the MNIST and FashionMNIST datasets. We follow the protocol used bybelkin2019reconciling, relying on a squared error loss. In order to increase the model size we simply increase the dimensionality of the latent space, and rely on gradient descent with a fixed learning rate and a training set to randomly chosen examples for both datasets. More details can be found in Appendix G.
. The first row shows test error (number of miss-classified examples out of the test examples) computed on the full test set ofdata points which as expected shows the double descent curve with a peak around hidden units. Note that the peak is relatively small, however the behaviour seems consistent under random seeds for the MNIST experiment.333The error bars for the test error in all the other experiments are estimated by splitting the test set into 10 subsets. The second row and potentially the more interesting one looks at the computed on the covariance of the activations of the hidden layer, which as predicted by our theoretical derivation shows a dip around the interpolation threshold, giving the expected U-shape. Even more surprisingly this shape seems to be robust throughout learning, and the fact that the input weights and biases are being trained seems not to alter it, thus suggesting that our derivation might provide insights in the behaviour of deep models.
Following this, if we think of the output layer as solving a least squares problem, while the rest of the network provides a projection of the data, we can consider what can affect the conditioning of the last latent space of the network. We put forward the hypothesis that is not simply affected by the number of parameters, but actually the distribution of these parameters in the architecture matters.
To test this hypothesis, we conduct an experiment where we compare the behavior of a network with a single hidden layer and a network with three hidden layers. For both networks, we increase the size of the hidden layers. For the deeper network, we consider either increasing the size of all the hidden layers or grow only the last hidden layer while keeping the others to a fixed small size, creating a strong bottleneck in the network. Figure 3 shows the results obtained with the former, while the effect of the bottleneck can be seen in Appendix F. We first observe that for the three tested networks, the drop in the minimum eigenvalues happens when the size of the last hidden layer reaches the number of training samples, as predicted by the theory. The magnitude of this drop and behavior across the different tested sizes depends however on the previous layers. In particular, we observe that the bottleneck yields features that are more ill-conditioned than the network with wide hidden layers, where the width of the last layer on its own can not compensate for the existence of the bottleneck. Moreover, from Figure 3, we can clearly see that the features obtained by the deeper network have a bigger drop in the minimum eigenvalue, which results, as expected in a higher increase in the test error around the interpolation threshold.
It is well known that depth can harm optimization making the problem ill-conditioned, hence the reliance on skip-connections and batch normalizationDe2020 to train very deep architecture. Our construction provides a way of reasoning about double descent that allows us to factor in the ill-conditioning of the learning problem. Rather than focusing simply on the model size, it suggests that for neural networks the quantity of interest might also be for intermediary features, which is affected by size of the model but also by the distribution of the weights and architectural choices. For now we present more empirical explorations and ablations in Appendix G, and put forward this perspective as a conjecture for further exploration.
6 Conclusion and Future Work
In this work we analyse the double descent phenomenon in the context of the least squares problem. We make the observation that the excess risk of gradient descent is controlled by the smallest positive eigenvalue, , of the feature covariance matrix. Furthermore, this quantity follows the Bai-Yin law with high probability under mild distributional assumptions on features, that is, it manifests a U-shaped behaviour as the number of features increases, which we argue induces a double descent shape of the excess risk. Through this we provide a connection between the widely known phenomena and optimization process and conditioning of the problem. We believe this insight provides a different perspective compared to existing results focusing on the Moore-Penrose pseudo-inverse solution. In particular our work conjectures that the connection between the known double descent shape and model size is through of the features at intermediary layers. For the least squares problem correlates strongly with model size (and hence feature size). However this might not necessarily be always true for neural networks. For example we show empirically that while both depth and width increase the model size, they might affect differently. We believe that our work could enable much needed effort, either empirical or theoretical, to disentangle further the role of various factors, like depth and width or other architectural choices like skip connections on double descent.
Appendix A Definitions from Linear Analysis
We denote column vectors and matrices with small and capital bold letters, respectively, e.g. and . Singular values of a rectangular matrix are denoted by . The rank of is . Eigenvalues of a Positive Semi-Definite (PSD) matrix are nonnegative and are denoted , while the smallest non-zero eigenvalue is denoted .
When is positive definite, we define for by It is easy to check that is indeed a norm on , hence it induces a metric over , with the distance between and given by . If is only semi-definite, these definitions would give a semi-norm and semi-metric. Note that where is the matrix square root of . If we set
, the identity matrix, then the normreduces to the standard Euclidean norm: .
Combining the Cauchy-Schwarz inequality and the definition of operator norm , which implies , we get the inequality .
The distance from a point to some set is defined as usual
and for a positive defninte matrix we define similarly
Given and , the Euclidean ball of radius centered at is defined as
and for a positive definite matrix the ellipsoid w.r.t. metric is defined as
Appendix B Minimum eigenvalue and condition number
Previous works considered the link between the condition number of the features and the DD behavior rangamani2020interpolating. In this work, the analysis focuses more particularly on the minimum eigenvalue. In the following small experiments, we empirically show that in the experiments shown in the main paper, the condition number is driven by the minimum eigenvalue, and that the maximum eigenvalue stays close to a constant order when we increase the size of the features. In Figure 4, we use the same setting as in the MNIST experiment in Figure 2 is the main paper. We obseve that the behavior of the condition number follows the minimum eigenvalue, while the maximum eigenvalue stays between 10 and 100 as we increase the width of the networks.
Appendix C Excess Risk of Gradient Descent
In this section we consider the standard GD algorithm, that is , which is obtained recursively by applying the update rule with some step size and initialization . The rule is iterated for . We will pay attention to which satisfy the following regularity condition:
A map is called -admissible, where is a fixed PSD matrix and , if for all the following holds:
Notice that the norm on the left-hand side is , while that on the right-hand side is the standard Euclidean norm. Also notice that this inequality entails a Lipschitz condition with Lipschitz factor .
The excess risk of is defined as
Next we give upper bounds on the excess risk of GD output, assuming that the output of is of low-rank, which is of interest in the overparameterized regime. Specifically, for some low-rank orthogonal projection we assume that a.s. with respect to random samples , for any initialization . The following theorem gives us a general bound on the excess risk of any admissible algorithm in a sense of Definition 1 w.r.t. to any smooth loss (not necessarily convex). In the following we will demonstrate that GD satisfies Definition 1.
Theorem 1 (Excess Risk of Admissible Algorithm).
Assume that , and assume that is -admissible (Definition 1), where and are independent. Further assume for any , and that and are -smooth. Then, for any we have
In particular, having ,
By the -smoothness of , and noting that since is a minimizer,
where the last equality is justified since is an orthogonal projection satisfying the assumption .
Focusing on the first term above, we get
where the first inequality is due to the following inequality for squared Euclidean norms: , and the last inequality is due to -admissibility of . Taking expectation on both sides we have
where is random only due to the sample (recall that is independent of by assumption). The term is a standard Gaussian integral, whose calculation is summarized in following lemma:
Lemma 2 (Expectation of a squared norm of the Gaussian random vector).
For any and :
where is the normalization constant. Therefore, if then for any we have the identity .
For our case with this gives .
The term is bounded next by using the standard “descent lemma”.
Lemma 3 (Descent Lemma).
Assuming that ,
Proof of Lemma 3.
Since is -smooth, Taylor expansion and the gradient descent rule give us
and rearranging we have
Summing over we arrive at
Finally, note that by the assumption that . ∎
In particular, if are the iterates of GD when starting from (so that ), then