1 Introduction
Classical statistical learning theory asserts that to achieve generalization one must use training sample size that sufficiently exceeds the complexity of the learning model, where the latter is typically represented by the number of parameters (or some related structural parameter)
[FHT01]. In particular, this seems to suggest the conventional wisdom that one should not use models that fit the training data exactly. However, modern machine learning practice often seems to go against this intuition, using models with so many parameters that the training data can be perfectly interpolated, in which case the training error vanishes. It has been shown that models such as deep neural networks, as well as certain socalled interpolating kernels and decision trees, can generalize well in this regime. In particular, recent work
[BHMM19] empirically demonstrated a phase transition in generalization performance of learning models which occurs at an interpolation thershold, i.e., a point where training error goes to zero (as one varies the ratio between the model complexity and the sample size). Moving away from this threshold in either direction tends to reduce the generalization error, leading to the socalled double descent curve.To understand this surprising phenomenon, in perhaps the simplest possible setting, we study it in the context of linear or least squares regression. Consider a full rank data matrix
and a vector
of responses corresponding to each of the data points (the rows of ), where we wish to find the best linear model , parameterized by a dimensional vector . The simplest example of an estimator that has been shown to exhibit the double descent phenomenon [BHX19] is the MoorePenrose estimator, : in the socalled overdetermined regime, i.e., when , it corresponds to the least squares solution, i.e., ; and in the underdetermined regime (also known as overparameterized or interpolating), i.e., when , it corresponds to the minimum norm solution to the linear system . Given the ubiquity of linear regression and the MoorePenrose solution, e.g., in kernelbased machine learning, studying the performance of this estimator can shed some light on the effects of overparameterization/interpolation in machine learning more generally. Of particular interest are results that are exact (i.e., not upper/lower bounds) and nonasymptotic (i.e., for large but still finite and ).We build on methods from Randomized Numerical Linear Algebra (RandNLA) in order to obtain exact nonasymptotic expressions for the mean squared error (MSE) of the MoorePenrose estimator (see Theorem 1). This provides a precise characterization of the double descent phenomenon for perhaps the simplest and most ubiquitous regression problem. In obtaining these results, we are able to provide precise formulas for the implicit regularization induced by minimum norm solutions of underdetermined training samples, relating it to classical ridge regularization (see Theorem 2). This result has been observed empirically for RandNLA methods [Mah11]
, but it has also been shown in deep learning
[Ney17] and machine learning [Mah12] more generally. To obtain our precise results, we use a somewhat nonstandard random design, which we term surrogate random design (see Section 2 for a detailed discussion), and which we expect will be of more general interest. Informally, the goal of a surrogate random design is to modify an original design to capture its main properties while being “nicer” in some useful way. In Theorem 3 and Section 5 we show, both theoretically and empirically, that our surrogate design accurately preserves the key properties of the original design when the data distribution is a multivariate Gaussian.1.1 Main results: double descent and implicit regularization
As the performance metric in our analysis, we use the mean squared error (MSE), defined as , where is a fixed underlying linear model of the responses. In analyzing the MSE, we make the following standard assumption on the response noise.
Assumption 1 (Homoscedastic noise).
Responses are where .
Our main result provides an exact expression for the MSE of the MoorePenrose estimator under our surrogate design denoted , where is the variate distribution of the row vector and is the sample size (details in Section 2). This surrogate is used in place of the standard random design , where data points (the rows of ) are sampled independently from . Unlike for the standard design, our MSE formula is fully expressible as a function of the covariance matrix . To state our main result, we need an additional minor assumption on which is satisfied by most standard continuous distributions such as any multivariate Gaussian with positive definite covariance matrix.
Assumption 2 (General position).
For , if , then almost surely.
Theorem 1 (Exact nonasymptotic MSE).
Definition 1.
We will use to denote the above expressions for .
For illustration, we plot these MSE expressions in Figure 1
a, comparing them with empirical estimates of the true MSE under the i.i.d. design for a multivariate Gaussian distribution
with several different covariance matrices . We keep the number of features fixed to and vary the number of samples , observing a double descent peak at . We observe that our theory aligns well with the empirical estimates, whereas previously, no such theory was available except for special cases such as (more details in Theorem 3 and Section 5). The plots show that varying the spectral decay of has a significant effect on the shape of the curve in the underdetermined regime. We use the horizontal line to denote the MSE of the null estimator. When the eigenvalues of
decay rapidly, then the MoorePenrose estimator suffers less error than the null estimator for some values of , and the curve exhibits a local optimum in this regime.lines), accompanied by the empirical estimates based on the standard i.i.d. design (error bars are three times the standard error of the mean). We consider three different condition numbers
of , with isotropic corresponding to , i.e., . In all cases, we use with .One important aspect of Theorem 1 comes from the relationship between and the parameter , which together satisfy . This expression is precisely the classical notion of effective dimension
for ridge regression regularized with
[AM15], and it arises here even though there is no explicit ridge regularization in the problem being considered in Theorem 1. The global solution to the ridge regression task (i.e., regularized least squares) with parameter is defined as:When Assumption 1 holds, then , however ridgeregularized least squares is welldefined for much more general response models. Our second result makes a direct connection between the (expectation of the) unregularized minimum norm solution on the sample and the global ridgeregularized solution. While the underdetermined regime (i.e., ) is of primary interest to us, for completeness we state this result for arbitrary values of and . Note that, just like the definition of regularized least squares, this theorem applies more generally than Theorem 1, in that it does not require the responses to follow any linear model as in Assumption 1.
Theorem 2 (Implicit regularization of MoorePenrose estimator).
For satisfying^{1}^{1}1The proof of Theorem 2
can be easily extended to probability measures
that do not satisfy Assumption 2 (such as discrete distributions). We include this assumption here to simplify the presentation. Assumption 2 and any such that is welldefined, if and , thenwhere, as in Theorem 1, is chosen so that the effective dimension equals .
That is, when , the MoorePenrose estimator (which itself is not regularized), computed on the random training sample, in expectation equals the global ridgeregularized least squares solution of the underlying regression problem. Moreover, , i.e., the amount of implicit regularization, is controlled by the degree of overparameterization in such a way as to ensure that
becomes the ridge effective dimension (a.k.a. the effective degrees of freedom).
We illustrate this result in Figure 1b, plotting the norm of the expectation of the MoorePenrose estimator. As for the MSE, our surrogate theory aligns well with the empirical estimates for i.i.d. Gaussian designs, showing that the shrinkage of the unregularized estimator in the underdetermined regime matches the implicit ridgeregularization characterized by Theorem 2. While the shrinkage is a linear function of the sample size for isotropic features (i.e., ), it exhibits a nonlinear behavior for other spectral decays. Such implicit regularization has been studied previously [MO11, PM11, GM14, Mah12]; it has been observed empirically for RandNLA sampling algorithms [Mah11, MMY15]; and it has also received attention more generally within the context of neural networks [Ney17]. While our implicit regularization result is limited to the MoorePenrose estimator, this new connection (and others, described below) between the minimum norm solution of an unregularized underdetermined system and a ridgeregularized least squares solution offers a simple interpretation for the implicit regularization observed in modern machine learning architectures.
Our exact nonasymptotic expressions in Theorem 1 and our exact implicit regularization results in Theorem 2 are derived for the surrogate design, but Figure 1 suggests that they accurately describe the MSE (up to lower order terms) also under the standard i.i.d. design , particularly when is a multivariate Gaussian. As a third result, we can verify this in the cases where there exist known expressions for the MSE under the i.i.d. design (standard Gaussian for the underdetermined setting, and arbitrary Gaussian for the overdetermined one).
Theorem 3 (Asymptotic consistency of surrogate design).
Let , and satisfy Assumption 1. If and
then the absolute difference between surrogate expressions and the true MSE is bounded as follows:
Remark 1.
For equal to , or , the true MSE under Gaussian random design can be infinite, whereas the surrogate MSE is finite and has a closed form expression.
Empirical estimates given in Figure 1 suggest that the consistency of surrogate expressions holds much more generally than it is stated above. Based on a detailed empirical analysis described in Section 5.2, we conjecture that an asymptotic consistency result of the form similar to the statement of Theorem 3 holds true in the underdetermined regime without the assumption that . In this case, no formula is known for , whereas the expressions for the surrogate Gaussian design naturally extend.
1.2 Key techniques: surrogate designs and determinant preserving matrices
The standard random design model for linear regression assumes that each pair is drawn independently, where the row vector comes from some variate distribution and
is a random response variable drawn conditionaly on
. Precise theoretical analysis of underdetermined regression in this setting poses significant challenges, even in such special cases as the MoorePenrose estimator and a Gaussian data distribution . Rather than trying to directly analyze the usual i.i.d. random design described above, we modify it slightly by introducing the notion of a surrogate random design, . Informally, the goal of a surrogate random design is to modify an original design to capture the main properties of the original design, while being “nicer” for theoretical or empirical analysis. In particular, here, we will modify the distribution of matrix so as to:
closely preserve the behavior of the MoorePenrose estimator from the i.i.d. design; and

obtain exact expressions for double descent in terms of the mean squared error.
A key element in the construction of our surrogate designs involves rescaling the measure by the pseudodeterminant , i.e., a product of the nonzero eigenvalues. A similar type of determinantal design was suggested in prior work [DWH19b], but it was restricted there only to . We broaden this definition by not only allowing the sample size to be less than , but also allowing it to be randomized. Our definition of a determinantal design matrix follows by expressing for any realvalued function as (see Definition 2):
where and
is a random variable. Then, we define (in Definition
3) our surrogate design for each as a determinantal design with a carefully chosen random variable , so that the expected sample size is equal toand so that it is possible to derive closed form expressions for the MSE. We achieve this by using modifications of the Poisson distribution to construct the variable
.The key technical contribution that allows us to derive the MSE for determinantal designs is the concept of determinant preserving random matrices, a notion that we expect to be useful more generally. Specifically, in Section 3 we define a class of random matrices for which taking the determinant commutes with taking the expectation, for the matrix itself and any of its square submatrices (see Definition 4):
Not all random matrices satisfy this property, however many interesting and nontrivial examples can be found. Constructing these examples is facilitated by the closure properties that this class enjoys. In particular, if and are determinant preserving and independent, then and are also determinant preserving (see Lemma 3). We use these techniques to prove a number of determinantal expectation formulas. For example, we show that if , where is a Poisson random variable, then:
These formulas are used to derive the normalization constants for our surrogate design distribution, which are later used in proving Theorems 1 and 2.
1.3 Related work
There is a large body of related work, which for simplicity we cluster into three groups.
Double descent. The double descent phenomenon (a term introduced by [BHMM19]) corresponds to the phase transition in the generalization error that occurs when the ratio between the model complexity and the sample size crosses the socalled interpolation threshold. It has been observed empirically in a number of learning models, including neural networks [BHMM19, GJS19], kernel methods [BMM18, BRT19], nearest neighbor models [BHM18], and decision trees [BHMM19]. The theoretical analysis of double descent, and more broadly the generalization properties of interpolating estimators, have primarily focused on various forms of linear regression. The most comparable to our work are [BLLT19, LR19] and [HMRT19], who provide nonasymptotic upper/lower bounds and asymptotic formulas, respectively, for the generalization error of the MoorePenrose estimator under essentially the same i.i.d. random design setting as ours. On the other hand, [MVSS19] provide bounds for the error of the ideal linear interpolator (instead of the minimum norm one). Note that while we analyze the classical mean squared error, many works focus on the squared prediction error instead (some of them still refer to it as the MSE). Another line of literature deals with linear regression in the socalled misspecified setting, where the set of observed features does not match the feature space in which the response model is linear [BHX19, HMRT19, Mit19, MM19b], e.g., when the learner observes a random subset of features from a larger population. This is an important distinction, because it allows varying the model complexity by changing the number of observed features while keeping the linear model fixed (see further discussion in Section 6). We believe that our results can be extended to this important setting, and we leave this as a direction for future work.
RandNLA. Randomized numerical linear algebra [Mah11, DM16, DM17] has traditionally focused on obtaining algorithmic improvements for tasks such as least squares and lowrank approximation via techniques that include sketching [Sar06] and i.i.d. leverage score sampling [DMM06]. However, there has been growing interest in understanding the statistical properties of these randomized methods [MMY15, RM16], for example looking at the mean squared error of the least squares estimator obtained via i.i.d. subsampling under the standard linear response model. Determinantal sampling methods (a.k.a. volume sampling, or determinantal point processes), which first found their way into RandNLA in the context of lowrank approximation [DRVW06], have been recently shown to combine strong worstcase guarantees with elegant statistical properties. In particular, [DW17] showed that the leastsquares estimator subsampled via the socalled size volume sampling (loosely corresponding to the special case of our surrogate design where
) is an unbiased estimator that admits exact formulas for both the expected square loss (a worstcase metric) and the mean squared error (a statistical metric). These results were developed further by
[DWH18, DWH19a, DCMW19], however they were still limited to the overdetermined setting (with the exception of [DW18a, DLM19] who gave upper bounds on the mean squared error of the ridge estimator under different determinantal samplings). Also in the overdetermined setting, [DWH19b] provided evidence for the fact that determinantal rescaling can be used to modify the original data distribution (particularly, a multivariate Gaussian) without a significant distortion to the estimator, while making certain statistical quantities expressible analytically. We take this direction further by analyzing the unregularized least squares estimator in the underdetermined setting which is less well understood, partly due to the presence of implicit regularization.Implicit regularization. The term implicit regularization typically refers to the notion that approximate computation (e.g., rather than exactly minimizing a function , instead running an approximation algorithm to get an approximately optimal solution) can implicitly lead to statistical regularization (e.g., exactly minimizing an objective of the form , for some wellspecified and ). See [MO11, PM11, GM14] and references therein for early work on the topic; and see [Mah12] for an overview. More recently, often motivated by neural networks, there has been work on implicit regularization that typically considered SGDbased optimization algorithms. See, e.g., theoretical results on simplified models [NTS14, Ney17, SHN18, GWB17, ACHL19, KBMM19] as well as extensive empirical and phenomenological results on stateoftheart neural network models [MM18, MM19a]. The implicit regularization observed by us is different in that it is not caused by an inexact approximation algorithm (such as SGD) but rather by the selection of one out of many exact solutions (e.g., the minimum norm solution). In this context, most relevant are the asymptotic results of [LJB19] (which used the asymptotic risk results for ridge regression of [DW18b]) and [KLS18]. Our nonasymptotic results are also related to recent work in RandNLA on the expectation of the inverse [DM19] and generalized inverse [MDK19] of a subsampled matrix.
2 Surrogate random designs
In this section, we provide the definition of our surrogate random design , where is a variate probability measure and is the sample size. This distribution is used in place of the standard random design consisting of row vectors drawn independently from
. Our surrogate design uses determinantal rescaling to alter the joint distribution of the vectors so that certain expected quantities (such as the mean squared error of the MoorePenrose estimator) can be expressed in a closed form. We start by introducing notation.
Preliminaries. The set will be denoted by . For an matrix , we use to denote the pseudodeterminant of , which is the product of nonzero eigenvalues. For index subsets and , we use to denote the submatrix of with rows indexed by and columns indexed by . We may write to indicate that we take a subset of rows. We use to denote the adjugate of , defined as follows: the th entry of is . We will use two useful identities related to the adjugate: (1) for invertible , and (2) . For a probability measure over , we use to denote a random row vector sampled according to this distibution. We let denote a random matrix with rows drawn i.i.d. according to , and the th row is denoted as . We also let , where refers to the expectation with respect to , assuming throughout that is welldefined and positive definite. We use as the Poisson distribution restricted to values less than or equal to , and a similar convention is used for the restriction to values greater or equal . Finally, we use to denote the number of rows of .
We now define a family of determinantal distributions over random matrices , where not only the entries but also the number of rows is randomized. This randomized sample size is a crucial property of our designs that enables our analysis. Our definition follows by expressing for realvalued functions (the expectation may be undefined for some functions).
Definition 2.
Let satisfy Assumption 2 and let be a random variable over nonnegative integers. A determinantal design is a distribution such that for any as above,
Setting to 1, observe that the proportionality constant must be . The above definition can be interpreted as rescaling the density function of by the pseudodeterminant, and then renormalizing it.
We now construct our surrogate design by appropriately selecting the random variable . One might be tempted to use the obvious choice of , but this does not result in simple closed form expressions for the MSE in the underdetermined regime (i.e., ), which is the regime of primary interest to us. Instead, we derive our random variables from the Poisson distribution.
Definition 3.
For satisfying Assumption 2, define surrogate design as where:

if , then with being the solution of ,

if , then we simply let ,

if , then with .
Note that the underdetermined case, i.e., , is restricted to so that, under Assumption 2, with probability 1. On the other hand in the overdetermined case, i.e., , we have so that . In the special case of both of these equations are satisfied: .
The first nontrivial property of the surrogate design is that the expected sample size is in fact always equal to , which we prove at the end of this section.
Lemma 1.
Let for any . Then, we have .
Our general template for computing expectations under a surrogate design is to use the following expressions based on the i.i.d. random design :
These formulas follow from Definitions 2 and 3 because the determinants and are nonzero precisely in the regimes and , respectively, which is why we can drop the restrictions on the range of the Poisson distribution. Crucially, the normalization constants for computing the expectations can be obtained using the following formulas: if then
Remark 2.
We will use as a shorthand for the above normalization constants.
We prove Lemmas 4 and 6 in Section 3 by introducing the concept of determinant preserving random matrices. The lemmas play a crucial role in deriving a number of new expectation formulas for the under and overdetermined surrogate designs that we use to prove Theorems 1 and 2 in Section 4. On the other hand, Lemma 5 and the design can be found in the literature [vdV65, DWH19b], and we will rely on those known results in this special case. Importantly, the case offers a continuous transition between the under and overdetermined regimes because the distribution converges to when approaches from above and below. Another important property of the design is that it can be used to construct an overdetermined design for any . A similar version of this result was also previously shown by [DWH19b] for a different determinantal design.
Lemma 2.
Let and , where . Then the matrix composed of a random permutation of the rows from and is distributed according to .
Proof.
Let denote the matrix constructed from the permuted rows of and . Letting , we derive the expectation by summing over the possible index subsets that correspond to the rows coming from :
where uses the CauchyBinet formula to sum over all subsets of size . Finally, since the sum shifts from to , the last expression can be rewritten as , where recall that and , matching the definition of . ∎
We now return to the proof of Lemma 1, where we establish that the expected sample size of is indeed .
Proof.
(of Lemma 1) The result is obvious when , whereas for it is an immediate consequence of Lemma 2. Finally, for the expected sample size follows as a corollary of a more general expectation formula proven in Section 4, which states that
where is the orthogonal projection onto the subspace spanned by the rows of . Since the rank of this subspace is equal to the number of the rows, we have , so
which completes the proof. ∎
3 Determinant preserving random matrices
In this section, we introduce the key tool for computing expectation formulas of matrix determinants. It is used in our analysis of the surrogate design, and it should be of independent interest.
The key question motivating the following definition is: when does taking expectation commute with computing a determinant for a square random matrix?
Definition 4.
A random matrix is called determinant preserving (d.p.), if
Note that from the definition of an adjugate matrix (see Section 2) it immediately follows that if is determinant preserving then adjugate commutes with expectation for this matrix:
(1) 
We next give a few simple examples to provide some intuition. First, note that every random matrix is determinant preserving simply because taking a determinant is an identity transfomation in one dimension. Similarly, every fixed matrix is determinant preserving because in this case taking the expectation is an identity transformation. In all other cases, however, Definition 4 has to be verified more carefully. Further examples (positive and negative) follow.
Example 1.
If has i.i.d. Gaussian entries , then is d.p. because .
In fact, it can be shown that all random matrices with independent entries are determinant preserving. However, this is not a necessary condition.
Example 2.
Let , where is fixed with , and is a scalar random variable. Then for we have
so if then is determinant preserving, whereas if and then it is not.
To construct more complex examples, we show that determinant preserving random matrices are closed under addition and multiplication. The proof of this result is an extension of an argument given by [DM19] (Lemma 7) for computing the expected determinant of the sum of rank1 random matrices.
Lemma 3.
If are independent and d.p. then and are also determinant preserving.
Proof.
First, we show that is d.p. for any fixed . Below, we use a standard identity for the rank one update of a determinant: . It follows that for any and of the same size,
where used (1), i.e., the fact that for d.p. matrices, adjugate commutes with expectation. Crucially, through the definition of an adjugate this step implicitly relies on the assumption that all the square submatrices of are also determinant preserving. Iterating this, we get that is d.p. for any fixed . We now show the same for :
where uses the fact that after conditioning on we can treat it as a fixed matrix. Next, we show that is determinant preserving via the CauchyBinet formula:
where recall that denotes the submatrix of consisting of its (entire) rows indexed by . ∎
Finally, we introduce another important class of d.p. matrices: a sum of i.i.d. rank1 random matrices with the number of i.i.d. samples being a Poisson random variable. Our use of the Poisson distribution is crucial for the below result to hold. It is an extension of an expectation formula given by [Der19] for sampling from discrete distributions.
Lemma 4.
If is a Poisson random variable and are random matrices whose rows are sampled as an i.i.d. sequence of joint pairs of random vectors, then is d.p., and in particular,
To prove the above result, we will use the following lemma, many variants of which appeared in the literature (e.g., [vdV65]). We use the one given by [DWH19a].
Lemma 5 ([DWH19a]).
If the rows of random matrices are sampled as an i.i.d. sequence of pairs of joint random vectors, then
(2) 
Here, we use the following standard shorthand: . Note that the above result almost looks like we are claiming that the matrix is d.p., but in fact it is not because . The difference in those factors is precisely what we are going to correct with the Poisson random variable. We now present the proof of Lemma 4.
Proof.
Finally we use Lemma 4 combined with Lemma 3 to show the expectation formula needed for obtaining the normalization constant of the underdetermined surrogate design (proven by setting ). Note that the below result is more general than the normalization constant requires, because it allows the matrices and to be different. In fact, we will use this more general statement later on in our analysis.
Lemma 6.
If is a Poisson random variable and , are random matrices whose rows are sampled as an i.i.d. sequence of joint pairs of random vectors, then
Proof.
By Lemma 4, the matrix is determinant preserving. Applying Lemma 3 we conclude that is also d.p., so
where the second equality is known as Sylvester’s Theorem. We now use the following standard determinantal formula.
Lemma 7 ([Kt12]).
For any matrices we have .
We rewrite the expectation of by applying the lemma. Letting , we obtain:
where follows from the exchangeability of the rows of and , which implies that the distribution of is the same for all subsets of a fixed size . ∎
4 Expectation formulas for surrogate designs
In this section we prove a number of expectation formulas for determinantal surrogate designs, which we then use to prove Theorems 1 and 2. In the process, we derive closed form expressions for , i.e., the expectation of the orthogonal projection onto the subspace spanned by the the rows of , and for , the trace of the pseudoinverse of the sample covariance matrix. To our knowledge, neither of these quantities admit closed form expressions for standard i.i.d. random designs such as Gaussian with general covariance (except for the isotropic case).
4.1 Proof of Theorem 1
Let follow the homoscedastic noise model with variance (Assumption 1). Recall that we have and , where . A standard decomposition of the MSE of the MoorePenrose estimator proceeds as follows:
Thus, our task is to find closed form expressions for the two expectations above. If , then the latter goes away because when has full column rank then . When , this expectation is given in the following result.
Lemma 8.
If and , then we have
The proof of Lemma 8 is deferred to Section 4.2 because it follows as a corollary of a more general result (Lemma 11). We next derive the second expectation needed to compute the MSE. The under and overdetermined cases are proven separately, starting with the former.
Lemma 9.
If for , then we have
Proof.
Let for . Note that if then using the fact that
for any invertible matrix
, we can write:where is a shorthand for . Assumption 2 ensures that , which allows us to write:
Comments
There are no comments yet.