Surprises in High-Dimensional Ridgeless Least Squares Interpolation

03/19/2019 ∙ by Trevor Hastie, et al. ∙ 0

Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum ℓ_2-norm interpolation in high-dimensional linear regression. Motivated by the connection with overparametrized neural networks, we consider the case of random features. We study two distinct models for the features' distribution: a linear model in which the feature vectors x_i∈ R^p are obtained by applying a linear transform to vectors of i.i.d. entries, x_i = Σ^1/2z_i (with z_i∈ R^p); a nonlinear model, in which the features are obtained by passing the input through a random one-layer neural network x_i = φ(Wz_i) (with z_i∈ R^d, and φ an activation function acting independently on the coordinates of Wz_i). We recover -- in a precise quantitative way -- several phenomena that have been observed in large scale neural networks and kernel machines, including the `double descent' behavior of the generalization error and the potential benefit of overparametrization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern deep learning models involve a huge number of parameters. In nearly all applications of these models, current practice suggests that we should design the network to be sufficiently complex so that the model (as trained, typically, by gradient descent) interpolates the data, i.e., achieves zero training error. Indeed, in a thought-provoking experiment,

Zhang et al. (2016) showed that state-of-the-art deep neural network architectures can be trained to interpolate the data even when the actual labels are replaced by entirely random ones.

Despite their enormous complexity, deep neural networks are frequently seen to generalize well, in meaningful practical problems. At first sight, this seems to defy conventional statistical wisdom: interpolation (vanishing training error) is usually taken to be a proxy for overfitting or poor generalization (large gap between training and test error). In an insightful series of papers, Belkin et al. (2018b, c, a)

pointed out that these concepts are, in general, distinct, and interpolation does not contradict generalization. For example, kernel ridge regression is a relatively well-understood setting in which interpolation can coexist with good generalization

(Liang and Rakhlin, 2018).

In this paper, we examine the prediction risk of minimum norm or “ridgeless” least squares regression, under different models for the features. A skeptical reader might ask what least squares has to do with neural networks. To motivate our study, we appeal to line of work that draws concrete connections between these two frameworks (Jacot et al., 2018; Du et al., 2018b; Allen-Zhu et al., 2018; Du et al., 2018a). Following Chizat and Bach (2018b), consider a general nonlinear model that relates responses and inputs , , in terms of a parameter vector (while we have in mind a neural network, the setting here is actually quite general). In some contexts, the number of parameters is so large that training effectively changes each of them by a just small amount with respect to a random initialization . It thus makes sense to linearize the model around . Furthermore, supposing that the initialization is such that , and letting , we obtain

(1)

We are therefore led to consider a linear regression problem, with random features , , of high-dimensionality ( much greater than ). Notice that the features are random because of the initialization. In this setting, many vectors give rise to a model that interpolates the data. However, using gradient descent on the least squares objective for training yields a special interpolating parameter (having implicit regularity): the minimum norm least squares solution.

We consider two different models for the features.

  • Linear model. Here each , where

    has i.i.d. entries with zero mean and unit variance and

    deterministic and positive definite. This is a standard model in random matrix theory and we leverage known results from that literature to obtain a fairly complete asymptotic characterization of the out-of-sample prediction risk. In particular, we recover a number of phenomena that have been observed in kernel regression and neural networks (summary in the next subsection).

  • Nonlinear model. Here each , where has i.i.d. entries from , has i.i.d. entries from , and is an activation function acting componentwise. This corresponds to a two-layer neural network with random weights at the first layer, and weights at the second layer given by the regression coefficients . This model was introduced by Rahimi and Recht (2008) as a randomized approach for scaling kernel methods to large datasets. But (relative to the linear model described above) it is far less studied in the random matrix theory literature. We prove a new asymptotic result that allows us to characterize the prediction variance. The result is remarkably close to the result for the linear model (and in some cases, identical to it), despite the fact that can be much smaller than .

Because of its simplicity, the linear model abstracts away some interesting properties of the model (1), specifically, the distinction between the input dimension and the number of parameters . But, also due to its simplicity, we are able to develop a fairly complete picture of the asymptotic risk (which shows the effect of correlations between the features, signal-to-noise ratio, strength of model misspecification, etc.). The fact that the two models give closely related results (for the variance component of the risk) supports the hope that the linear model can still provide useful insights into the model (1).

1.1 Summary of results

In what follows, we study the out-of-sample prediction risk of the minimum norm (or min-norm, for short) least squares estimator, in an asymptotic setting where both the number of samples and features diverge, , and their ratio converges to a nonzero constant, . When , we call the problem underparametrized, and when , we call it overparametrized. Below we summarize our main results on the asymptotic risk. Denote by the underlying signal and feature covariance matrix, respectively, and by the signal-to-noise ratio (defined precisely in Section 3.3). Note that all but the last point below pertain to the linear model; also, see Figure 1 for a supporting plots, of the asymptotic risk curves for different cases of interest.

Figure 1: Asymptotic risk curves for the linear feature model, as a function of the limiting aspect ratio . The risks for min-norm least squares, when and , are plotted in black and red, respectively. These two match for but differ for . The null risks for and are marked by the dotted black and red lines, respectively. The risk in for misspecified case (with significant approximation bias), when , is plotted in green. The risk for optimally-tuned ridge regression, in the same misspecified setup, is plotted in blue. The points denote finite-sample risks, with , , across various values of , computed from features having i.i.d.  entries. Meanwhile, the “x” points mark finite-sample risks for a nonlinear feature model, with , , , and , where has i.i.d.  entries, has i.i.d.  entries, and is a “purely nonlinear” activation function, for constants . The theory predicts that the nonlinear risk here should converge to that in the linear model case with features (regardless of ). The general agreement between finite-sample and asymptotic risks is excellent, and the similarity between the linear and nonlinear cases is striking.
  1. In the underparametrized regime (), the risk is purely variance (there is no bias), and does not depend on (see Theorem 1). Moreover, the risk diverges as we approach the interpolation boundary (as ).

  2. In the overparametrized regime (), the risk is composed of both bias and variance, and generally depends on (see Theorem 4, Corollary 1, Appendix A.5).

  3. When , the risk is monotonically decreasing over , and approaches the null risk (from above) as (see Section 3.3).

  4. When , the risk has a local minimum on , is better than the null risk for large enough , and approaches the null risk (from below) as (see Section 3.3).

  5. For a misspecified model, when , the risk can attain its global minimum on (when there is strong enough approximation bias, see Section 5.3).

  6. Optimally-tuned ridge regression dominates the min-norm least squares estimator in risk, across all values of and , in both the well-specified and misspecified settings. For a misspecified model, optimally-tuned ridge regression attains its global minimum around (see Section 6).

  7. For the nonlinear model, the asymptotics we obtain are still (surprisingly) exact (meaning, no hidden constants) (see Theorem 7). For the case of a “purely nonlinear” activation function, our theory reveals that the asymptotic risk, somewhat remarkably, matches that the linear model with isotropic features (see Corollary 2).

A few remarks are in order. We number the first point above (which provides important context for the points that follow) as 0, because it is a known result that has appeared in various places in the random matrix theory literature. Points 3 through 5 are formally established for isotropic features, , but qualitatively similar behavior can be seen for general . Several of the arguments in this paper rely on more or less standard results in random matrix theory; even though the mathematics is standard, the insights, we believe, are new. An important distinction is the result for nonlinear models, which is not covered by existing literature. In this setting, we derive a new asymptotic result, which is of independent interest (see Theorem 8).

1.2 Intuition and implications

We discuss some intuition behind and implications of our results.

Bias and variance.

The shape of the asymptotic risk curve for min-norm least squares is, of course, controlled by its components: bias and variance. In the overparametrized regime, the bias increases with , which is intuitive. When , the min-norm least squares estimate of is constrained to lie the row space of , the training feature matrix. This is a subspace of dimension lying in a feature space of dimension . Thus as increases, so does the bias, since this row space accounts for less and less of the ambient -dimensional feature space.

Meanwhile, in the overparametrized regime, the variance decreases with . This may seem counterintuitive at first, because it says, in a sense, that the min-norm least squares estimator becomes more regularized as grows. However, this too can be explained intuitively, as follows. As grows, the minimum norm least squares solution—i.e., the minimum norm solution to the linear system , for a training feature matrix and response vector —will generally have decreasing norm. Why? Compare two such linear systems: in each, we are asking for the min-norm solution to a linear system with the same , but in one instance we are given more columns in , so we can generally decrease the components of (by distributing them over more columns), and achieve a smaller norm. This can in fact be formalized asymptotically, see Theorem 2.

Double descent.

Recently, Spigler et al. (2018) and Belkin et al. (2018a)

pointed out a fascinating empirical trend where, for popular methods like neural networks and random forests, we can see a

second bias-variance tradeoff in the out-of-sample prediction risk, beyond the interpolation limit. The resulting risk curve resembles a W-shaped curve, which these authors call a “double descent” risk curve. Our results in this paper formally verify that the double descent phenomenon can occur even in the simple and fundamental case of least squares regression. The appearance of the second U-shape in the risk, past the interpolation boundary (), is explained by the opposing behavior of bias and variance as grows, as we discussed above.

In the misspecified case, the variance still decreases with (for the same reasons), but interestingly, the bias can now also decrease with , provided is not too large (not too far past the interpolation boundary). The intuition here is that in a misspecified model, some part of the true regression function is always unaccounted for, and adding features generally improves our approximation capacity. Our results show that this double descent phenomenon can be even more pronounced in the misspecified case (depending on the strength of the approximation bias), and that the risk can attain its global minimum past the interpolation limit.

In-sample prediction risk.

Our focus throughout this paper is out-of-sample prediction risk. It is reasonable to ask how the results would change if we instead look at in-sample prediction risk. In the data model (2), (3) we study, the in-sample prediction risk of the min-norm least squares estimator is (where we abbreviate , and we are assuming that ). The asymptotic in-sample prediction risk, as , is therefore just . This does not in of itself have an interesting behavior; it just ascends linearly in , and then becomes flat after . Compare this to the much richer and more complex behavior exhibited by the limiting out-of-sample risk (see the curves in Figure 1, or (8), (14

) for the precise mathematical forms in the well-specified and misspecified settings, respectively): their behaviors could not be more different. This serves as an important reminder that the former (in-sample prediction risk) is not always a good proxy for the latter (out-of-sample prediction risk). Although much of classical regression theory is based on the former (e.g., optimism, effective degrees of freedom, and covariance penalties), the latter is more broadly relevant to practice.

Interpolation versus regularization.

The min-norm least squares estimator can be seen as the limit of ridge regression as the tuning parameter tends to zero. It is also the convergence point of gradient descent run on the least squares loss. We would not in general expect the best-predicting ridge solution to be at the end of its regularization path. Our results, comparing min-norm least squares to optimally-tuned ridge regression, show that (asymptotically) this is never the case, and dramatically so near . It is worth noting that early stopped gradient descent is known to be closely connected to ridge regularization, especially for least squares problems; in fact, the tight connection developed in Ali et al. (2019) suggests that optimally-stopped gradient descent will also often have better risk than min-norm least squares. In practice, of course, we would not have access to the optimal tuning parameter for ridge (optimal stopping for gradient descent), and we would rely on, e.g., cross-validation (CV). Will the CV-tuned ridge estimator (CV-stopped gradient descent) still outperform min-norm least squares? Empirical experiments suggest yes, but a formal analysis has not yet been pursued.

Historically, the debate between interpolation and regularization has been alive for the last 30 or so years. Support vector machines find maximum-margin decision boundaries, which often perform very well for problems where the Bayes error is close to zero. But for less-separated classification tasks, one needs to tune the cost parameter

(Hastie et al., 2004). Relatedly, in classification, it is common to run boosting until the training error is zero. Similar to the connection between gradient descent and regularization, the boosting path is tied to regularization (Rosset et al., 2004; Tibshirani, 2015); again, we now know that boosting can overfit, and the number of boosting iterations should be regarded as a tuning parameter.

Virtues of nonlinearity.

Gradient descent is the method of choice in the deep learning community, where common practice is to run gradient descent until zero training error. Hence, heuristically, training a deep learning model with squared error loss by gradient descent is akin to finding the min-norm least squares solution in an adaptively-learned high-dimensional feature space. Turning to rigor, however, it is a priori unclear whether success in precisely analyzing the linear model setting should carry over to the nonlinear setting. Remarkably, under high-dimensional asymptotics with

, , and , this turns out to be the case. Even more surprising, the relevant dimensions-to-samples ratio is given by (not ): for “purely nonlinear” activations (these are activations whose linear component vanishes, with respect to a suitable inner product), the results from the linear model with features remain asymptotically exact. In other words, each component of the feature vector behaves “as if” it was independent from the others, even when is much smaller than .

1.3 Related work

The present work connects to and is motivated by the recent interest in interpolators in machine learning (Belkin et al., 2018b, a; Liang and Rakhlin, 2018; Belkin et al., 2018c; Geiger et al., 2019) Several auhors have argued that minimum norm least squares regression captures the behavior of deep neural networks, at least in an early (lazy) training regime (Jacot et al., 2018; Du et al., 2018b, b; Allen-Zhu et al., 2018; Chizat and Bach, 2018b; Lee et al., 2019). The connection between neural networks and kernel ridge regression arises when the number of hidden units diverges. The same limit was also studied (beyond the linearized regime) by Mei et al. (2018); Rotskoff and Vanden-Eijnden (2018); Sirignano and Spiliopoulos (2018); Chizat and Bach (2018a).

For the linear model, our risk result in the general case basically follows by taking a limit (as the ridge tuning parameter tends to zero) in the ridge regression result of Dobriban and Wager (2018). For the

case, our analysis bears similarities to the ridge regression analysis of

Dicker (2016) (though we manage to avoid the assumption of Gaussianity of the features, by invoking a local version of the Marchenko-Pastur theorem). Moreover, our discussion of min-norm least squares versus ridge regression is somewhat related to the “regimes of learning” problem studied by Liang and Srebro (2010); Dobriban and Wager (2018).

For the nonlinear model, the random matrix theory literature is much sparser, and focuses on the related model of kernel random matrices, namely, symmetric matrices of the form . El Karoui (2010) studied the spectrum of such matrices in a regime in which can be approximated by a linear function (for ) and hence the spectrum converges to a rescaled Marchenko-Pastur law. This approximation does not hold for the regime of interest here, which was studied instead by Cheng and Singer (2013) (who determined the limiting spectral distribution) and Fan and Montanari (2015)

(who characterized the extreme eigenvalues). The resulting eigenvalue distribution is the free convolution of a semicircle and a Marchenko-Pastur law. In the current paper, we must consider asymmetric (rectangular) matrices

, whose singular value distribution was recently computed by

Pennington and Worah (2017)

, using the moment method. Unfortunately, the prediction variance does not depend uniquely on the singular values but also on the singular vectors of this matrix. We use the method of

Cheng and Singer (2013) (applied to a suitable block-structured matrix) to address this point.

1.4 Outline

Section 2 provides important background. Sections 36 consider the linear model case, focusing on isotropic features, correlated features, misspecified models, and ridge regularization, respectively. Section 7 covers the nonlinear model case. Nearly all proofs are deferred until the appendix.

2 Preliminaries

We describe our setup and gather a number of important preliminary results.

2.1 Data model and risk

Assume we observe training data from a model

(2)
(3)

where the random draws across are independent. Here, is a distribution on such that , , and is a distribution on such that , . We collect the responses in a vector , and the features in a matrix (with rows , ).

Consider a test point , independent of the training data. For an estimator (a function of the training data ), we define its out-of-sample prediction risk (or simply, risk) as

where . Note that our definition of risk is conditional on (as emphasized by our notation ). Note also that we have the bias-variance decomposition

2.2 Ridgeless least squares

Consider the minimum norm (min-norm) least squares regression estimator, of on , defined by

(4)

where denotes the Moore-Penrose pseudoinverse of . Equivalently, we can write

which justifies its name. An alternative name for (4) is the “ridgeless” least squares estimator, motivated by the fact that

where denotes the ridge regression estimator,

(5)

which we can equivalently write as

When has full column rank (equivalently, when is invertible), the min-norm least squares estimator reduces to , the usual least squares estimator. When has rank , importantly, this estimator interpolates the training data: , for .

Lastly, the following is a well-known fact that connects the min-norm least squares solution to gradient descent (as referenced in the introduction).

Proposition 1.

Initialize , and consider running gradient descent on the least squares loss, yielding iterates

where we take (and is the largest eigenvalue of ). Then , the min-norm least squares solution in (4).

Proof.

The choice of step size guarantees that converges to a least squares solution as , call it . Note that , all lie in the row space of ; therefore must also lie in the row space of ; and the min-norm least squares solution is the unique least squares solution with this property. ∎

2.3 Bias and variance

We recall expressions for the bias and variance of the min-norm least squares estimator, which are standard.

Lemma 1.

Under the model (2), (3), the min-norm least squares estimator (4) has bias and variance

where is the (uncentered) sample covariance of , and is the projection onto the null space of .

Proof.

As and , the bias and variance expressions follow from plugging these into their respective definitions. ∎

2.4 Underparametrized asymptotics

We consider an asymptotic setup where , in such a way that . Recall that when , we call the problem underparametrized; when , we call it overparametrized. Here, we recall the risk of the min-norm least squares estimator in the underparametrized case. The rest of this paper focuses on the overparametrized case.

The following is a known result in random matrix theory, and can be found in Chapter 6 of Serdobolskii (2007), where the author traces it back to work by Girko and Serdobolskii from the 1990s through early 2000s. It can also be found in the wireless communications literature (albeit with minor changes in the presentation and conditions), see Chapter 4 of Tulino and Verdu (2004), where it is traced back to work by Verdu, Tse, and others from the late 1990s through early 2000s. Before stating the result, we recall that for a symmetric matrix , we define its spectral distribution as , where , are the eigenvalues of .

Theorem 1.

Assume the model (2), (3), and assume is of the form , for a random vector with i.i.d. entries that have zero mean, unit variance, and bounded 4th moment, and a deterministic positive definite matrix . As , assume that the spectral distribution converges weakly to a measure . Then as , such that , the risk of the least squares estimator (4) satistifies, almost surely,

Proof.

Recall the bias and variance results from Lemma 1. We may assume without a loss of generality that is almost surely invertible111It suffices to compute the limit when is Gaussian, because the Marchenko-Pastur theorem tells us that the limit will be the same for any such that , where has i.i.d. entries with zero mean and unit variance. In the Gaussian case, has a Wishart distribution, and is almost surely invertible when , e.g., see von Rosen (1988)., therefore and . In addition, writing ,

where is the spectral measure of . Now apply the Marchenko-Pastur theorem (Marchenko and Pastur, 1967; Silverstein, 1995), which says that converges weakly, almost surely, to the Marchenko-Pastur law (depending only on ). By the Portmanteau theorem, weak convergence is equivalent to convergence in expectation of all bounded functions

, that are continuous except on a set of zero probability under the limiting measure. Defining

, where , it follows that as , almost surely,

Notice we have left the indicator out of the integral on the right-hand side, as the support of the Marchenko-Pastur law is , where . By the Bai-Yin theorem (Bai and Yin, 1993), the smallest eigenvalue of is almost surely larger than for sufficiently large , therefore the last display implies as , almost surely,

(6)

It remains to compute the right-hand side above. This can be done in various ways. One approach is to recognize the right-hand side as the evaluation of the Stieltjes transform of Marchenko-Pastur law at . Fortunately, this has an explicit form (e.g., Lemma 3.11 in Bai and Silverstein 2010), for real :

(7)

Since the limit as is indeterminate, we can use l’Hopital’s rule to calculate:

Plugging this into (6) completes the proof. ∎

2.5 Limiting norm

Using the same asymptotic setup as Theorem 1, it is straightforward to compute the limiting (expected, squared) norm of the min-norm least squares estimator, for both and . (The asymptotic risk when is a more difficult calculation, and is the focus of the next two sections.)

Theorem 2.

Assume the conditions of Theorem 1. Also assume that for all . Then as , such that , the squared norm of the min-norm least squares estimator (4) satisfies, almost surely,

Proof.

Observe . The first term is just by assumption; when , the limit of the second term was already computed in the proof of Theorem 1; and when , it can be computed by using the fact that and have the same eigenvalues, see the proof of Lemma 3. ∎

3 Isotropic features

For the next two sections, we focus on the limiting risk of the min-norm least squares estimator when . In the overparametrized case, an important issue that we face is that of bias: is generally nonzero, because is. We consider two approaches to analyze the limiting bias. We assume throughout that takes the form for a random vector with i.i.d. entries that have zero mean and unit variance. In the first approach, considered in this section, we assume , in which case the limiting bias is seen to depend only on . In the second, considered in Section 4, we allow to be general but place an isotropic prior on , in which case the limiting bias is seen to depend only on .

3.1 Limiting bias

In the next lemma, we compute the asymptotic bias in for isotropic features, where we will see that it depends only on . To give some intuition as to why this is true, consider the special case where has i.i.d. entries from . By rotational invariance, for any orthogonal , the distribution of and is the same. Thus

Choosing so that , the th standard basis vector, then averaging over , yields

As with , we see that , almost surely. As the next result shows, this is still true outside of the Gaussian case, provided the features are isotropic. The intuition is that an isotropic feature distribution, with i.i.d. components, will begin to look rotationally invariant in large samples, which is made precise by an isotropic local version of the Marchenko-Pastur theorem from Bloemendal et al. (2014). The proof of the next result is deferred to Appendix A.1.

Lemma 2.

Assume (2), (3), where has i.i.d. entries with zero mean, unit variance, and bounded 4th moment. Assume that for all . Then for the min-norm least squares estimator in (4), as , such that , its bias satisfies, almost surely,

3.2 Limiting variance

The next lemma computes the limiting variance for isotropic features. As in Theorem 1, the calculation is a more or less standard one of random matrix theory (in fact, our proof reduces the calculation to that from Theorem 1).

Lemma 3.

Assume (2), (3), where has i.i.d. entries with zero mean and unit variance. For the min-norm least squares estimator in (4), as , such that , its variance satisfies, almost surely,

Proof.

Recalling the expression for the bias from Lemma 1 (where now ), we have

where , are the nonzero eigenvalues of . Let , denote the eigenvalues of . Then we may write , , and

where is the spectral measure of . Now as , we are back precisely in the setting of Theorem 1, and by the same arguments, we may conclude that almost surely

completing the proof. ∎

3.3 Limiting risk

Putting together Lemmas 2 and 3 leads to the following result for isotropic features.

Theorem 3.

Assume the model (2), (3), where has i.i.d. entries with zero mean, unit variance, and bounded 4th moment. Also assume that for all . Then for the min-norm least squares estimator in (4), as , such that , it holds almost surely that

Now write for the asymptotic risk of the min-norm least squares estimator, as a function of the aspect ratio . Putting together Theorems 1 and 3, we have in the isotropic case,

(8)

On , there is no bias, and the variance increases with ; on , the bias increases with , and the variance decreases with . Below we discuss some further interesting aspects of this curve. Let . Observe that the risk of the null estimator is , which we hence call the null risk. The following facts are immediate from the form of the risk curve in (8). See Figure 2 for an accompanying plot when varies from 1 to 5.

Figure 2: Asymptotic risk curves in (8) for the min-norm least squares estimator, when varies from 1 to 5, and . For each value of , the null risk is marked as a dotted line, and the points denote finite-sample risks, with , , across various values of , computed from features having i.i.d.  entries.
  1. On , the least squares risk is better than the null risk if and only if .

  2. On , when , the min-norm least squares risk is always worse than the null risk. Moreover, it is monotonically decreasing, and approaches the null risk (from above) as .

  3. On , when , the min-norm least squares risk beats the null risk if and only if . Further, it has a local minimum at , and approaches the null risk (from below) as .

4 Correlated features

We broaden the scope of our analysis from the last section, where we examined isotropic features. In this section, we take to be of the form , where is a random vector with i.i.d. entries that have zero mean and unit variance, and is arbitrary (but still deterministic and positive definite). To make the analysis (i.e, the bias calculation) tractable, we introduce a prior

(9)

We consider an integrated or Bayes risk,

where the expectation is over the prior in (9). We have the bias-variance decomposition

For the min-norm least squares estimator (4), its Bayes variance is as before, , from Lemma (1) (because, as we can see, does not actually depend on ). Its Bayes bias is computed next.

4.1 Bayes bias

With the prior (9) in place (in which, note, ), we have the following result for the Bayes bias

Lemma 4.

Under the prior (9), and data model (2), (3), the min-norm least squares estimator (4) has Bayes bias

Proof.

Using trace rotation, we can rewrite the bias as . Taking an expectation over , and using trace rotation again, gives , which is the desired result. ∎

4.2 Limiting risk

We compute the asymptotic risk for a general feature covariance . Before stating the result, we recall that for a measure supported on , we define its Stieltjes transform , any , by

Furthermore, the companion Stieltjes transform is defined by

The proof of the next result is found in Appendix A.2. The main work for calculating for the asymptotic risk here was in fact already done by Dobriban and Wager (2018) (who in turn used a key result on trace functionals involving from Ledoit and Peche 2011): these authors studied the asymptotic risk of ridge regression for general , and the next result for min-norm least squares can be obtained by taking a limit in their result as the ridge parameter tends to zero (though some care is required in exchanging limits as and ).

Theorem 4.

Assume the prior (9), and data model (2), (3). Assume is of the form , for a random vector with i.i.d. entries that have zero mean, unit variance, and bounded 12th moment, and a deterministic positive definite matrix , with , for all and constants (where denote the smallest and largest eigenvalues of , respectively). As , assume that converges weakly to a measure . For the min-norm least squares estimator (4), as , with , we have almost surely

where we abbreviate , the companion Stieltjes transform of the empirical spectral distribution given by the Marchenko-Pastur theorem, and we write for its derivative. Also, we write to denote , and likewise , which exist under our assumptions above.

It is not always possible to analytically evaluate or . But when , the companion Stieltjes transform is available in closed-form (41), and a tedious but straightforward calculation, deferred to Appendix A.3, shows that the asymptotic risk from Theorem 4 reduces to that from Theorem 3 (as it should). The next subsection generalizes this result, by looking at covariance matrices with constant off-diagonals.

4.3 Equicorrelated features

As a corollary to Theorem 4, we consider a -equicorrelation structure for , for a constant , meaning that for all , and for all . Interestingly, we recover the same asymptotic form for the variance as in the case, but the bias is affected—in fact, helped—by the presence of correlation. In the proof, deferred to Appendix A.4, we leverage the Silverstein equation (Silverstein, 1995) to derive an explicit form for the companion Stieltjes transform when has -equicorrelation structure (by relating it to the transform when ).

Corollary 1.

Assume the conditions of Theorem 4, and moreover, assume has -equicorrelation structure for all , and some . Then as , with , we have almost surely

Figure 10, deferred until Appendix A.5, displays asymptotic risk curves when has equicorrelation structure, as varies from 0 to 0.75. This same section in the appendix details the computation of the asymptotic risk when we have a -autoregressive structure for , for a constant , meaning that for all . Figure 10, also in Appendix A.5, displays the asymptotic risk curves in the autoregressive case, as varies from 0 to 0.75.

We make one further point. Inspection of the asymptotic bias and variance curves individually (rather than the risk as a whole) reveals that in the autoregressive setting, both the bias and the variance depend on the correlation structure (cf. the equicorrelation setting in Corollary 1, where only the bias did). Figure 11, in Appendix A.5, shows that the bias improves as increases, and the variance worsens with as increases.

5 Misspecified model

In this section, we consider a misspecified model, in which the regression function is still linear, but we observe only a subset of the features. Such a setting is more closely aligned with practical interest in interpolation: in many problems, we do not know the form of the regression function, and we generate features in order to improve our approximation capacity. Increasing the number of features past the point of interpolation (increasing past 1) can now decrease both bias and variance (i.e., not just the variance, as in the well-specified setting considered previously).

As such, the misspecified model setting also yields more interesting asymptotic comparisons between the and regimes. Recall that in Section 3.3, assuming isotropic features, we showed that when the asymptotic risk can have a local minimum on . Of course, the risk function in (8) is globally minimized at , which is a consequence of the fact that, in previous sections, we were assuming a well-specified linear model (3) at each , and trivially at there is no bias and no variance, and hence no risk. In a misspecified model, we will see that the story can be quite different, and the asymptotic risk can actually attain its global minimum on .

5.1 Data model and risk

Consider, instead of (2), (3), a data model

(10)
(11)

where as before the random draws across are independent. Here, we partition the features according to ,

, where the joint distribution

is such that and

We collect the features in a block matrix (which has rows , ). We presume that is observed but is unobserved, and focus on the min-norm least squares estimator exactly as before in (4), from the regression of on (not the full feature matrix ).

Given a test point , and an estimator (fit using only, and not ), we define its out-of-sample prediction risk as

Note that this definition is conditional on , and we are integrating over the randomness not only in (the training errors), but in the unobserved features , as well. The next lemma decomposes this notion of risk in a useful way.

Lemma 5.

Under the misspecified model (10), (11), for any estimator , we have

Proof.

Simply add an subtract inside the square in the definition of , then expand, and note that the cross term can be written, conditional on , as

The first term in the decomposition in Lemma 5 is the precisely the risk that we studied previously in the well-specified case, except that the response distribution has changed (due to the presence of the middle term in (11)). We call the second term in Lemma 5 the misspecification bias. In general, computing and in finite-sample can be very difficult, owing to the potential complexities created by the middle term in (11). However, in some special cases—for example, when the observed and unobserved features are independent, or jointly Gaussian—we can precisely characterize the contribution of the middle term in (11) to the overall response distribution, and can then essentially leverage our previous results to characterize risk in the misspecified model setting. In what follows, we restrict our attention to the independence setting, for simplicity.

5.2 Isotropic features

When the observed and unobserved features are independent, , the middle term in (11) only adds a constant to the variance, and the analysis of and becomes tractable. Here, we make the additional simplifying assumption that has i.i.d. entries with unit variance, which implies that . (The case of independent features but general covariances is similar, and we omit the details.) Therefore, we may write the response distribution in (11) as

where is independent of , having mean zero and variance , for . Denote the total signal by , and the fraction of the signal captured by the observed features by . Then behaves exactly as we computed previously, for isotropic features in the well-specified setting (Theorem 1 for , and Theorem 3 for ), after we make the substitutions:

(12)

Furthermore, we can easily calculate the misspecification bias:

Putting these results together leads to the next conclusion.

Theorem 5.

Assume the misspecified model (10), (11), and assume has i.i.d. entries with zero mean, unit variance, and bounded 4th moment. Also assume that and for all . Then for the min-norm least squares estimator in (4), as , with , it holds almost surely that

We remark that, in the independence setting considered in Theorem 5, the dimension of the unobserved feature space does not play any role, and the result only depends on the unobserved features via . Therefore, we may equally well take