1 Introduction
Modern deep learning models involve a huge number of parameters. In nearly all applications of these models, current practice suggests that we should design the network to be sufficiently complex so that the model (as trained, typically, by gradient descent) interpolates the data, i.e., achieves zero training error. Indeed, in a thoughtprovoking experiment,
Zhang et al. (2016) showed that stateoftheart deep neural network architectures can be trained to interpolate the data even when the actual labels are replaced by entirely random ones.Despite their enormous complexity, deep neural networks are frequently seen to generalize well, in meaningful practical problems. At first sight, this seems to defy conventional statistical wisdom: interpolation (vanishing training error) is usually taken to be a proxy for overfitting or poor generalization (large gap between training and test error). In an insightful series of papers, Belkin et al. (2018b, c, a)
pointed out that these concepts are, in general, distinct, and interpolation does not contradict generalization. For example, kernel ridge regression is a relatively wellunderstood setting in which interpolation can coexist with good generalization
(Liang and Rakhlin, 2018).In this paper, we examine the prediction risk of minimum norm or “ridgeless” least squares regression, under different models for the features. A skeptical reader might ask what least squares has to do with neural networks. To motivate our study, we appeal to line of work that draws concrete connections between these two frameworks (Jacot et al., 2018; Du et al., 2018b; AllenZhu et al., 2018; Du et al., 2018a). Following Chizat and Bach (2018b), consider a general nonlinear model that relates responses and inputs , , in terms of a parameter vector (while we have in mind a neural network, the setting here is actually quite general). In some contexts, the number of parameters is so large that training effectively changes each of them by a just small amount with respect to a random initialization . It thus makes sense to linearize the model around . Furthermore, supposing that the initialization is such that , and letting , we obtain
(1) 
We are therefore led to consider a linear regression problem, with random features , , of highdimensionality ( much greater than ). Notice that the features are random because of the initialization. In this setting, many vectors give rise to a model that interpolates the data. However, using gradient descent on the least squares objective for training yields a special interpolating parameter (having implicit regularity): the minimum norm least squares solution.
We consider two different models for the features.

Linear model. Here each , where
has i.i.d. entries with zero mean and unit variance and
deterministic and positive definite. This is a standard model in random matrix theory and we leverage known results from that literature to obtain a fairly complete asymptotic characterization of the outofsample prediction risk. In particular, we recover a number of phenomena that have been observed in kernel regression and neural networks (summary in the next subsection).

Nonlinear model. Here each , where has i.i.d. entries from , has i.i.d. entries from , and is an activation function acting componentwise. This corresponds to a twolayer neural network with random weights at the first layer, and weights at the second layer given by the regression coefficients . This model was introduced by Rahimi and Recht (2008) as a randomized approach for scaling kernel methods to large datasets. But (relative to the linear model described above) it is far less studied in the random matrix theory literature. We prove a new asymptotic result that allows us to characterize the prediction variance. The result is remarkably close to the result for the linear model (and in some cases, identical to it), despite the fact that can be much smaller than .
Because of its simplicity, the linear model abstracts away some interesting properties of the model (1), specifically, the distinction between the input dimension and the number of parameters . But, also due to its simplicity, we are able to develop a fairly complete picture of the asymptotic risk (which shows the effect of correlations between the features, signaltonoise ratio, strength of model misspecification, etc.). The fact that the two models give closely related results (for the variance component of the risk) supports the hope that the linear model can still provide useful insights into the model (1).
1.1 Summary of results
In what follows, we study the outofsample prediction risk of the minimum norm (or minnorm, for short) least squares estimator, in an asymptotic setting where both the number of samples and features diverge, , and their ratio converges to a nonzero constant, . When , we call the problem underparametrized, and when , we call it overparametrized. Below we summarize our main results on the asymptotic risk. Denote by the underlying signal and feature covariance matrix, respectively, and by the signaltonoise ratio (defined precisely in Section 3.3). Note that all but the last point below pertain to the linear model; also, see Figure 1 for a supporting plots, of the asymptotic risk curves for different cases of interest.

In the underparametrized regime (), the risk is purely variance (there is no bias), and does not depend on (see Theorem 1). Moreover, the risk diverges as we approach the interpolation boundary (as ).

When , the risk is monotonically decreasing over , and approaches the null risk (from above) as (see Section 3.3).

When , the risk has a local minimum on , is better than the null risk for large enough , and approaches the null risk (from below) as (see Section 3.3).

For a misspecified model, when , the risk can attain its global minimum on (when there is strong enough approximation bias, see Section 5.3).

Optimallytuned ridge regression dominates the minnorm least squares estimator in risk, across all values of and , in both the wellspecified and misspecified settings. For a misspecified model, optimallytuned ridge regression attains its global minimum around (see Section 6).

For the nonlinear model, the asymptotics we obtain are still (surprisingly) exact (meaning, no hidden constants) (see Theorem 7). For the case of a “purely nonlinear” activation function, our theory reveals that the asymptotic risk, somewhat remarkably, matches that the linear model with isotropic features (see Corollary 2).
A few remarks are in order. We number the first point above (which provides important context for the points that follow) as 0, because it is a known result that has appeared in various places in the random matrix theory literature. Points 3 through 5 are formally established for isotropic features, , but qualitatively similar behavior can be seen for general . Several of the arguments in this paper rely on more or less standard results in random matrix theory; even though the mathematics is standard, the insights, we believe, are new. An important distinction is the result for nonlinear models, which is not covered by existing literature. In this setting, we derive a new asymptotic result, which is of independent interest (see Theorem 8).
1.2 Intuition and implications
We discuss some intuition behind and implications of our results.
Bias and variance.
The shape of the asymptotic risk curve for minnorm least squares is, of course, controlled by its components: bias and variance. In the overparametrized regime, the bias increases with , which is intuitive. When , the minnorm least squares estimate of is constrained to lie the row space of , the training feature matrix. This is a subspace of dimension lying in a feature space of dimension . Thus as increases, so does the bias, since this row space accounts for less and less of the ambient dimensional feature space.
Meanwhile, in the overparametrized regime, the variance decreases with . This may seem counterintuitive at first, because it says, in a sense, that the minnorm least squares estimator becomes more regularized as grows. However, this too can be explained intuitively, as follows. As grows, the minimum norm least squares solution—i.e., the minimum norm solution to the linear system , for a training feature matrix and response vector —will generally have decreasing norm. Why? Compare two such linear systems: in each, we are asking for the minnorm solution to a linear system with the same , but in one instance we are given more columns in , so we can generally decrease the components of (by distributing them over more columns), and achieve a smaller norm. This can in fact be formalized asymptotically, see Theorem 2.
Double descent.
Recently, Spigler et al. (2018) and Belkin et al. (2018a)
pointed out a fascinating empirical trend where, for popular methods like neural networks and random forests, we can see a
second biasvariance tradeoff in the outofsample prediction risk, beyond the interpolation limit. The resulting risk curve resembles a Wshaped curve, which these authors call a “double descent” risk curve. Our results in this paper formally verify that the double descent phenomenon can occur even in the simple and fundamental case of least squares regression. The appearance of the second Ushape in the risk, past the interpolation boundary (), is explained by the opposing behavior of bias and variance as grows, as we discussed above.In the misspecified case, the variance still decreases with (for the same reasons), but interestingly, the bias can now also decrease with , provided is not too large (not too far past the interpolation boundary). The intuition here is that in a misspecified model, some part of the true regression function is always unaccounted for, and adding features generally improves our approximation capacity. Our results show that this double descent phenomenon can be even more pronounced in the misspecified case (depending on the strength of the approximation bias), and that the risk can attain its global minimum past the interpolation limit.
Insample prediction risk.
Our focus throughout this paper is outofsample prediction risk. It is reasonable to ask how the results would change if we instead look at insample prediction risk. In the data model (2), (3) we study, the insample prediction risk of the minnorm least squares estimator is (where we abbreviate , and we are assuming that ). The asymptotic insample prediction risk, as , is therefore just . This does not in of itself have an interesting behavior; it just ascends linearly in , and then becomes flat after . Compare this to the much richer and more complex behavior exhibited by the limiting outofsample risk (see the curves in Figure 1, or (8), (14
) for the precise mathematical forms in the wellspecified and misspecified settings, respectively): their behaviors could not be more different. This serves as an important reminder that the former (insample prediction risk) is not always a good proxy for the latter (outofsample prediction risk). Although much of classical regression theory is based on the former (e.g., optimism, effective degrees of freedom, and covariance penalties), the latter is more broadly relevant to practice.
Interpolation versus regularization.
The minnorm least squares estimator can be seen as the limit of ridge regression as the tuning parameter tends to zero. It is also the convergence point of gradient descent run on the least squares loss. We would not in general expect the bestpredicting ridge solution to be at the end of its regularization path. Our results, comparing minnorm least squares to optimallytuned ridge regression, show that (asymptotically) this is never the case, and dramatically so near . It is worth noting that early stopped gradient descent is known to be closely connected to ridge regularization, especially for least squares problems; in fact, the tight connection developed in Ali et al. (2019) suggests that optimallystopped gradient descent will also often have better risk than minnorm least squares. In practice, of course, we would not have access to the optimal tuning parameter for ridge (optimal stopping for gradient descent), and we would rely on, e.g., crossvalidation (CV). Will the CVtuned ridge estimator (CVstopped gradient descent) still outperform minnorm least squares? Empirical experiments suggest yes, but a formal analysis has not yet been pursued.
Historically, the debate between interpolation and regularization has been alive for the last 30 or so years. Support vector machines find maximummargin decision boundaries, which often perform very well for problems where the Bayes error is close to zero. But for lessseparated classification tasks, one needs to tune the cost parameter
(Hastie et al., 2004). Relatedly, in classification, it is common to run boosting until the training error is zero. Similar to the connection between gradient descent and regularization, the boosting path is tied to regularization (Rosset et al., 2004; Tibshirani, 2015); again, we now know that boosting can overfit, and the number of boosting iterations should be regarded as a tuning parameter.Virtues of nonlinearity.
Gradient descent is the method of choice in the deep learning community, where common practice is to run gradient descent until zero training error. Hence, heuristically, training a deep learning model with squared error loss by gradient descent is akin to finding the minnorm least squares solution in an adaptivelylearned highdimensional feature space. Turning to rigor, however, it is a priori unclear whether success in precisely analyzing the linear model setting should carry over to the nonlinear setting. Remarkably, under highdimensional asymptotics with
, , and , this turns out to be the case. Even more surprising, the relevant dimensionstosamples ratio is given by (not ): for “purely nonlinear” activations (these are activations whose linear component vanishes, with respect to a suitable inner product), the results from the linear model with features remain asymptotically exact. In other words, each component of the feature vector behaves “as if” it was independent from the others, even when is much smaller than .1.3 Related work
The present work connects to and is motivated by the recent interest in interpolators in machine learning (Belkin et al., 2018b, a; Liang and Rakhlin, 2018; Belkin et al., 2018c; Geiger et al., 2019) Several auhors have argued that minimum norm least squares regression captures the behavior of deep neural networks, at least in an early (lazy) training regime (Jacot et al., 2018; Du et al., 2018b, b; AllenZhu et al., 2018; Chizat and Bach, 2018b; Lee et al., 2019). The connection between neural networks and kernel ridge regression arises when the number of hidden units diverges. The same limit was also studied (beyond the linearized regime) by Mei et al. (2018); Rotskoff and VandenEijnden (2018); Sirignano and Spiliopoulos (2018); Chizat and Bach (2018a).
For the linear model, our risk result in the general case basically follows by taking a limit (as the ridge tuning parameter tends to zero) in the ridge regression result of Dobriban and Wager (2018). For the
case, our analysis bears similarities to the ridge regression analysis of
Dicker (2016) (though we manage to avoid the assumption of Gaussianity of the features, by invoking a local version of the MarchenkoPastur theorem). Moreover, our discussion of minnorm least squares versus ridge regression is somewhat related to the “regimes of learning” problem studied by Liang and Srebro (2010); Dobriban and Wager (2018).For the nonlinear model, the random matrix theory literature is much sparser, and focuses on the related model of kernel random matrices, namely, symmetric matrices of the form . El Karoui (2010) studied the spectrum of such matrices in a regime in which can be approximated by a linear function (for ) and hence the spectrum converges to a rescaled MarchenkoPastur law. This approximation does not hold for the regime of interest here, which was studied instead by Cheng and Singer (2013) (who determined the limiting spectral distribution) and Fan and Montanari (2015)
(who characterized the extreme eigenvalues). The resulting eigenvalue distribution is the free convolution of a semicircle and a MarchenkoPastur law. In the current paper, we must consider asymmetric (rectangular) matrices
, whose singular value distribution was recently computed by
Pennington and Worah (2017), using the moment method. Unfortunately, the prediction variance does not depend uniquely on the singular values but also on the singular vectors of this matrix. We use the method of
Cheng and Singer (2013) (applied to a suitable blockstructured matrix) to address this point.1.4 Outline
2 Preliminaries
We describe our setup and gather a number of important preliminary results.
2.1 Data model and risk
Assume we observe training data from a model
(2)  
(3) 
where the random draws across are independent. Here, is a distribution on such that , , and is a distribution on such that , . We collect the responses in a vector , and the features in a matrix (with rows , ).
Consider a test point , independent of the training data. For an estimator (a function of the training data ), we define its outofsample prediction risk (or simply, risk) as
where . Note that our definition of risk is conditional on (as emphasized by our notation ). Note also that we have the biasvariance decomposition
2.2 Ridgeless least squares
Consider the minimum norm (minnorm) least squares regression estimator, of on , defined by
(4) 
where denotes the MoorePenrose pseudoinverse of . Equivalently, we can write
which justifies its name. An alternative name for (4) is the “ridgeless” least squares estimator, motivated by the fact that
where denotes the ridge regression estimator,
(5) 
which we can equivalently write as
When has full column rank (equivalently, when is invertible), the minnorm least squares estimator reduces to , the usual least squares estimator. When has rank , importantly, this estimator interpolates the training data: , for .
Lastly, the following is a wellknown fact that connects the minnorm least squares solution to gradient descent (as referenced in the introduction).
Proposition 1.
Initialize , and consider running gradient descent on the least squares loss, yielding iterates
where we take (and is the largest eigenvalue of ). Then , the minnorm least squares solution in (4).
Proof.
The choice of step size guarantees that converges to a least squares solution as , call it . Note that , all lie in the row space of ; therefore must also lie in the row space of ; and the minnorm least squares solution is the unique least squares solution with this property. ∎
2.3 Bias and variance
We recall expressions for the bias and variance of the minnorm least squares estimator, which are standard.
Lemma 1.
Proof.
As and , the bias and variance expressions follow from plugging these into their respective definitions. ∎
2.4 Underparametrized asymptotics
We consider an asymptotic setup where , in such a way that . Recall that when , we call the problem underparametrized; when , we call it overparametrized. Here, we recall the risk of the minnorm least squares estimator in the underparametrized case. The rest of this paper focuses on the overparametrized case.
The following is a known result in random matrix theory, and can be found in Chapter 6 of Serdobolskii (2007), where the author traces it back to work by Girko and Serdobolskii from the 1990s through early 2000s. It can also be found in the wireless communications literature (albeit with minor changes in the presentation and conditions), see Chapter 4 of Tulino and Verdu (2004), where it is traced back to work by Verdu, Tse, and others from the late 1990s through early 2000s. Before stating the result, we recall that for a symmetric matrix , we define its spectral distribution as , where , are the eigenvalues of .
Theorem 1.
Assume the model (2), (3), and assume is of the form , for a random vector with i.i.d. entries that have zero mean, unit variance, and bounded 4th moment, and a deterministic positive definite matrix . As , assume that the spectral distribution converges weakly to a measure . Then as , such that , the risk of the least squares estimator (4) satistifies, almost surely,
Proof.
Recall the bias and variance results from Lemma 1. We may assume without a loss of generality that is almost surely invertible^{1}^{1}1It suffices to compute the limit when is Gaussian, because the MarchenkoPastur theorem tells us that the limit will be the same for any such that , where has i.i.d. entries with zero mean and unit variance. In the Gaussian case, has a Wishart distribution, and is almost surely invertible when , e.g., see von Rosen (1988)., therefore and . In addition, writing ,
where is the spectral measure of . Now apply the MarchenkoPastur theorem (Marchenko and Pastur, 1967; Silverstein, 1995), which says that converges weakly, almost surely, to the MarchenkoPastur law (depending only on ). By the Portmanteau theorem, weak convergence is equivalent to convergence in expectation of all bounded functions
, that are continuous except on a set of zero probability under the limiting measure. Defining
, where , it follows that as , almost surely,Notice we have left the indicator out of the integral on the righthand side, as the support of the MarchenkoPastur law is , where . By the BaiYin theorem (Bai and Yin, 1993), the smallest eigenvalue of is almost surely larger than for sufficiently large , therefore the last display implies as , almost surely,
(6) 
It remains to compute the righthand side above. This can be done in various ways. One approach is to recognize the righthand side as the evaluation of the Stieltjes transform of MarchenkoPastur law at . Fortunately, this has an explicit form (e.g., Lemma 3.11 in Bai and Silverstein 2010), for real :
(7) 
Since the limit as is indeterminate, we can use l’Hopital’s rule to calculate:
Plugging this into (6) completes the proof. ∎
2.5 Limiting norm
Using the same asymptotic setup as Theorem 1, it is straightforward to compute the limiting (expected, squared) norm of the minnorm least squares estimator, for both and . (The asymptotic risk when is a more difficult calculation, and is the focus of the next two sections.)
Theorem 2.
3 Isotropic features
For the next two sections, we focus on the limiting risk of the minnorm least squares estimator when . In the overparametrized case, an important issue that we face is that of bias: is generally nonzero, because is. We consider two approaches to analyze the limiting bias. We assume throughout that takes the form for a random vector with i.i.d. entries that have zero mean and unit variance. In the first approach, considered in this section, we assume , in which case the limiting bias is seen to depend only on . In the second, considered in Section 4, we allow to be general but place an isotropic prior on , in which case the limiting bias is seen to depend only on .
3.1 Limiting bias
In the next lemma, we compute the asymptotic bias in for isotropic features, where we will see that it depends only on . To give some intuition as to why this is true, consider the special case where has i.i.d. entries from . By rotational invariance, for any orthogonal , the distribution of and is the same. Thus
Choosing so that , the th standard basis vector, then averaging over , yields
As with , we see that , almost surely. As the next result shows, this is still true outside of the Gaussian case, provided the features are isotropic. The intuition is that an isotropic feature distribution, with i.i.d. components, will begin to look rotationally invariant in large samples, which is made precise by an isotropic local version of the MarchenkoPastur theorem from Bloemendal et al. (2014). The proof of the next result is deferred to Appendix A.1.
3.2 Limiting variance
The next lemma computes the limiting variance for isotropic features. As in Theorem 1, the calculation is a more or less standard one of random matrix theory (in fact, our proof reduces the calculation to that from Theorem 1).
Lemma 3.
Proof.
Recalling the expression for the bias from Lemma 1 (where now ), we have
where , are the nonzero eigenvalues of . Let , denote the eigenvalues of . Then we may write , , and
where is the spectral measure of . Now as , we are back precisely in the setting of Theorem 1, and by the same arguments, we may conclude that almost surely
completing the proof. ∎
3.3 Limiting risk
Theorem 3.
Now write for the asymptotic risk of the minnorm least squares estimator, as a function of the aspect ratio . Putting together Theorems 1 and 3, we have in the isotropic case,
(8) 
On , there is no bias, and the variance increases with ; on , the bias increases with , and the variance decreases with . Below we discuss some further interesting aspects of this curve. Let . Observe that the risk of the null estimator is , which we hence call the null risk. The following facts are immediate from the form of the risk curve in (8). See Figure 2 for an accompanying plot when varies from 1 to 5.

On , the least squares risk is better than the null risk if and only if .

On , when , the minnorm least squares risk is always worse than the null risk. Moreover, it is monotonically decreasing, and approaches the null risk (from above) as .

On , when , the minnorm least squares risk beats the null risk if and only if . Further, it has a local minimum at , and approaches the null risk (from below) as .
4 Correlated features
We broaden the scope of our analysis from the last section, where we examined isotropic features. In this section, we take to be of the form , where is a random vector with i.i.d. entries that have zero mean and unit variance, and is arbitrary (but still deterministic and positive definite). To make the analysis (i.e, the bias calculation) tractable, we introduce a prior
(9) 
We consider an integrated or Bayes risk,
where the expectation is over the prior in (9). We have the biasvariance decomposition
For the minnorm least squares estimator (4), its Bayes variance is as before, , from Lemma (1) (because, as we can see, does not actually depend on ). Its Bayes bias is computed next.
4.1 Bayes bias
With the prior (9) in place (in which, note, ), we have the following result for the Bayes bias
Lemma 4.
Proof.
Using trace rotation, we can rewrite the bias as . Taking an expectation over , and using trace rotation again, gives , which is the desired result. ∎
4.2 Limiting risk
We compute the asymptotic risk for a general feature covariance . Before stating the result, we recall that for a measure supported on , we define its Stieltjes transform , any , by
Furthermore, the companion Stieltjes transform is defined by
The proof of the next result is found in Appendix A.2. The main work for calculating for the asymptotic risk here was in fact already done by Dobriban and Wager (2018) (who in turn used a key result on trace functionals involving from Ledoit and Peche 2011): these authors studied the asymptotic risk of ridge regression for general , and the next result for minnorm least squares can be obtained by taking a limit in their result as the ridge parameter tends to zero (though some care is required in exchanging limits as and ).
Theorem 4.
Assume the prior (9), and data model (2), (3). Assume is of the form , for a random vector with i.i.d. entries that have zero mean, unit variance, and bounded 12th moment, and a deterministic positive definite matrix , with , for all and constants (where denote the smallest and largest eigenvalues of , respectively). As , assume that converges weakly to a measure . For the minnorm least squares estimator (4), as , with , we have almost surely
where we abbreviate , the companion Stieltjes transform of the empirical spectral distribution given by the MarchenkoPastur theorem, and we write for its derivative. Also, we write to denote , and likewise , which exist under our assumptions above.
It is not always possible to analytically evaluate or . But when , the companion Stieltjes transform is available in closedform (41), and a tedious but straightforward calculation, deferred to Appendix A.3, shows that the asymptotic risk from Theorem 4 reduces to that from Theorem 3 (as it should). The next subsection generalizes this result, by looking at covariance matrices with constant offdiagonals.
4.3 Equicorrelated features
As a corollary to Theorem 4, we consider a equicorrelation structure for , for a constant , meaning that for all , and for all . Interestingly, we recover the same asymptotic form for the variance as in the case, but the bias is affected—in fact, helped—by the presence of correlation. In the proof, deferred to Appendix A.4, we leverage the Silverstein equation (Silverstein, 1995) to derive an explicit form for the companion Stieltjes transform when has equicorrelation structure (by relating it to the transform when ).
Corollary 1.
Assume the conditions of Theorem 4, and moreover, assume has equicorrelation structure for all , and some . Then as , with , we have almost surely
Figure 10, deferred until Appendix A.5, displays asymptotic risk curves when has equicorrelation structure, as varies from 0 to 0.75. This same section in the appendix details the computation of the asymptotic risk when we have a autoregressive structure for , for a constant , meaning that for all . Figure 10, also in Appendix A.5, displays the asymptotic risk curves in the autoregressive case, as varies from 0 to 0.75.
We make one further point. Inspection of the asymptotic bias and variance curves individually (rather than the risk as a whole) reveals that in the autoregressive setting, both the bias and the variance depend on the correlation structure (cf. the equicorrelation setting in Corollary 1, where only the bias did). Figure 11, in Appendix A.5, shows that the bias improves as increases, and the variance worsens with as increases.
5 Misspecified model
In this section, we consider a misspecified model, in which the regression function is still linear, but we observe only a subset of the features. Such a setting is more closely aligned with practical interest in interpolation: in many problems, we do not know the form of the regression function, and we generate features in order to improve our approximation capacity. Increasing the number of features past the point of interpolation (increasing past 1) can now decrease both bias and variance (i.e., not just the variance, as in the wellspecified setting considered previously).
As such, the misspecified model setting also yields more interesting asymptotic comparisons between the and regimes. Recall that in Section 3.3, assuming isotropic features, we showed that when the asymptotic risk can have a local minimum on . Of course, the risk function in (8) is globally minimized at , which is a consequence of the fact that, in previous sections, we were assuming a wellspecified linear model (3) at each , and trivially at there is no bias and no variance, and hence no risk. In a misspecified model, we will see that the story can be quite different, and the asymptotic risk can actually attain its global minimum on .
5.1 Data model and risk
Consider, instead of (2), (3), a data model
(10)  
(11) 
where as before the random draws across are independent. Here, we partition the features according to ,
, where the joint distribution
is such that andWe collect the features in a block matrix (which has rows , ). We presume that is observed but is unobserved, and focus on the minnorm least squares estimator exactly as before in (4), from the regression of on (not the full feature matrix ).
Given a test point , and an estimator (fit using only, and not ), we define its outofsample prediction risk as
Note that this definition is conditional on , and we are integrating over the randomness not only in (the training errors), but in the unobserved features , as well. The next lemma decomposes this notion of risk in a useful way.
Proof.
Simply add an subtract inside the square in the definition of , then expand, and note that the cross term can be written, conditional on , as
∎
The first term in the decomposition in Lemma 5 is the precisely the risk that we studied previously in the wellspecified case, except that the response distribution has changed (due to the presence of the middle term in (11)). We call the second term in Lemma 5 the misspecification bias. In general, computing and in finitesample can be very difficult, owing to the potential complexities created by the middle term in (11). However, in some special cases—for example, when the observed and unobserved features are independent, or jointly Gaussian—we can precisely characterize the contribution of the middle term in (11) to the overall response distribution, and can then essentially leverage our previous results to characterize risk in the misspecified model setting. In what follows, we restrict our attention to the independence setting, for simplicity.
5.2 Isotropic features
When the observed and unobserved features are independent, , the middle term in (11) only adds a constant to the variance, and the analysis of and becomes tractable. Here, we make the additional simplifying assumption that has i.i.d. entries with unit variance, which implies that . (The case of independent features but general covariances is similar, and we omit the details.) Therefore, we may write the response distribution in (11) as
where is independent of , having mean zero and variance , for . Denote the total signal by , and the fraction of the signal captured by the observed features by . Then behaves exactly as we computed previously, for isotropic features in the wellspecified setting (Theorem 1 for , and Theorem 3 for ), after we make the substitutions:
(12) 
Furthermore, we can easily calculate the misspecification bias:
Putting these results together leads to the next conclusion.
Theorem 5.
We remark that, in the independence setting considered in Theorem 5, the dimension of the unobserved feature space does not play any role, and the result only depends on the unobserved features via . Therefore, we may equally well take
Comments
There are no comments yet.