1 Introduction
Traditionally in econometrics, a statistical inference about a population is formed using a model imposed and estimated on a sample. Put another way, the goal of inference is to see whether a model estimated on a sample may be generalized to the population. Deciding whether the estimated model is valid for generalization is referred to as the process of model evaluation. The performance of a model is usually evaluated using the sample data at hand, referred to as the ‘internal validity’ of the model. However, internal validity may not be a useful indicator of model performance in many different scenarios. A perspective of increasing interest in applied econometrics considers the performance of a model on outofsample data. In this paper, we focus on the ability of a model estimated from a given sample to fit new samples, referred to as the generalization ability (GA) of the model.
Generalization ability is one aspect of external validity, the extent to which the results from a study generalize to other settings. Researchers evaluating pilot programs and randomized trials in a variety of settings from labour economics to development economics are increasingly focused on the external validity of their findings.^{2}^{2}2In the literature, external validity sometimes refers to whether results based on a sample from one population generalize to another population. The concept of GA may be adapted to this interpretation provided the heterogeneity across the two populations is adequately controlled. For instance, Heckman and Vytlacil (2007), Ludwig et al. (2011) and Allcott and Mullainathan (2012) discuss the importance of externally valid results in policy and program evaluation. In a similar vein, Guala and Mittone (2005) and List (2011) emphasize the importance of external validity in determining whether experiments can offer a robust explanation of causes and effects in the field. While external validity encompasses many issues from experimental design, sample selection, and economic theory, it clearly also raises econometric issues around model estimation and evaluation.
In order to explore the properties of estimators from the perspective of GA, three questions need to be addressed. How do we measure GA? Given a useful measure, what are its properties? Given its properties, how can GA be exploited for estimation? These questions have received only tangential interest in the literature. Roe and Just (2009) discuss the tradeoff between internal and external validity in empirical data, pointing out that while internal validity (or insample fit) is well studied, there is much less research on how to control external validity (outofsample fit). Some work on external validity has focussed on the properties of specific estimation methods. Angrist and FernandezVal (2010), for example, studies the external validity of instrument variable estimation in a labor market setting. In this paper, we answer these questions, starting by proposing a measure of GA that is straightforward to implement empirically.
With a new sample at hand, GA is easily measured by using validation or crossvalidation to quantify the goodnessoffit of the estimated model on new (outofsample) data. Without a new sample, however, it can be difficult to measure GA ex ante. In this paper, when only a single sample is available, we quantify the GA of the insample estimates by deriving upper bounds on the empirical outofsample errors, which we call the empirical generalization errors (eGE). The upper bounds reveal the properties of the eGE as well as insight into other important statistical properties, such as the tradeoff between insample and outofsample fit in estimation. Furthermore, the upper bounds can be extended to analyze the performance of validation, fold crossvalidation and specific estimators such as penalized regression. Thus, the GA approach yields insight into the finitesample and asymptotic properties of model estimation and evaluation from several different perspectives.
Essentially, GA measures the performance of a model from an external pointofview: the greater the GA of an estimator, the better its predictions on outofsample data. Furthermore, we show that GA also serves as a natural criterion for model selection, throwing new light on the model selection process. We propose the criterion of minimizing eGE (maximizing GA), or generalization error minimization (GEM), as a framework for model selection. Using penalized regression as an example of a specific model selection scenario, we show how the traditional biasvariance tradeoff is connected to GEM and to the tradeoff between insample and outofsample fit. Moreover, the GEM framework allows us to establish additional properties for penalized regression implicit in the biasvariance tradeoff.
1.1 Approaches to model selection
Given the increasing prevalence of highdimensional data in economics, model selection is coming to the forefront in empirical work. Researchers often desire a smaller set of predictors in order to gain insight into the most relevant relationships between outcomes and covariates. Without explicitly introducing the concept of GA, the classical approach to model selection focusses on the biasvariance tradeoff, yielding methods such as the information criteria (IC), crossvalidation, and penalized regression. Consider the standard linear regression model
where
is a vector of outcome variables,
is a matrix of covariates and is a vector of i.i.d. random errors. The parameter vector may be sparse in the sense that many of its elements are zero. Model selection typically involves using a score or penalty function that depends on the data (Heckerman et al., 1995), such as the Akaike information criterion (Akaike, 1973), Bayesian information criterion (Schwarz, 1978), crossvalidation errors (Stone, 1974, 1977) or the mutual information score among variables (Friedman et al., 1997, 2004).An alternative approach to model selection is penalized regression, implemented through the objective function:
(1) 
where is the norm and is a penalty parameter. Note in eq. (1) that if , the OLS estimator is obtained. The IC can be viewed as special cases with and . The lasso (Tibshirani, 1996) corresponds to the case with (an penalty). When (an penalty), we have the ridge estimator (Hoerl and Kennard, 1970). For any , we have the bridge estimator (Frank and Friedman, 1993), proposed as a generalization of the ridge.
One way to derive the penalized regression estimates is using the crossvalidation approach, summarized in Algorithm 1. As shown in Algorithm 1, crossvalidation solves the constrained minimization problem in eq. (1) for each value of the penalty parameter to derive a . When the feasible range of is exhausted, the estimate that produces the smallest outofsample error among all the estimated is chosen to be the penalized regression estimate, .
Algorithm 1: Penalized regression estimation under crossvalidation
1.  Set the penalty parameter . 
2.  Partition the sample into a training set and a test set . Standardize all variables (to ensure the penalized regression residual satisfies in and ). 
3.  Compute the penalized regression estimate on . Use to calculate the prediction error on . 
4.  Increase the penalty parameter by a preset step size. Repeat 2 and 3 until . 
5.  Select to be the that minimizes the prediction error on . 
A range of consistency properties have been established for the IC and penalized regression. Shao (1997) proves that various IC and crossvalidation are consistent in model selection. Breiman (1995); Chickering et al. (2004) show that the IC have drawbacks: they tend to select more variables than necessary and are sensitive to small changes in the data. Zhang and Huang (2008); Knight and Fu (2000); Meinshausen and Bühlmann (2006); Zhao and Yu (2006) show that penalized regression is consistent in different settings. Huang et al. (2008); Hoerl and Kennard (1970) show the consistency of penalized regression with . Zou (2006); Caner (2009); Friedman et al. (2010) propose variants of penalized regression in different scenarios and Fu (1998) compares different penalized regressions using a simulation study. Alternative approaches to model selection, such as combinatorial search algorithms may be computationally challenging to implement, especially with highdimensional data.^{3}^{3}3Chickering et al. (2004)
point out that the best subset selection method is unable to deal with a large number of variables, heuristically 30 at most.
1.2 Major results and contribution
A central idea in this paper is that model evaluation and model selection may be reframed from the perspective of GA. If the objective is to improve the GA of a model, model selection is necessary. Conversely, GA provides a new and elegant angle to understand model selection. By introducing generalization errors as the measure of GA, we connect GEM to the traditional biasvariance tradeoff, the tradeoff between insample and outofsample fit and the properties of validation and crossvalidation. By the same token, the concept of GA may be used to derive additional theoretical properties of model selection. Specifically for the case of regression analysis, we use GEM to unify the properties for the class of penalized regressions with
, and show that the finitesample and asymptotic properties of penalized regression are closely related to GA.The first contribution of this paper is to quantify the GA of a model, ex ante, by deriving an upper bound for the GE. The upper bounds depend on the sample size, an index of the complexity of models, a loss function, and the distribution of the underlying population. The upper bound also characterizes the tradeoff between insample fit and outofsample fit. As shown in Vapnik and Chervonenkis (1971a, b); McDonald et al. (2011); Smale and Zhou (2009); Hu and Zhou (2009), the inequalities underlying conventional analysis of GA focus on the relation between the population error and the empirical insample error. Conventional methods to improve GA involve computing discrete measures of model complexity, such as the VC dimension, Radamacher dimension or Gaussian complexity, which typically are hard to compute. In contrast, for any outofsample data, we quantify bounds for the prediction error of the extremum estimate learned from insample data. Furthermore, we show that empirical GA analysis is straightforward to implement via validation or crossvalidation and possesses desirable finitesample and asymptotic properties for model selection.
A second contribution of the paper is to propose GEM as a general framework for model selection and estimation. By reframing the biasvariance tradeoff from the perspective of insample and outofsample fit, GA is connected to the traditional biasvariance tradeoff and model selection. To be specific: model selection is necessary to improve the GA of an estimated model while GA naturally serves as an elegant way to understand the properties of model selection. Thus, a model estimated by GEM will not only achieve the highest GA—and thus some degree of external validity—but also possess a number of nice theoretical properties. GEM is also naturally connected to the properties of crossvalidation. As argued by Varian (2014), crossvalidation could be adopted more often in empirical analysis in economics, especially as big data sets are becoming increasingly available in many fields.
A third contribution of the paper is to use GA analysis to unify a class of penalized regression estimators and derive their finitesample and asymptotic properties (including consistency) under the same set of assumptions. Various properties of penalized regression estimators have previously been established, such as probabilistic consistency or the oracle property (Knight and Fu, 2000; Zhao and Yu, 2006; Candes and Tao, 2007; Meinshausen and Yu, 2009; Bickel et al., 2009). GA analysis reveals that similar properties can be established more generally and for a wider class of penalized regression estimators. We also show that the difference between the OLS estimate and any penalized regression estimate depends on their respective GAs.
Lastly, a fourth contribution of the paper is to show that GA analysis may be used to tune the hyperparameter for validation (i.e., the ratio of training sample size to test sample size) or crossvalidation (i.e., the number of folds ). Existing research has studied crossvalidation for the estimation of specific parametric and nonparametric models (Hall and Marron, 1991; Hall et al., 2004; Stone, 1974, 1977). In contrast, by adapting the classical error bound inequalities that follow from our analysis of GA, we derive the optimal tuning parameters for validation and crossvalidation in a modelfree setting. We also show how affects the biasvariance tradeoff for crossvalidation: a higher increases the variance and lowers the bias.
The paper is organized as follows. In Section 2, we propose empirical generalization errors as the measure of GA and the criterion to evaluate the external performance of the estimated model. Also, by deriving the upper bounds for the empirical generalization errors in different scenarios, we reveal important features of the extremum estimator such as the tradeoff between insample and outofsample fit. Our approach also offers a framework to study the properties of validation and crossvalidation. In Section 3, we apply the GEM framework as a model selection criterion for penalized regression. We also demonstrate a number of new properties for penalized regression that flow from the GEM framework. We prove the consistency of penalized regression estimators for both and cases. Further, we establish the finitesample upper bound for the difference between penalized and unpenalized estimators based on their respective GAs. In Section 4, we use simulations to demonstrate the ability of penalized regression to control for overfitting. We also propose a measure of overfitting based on the empirical generalization errors, called the generalized or . Section 5 concludes with a brief discussion of our results. Proofs (showing detailed steps) are contained in Appendix A and summary plots of the simulations are in Appendix B.
2 Generalization ability and the upper bound for the finitesample generalization error
In this section, we propose the eGE, the measurement of GA for a model, as a new angle to evaluate the ‘goodness’ of a model. By deriving the upper bounds of eGE of the extremum estimator, we show that the upper bounds of eGE directly quantifies how much the model overfits or underfits the finite samples, also the upper bounds of eGE may be used to study the properties of eGE for a model. Furthermore, these upper bounds can be used to study the properties of cross validation and validation on finite samples. All those result implies that eGE is a convenient and useful criterion for model evaluation.
2.1 Generalization ability, generalization error and overfitting
In econometrics, choosing the best approximation to data often involves measuring a loss function, , defined as a functional that depends on some estimate and the sample points . The population error (or risk) functional is defined as
where
is the joint distribution of
and . Without knowing the distribution a priori, given a random sample , we define the empirical error functional as followsFor example, in the regression case, is the estimated parameter vector and .
When estimation involves minimizing the insample empirical error, we have the extremum estimator (Amemiya, 1985). In many settings, however, minimizing the insample empirical error does not guarantee a reliable model. In regression, for example, often the is used to measure goodnessoffit for the insample data. However, an estimate with a high insample may fit outofsample data poorly, a feature commonly referred to as overfitting: the insample estimate is too tailored for the sample data, compromising its outofsample performance.^{4}^{4}4Likewise, if the model fits insample data poorly and hence compromise its outofsample performance, we say that the model underfits the data. As a result, insample fit (internal validity) of the model may not be a reliable indicator of the general applicability (external validity) of the model.
Thus, Vapnik and Chervonenkis (1971a) refer to the generalization ability (GA) of a model: a measure of how an extremum estimator performs on outofsample data. GA can be measured several different ways. In the case where and are directly observed, GA is a function of the difference between the actual and predicted for outofsample data. In this paper, GA is measured by the outofsample empirical error functional.
Definition 1 (Training set, test set, empirical training error and empirical generalization error).

Let denote a sample point from , the joint distribution of . Let denote the space of all models. The loss function for is . The population error functional for is . The empirical error functional is .

Given a sample , the training set refers to data used to estimate (i.e., the insample data) and the test set refers to data not used to estimate (i.e., the outofsample data). Let . The effective sample size for the training set, test set and the total sample, respectively, is , and .

Let denote an extremum estimator, where minimizes . Under validation, the empirical training error (eTE) for is , the empirical generalization error (eGE) is and the population error is .

For fold crossvalidation, denote the training set and test set in the th round, respectively, as and . In each round, the sample size for the training set is and the sample size for the test set is . Thus, is the eGE and is the eTE, respectively, in the th round of crossvalidation.
Two methods are typically used to compute the eGE of an estimate: validation and crossvalidation. Under the validation approach, the sample is randomly divided into a training set and a test set. Following estimation on the training set, the fitted model is applied to the test set to compute the validated eGE. fold crossvalidation may be thought of as ‘multipleround validation’. Under the crossvalidation approach, the sample is randomly divided into subsamples or folds.^{5}^{5}5In practice, researchers arbitrarily choose , 10, 20, 40 or . One fold is chosen to be the test set and the remaining folds comprise the training set. Following estimation on the training set, the fitted model is applied to the test set to compute the eGE. The process is repeated times, with each of the folds in turn taking the role of the test set while the remaining folds are used as the training set. In this way, different estimates of the eGE for the fitted model are obtained. The average of the eGEs yields the crossvalidated eGE.
Crossvalidation uses each data point in both the training and test sets. The method also reduces resampling error by running the validation times over different training and test sets. Intuitively this suggests that crossvalidation is more robust to resampling error and on average will perform at least as well as validation. In Section 3, we study the generalization ability of penalized extremum estimators in both the validation and crossvalidation cases.
To study the properties of eGE for a model, three assumptions are required for the analysis in this section of the paper, stated as follows.
Assumptions

In the probability space
, we assume measurability of the loss function , the population error and the empirical error , for any and any sample point. Distributions of loss functions have closedform, firstorder moments.

The sample is randomly drawn from the population. In cases with multiple random samples, both the training set and the test set are randomly sampled from the population. In cases with a single random sample, both the training set and the test set are randomly partitioned from the sample.

For any sample, the extremum estimator exists. The insample error for converges in probability to the minimal population error as .
A few comments are in order for assumptions A1–A3. The loss distribution assumption A1 is made to simplify the analysis. The existence and convergence assumption A3 is standard (see, for example, Newey and McFadden (1994)).
The independence assumption A2 is not essential; GA analysis is valid for both i.i.d. and noni.i.d. data. While the original research on GA in Vapnik and Chervonenkis (1974a, b) imposes the i.i.d. restriction, subsequent work has generalized their results to cases where the data are dependent or not identically distributed.^{6}^{6}6See, for example, Yu (1994); CesaBianchi et al. (2004); Smale and Zhou (2009); Mohri and Rostamizadeh (2009); Kakade and Tiwari (2009); McDonald et al. (2011).
Others have shown that if heterogeneity is due to an observed random variable,
^{7}^{7}7See Michalski and Yashin (1986); Skrondal and RabeHesketh (2004); Wang and Feng (2005); Yu and Joachims (2009); Pearl (2015).the variable may be added to the model to control for the heterogeneity, while if the heterogeneity is related to a latent variable, various approaches—such as the hidden Markov model, mixture modelling or factor modelling—are available for heterogeneity control. In this paper, due to the different measuretheory setting for dependent data, we focus on the independent case as a first step.
Given A1–A3, both the eTE and eGE converge to the population error:
2.2 The upper bound of the empirical generalization error
The problem of GA and overfitting is discovered many years ago. Originally, to improve the generalization ability of a model, Vapnik and Chervonenkis (1971a, b) propose minimizing the upper bound of the population error of the estimator as opposed to minimizing the eTE. The balance between insample fit and outofsample fit is formulated by Vapnik and Chervonenkis (1974b) using the GlivenkoCantelli theorem and Donsker’s theorem for empirical processes. Specifically, the socalled VC inequality (Vapnik and Chervonenkis, 1974b) summarizes the relationship between and .
Lemma 1 (VC inequality: the upper bound for the population error).
Under A1–A3, the following inequality holds with probability at least , , and ,
(2) 
where is the population error, is the empirical training error, , and is the VC dimension.
The VC dimension is a more general measure of the geometric complexity of a model than the number of parameters, , which does not readily extend as a measure of complexity in nonlinear or nonnested models. While reduces to directly for generalized linear models, can also be used to order the complexity of nonlinear or nonnested models.^{8}^{8}8In empirical processes, several other geometric complexity measures are connected to or derived from the VC dimension, such as the minimum description length (MDL) score, the Rademacher dimension (or complexity), Pollard’s pseudodimension and the Natarajan dimension. Most of these measures, like the VC dimension, are derived from the GlivenkoCantelli class of empirical processes. Thus, eq. (2) can be used as a tool for linear, nonlinear and nonnested model selection.
While Lemma 1 is established under A2, eq. (2) can be generalized to noni.i.d. cases. McDonald et al. (2011) generalizes the VC inequality for  and mixing stationery time series while Smale and Zhou (2009) generalizes the VC inequality for panel data. Moreover, a number of papers (Michalski and Yashin, 1986; Skrondal and RabeHesketh, 2004; Wang and Feng, 2005; Yu and Joachims, 2009; Pearl, 2015) show that heterogeneity can be controlled in the context of the VC inequality by implementing the latent variable model or by adding the variable causing heterogeneity into the model.
The VC inequality provides an upper bound for the population error based on the eTE and the VC dimension of the model. As shown in Figure 1, when the VC dimension is low (i.e., the model is simple), the effective sample size () of the training set is large, is small, the second term on the RHS of (2) is small, and the eTE is close to the population error. In this case the extremum estimator has a good GA. However, when the VC dimension is high (i.e., the model is very complicated), the effective sample size is small, the second term on the RHS of (2) becomes larger. In such situations, a small eTE does not guarantee a good GA, and overfitting is more likely.
Vapnik and Chervonenkis (1971a) show that minimizing the RHS of eq. (2) reduces overfitting and improves the GA of the extremum estimator. However, this can be hard to implement because the VC dimension is hard to calculate for anything other than linear models.^{9}^{9}9
For example, VC dimension is complicated and not handy to implement if we try compare the GA between two different distribution in kernel density estimation.
Also, VC inequality is for population. But in practice population is often not accessible and the data we collect is always finite, moreover we are often interesting in how a estimated model would perform on finite outofsample data, such as "how well the effect of a policy, estimated from a sample from suburb A, would explain the variation in suburb B, given relevant heterogeneity controlled. " Hence, as a handy and intuitive measurement for GA or overfitting in empirical study, eGE has been proposed and widely used in applications.As a reasonable criterion for model evaluation, we would expect that eGE should demonstrate a number of nice properties that characterize its pattern or regularity as an empirical process. For example, intuitively if we get an estimated model from the training set and apply it to many test sets with same size, we would expect that value of different eGEs could be as stable as possible despite of the sampling error; also, if we calculate the eGE of a model on test sets with different sizes, we would expect that value of eGEs should be influenced by the size of training set, the size of test set and the tail behavior of the error distribution. However, due to the procedure of sampling from population and subsampling by validation/cross validation, the randomness of eGE as a empirical process increase the difficulty to derive those properties. To reduce and absorb the influence of sampling and subsampling, one approach is to derive an upper bound for eGE,^{10}^{10}10All error functional is nonnegative, hence could be seen as a natural lower bound. Moreover, typically the consequence of a large eGE is more severe than a small eGE. As a result, we focus on the upper bound here. which is somehow similar to what eq. (2) does to population error, Thus, we now describe the finitesample relation between the eGE and the eTE in validation, by adapting eq. (2).
Theorem 1 (The upper bound for the eGE of the extremum estimator under validation).
Under A1–A3 and given an extremum estimator , the following upper bound for the eGE holds with probability at least , .
(3) 
where is the eGE and the eTE, is defined in Lemma 1, if and is bounded, and otherwise
where
Eq. (3) provides an upper bound for the eGE of the test set as it depends on the eTE of the model with a given estimated on the training set. Thus, eq. (3) shows how the eTE from extremum estimation on the training set may be used to compute the eGE of the model on the test set. In other words, eq. (3) measures the GA of a model of a given complexity under the validation approach.
Theorem 1 has several important implications.

Upper bound of the eGE. Eq. (3) establishes the upper bound of the eGE for any outofsample data of size based on the eTE from any insample data of size . Thus, eq. (3) quantifies the upper bound of the eGE, as opposed to Lemma 1, which quantifies the upper bound of the population error. Previously, measuring the eGE of a model with new data required the use of validation or crossvalidation. Given Theorem 1, the eGE may be quantified directly using the RHS of eq. (3), avoiding the need for validation.

The tradeoff between accuracy and looseness of the upper bound. Theorem (1) also shows that influences the tradeoff between the accuracy and efficiency of the upper bound. The higher , the more likely the upper bound holds and the larger . Thus, while the probability the bound holds increases, it comes at the cost of a looser upper bound. In contrast, a lower reduces the upper bound, offering a empirically efficient upper bound at the cost of reducing the probability that it holds.

The eGEeTE tradeoff in model selection. Eq. (3) also characterizes the tradeoff between eGE and eTE for model selection in both the finitesample and asymptotic cases. The population GE and the population TE converge to the population error. Hence, minimizing eTE can lead directly to the true DGP in the population. By contrast, for the finitesample case illustrated in Figure 2, while an overcomplicated model (a low ) will have a small eTE, eq. (3) shows it may result in a large eGE on new data. Thus, an overcomplicated model will tend to overfit the insample data and have poor GA. On the other hand, an oversimplified model (a high ) will be unlikely to recover the true DGP leading to a large upper bound for the eGE. Thus, an oversimplified model will tend to underfit, fitting both the insample and outofsample data poorly. Thus, the complexity of a model is related to a tradeoff between the eTE and eGE.

GA and tails of the loss function distribution. Eq. (3) also shows how the tail of the loss function distribution affects the upper bound of the eGE through , the second term on the RHS of eq. (3). If is bounded or lighttailed, is mathematically simple and converges to zero at the rate . If the loss function is heavytailed and measurable, , the highest order of the population moment that is closedform for the loss distribution,^{11}^{11}11It is closedform owing to A1, which guarantees closedform, firstorder moments for all loss distributions. can be used to measure the heaviness of the loss distribution tail, a smaller implying a heavier tail. In the case of a heavy tail, is mathematically complicated and its convergence rate decreases to . Hence, the heavier the tail of the loss distribution, the higher the upper bound of the eGE and the harder it is to control GA in finite samples. In the extreme case with , there is no way to adapt eq. (3).
Our next step is to establish a similar bound to eq. (3) for fold crossvalidation. Given that fold crossvalidation is simply the multipleround validation, a similar bound to eq. (3) can be established for crossvalidation by convolution. For convenience, define the empirical process
as the eGE gap in the th round of crossvalidation.
Theorem 2 (The upper bound for the eGE of the extremum estimator for crossvalidation).
Under A1–A3, given an extremum estimator and given , then
 (i)

if , the following upper bound for the eGE gap holds with probability at least
(4) where
 (ii)

if is heavytailed and subexponential, the following upper bound for the eGE holds with probability at least
(5) where
is the eGE and is the eTE, respectively, in the qth round of crossvalidation.
Theorem 2 provides an upper bound for the average eGE from rounds of crossvalidation. Generally speaking, the errors generated from crossvalidation are affected both by sampling randomness (from the population) and by subsampling randomness that arises from partitioning the sample into folds. Thus, the errors from crossvalidation are potentially more volatile than the usual errors from estimation. Eqs. (4) and (5) offers a way to characterize the property of eGE despite of the effect of subsampling randomness.
The implications of eqs. (4) and (5) are as follows.

Upper bound of the eGE. Similar to eq. (3), eq. (4) and (5) serve as the upper bound of the crossvalidated eGE. Both equations reveal the tradeoff between eTE and eGE and the influence of the tails of the loss distribution. In convolution, the tails dramatically affect the upper bound of the eGE. If is lighttailed, converges to exponentially when . When is subexponential, the convolution is more complicated and it is hard to approximate the probability. However, with sufficiently large , Theorem 2 shows that we can approximate the convoluted probability.

The tradeoff between accuracy and looseness of the upper bound. Similar to eq. (3), in both eq. (4) and (5) we need to consider the tradeoff between between and (or as in Theorem 1), i.e., the tradeoff between efficiency and accuracy. Similar to Theorem 1, ceteris paribus, a larger in each round of crossvalidation increases and . Thus, in each round, the upper bound moves upwards and the probability the upper bound holds increases, implying that both increase overall under crossvalidation. While the probability the bound holds increases, it comes at the cost of a looser bound.

Crossvalidation hyperparameter
. Eqs. (4) and (5) characterize how affects the average eGE from crossvalidation (also called the crossvalidation error in the literature). With a given sample and fixed , subsampling randomness will produce a different average eGE each time crossvalidation is performed. From Definition 1, and , so the sizes of the training and test sets change with . As increases the test sets become smaller, increasing the influence of subsampling randomness on the eGE. On the other hand, as decreases the training sets become smaller, increasing the influence of subsampling randomness on eTE. The two effects complicate analysis of the effect of on tradeoff between eTE and eGE under crossvalidation. Thus, to characterize the influence of subsampling randomness, we establish the tradeoff for crossvalidation by a bound, after running crossvalidation many times.^{12}^{12}12By contrast, for extremum estimators like OLS, the biasvariance tradeoff is much more straightforward to analyze for different because the sample is fixed. Figure 5 illustrates the effect of .
Small . For low values of , is low in each round of insample estimation and the eTE in each round , , is more biased away from the population error, as shown in Figure (a)a. Also for small , the round average eTE (the first term on the RHS of eqs. (4) and (5)), is more biased away from the true population error, as shown in Figure (b)b. As a result, the RHS of eqs. (4) and (5) suffer more from finitesample bias for low values of . However, since a small implies is relatively large, more data is used for eGE calculation in each round, in each round the eGE on the test set should be less volatile. Thus, the round averaged eGE for crossvalidation is relatively less volatile, reflecting the fact that is not very large in eqs. (4) and (5).

Large . For high values of , is low. Given a small test set size, the eGE in each round may be hard to bound from above, the averaged eGE from rounds will be more volatile and will increase. However, with a high , the first term on the RHS of eqs. (4) and (5) tends to be closer to the true population error and the averaged eGE suffers less from bias.
In summary, Figure (b)b shows that as the value of increases, the averaged eGE from crossvalidation follows a typical biasvariance tradeoff. For low values of , the average eGE is less volatile but more biased away from the population error. As increases, the averaged eGE becomes more volatile but less biased away from the population error.

Theorem 2 confirms the exhaustive simulation study results from Kohavi (1995). As a side benefit, Theorem 2 also suggests that an optimal number of folds may exist for each sample when the crossvalidation approach is used with extremum estimation, as in the shown in Figure (b)b. More specifically, the that minimizes the upper bound (4) and (5) also maximizes the GA from crossvalidation. We leave to the next section discussion on the optimal for regression.
Theorems 1 and 2 establish upper bounds for the eGE of the extremum estimator given any size random sample, which reveals a method to analyze the prorperty of eGE for the extremum estimators. Potentially, if we take eGE as the criterion for model selection, Theorems 1 and 2 may be used to evaluate the performance of model selection under validation and crossvalidation. Hence it would be natural to select the model with minimal eGE in the space of alternative models
2.3 Generalization error minimization
By establishing upper bounds for eGE under validation or crossvalidation in section 2.1, we show that, as a criterion for model evaluation, a number of properties for eGE could be shown in finite samples and the asymptotic case, which suggest that it may serve as a good angle to understand model selection. Hence, by considering eGE as a criterion for model selection, we propose selecting the model based on minimizing the eGE, which we refer to as generalization error minimization or GEM.
Generally speaking, GEM can be implemented alongside with many conventional techniques of model selection, such as penalized regressions, the information criteria and maximum a posteriori (MAP). However, in next section, we show that GEM works especially well for penalized regression. As shown in Algorithm 1, penalized regression estimation returns a for each . Each value of generates a different model and a different eGE. As a result, Theorems 1 and 2 guarantee that the model with the minimum eGE among has the best empirical generalization ability. By applying GEM in conjunction with validation/crossvalidation and various penalty methods, the theoretical properties of the penalized regressions, such as its robustness, mode of consistency and convergence rate could be analyzed by directly applying the upper bounds we derive previously.
3 Finitesample and asymptotic properties for penalized regression under GEM
In Section 2, to analyze eGE from the perceptive of modelfree, we establish a class of upper bounds for eGE of the extremum estimators, which is potentially connected to the eTEeGE tradeoff, biasvariance tradeoff and cross validation; we also propose the GEM as the general framework of model selection. In this section, we implement the GEM onto regression analysis. As shown in the following section, we apply eGE to regression analysis and show that GEM serves as a new and clear angle to understand the procedure of penalized regression. By applying the eGE as the criterion of model selection, model selection directly serves as an effective method to improve GA; the other way around, the properties of model selection can directly be explained and reframed by the eGE of the model. Moreover, other than the properties in section 2, additional properties can be established for penalized regression under GEM framework. Specifically, we establish: (1) specific error bounds for any penalized regression, (2) consistency for all penalized regression estimators, (3) that the upper bound for the difference between the penalized regression estimator and the OLS estimator is a function of the eGE, the tail behavior of distribution for the loss function and the exogeneity of the sample.
3.1 Penalized regression
Firstly, we formally define penalized regression and its two most popular variants: lasso (penalized regression) and ridge (penalized regression). It is important to stress that each variable in must be standardized before implementing penalized regression. As shown by Tibshirani (1996), without standardization the penalized regression estimates may be influenced by the magnitude (units) of the variables. After standardization, of course, and are unit and scalefree.
Definition 2 (Penalized regression, ridge regression, lasso regression, eGE and eTE).

The general form of the objective function for penalized regression is
(6) where the penalty term is a function of the norm of .

Let denote the solution to the constrained minimization eq. (6) for a given value of the penalty parameter . Let denote the estimator with the minimum eGE among all the alternative (as in Algorithm 1 in Section 1). Let denote the OLS estimator on the training set.

The objective function for the lasso (norm penalty) is
(7) and the objective function for ridge regression (
norm penalty) is(8) 
The norm eTE and eGE for any are defined, respectively,
The idea behind penalized regression is illustrated in Figure 9. As shown in Figure (a)a, different penalties correspond to different boundaries for the estimation feasible set. For the penalized regression (lasso), the feasible set is a diamond. The feasible set expands to a circle under an penalty. As illustrated in Figures (b)b and (c)c, for given , the smaller , the more likely is a corner solution. It follows that under the penalty, variables are more likely to be dropped compared with the penalty.^{13}^{13}13For , the penalized regression may be a nonconvex programming problem. While general algorithms have not been found for nonconvex optimization, Strongin and Sergeyev (2000), Yan and Ma (2001) and Noor (2008) have developed working algorithms. For , the penalized regression becomes a discrete programming problem, which can be solved by Dantzigtype methods; see Candes and Tao (2007). In the special case when and (), the penalized regression is identical to the Akaike (Bayesian) information criterion.
Penalized regression primarily focuses on overfitting. By contrast, OLS minimizes the eTE without any penalty, often resulting in a large eGE (e.g., the ‘overfitting’ in Figure 1). It is also possible that OLS fits the training data poorly, causing both the eTE and eGE to be large (e.g., the ‘underfitting’ in Figure 1). Generally speaking, it is easier to deal with overfitting than underfitting.^{14}^{14}14See eq. (13) and (14). Overfitting in OLS is usually caused by including too many variables, which can be resolved by reducing . Underfitting, however, is likely due to a lack of data (variables) and the only remedy is to collect more data.
3.2 GEM for penalized regression
The conventional route to establish finitesample and asymptotic properties for regression is by analyzing the properties of the estimator in the space of the eTE. By contrast, to study how penalized regression improves GA, we reformulate the analysis in the space of the eGE. Figure 10 outlines our proof strategy. We show that a number of finitesample properties for penalized regression can be established under the GEM framework.
In asymptotic analysis, consistency is typically considered to be one of the most fundamental properties. To demonstrate that GEM is a viable estimation framework, we prove that the penalized regression model selected by eGE minimization converges to the true DGP as
. Essentially, we show that penalized regression bijectively maps to the minimal eGE among on the test set. To link the finitesample and asymptotic results we need to show that, if the true DGP is bijectively assigned to the global minimum eGE in the population, and ifthen is consistent in probability or , or
To establish consistency for penalized regression, we make the following three additional assumptions.
Further assumptions

The true DGP is .

.

No perfect collinearity in .
Assumptions A4–A6 restrict the true DGP indexed by to be identifiable. Otherwise, there may exist an alternative that is not statistically different from the true DGP. The assumptions are standard for linear regression.
Under assumptions A1–A6, we first show that the true DGP has the lowest generalization error.
Proposition 1 (Identification of in the space of eGE).
Under A1–A6, the true DGP, , is the one and only one offering the minimal eGE as .
Proposition 1 states that there is a bijective mapping between and the global minimum eGE in the population. If A5 or A6 were violated, variables may exist in the sample that render the true DGP not to have the minimum eGE in the population.
As shown in Algorithm 1, penalized regression chooses to be the model with the minimum eGE in . Thus, we need to establish that when the sample size is large enough, the true DGP is included among , the list of models selected by validation or crossvalidation.
Proposition 2 (Existence of consistency).
Under A1–A6 and Proposition 1, there exists at least one such that .
Using lasso as the example of penalized regression, Figure 14 illustrates Propositions 1 and 2. In Figure 14, refers to the true DGP, refers to the solution of eq. (7), and the diamondshaped feasible sets are due to the penalty. Different values of imply different areas for the feasible sets, which get smaller as the value of increases. There are three possible cases: (i) undershrinkage—for a low value of (Figure (a)a), lies within the feasible set and has the minimum eTE in the population; (ii) perfectshrinkage—for the oracle (Figure (b)b), is located precisely on the boundary of the feasible set and still has the minimum eTE in the population; (iii) overshrinkage—for a high value of (Figure (c)c), lies outside the feasible set. In cases (i) and (ii), the constraints become inactive as , so . However, in case (iii), . An important implication is that tuning the penalty parameter is critical for the theoretical properties of the penalized regression estimators.
3.3 Main results for penalized regression under GEM
Intuitively, the penalized regression estimator will be consistent in some norm or measure as long as, for a specific , lies in the feasible set and offers the minimum eTE. In practice, however, we may not know a priori whether causes overshrinking or not, especially when the number of variables, , is not fixed. As a result, we need to use an approach like validation or crossvalidation to tune the value of . Applying the results in Section 2, we show that GEM guarantees that the model selected by penalized regression with the appropriate , , asymptotically converges in to the true DGP.
In this section we analyze the finitesample and asymptotic properties of the GEM estimator in two settings: and . In the case where , OLS is feasible, and is the unpenalized regression estimator. In the case where , OLS is not feasible, and forward stagewise regression (FSR) is the unpenalized regression estimator.
3.3.1 GEM for penalized regression with
Firstly, by adapting eq. (3) and (4) and (5) for regression, we establish the upper bound of the eGE.
Lemma 2 (Upper bound for the eGE of the OLS estimator).
Under A1–A6, if we assume ,

Validation case. The following bound for the eGE for holds with probability at least , .
(9) where is the eGE and is the eTE of the OLS estimator, and is defined in Lemma 1.

fold crossvalidation case. The following bound for the eGE for holds with probability at least , .
(10) where is the eGE and is the eTE of the OLS estimator in the th round of crossvalidation, and is defined in Lemma 1.
In a similar fashion to eqs. (3) and (4) and (5), eqs. (9) and (10) measure the upper bound of the eGE for the OLS estimator under validation and crossvalidation, respectively. In standard practice, of course, neither validation nor crossvalidation are implemented as part of OLS estimation and the eGE of the OLS estimator is not computed. Nevertheless, eqs. (9) and (10) show that it is possible to compute the eGE of the OLS estimator without having to carry out validation or crossvalidation. Eqs. (9) and (10) also show that the higher the variance of in the true DGP, the higher the upper bound of the eGE under validation and crossvalidation.
As a bonus of the GEM approach, eq. (10) also shows that we can find the that maximizes the GA from crossvalidation by tuning to the lowest upper bound of the crossvalidated eGE, determined by minimizing the expectation of the RHS of eq. (10).
Corollary 1 (The optimal for penalized regression).
The penalty parameter can be tuned by validation or fold crossvalidation. For , we have different test sets for tuning and different training sets for estimation. Using eqs. (9) and (10), we now establish, in two steps, an upper bound for the norm difference between the unpenalized estimator and the corresponding penalized estimator under validation and crossvalidation.
Proposition 3 (difference between the penalized and unpenalized predicted values).

Validation case. The following bound for the difference between the predicted values from and the validated holds with probability at least
(11) where is defined in Theorem 1.

fold crossvalidation case. The following bound for the difference between the predicted values from the fold crossvalidated and holds with probability at least
where is the OLS estimator and is the penalized estimator in the th round of crossvalidation, and is defined in Theorem 2.
Following Markov’s classical proof of consistency for OLS, Proposition 3 establishes the norm convergence of the fitted values to the true values. Based on A1A6, the identification condition is satisfied, and the convergence of the fitted values implies the norm consistency of the penalized regression estimator.
We now establish the upper bound of the norm difference between and , or , under validation and crossvalidation.
Theorem 3 (difference between the penalized and unpenalized regression estimators).

Validation case. The following bound for the difference between and the validated holds with probability at least
(13) where
is the minimum eigenvalue of
and is defined in Lemma 2. 
fold crossvalidation case. The following bound for the  difference between the fold crossvalidated and holds with probability at least
(14) where is defined .
Some important remarks apply to Theorem 3. The LHS of eq. (13) measures the norm difference between the penalized regression estimator and the OLS estimator under validation. The RHS of eq. (13) essentially captures the maximum norm difference between and . As shown in eq. (13), the maximum difference depends on the GE of the true DGP and the GE of the OLS model.

The first term on the RHS of eq. (13) (ignoring ) is the difference between the eGE from OLS and the upper bound of the population error, or, equivalently, the difference between the GA of the OLS estimator and its maximum. The better the GA of , the less overfitting OLS generates, the closer the eGE of is to the upper bound of the population error, and the smaller the first term on the RHS of eq. (13).

The second term on the RHS of eq. (13) (ignoring ) measures the empirical endogeneity of the OLS estimator on the test set. On the training set , but on the test set, in general, . Hence, measures the GA for the empirical moment condition of the OLS estimator on outofsample data.^{15}^{15}15Because we standardize the test and training data, the moment condition holds directly. The more generalizable the OLS estimate, the closer is to zero on outofsample data, and the smaller the second term on the RHS of eq. (
Comments
There are no comments yet.