Generalization error minimization: a new approach to model evaluation and selection with an application to penalized regression

10/18/2016
by   Ning Xu, et al.
The University of Sydney
0

We study model evaluation and model selection from the perspective of generalization ability (GA): the ability of a model to predict outcomes in new samples from the same population. We believe that GA is one way formally to address concerns about the external validity of a model. The GA of a model estimated on a sample can be measured by its empirical out-of-sample errors, called the generalization errors (GE). We derive upper bounds for the GE, which depend on sample sizes, model complexity and the distribution of the loss function. The upper bounds can be used to evaluate the GA of a model, ex ante. We propose using generalization error minimization (GEM) as a framework for model selection. Using GEM, we are able to unify a big class of penalized regression estimators, including lasso, ridge and bridge, under the same set of assumptions. We establish finite-sample and asymptotic properties (including L_2-consistency) of the GEM estimator for both the n ≥ p and the n < p cases. We also derive the L_2-distance between the penalized and corresponding unpenalized regression estimates. In practice, GEM can be implemented by validation or cross-validation. We show that the GE bounds can be used for selecting the optimal number of folds in K-fold cross-validation. We propose a variant of R^2, the GR^2, as a measure of GA, which considers both both in-sample and out-of-sample goodness of fit. Simulations are used to demonstrate our key results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/12/2016

Finite-sample and asymptotic analysis of generalization ability with an application to penalized regression

In this paper, we study the performance of extremum estimators from the ...
06/01/2016

Model selection consistency from the perspective of generalization ability and VC theory with an application to Lasso

Model selection is difficult to analyse yet theoretically and empiricall...
05/20/2017

( β, ϖ)-stability for cross-validation and the choice of the number of folds

In this paper, we introduce a new concept of stability for cross-validat...
08/15/2019

A deep learning model for segmentation of geographic atrophy to study its long-term natural history

Purpose: To develop and validate a deep learning model for automatic seg...
03/03/2020

Error bounds in estimating the out-of-sample prediction error using leave-one-out cross validation in high-dimensions

We study the problem of out-of-sample risk estimation in the high dimens...
03/28/2019

An analysis of the cost of hyper-parameter selection via split-sample validation, with applications to penalized regression

In the regression setting, given a set of hyper-parameters, a model-esti...
02/10/2014

An Algorithmic Framework for Computing Validation Performance Bounds by Using Suboptimal Models

Practical model building processes are often time-consuming because many...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditionally in econometrics, a statistical inference about a population is formed using a model imposed and estimated on a sample. Put another way, the goal of inference is to see whether a model estimated on a sample may be generalized to the population. Deciding whether the estimated model is valid for generalization is referred to as the process of model evaluation. The performance of a model is usually evaluated using the sample data at hand, referred to as the ‘internal validity’ of the model. However, internal validity may not be a useful indicator of model performance in many different scenarios. A perspective of increasing interest in applied econometrics considers the performance of a model on out-of-sample data. In this paper, we focus on the ability of a model estimated from a given sample to fit new samples, referred to as the generalization ability (GA) of the model.

Generalization ability is one aspect of external validity, the extent to which the results from a study generalize to other settings. Researchers evaluating pilot programs and randomized trials in a variety of settings from labour economics to development economics are increasingly focused on the external validity of their findings.222In the literature, external validity sometimes refers to whether results based on a sample from one population generalize to another population. The concept of GA may be adapted to this interpretation provided the heterogeneity across the two populations is adequately controlled. For instance, Heckman and Vytlacil (2007), Ludwig et al. (2011) and Allcott and Mullainathan (2012) discuss the importance of externally valid results in policy and program evaluation. In a similar vein, Guala and Mittone (2005) and List (2011) emphasize the importance of external validity in determining whether experiments can offer a robust explanation of causes and effects in the field. While external validity encompasses many issues from experimental design, sample selection, and economic theory, it clearly also raises econometric issues around model estimation and evaluation.

In order to explore the properties of estimators from the perspective of GA, three questions need to be addressed. How do we measure GA? Given a useful measure, what are its properties? Given its properties, how can GA be exploited for estimation? These questions have received only tangential interest in the literature. Roe and Just (2009) discuss the trade-off between internal and external validity in empirical data, pointing out that while internal validity (or in-sample fit) is well studied, there is much less research on how to control external validity (out-of-sample fit). Some work on external validity has focussed on the properties of specific estimation methods. Angrist and Fernandez-Val (2010), for example, studies the external validity of instrument variable estimation in a labor market setting. In this paper, we answer these questions, starting by proposing a measure of GA that is straightforward to implement empirically.

With a new sample at hand, GA is easily measured by using validation or cross-validation to quantify the goodness-of-fit of the estimated model on new (out-of-sample) data. Without a new sample, however, it can be difficult to measure GA ex ante. In this paper, when only a single sample is available, we quantify the GA of the in-sample estimates by deriving upper bounds on the empirical out-of-sample errors, which we call the empirical generalization errors (eGE). The upper bounds reveal the properties of the eGE as well as insight into other important statistical properties, such as the trade-off between in-sample and out-of-sample fit in estimation. Furthermore, the upper bounds can be extended to analyze the performance of validation, -fold cross-validation and specific estimators such as penalized regression. Thus, the GA approach yields insight into the finite-sample and asymptotic properties of model estimation and evaluation from several different perspectives.

Essentially, GA measures the performance of a model from an external point-of-view: the greater the GA of an estimator, the better its predictions on out-of-sample data. Furthermore, we show that GA also serves as a natural criterion for model selection, throwing new light on the model selection process. We propose the criterion of minimizing eGE (maximizing GA), or generalization error minimization (GEM), as a framework for model selection. Using penalized regression as an example of a specific model selection scenario, we show how the traditional bias-variance trade-off is connected to GEM and to the trade-off between in-sample and out-of-sample fit. Moreover, the GEM framework allows us to establish additional properties for penalized regression implicit in the bias-variance trade-off.

1.1 Approaches to model selection

Given the increasing prevalence of high-dimensional data in economics, model selection is coming to the forefront in empirical work. Researchers often desire a smaller set of predictors in order to gain insight into the most relevant relationships between outcomes and covariates. Without explicitly introducing the concept of GA, the classical approach to model selection focusses on the bias-variance trade-off, yielding methods such as the information criteria (IC), cross-validation, and penalized regression. Consider the standard linear regression model

where

is a vector of outcome variables,

is a matrix of covariates and is a vector of i.i.d. random errors. The parameter vector may be sparse in the sense that many of its elements are zero. Model selection typically involves using a score or penalty function that depends on the data (Heckerman et al., 1995), such as the Akaike information criterion (Akaike, 1973), Bayesian information criterion (Schwarz, 1978), cross-validation errors (Stone, 1974, 1977) or the mutual information score among variables (Friedman et al., 1997, 2004).

An alternative approach to model selection is penalized regression, implemented through the objective function:

(1)

where is the -norm and is a penalty parameter. Note in eq. (1) that if , the OLS estimator is obtained. The IC can be viewed as special cases with and . The lasso (Tibshirani, 1996) corresponds to the case with (an penalty). When (an penalty), we have the ridge estimator (Hoerl and Kennard, 1970). For any , we have the bridge estimator (Frank and Friedman, 1993), proposed as a generalization of the ridge.

One way to derive the penalized regression estimates is using the cross-validation approach, summarized in Algorithm 1. As shown in Algorithm 1, cross-validation solves the constrained minimization problem in eq. (1) for each value of the penalty parameter to derive a . When the feasible range of is exhausted, the estimate that produces the smallest out-of-sample error among all the estimated is chosen to be the penalized regression estimate, .

Algorithm 1: Penalized regression estimation under cross-validation

1. Set the penalty parameter .
2. Partition the sample into a training set and a test set . Standardize all variables (to ensure the penalized regression residual satisfies in and ).
3. Compute the penalized regression estimate on . Use to calculate the prediction error on .
4. Increase the penalty parameter by a preset step size. Repeat 2 and 3 until .
5. Select to be the that minimizes the prediction error on .
Table 1: *

A range of consistency properties have been established for the IC and penalized regression. Shao (1997) proves that various IC and cross-validation are consistent in model selection. Breiman (1995); Chickering et al. (2004) show that the IC have drawbacks: they tend to select more variables than necessary and are sensitive to small changes in the data. Zhang and Huang (2008); Knight and Fu (2000); Meinshausen and Bühlmann (2006); Zhao and Yu (2006) show that -penalized regression is consistent in different settings. Huang et al. (2008); Hoerl and Kennard (1970) show the consistency of penalized regression with . Zou (2006); Caner (2009); Friedman et al. (2010) propose variants of penalized regression in different scenarios and Fu (1998) compares different penalized regressions using a simulation study. Alternative approaches to model selection, such as combinatorial search algorithms may be computationally challenging to implement, especially with high-dimensional data.333Chickering et al. (2004)

point out that the best subset selection method is unable to deal with a large number of variables, heuristically 30 at most.

1.2 Major results and contribution

A central idea in this paper is that model evaluation and model selection may be re-framed from the perspective of GA. If the objective is to improve the GA of a model, model selection is necessary. Conversely, GA provides a new and elegant angle to understand model selection. By introducing generalization errors as the measure of GA, we connect GEM to the traditional bias-variance trade-off, the trade-off between in-sample and out-of-sample fit and the properties of validation and cross-validation. By the same token, the concept of GA may be used to derive additional theoretical properties of model selection. Specifically for the case of regression analysis, we use GEM to unify the properties for the class of penalized regressions with

, and show that the finite-sample and asymptotic properties of penalized regression are closely related to GA.

The first contribution of this paper is to quantify the GA of a model, ex ante, by deriving an upper bound for the GE. The upper bounds depend on the sample size, an index of the complexity of models, a loss function, and the distribution of the underlying population. The upper bound also characterizes the trade-off between in-sample fit and out-of-sample fit. As shown in Vapnik and Chervonenkis (1971a, b); McDonald et al. (2011); Smale and Zhou (2009); Hu and Zhou (2009), the inequalities underlying conventional analysis of GA focus on the relation between the population error and the empirical in-sample error. Conventional methods to improve GA involve computing discrete measures of model complexity, such as the VC dimension, Radamacher dimension or Gaussian complexity, which typically are hard to compute. In contrast, for any out-of-sample data, we quantify bounds for the prediction error of the extremum estimate learned from in-sample data. Furthermore, we show that empirical GA analysis is straightforward to implement via validation or cross-validation and possesses desirable finite-sample and asymptotic properties for model selection.

A second contribution of the paper is to propose GEM as a general framework for model selection and estimation. By re-framing the bias-variance trade-off from the perspective of in-sample and out-of-sample fit, GA is connected to the traditional bias-variance trade-off and model selection. To be specific: model selection is necessary to improve the GA of an estimated model while GA naturally serves as an elegant way to understand the properties of model selection. Thus, a model estimated by GEM will not only achieve the highest GA—and thus some degree of external validity—but also possess a number of nice theoretical properties. GEM is also naturally connected to the properties of cross-validation. As argued by Varian (2014), cross-validation could be adopted more often in empirical analysis in economics, especially as big data sets are becoming increasingly available in many fields.

A third contribution of the paper is to use GA analysis to unify a class of penalized regression estimators and derive their finite-sample and asymptotic properties (including -consistency) under the same set of assumptions. Various properties of penalized regression estimators have previously been established, such as probabilistic consistency or the oracle property (Knight and Fu, 2000; Zhao and Yu, 2006; Candes and Tao, 2007; Meinshausen and Yu, 2009; Bickel et al., 2009). GA analysis reveals that similar properties can be established more generally and for a wider class of penalized regression estimators. We also show that the -difference between the OLS estimate and any penalized regression estimate depends on their respective GAs.

Lastly, a fourth contribution of the paper is to show that GA analysis may be used to tune the hyper-parameter for validation (i.e., the ratio of training sample size to test sample size) or cross-validation (i.e., the number of folds ). Existing research has studied cross-validation for the estimation of specific parametric and nonparametric models (Hall and Marron, 1991; Hall et al., 2004; Stone, 1974, 1977). In contrast, by adapting the classical error bound inequalities that follow from our analysis of GA, we derive the optimal tuning parameters for validation and cross-validation in a model-free setting. We also show how affects the bias-variance trade-off for cross-validation: a higher increases the variance and lowers the bias.

The paper is organized as follows. In Section 2, we propose empirical generalization errors as the measure of GA and the criterion to evaluate the external performance of the estimated model. Also, by deriving the upper bounds for the empirical generalization errors in different scenarios, we reveal important features of the extremum estimator such as the trade-off between in-sample and out-of-sample fit. Our approach also offers a framework to study the properties of validation and cross-validation. In Section 3, we apply the GEM framework as a model selection criterion for penalized regression. We also demonstrate a number of new properties for penalized regression that flow from the GEM framework. We prove the -consistency of penalized regression estimators for both and cases. Further, we establish the finite-sample upper bound for the -difference between penalized and unpenalized estimators based on their respective GAs. In Section 4, we use simulations to demonstrate the ability of penalized regression to control for overfitting. We also propose a measure of overfitting based on the empirical generalization errors, called the generalized or . Section 5 concludes with a brief discussion of our results. Proofs (showing detailed steps) are contained in Appendix A and summary plots of the simulations are in Appendix B.

2 Generalization ability and the upper bound for the finite-sample generalization error

In this section, we propose the eGE, the measurement of GA for a model, as a new angle to evaluate the ‘goodness’ of a model. By deriving the upper bounds of eGE of the extremum estimator, we show that the upper bounds of eGE directly quantifies how much the model overfits or underfits the finite samples, also the upper bounds of eGE may be used to study the properties of eGE for a model. Furthermore, these upper bounds can be used to study the properties of cross validation and validation on finite samples. All those result implies that eGE is a convenient and useful criterion for model evaluation.

2.1 Generalization ability, generalization error and overfitting

In econometrics, choosing the best approximation to data often involves measuring a loss function, , defined as a functional that depends on some estimate and the sample points . The population error (or risk) functional is defined as

where

is the joint distribution of

and . Without knowing the distribution a priori, given a random sample , we define the empirical error functional as follows

For example, in the regression case, is the estimated parameter vector and .

When estimation involves minimizing the in-sample empirical error, we have the extremum estimator (Amemiya, 1985). In many settings, however, minimizing the in-sample empirical error does not guarantee a reliable model. In regression, for example, often the is used to measure goodness-of-fit for the in-sample data. However, an estimate with a high in-sample may fit out-of-sample data poorly, a feature commonly referred to as overfitting: the in-sample estimate is too tailored for the sample data, compromising its out-of-sample performance.444Likewise, if the model fits in-sample data poorly and hence compromise its out-of-sample performance, we say that the model underfits the data. As a result, in-sample fit (internal validity) of the model may not be a reliable indicator of the general applicability (external validity) of the model.

Thus, Vapnik and Chervonenkis (1971a) refer to the generalization ability (GA) of a model: a measure of how an extremum estimator performs on out-of-sample data. GA can be measured several different ways. In the case where and are directly observed, GA is a function of the difference between the actual and predicted for out-of-sample data. In this paper, GA is measured by the out-of-sample empirical error functional.

Definition 1 (Training set, test set, empirical training error and empirical generalization error).
  1. Let denote a sample point from , the joint distribution of . Let denote the space of all models. The loss function for is . The population error functional for is . The empirical error functional is .

  2. Given a sample , the training set refers to data used to estimate (i.e., the in-sample data) and the test set refers to data not used to estimate (i.e., the out-of-sample data). Let . The effective sample size for the training set, test set and the total sample, respectively, is , and .

  3. Let denote an extremum estimator, where minimizes . Under validation, the empirical training error (eTE) for is , the empirical generalization error (eGE) is and the population error is .

  4. For -fold cross-validation, denote the training set and test set in the th round, respectively, as and . In each round, the sample size for the training set is and the sample size for the test set is . Thus, is the eGE and is the eTE, respectively, in the th round of cross-validation.

Two methods are typically used to compute the eGE of an estimate: validation and cross-validation. Under the validation approach, the sample is randomly divided into a training set and a test set. Following estimation on the training set, the fitted model is applied to the test set to compute the validated eGE. -fold cross-validation may be thought of as ‘multiple-round validation’. Under the cross-validation approach, the sample is randomly divided into subsamples or folds.555In practice, researchers arbitrarily choose , 10, 20, 40 or . One fold is chosen to be the test set and the remaining folds comprise the training set. Following estimation on the training set, the fitted model is applied to the test set to compute the eGE. The process is repeated times, with each of the folds in turn taking the role of the test set while the remaining folds are used as the training set. In this way, different estimates of the eGE for the fitted model are obtained. The average of the eGEs yields the cross-validated eGE.

Cross-validation uses each data point in both the training and test sets. The method also reduces resampling error by running the validation times over different training and test sets. Intuitively this suggests that cross-validation is more robust to resampling error and on average will perform at least as well as validation. In Section 3, we study the generalization ability of penalized extremum estimators in both the validation and cross-validation cases.

To study the properties of eGE for a model, three assumptions are required for the analysis in this section of the paper, stated as follows.


Assumptions

  1. In the probability space

    , we assume -measurability of the loss function , the population error and the empirical error , for any and any sample point

    . Distributions of loss functions have closed-form, first-order moments.

  2. The sample is randomly drawn from the population. In cases with multiple random samples, both the training set and the test set are randomly sampled from the population. In cases with a single random sample, both the training set and the test set are randomly partitioned from the sample.

  3. For any sample, the extremum estimator exists. The in-sample error for converges in probability to the minimal population error as .

A few comments are in order for assumptions A1–A3. The loss distribution assumption A1 is made to simplify the analysis. The existence and convergence assumption A3 is standard (see, for example, Newey and McFadden (1994)).

The independence assumption A2 is not essential; GA analysis is valid for both i.i.d. and non-i.i.d. data. While the original research on GA in Vapnik and Chervonenkis (1974a, b) imposes the i.i.d. restriction, subsequent work has generalized their results to cases where the data are dependent or not identically distributed.666See, for example, Yu (1994); Cesa-Bianchi et al. (2004); Smale and Zhou (2009); Mohri and Rostamizadeh (2009); Kakade and Tiwari (2009); McDonald et al. (2011).

Others have shown that if heterogeneity is due to an observed random variable,

777See Michalski and Yashin (1986); Skrondal and Rabe-Hesketh (2004); Wang and Feng (2005); Yu and Joachims (2009); Pearl (2015).

the variable may be added to the model to control for the heterogeneity, while if the heterogeneity is related to a latent variable, various approaches—such as the hidden Markov model, mixture modelling or factor modelling—are available for heterogeneity control. In this paper, due to the different measure-theory setting for dependent data, we focus on the independent case as a first step.

Given A1–A3, both the eTE and eGE converge to the population error:

2.2 The upper bound of the empirical generalization error

The problem of GA and overfitting is discovered many years ago. Originally, to improve the generalization ability of a model, Vapnik and Chervonenkis (1971a, b) propose minimizing the upper bound of the population error of the estimator as opposed to minimizing the eTE. The balance between in-sample fit and out-of-sample fit is formulated by Vapnik and Chervonenkis (1974b) using the Glivenko-Cantelli theorem and Donsker’s theorem for empirical processes. Specifically, the so-called VC inequality (Vapnik and Chervonenkis, 1974b) summarizes the relationship between and .

Lemma 1 (VC inequality: the upper bound for the population error).

Under A1–A3, the following inequality holds with probability at least , , and ,

(2)

where is the population error, is the empirical training error, , and is the VC dimension.

The VC dimension is a more general measure of the geometric complexity of a model than the number of parameters, , which does not readily extend as a measure of complexity in nonlinear or non-nested models. While reduces to directly for generalized linear models, can also be used to order the complexity of nonlinear or non-nested models.888In empirical processes, several other geometric complexity measures are connected to or derived from the VC dimension, such as the minimum description length (MDL) score, the Rademacher dimension (or complexity), Pollard’s pseudo-dimension and the Natarajan dimension. Most of these measures, like the VC dimension, are derived from the Glivenko-Cantelli class of empirical processes. Thus, eq. (2) can be used as a tool for linear, nonlinear and non-nested model selection.

While Lemma 1 is established under A2, eq. (2) can be generalized to non-i.i.d. cases. McDonald et al. (2011) generalizes the VC inequality for - and -mixing stationery time series while Smale and Zhou (2009) generalizes the VC inequality for panel data. Moreover, a number of papers (Michalski and Yashin, 1986; Skrondal and Rabe-Hesketh, 2004; Wang and Feng, 2005; Yu and Joachims, 2009; Pearl, 2015) show that heterogeneity can be controlled in the context of the VC inequality by implementing the latent variable model or by adding the variable causing heterogeneity into the model.

The VC inequality provides an upper bound for the population error based on the eTE and the VC dimension of the model. As shown in Figure 1, when the VC dimension is low (i.e., the model is simple), the effective sample size () of the training set is large, is small, the second term on the RHS of (2) is small, and the eTE is close to the population error. In this case the extremum estimator has a good GA. However, when the VC dimension is high (i.e., the model is very complicated), the effective sample size is small, the second term on the RHS of (2) becomes larger. In such situations, a small eTE does not guarantee a good GA, and overfitting is more likely.

Figure 1: The VC inequality and the upper bound of the population error

Vapnik and Chervonenkis (1971a) show that minimizing the RHS of eq. (2) reduces overfitting and improves the GA of the extremum estimator. However, this can be hard to implement because the VC dimension is hard to calculate for anything other than linear models.999

For example, VC dimension is complicated and not handy to implement if we try compare the GA between two different distribution in kernel density estimation.

Also, VC inequality is for population. But in practice population is often not accessible and the data we collect is always finite, moreover we are often interesting in how a estimated model would perform on finite out-of-sample data, such as "how well the effect of a policy, estimated from a sample from suburb A, would explain the variation in suburb B, given relevant heterogeneity controlled. " Hence, as a handy and intuitive measurement for GA or overfitting in empirical study, eGE has been proposed and widely used in applications.

As a reasonable criterion for model evaluation, we would expect that eGE should demonstrate a number of nice properties that characterize its pattern or regularity as an empirical process. For example, intuitively if we get an estimated model from the training set and apply it to many test sets with same size, we would expect that value of different eGEs could be as stable as possible despite of the sampling error; also, if we calculate the eGE of a model on test sets with different sizes, we would expect that value of eGEs should be influenced by the size of training set, the size of test set and the tail behavior of the error distribution. However, due to the procedure of sampling from population and subsampling by validation/cross validation, the randomness of eGE as a empirical process increase the difficulty to derive those properties. To reduce and absorb the influence of sampling and subsampling, one approach is to derive an upper bound for eGE,101010All error functional is non-negative, hence could be seen as a natural lower bound. Moreover, typically the consequence of a large eGE is more severe than a small eGE. As a result, we focus on the upper bound here. which is somehow similar to what eq. (2) does to population error, Thus, we now describe the finite-sample relation between the eGE and the eTE in validation, by adapting eq. (2).

Theorem 1 (The upper bound for the eGE of the extremum estimator under validation).

Under A1–A3 and given an extremum estimator , the following upper bound for the eGE holds with probability at least , .

(3)

where is the eGE and the eTE, is defined in Lemma 1, if and is bounded, and otherwise

where


Eq. (3) provides an upper bound for the eGE of the test set as it depends on the eTE of the model with a given estimated on the training set. Thus, eq. (3) shows how the eTE from extremum estimation on the training set may be used to compute the eGE of the model on the test set. In other words, eq. (3) measures the GA of a model of a given complexity under the validation approach.

Theorem 1 has several important implications.

  1. Upper bound of the eGE. Eq. (3) establishes the upper bound of the eGE for any out-of-sample data of size based on the eTE from any in-sample data of size . Thus, eq. (3) quantifies the upper bound of the eGE, as opposed to Lemma 1, which quantifies the upper bound of the population error. Previously, measuring the eGE of a model with new data required the use of validation or cross-validation. Given Theorem 1, the eGE may be quantified directly using the RHS of eq. (3), avoiding the need for validation.

  2. The trade-off between accuracy and looseness of the upper bound. Theorem (1) also shows that influences the trade-off between the accuracy and efficiency of the upper bound. The higher , the more likely the upper bound holds and the larger . Thus, while the probability the bound holds increases, it comes at the cost of a looser upper bound. In contrast, a lower reduces the upper bound, offering a empirically efficient upper bound at the cost of reducing the probability that it holds.

  3. The eGE-eTE trade-off in model selection. Eq. (3) also characterizes the trade-off between eGE and eTE for model selection in both the finite-sample and asymptotic cases. The population GE and the population TE converge to the population error. Hence, minimizing eTE can lead directly to the true DGP in the population. By contrast, for the finite-sample case illustrated in Figure 2, while an overcomplicated model (a low ) will have a small eTE, eq. (3) shows it may result in a large eGE on new data. Thus, an overcomplicated model will tend to overfit the in-sample data and have poor GA. On the other hand, an oversimplified model (a high ) will be unlikely to recover the true DGP leading to a large upper bound for the eGE. Thus, an oversimplified model will tend to underfit, fitting both the in-sample and out-of-sample data poorly. Thus, the complexity of a model is related to a trade-off between the eTE and eGE.

  4. GA and tails of the loss function distribution. Eq. (3) also shows how the tail of the loss function distribution affects the upper bound of the eGE through , the second term on the RHS of eq. (3). If is bounded or light-tailed, is mathematically simple and converges to zero at the rate . If the loss function is heavy-tailed and -measurable, , the highest order of the population moment that is closed-form for the loss distribution,111111It is closed-form owing to A1, which guarantees closed-form, first-order moments for all loss distributions. can be used to measure the heaviness of the loss distribution tail, a smaller implying a heavier tail. In the case of a heavy tail, is mathematically complicated and its convergence rate decreases to . Hence, the heavier the tail of the loss distribution, the higher the upper bound of the eGE and the harder it is to control GA in finite samples. In the extreme case with , there is no way to adapt eq. (3).


Our next step is to establish a similar bound to eq. (3) for -fold cross-validation. Given that -fold cross-validation is simply the multiple-round validation, a similar bound to eq. (3) can be established for cross-validation by convolution. For convenience, define the empirical process

as the eGE gap in the th round of cross-validation.

Figure 2: Representation of the trade-off between eGE and eTE
Theorem 2 (The upper bound for the eGE of the extremum estimator for cross-validation).

Under A1–A3, given an extremum estimator and given , then

(i)

if , the following upper bound for the eGE gap holds with probability at least

(4)

where

(ii)

if is heavy-tailed and sub-exponential, the following upper bound for the eGE holds with probability at least

(5)

where

is the eGE and is the eTE, respectively, in the qth round of cross-validation.

Theorem 2 provides an upper bound for the average eGE from rounds of cross-validation. Generally speaking, the errors generated from cross-validation are affected both by sampling randomness (from the population) and by sub-sampling randomness that arises from partitioning the sample into folds. Thus, the errors from cross-validation are potentially more volatile than the usual errors from estimation. Eqs. (4) and (5) offers a way to characterize the property of eGE despite of the effect of sub-sampling randomness.

The implications of eqs. (4) and (5) are as follows.

  1. Upper bound of the eGE. Similar to eq. (3), eq. (4) and (5) serve as the upper bound of the cross-validated eGE. Both equations reveal the trade-off between eTE and eGE and the influence of the tails of the loss distribution. In convolution, the tails dramatically affect the upper bound of the eGE. If is light-tailed, converges to exponentially when . When is sub-exponential, the convolution is more complicated and it is hard to approximate the probability. However, with sufficiently large , Theorem 2 shows that we can approximate the convoluted probability.

  2. The trade-off between accuracy and looseness of the upper bound. Similar to eq. (3), in both eq. (4) and (5) we need to consider the trade-off between between and (or as in Theorem 1), i.e., the trade-off between efficiency and accuracy. Similar to Theorem 1, ceteris paribus, a larger in each round of cross-validation increases and . Thus, in each round, the upper bound moves upwards and the probability the upper bound holds increases, implying that both increase overall under cross-validation. While the probability the bound holds increases, it comes at the cost of a looser bound.

  3. Cross-validation hyperparameter

    . Eqs. (4) and (5) characterize how affects the average eGE from cross-validation (also called the cross-validation error in the literature). With a given sample and fixed , sub-sampling randomness will produce a different average eGE each time cross-validation is performed. From Definition 1, and , so the sizes of the training and test sets change with . As increases the test sets become smaller, increasing the influence of sub-sampling randomness on the eGE. On the other hand, as decreases the training sets become smaller, increasing the influence of sub-sampling randomness on eTE. The two effects complicate analysis of the effect of on trade-off between eTE and eGE under cross-validation. Thus, to characterize the influence of sub-sampling randomness, we establish the trade-off for cross-validation by a bound, after running cross-validation many times.121212By contrast, for extremum estimators like OLS, the bias-variance trade-off is much more straightforward to analyze for different because the sample is fixed. Figure 5 illustrates the effect of .

    • Small . For low values of , is low in each round of in-sample estimation and the eTE in each round , , is more biased away from the population error, as shown in Figure (a)a. Also for small , the -round average eTE (the first term on the RHS of eqs. (4) and (5)), is more biased away from the true population error, as shown in Figure (b)b. As a result, the RHS of eqs. (4) and (5) suffer more from finite-sample bias for low values of . However, since a small implies is relatively large, more data is used for eGE calculation in each round, in each round the eGE on the test set should be less volatile. Thus, the -round averaged eGE for cross-validation is relatively less volatile, reflecting the fact that is not very large in eqs. (4) and (5).

    • Large . For high values of , is low. Given a small test set size, the eGE in each round may be hard to bound from above, the averaged eGE from rounds will be more volatile and will increase. However, with a high , the first term on the RHS of eqs. (4) and (5) tends to be closer to the true population error and the averaged eGE suffers less from bias.

    In summary, Figure (b)b shows that as the value of increases, the averaged eGE from cross-validation follows a typical bias-variance trade-off. For low values of , the average eGE is less volatile but more biased away from the population error. As increases, the averaged eGE becomes more volatile but less biased away from the population error.


Theorem 2 confirms the exhaustive simulation study results from Kohavi (1995). As a side benefit, Theorem 2 also suggests that an optimal number of folds may exist for each sample when the cross-validation approach is used with extremum estimation, as in the shown in Figure (b)b. More specifically, the that minimizes the upper bound (4) and (5) also maximizes the GA from cross-validation. We leave to the next section discussion on the optimal for regression.

(a) eGE in each round of cross-validation
(b) average eGEs from rounds of cross-validation
Figure 5: The bias-variance trade-off for cross-validation eGE

Theorems 1 and 2 establish upper bounds for the eGE of the extremum estimator given any size random sample, which reveals a method to analyze the prorperty of eGE for the extremum estimators. Potentially, if we take eGE as the criterion for model selection, Theorems 1 and 2 may be used to evaluate the performance of model selection under validation and cross-validation. Hence it would be natural to select the model with minimal eGE in the space of alternative models

2.3 Generalization error minimization

By establishing upper bounds for eGE under validation or cross-validation in section 2.1, we show that, as a criterion for model evaluation, a number of properties for eGE could be shown in finite samples and the asymptotic case, which suggest that it may serve as a good angle to understand model selection. Hence, by considering eGE as a criterion for model selection, we propose selecting the model based on minimizing the eGE, which we refer to as generalization error minimization or GEM.

Generally speaking, GEM can be implemented alongside with many conventional techniques of model selection, such as penalized regressions, the information criteria and maximum a posteriori (MAP). However, in next section, we show that GEM works especially well for penalized regression. As shown in Algorithm 1, penalized regression estimation returns a for each . Each value of generates a different model and a different eGE. As a result, Theorems 1 and 2 guarantee that the model with the minimum eGE among has the best empirical generalization ability. By applying GEM in conjunction with validation/cross-validation and various penalty methods, the theoretical properties of the penalized regressions, such as its robustness, mode of consistency and convergence rate could be analyzed by directly applying the upper bounds we derive previously.

3 Finite-sample and asymptotic properties for penalized regression under GEM

In Section 2, to analyze eGE from the perceptive of model-free, we establish a class of upper bounds for eGE of the extremum estimators, which is potentially connected to the eTE-eGE trade-off, bias-variance trade-off and cross validation; we also propose the GEM as the general framework of model selection. In this section, we implement the GEM onto regression analysis. As shown in the following section, we apply eGE to regression analysis and show that GEM serves as a new and clear angle to understand the procedure of penalized regression. By applying the eGE as the criterion of model selection, model selection directly serves as an effective method to improve GA; the other way around, the properties of model selection can directly be explained and re-framed by the eGE of the model. Moreover, other than the properties in section 2, additional properties can be established for penalized regression under GEM framework. Specifically, we establish: (1) specific error bounds for any penalized regression, (2) -consistency for all penalized regression estimators, (3) that the upper bound for the -difference between the penalized regression estimator and the OLS estimator is a function of the eGE, the tail behavior of distribution for the loss function and the exogeneity of the sample.

3.1 Penalized regression

Firstly, we formally define penalized regression and its two most popular variants: lasso (-penalized regression) and ridge (-penalized regression). It is important to stress that each variable in must be standardized before implementing penalized regression. As shown by Tibshirani (1996), without standardization the penalized regression estimates may be influenced by the magnitude (units) of the variables. After standardization, of course, and are unit- and scale-free.

Definition 2 (Penalized regression, ridge regression, lasso regression, -eGE and -eTE).
  1. The general form of the objective function for penalized regression is

    (6)

    where the penalty term is a function of the -norm of .

  2. Let denote the solution to the constrained minimization eq. (6) for a given value of the penalty parameter . Let denote the estimator with the minimum eGE among all the alternative (as in Algorithm 1 in Section 1). Let denote the OLS estimator on the training set.

  3. The objective function for the lasso (-norm penalty) is

    (7)

    and the objective function for ridge regression (

    -norm penalty) is

    (8)
  4. The -norm eTE and eGE for any are defined, respectively,

(a) , and constraints
(b) penalty (lasso)
(c) penalty (ridge)
Figure 9: Comparison of estimates from OLS and various penalized regressions

The idea behind penalized regression is illustrated in Figure 9. As shown in Figure (a)a, different penalties correspond to different boundaries for the estimation feasible set. For the penalized regression (lasso), the feasible set is a diamond. The feasible set expands to a circle under an penalty. As illustrated in Figures (b)b and (c)c, for given , the smaller , the more likely is a corner solution. It follows that under the penalty, variables are more likely to be dropped compared with the penalty.131313For , the penalized regression may be a non-convex programming problem. While general algorithms have not been found for non-convex optimization, Strongin and Sergeyev (2000), Yan and Ma (2001) and Noor (2008) have developed working algorithms. For , the penalized regression becomes a discrete programming problem, which can be solved by Dantzig-type methods; see Candes and Tao (2007). In the special case when and (), the penalized regression is identical to the Akaike (Bayesian) information criterion.

Penalized regression primarily focuses on overfitting. By contrast, OLS minimizes the eTE without any penalty, often resulting in a large eGE (e.g., the ‘overfitting’ in Figure 1). It is also possible that OLS fits the training data poorly, causing both the eTE and eGE to be large (e.g., the ‘underfitting’ in Figure 1). Generally speaking, it is easier to deal with overfitting than underfitting.141414See eq. (13) and (14). Overfitting in OLS is usually caused by including too many variables, which can be resolved by reducing . Underfitting, however, is likely due to a lack of data (variables) and the only remedy is to collect more data.

3.2 GEM for penalized regression

The conventional route to establish finite-sample and asymptotic properties for regression is by analyzing the properties of the estimator in the space of the eTE. By contrast, to study how penalized regression improves GA, we reformulate the analysis in the space of the eGE. Figure 10 outlines our proof strategy. We show that a number of finite-sample properties for penalized regression can be established under the GEM framework.

Figure 10: Outline of proof strategy

In asymptotic analysis, consistency is typically considered to be one of the most fundamental properties. To demonstrate that GEM is a viable estimation framework, we prove that the penalized regression model selected by eGE minimization converges to the true DGP as

. Essentially, we show that penalized regression bijectively maps to the minimal eGE among on the test set. To link the finite-sample and asymptotic results we need to show that, if the true DGP is bijectively assigned to the global minimum eGE in the population, and if

then is consistent in probability or , or

To establish consistency for penalized regression, we make the following three additional assumptions.


Further assumptions

  1. The true DGP is .

  2. .

  3. No perfect collinearity in .

Assumptions A4–A6 restrict the true DGP indexed by to be identifiable. Otherwise, there may exist an alternative that is not statistically different from the true DGP. The assumptions are standard for linear regression.

Under assumptions A1–A6, we first show that the true DGP has the lowest generalization error.

Proposition 1 (Identification of in the space of eGE).

Under A1–A6, the true DGP, , is the one and only one offering the minimal eGE as .

Proposition 1 states that there is a bijective mapping between and the global minimum eGE in the population. If A5 or A6 were violated, variables may exist in the sample that render the true DGP not to have the minimum eGE in the population.

As shown in Algorithm 1, penalized regression chooses to be the model with the minimum eGE in . Thus, we need to establish that when the sample size is large enough, the true DGP is included among , the list of models selected by validation or cross-validation.

Proposition 2 (Existence of -consistency).

Under A1–A6 and Proposition 1, there exists at least one such that .

Using lasso as the example of penalized regression, Figure 14 illustrates Propositions 1 and 2. In Figure 14, refers to the true DGP, refers to the solution of eq. (7), and the diamond-shaped feasible sets are due to the penalty. Different values of imply different areas for the feasible sets, which get smaller as the value of increases. There are three possible cases: (i) under-shrinkage—for a low value of (Figure (a)a), lies within the feasible set and has the minimum eTE in the population; (ii) perfect-shrinkage—for the oracle (Figure (b)b), is located precisely on the boundary of the feasible set and still has the minimum eTE in the population; (iii) over-shrinkage—for a high value of (Figure (c)c), lies outside the feasible set. In cases (i) and (ii), the constraints become inactive as , so . However, in case (iii), . An important implication is that tuning the penalty parameter is critical for the theoretical properties of the penalized regression estimators.

(a) low (under-shrinkage)
(b) oracle (perfect shrinkage)
(c) high (over-shrinkage)
Figure 14: Shrinking for various values of under the penalty

3.3 Main results for penalized regression under GEM

Intuitively, the penalized regression estimator will be consistent in some norm or measure as long as, for a specific , lies in the feasible set and offers the minimum eTE. In practice, however, we may not know a priori whether causes over-shrinking or not, especially when the number of variables, , is not fixed. As a result, we need to use an approach like validation or cross-validation to tune the value of . Applying the results in Section 2, we show that GEM guarantees that the model selected by penalized regression with the appropriate , , asymptotically converges in to the true DGP.

In this section we analyze the finite-sample and asymptotic properties of the GEM estimator in two settings: and . In the case where , OLS is feasible, and is the unpenalized regression estimator. In the case where , OLS is not feasible, and forward stagewise regression (FSR) is the unpenalized regression estimator.

3.3.1 GEM for penalized regression with

Firstly, by adapting eq. (3) and (4) and (5) for regression, we establish the upper bound of the eGE.

Lemma 2 (Upper bound for the eGE of the OLS estimator).

Under A1–A6, if we assume ,

  1. Validation case. The following bound for the eGE for holds with probability at least , .

    (9)

    where is the eGE and is the eTE of the OLS estimator, and is defined in Lemma 1.

  2. -fold cross-validation case. The following bound for the eGE for holds with probability at least , .

    (10)

    where is the eGE and is the eTE of the OLS estimator in the th round of cross-validation, and is defined in Lemma 1.

In a similar fashion to eqs. (3) and (4) and (5), eqs. (9) and (10) measure the upper bound of the eGE for the OLS estimator under validation and cross-validation, respectively. In standard practice, of course, neither validation nor cross-validation are implemented as part of OLS estimation and the eGE of the OLS estimator is not computed. Nevertheless, eqs. (9) and (10) show that it is possible to compute the eGE of the OLS estimator without having to carry out validation or cross-validation. Eqs. (9) and (10) also show that the higher the variance of in the true DGP, the higher the upper bound of the eGE under validation and cross-validation.

As a bonus of the GEM approach, eq. (10) also shows that we can find the that maximizes the GA from cross-validation by tuning to the lowest upper bound of the cross-validated eGE, determined by minimizing the expectation of the RHS of eq. (10).

Corollary 1 (The optimal for penalized regression).

Under A1–A6, and based on eq. (10) from Lemma 2, if we also assume the error term , the optimal for cross-validation in penalized regression (the minimum expected upper bound of eGE) is defined:

The penalty parameter can be tuned by validation or -fold cross-validation. For , we have different test sets for tuning and different training sets for estimation. Using eqs. (9) and (10), we now establish, in two steps, an upper bound for the -norm difference between the unpenalized estimator and the corresponding penalized estimator under validation and cross-validation.

Proposition 3 (-difference between the penalized and unpenalized predicted values).

Under A1–A6 and based on Lemma 2, Propositions 1, and 2,

  1. Validation case. The following bound for the difference between the predicted values from and the validated holds with probability at least

    (11)

    where is defined in Theorem 1.

  2. -fold cross-validation case. The following bound for the difference between the predicted values from the -fold cross-validated and holds with probability at least

    where is the OLS estimator and is the penalized estimator in the th round of cross-validation, and is defined in Theorem 2.

Following Markov’s classical proof of consistency for OLS, Proposition 3 establishes the -norm convergence of the fitted values to the true values. Based on A1-A6, the identification condition is satisfied, and the convergence of the fitted values implies the -norm consistency of the penalized regression estimator.

We now establish the upper bound of the -norm difference between and , or , under validation and cross-validation.

Theorem 3 (-difference between the penalized and unpenalized regression estimators).

Under A1–A6 and based on Propositions 1, 2, and 3,

  1. Validation case. The following bound for the -difference between and the validated holds with probability at least

    (13)

    where

    is the minimum eigenvalue of

    and is defined in Lemma 2.

  2. -fold cross-validation case. The following bound for the - difference between the -fold cross-validated and holds with probability at least

    (14)

    where is defined .

Some important remarks apply to Theorem 3. The LHS of eq. (13) measures the -norm difference between the penalized regression estimator and the OLS estimator under validation. The RHS of eq. (13) essentially captures the maximum -norm difference between and . As shown in eq. (13), the maximum difference depends on the GE of the true DGP and the GE of the OLS model.

  • The first term on the RHS of eq. (13) (ignoring ) is the difference between the eGE from OLS and the upper bound of the population error, or, equivalently, the difference between the GA of the OLS estimator and its maximum. The better the GA of , the less overfitting OLS generates, the closer the eGE of is to the upper bound of the population error, and the smaller the first term on the RHS of eq. (13).

  • The second term on the RHS of eq. (13) (ignoring ) measures the empirical endogeneity of the OLS estimator on the test set. On the training set , but on the test set, in general, . Hence, measures the GA for the empirical moment condition of the OLS estimator on out-of-sample data.151515Because we standardize the test and training data, the moment condition holds directly. The more generalizable the OLS estimate, the closer is to zero on out-of-sample data, and the smaller the second term on the RHS of eq. (