# When Does More Regularization Imply Fewer Degrees of Freedom? Sufficient Conditions and Counter Examples from Lasso and Ridge Regression

Regularization aims to improve prediction performance of a given statistical modeling approach by moving to a second approach which achieves worse training error but is expected to have fewer degrees of freedom, i.e., better agreement between training and prediction error. We show here, however, that this expected behavior does not hold in general. In fact, counter examples are given that show regularization can increase the degrees of freedom in simple situations, including lasso and ridge regression, which are the most common regularization approaches in use. In such situations, the regularization increases both training error and degrees of freedom, and is thus inherently without merit. On the other hand, two important regularization scenarios are described where the expected reduction in degrees of freedom is indeed guaranteed: (a) all symmetric linear smoothers, and (b) linear regression versus convex constrained linear regression (as in the constrained variant of ridge regression and lasso).

## Authors

• 2 publications
• 19 publications
03/30/2016

### Degrees of Freedom in Deep Neural Networks

In this paper, we explore degrees of freedom in deep sigmoidal neural ne...
09/12/2011

### Efficient algorithm to select tuning parameters in sparse regression modeling with regularization

In sparse regression modeling via regularization such as the lasso, it i...
04/14/2022

### On Measuring Model Complexity in Heteroscedastic Linear Regression

Heteroscedasticity is common in real world applications and is often han...
02/06/2022

### Asymptotic behavior of the forecast-assimilation process with unstable dynamics

Extensive numerical evidence shows that the assimilation of observations...
02/24/2019

### De-Biasing The Lasso With Degrees-of-Freedom Adjustment

This paper studies schemes to de-bias the Lasso in sparse linear regress...
02/24/2014

### Avoiding pathologies in very deep networks

Choosing appropriate architectures and regularization strategies for dee...
05/13/2020

### Recognition of 26 Degrees of Freedom of Hands Using Model-based approach and Depth-Color Images

In this study, we present an model-based approach to recognize full 26 d...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let

be a random data set generated according to a probability distribution

, where is a parameter that we wish to model. A modeling approach is a mapping from a training set to a model . Models are evaluated according to an error or loss criterion , where (the test set) is also drawn from , independently of . Here we focus on the squared-error criterion:

 (1)

although as will be discussed in Section 5, other choices are possible, with most of the theory that follows intact. We follow previous work in examining the in-sample error, where any covariate values are the same for the training and testing data (Efron, 1983; Hastie et al., 2009).

In the model selection problem we are given a collection of candidate modeling approaches, and our goal is to select an approach having small risk, i.e., small expected prediction error. A typical setting in which this problem arises is regularization, where a family of nested modeling approaches is considered (Hastie et al., 2009)

. To apply a modeling approach from the family, one first has to specify the value of a tuning parameter which controls the amount of fitting to training data. Consider for example the problem of estimating

by fitting a polynomial to a set of observations with least squares. Here the degree of the polynomial, , plays the role of the tuning parameter, where a higher degree leads to more fitting to training data than does a lower degree. Having specified

, the least squares optimization problem over the set of polynomials of this degree constitutes a modeling approach, and the solution given specific training data is a model. Choosing a very low degree (underfitting) is undesirable, as some of the information that could be gained from the training data is wasted. A high degree often leads to high variance and overfitting, and is also undesirable. Both underfitting and overfitting lead to models achieving suboptimal risk.

Model selection is facilitated by producing estimates for the risk of each candidate modeling approach (Mallows, 1973; Akaike, 1974; Schwarz, 1978; Stone, 1974). The training error, , is a naive such estimate which is typically negatively biased due to the fact that fitting has been carried out on the same data used for performance evaluation. This bias (i.e., the difference between the expected training error and risk) is termed expected optimism in Efron (1983), or (up to a constant) effective degrees of freedom in Hastie et al. (2009). Because within a nested family of regularized models the training error increases monotonically with the level of regularization, the latter is premised on the idea that the effective degrees of freedom must correspondingly decrease (just like the degrees of freedom of nested linear regression models decrease when covariates are removed). In other words, for regularization to have the potential to reduce risk, less-fitting models (with larger training error) must have smaller optimism.

We prove in Section 3 that this is indeed true in some important cases. However, our main observation in this paper is that such monotonicity does not necessarily hold. In particular, the lasso (Tibshirani, 1996) and ridge regression (Hoerl, 1962) are two approaches of great practical and theoretical importance in regularized modeling. As we show in Section 4, both of them admit counter examples where more regularized, nested models, also have higher optimism. In fact, specifically for the lasso this can be argued to be a typical case that arises in natural examples. Thus monotonicity of degrees of freedom of nested models breaks down in perhaps the most common and important cases. When the monotonicity does not hold, the implication is that adding regularization counter-intuitively increases both the training error and the optimism, and hence is inherently without merit.

The remainder of the paper is organized as follows. Section 2 gives formal definitions for the notion of nesting, and reviews the basic concepts of optimism, degrees of freedom and their statistical properties. Section 3 gives sufficient conditions under which the effective degrees of freedom grow monotonically in the direction of nesting. Section 4 gives realistic examples where familiar nested models exhibit reverse monotonicity. Section 5 offers a discussion of our results.

## 2 Nesting and optimism

### 2.1 Nesting

In traditional linear regression methodology, least squares modeling approaches are projections on linear subspaces, and nested models are naturally defined according to the nesting structure of these subspaces. Specifically, one linear modeling approach is nested in another if the former admits a subset of the explanatory variables used in the latter. The span of this reduced explanatory set (defined as the linear span of the observed covariate vectors in

) is geometrically nested in the span of the larger set. Here we formalize and generalize this definition to cover typical regularized modeling settings. Our first definition of strict-sense nesting is an immediate generalization of the least squares projection notion above. Intuitively, one modeling approach is nested in another if both fit the model by minimizing a loss criterion on the training data, over geometrically nested sets of candidate models (as done in empirical risk minimization (Vapnik, 2000)).

###### Definition 1 (Strict-sense nesting).

Let and be two modeling approaches that produce models by optimizing the training error. performs this optimization over the model set while considers the model set :

 MS:^μy,S=argmin~μ∈S{E(~μ,y)},ML:^μy,L=argmin~μ∈L{E(~μ,y)}

We say that is nested in the strict sense in if . In this case we write .

As elaborated below, strict-sense nesting covers some interesting regularization families, but others (like penalized ridge regression or lasso) are not covered by this definition, since different regularization levels modify the loss criterion rather than the set of candidate models. Towards this end we devise a second looser definition, which we term wide-sense nesting. To be nested in this sense, the two modeling approaches have to be equivalent to strict-sense nested approaches for every specific training set , but this correspondence can be data-dependent.

###### Definition 2 (Wide-sense nesting).

Let be a sequence of nested sets. Let and be modeling approaches that, given a training set , produce the models and , respectively. We say that is nested in the wide sense in if, for every value of , there exist sets that depend on , and , such that and such that the models and are equivalent to the result of optimizing the same criterion over the model sets and , respectively. Thus, the following holds for every value of :

In this case it is said that induces the nesting.

Clearly, by taking to contain the sets themselves, Definition 2 is a generalization of Definition 1. The two definitions are embodied in the following example:

###### Example 2.1 (ridge regression).

The ridge regression modeling approach (Hoerl, 1962), in its common penalized form, fits a model by optimizing a criterion that incorporates a penalty term weighted by a prespecified tuning parameter :

 ^μ=argmin~μ∈S{∥y−~μ∥22+λ∥~β∥22}, S={~μ∣∃~β∈Rp:~μ=X~β}. (2)

The Lagrangian dual problem (Boyd and Vandenberghe, 2004) of (2) is the less common but conceptually important constrained form ridge regression. The dual tuning parameter is , which this time directly constrains the norm squared of the coefficient vector:

 ^μ=argmin~μ∈S{∥y−~μ∥22}, S={~μ∣∃~β∈Rp,∥~β∥22≤s:~μ=X~β}. (3)

For the ridge regression problem, the duality of the two forms essentially means that for a given vector , for each value of there exists a value of such that the two problems are equivalent, i.e., give the same (for details see Davidov, 2006).

In the constrained form, the constraint on defines a -ball where all coefficient vectors must lie. Consequently the model set which is defined by the projection of this -ball by the matrix

is enclosed by a hyper-ellipsoid (embedded in the hyperplane spanned by

) that scales isotropically with the value of the tuning parameter . Specifically, according to Def. 1, a constrained ridge regression with smaller is nested in a constrained ridge regression with a larger .

In the penalized form nesting exists in the sense of Def. 2. Consider two penalized ridge cases: one with and the other with , and assume . The criterion optimized in (2) depends on the value of , so there is no strict sense nesting as per Def. 1. Now, let

 Q={Qs={~μ∣∃~β∈Rp,∥β∥22≤s:~μ=X~β}∣s>0}.

From the strong duality with the constrained form, the penalized ridge regression for a given and in fact optimizes the squared-loss criterion over model sets in . In our case, denote these by and . Since in the dual (penalized) form, then for every value of , and thus in the primal (constrained) form. Therefore , and penalized ridge regression models are nested in the wide sense of Def. 2.

### 2.2 Optimism and effective degrees of freedom

Let the expected training error (averaged over all random training sets) be , and the risk or expected prediction error over all training and test sets be . The expected optimism (Efron, 1983), , is then defined as the difference of these two quantities:

 ω=Epred−Etrain=1n(E(∥^μ−ynew∥22)−E(∥^μ−y∥22)). (4)

The optimism theorem, due to Efron (1983, 2004), relates optimism to the self-influence of observations:

 ω=2nn∑i=1cov(^μi,yi). (5)

Stein’s Lemma (Stein, 1981)

further states that, under certain regularity conditions and for a normally distributed, homoscedastic and uncorrelated data set

(with extensions for a variety of other cases as well (Kattumannil, 2009)), the expected optimism is proportional to the divergence (the trace of the Jacobian) of as a function of :

 ω=2σ2nE(n∑i=1∂^μi∂yi). (6)

Expected optimism can thus be thought of as a sensitivity measure of the fitted values to their respective observation. A short review of the importance of optimism for model selection is given in Section LABEL:sec:additional_background of the Supplementary Material.

Specifically, for the class of linear smoothers of the form , where is uncorrelated with (or is simply fixed), if we assume , then (5) allows us to directly derive the optimism as:

 ω=2σ2ntr(S).

In the case of linear regression and penalized ridge regression, the matrix has the form:

 ^μ=X(XTX+λIp×p)−1XTy, (7)

and the optimism of these approaches can be expressed using the singular value decomposition (SVD) of the design matrix

:

 ω=2σ2np∑j=1d2jd2j+λ (8)

(for the details, see Hastie et al., 2009). In linear regression () this simplifies to , thus the optimism here is proportional to the degrees of freedom, which are the number of optimized parameters in the linear model. This means that for nested linear regressions, adding explanatory variables indeed increases optimism. Similar monotonicity occurs in penalized ridge regression with a general since (8) decreases as increases.

These results motivated the definition by Wahba (1983) of as “equivalent degrees of freedom” for linear smoothers. A natural extension is the definition of “generalized degrees of freedom” in Ye (1998) or “effective degrees of freedom” in Hastie et al. (2009) for an arbitrary modeling approach based on the concept of expected optimism:

When applicable, the equality follows from the optimism theorem (5), while comes from Stein’s lemma (6). In the original linear regression context, the degrees of freedom are a measure of both the optimism, and of the amount of regularization (implying for any pair of models which model is nested in the other). As shown above for penalized ridge regression, and as will be shown more generally in the theorems of the next section, a monotonic nondecreasing relation between the amount of regularization and the effective degrees of freedom also holds in other important regularization methods. This belies a notion that such monotonicity holds in general, providing a wide theoretical basis for applying regularization. However, this is not always the case.

Before going into the examples of Section 4 which deal with lasso and (constrained) ridge regression and are of more practical relevance, let us begin with a simple illustrative counter-example where strict-sense nesting does not imply monotonicity in optimism.

###### Example 2.2 (toy counterexample).

Assume our data vector is two dimensional: . Let be a line segment in where its first coordinate is in and its second is . Let be the unit disk. We shall relax this later, but first let our data be , and . Projecting from any realization of on the line segment gives . On the other hand, projecting onto the disk , the correlation in the first coordinate is partial (and since is fixed, correlation of the second component is still zero). Formally, the optimism for these models can be computed as:

 y =[02]+[ε0] ε ∼U[−1,1] ^μS =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩[\makebox[20.0pt]$−1$0],if y1<−1[\makebox[20.0pt]$1$0],if y1>1[\makebox[20.0pt]$y1$0],otherwise ^μL ={y,if ∥y∥<1y∥y∥,otherwise,

where only the latter cases of each model are relevant for our distribution.

 cov(^μS1,y1) =E[(^μS1−0)(y1−0)]=E(y21)=13 cov(^μS2,y2) =cov(^μL2,y2)=0 cov(^μL1,y1) =E[(y1√y21+4−0)(y1−0)] =12(√52−2log(1+√5)−−√52+2log(−1+√5))≈0.1556. ωS =2nn∑i=1cov(^μSi,yi)=13,ωL=2nn∑i=1cov(^μLi,yi)≈0.1556,

and indeed, the smaller nested model set, , leads to more optimism than the larger one, .

Supplementary Figure LABEL:fig:ex_line_ball (right panel) demonstrates the phenomenon using a Monte-Carlo simulation with draws of . When is normally distributed having the same expectation and variance-covariance matrix we still see that the larger approach has smaller optimism. This phenomenon is not limited to a two-dimensional setup; supplementary Figure LABEL:fig:ex_line_ball_highdim shows its persistence when the disk is replaced by a -ball and the line segment is replaced by a hyperplane tile. Furthermore, Supplementary Section LABEL:sec:toy shows that such nonmonotonicity in optimism can have a significant effect on prediction error.

## 3 Sufficient Conditions

We propose in this section two theorems which address important special cases of nesting. We show first (Thm. 1) that for the class of symmetric linear smoothers (spanning most commonly used smoothing approaches), nesting in the wide sense (Def. 2) guarantees the smaller modeling approach has less optimism. Our next result (Thm. 2) concerns the case where the smaller modeling approach is a projection on a convex set , while the bigger one is a projection on a linear subspace containing . Here the modeling approaches are nested in the strict sense (Def. 1), and we show that in this case as well monotonicity of optimism is guaranteed.

###### Theorem 1.

Let be a homoscedastic and mutually uncorrelated observation vector. Let , be two linear smoothers that are real and symmetric (, ). If is nested in in the wide-sense over convex sets as in Definition 2, then has more optimism than .

A proof is provided in Appendix 5.1. The following example uses Thm. 1 to show that all nested generalized ridge regressions exhibit monotone optimism in the direction of nesting.

###### Example 3.1 (generalized ridge regression).

Consider the following family of modeling approaches:

 ^μ=Xargminβ∈Rp{∥y−Xβ∥22+λβTKβ}, (9)

where is some symmetric matrix. The solution is . It can easily be verified that is a monotone decreasing function of . Thus we have nesting in the wide sense over the sets

 Q={Qs={~μ:~μTXKXT~μ≤s}∣s>0},

and according to Thm. 1 the optimism is indeed monotone in .

Although direct eigen-analysis can give an explicit derivation of the optimism and therefore also prove the monotonicity in special cases (including ridge regression as shown in Section 2, and natural smoothing splines as in Hastie et al. (2009)), to our knowledge there is no previous general result that can be used to prove monotonicity for all generalized ridge approaches.

###### Theorem 2.

Let be nested modeling approaches as defined by Definition 1 with the squared error criterion (1). Let be a linear subspace of and a convex set. If the conditions for Stein’s lemma (6) are satisfied for both and , then .

This theorem, proven in Appendix 5.2, implies that any constrained linear regression model including constrained ridge regression, constrained lasso (discussed below), constrained elastic net (Zou and Hastie, 2005) and others, has lower optimism and fewer degrees of freedom than the unconstrained linear regression model with the same variables.

###### Example 3.2 (convexity requirement in Thm. 2).

It is important to note that the convexity requirement is needed in Thm. 2. Let

 y=[10]+ε,ε∼N([00],(0.1)2I2×2),

be a two-dimensional data distribution. Let be the vertical axis () and let be a two-point set . Let and be Euclidean projections on and respectively. Their fits are and , with the first coordinate of both and equal . and therefore . But it is easy to verify that:

 ωL =var(y2)=0.01 ωS =E|y2|=0.1√2/π≈0.08.

Thus, the theorem does not hold without the convexity requirement.

## 4 Counterexamples

Examples 3.2 and 2.2 were simple illustrations that a smaller nested approach can give higher optimism than a larger one. While it is easy to devise such anecdotal examples, a key question is to what extent may we expect to encounter this phenomenon in the wild, i.e., in practically interesting and relevant situations. To address this question we call on what are perhaps the two most widely used and studied regularization approaches in regression: lasso (Tibshirani, 1996) and ridge regression (Hoerl, 1962). Supplementary Section LABEL:sec:irp gives a more exotic third example, using regularized isotonic regression.

###### Example 4.1 (lasso counterexample).

Using similar notations to the ridge definitions above, the penalized and constrained formulations of lasso are, respectively:

 ^μ =argmin~μ∈S{∥y−~μ∥22+λ∥~β∥1}, S={~μ∣∃~β∈Rp:~μ=X~β}, (10) ^μ =argmin~μ∈S{∥y−~μ∥22}, S={~μ∣∃~β∈Rp,∥~β∥1≤s:~μ=X~β}. (11)

Like in ridge regression, constrained lasso modeling approaches are nested in the strict sense (Def. 1), and penalized lasso modeling approaches are nested in the wide sense (Def. 2).

If we denote the solution of (10) by , it is well known that corresponds to soft variable selection, where for some of . Following Zou et al. (2007), for a specific lasso solution, we denote by the active set of variables with non-zero coefficients, i.e., To avoid complex notation, the dependence of on the penalty or constraint is left implicit. Zou et al. (2007)

give the Stein unbiased estimate of optimism for the penalized lasso formulation:

while Kato (2009) gives a slightly different result for the constrained version. As the regularization level decreases in the lasso solution of a specific data set, the number of active variables can decrease and not only increase (Efron et al., 2004; Zou et al., 2007). Hence the Stein unbiased estimate of optimism is in general not expected to be monotone increasing as regularization decreases. The important question, however, is to what extent can this behavior exist in expectation over a distribution. In other words, can the optimism itself adopt a similar pattern in realistic examples?

We offer a simple example which demonstrates that this is eminently possible. Consider a regression problem with three covariates, observations and the following characterization:

 x1i =⎧⎨⎩√1n−1 if i

Figure 1 (left panel) shows the degrees of freedom of the penalized lasso solution for this problem as a function of , estimated from independent simulations in two manners: (i) directly, using the definition of optimism in Eq. (5), and (ii) by calculating the Stein estimate (number of non-zero coefficients) and averaging it over the simulations. As expected, the two calculations agree well, and both demonstrate the clear non-monotonicity of the optimism in the regularization level. The implications for model selection are clear: at the modeling approach has both higher in-sample error and higher optimism than at , therefore the value is not a useful value of the regularization parameter to consider. For this distribution, decreasing the tuning parameter from to in fact decreases the true measure of regularization as captured by optimism and degrees of freedom; thus, the apparently less regularized model is in fact more regularized. If we move from the penalized formulation to the constrained formulation (using Kato’s derivation of the Stein estimate), the phenomenon persists (Fig. 1, right panel).

###### Example 4.2 (ridge regression counterexample).

For penalized ridge regression, the explicit derivation of its optimism in Eq. (8) guarantees monotonicity between the regularization level and the optimism, hence it cannot admit counter examples. The standard result in Eq. (8

) assumes homoscedastic error, we now generalize it to the heteroscedastic case as well:

###### Proposition 1.

Suppose observations , which are components of the vector , are mutually uncorrelated but not homoscedastic, i.e., the covariance matrix of has components:

 Λij={σ2i,if i=j0,otherwise.

Then the expected optimism of the ridge regression modeling approach is

 ω=2nn∑i=1σ2ip∑j=1d2jd2j+λu2ij,

where and are components of the matrices (on the main diagonal) and respectively, in the SVD of the design matrix: .

We leave the proof to Supplementary Section LABEL:apdx:heteroridge.

On the other hand, such monotonicity does not always hold for the constrained form (3). As mentioned in Section 2.1, fitting is done by projection onto a model set that is enclosed by a hyper-ellipsoid centered at the origin. Changing the value of the regularization parameter shrinks or inflates the hyper-ellipsoid isotropically. Surprisingly, in this case too there are setups where we get smaller optimism when projecting onto a larger ellipsoid. We describe next a relatively simple setup that demonstrates this.

Let be the set enclosed by a hyper-ellipsoid centered at the origin, with principal directions parallel to the axes and equatorial radii . Let be a similarly defined set but with radii . The parameter determines eccentricity in the last component. This setup is presented in two dimensions in Supplementary Figure LABEL:fig:ex_ellipse_zoom (top left panel). Specifically, consider taking the following parameter values: . These values can be thought of as constrained ridge regression with a diagonal design matrix . We begin with an illustrative distribution: . The rationale behind this setup is that is situated such, that the image of its Euclidean projection on is a nearly horizontal line segment, while its projection on has a much larger vertical () component. There is also the contradictory effect of the circumference of the ellipses at play, but since we have highly eccentric ellipses, it is much less pronounced. The correlation between observed and fitted values is therefore higher for than it is for . Supplementary Figure LABEL:fig:ex_ellipse_zoom (bottom left panel) shows realizations of for , and the two right panels depict the corresponding fitted values on the same scale. The componentwise covariance of is visibly greater with than it is with .

This behavior persists beyond the illustrative setup described so far: it scales with the dimension (hyper-ellipsoids, results not shown), and endures if we take

to be distributed according to a normal distribution that is uncorrelated but heteroscedastic. Results are difficult to obtain in closed-form for this case, because, in constrained ridge regression, projections involve the solution of quartic and higher order polynomial equations. We therefore settle for an estimate of the optimism, with appropriate confidence intervals. Figure

2 gives the optimism profile for the normal distribution

 y∼N([310],[0.1003]),

and when the larger model set is inflated starting from up to . This profile is monotonic nondecreasing up to , but then becomes strictly decreasing for the remainder of the examined range.

Unlike the realistic lasso example, the constrained ridge example above is more contrived, requiring non-standard error distributions (non-homoscedastic error). However, the ridge case is potentially more intriguing because the penalized form guarantees monotonicity, while the constrained form admits counter examples.

## 5 Discussion

Regularization (nesting) and degrees of freedom (optimism) are key concepts in statistics and specifically in model selection. Because these concepts are closely related in the context of fundamental modeling approaches such as linear regression, a notion of a generally applicable nondecreasing monotonic relationship between them has permeated the statistics literature (e.g. Ye, 1998; Zou et al., 2007; Krämer and Sugiyama, 2011). This notion is also expressed by the use of optimism to define “effective degrees of freedom” (Hastie et al., 2009). We have shown here that for some important families of nested modeling approaches, the monotonicity is indeed preserved. However the general relationship is a misconception that does not hold even in simple and familiar scenarios such as lasso or constrained ridge regression. In particular, our lasso (Section 4) and isotonic recursive partitioning (Section LABEL:sec:irp) examples are natural and realistic. In such situations, the fundamental premise of regularization as controlling model complexity and decreasing optimism is in fact incorrect, and regularized models with more optimism than their less regularized counterparts are guaranteed to be inferior in their expected predictive performance.

Specifically for ridge regression with additive heteroscedastic normal noise, Proposition 1 shows that the penalized form guarantees a monotonic nondecreasing relationship between regularization (as captured in this case by the tuning parameter, ) and optimism, . Surprisingly, the constrained form does not guarantee this, as demonstrated in Example 4.2. On the one hand, the two forms are equivalent (strongly dual) in the optimization theory sense, in that for every realized training set, we may switch from one form to the other with an appropriate choice of tuning parameter value, and produce the same model. On the other hand, this mapping is data-dependent and thus random, which means it does not imply that the two forms are equivalent statistical modeling approaches. This subtlety which is reflected in our results also comes up in the Stein unbiased estimates for penalized and constrained degrees of freedom for the lasso in Kato (2009) and Zou et al. (2007).

Alternative formulations for the model selection problem exist, which replace optimism with different notions of complexity, for which monotonicity is guaranteed. In particular, the machine learning community traditionally defines model complexity via Vapnik-Chervonenkis (VC) dimension of the model set

(Vapnik, 2000), and calculates penalties on training error which give bounds on prediction error in place of expected error expressed by optimism (Cherkassky and Mulier, 2007). The penalties depend monotonically on the VC dimension, hence the consistency between model complexity and prediction penalty is guaranteed. A major downside with using this approach is that it gives worst-case bounds (which are often very loose) in place of estimates of expected prediction error. More critically, unlike the optimism, these penalties are independent of both the modeling approach and the true underlying distribution, depending only on the model set. Hence they are of a fundamentally different nature than modeling-approach-specific estimates based on optimism.

The concept of optimism is applicable to other loss functions besides the squared error loss we have focused on here, as shown by

Efron (2004). Indeed, so is the concept nesting defined in Section 2.1. We thus expect that the general spirit of our positive results from Section 3 and negative results from Section 4 should not change when considering other loss functions (for example, exponential family log-likelihoods). The details of these generalizations remain a topic for future research.

## Acknowledgements

The authors are grateful to E. Aharoni, R. Luss and M. Shahar for useful ideas and discussion, and to F. Abramovich, T. Hastie, R. Heller, G. Hooker, R. Tibshirani and the reviewing team for thoughtful and useful comments. This research was partially supported by Israeli Science Foundation grant 1487/12 and by a fellowship to SK from the Edmond J. Safra Center for Bioinformatics at Tel Aviv University.

## Appendix

### 5.1 Proof of Theorem 1

###### Proof.

Let the eigenvectors of

be

with associated eigenvalues

. Also let the eigenvectors of be with associated eigenvalues (some eigenvalues might be zero, some might not be unique). Since both eigenvector bases are orthonormal and span , we may transform one to the other via a rotation matrix : (where and are matrices whose rows are individual and respectively).

Since is nested in in the wide-sense, there exists a parametrization of with nested contours, such that for each value of , and are the Euclidean projection of onto and respectively, and such that . Because of this nesting, fits a model that is closer to than : . The conditions of this theorem also specify that and are convex.

Since in this case both modeling approaches are linear smoothers, and we must have (more generally, the axes origin has to be at the limit of the smallest contour in the parametrization ). This also means that all eigenvalues must be in , else the projection of their eigenvector cannot be a projection to a convex set which includes the origin. Since , if follows that has to be outside the ball of radius around the origin, and hence (See Figure LABEL:fig:nested_linear_smoothers). We therefore have:

 ∥Lzi∥22 =n∑j=1R2ijλ2j≥∥Szi∥22=δ2i.

Subsequently,

 n∑i=1n∑j=1R2ijλ2j ≥n∑i=1δ2i⟺ n∑j=1λ2jn∑i=1R2ij ≥n∑i=1δ2i.

But since is a rotation matrix, the sum of squares along any column or row is unity. Thus,

 n∑j=1λ2j≥n∑i=1δ2i⇔tr(LTL)≥tr(STS).

Let us now reexamine the nesting consequence:

 ∥^μS−y∥22≥∥^μL−y∥22 ⟺ yT(S−I)T(S−I)y≥yT(L−I)T(L−I)y ⟺ (S−I)T(S−I)−(L−I)T(L−I)⪰0 ⟺ (STS−LTL)+(L−S)+(L−S)T⪰0 ⟹ tr(STS)−tr(LTL)+2tr(L−S)≥0.

With the previous result we must have

 tr(L)≥tr(S).

Which, for data distributed according to ) (i.e., homoscedastic and mutually uncorrelated) and for -loss, means that

 ω(L)≥ω(S).

Thus, has more optimism than . ∎

### 5.2 Proof of Theorem 2

###### Proof.

Jacobian main-diagonal components for the modeling approach are given by

 ∂^μy,Si∂yi=limε→0^μy+εei,Si−^μy−εei,Si2ε,

and similarly for ( is the unit vector whose ’th component equals 1).

For every value of , for every value of and for each , we have

 ^μy+εei,Si−^μy−εei,Si≤∥^μy+εei,Si−^μy−εei,Si∥2.

A projection mapping onto a convex set is a non-expansion mapping (e.g., as used in Tibshirani and Taylor, 2011). There thus exist and such that for every two values of : and ,

 ∥^μa,S−^μb,S∥2 ≤kS∥a−b∥2 ∥^μa,L−^μb,L∥2 ≤kL∥a−b∥2.

Because is a linear subspace, the Euclidean projection onto may be broken down to first projecting onto and then projecting from there onto : (as shown in Supplementary Figure LABEL:fig:sequential_projection). Hence

 ∥^μa,S−^μb,S∥2=∥^μa,L,S−^μb,L,S∥2≤kS∥^μa,L−^μb,L∥2.

Therefore

 ∥^μy+εei,Si−^μy−εei,Si∥2≤kS∥^μy+εei,Li−^μy−εei,Li∥2, (12)

but since constitutes an orthogonal linear projection of to , there exist a projection matrix (Hermitian and idempotent) such that

 ^μy,L≡PLy,

and so we may further develop the right hand side of the inequality (12)

 kS∥^μy+εei,Li−^μy−εei,Li∥2= kS∥PL(y+εei)−PL(y−εei)∥2=2εkS∥PLei∥2 = 2εkS√eTiPTLPLei=2εkSPLii.

On the other hand

 ^μy+εei,Li−^μy−εei,Li= eTiPL(y+εei)−eTiPL(y−εei) = 2εeTiPLei=2εPLii.

In summary, for every value of , and for each , we have shown that

 ^μy+εei,Si−^μy−εei,Si≤^μy+εei,Li−^μy−εei,Li.

This implies that every main-diagonal Jacobian component is smaller for the projection to than it is for the projection to

 ∂^μy,Si∂yi≤∂^μy,Li∂yi, hence n∑i=1∂^μy,Si∂yi≤n∑i=1∂^μy,Li∂yi.

Because this is true for any observed data , it is also true in expectation

which, by Stein’s lemma leads to . ∎

## References

• Akaike (1974) Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control 19(6), 716–723.
• Boyd and Vandenberghe (2004) Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge University Press.
• Cherkassky and Mulier (2007) Cherkassky, V. and F. Mulier (2007). Learning from data: Concepts, theory, and methods. Wiley-IEEE Press.
• Davidov (2006) Davidov, O. (2006). Constrained estimation and the theorem of kuhn-tucker. Advances in Decision Sciences 2006.
• Efron (1983) Efron, B. (1983). Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American Statistical Association 78(382), 316–331.
• Efron (2004) Efron, B. (2004). The estimation of prediction error. Journal of the American Statistical Association 99(467), 619–632.
• Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of statistics 32(2), 407–499.
• Hastie et al. (2009) Hastie, T., R. Tibshirani, and J. Friedman (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Verlag.
• Hoerl (1962) Hoerl, A. (1962). Application of ridge analysis to regression problems. Chemical Engineering Progress 58(3), 54–59.
• Kato (2009) Kato, K. (2009). On the degrees of freedom in shrinkage estimation.

Journal of Multivariate Analysis

100(7), 1338–1352.
• Kattumannil (2009) Kattumannil, S. (2009). On stein’s identity and its applications. Statistics & Probability Letters 79(12), 1444–1449.
• Krämer and Sugiyama (2011) Krämer, N. and M. Sugiyama (2011). The degrees of freedom of partial least squares regression. Journal of the American Statistical Association 106(494), 697–705.
• Mallows (1973) Mallows, C. (1973). Some comments on cp. Technometrics 15(4), 661–675.
• Schwarz (1978) Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics 6(2), 461–464.
• Stein (1981) Stein, C. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics 9(6), 1135–1151.
• Stone (1974) Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological), 111–147.
• Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. J. of the Royal Statistical Society: Series B 58(1), 267–288.
• Tibshirani and Taylor (2011) Tibshirani, R. and J. Taylor (2011). The solution path of the generalized lasso. The Annals of Statistics 39(3), 1335–1371.
• Vapnik (2000) Vapnik, V. (2000).

The nature of statistical learning theory

.
Springer Verlag.
• Wahba (1983) Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validated smoothing spline. Journal of the Royal Statistical Society, Series B 45(1), 133–150.
• Ye (1998) Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association 93(441), 120–131.
• Zou and Hastie (2005) Zou, H. and T. Hastie (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67(2), 301–320.
• Zou et al. (2007) Zou, H., T. Hastie, and R. Tibshirani (2007). On the degrees of freedom of the lasso. The Annals of Statistics 35(5), 2173–2192.