# The Risk of Machine Learning

Many applied settings in empirical economics involve simultaneous estimation of a large number of parameters. In particular, applied economists are often interested in estimating the effects of many-valued treatments (like teacher effects or location effects), treatment effects for many groups, and prediction models with many regressors. In these settings, machine learning methods that combine regularized estimation and data-driven choices of regularization parameters are useful to avoid over-fitting. In this article, we analyze the performance of a class of machine learning estimators that includes ridge, lasso and pretest in contexts that require simultaneous estimation of many parameters. Our analysis aims to provide guidance to applied researchers on (i) the choice between regularized estimators in practice and (ii) data-driven selection of regularization parameters. To address (i), we characterize the risk (mean squared error) of regularized estimators and derive their relative performance as a function of simple features of the data generating process. To address (ii), we show that data-driven choices of regularization parameters, based on Stein's unbiased risk estimate or on cross-validation, yield estimators with risk uniformly close to the risk attained under the optimal (unfeasible) choice of regularization parameters. We use data from recent examples in the empirical economics literature to illustrate the practical applicability of our results.

## Authors

• 4 publications
• 3 publications
10/29/2020

### Group-regularized ridge regression via empirical Bayes noise level cross-validation

Features in predictive models are not exchangeable, yet common supervise...
10/30/2019

### Find what you are looking for: A data-driven covariance matrix estimation

The global minimum-variance portfolio is a typical choice for investors ...
01/23/2022

### High-dimensional model-assisted inference for treatment effects with multi-valued treatments

Consider estimation of average treatment effects with multi-valued treat...
11/09/2020

### Coupled regularized sample covariance matrix estimator for multiple classes

The estimation of covariance matrices of multiple classes with limited t...
12/19/2017

### Some Large Sample Results for the Method of Regularized Estimators

We present a general framework for studying regularized estimators; i.e....
07/25/2013

### Does generalization performance of l^q regularization learning depend on q? A negative example

l^q-regularization has been demonstrated to be an attractive technique i...
10/08/2018

### Visually Communicating and Teaching Intuition for Influence Functions

Estimators based on influence functions (IFs) have been shown effective ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Applied economists often confront problems that require estimation of a large number of parameters. Examples include (a) estimation of causal (or predictive) effects for a large number of treatments such as neighborhoods or cities, teachers, workers and firms, or judges; (b) estimation of the causal effect of a given treatment for a large number of subgroups; and (c) prediction problems with a large number of predictive covariates or transformations of covariates. The machine learning literature provides a host of estimation methods, such as ridge, lasso, and pretest, which are particularly well adapted to high-dimensional problems. In view of the variety of available methods, the applied researcher faces the question of which of these procedures to adopt in any given situation. This article provides guidance on this choice based on the study of the risk properties (mean squared error) of a class of regularization-based machine learning methods.

A practical concern that generally motivates the adoption of machine learning procedures is the potential for severe over-fitting in high-dimensional settings. To avoid over-fitting, most machine learning procedures for “supervised learning” (that is, regression and classification methods for prediction) involve two key features, (i) regularized estimation and (ii) data-driven choice of regularization parameters. These features are also central to more familiar non-parametric estimation methods in econometrics, such as kernel or series regression.

#### Setup

In this article, we consider the canonical problem of estimating the unknown means,

, of a potentially large set of observed random variables,

. After some transformations, our setup covers applications (a)-(c) mentioned above and many others. For example, in the context of a randomized experiment with subgroups, is the difference in the sample averages of an outcome variable between treated and non-treated for subgroup , and is the average treatment effect on the same outcome and subgroup. Moreover, as we discuss in Section 2.1, the many means problem analyzed in this article encompasses the problem of nonparametric estimation of a regression function.

We consider componentwise estimators of the form , where is a non-negative regularization parameter. Typically, , so that corresponds to the unregularized estimator . Positive values of typically correspond to regularized estimators, which shrink towards zero, . The value typically implies maximal shrinkage: for . Shrinkage towards zero is a convenient normalization but it is not essential. Shifting by a constant to , for , results in shrinkage towards .

#### The risk function of regularized estimators

Our article is structured according to the two mentioned features of machine learning procedures, regularization and data-driven choice of regularization parameters. We first focus on feature (i) and study the risk properties (mean squared error) of regularized estimators with fixed and with oracle-optimal regularization parameters. We show that for any given data generating process there is an (infeasible) risk-optimal regularized componentwise estimator. This estimator has the form of the posterior mean of given and given the empirical distribution of , where

is a random variable with uniform distribution on the set of indices

. The optimal regularized estimator is useful to characterize the risk properties of machine learning estimators. It turns out that, in our setting, the risk function of any regularized estimator can be expressed as a function of the distance between that regularized estimator and the optimal one.

Instead of conditioning on , one can consider the case where each

is a realization of a random vector

with distribution and a notion of risk that is integrated over the distribution of in the population. For this alternative definition of risk, we derive results analogous to those of the previous paragraph.

We next turn to a family of parametric models for

. We consider models that allow for a probability mass at zero in the distribution of

, corresponding to the notion of sparsity, while conditional on the distribution of is normal around some grand mean. For these parametric models we derive analytic risk functions under oracle choices of risk minimizing values for , which allow for an intuitive discussion of the relative performance of alternative estimators. We focus our attention on three estimators that are widespread in the empirical machine learning literature: ridge, lasso, and pretest. When the point-mass of true zeros is small, ridge tends to perform better than lasso or pretest. When there is a sizable share of true zeros, the ranking of the estimators depends on the other characteristics of the distribution of : (a) if the non-zero parameters are smoothly distributed in a vicinity of zero, ridge still performs best; (b) if most of the distribution of non-zero parameters assigns large probability to a set well-separated from zero, pretest estimation tends to perform well; and (c) lasso tends to do comparatively well in intermediate cases that fall somewhere between (a) and (b), and overall is remarkably robust across the different specifications. This characterization of the relative performance of ridge, lasso, and pretest is consistent with the results that we obtain for the empirical applications discussed later in the article.

#### Data-driven choice of regularization parameters

The second part the article turns to feature (ii) of machine learning estimators and studies the data-driven choice of regularization parameters. We consider choices of regularization parameters based on the minimization of a criterion function that estimates risk. Ideally, a machine learning estimator evaluated at a data-driven choice of the regularization parameter would have a risk function that is uniformly close to the risk function of the infeasible estimator using an oracle-optimal regularization parameter (which minimizes true risk). We show this type of uniform consistency can be achieved under fairly mild conditions whenever the dimension of the problem under consideration is large. This is in stark contrast to well-known results in Leeb and Pötscher (2006) for low-dimensional settings. We further provide fairly weak conditions under which machine learning estimators with data-driven choices of the regularization parameter, based on Stein’s unbiased risk estimate (SURE) and on cross-validation (CV), attain uniform risk consistency. In addition to allowing data-driven selection of regularization parameters, uniformly consistent estimation of the risk of shrinkage estimators can be used to select among alternative shrinkage estimators on the basis of their estimated risk in specific empirical settings.

#### Applications

We illustrate our results in the context of three applications taken from the empirical economics literature. The first application uses data from Chetty and Hendren (2015) to study the effects of locations on intergenerational earnings mobility of children. The second application uses data from the event-study analysis in Della Vigna and La Ferrara (2010) who investigate whether the stock prices of weapon-producing companies react to changes in the intensity of conflicts in countries under arms trade embargoes. The third application considers nonparametric estimation of a Mincer equation using data from the Current Population Survey (CPS), as in Belloni and Chernozhukov (2011). The presence of many neighborhoods in the first application, many weapon producing companies in the second one, and many series regression terms in the third one makes these estimation problems high-dimensional.

These examples showcase how simple features of the data generating process affect the relative performance of machine learning estimators. They also illustrate the way in which consistent estimation of the risk of shrinkage estimators can be used to choose regularization parameters and to select among different estimators in practice. For the estimation of location effects in Chetty and Hendren (2015) we find estimates that are not overly dispersed around their mean and no evidence of sparsity. In this setting, ridge outperforms lasso and pretest in terms of estimated mean squared error. In the setting of the event-study analysis in Della Vigna and La Ferrara (2010), our results suggest that a large fraction of values of parameters are closely concentrated around zero, while a smaller but non-negligible fraction of parameters are positive and substantially separated from zero. In this setting, pretest dominates. Similarly to the result for the setting in Della Vigna and La Ferrara (2010), the estimation of the parameters of a Mincer equation in Belloni and Chernozhukov (2011) suggests a sparse approximation to the distribution of parameters. Substantial shrinkage at the tails of the distribution is still helpful in this setting, so that lasso dominates.

The rest of this article is structured as follows. Section 2 introduces our setup: the canonical problem of estimating a vector of means under quadratic loss. Section 2.1 discusses a series of examples from empirical economics that are covered by our setup. Section 2.2 discusses the setup of this article in the context of the machine learning literature and of the older literature on estimation of normal means. Section 3 provides characterizations of the risk function of regularized estimators in our setting. We derive a general characterization in Section 3.1. Sections 3.2 and 3.3 provide analytic formulas for risk under additional assumptions. In particular, in Section 3.3 we derive analytic formulas for risk in a spike-and-normal model . These characterizations allow for a comparison of the mean squared error of alternative procedures and yield recommendations for the choice of an estimator. Section 4 turns to data-driven choices of regularization parameters. We show uniform risk consistency results for Stein’s unbiased risk estimate and for cross-validation. Section 5 discusses extensions and explains the apparent contradiction between our results and those in Leeb and Pötscher (2005). Section 6 reports simulation results. Section 7 discusses several empirical applications. Section 8 concludes. The appendix contains proofs and supplemental materials.

## 2 Setup

Throughout this paper, we consider the following setting. We observe a realization of an -vector of real-valued random variables, , where the components of are mutually independent with finite mean

and finite variance

, for . Our goal is to estimate .

In many applications, the arise as preliminary least squares estimates of the coefficients of interest, . Consider, for instance, a randomized controlled trial where randomization of treatment assignment is carried out separately for non-overlapping subgroups. Within each subgroup, the difference in the sample averages between treated and control units, , has mean equal to the average treatment effect for that group in the population, . Further examples are discussed in Section 2.1 below.

#### Componentwise estimators

We restrict our attention to componentwise estimators of ,

 ˆμi=m(Xi,λ),

where defines an estimator of as a function of and a non-negative regularization parameter, . The parameter is common across the components but might depend on the vector . We study data-driven choices in Section 4 below, focusing in particular on Stein’s unbiased risk estimate (SURE) and cross-validation (CV).

Popular estimators of this componentwise form are ridge, lasso, and pretest. They are defined as follows:

 mR(x,λ) =argminm∈R (x−m)2+λm2 (ridge) =11+λx, mL(x,λ) =argminm∈R (x−m)2+2λ|m| (lasso) =1(x<−λ)(x+λ)+1(x>λ)(x−λ), mPT(x,λ) =argminm∈R (x−m)2+λ21(m≠0) (pretest) =1(|x|>λ)x,

where denotes the indicator function, which equals if holds and otherwise. Figure 1 plots , and as functions of . For reasons apparent in Figure 1, ridge, lasso, and pretest estimators are sometimes referred to as linear shrinkage, soft thresholding, and hard thresholding, respectively. As we discuss below, the problem of determining the optimal choice among these estimators in terms of minimizing mean squared error is equivalent to the problem of determining which of these estimators best approximates a certain optimal estimating function, .

Let and , where for simplicity we leave the dependence of on implicit in our notation. Let be the distributions of , and let .

#### Loss and risk

We evaluate estimates based on the squared error loss function, or compound loss,

 Ln(X,m(⋅,λ),P)=1nn∑i=1(m(Xi,λ)−μi)2,

where depends on via . We will use expected loss to rank estimators. There are different ways of taking this expectation, resulting in different risk functions, and the distinction between them is conceptually important.

Componentwise risk fixes and considers the expected squared error of as an estimator of ,

 R(m(⋅,λ),Pi)=E[(m(Xi,λ)−μi)2|Pi].

Compound risk averages componentwise risk over the empirical distribution of across the components . Compound risk is given by the expectation of compound loss given ,

 Rn(m(⋅,λ),P) =E[Ln(X,m(⋅,λ),P)|P] =1nn∑i=1E[(m(Xi,λ)−μi)2|Pi] =1nn∑i=1R(m(⋅,λ),Pi).

Finally, integrated (or empirical Bayes) risk considers to be themselves draws from some population distribution,

. This induces a joint distribution,

, for . Throughout the article, we will often use a subscript to denote characteristics of the joint distribution of . Integrated risk refers to loss integrated over or, equivalently, componentwise risk integrated over ,

 ¯R(m(⋅,λ),π) =Eπ[Ln(X,m(⋅,λ),P)] =Eπ[(m(Xi,λ)−μi)2] =∫R(m(⋅,λ),Pi)dΠ(Pi). (1)

Notice the similarity between compound risk and integrated risk: they differ only by replacing an empirical (sample) distribution by a population distribution. For large , the difference between the two vanishes, as we will explore in Section 4.

#### Regularization parameter

Throughout, we will use to denote the risk function of the estimator with fixed (non-random) , and similarly for . In contrast, is the risk function taking into account the randomness of , where the latter is chosen in a data-dependent manner, and similarly for .

For a given , we define the “oracle” selector of the regularization parameter as the value of that minimizes compound risk,

 λ∗(P)=argminλ∈[0,∞]Rn(m(⋅,λ),P),

whenever the argmin exists. We use , and to denote the oracle selectors for ridge, lasso, and pretest, respectively. Analogously, for a given , we define

 ¯λ∗(π)=argminλ∈[0,∞]¯R(m(⋅,λ),π) (2)

whenever the argmin exists, with , , and for ridge, lasso, and pretest, respectively. In Section 3, we characterize compound and integrated risk for fixed and for the oracle-optimal . In Section 4 we show that data-driven choices are, under certain conditions, as good as the oracle-optimal choice, in a sense to be made precise.

### 2.1 Empirical examples

Our setup describes a variety of settings often encountered in empirical economics, where are unbiased or close-to-unbiased but noisy least squares estimates of a set of parameters of interest, . As mentioned in the introduction, examples include (a) studies estimating causal or predictive effects for a large number of treatments such as neighborhoods, cities, teachers, workers, firms, or judges; (b) studies estimating the causal effect of a given treatment for a large number of subgroups; and (c) prediction problems with a large number of predictive covariates or transformations of covariates.

#### Large number of treatments

Examples in the first category include Chetty and Hendren (2015), who estimate the effect of geographic locations on intergenerational mobility for a large number of locations. Chetty and Hendren use differences between the outcomes of siblings whose parents move during their childhood in order to identify these effects. The problem of estimating a large number of parameters also arises in the teacher value-added literature when the objects of interest are individual teachers’ effects, see, for instance, Chetty et al. (2014). In labor economics, estimation of firm and worker effects in studies of wage inequality has been considered in Abowd et al. (1999). Another example within the first category is provided by Abrams et al. (2012), who estimate differences in the effects of defendant’s race on sentencing across individual judges.

#### Treatment for large number of subgroups

Within the second category, which consists of estimating the effect of a treatment for many sub-populations, our setup can be applied to the estimation of heterogeneous causal effects of class size on student outcomes across many subgroups. For instance, project STAR (Krueger, 1999) involved experimental assignment of students to classes of different sizes in 79 schools. Causal effects for many subgroups are also of interest in medical contexts or for active labor market programs, where doctors / policy makers have to decide on treatment assignment based on individual characteristics. In some empirical settings, treatment impacts are individually estimated for each sample unit. This is often the case in empirical finance, where event studies are used to estimate reactions of stock market prices to newly available information. For example, Della Vigna and La Ferrara (2010) estimate the effects of changes in the intensity of armed conflicts in countries under arms trade embargoes on the stock market prices of arms-manufacturing companies.

#### Prediction with many regressors

The third category is prediction with many regressors. This category fits in the setting of this article after orthogonalization of the regressors. Prediction with many regressors arises, in particular, in macroeconomic forecasting. Stock and Watson (2012), in an analysis complementing the present article, evaluate various procedures in terms of their forecast performance for a number of macroeconomic time series for the United States. Regression with many predictors also arises in series regression, where series terms are transformations of a set of predictors. Series regression and its asymptotic properties have been widely studied in econometrics (see for instance Newey, 1997). Wasserman (2006, Sections 7.2-7.3) provides an illuminating discussion of the equivalence between the normal means model studied in this article and nonparametric regression estimation. For that setting, and correspond to the estimated and true regression coefficients on an orthogonal basis of functions. Application of lasso and pretesting to series regression is discussed, for instance, in Belloni and Chernozhukov (2011). Appendix A.1 further discusses the relationship between the normal means model and prediction models.

In Section 7, we return to three of these applications, revisiting the estimation of location effects on intergenerational mobility, as in Chetty and Hendren (2015), the effect of changes in the intensity of conflicts in arms-embargo countries on the stock prices of arms manufacturers, as in Della Vigna and La Ferrara (2010), and nonparametric series estimation of a Mincer equation, as in Belloni and Chernozhukov (2011).

### 2.2 Statistical literature

Machine learning methods are becoming widespread in econometrics – see, for instance, Athey and Imbens (2015) and Kleinberg et al. (2015). A large number of estimation procedures are available to the applied researcher. Textbooks such as Hastie et al. (2009) or Murphy (2012) provide an introduction to machine learning. Lasso, which was first introduced by Tibshirani (1996), is becoming particularly popular in applied economics. Belloni and Chernozhukov (2011) provide a review of lasso including theoretical results and applications in economics.

Much of the research on machine learning focuses on algorithms and computational issues, while the formal statistical properties of machine learning estimators have received less attention. However, an older and superficially unrelated literature in mathematical statistics and statistical decision theory on the estimation of the normal means model has produced many deep results which turn out to be relevant for understanding the behavior of estimation procedures in non-parametric statistics and machine learning. A foundational article in this literature is James and Stein (1961), who study the case . They show that the estimator is inadmissible whenever . That is, there exists a (shrinkage) estimator that has mean squared error smaller than the mean squared error of for all values of . Brown (1971) provides more general characterizations of admissibility and shows that this dependence on dimension is deeply connected to the recurrence or transience of Brownian motion. Stein et al. (1981) characterizes the risk function of arbitrary estimators,

, and based on this characterization proposes an unbiased estimator of the mean squared error of a given estimator, labeled “Stein’s unbiased risk estimator” or SURE. We return to SURE in Section

4.2 as a method to produce data-driven choices of regularization parameters. In section 4.3, we discuss cross-validation as an alternative method to obtain data-driven choices of regularization parameters in the context studied in this article.111See, e.g., Arlot and Celisse (2010) for a survey on cross-validation methods for model selection.

A general approach for the construction of regularized estimators, such as the one proposed by James and Stein (1961), is provided by the empirical Bayes framework, first proposed in Robbins (1956) and Robbins (1964). A key insight of the empirical Bayes framework, and the closely related compound decision problem framework, is that trying to minimize squared error in higher dimensions involves a trade-off across components of the estimand. The data are informative about which estimators and regularization parameters perform well in terms of squared error and thus allow one to construct regularized estimators that dominate the unregularized . This intuition is elaborated on in Stigler (1990). The empirical Bayes framework was developed further by Efron and Morris (1973) and Morris (1983), among others. Good reviews and introductions can be found in Zhang (2003) and Efron (2010).

In Section 4 we consider data-driven choices of regularization parameters and emphasize uniform validity of asymptotic approximations to the risk function of the resulting estimators. Lack of uniform validity of standard asymptotic characterizations of risk (as well as of test size) in the context of pretest and model-selection based estimators in low-dimensional settings has been emphasized by Leeb and Pötscher (2005).

While in this article we study risk-optimal estimation of , a related literature has focused on the estimation of confidence sets for the same parameter. Wasserman (2006, Section 7.8) and Casella and Hwang (2012) surveys some results in this literature. Efron (2010) studies hypotheses testing in high dimensional settings from an empirical Bayes perspective.

## 3 The risk function

We now turn to our first set of formal results, which pertain to the mean squared error of regularized estimators. Our goal is to guide the researcher’s choice of estimator by describing the conditions under which each of the alternative machine learning estimators performs better than the others.

We first derive a general characterization of the mean squared error of regularized estimators. This characterization is based on the geometry of estimating functions as depicted in Figure 1. It is a-priori not obvious which of these functions is best suited for estimation. We show that for any given data generating process there is an optimal function that minimizes mean squared error. Moreover, we show that the mean squared error for an arbitrary is equal, up to a constant, to the distance between and . A function thus yields a good estimator if it is able to approximate the shape of well.

In Section 3.2, we provide analytic expressions for the componentwise risk of ridge, lasso, and pretest estimators, imposing the additional assumption of normality. Summing or integrating componentwise risk over some distribution for delivers expressions for compound and integrated risk.

In Section 3.3, we turn to a specific parametric family of data generating processes where each is equal to zero with probability

, reflecting the notion of sparsity, and is otherwise drawn from a normal distribution with some mean

and variance . For this parametric family indexed by , we provide analytic risk functions and visual comparisons of the relative performance of alternative estimators. This allows us to identify key features of the data generating process which affect the relative performance of alternative estimators.

### 3.1 General characterization

Recall the setup introduced in Section 2, where we observe jointly independent random variables , with means . We are interested in the mean squared error for the compound problem of estimating all simultaneously. In this formulation of the problem, are fixed unknown parameters.

Let be a random variable with a uniform distribution over the set and consider the random component of . This construction induces a mixture distribution for (conditional on ),

 (XI,μI)|P∼1nn∑i=1Piδμi,

where are Dirac measures at . Based on this mixture distribution, define the conditional expectation

 m∗P(x)=E[μI|XI=x,P]

and the average conditional variance

 v∗P=E[var(μI|XI,P)|P].

The next theorem characterizes the compound risk of an estimator in terms of the average squared discrepancy relative to , which implies that is optimal (lowest mean squared error) for the compound problem.

###### Theorem 1 (Characterization of risk functions)

Under the assumptions of Section 2 and , the compound risk function of can be written as

 Rn(m(⋅,λ),P)=v∗P+E[(m(XI,λ)−m∗P(XI))2|P],

which implies

whenever is well defined.

The proof of this theorem and all further results can be found in the appendix.

The statement of this theorem implies that the risk of componentwise estimators is equal to an irreducible part , plus the distance of the estimating function to the infeasible optimal estimating function . A given data generating process maps into an optimal estimating function , and the relative performance of alternative estimators depends on how well they approximate .

We can easily write explicitly because the conditional expectation defining is a weighted average of the values taken by . Suppose, for example, that for . Let

be the standard normal probability density function. Then,

 m∗P(x)=n∑i=1μiϕ(x−μi)n∑i=1ϕ(x−μi).

Theorem 1 conditions on the empirical distribution of , which corresponds to the notion of compound risk. Replacing this empirical distribution by the population distribution , so that

 (Xi,μi)∼π,

results analogous to those in Theorem 1 are obtained for the integrated risk and the integrated oracle selectors in equations (2) and (2). That is, let

 ¯m∗π(x)=Eπ[μi|Xi=x]

and

 ¯v∗π=Eπ[varπ(μi|Xi)],

and assume . Then

 ¯R(m(⋅,λ),π)=¯v∗π+Eπ[(m(Xi,λ)−¯m∗π(Xi))2]

and

 ¯λ∗(π)=argminλ∈[0,∞]Eπ[(m(Xi,λ)−¯m∗π(Xi))2]. (3)

The proof of these assertions is analogous to the proof of Theorem 1. and are optimal componentwise estimators or “shrinkage functions” in the sense that they minimize the compound and integrated risk, respectively.

### 3.2 Componentwise risk

The characterization of the risk of componentwise estimators in the previous section relies only on the existence of second moments. Explicit expressions for compound risk and integrated risk can be derived under additional structure. We shall now consider a setting in which the

are normally distributed,

 Xi∼N(μi,σ2i).

This is a particularly relevant scenario in applied research, where the are often unbiased estimators with a normal distribution in large samples (as in examples (a) to (c) in Sections 1 and 2.1). For concreteness, we will focus on the three widely used componentwise estimators introduced in Section 2, ridge, lasso, and pretest, whose estimating functions were plotted in Figure 1. The following lemma provides explicit expressions for the componentwise risk of these estimators.

###### Lemma 1 (Componentwise risk)

Consider the setup of Section 2. Then, for , the componentwise risk of ridge is:

 R(mR(⋅,λ),Pi)=(11+λ)2σ2i+(1−11+λ)2μ2i.

Assume in addition that has a normal distribution. Then, the componentwise risk of lasso is

 R(mL(⋅,λ),Pi) =(1+Φ(−λ−μiσi)−Φ(λ−μiσi))(σ2i+λ2) +(Φ(λ−μiσi)−Φ(−λ−μiσi))μ2i.

Under the same conditions, the componentwise risk of pretest is

 R(mPT(⋅,λ),Pi) =(1+Φ(−λ−μiσi)−Φ(λ−μiσi))σ2i +(Φ(λ−μiσi)−Φ(−λ−μiσi))μ2i.

Figure 2 plots the componentwise risk functions in Lemma 1 as functions of (with for ridge, for lasso, and for pretest). It also plots the componentwise risk of the unregularized maximum likelihood estimator, , which is equal to . As Figure 2 suggests, componentwise risk is large for ridge when is large. The same is true for lasso, except that risk remains bounded. For pretest, componentwise risk is large when is close to .

Notice that these functions are plotted for a fixed value of the regularization parameter. If is chosen optimally , then the componentwise risks of ridge, lasso, and pretest are no greater than the componentwise risk of the unregularized maximum likelihood estimator , which is . The reason is that ridge, lasso, and pretest nest the unregularized estimator (as the case ).

### 3.3 Spike and normal data generating process

If we take the expressions for componentwise risk derived in Lemma 1 and average them over some population distribution of , we obtain the integrated, or empirical Bayes, risk. For parametric families of distributions of , this might be done analytically. We shall do so now, considering a family of distributions that is rich enough to cover common intuitions about data generating processes, but simple enough to allow for analytic expressions. Based on these expressions, we characterize scenarios that favor the relative performance of each of the estimators considered in this article.

We consider a family of distributions for such that: (i) takes value zero with probability and is otherwise distributed as a normal with mean value

, and (ii) . The following proposition derives the optimal estimating function , as well as integrated risk functions for this family of distributions.

###### Proposition 1 (Spike and normal data generating process)

Assume is such that (i) are drawn independently from a distribution with probability mass at zero, and normal with mean and variance elsewhere, and (ii) conditional on , follows a normal distribution with mean and variance . Then, the optimal shrinkage function is

 ¯m∗π(x)=(1−p)1√σ20+σ2ϕ⎛⎜ ⎜⎝x−μ0√σ20+σ2⎞⎟ ⎟⎠μ0σ2+xσ20σ20+σ2p1σϕ(xσ)+(1−p)1√σ20+σ2ϕ⎛⎜ ⎜⎝x−μ0√σ20+σ2⎞⎟ ⎟⎠.

The integrated risk of ridge is

 ¯R(mR(⋅,λ),π)=(11+λ)2σ2+(1−p)(λ1+λ)2(μ20+σ20),

with

 ¯λ∗R(π)=σ2(1−p)(μ20+σ20).

The integrated risk of lasso is given by

 ¯R(mL(⋅,λ),π)=p¯R0(mL(⋅,λ),π)+(1−p)¯R1(mL(⋅,λ),π),

where

 ¯R0(mL(⋅,λ),π)=2Φ(−λσ)(σ2+λ2)−2(λσ)ϕ(λσ)σ2,

and

 ¯R1(mL(⋅,λ),π)=( 1+Φ(−λ−μ0√σ20+σ2)−Φ(λ−μ0√σ20+σ2))(σ2+λ2) +(Φ(λ−μ0√σ20+σ2)−Φ(−λ−μ0√σ20+σ2))(μ20+σ20) −1√σ20+σ2ϕ(λ−μ0√σ20+σ2)(λ+μ0)(σ20+σ2) −1√σ20+σ2ϕ(−λ−μ0√σ20+σ2)(λ−μ0)(σ20+σ2).

Finally, the integrated risk of pretest is given by

 ¯R(mPT(⋅,λ),π)=p¯R0(mPT(⋅,λ),π)+(1−p)¯R1(mPT(⋅,λ),π),

where

 ¯R0(mPT(⋅,λ),π)=2Φ(−λσ)σ2+2(λσ)ϕ(λσ)σ2

and

 ¯R1(mPT(⋅,λ),π)=( 1+Φ(−λ−μ0√σ20+σ2)−Φ(λ−μ0√σ20+σ2))σ2 +(Φ(λ−μ0√σ20+σ2)−Φ(−λ−μ0√σ20+σ2))(μ20+σ20) −1√σ20+σ2ϕ(λ−μ0√σ20+σ2)(λ(σ20−σ2)+μ0(σ20+σ2)) −1√σ20+σ2ϕ(−λ−μ0√σ20+σ2)(λ(σ20−σ2)−μ0(σ20+σ2)).

Notice that, even under substantial sparsity (that is, if is large), the optimal shrinkage function, , never shrinks all the way to zero (unless, of course, or ). This could in principle cast some doubts about the appropriateness of thresholding estimators, such as lasso or pretest, which induce sparsity in the estimated parameters. However, as we will see below, despite this stark difference between thresholding estimators and , lasso and, to a certain extent, pretest are able to approximate the integrated risk of in the spike and normal model when the degree of sparsity in the parameters of interest is substantial.

#### Visual representations

While it is difficult to directly interpret the risk formulas in Proposition 1, plotting these formulas as functions of the parameters governing the data generating process elucidates some crucial aspects of the risk of the corresponding estimators. Figure 3 does so, plotting the minimal integrated risk function of the different estimators. Each of the four subplots in Figure 3 is based on a fixed value of , with and varying along the bottom axes. For each value of the triple , Figure 3 reports minimal integrated risk of each estimator (minimized over ). As a benchmark, Figure 3 reports the risk of the optimal shrinkage function, , simulated over 10 million repetitions. Figure 4 maps the regions of parameter values over which each of the three estimators, ridge, lasso, or pretest, performs best in terms of integrated risk.

Figures 3 and 4 provide some useful insights on the performance of shrinkage estimators. With no true zeros, ridge performs better than lasso or pretest. A clear advantage of ridge in this setting is that, in contrast to lasso or pretest, ridge allows shrinkage without shrinking some observations all the way to zero. As the share of true zeros increases, the relative performance of ridge deteriorates for pairs away from the origin. Intuitively, linear shrinkage imposes a disadvantageous trade-off on ridge. Using ridge to heavily shrink towards the origin in order to fit potential true zeros produces large expected errors for observations with away from the origin. As a result, ridge performance suffers considerably unless much of the probability mass of the distribution of is tightly concentrated around zero. In the absence of true zeros, pretest performs particularly poorly unless the distribution of has much of its probability mass tightly concentrated around zero, in which case shrinking all the way to zero produces low risk. However, in the presence of true zeros, pretest performs well when much of the probability mass of the distribution of is located in a set that is well-separated from zero, which facilitates the detection of true zeros. Intermediate values of coupled with moderate values of produces settings where the conditional distributions and greatly overlap, inducing substantial risk for pretest estimation. The risk performance of lasso is particularly robust. It out-performs ridge and pretest for values of at intermediate distances to the origin, and uniformly controls risk over the parameter space. This robustness of lasso may explain its popularity in empirical practice. Despite the fact that, unlike optimal shrinkage, thresholding estimators impose sparsity, lasso – and to a certain extent – pretest are able to approximate the integrated risk of the optimal shrinkage function over much of the parameter space.

All in all, the results in Figures 3 and 4 for the spike and normal case support the adoption of ridge in empirical applications where there are no reasons to presume the presence of many true zeros among the parameters of interest. In empirical settings where many true zeros may be expected, Figures 3 and 4 show that the choice among estimators in the spike and normal model depends on how well separated the distributions and are. Pretest is preferred in the well-separated case, while lasso is preferred in the non-separated case.

## 4 Data-driven choice of regularization parameters

In Section 3.3 we adopted a parametric model for the distribution of to study the risk properties of regularized estimators under an oracle choice of the regularization parameter, . In this section, we return to a nonparametric setting and show that it is possible to consistently estimate from the data, , under some regularity conditions on . We consider estimates of based on Stein’s unbiased risk estimate and based on cross validation. The resulting estimators have risk functions which are uniformly close to those of the infeasible estimators .

The uniformity part of this statement is important and not obvious. Absent uniformity, asymptotic approximations might misleadingly suggest good behavior, while in fact the finite sample behavior of proposed estimators might be quite poor for plausible sets of data generating processes. This uniformity results in this section contrast markedly with other oracle approximations to risk, most notably approximations which assume that the true zeros, that is the components for which , are known. Asymptotic approximations of this latter form are often invoked when justifying the use of lasso and pretest estimators. Such approximations are in general not uniformly valid, as emphasized by Leeb and Pötscher (2005) and others.

### 4.1 Uniform loss and risk consistency

For the remainder of the paper we adopt the following short-hand notation:

 Ln(λ) =Ln(X,m(⋅,λ),P) (compound loss) Rn(λ) =Rn(m(⋅,λ),P) (compound risk) ¯Rπ(λ) =¯R(m(⋅,λ),π) (empirical Bayes or integrated risk)

We will now consider estimators of that are obtained by minimizing some empirical estimate of the risk function (possibly up to a constant that depends only on ). The resulting is then used to obtain regularized estimators of the form . We will show that for large the compound loss, the compound risk, and the integrated risk functions of the resulting estimators are uniformly close to the corresponding functions of the same estimators evaluated at oracle-optimal values of . As , the differences between , , and vanish, so compound loss optimality, compound risk optimality, and integrated risk optimality become equivalent.

The following theorem establishes our key result for this section. Let

be a set of probability distributions for

. Theorem 2 provides sufficient conditions for uniform loss consistency over , namely that (i) the supremum of the difference between the loss, , and the empirical Bayes risk, , vanishes in probability uniformly over and (ii) that is chosen to minimize a uniformly consistent estimator, , of the risk function, (possibly up to a constant ). Under these conditions, the difference between loss and the infeasible minimal loss vanishes in probability uniformly over .

###### Theorem 2 (Uniform loss consistency)

Assume

 supπ∈QPπ(supλ∈[0,∞]∣∣Ln(λ)−¯Rπ(λ)∣∣>ϵ)→0,∀ϵ>0. (4)

Assume also that there are functions, , , and (of , , and , respectively) such that , and

 supπ∈QPπ(supλ∈[0,∞]∣∣rn(λ)−¯rπ(λ)∣∣>ϵ)→0,∀ϵ>0. (5)

Then,

 supπ∈QPπ(∣∣∣Ln(ˆλn)−infλ∈[0,∞]Ln(λ)∣∣∣>ϵ)→0,∀ϵ>0,

where .

The sufficient conditions given by this theorem, as stated in equations (4) and (5), are rather high-level. We shall now give more primitive conditions for these requirements to hold. In Sections 4.2 and 4.3 below, we propose suitable choices of based on Stein’s unbiased risk estimator (SURE) and cross-validation (CV), and show that equation (5) holds for these choices of .

The following Theorem 3 provides a set of conditions under which equation (4) holds, so the difference between compound loss and integrated risk vanishes uniformly. Aside from a bounded moment assumption, the conditions in Theorem 3 impose some restrictions on the estimating functions, . Lemma 2 below shows that those conditions hold, in particular, for ridge, lasso, and pretest estimators.

###### Theorem 3 (Uniform L2-convergence)

Suppose that

1. is monotonic in for all in ,

2. and for all in ,

3. .

4. For any there exists a set of regularization parameters , which may depend on , such that

 Eπ[(|X−μ|+|μ|)|m(X,λj)−m(X,λj−1)|]≤ϵ

for all and all .

Then,

 supπ∈QEπ[supλ∈[0,∞](Ln(λ)−¯Rπ(λ))2]→0. (6)

Notice that finiteness of is equivalent to finiteness of and via Jensen’s and Minkowski’s inequalities.

###### Lemma 2

If , then equation (6) holds for ridge and lasso. If, in addition, is continuously distributed with a bounded density, then equation (6) holds for pretest.

Theorem 2 provides sufficient conditions for uniform loss consistency. The following corollary shows that under the same conditions we obtain uniform risk consistency, that is, the integrated risk of the estimator based on the data-driven choice becomes uniformly close to the risk of the oracle-optimal . For the statement of this corollary, recall that is the integrated risk of the estimator using the stochastic (data-dependent) .

###### Corollary 1 (Uniform risk consistency)

Under the assumptions of Theorem 3,

 supπ∈Q∣∣∣¯R(m(.,ˆλn),π)−infλ∈[0,∞]¯Rπ(λ)∣∣∣→0. (7)

In this section, we have shown that approximations to the risk function of machine learning estimators based on oracle-knowledge of are uniformly valid over under mild assumptions. It is worth pointing out that such uniformity is not a trivial result. This is made clear by comparison to an alternative approximation, sometimes invoked to motivate the adoption of machine learning estimators, based on oracle-knowledge of true zeros among (see, e.g., Fan and Li 2001). As shown in Appendix A.2, assuming oracle knowledge of zeros does not yield a uniformly valid approximation.

### 4.2 Stein’s unbiased risk estimate

Theorem 2 provides sufficient conditions for uniform loss consistency using a general estimator of risk. We shall now establish that our conditions apply to a particular estimator of , known as Stein’s unbiased risk estimate (SURE), which was first proposed by Stein et al. (1981). SURE leverages the assumption of normality to obtain an elegant expression of risk as an expected sum of squared residuals plus a penalization term.

SURE as originally proposed requires that be piecewise differentiable as a function of , which excludes discontinuous estimators such as the pretest estimator . We provide a generalization in Lemma 3 that allows for discontinuities. This lemma is stated in terms of integrated risk; with the appropriate modifications, the same result holds verbatim for compound risk.

###### Lemma 3 (SURE for piecewise differentiable estimators)

Suppose that and

 X|μ∼N(μ,1).

Let be the marginal density of , where is the standard normal density. Consider an estimator of , and suppose that is differentiable everywhere in , but might be discontinuous at . Let be the derivative of (defined arbitrarily at ), and let for . Assume that , , and as -a.s. Then,

 ¯R(m(.),π)=Eπ[(m(X)−X)2]+2(Eπ[∇m(X)]+J∑j=1Δmjfπ(xj))−1.

The result of this lemma yields an objective function for the choice of of the general form we considered in Section 4.1, with and

 ¯rπ(λ)=Eπ[(m(X,λ)−X)2]+2(Eπ[∇xm(X,λ)]+J∑j=1Δmj(λ)fπ(xj)), (8)

where is the derivative of with respect to its first argument, and may depend on . The expression in equation (8) can be estimated using its sample analog,

 rn(λ)=1nn∑i=1(m(Xi,λ)−Xi)2+2(1nn∑i=1∇xm(Xi,λ)+J∑j=1Δmj(λ)ˆf(xj)), (9)

where