Regularization methods perform model selection subject to the choice of a regularization parameter, and are commonly used when the number of predictor variables is too large to consider all subsets. In regularized regression, these methods operate by minimizing the penalized least squares function
where is a
response vector,is a deterministic matrix of predictor variables, is a vector of coefficients, and is a penalty function. A common choice for the penalty function is the norm of the coefficients. This penalty function was proposed by Tibshirani (1996) and termed the Lasso (Least absolute shrinkage and selection operator). The solution to the Lasso is sparse in that it automatically sets some of the estimated coefficients equal to zero, and the entire regularization path can be found using the computationally efficient Lars algorithm (Efron et al., 2004). Given its computational advantages, understanding the theoretical properties of the Lasso is an important research area.
This paper focuses on the performance of the Lasso for predictive purposes. To that end, we evaluate the Lasso-estimated models using the loss function. Assume that the true data generating process is
where is a unknown mean vector and is a random noise vector. The loss is defined as
where is the Lasso estimated vector of coefficients for a specific choice of the regularization parameter and is the squared Euclidean norm. If the true model is included among the candidate models, then for some unknown true coefficient vector and the loss function takes the form
In most modern applications, it is assumed that is sparse and only has non-zero entries.
A large portion of the regularization literature has focused on establishing probability loss bounds for the Lasso and its variants (see e.g., Bunea et al., 2007, Negahban et al., 2009, Bickel et al., 2010, and Buhlmann and van de Geer, 2011). Roughly, for a particular deterministic choice of , these probability bounds are of the form
(p. 102, Buhlmann and van de Geer, 2011). Here
is the true error variance, andis a constant that does not depend on . These bounds are commonly termed “oracle inequalities” since, apart from the term and the constant, they closely resemble the loss expected if an oracle told us the true set of predictors and we fit least squares. In light of this connection, it is commonly noted in the literature that the “-estimator achieves the ideal risk up to a logarithmic ” (Fan and Lv, 2008) and that the “[ factor] can be seen as the price to pay for not knowing the active set” (p. 102, Buhlmann and van de Geer, 2011). Furthermore, since the bound depends on , one can conclude that “the ambient dimension and structural parameters can grow as some function of the sample size , while still having the statistical error decrease to zero”
(Negahban et al., 2009).
Similar asymptotic conclusions exist in the work by Greenshtein and Ritov (2004) on the “persistence” of the Lasso estimator. In this context, the authors showed that the difference in the expected prediction error of the Lasso estimator and the optimal estimator converges to zero in probability. From this result, the authors concluded that “there is ‘asymptotically no harm’ in introducing many more explanatory variables than observations.” The extended work by Greenshtein (2006) similarly concludes that “in some ‘asymptotic sense’, when assuming a sparsity condition, there is no loss in letting  be much larger than .”
Unfortunately, there is a disconnect between oracle inequalities and the way that the Lasso is implemented in practice. In practice is not taken to be a deterministic value, but rather it is selected using an information criterion such as Akaike’s information criterion (; Akaike, 1973), the corrected (; Hurvich and
Tsai, 1989), the Bayesian information criterion (; Schwarz, 1978), or Generalized cross-validation (;
Craven and Wahba, 1978) or using the data-dependent procedure 10-fold cross-validation () (see, e.g., Fan and Li, 2001, Leng et al., 2006, Zou et al., 2007, and Flynn et al., 2013). This motivates us to study the behavior of the loss based on a data-dependent choice of the tuning parameter.
Define the random variableto be the optimal (infeasible) choice of that minimizes the loss function over the regularization path. In what follows, we will focus on the loss of the Lasso evaluated at . This selector provides information about the performance of the method in an absolute sense, and it represents the ultimate goal for any model selection procedure designed for prediction. By the definition of , the loss bound (3) still applies for . Thus, with this choice of selector it is possible to compare the best-case performance of the method against the commonly held views in the literature with a particular focus on the sensitivity of the loss of the Lasso evaluated at to the number of predictor variables.
The remainder of this paper is organized as follows. Section 2 presents some theoretical results on the behavior of the Lasso based on a data-dependent choice of and proves that the best-case predictive performance can deteriorate as the number of predictor variables is increased. In particular, under the assumption of a sparse true model and orthonormal predictors, we prove that the probability of deterioration can be arbitrarily close to one for a sufficiently high signal to noise ratio and sufficiently large , and that the expected amount of deterioration is infinite. Section 3 investigates the rate of deterioration empirically and shows that it can be much worse than one would expect based on the established loss bounds. Section 4 presents an analysis of HIV data using the Lasso and exemplifies the occurrence of deterioration in practice. Finally, Section 5 presents some final remarks and areas for future research. The appendix includes some additional technical and simulation results.
2 Theoretical Results
Here we consider a simple framework for which there exists an exact solution for the Lasso estimator. We assume that
where is the response vector, is a matrix of deterministic predictors such that (the identity matrix), is the vector of true unknown coefficients, and is a noise vector where . We assume that but all of the other true coefficients are equal to zero. We further assume that there is no intercept. In practice, the intercept is not penalized and is computed as
for each value of , where is the mean of the and is a vector containing the column means of .
By construction is the vector of the least squares-estimated coefficients based on the full model. It follows that
for , and that is independent of for . Furthermore, for a given , the Lasso estimated coefficients are
for (Fan and Li, 2001). To measure the performance of this estimator, define
Here we subscript the loss by in order to emphasize that the loss at a particular value of depends on the number of predictor variables. In particular, for this example,
We wish to study the sensitivity of the Lasso to the number of predictor variables. To do this, we compare the loss when only the one true predictor is used, , to the loss where predictors (including the one true predictor) are used, . Under our set-up, if , then for any given . In what follows, we establish the stronger result that with probability arbitrarily close to one for an appropriately high signal to noise ratio and large , and that the expected value of this ratio is infinite.
To simplify notation, define
Under the orthonormality assumption, we require .
For all ,
where is the cumulative distribution function of a standard normal random variable.
is the cumulative distribution function of a standard normal random variable.
First note that one can always choose , which will shrink all of the estimated coefficients to zero. Thus, for all , . The following Lemma establishes that equality always occurs if the sign of is incorrect.
If , then for all .
Lemma 2.1 establishes that if the sign of is incorrect, for all , so no deterioration will occur.
Next we focus our attention on the situation where the sign of is correct. The following lemma establishes the optimal loss for the Lasso when only the one true predictor is used.
If , then
In this case, when the model includes superfluous predictors, the optimal level of shrinkage is determined by balancing the increase in loss due to the bias induced from over-shrinking the true estimated coefficient with the increase in loss due to under-shrinking the estimated coefficients for the superfluous predictors. The next two lemmas establish necessary and sufficient conditions on the ’s for deterioration to occur.
Assume that . If , then if and only if .
Assume that . If , then for all .
It follows that deterioration occurs unless it is possible to shrink optimally while at the same time shrinking all of the estimated coefficients for the superfluous predictors to zero. In particular, by Lemmas 2.3 and 2.4, when the sign of is correct, unless .
With these results in place, the two terms on the right-hand side of equation (7) can be explained intuitively. Specifically, the first term reflects the increasing likelihood that the sign of is correct as the signal-to-noise ratio increases, and the second term reflects the decreasing probability of no deterioration in this case as increases. We can also establish the probability of deterioration given that the sign of is correct.
For all ,
where is the cumulative distribution function of a standard normal random variable.
Theorems 2.1 and 2.2 establish that the best case predictive performance of the Lasso can deteriorate as increases. Under our set-up, the probability of deterioration can be arbitrarily close to one, and the following proposition establishes that the expected amount of deterioration is infinite.
For all ,
Example. To demonstrate the implications of Theorem 2.1, consider an ANOVA model based on an orthonormal regression matrix. Specifically, assume that we have binary predictor variables, each of which is coded using effects coding, and a balanced design with an equal number of observations falling into each of the combinations. If we scale these predictors to have unit variance, then an ANOVA model on only the main effects is equivalent to a regression on these predictors. Similarly, if we consider all pairwise products and then standardize, a regression including them as well as the main effects is equivalent to an ANOVA with all two-way interactions. We can continue to add higher-order interactions in a similar manner, where a model with all -way interactions includes predictors.
3 Empirical Study
This section empirically investigates the cost of not knowing the true set of predictors when working with high-dimensional data. We assume that is generated according to the generating model in (1
). The Lasso regressions are fit using the Rlars package (Hastie and Efron, 2011). We use the default package settings and include an intercept in the model. We consider two simulation set-ups. The first is in line with our theoretical work and studies the performance of the Lasso when the columns of are trigonometric predictors. Since these predictors are orthogonal, this setting requires . To allow for situations with , we also study the case where the columns of are independent standard normals.
The main goal of our simulations is to understand the behavior of the infeasible optimal loss for the Lasso as and vary. To measure the deterioration in optimal loss we consider the optimal loss ratio
which compares the minimum loss based on predictors to the minimum loss based on the true set of predictors. These predictors have nonzero coefficients. All other coefficients are zero. Here and the true predictors are always a subset of the predictors. We focus on cases where is large or is getting large in order to be consistent with high-dimensional frameworks.
3.1 Orthogonal Predictors
Define the true model to be
for , where . We compare and in order to study the impact of varying the signal-to-noise ratio (SNR). We refer to these cases as “High SNR” and “Low SNR”, respectively.
The columns of are trigonometric predictors defined by
for and . The columns of are orthogonal under this design and the true model is always included among the candidate models.
By the definition of the optimal loss, the oracle inequalities in the literature also apply to . In what follows, we compare the empirical performance of the optimal loss to two established bounds. First, by applying Corollary 6.2 in Buhlmann and van de Geer (2011),
with probability greater than , where
is a constant that satisfies a compatibility condition. This condition places a restriction on the minimum eigenvalue offor a restricted set of coefficients and it’s sufficient to take for an orthogonal design matrix. Second, by Theorem 6.2 in Bickel et al. (2010),
with probability at least , where is a constant tied to a restricted eigenvalue assumption. For orthogonal predictors, . In the simulations, and are both set so that the bounds hold with at least 95 percent probability. Since these bounds also depend on , we study if the deterioration in optimal loss is adequately predicted by these bounds. In other words, is the price that we pay (roughly) equal to ?
Figure 1 compares the medians of optimal loss ratios over 1000 realizations to the ratio suggested by the loss bound for varying values of . In both plots, the bottom two lines are the loss ratios implied by assuming that the bound equals the optimal loss, whereas the top line is the observed median optimal loss ratio. Comparing the two plots, the deterioration is worse in the High SNR case. This is consistent with our theoretical results, which established that we are more likely to observe deterioration when the SNR is high. When the SNR is low, it is more likely that the optimal loss will equal the loss for , where is equal to the value of that sets all of the estimated coefficients equal to zero. When this is the case, no further deterioration can occur when adding more superfluous variables.
Clearly the deterioration is far worse than is suggested by the bounds for both choices of the SNR. For example, if we include predictors in the High SNR case, then the loss bounds suggest we should be about 50 percent worse off than if we knew the true set of predictors, but in actuality we are typically more than 350 percent worse off. This discrepancy is a consequence of the fact that the bounds are inequalities rather than equalities.
To emphasize the danger of relying only on bounds, Figure 2 plots the ratio of the bounds to the median optimal loss for varying values of . This plot suggests that the bounds are overly conservative when compared to the optimal loss and the degree of conservatism depends on both and the signal-to-noise ratio. As a result of this behavior, the deterioration in optimal loss can be much worse than .
The results presented thus far suggest that the performance of the Lasso deteriorates for fixed as varies. In order to investigate its behavior when varies, we compare against and define the optimal loss ratio to be
Under this set-up, increases as increases, which is consistent with the standard settings in high-dimensional data analysis. Figure 3 compares the mean optimal loss ratios over 1000 realizations to the optimal loss ratio predicted by the bounds. These plots suggest that the deterioration persists as increases, and that the bounds under-predict the observed deterioration. Since the slopes with respect to are higher than the bounds imply, this further suggests that the deterioration gets worse for larger samples.
3.2 Independent Predictors
Here we again assume that is generated from the model given by (10) except in this section the columns of are independent standard normal random variables. This allows us to consider situations where . This matrix is simulated once and used for all realizations. We consider both a high and low SNR setting by taking and , respectively.
Figure 4 compares the medians of the optimal loss ratios as defined in (9) over 1000 realizations to the optimal loss ratio suggested by the bounds (11) and (12). We vary from six to 1000. In both plots, the bottom two lines are the optimal loss ratios predicted by the bounds, whereas the top line is the observed median optimal loss ratios. As in the orthonormal design case, this plot shows that the bounds do not adequately measure the deterioration in performance, and that the optimal performance of the Lasso is sensitive to the number of predictor variables. Also, we again see that less deterioration occurs in the Low SNR case, where it is more likely that the optimal choice for is equal to .
4 Real Data Analysis
In numerous applications it is desirable to model higher-order interactions; however, the inclusion of such interactions can greatly increase the computational burden of a regression analysis. The Lasso provides a computationally feasible solution to this problem.
As an example of this, Bien et al. (2013) recently used the Lasso to investigate the inclusion of all pairwise interactions in the analysis of six HIV-1 drug datasets. The goal of this analysis was to understand the impact of mutation sites on antiretroviral drug resistance. These datasets were originally studied by Rhee et al. (2006) and include a measure of (log) susceptibility for different combinations of mutation sites for each of six nucleoside reverse transcriptase inhibitors. The number of samples () and the number of mutation sites () for each dataset are listed in Table 2.
In their analysis, Bien et al. (2013) compared the performance of the Lasso with only main effects included in the set of predictors (MEL) to its performance with main effects and all pairwise interactions included (APL). Although not the focus of their analysis, we show here that this application demonstrates the sensitivity of the procedure to the number of predictor variables, which can result in deteriorating performance in the absence of strong interaction effects.
Since the true data-generating mechanism is unknown, we cannot compute the optimal loss ratios for this example. As an alternative, to measure deterioration we randomly split the data into a training- and test-set. We then fit the Lasso using the training-set and evaluate the predictive performance on the test-set by computing the mean square error (MSE). It is important to note that both the numerator and denominator in the MSE ratio include the true error variance, so the loss in estimation precision will be less apparent. To exemplify this, Appendix B studies the optimal MSE ratio in the context of the independent predictors example given in Section 3.2.
Figure 5 plots the ratios of the minimum test-set MSE obtained using the APL to that obtained using the MEL based on 20 random splits of the data for each of the six drugs.
For the ‘3TC’ drug, the inclusion of all pairwise interactions results in a dramatic improvement in performance. In particular, there are five interactions that are included in all twenty of the selected models: ‘p62:p69’,‘p65:p184’, ‘p67:p184’, ‘p184:p215’, and ‘p184:p190’. This suggests that there is a strong interaction effect in this example, and that the interactions between these molecular targets are useful for the predicting drug susceptibility.
On the other hand, in four of the five remaining drugs - ‘ABC’, ‘D4T’, ‘DDI’, and ‘TDF’ - the inclusion of all pairwise interactions results in a significant deterioration in performance. Here significance is determined using a Wilcoxon signed-rank test performed at a 0.05 significance level. Thus, although the MEL is a restricted version of the APL, we still observe deterioration in the best-case predictive performance. This suggests that although the Lasso allows the modeling of higher-order interactions, their inclusion should be done with care as doing so can hurt overall performance.
The Lasso allows the fitting of regression models with a large number of predictor variables, but the resulting cost can be much higher than the commonly-held views in the literature would suggest. We proved that when tuned optimally for prediction the performance of the Lasso deteriorates as the number of predictor variables increases with probability arbitrarily close to one under the assumptions of a sparse true model and an orthonormal deterministic design matrix. Our empirical results further suggest that this deterioration persists as the sample size increases, and carries over to more general contexts.
In light of this deterioration, data analysts should be careful when using the Lasso as a variable selection and estimation tool with high-dimensional data sets. One possible modification to the procedure is to use the Lasso as a subset selector, but not as an estimation procedure. An implementation of this is the extreme version of the Relaxed Lasso (Meinshausen, 2007), which fits least squares regressions to the Lasso selected subsets. Returning to the orthogonal predictors example in Section 3.1, we investigate the performance of this simple two-step procedure. Figure 6 plots the median optimal loss for the Lasso and the median optimal loss for the two-stage procedure for varying values of . In this example, the two-stage procedure improves performance when the SNR is high, but not when the SNR is low. However, the improvement in performance in the high SNR case is more than the worsening of performance in the low SNR case. These preliminary results suggest that a two-stage procedure can help improve performance when the SNR is sufficiently high.
Another possible solution is to screen the predictor variables before fitting the Lasso penalized regression. In screening, the typical goal is to reduce from a huge scale to something that is (Fan and Lv, 2008). However, our results suggest that it is not enough to merely reduce the number of predictors, which implies that how to optimally tune the number of screened predictors is an interesting model selection problem. Investigating proper screening techniques and two-stage procedures are both interesting areas for future research.
Appendix A Technical Results
[Lemma 2.1] If , then for any ,
[Lemma 2.2] Without loss of generality assume that , and therefore . Consider
First consider . Since is a convex function for , the minimum occurs at a place where the derivative is zero or when . Taking the derivative with respect to ,
Since the derivative is an increasing function of , a minimum occurs at if the derivative is non-negative at that point. In other words, a minimum occurs at if . Otherwise, a minimum occurs at a point where the derivative is zero. Thus
Next, for , . Since for all , it follows that
[Lemma 2.3] Without loss of generality assume that , and therefore . Also assume that . Consider
First consider . Since is a continuous differentiable function for , local extrema occur at points where the derivative is zero or at a boundary point. Taking the derivative with respect to ,
Since the derivative is a strictly increasing function of , a minimum occurs at if the derivative is non-negative at that point. Hence a minimum occurs at if . Otherwise a minimum occurs at a point where the derivative is zero. It follows that
Next, for , . Thus
To compare to , first note that . Next, comparing to it is clear that if . However, if , then and
Similarly, if , then either so that
Hence, if and only if .
[Lemma 2.4] Without loss of generality assume that , and therefore . Also assume that . Consider
The derivative of does not exist at . However, by Lemma 2.1, is never globally optimal since
To determine the optimal values of , we consider the intervals , , and separately. Define
First, for , is a continuous differentiable function and
Since the derivative is a strictly increasing function of , a minimum occurs at if the derivative is non-negative at that point. Thus, a minimum occurs at if . Similarly, a local minimum occurs at if
which holds if . Otherwise, a minimum occurs at a point where the derivative is zero. It follows that
Next, for , is a continuous differentiable function and
Since the derivative is negative for all , a local minimum occurs at , thus .
Lastly, for all , . It follows that
By a similar argument to that used in the proof of Lemma 2.3, it follows that .
By Lemma 2.1,
We can evaluate these probabilities explicitly. First consider
are the probability distribution functions (pdf) ofand , respectively, and
is the pdf of the standard normal distribution. Substituting,
[Theorem 2.2] Consider
From the proof of Theorem 2.1,
Appendix B Optimal MSE Ratio
Here we return to the independent predictors example in section 3.2. To study the behavior of the optimal MSE, we evaluate the MSE for each realization on a simulated test set. Figure 7 presents boxplots of the ratios of the estimated optimal MSE with predictors to the estimated optimal MSE with the six true predictors where is taken to be 100, 500, and 1000. A comparison of this figure to the median optimal loss ratios presented in Figure 4 demonstrates that while deterioration is still observed, the optimal MSE ratios can be smaller than the optimal loss ratios due to the presence of the error variance in both the numerator and denominator of the MSE ratio.
- Akaike (1973) Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. In International Symposium on Information Theory, 2nd, Tsahkadsor, Armenian SSR, pp. 267–281.
- Bickel et al. (2010) Bickel, J., Alexandre, and B. Tsybakov (2010). Simultaneous analysis of lasso and dantzig selector. Annals of Statistics 37, 1705–1732.
- Bien et al. (2013) Bien, J., J. Taylor, and R. Tibshirani (2013). A lasso for hierarchical interactions. Annals of Statistics 41, 1111–1141.
- Buhlmann and van de Geer (2011) Buhlmann, P. and S. van de Geer (2011). Statistics for High-Dimensional Data. Springer Series in Statistics.
- Bunea et al. (2007) Bunea, F., A. Tsybakov, and M. Wegkamp (2007). Sparsity oracle inequalities for the lasso. Electronic Journal of Statistics 1, 169–194.
- Craven and Wahba (1978) Craven, P. and G. Wahba (1978, December). Smoothing Noisy Data with Spline Functions. Numerische Mathematik 31, 377–403.
- Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. Annals of Statistics 32, 407–499.
- Fan and Li (2001) Fan, J. and R. Li (2001, December). Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties. Journal of the American Statistical Association 96, 1348–1360.
- Fan and Lv (2008) Fan, J. and J. Lv (2008). Sure Independence Screening for Ultrahigh Dimensional Feature Space with Discussion. J.R. Statist. Soc. B 70, 849–911.
- Flynn et al. (2013) Flynn, C. J., C. M. Hurvich, and J. S. Simonoff (2013). Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models. Journal of the American Statistical Association 108, 1031–1043.
Greenshtein, E. (2006).
Best Subset Selection, Persistence in High-Dimensional Statistical Learning and Optimization UnderConstraint. Annals of Statistics 34, 2367–2386.
- Greenshtein and Ritov (2004) Greenshtein, E. and Y. Ritov (2004). Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli 10(6), 971–988.
- Hastie and Efron (2011) Hastie, T. and B. Efron (2011). lars: Least Angle Regression, Lasso and Forward Stagewise. R package version 0.9-8.
- Hurvich and Tsai (1989) Hurvich, C. M. and C.-L. Tsai (1989). Regression and Time Series Model Selection in Small Samples. Biometrika 76, 297–307.
- Leng et al. (2006) Leng, C., Y. Lin, and G. Wahba (2006). A Note on the Lasso and Related Procedures in Model Selection. Statistica Sinica 16, 1273–1284.
- Meinshausen (2007) Meinshausen, N. (2007). Relaxed lasso. Computational Statistics and Data Analysis 52, 374–393.
- Negahban et al. (2009) Negahban, S., P. Ravikumar, M. Wainwright, and B. Yu (2009). A unified framework for high-dimensional analysis of -estimators with decomposable regularizers. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta (Eds.), Advances in Neural Information Processing Systems 22, pp. 1348–1356.
- Rhee et al. (2006) Rhee, S.-Y., J. Taylor, G. Wadhera, A. Ben-Hur, and D. L. Brutlag (2006). Genotypic predictors of human immunodeficiency virus type 1 drug resistance. Proceedings of the National Academy of Sciences USA 103, 17355–17360.
- Schwarz (1978) Schwarz, G. (1978, March). Estimating the Dimension of a Model. The Annals of Statistics 6, 461–464.
- Tibshirani (1996) Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society B 58, 267–288.
et al. (2007)
Zou, H., T. Hastie, and R. Tibshirani (2007).
On the “Degrees of Freedom” of the Lasso.The Annals of Statistics 35, 2173–2192.