Introduction
In recent years, statistical tools that can deal with highdimensional data and models have become pivotal in many areas of science and engineering. The advent of highthroughput technologies, for example, has transformed biology into a datadriven science that requires mathematical models with many variables. The need to analyze and reduce the complexity of these models has triggered an enormous interest in highdimensional statistical methods that are able to separate relevant variables from irrelevant ones
[Belloni and Chernozhukov2011, Bühlmann and van de Geer2011, Hastie et al.2001]. Among the many existing methods, Lasso [Tibshirani1996] and SquareRoot Lasso (or Scaled Lasso) [Belloni et al.2011, Owen2007, Städler et al.2010, Sun and Zhang2012] have become very popular representatives.In practice, however, highdimensional variable selection turns out to be a difficult task. A major shortcoming of Lasso, in particular, is its need for a tuning parameter that is properly adjusted to all aspects of the model [Hebiri and Lederer2013] and therefore difficult to calibrate in practice. Using CrossValidation to adjust the tuning parameter is not a satisfactory approach to this problem, because CrossValidation is computationally inefficient and provides unsatisfactory variable selection performance. Replacing Lasso by SquareRoot Lasso is also not a satisfactory approach, because SquareRoot Lasso resolves only the adjustment of the tuning parameter to the variance of the noise but does not address the adjustment to the tail behavior of the noise and to the design. Similarly, more advanced Lassobased procedures such as the Uncorrelated Lasso [Chen et al.2013] or the Trace Lasso [Grave et al.2011] also comprise tuning parameters that need proper calibration. In conclusion, none of the present approaches simultaneously provides parameterfree, accurate, and computationally attractive variable selection.
Our contribution: In this study, we present a novel approach for highdimensional variable selection. First, we reveal how a systematic development of the SquareRoot Lasso approach leads to TREX, an estimator without any tuning parameter. For optimal variable selection, we then combine TREX with a bootstrapping scheme. Next, we detail on implementations and demonstrate in a thorough numerical study that TREX is both accurate and computationally efficient. Finally, we discuss the findings and indicate directions for subsequent studies.
Methodology
Framework for our study
In this study, we aim at variable selection in linear regression. We therefore consider models of the form
(Model) 
where
is a response vector,
a design matrix, a constant, and a noise vector. We allow in particular for highdimensional settings, where rivals or exceeds , and undisclosed distributions of the noise . Statistical methods for models of the above form typically target (estimation), the support of (variable selection), (prediction), or (variance estimation). In this study, we focus on variable selection.To ease the exposition of the sequel, we append some conventions and notation: We allow for fixed and for random design matrices but assume in either case the normalization for all . Moreover, we assume that the distribution of the noise vector has variance so that
is the standard deviation of the entire noise
. Finally, we denote the support (the index set of the nonzero entries) of a vector by and the norm and the maximum norm of by and , respectively.TREX and BTREX
We now introduce two novel estimators for highdimensional linear regression: TREX and BTREX. To motivate these estimators, let us first detail on the calibration of Lasso. Recall that for a fixed tuning parameter , Lasso is a minimizer of a leastsquares criterion with penalty:
(Lasso) 
The tuning parameter determines the intensity of the regularization and is therefore highly influential, and it is well understood that a reasonable choice is of the order
For example, this becomes apparent when looking at the following prediction bound for Lasso (cf. [Koltchinskii et al.2011, Rigollet and Tsybakov2011], see also [Dalalyan et al.2014] for an overview of Lasso prediction).
Lemma 1.
If , it holds
This suggests a tuning parameter that is small (since the bound is proportional to ) but not too small (to satisfy the condition ). In practice, however, the corresponding calibration is very difficult, because it needs to incorporate several, often unknown, aspects of the model:

the design matrix ;

the standard deviation of the noise ;

the tail behavior of the noise vector .
While one line of research approaches (a) and describes the calibration of Lasso to the design matrix [van de Geer and Lederer2013, Hebiri and Lederer2013, Dalalyan et al.2014], SquareRoot Lasso approaches (b) and eliminates the calibration to the standard deviation of the noise. To elucidate the latter approach, we first recall that for a fixed tuning parameter , SquareRoot Lasso is defined similarly as Lasso:
(SquareRoot Lasso) 
SquareRoot Lasso also requires a tuning parameter to determine the intensity of the regularization. However, the tuning parameter should here be of the order (see, for example, [Belloni et al.2011])
so that SquareRoot Lasso does not require a calibration to the standard deviation of the noise. The origin of this feature can be readily located: Reformulating the definition of SquareRoot Lasso as
identifies the factor in the denominator of the first term as the distinction to Lasso. This additional factor acts as an inherent estimator of the standard deviation of the noise and makes therefore the calibration to obsolete. On the other hand, SquareRoot Lasso still contains a tuning parameter that needs to be adjusted to (a) the design matrix and (c) the tail behavior of the noise vector.
We now develop the SquareRoot Lasso approach further to address all aspects (a), (b), and (c). For this, we aim at incorporating an inherent estimation not of but rather of the entire quantity of interest . For this, note that if is a consistent estimator of , then is a consistent estimator of . In this spirit, we define TREX^{1}^{1}1We call this new approach TREX to emphasize that it aims at Tuningfree Regression that adapts to the Entire noise and the design matrix . according to
(TREX) 
SquareRoot Lasso and Lasso are equivalent families of estimators (there is a onetoone mapping between the tuning parameter paths of SquareRoot Lasso and Lasso); in contrast, TREX is a single, tuningfree estimator, and its solution is in general not on the tuning parameter paths of Lasso and SquareRoot Lasso. However, we can establish an interesting relationship between these paths and TREX (we omit all proofs for sake of brevity):
Theorem 1.
It holds that
In view of the KarushKuhnTucker conditions for Lasso, the latter formulation strongly resembles the Lasso path. This resemblance is no surprise: In fact, any consistent estimator of is related to a Lasso solution with an optimal (but in practice unknown) tuning parameter via the formulation of TREX:
Lemma 2.
Assume that a consistent estimator of and
Then, is close to a Lasso solution with tuning parameter , that is,
for .
Equipped with TREX to estimate the regression vector , we can tackle a broad spectrum of tasks including estimation, prediction, and variance estimation. In this paper, however, we focus on variable selection. For this task, we advocate an additional refinement based on sequential bootstrapping [Rao et al.1997]. More specifically, we advocate BTREX for a fixed number of bootstraps :
BTREX is the majority vote over the TREX solutions for sequential bootstrap samples. Note that related bootstrapping schemes (based on traditional bootstrapping and different selection rules, however) have already been applied to Lasso [Bach2008, Bunea et al.2011]. In practice, it can also be illustrative to report the selection frequencies of each parameter over the bootstrap samples (cf. Figure 3). We finally note that BTREX readily provides estimation and prediction if a leastsquares refitting on the set is performed. This refitting can improve the prediction and estimation accuracy if the set is a good estimator of the true support of [Belloni and Chernozhukov2013, Lederer2013].
We point out that the norms and in the formulation of TREX are dual and that extensions to other pairs of dual norms are straightforward.
A theoretical analysis of TREX is beyond the scope of this paper but is the subject of a forthcoming theory paper (with different authors). Note also that theoretical results for standard variable selection methods are incomplete: in particular, there are currently no finite sample guarantees for approaches based on Lasso and SquareRoot Lasso: Finite sample bounds (“Oracle inequalities”) for Lasso [Bühlmann and van de Geer2011] and SquareRoot Lasso [Bunea et al.2014] require that the tuning parameters are properly calibrated to the model; yet, there are no guarantees that standard calibration schemes such as CrossValidation or BICtype criteria provide such tuning parameters.
Implementation of TREX
To compute TREX, we consider the objective function
that comprises the datafitting term and the regularization term . To make this objective function amenable to standard algorithms [Nesterov2007, Schmidt2010], we invoke a smooth approximation of the datafitting term. For this, we note that for all vectors and positive integers , it holds that
and the datafitting term can therefore be approximated by the smooth datafitting term
We find that any works well in practice (see supplementary material). We can calculate the gradient of the smooth approximation and obtain
The approximation of the criterion is now amenable to effective (local) optimization with projected scaled subgradient (PSS) algorithms [Schmidt2010]. PSS schemes are specifically tailored to objective functions with smooth, possibly nonconvex datafitting terms and regularization terms. PSS algorithms only require zeroth and firstorder information about the objective function, have a linear time and space complexity per iteration, and are especially effective for problems with sparse solutions. Several PSS algorithms that fit our framework are described in [Schmidt2010, Chapter 2.3.1]^{2}^{2}2http://www.di.ens.fr/%7Emschmidt/Software/L1General.html provides the implementations. Among these algorithms, the GafniBertsekas variant was particularly effective for our purposes.
The smooth formulation of the TREX criterion remains nonconvex; therefore, convergence to the global minimum cannot be guaranteed. Neverthless, we show that the above implementation is fast, scalable, and provides estimators with excellent statistical performance.
Note also that the advent of novel optimization procedures [Breheny and Huang2011, Mazumder et al.2011] lead to an increasing popularity of nonconvex regularization terms such as the Smoothly Clipped After Deviation (SCAD) [Fan and Li2001] and Minimax Concave Penality (MCP) [Zhang2010]. More recently, also objective functions with nonconvex datafitting terms have been proved both statistically valuable and efficiently computable [Loh and Wainwright2013, Nesterov2007, Wang et al.2013].
Numerical Examples
We demonstrate the performance of TREX and BTREX on three numerical examples. We first consider a synthetic example inspired by [Belloni et al.2011]. We then consider two highdimensional biological data sets that involve riboflavin production in B. subtilis [Bühlmann et al.2014] and mass spectrometry data from melanoma patients [Mian et al.2005].
We perform the numerical computations in MATLAB 2012b on a standard MacBook Pro with dual 2GHz Intel Core i7 and 4GB 1333MHz DDR3 memory. To compute Lasso and its crossvalidated version, we use the MATLABinternal procedure lasso.m (with standard values), which follows the popular glmnet R code. To compute TREX, we use Schmidt’s PSS algorithm implemented in L1General2_PSSgb.m to optimize the approximate TREX objective function with . We use the PSS algorithm with standard parameter settings and set the initial solution to the parsimonious allzeros vector . We use the following PSS stopping criteria: minimum relative progress tolerance optTol=1e7, minimum gradient tolerance progTol=1e9, and maximum number of iterations . As standard for the number of bootstrap samples in BTREX we set .
Synthetic Example
We first evaluate the scalability and the variable selection performance of TREX and BTREX on synthetic data. The method of comparison is Lasso with the tuning parameter that leads to minimal fold crossvalidated mean squared error (LassoCV). We generate data according to the linear regression model (Model) with parameters inspired by the Monte Carlo simulations in [Belloni et al.2011]: We set the sample size to , the number of variables to (or vary over ), and the true regression vector to ; we sample standard normal errors and multiply them by a fixed standard deviation ; and we sample the rows of from the
dimensional normal distribution
, where is the covariance matrix with diagonal entries and offdiagonal entries for and a fixed correlation , and then normalized them to Euclidean norm . We report scalability and variable selection results averaged over repetitions (thick, colored bars) and the corresponding standard deviations (thin, black bars). More precisely, we report the runtime of plain Lasso and of TREX as a function of (for , , ) in Figure 1, and we report the runtime and the variable selection performance of LassoCV, TREX, and BTREX in Hamming distance for fixed in Figure 2.The data shown in Figure 1 suggest that the runtime for TREX is between quadratic and cubic in (a leastsquares fit results in ) and, thus, illustrates the scalability of TREX at least up to . In comparison, the runtime for a single Lasso path (without CrossValidation or any other calibration scheme), shown in Figure 1, reveals a nearlinear dependence of (a leastsquares fit results in ), though with a higher offset and slope.
Figure 2 summarizes the numerical results for the settings with . The runtimes disclosed in Figure 2 indicate that both TREX and BTREX can rival LassoCV in terms of speed. The variable selection results show that TREX provides nearperfect variable selection for and BTREX for ; for stronger noise, the Hamming distance of these two estimators to increases. LassoCV, on the other hand, consistently selects too many variables. For (see supplementary material), the performance of TREX deteriorates as compared to LassoCV. BTREX, on the other hand, provides excellent variable selection for all considered parameter settings. In summary, the numerical results for the standard synthetic example considered here provide first evidence that TREX and BTREX can outmatch LassoCV in terms of variable selection.
Riboflavin Production in B. Subtilis
We next consider a recently published highdimensional biological data set for the production of riboflavin (vitamin B) in B. subtilis (Bacillus subtilis) [Bühlmann et al.2014]. The data set comprises expression profiles of genes of different B. subtilis strains for a total of experiments with varying settings. The corresponding expression profiles are stored in the matrix . Along with these expression profiles, the associated standardized riboflavin logproduction rates have been measured. The main objective is now to identify a small set of genes that is highly predictive for the riboflavin production rate.
We first report the outcomes of standard Lassobased approaches, which can be obtained along the lines of [Bühlmann et al.2014]. The runtime for the computation of a single Lasso path with the MATLAB routine is approximately 58 seconds. LassoCV selects 38 genes, that is, its solution has 38 nonzero coefficients; the 20 genes with largest coefficients and the associated coefficient values are listed in Table 1. For variable selection, Bühlmann et al. specifically propose stability selection [Meinshausen and Bühlmann2010]. The standard stability selection approach is based on Lasso computations on subsamples of size and the coefficients that enter the corresponding Lasso paths first. This approach yields three genes: LYSC_at, YOAB_at, and YXLD_at [Bühlmann et al.2014].
We next apply TREX and BTREX. The runtime for a single TREX computation is approximately 30 seconds. TREX selects 20 genes and therefore provides a considerably sparser solution than LassoCV; the corresponding genes and the associated coefficients are listed in Table 1. BTREX with the standard majority vote selects three genes: YXLE_at, YOAB_at, and YXLD_at. The outcomes of BTREX with selection rules different from majority vote can be deduced from Table 1, where we list the selection frequencies of the 20 genes that are selected most frequently across the bootstraps.
The numerical results reveal three key insights: First, the set of genes selected by TREX and the set of the 20 genes corresponding to the highest coefficients in the LassoCV solution are distinct but share a common subset of genes. Second, the sets of genes selected by BTREX and LassoCV stability selection have the two topranked LassoCV and TREX genes in common. On the other hand, the gene associated with the highest frequency in the BTREX solution is not selected by stability selection. The BTREX solution is biologically plausible: Since the genes YXLD_at and YXLE_at are located in the same operon, both genes are likely to be coexpressed and involved in similar cellular functions. Third, the runtime for a single Lasso path is about two times larger than for a single TREX solution.
The model complexities differ considerably, ranging from parameters for BTREX to for LassoCV, and in applications, simple models are often preferred. We evaluate, therefore, the LeaveOneOut CrossValidation errors (LOOCVerrors) of the methods under consideration for fixed numbers of parameters. As a reference, we report the LOOCVerrors of LassoCV (with the crossvalidations performed on the training sets of size ) in the first row of Table 2. In the three subsequent rows, we then show the LOOCVerrors of TREX, of TREX with leastsquares refitting (TREXLS), and of Lasso with tuning parameter such that the number of nonzero entries equals the number of nonzero entries of TREX (LassoT). Finally, we give the LOOCVerrors of BTREX and of Lasso with tuning parameter such that the number of nonzero entries equals the number of nonzero entries of BTREX (LassoBT). The computations for Stability Selection are very intensive and therefore omitted. We observe that for fixed model complexity, the solutions of TREX (with leastsquares refitting) and BTREX have lower LOOCVerror than their Lassobased counterparts.
We conclude that the genes selected by BTREX are commensurate with biological knowledge and that BTREX can provide small models with good predictive performance.
Classification of Melanoma Patients
We also demonstrate the usefulness of the ranked BTREX list for a proteomics data set from a study on melanoma patients [Mian et al.2005]. The data^{3}^{3}3see http://www.maths.nottingham.ac.uk/%7Eild/massspec consist of mass spectrometry scans of serum samples from patients with Stage I melanoma (moderately severe) and patients with Stage IV melanoma (very severe). Each scan measures the intensities for mass over charge (m/Z) values. The objective is to find m/Z values that are indicators for the stage of the disease, eventually leading to proteins that can serve as discriminative biomarkers [Mian et al.2005].
We want to compare outcomes of our estimators with results described in [Vasiliu et al.2014]
. For this, we use the same linear regression framework (even though one could also argue in favor of a logistic regression framework) and the same data preprocessing: We apply an initial peak filtering step that yields the
most relevant m/Z values. The resulting data are then normalized and stored in the matrix . Next, the class labels in are set to for (Stage I patients) and to for (Stage IV patients).We now demonstrate the usefulness of the ranked list of predictors provided by BTREX. For this, we first report in Figure 4 the parameter values of the leastsquares refitted versions of the three estimators fold crossvalidated Lasso (LassoCV), TREX, and BTREX. LassoCV selects predictors, TREX selects predictors, and BTREX selects predictors. We now use the signs of the (leastsquares refitted) responses to estimate the class labels, cf. [Vasiliu et al.2014]. We depict in Figure 4 averaged fold classification errors of SureIndependence Screening (SIS), Iterative SIS (ISIS), Elastic Net, and Penalized Euclidean Distance (PED) (all taken from [Vasiliu et al.2014]) and of TREX, BTREX, and LassoCV. TREX shows almost identical classification error/model complexity as SIS and ISIS and outperforms Elastic net in terms of model complexity. PED and LassoCV have lower classification error but higher model complexity. BTREX with standard majority vote results in a very sparse model with moderate error. More importantly, classification based on the top predictors from BTREX is insensitive with respect to the threshold: For any number of predictors from up to , BTREX outperforms all other estimators. We conclude that the ranked list of BTREX predictors can lead to very robust and accurate model selection and, in particular, can outperform on this data set all other standard estimators.
LOOCVerror  # of coefficients  

LassoCV  0.42  39 
TREX  0.51  21 
TREXLS  0.45  21 
LassoT  0.47  21 
BTREX  0.50  4 
LassoBT  0.62  4 
Conclusions
We have introduced TREX, a simple, fast, and accurate method for highdimensional variable selection. We have shown that TREX avoids tuning parameters and, therefore, challenging calibrations. Moreover, we have shown that TREX can outmatch a crossvalidated Lasso in terms of speed and accuracy.
To further improve variable selection, we proposed BTREX, a combination of TREX with a bootstrapping scheme. This proposition is supported by the numerical results and in line with earlier claims that bootstrapping can improve variable selection [Bach2008, Bunea et al.2011]. Moreover, we argue that the solution of BTREX on the recent riboflavin data set in [Bühlmann et al.2014] is supported by biological insights. Finally, the results on the melanoma data show that TREX can yield robust classification.
Our contribution therefore suggests that TREX and BTREX can challenge standard methods such as crossvalidated Lasso and can be valuable in a wide range of applications. We will provide further theoretical guarantees, optimized implementations, and tests for prediction and estimation performance in a forthcoming paper. A TREX MATLABtoolbox as well as all presented numerical data will be made publicly available at the authors’ websites.
Acknowledgments
We sincerely thank the reviewers for their insightful comments and Jacob Bien, Richard Bonneau, and Irina Gaynanova for the valuable discussions.
References

[Bach2008]
F. Bach.
Bolasso: Model consistent lasso estimation through the bootstrap.
In
Proceedings of the 25th International Conference on Machine Learning
, ICML ’08, pages 33–40, 2008.  [Belloni and Chernozhukov2011] A. Belloni and V. Chernozhukov. High dimensional sparse econometric models: an introduction. In P. Alquier, E. Gautier, and G. Stoltz, editors, Inverse Problems and HighDimensional Estimation, volume 203 of Lect. Notes Stat. Proc. Springer, 2011.
 [Belloni and Chernozhukov2013] A. Belloni and V. Chernozhukov. Least squares after model selection in highdimensional sparse models. Bernoulli, 19(2):363–719, 2013.
 [Belloni et al.2011] A. Belloni, V. Chernozhukov, and L. Wang. Squareroot lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4):791–806, 2011.

[Breheny and Huang2011]
P. Breheny and J. Huang.
Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection.
The annals of applied statistics, 5(1):232, 2011.  [Bühlmann and van de Geer2011] P. Bühlmann and S. van de Geer. Statistics for highdimensional data: Methods, theory and applications. Springer Series in Statistics. 2011.
 [Bühlmann et al.2014] P. Bühlmann, M. Kalisch, and L. Meier. Highdimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1):255–278, 2014.
 [Bunea et al.2011] F. Bunea, Y. She, H. Ombao, A. Gongvatana, K. Devlin, and R. Cohen. Penalized least squares regression methods and applications to neuroimaging. Neuroimage, 55, 2011.
 [Bunea et al.2014] F. Bunea, J. Lederer, and Y. She. The Group SquareRoot Lasso: Theoretical Properties and Fast Algorithms. IEEE Trans. Inform. Theory, 60(2):1313–1325, 2014.
 [Chen et al.2013] S. Chen, C. Ding, B. Luo, and Y. Xie. Uncorrelated Lasso. In AAAI, 2013.
 [Dalalyan et al.2014] A. Dalalyan, M. Hebiri, and J. Lederer. On the Prediction Performance of the Lasso. preprint, arXiv:1402.1700, 2014.
 [Fan and Li2001] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc., 96:1348–1360, 2001.
 [Grave et al.2011] E. Grave, G. Obozinski, and F. Bach. Trace Lasso: a trace norm regularization for correlated designs. In Advances in Neural Information Processing Systems, pages 2187–2195, 2011.
 [Hastie et al.2001] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: Data mining, inference, and prediction. Springer Series in Statistics. 2001.
 [Hebiri and Lederer2013] M. Hebiri and J. Lederer. How Correlations Influence Lasso Prediction. IEEE Trans. Inform. Theory, 59(3):1846–1854, 2013.
 [Koltchinskii et al.2011] V. Koltchinskii, K. Lounici, and A. Tsybakov. Nuclearnorm penalization and optimal rates for noisy lowrank matrix completion. Ann. Statist., 39(5):2302–2329, 2011.
 [Lederer2013] J. Lederer. Trust, but verify: benefits and pitfalls of leastsquares refitting in high dimensions. preprint, arxiv:1306.0113, 2013.
 [Loh and Wainwright2013] P.L. Loh and M. Wainwright. Regularized mestimators with nonconvexity: Statistical and algorithmic theory for local optima. In NIPS, pages 476–484, 2013.
 [Mazumder et al.2011] R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: Coordinate descent with nonconvex penalties. J. Amer. Statist. Assoc., 106(495):1125–1138, 2011.
 [Meinshausen and Bühlmann2010] N. Meinshausen and P. Bühlmann. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol., 72(4):417–473, 2010.
 [Mian et al.2005] S. Mian, S. Ugurel, E. Parkinson, I. Schlenzka, I. Dryden, L. Lancashire, G. Ball, C. Creaser, R. Rees, and D. Schadendorf. Serum proteomic fingerprinting discriminates between clinical stages and predicts disease progression in melanoma patients. Journal of clinical oncology, 23(22):5088–5093, 2005.
 [Nesterov2007] Y. Nesterov. Gradient methods for minimizing composite objective function. CORE Discussion Papers 2007076, Université catholique de Louvain, 2007.

[Owen2007]
A. Owen.
A robust hybrid of lasso and ridge regression.
In Prediction and discovery, volume 443 of Contemp. Math., pages 59–71. Amer. Math. Soc., 2007.  [Rao et al.1997] C. Rao, P. Pathak, and V. Koltchinskii. Bootstrap by sequential resampling. J. Statist. Plann. Inference, 64(2):257–281, 1997.
 [Rigollet and Tsybakov2011] P. Rigollet and A. Tsybakov. Exponential Screening and optimal rates of sparse estimation. Ann. Statist., 39(2):731–771, 2011.
 [Schmidt2010] M. Schmidt. Graphical Model Structure Learning with L1Regularization. PhD thesis, University of British Columbia, 2010.
 [Städler et al.2010] N. Städler, P. Bühlmann, and S. van de Geer. penalization for mixture regression models. Test, 19(2):209–256, 2010.
 [Sun and Zhang2012] T. Sun and C.H. Zhang. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012.
 [Tibshirani1996] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B, 58(1):267–288, 1996.
 [van de Geer and Lederer2013] S. van de Geer and J. Lederer. The Lasso, correlated design, and improved oracle inequalities. IMS Collections, 9:303–316, 2013.
 [Vasiliu et al.2014] D. Vasiliu, T. Dey, and I. L. Dryden. Penalized Euclidean Distance Regression. preprint, arxiv:1405.4578, 2014.
 [Wang et al.2013] Z. Wang, H. Liu, and T. Zhang. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. preprint, arXiv/1306.4960, 2013.
 [Zhang2010] C.H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, pages 894–942, 2010.