In a wide range of applications it is now routine to encounter regression problems where the number of features or covariates exceeds the sample size
, often greatly. Even in the simple case of linear models with independent Gaussian noise, estimation is nontrivial and requires specific assumptions. A common and often appropriate assumption is that of sparsity, where only a subset of the variables (the active set) have non-zero coefficients, with the numberof such active variables usually assumed much smaller than .
Penalized methods augment the regression log-likelihood with a penalty term that encodes a structural assumption such as sparsity. Recent years have seen much progress in theory and methodology for penalized regression (see Bühlmann and van de Geer, 2011, for a lucid account). However, while the theoretical developments have been remarkable and insightful, they cannot go quite as far as telling the user which method to use in a given finite-sample setting. Meanwhile, rapid methodological progress has meant a wide range of plausible approaches to choose between.
The present study aims to fill this gap via a systematic empirical comparison of a number of penalized regression methods, which could guide users towards selecting methods for specific applications. We consider six popular methods (Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection) and a range of data-generating scenarios. It is obvious that large departures from modeling assumptions can produce poor results. Here our intention is not so much to look at robustness to such departures, but rather to look at variation in performance even in the favorable case where assumptions broadly hold.
In the simulations, we vary a number of factors in a relatively fine-grained manner within an essentially full factorial design (i.e. all combinations of factors). We distinguish between prediction, variable selection and variable ranking. We consider variable ranking in addition to selection due to the fact that in many applications, users are interested in guidance for follow-up studies or data acquisition. Then, highlighting variables in a suitable rank order is particularly important. In addition to the main simulations, we extend some data-generating regimes in specific directions to further explore specific properties of the methods. Furthermore, we also compare methods using semi-synthetic data (real covariates but simulated responses) to study the generalizability of our findings.
Our four main findings are: (i) there is substantial variation in performance between methods, with no unambiguous winner across scenarios (i.e details of the data-generating set-up matter) and this is despite the fact that we focus on a relatively narrow class of models broadly favorable to the methods employed, (ii) relative performance depends on the specific goals, (iii) Lasso is relatively stable in the sense that it performs competitively in many of the scenarios considered here, and (iv) there is evidence of an interesting “phase transition”-like behavior for SCAD, where it goes from being the best performing method to the worst as scenario difficulty increases. We also find that a Ridge-penalty only offers substantial benefits over Lasso in the most challenging scenarios with strong multi-collinearity. In addition, our results and associated simulation code (see the “Code and data availablity” section below) provide a resource, allowing users to compare the methods considered here against each other across many scenarios and also to extend the study with other (existing or novel) methods.
A number of previous papers have examined the empirical performance of penalized regression methods. Meinshausen and Bühlmann (2010) consider large problems from a selection perspective. Bühlmann and Mandozzi (2014) is a more comprehensive study using semi-synthetic data and evaluating screening or ranking properties in high dimensional settings. In contrast to previous work, our design is more comprehensive and systematic. We use finer grids on factors including and signal-to-noise ratio (SNR) so that our results cover a wider range of designs, allowing us to more fully investigate the trends in relative performance. We also consider several multicollinearity parameters, so we can better understand this practically important factor. Furthermore, we evaluate all three of prediction, selection and ranking, using specific performance metrics for each. To limit scope we do not consider Bayesian methods here but note that there have been some interesting empirical comparisons of frequentist and Bayesian methods (including Celeux et al., 2012; Bondell and Reich, 2012).
The remainder of the paper is organized as follows. In Section 2, we outline the methods compared and describe our simulation strategy, including the data-generating factors considered. We also give details of how the methods are implemented and the performance metrics used. Section 3 presents the results from our main simulation study. Results from additional simulations appear in Section 4 and from semi-synthetic data, based on a cancer study, in Section 5. We conclude with a discussion in Section 6.
2.1 Model setting and notation
We focus on the best studied high-dimensional regression setting, namely the sparse linear model with independent Gaussian noise. That is, we consider models of the form
is a vector of responses,a design matrix, a vector of (true) regression coefficients and are the errors. We use to denote the active set with the number of active variables. We focus on the case where and where is small (i.e. a sparse setting). Unless otherwise noted, , where is the -dimensional Gaussian and the identity matrix.
2.2 The methods considered
A general penalized estimate for linear regression takes the following form:
where is a penalty function applied to each component of and is a tuning parameter that controls the amount of penalization. We consider several specific methods, outlined below.
Lasso. The Lasso estimator (Tibshirani, 1996) takes the form given in (2) with an -norm penalty: . This shrinks coefficients towards zero, with some set to exactly zero, and controlling the amount of shrinkage and degree of sparsity. The theoretical properties of the Lasso have been well-studied and an extensive treatment can be found in Bühlmann and van de Geer (2011).
We mention two points. First, that should be larger for consistent variable selection than for consistent prediction. Second, that the prediction-optimal (estimated using e.g. cross-validation) can lead to inclusion of many false positives (Meinshausen and Bühlmann, 2006).
Ridge Regression. Ridge Regression (Hoerl and Kennard, 1970) uses an -norm penalty in (2): . This shrinks coefficients towards zero, but results in non-sparse solutions because it is not singular at the origin. It also has a grouping effect where correlated variables have similar estimates. Note that Ridge Regression is the only method considered here that does not perform variable selection per se.
That is, - and -norm penalties combined with an additional parameter ( and correspond to Lasso and Ridge respectively). This combines some of the benefits of Ridge while giving sparse solutions. In the setting, Lasso can select at most variables, but Elastic Net has no such limitation.
where and . This is a non-convex, quadratic spline function by which small coefficients are shrunk towards zero with a Lasso penalty, while large coefficients are not penalized. The resulting estimator is, unlike Lasso, nearly unbiased for large coefficients. Fan and Li (2001) and Fan et al. (2004) also show that SCAD enjoys an oracle property (assuming some regularity conditions) – it is simultaneously consistent for variable selection and estimation, where the latter is as efficient (asymptotically) as the ideal case when the true model is known in advance. For further details on the properties of SCAD, see Fan and Lv (2010) and references therein.
The Dantzig Selector and the Lasso are closely connected as discussed in Bickel et al. (2009) and under certain conditions on the design matrix, Lasso and Dantzig provide the same solution (Meinshausen et al., 2007; James et al., 2009).
Stability Selection. This is a general approach by which to combine variable selection with data subsampling to obtain more stable selection and control the number of false positives. Specifically, random data subsamples of size are generated by sampling without replacement. Applying a variable selection procedure, with regularization parameter , to these datasets gives a score indicating the frequency with which variable is selected among the iterations. Let
denote the set of considered values for the regularization parameter. Then, a set of “stable variables” is obtained by choosing those variables that have selection probabilities larger than a cutoff valuefor any .
In contrast to the methods described above, Stability Selection does not require setting of the parameter , but instead requires the cutoff to be chosen. Meinshausen and Bühlmann (2010) provide theoretical results showing how can be chosen to achieve a user-specified upper bound on the expected number of false positives , assuming a fixed set of regularization parameters . Alternatively, the user can fix and then the theory shows how should be chosen to achieve the desired upper bound on . In our study we use the Lasso as the variable selection procedure with Stability Selection.
2.3 Simulation set-ups
|Independence and correlation designs||Sample size,||100, 200, 300|
|Dimensionality,||500, 1000, 2000, 4000|
|Sparsity,||10, 20, 40|
|Signal-to-noise ratio, SNR||1, 2, 4|
|Correlation designs only||Pairwise correlation within a block,||0.5, 0.7, 0.9|
|Toeplitz correlation within a block||for , in same block|
|Block size,||10, 100|
|Number of signals per block,||1, 2, 5|
We generate data from model (1). We set to have non-zero entries (all set to 3 except in Section 4.2 where we consider heterogeneous coefficients) with then set to obtain a desired SNR, defined here as . We consider the following three designs:
Independence design. All covariates are i.i.d. standard normal.
Pairwise correlation design. The covariates are partitioned into blocks, each of size . All covariates are standard normal but with correlation between any pair of covariates within the same block set to . Covariates in different blocks are independent of each other. The number of active variables within a block is for the first blocks, with the remaining blocks containing no active variables.
Toeplitz correlation design. As for pairwise correlation, but with covariates and within the same block having correlation . We only consider two active variables per block, ,with their positions, and , within a block chosen such that , to give a correlation of .
We consider the effects of the various factors in a systematic way via 1,863 simulation scenarios, each corresponding to a different configuration. The values considered for each factor are shown in Table 1 and we cover all combinations of the factors with the following exceptions: for Toeplitz correlation design, we consider only with two signals per block (); and for correlation designs we exclude some combinations of and which violate the necessary constraint (see Table 2).
2.4 Method implementation
Tuning parameters are set to reflect the way methods would typically be used by users. For Lasso, Elastic Net, Ridge Regression, SCAD and Dantzig Selector, is set via 10-fold cross-validation (CV). Following Bühlmann and Mandozzi (2014), we implement two versions of Elastic Net with and , referred to as heavy Elastic Net (HENet) and light Elastic Net (LENet) respectively. For SCAD, we set , as recommended by Fan and Li (2001). For Stability Selection, we set the number of iterations to with subsample size and selection probability cutoff (the R package defaults; see below). We do not place any explicit control on the expected number of false positives (i.e. we consider the full range of regularization parameters ). We avoid explicit false positive control as there will be no choice of upper bound that is optimal for all simulation scenarios considered. However, we assess sensitivity to the above choices in Section 4.
We use available R packages to implement the methods: glmnet for Lasso, Elastic Net and Ridge Regression (Friedman et al., 2010); ncvreg for SCAD (Breheny and Huang, 2011); flare for Dantzig Selector (Li et al., 2015); and c060 for Stability Selection (Sill et al., 2014). Covariates are standardized and the response vector is centered. We run all methods on all simulation scenarios with the exception of Dantzig; for correlation designs, Dantzig is run only for and due to its computational demands under multicollinearity for large . For each simulation scenario, we show results averaged across 64 simulated datasets.
2.5 Performance metrics
We distinguish between prediction, variable selection and ranking and use the following metrics.
Prediction. To assess predictive performance we use the root mean squared error (RMSE). For each simulation scenario, we generate training data with sample size and test data with sample size . Models are fitted on training data to obtain coefficient estimates and prediction error, calculated as , where and are the test responses and design matrix respectively. Stability Selection focuses on variable selection and we therefore do not include it in assessment of predictive performance.
Variable selection. For assessment of variable selection, we use true positive rate (TPR) and positive predictive value (PPV):
where TP, FP and FN are the number of true positives, false positives and false negatives respectively. Ridge Regression does not perform variable selection per se and is therefore excluded from this evaluation.
For ranking, we assess performance using the partial area under the receiver operating characteristic curve (pAUC). This is the area under the curve obtained when restricting to a maximum of 50 false positives (). The pAUC calculation requires a score under which to rank variables . For Ridge Regression, we rank by and for Stability Selection by . For the other methods (Lasso, Elastic Net, SCAD and Dantzig Selector), we could use as for Ridge, but due to sparsity this would involve ranking many covariates with . We instead consider the set of estimated active sets where is the set of candidate regularization parameters. We consider a covariate to be more important the longer it remains in as increases and more sparsity is induced. This motivates defining ranking scores as: or if , where .
3 Main results
Due to the large number of simulation regimes, we focus below mainly on the key patterns. All performance data and plotting code are made available on GitHub, allowing specific scenarios to be investigated further (see the “Code and data availablity” section below).
3.1 Independence design
3.1.1 An approximate guide to simulation scenario difficulty
Figure 1 shows the performance metrics versus rescaled sample size , for the independence design with SNR=2. The quantity equals (see Wainwright, 2009) and is motivated by scaling results for consistent Lasso variable selection. Large (small) values of can be interpreted as large (small) sample size relative to dimensionality and sparsity. We observe a clear overall trend of better pAUC (Fig. 1A) and TPR (Fig. 1C) for all methods as increases, with performance leveling off for larger values of . The trend is similar for RMSE as increases, although with more local variation in performance (Fig. 1B). The behavior of PPV is method-dependent and the overall trend is non-monotonic as increases (Fig. 1D). Performance with varying was qualitatively similar for other SNR values and we also observed an expected trend of deteriorating performance with decreasing SNR (see Figs. A1 and A2 in Appendix A for SNR=1 and SNR=4 respectively). Therefore, although the motivation for lies in asymptotic theory for variable selection, and SNR together serve as a useful approximate guide to the difficulty of each simulation scenario for all three tasks (selection, ranking and prediction). We make use of this characterization below.
3.1.2 Key observations
We first present an overview of the results before discussing the individual performance metrics in more detail below. Figures 2, 3 and 4 show ranking, prediction and selection performance respectively for a subset of independence design scenarios, while Figures A4 and A5 plot the performance of pairs of methods against each other across all scenarios.
Key observations for the independence design are:
No overall winner; large differences. For all metrics, there is no one method that consistently performs best across all or the majority of the independence design scenarios. Moreover, relative differences in method performance can be large in some scenarios. For example, in Figure 2B, for , there is a percentage relative decrease in pAUC of between the methods with the highest and lowest scores (SCAD and Ridge Regression respectively). Across all 108 independence design scenarios, the median percentage relative decrease is 25% for pAUC, 15% for RMSE, 30% for TPR and 70% for PPV.
SCAD transition. The performance of SCAD relative to other methods varies substantially across scenarios. SCAD can offer the best performance in “easier” scenarios, but does not retain this advantage as scenario difficulty increases. In particular, for ranking, SCAD undergoes a transition from best to worst performing method (see e.g. black line in Fig. 2F). A similar transition is also seen for prediction (see below).
Stability Selection typically best for PPV; trade-off between PPV and TPR. For selection, Stability Selection and SCAD typically outperform other methods in terms of PPV, with large gains in some scenarios and with Stability Selection offering the best performance except in “easy” scenarios. However, Stability Selection and SCAD often suffer from an inferior TPR (see, for example, circle symbols in Fig. 4B). In general, there is a trade-off between TPR and PPV.
Lasso performs well. Under the independence design Lasso is competitive in the majority of scenarios for all metrics except PPV. SCAD can outperform Lasso in “easy” scenarios, but Lasso can be considered as a “safe” option because its relative performance across scenarios is less variable than for SCAD.
We will see below that these key observations largely continue to hold in the correlation designs.
As expected, there is no benefit of using an penalty under the independence design; mostly Lasso outperforms or is competitive with Elastic Net, which itself outperforms or is competitive with Ridge (see red, green and yellow lines in Fig. 1 and see also Figs. A4 and A5). An exception is for the selection metric TPR (see below). We completely exclude LENet from our presentation below due to its performance being invariably between that of Lasso and HENet.
The Dantzig Selector mostly performed similarly to Lasso (Figs. A4 and A5), in line with theory (e.g. Meinshausen et al., 2007; Efron et al., 2007). However, Dantzig is more computationally expensive than Lasso (Meinshausen et al., 2007). For example, when and SNR=1, Dantzig takes around 1,500 seconds to compute the whole solution path, while Lasso takes less than one second. In the interest of brevity, we only include Dantzig below when its performance differs to Lasso.
3.1.3 Results by performance metric
Ranking. Ranking performance deteriorates for all methods as SNR or decreases, but SCAD retains its good performance for longest and achieves the best performance in some “easier” scenarios (e.g. Fig. 2B, black line). However, SCAD transitions from best- to worst-performing method with an unfavorable change in , , or SNR (see Fig. 2F for such a transition with increasing ; see also Key Observation 2). HENet and Ridge Regression fail to outperform Lasso in any scenario. Moreover, for some intermediate values of , an penalty can even be detrimental for ranking with the worst performance provided by Ridge Regression (see e.g. yellow lines in Fig. 2B and I). Stability Selection has a very similar performance to Lasso (Fig. A4). Thus, Lasso is competitive across all scenarios, except for those “easier” scenarios where SCAD performs best.
Prediction. Relative performance for prediction is broadly similar to that for ranking (contrast Fig. 2 with Fig. 3). In this sparse, independence setting we see again that an penalty offers no benefits, with Ridge performing substantially worse than all other methods in many scenarios. SCAD again shows transition behavior as difficulty increases, but it is never worse than Ridge, not even in “harder” scenarios.
Selection. All methods achieve optimal TPR when and SNR are sufficiently large, but can at the same time have substantial differences in terms of PPV (see e.g. Fig. 4A; range of PPVs). SCAD offers the best PPV in these “easier” scenarios, while Stability Selection typically outperforms Lasso and HENet. Note that the inferior performance of Stability Selection relative to SCAD in the “easiest” scenarios could at least in part be due to the lack of false positive control in the implementation used here.
In scenarios where TPR is sub-optimal (small-to-moderate values of or small SNR), the relative performance of two methods typically follows the rule: if method has a higher TPR than method , then method will have a lower PPV (see e.g. triangles in Fig. 4B). For the majority of these scenarios, Stability Selection has the highest PPV and lowest TPR, and SCAD performs similar to or better than Lasso and HENet in terms of PPV, but similar or worse in terms of TPR (see e.g. Figures 4B-D and A5; performance differences are greater for larger SNR).
Across the majority of scenarios, Lasso has small gains in PPV (of at most 0.1) over HENet and Dantzig Selector has PPV similar to or slightly worse than Lasso (see e.g. red, green and blue lines in Fig. 4B). Again, the converse relationships are true for TPR. Lasso, HENet and Dantzig fail to obtain PPV higher than 0.4 across all scenarios, contrasting with a maximum PPV greater than 0.8 for SCAD or Stability Selection. However, they are competitive in terms of TPR.
3.2 Pairwise correlation design
3.2.1 Key Observations
We again provide an overview of results and then discuss in more detail below.
Comparison with the independence design for fixed correlation designs. For simplicity, we initially focus on two pairwise correlation designs and compare performance with the independence design for varying and SNR. Both designs have two signals per block () and intra-block pairwise correlation of , but one has block size and the other, . Figures 5 and 6 show the impact of correlation on method performance, relative to the independence design, for all values of (i.e. , and ) and (see Figs. A6-A9 for and ). Figures 7, A10 and A11 show the relative performance of methods in the two pairwise correlation designs for ranking, prediction and selection respectively (see also Figures A12-A15).
We have the following key observations:
Effect of correlation is method- and scenario-specific. Broadly speaking, correlated variables have a negative effect on performance, but in some scenarios there are clear positive effects. Benefits from correlation typically occur when is small and are more salient for ranking and selection when block size is small (; contrast Figures 5A and C with Figures 6A and C), or for prediction when block size is large (, contrast Figure 5B with Figure 6B). SCAD tends to be the most negatively affected by correlation, while Ridge Regression and Elastic Net can, in some scenarios, be the least negatively affected or most positively affected (see e.g. black and yellow symbols in Fig. 6B).
Benefits of regularization in some “hard” correlated scenarios. In line with theory, we find that an penalty confers small gains in ranking and prediction over Lasso in some correlated scenarios, resulting in Ridge or HENet performing the best. We typically observe these gains in “harder” scenarios with small or SNR (see e.g. yellow, green and red lines in Figure 7G).
Key Observations I1-I4 (“No overall winner”, “SCAD transition”, “Stability Selection best for PPV” and “Lasso performs well”) from the independence design also still hold in these two correlation designs, with relative performance between methods remaining broadly the same.
Influence of correlation design parameters. We now turn our attention to the influence of the pairwise correlation design parameters (block size), (number of signals per block) and (intra-block pairwise correlation). To aid presentation of results, we fix or which give (“hard”) or 4.35 (“easy”) respectively. Ranking performance is presented in Figure 8 and analogous results for prediction can be found in Figure 9. Selection performance is shown in Figure 10.
Key observations are:
Influence of correlation parameters can be positive or negative. For ranking and selection, the influence of correlation and number of signals per block depends on scenario difficulty. In “easy” scenarios (with sufficiently large or SNR), performance typically declines with increasing or (see e.g. Figs. 8O, 8P, 10O and 10P) . However, in “harder” scenarios (with small or SNR), performance can improve with increasing or , particularly for HENet and Ridge Regression when block size is small (see e.g. Figs. 8E, 8F, 10E and 10F; see also Key Observation 1). An increase in block size typically has a negative effect or little effect on ranking and selection performance. Predictive performance generally worsens (or remains relatively stable) with an increasing number of signals per block (for example, contrast Fig. 9E with Fig. 9F), while the influence of increasing correlation has a strong dependence on block size and (see below for details). An increase in typically has a positive effect or little effect on prediction.
Benefits of regularization depend on correlation parameters. The gains in ranking performance from an penalty over Lasso in some “hard” scenarios (see Key Observation 2) can be substantial, particularly when block size is small, and correlation strength and number of signals per block are large (e.g. Fig. 8F). Similar benefits of an penalty are observed for selection with the TPR metric, which is in line with Elastic Net enjoying the grouping effect property for highly correlated variables (e.g. Fig. 10F). For prediction, advantages of an penalty observed in certain scenarios are very small (see below for details).
Stability Selection is competitive for ranking and selection (PPV). Stability Selection is competitive for ranking across the majority of scenarios and performs best for some large SNR scenarios (see Figs. 8C and 8D). As in the independence design, it also typically performs best in terms of PPV (Key Observation 3; Fig. 10A-D).
Lasso remains competitive in many correlated scenarios. Lasso remains competitive across many of the correlations designs for all metrics except PPV. Ridge Regression or HENet can offer substantive gains over Lasso for ranking and selection (TPR), particularly for small blocks that contain highly correlated variables, several of which are signals (see Key Observation 2). No such gains are observed for prediction. SCAD continues to perform best in a small number of scenarios that are “easy” (large or SNR) and have weak correlation (small and ).
3.2.2 Results by performance metric
Ranking. We saw in Key Observations 1 and 1 above that there is a method- and scenario-specific influence of correlation. The most salient benefits from correlation are seen for HENet and Ridge in “hard” scenarios with small SNR or when block size is small and blocks consist of highly correlated variables of which several are active (i.e. large and ). For example, when SNR=1, , and (Fig. 8F), all methods benefit from correlation, but HENet and Ridge have the largest improvements over the independence design. The gains from Ridge can be substantial with an increase in pAUC of 0.39 when . Note that improvements over the independence design are not typically monotonically increasing with ; the largest gains are often seen for moderate correlation strength (Figs. 8B and 8F).
This positive influence of correlation in “hard” scenarios with small, highly correlated blocks containing several signals gives rise to an improved ranking performance from an penalty over Lasso (Key Observations 2 and 2). Taking the same example as above (Fig. 8F), Ridge substantially outperforms all other methods when , with an improvement in pAUC of 0.25 over the second best method, HENet. HENet itself also improves over Lasso with a difference in pAUC of 0.14. So the magnitude of the gains over Lasso is linked to the strength of the penalty. Substantial improvements from an penalty over Lasso are not seen in the corresponding larger block size scenario ( with , SNR=1 and ; Figure 8H), suggesting that the proportion of signals per block is important. Indeed, when is increased to 40 (keeping all other parameters fixed), such improvements are obtained: pAUC=0.44, 0.21 and 0.09 for Ridge, Henet and Lasso respectively when .
SCAD again displays its characteristic transition behavior with increasing or SNR (see e.g. Fig. 7), but due to it typically being the most negatively affected by correlation (Key Observation 1), the number of scenarios where SCAD performs best is reduced. SCAD’s sensitivity to correlation also results in a transition with increasing or when ; SCAD can perform best in scenarios with weak correlation and only one signal per block, but performs worst when correlation is strong and there are many signals per block (see e.g. black lines in Figs. 8A and 8B).
Stability Selection is competitive across the majority of correlation design scenarios and can perform the best in some scenarios (see e.g. brown lines in Figs. 8C and 8D; see also Key Observation 3). The exceptions where it is not competitive are the scenarios described above where Ridge or SCAD are the best performers. It is also worth noting that Lasso can have modest gains over Dantzig Selector, in particular for larger (Fig. 8N).
Prediction. The impact of increasing correlation strength on predictive performance depends on block size and number of signals per block . The most salient improvements as increases are observed for large and small , while the most notable declines occur for small and larger . For example, in Figure 9C where and in a large SNR, small scenario, Ridge has a 57% reduction in RMSE relative to the independence design when . On the other hand, in Figure 9B where we have instead and , Ridge has a 56% increase in RMSE. In general, Ridge is affected the most as correlation parameters change. This influence of , taken together with possible positive effects of increasing and negative effects of increasing (Key Observation 1), means that correlation design scenarios with large, highly correlated blocks with signals spread across many blocks (i.e. large and , small ) are most favorable for prediction. These favorable correlation design scenarios have a large number of predictors that are associated with the response (and so are, in a sense, non-sparse due to the correlation, even though each block is sparse in terms of number of signals) and Ridge, which has a non-sparse solution, typically benefits the most (as seen by contrasting the yellow lines in Figs. 9B and 9C).
Ridge does not substantively outperform the other methods for prediction in any of the scenarios considered here, even the favorable correlation design scenarios described above that benefit Ridge the most. In such scenarios, Ridge can marginally outperform other methods when is very small (Key Observation 1), but performance remains poor. For example, for the most “difficult” scenario in Figure A10P (where and ), RMSE=33.05 and 30.87 for Lasso and Ridge respectively. Note that this contrasts with ranking, where Ridge performed best for small and large .
SCAD again shows transition behavior, offering modest gains over other methods when and SNR are large, and and are small (i.e. in “easy”, weakly correlated scenarios), but becoming worse than Lasso, HENet and sometimes Ridge as scenario difficulty increases or correlation becomes stronger.
Selection. Results in the majority of the correlation designs mirror those seen in the independence design (see Key Observation 3). Stability Selection typically offers the largest PPV, outperforming SCAD, which in turn outperforms Lasso, which itself has small gains over HENet. Due to the trade-off between PPV and TPR, the opposite relation generally holds for TPR. As before, a notable exception is that SCAD can have the best PPV in “easy” scenarios with large and this occurs across most correlation designs, as seen for in Figure 10I-L.
Stability Selection and SCAD can be sensitive to correlation. For example, in the small , large SNR scenario with large block size shown in Figure 10C and D, the substantial improvements in PPV provided by Stability Selection and SCAD in the independence design (see square symbols) are mostly or, in the case of SCAD, completely lost under stronger correlation (; see “+” symbols). SCAD also loses competitiveness in terms of TPR. Despite this sensitivity to correlation, Stability Selection still typically performs best except in “easy” scenarios.
Lasso is typically reasonably competitive in terms of TPR, but as outlined in Key Observations 1-2 and similar to ranking, an penalty gives substantial improvements in TPR over Lasso in “hard” scenarios with small, highly correlated blocks and a large proportion of signals per block. For example, in Figure 10F where SNR=1, , and , HENet has a TPR of 0.56 when (green “+” symbol) compared with 0.31 for Lasso (red “+” symbol). At the same time, PPV remains competitive at 0.22 for HENet and 0.23 for Lasso.
Note that Dantzig Selector, in the scenarios where results are available (), can perform slightly worse than Lasso in terms of TPR and PPV (see e.g. blue and red lines in Figure 10P)
3.3 Toeplitz correlation design
We now consider method performance in the Toeplitz correlation design with block size and number of signals per block where the two signals are correlated with (see Methods). Figure 11 compares performance in the Toeplitz design against that in the corresponding pairwise correlation design () for and all possible combinations of , and (see Figs. A16 and A17 for SNR=1 and SNR=4 respectively). Performance is typically similar for the two designs or worse in the Toeplitz design. For prediction, Ridge Regression is most negatively affected by Toeplitz correlation, while SCAD is most affected for the other metrics.
On the one hand, the pairwise correlation design could be considered more difficult than the Toeplitz design because the average correlation between signals and non-signals (within a block) is higher for pairwise than for Toeplitz (0.7 vs. 0.19). However, on the other hand, the Toeplitz design could be considered more difficult because there are several non-signals that are more strongly correlated with the signals than the signals are with each other; for the pairwise correlation design all signals and non-signals within a block are correlated with equal strength. The generally poorer performance observed for the Toeplitz design therefore suggests that having strongly correlated signals and non-signals is more detrimental than a higher average correlation.
Relative performance of methods in the Toeplitz design is generally consistent with that seen for the corresponding pairwise correlation design (contrast Figs. A18 and A19 with Figs. A14 and A15). For ranking, the impact of an penalty (relative to Lasso) is larger under the Toeplitz design; Ridge performs particularly well when SNR=1, but poorly when SNR=4 (see Fig. A18).
4 Additional investigations
Below we extend the main simulations above in two directions. Section 4.1 explores sensitivity of Stability Selection to its tuning parameters and Section 4.2 investigates the ability of methods to detect weak signals when coefficients are heterogeneous. All results in this section are averaged across 100 replicates.
4.1 Stability Selection tuning parameters
Stability Selection has several tuning parameters: the subsample size , an upper bound for (the expected number of false positives), and either a threshold on the selection probabilities or a set of regularization parameters to consider (see Section 2.2). Making appropriate choices for these parameters is non-trivial. Here, we explore the effects of varying , and on selection performance.
We simulated data (as described in Section 2.3) with SNR=2, , and or 20 (giving or 1.45 respectively) for the independence design, and the pairwise correlation design with =10, and . We applied Stability Selection with all possible combinations of the following tuning parameter values: , and where is the subsample proportion.
Results are shown in Figure 12. In general, as or increase, or decreases, the number of selected variables increases, resulting in higher TPR, but lower PPV. An exception is for , where, for the most conservative choices of the parameters ( and ), in addition to a very poor TPR, PPV is low on average and very unstable across iterations (see Fig. 12G, H, O and P). Here, selection is too stringent and the majority of signals are missed. An increase in and , and decrease in leads to substantial improvements in both TPR and PPV. When the underlying model size is smaller (), the most conservative parameter choices are again suboptimal in terms of performance, but the same is also true for the least conservative choices ( and ; Fig. 12I and J). However, in the scenarios considered here, being too stringent seems to have a more deleterious effect on performance than being too lenient.
Results from the main simulations, where we set , and had no explicit false positive control (i.e. the full range of regularization parameters was considered; see Section 2.4), are indicated by crosses in panels A-D and I-L of Figure 12. Performance in the main simulations is typically similar to that of the largest considered here (), but with better TPR and worse PPV (except for where TPR is already optimal and so there is only a decrease in PPV).
4.2 Heterogeneous coefficients
In the main simulations, all non-zero coefficients were assigned the same value. Here, we consider detection of signals with heterogeneous coefficients for three methods: Lasso, HENet and SCAD. We simulated data (for the independence design) as described in Section 2, except instead of active variables all having coefficient 3, half of them had coefficient and the other half had coefficient where . We chose such that with fixed SNR, remains the same as in the homogeneous ’s case. Note that gives the main simulation set-up with homogeneous coefficents. Informed by the main simulations, we set , , and SNR=2 or 4, guaranteeing that when non-zero coefficients all take the same value, we are in a relatively “easy” scenario where the majority of the signals can be detected.
Figure 13 shows the effect of heterogeneous coefficients on selection for . As decreases, signals with smaller coefficients are less likely to be detected, resulting in a decrease in TPR. All methods fail to detect the very weak signals when =0.1 (i.e. only the stronger 50% of the signals are detected giving TPR0.5). Consistent with the main simulations, SCAD has better false positive control (higher PPV) than Lasso and Elastic Net when SNR is large, and this is especially the case when is near 0.1 or 1 (contrast black dotted line with red and green dotted lines in Fig. 13B). The “U” shape of the SCAD PPV curve here is due to the fact that bias is largest when is moderate, which leads to selection of more variables to compensate (SCAD is known to be nearly unbiased for strong signals; for large all signals are relatively strong, while for small the weaker signals have such a small influence that the underlying model is well-approximated by a model with strong signals and no weak signals). In contrast, Lasso and Elastic Net are biased estimators, so their PPV are not as affected. SCAD also seems to have higher power to detect the weaker signals when SNR is large and is moderate (see solid lines in Fig. 13B). However, as observed in the main simulations, SCAD is more sensitive to SNR and so is less competitive in “harder” scenarios (SNR=2; Figure 13A). Relative performance of Lasso and Elastic Net is consistent with the homogeneous coefficient case (). Note that in the independence design scenario with SNR=4 considered here, Lasso has slightly higher TPR than HENet, but in the majority of independence design scenarios the opposite relationship is observed (see Fig. A5).
5 Results using semi-synthetic data
To complement the purely simulated data above, we considered an example using real covariates from The Cancer Genome Atlas (TCGA) study. We used gene expression data from TCGA ovarian cancer samples (The Cancer Genome Atlas Research Network, 2011), and specifically, we used the dataset provided in the Supplementary Appendix of Tucker et al. (2014) 111The dataset is available at http://bioinformatics.mdanderson.org/Supplements/ResidualDisease.. The dataset contained 594 samples and expression levels for 22,277 genes. The samples were a mixture of primary tumor (569), recurrent tumor (17) and normal tissue (8). We randomly subsampled the samples and genes to obtain a design matrix . A response vector was then obtained using the sparse linear model (1). Those samples not included in
were used as test data. Design matrix columns for both training and test data were centered and scaled to have zero mean and unit variance, and responses were centered.
Signals were allocated among the predictors to give either low or high correlation scenarios, using an approach similar to Bühlmann and Mandozzi (2014). Specifically, for the low correlation scenarios, signals were randomly allocated among and for the high correlation scenarios, we used the following procedure that mimics the main simulation set-up by forming correlated blocks: (i) form a block of predictors consisting of a randomly chosen predictor and the predictors that are most correlated with ; (ii) allocate signals to this block by designating and the predictors that are most correlated to it as signals; (iii) repeat steps (i) and (ii), but remove from consideration any predictors already allocated to a block, and continue repeating until all signals have been allocated.
We set , , or 20 (giving or respectively), or , and for the high correlation scenarios, and . We generated 100 semi-synthetic datasets for each of the 16 scenarios. We applied the penalized regression methods to each scenario, but excluded the Dantzig Selector because the main simulations provided little evidence to prefer Dantzig over Lasso and Dantzig is computationally intensive. For Stability Selection, for ranking, there is no explicit false positive control and the full range of tuning parameters is considered when calculating selection probabilities. For selection, we chose , , and subsample size .
Ranking, prediction and selection performance are shown in Figure 14. Results are largely consistent with those from the main simulations; we highlight a few observations here. SCAD performs well in “easy” scenarios with large and SNR, and weak correlation, but is less competitive otherwise (Key Observations 2 and 1; e.g., compare the solid black lines in Figs. 14A and 14E). An penalty is useful for ranking when data is noisy or strong multicollinearity exists (Key Observation 2; e.g., compare yellow and red solid lines at SNR=1 and SNR=8 in Fig. 14E). SCAD and Stability Section are conservative and tend to have good false positive control but less power (Key Observation 3; see black and brown lines in Figs. 14C, D, G and H). Except for PPV, Lasso is overall competitive (Key Observations 4 and 4).
Our results complement theory by shedding light on the finite-sample relative performance of methods. Many of our results do align with available theory. For instance, SCAD is known to have nearly unbiased estimates for coefficients that are large (relative to noise), explaining why it tends to have better selection performance in “easy” scenarios. On the other hand, the Lasso and Elastic Net are biased, so good prediction requires more irrelevant variables to compensate. However, some conditions of theoretical results (asymptotic or finite-sample) can be hard to verify in practice, and the results do not directly provide insight into the performance of a method relative to others, making it difficult to pick a suitable method in any given finite-sample setting. Our results suggest that there is no one method which clearly dominates others in all scenarios, even in the relatively narrow set of possibilities considered here (e.g. we did not consider heavy tailed noise, non-sparsity, non-block-type covariance etc.). Relative performance depends on many factors, and also on the specific metric(s) of interest.
Nevertheless, with the above caveats, we can highlight some general observations. For data generated from sparse linear models: Lasso is relatively stable and outperforms Elastic Net and Ridge in the mild correlation designs; Elastic Net and Ridge only outperform Lasso in the most challenging, correlated design scenarios we considered here; SCAD is double-edged, dominating in “easier” scenarios but deteriorating rapidly when conditions become difficult; Stability Selection is good for false positive control and ranking; and Dantzig is usually similar or worse than Lasso. Ridge does particularly badly in many cases, but it is worth pointing out that most scenarios in this paper were pro-Lasso (and unfriendly to Ridge) in the sense of being highly sparse, and with low overall correlation (across all predictors). In many areas such as biomedicine, signals can be rather weak and SNR can be smaller than the values used here. In such difficult settings, Ridge should perhaps receive more attention due to its robustness.
Choices of tuning parameters can be crucial. In line with known results, we saw that standard cross-validation often yielded overly large models for Lasso, Elastic Net and the Dantzig Selector. An interesting alternative is proposed in Lim and Yu (2016), where cross-validation is based on an estimation stability metric. Compared to traditional cross-validation, this approach significantly reduces the false positive rate while slightly sacrificing the true positive rate, and achieves similar prediction but higher accuracy in parameter estimation. For Stability Selection, in Zou (2010) the author points out that there is no established lower bound for the expected number of true positives, and the tuning parameters and have significant influences on the true positive rate. They also found in their simulation study that the number of false positives is usually smaller than the specified . This suggests that less stringent can help improve signal detection without sacrificing false positive control too much, thus providing a better balance between the two. This is reflected in our results in Section 4.1.
We focused on simulations from the sparse linear model to better understand the variability of performance in a broadly favorable setting. Extending this systematic empirical approach to (the huge range of) less favorable settings, spanning many kinds of model mis-specification, could be illuminating, but experimental design would be nontrivial. As one example, we revisited a low correlation scenario from the TCGA ovarian cancer data analysis (Section 5), but with a non-Gaussian error distribution. Figure A20 shows method performance for all metrics and provides details of data generation. Method performance deteriorates as non-normality increases. SCAD is the most affected and mirrors its previous behavior, with a transition in performance from best to worst as non-normality increases for ranking and prediction.
Our comparison focused on six popular penalized linear regression methods, but there are of course many others that have been proposed, and some of these are also well-known. The Adaptive Lasso (Zou, 2006) is one example and is an extension of the Lasso where the penalty is weighted, assigning a different penalty parameter to each coefficient. The idea is similar in spirit to SCAD in that large coefficients are penalized less, resulting in an estimator that is almost unbiased for these coefficients. The Adaptive Lasso, like SCAD, also enjoys the oracle property. There are also relatively well-known extensions of Lasso that have been proposed for data where covariates can be grouped (Group Lasso; Yuan and Lin, 2006) or ordered (Fused Lasso; Tibshirani et al., 2005). While, for reasons of tractability, our comparison was restricted to six methods, we make our simulation code and method performance data available, allowing users to add in other methods of interest into the comparison without the need to regenerate the results for the six methods considered here.
We explicitly defined the true model in terms of exact sparsity (i.e. some coefficients being precisely zero). Although this is the best studied case, in practice such a notion of sparsity may not be realistic and a more reasonable assumption may be that there are a few strong signals, several moderate signals and even more weak signals, but the majority of variables are irrelevant with small, but sometimes non-zero coefficients. In this case, since it may not be possible to find all relevant variables, a good method might be expected to detect all strong and moderate signals while removing the weaker ones. In this vein, Zhang and Huang (2008) consider the problem where weak signals exist outside the ideal model, such that their total signal strength is below a certain level. The authors prove that the Lasso estimate has model size of the correct order, and the selection bias is controlled by the weak signal coefficients and a threshold bias.
Due to the comprehensive nature of our simulation study, we focused on summarizing the predominant trends and relationships across the scenarios. There will always be some scenarios which are exceptions to these summaries, but this in itself motivates the need for extensive simulation studies. If a simulation study has limited scope then the derived conclusions may not generalize beyond the few scenarios considered. So while such studies may be useful in exploring and understanding the properties of a method, they may have limited practical implications for an end-user. In contrast, a large-scale simulation study, such as the one presented here, offers some insight as to which method may be the most appropriate, depending on the properties of the data.
Code and data availability
All analysis was performed in R (R Core Team, 2018). Scripts for generating the main simulation synthetic data sets, applying the regression methods, assessing performance and plotting results are available at https://github.com/fw307/high_dimensional_regression_comparison, together with performance metric data from the main simulation.
This work was supported by the UK Medical Research Council (University Unit Programme numbers MC_UU_00002/2 and MC_UU_00002/10).
Appendix A: Supplementary figures
- Bickel et al.  P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37:1705–1732, 2009.
- Bondell and Reich  H. D. Bondell and B. J. Reich. Consistent high-dimensional Bayesian variable selection via penalized credible regions. Journal of the American Statistical Association, 107:1610–1624, 2012.
Breheny and Huang 
P. Breheny and J. Huang.
Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection.The Annals of Applied Statistics, 5:232–253, 2011.
- Bühlmann and Mandozzi  P. Bühlmann and J. Mandozzi. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Computational Statistics, 29:407–430, 2014.
Bühlmann and van de Geer 
P. Bühlmann and S. van de Geer.
Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Publishing Company, Incorporated, 1st edition, 2011.
- Candes and Tao  E. Candes and T. Tao. The Dantzig selector: statistical estimation when is much larger than . The Annals of Statistics, 35:2313–2351, 2007.
- Celeux et al.  G. Celeux, M. E. Anbari, J.-M. Marin, and C. P. Robert. Regularization in regression: Comparing Bayesian and frequentist methods in a poorly informative situation. Bayesian Analysis, 7:477–502, 2012.
- Efron et al.  B. Efron, T. Hastie, and R. Tibshirani. Discussion: The Dantzig selector: Statistical estimation when is much larger than . The Annals of Statistics, 35:2358–2364, 2007.
- Fan and Li  J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96:1348–1360, 2001.
- Fan and Lv  J. Fan and J. Lv. A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101–148, 2010.
- Fan et al.  J. Fan, H. Peng, et al. Nonconcave penalized likelihood with a diverging number of parameters. The Annals of Statistics, 32:928–961, 2004.
- Friedman et al.  J. H. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1–22, 2010.
- Hoerl and Kennard  A. E. Hoerl and R. W. Kennard. Ridge regression: biased estimation for nonorthogonal problems. Technometrics, 12:55–67, 1970.
- James et al.  G. M. James, P. Radchenko, and J. Lv. DASSO: connections between the Dantzig selector and Lasso. Journal of the Royal Statistical Society Series B, 71:127–142, 2009.
Li et al. 
X. Li, T. Zhao, X. Yuan, and H. Liu.
The flare package for high dimensional linear regression and
precision matrix estimation in R.
Journal of Machine Learning Research, 16:553–557, 2015.
- Lim and Yu  C. Lim and B. Yu. Estimation stability with cross-validation (ESCV). Journal of Computational and Graphical Statistics, 25:464–492, 2016.
- Meinshausen and Bühlmann  N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34:1436–1462, 2006.
- Meinshausen and Bühlmann  N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society, Series B, 72:417–473, 2010.
- Meinshausen et al.  N. Meinshausen, G. Rocha, and B. Yu. Discussion: A tale of three cousins: Lasso, L2 Boosting and Dantzig. The Annals of Statistics, 35:2373–2384, 2007.
- R Core Team  R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2018. URL https://www.R-project.org/.
- Sill et al.  M. Sill, T. Hielscher, N. Becker, and M. Zucknick. c060: Extended inference with lasso and elastic-net regularized cox and generalized linear models. Journal of Statistical Software, 62:1–22, 2014.
- The Cancer Genome Atlas Research Network  The Cancer Genome Atlas Research Network. Integrated genomic analyses of ovarian carcinoma. Nature, 474:609–615, 2011.
- Tibshirani  R. Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, 58:267–288, 1996.
- Tibshirani et al.  R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B, pages 91–108, 2005.
- Tucker et al.  S. L. Tucker, K. Gharpure, S. M. Herbrich, A. K. Unruh, A. M. Nick, E. K. Crane, R. L. Coleman, J. Guenthoer, H. J. Dalton, S. Y. Wu, R. Rupaimoole, G. Lopez-Berestein, B. Ozpolat, C. Ivan, W. Hu, K. A. Baggerly, and A. K. Sood. Molecular biomarkers of residual disease after surgical debulking of high-grade serous ovarian cancer. Clinical Cancer Research, 20:3280–3288, 2014.
- Wainwright  M. J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using L1-constrained quadratic programming (Lasso). IEEE Transactions on Information Theory, 55:2183–2202, 2009.
- Yuan and Lin  M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68:49–67, 2006.
- Zhang and Huang  C.-H. Zhang and J. Huang. The sparsity and bias of the Lasso selection in high-dimensional linear regression. The Annals of Statistics, 36:1567–1594, 2008.
- Zou  H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:3133–3164, 2006.
- Zou  H. Zou. Discussion of ”Stability selection” by Nicolai Meinshausen and Peter Buhlmann. Journal of the Royal Statistical Society, Series B, 72:468, 2010.
- Zou and Hastie  H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B, 67:301–320, 2005.