This package provides functions for computing the ensembles of regularized linear regression estimators defined in Christidis, Lakshmanan, Smucler and Zamar (2017) (https://arxiv.org/abs/1712.03561).
We propose an approach for building ensembles of regularized linear models by optimizing a novel objective function, that encourages sparsity within each model and diversity among them. Our procedure works on top of a given penalized linear regression estimator (e.g., Lasso, Elastic Net, SCAD) by fitting it to possibly overlapping subsets of features, while at the same time encouraging diversity among the subsets, to reduce the correlation between the predictions that result from each fitted model. The predictions from the models are then aggregated. For the case of an Elastic Net penalty and orthogonal predictors, we give a closed form solution for the regression coefficients in each of the ensembled models. An extensive simulation study and real-data applications show that the proposed method systematically improves the prediction accuracy of the base linear estimators being ensembled. Extensions to GLMs and other models are discussed.READ FULL TEXT VIEW PDF
This package provides functions for computing the ensembles of regularized linear regression estimators defined in Christidis, Lakshmanan, Smucler and Zamar (2017) (https://arxiv.org/abs/1712.03561).
Model ensembling is a powerful approach for prediction. Examples of ensemble methods for regression include Random Forests(Breiman, 2001) and Boosting (Schapire and Freund, 2012; Friedman, 2001)
. Both approaches can adapt well to the presence of non-linearity, but the resulting prediction rules are generally difficult to interpret. If the relation between the response and the predictor variables is approximately linear, an ensemble of linear models will produce highly competitive predictions, while at the same time yielding more easily interpretable results.
In this paper, we are interested in building ensembles of regularized linear models, with a special aim towards data sets in which the number of observations is relatively small, smaller or not much larger than the number of predictors. For motivation, consider the following toy example. We generate replications of independent observations from the model
where, , , and are standard normal, is independent of , , is independent of and , and the correlation between and is
. We fit three procedures to the data, the ordinary least squares estimator (LS), Elastic Net (EN)(Zou and Hastie, 2005) with penalty parameter chosen by leave-one-out cross-validation, and the following ensemble: apply least squares to the data using only predictors and , then apply least squares to the data using only predictor and average the predictions from these two fits. We computed the prediction mean squared error (PMSE) of each procedure on an independent test set of size five thousand. The resulting PMSEs of LS and EN are and , respectively, whereas the PMSE for the ensemble, , is much smaller.
The intuitive idea is that, for problems with a number of observations that is relatively low when compared to the number of predictors
, the increase in bias due to leaving out variables from some of the models is compensated by a double reduction in variance: (i) the reduction in variance in each of the linear models due the lower dimensionality and possibly lower multicollinearity and (ii) the reduction in variance due to the averaging of the resulting predictors. Indeed, in the example above, the mean variances of LS and the ensemble are 0.74 and 0.32 respectively, whereas the mean squared biases areand . In our toy example, since predictors and are highly correlated, it seems sensible to place them in separate models. Note that to build the ensemble, in particular, to choose how to group the variables, we used our knowledge of the data generating process, which is not available in practice.
In general, one could exhaustively search over all possible groupings of the variables into different models and choose the one with the lowest estimated prediction error (e.g. using cross-validation), but this is computationally unfeasible. For example, the number of possible splits of features into two groups of sizes and plus a third group of left-out features () is . In general, the number of possible ensembles of models plus a group of left-out-features is . This number becomes much larger if we allow the variables to be shared by the different models. At this point we notice that, in the simpler case of selecting a single subset of features (where ), the combinatorial problem of evaluating possible subsets can be bypassed by using greedy algorithms such as forward stepwise regression or penalized estimators. We will see that a penalization approach can also be adopted to deal with the case.
Suppose we have samples of training data , , where
is the response variable andis a matrix collecting all the available features from each of the samples, and we want to build linear models using the data. We propose to minimize a penalized objective function of the form:
is the vector of coefficients for model, is a penalty function, encouraging sparsity within the models and is another penalty function, encouraging diversity among the models. In this paper, we take to be the Elastic Net penalty
In general, by appropriately choosing the penalty function it is seen that our method generalizes penalized regression estimators, such as the Lasso, the Elastic Net and the SCAD, (Tibshirani, 1996; Zou and Hastie, 2005; Fan and Li, 2001), allowing for the selection of possibly overlapping subsets of features in such a way that variables that work well together end up in the same model, while at the same time encouraging diversity between the models, so as to reduce the correlation between the predictions resulting from each of them.
There has been a vast production of work, both theoretical and practical, dealing with regularization in linear models. The task of reviewing this mass of work is daunting and beyond the scope of this paper. The interested reader can find excellent reviews in Bühlmann and van de Geer (2011) and Hastie et al. (2015). The main difference between our approach and the existing methodology is that the final output of our approach consists of a collection of regression models, whose predictions can be averaged or otherwise combined to produce a final prediction. Moreover, our procedure allows for the optimal choice of the model sizes and overlap to yield better predicitions.
The rest of this article is organized as follows. In Section 2 we study the properties of the minimizers of (1) in some simple but illustrative cases. In Section 3 we propose an algorithm to compute the proposed estimators, to choose their tuning parameters and to aggregate the predictions from the constructed models. In the simulation study included in Section 4 we compare the performance with regards to prediction accuracy of the proposed method against that of several competitors. We apply the procedures considered in the simulation study to real data-sets in Section 5. Finally, some conclusions and possible extensions are discussed in Section 6. Section 7 is an Appendix containing all the proofs.
Assume we have training data , , standardized so that
Let be the matrix with as rows and let be its columns.
We consider ensembles defined as minimizers of the objective function given in (1), that is
where is the matrix with as columns and . Hence, as mentioned in the introduction, we use the quadratic loss to measure the goodness of fit of each model, the Elastic Net penalty to regularize each of the models, and the penalty to encourage diversity among the models.
The problem of minimizing (1) can be posed as an ‘artificial’ multivariate linear regression problem. Let be the matrix with the vector repeated times as columns. Then
where is the Frobenius norm, stands for taking the absolute value coordinate-wise and is the sum of the absolute values of the entries of the matrix. It is seen that the diversity penalty term in a sense penalizes correlations between the different models. Further insights can be gained by analyzing the term corresponding to each model in (1) separately. Fix any . Then
where . Hence, when looking at each model separately, we are solving an Elastic Net type problem, where the Lasso penalty has weights which depend on the solution itself. In particular, the coordinates most penalized in model will be those that have large coefficients in the other models. Some intuition on the impact of using our diversity penalty can be obtained by considering an extreme situation in which there is only one variable and three models. In Figure 1 we show level surfaces of the full penalty term for , , , and different values of . Hence the surfaces plotted are the solutions of
We see that when is small, the surface is similar to the three-dimensional ball. For larger values of the surface becomes highly non-convex, with peaks aligned with the axes, where there is only one model that is non-null.
The following proposition follows easily from the previous discussion.
For , the optimal has columns equal to the Elastic Net estimator.
Since is always considered as a candidate penalty parameter, see Section 3, if the optimal model (in the sense of minimal cross-validated prediction mean squared error) is a single Elastic Net, this will be the final output of our method.
For any , as and hence a global minimum of exists. The objective function is not a convex function of if , due to the non-convexity of . Moreover, if is a global minimizer of , any permutation of its columns is also a global minimizer. Importantly, the objective function is convex (strictly if ) in each coordinate and in each group of coordinates , since the corresponding optimization problems are actually penalized least squares problems with a weighted Elastic Net penalty.
We derive a closed form solution for the minimizers of the objective function in the special case in which the predictors are orthogonal. We find that the closed form solution for the orthogonal case provides some insights into how the procedure works.
Assume is orthogonal and . Fix any and let . Then
If the -th coefficients of all models in all solutions of the ensemble are zero.
If all solutions of the ensemble satisfy
If any pair that satisfies and
is a solution to the ensemble.
If all solutions of the ensemble satisfy that only one of and is zero, and the non-zero one is equal to
Some comments are in order. First, as happens with the classical Elastic Net, if the maximal correlation between the predictors and the response is smaller that , then all the coefficients in the ensemble are zero. Else, we have three distinct regimes. Fix some coordinate . When , the coefficients for predictor in both models are equal, and are equal to the univariate Elastic Net estimator corresponding to penalty parameters and but with an added shrinkage: the factor dividing the soft thresholding operator has an added . If the objective function depends only on and hence more than one solution exists. Finally, if , for all possible solutions of the ensemble, and for predictor , only one of the models is non-null, and it is equal to the univariate Elastic Net. Note that, as a function of , the solution path is continuous except at .
Further insights into how our procedure works can be gained by analyzing the simple case in which there are only two correlated predictors and two models.
Assume is normalized so that its columns have squared norm equal to and . Let be any solution of the ensemble, and , .
If the models are disjoint then the active variables in each model have coefficients
If variable is inactive in both models, variable is active in both models and then the coefficients of variable are equal to
Assume and that both variables are active in both models. If and then all solutions of the ensemble satisfy
If , the solution is unique.
The case in which is easier to analyze. In this case, the proposition above implies that if the fitted models are disjoint then
and the non-null coefficients in the ensemble are equal to the marginal Elastic Net regressions. Note that in the case in which , the size of the diversity penalty required to separate the models decreases as the correlation between the variables increases.
To obtain approximate solutions of the minimizer of (1), we propose an algorithm based on coordinate descent: we cycle through the coordinates of , optimizing with respect to each coordinate while keeping the others fixed. Coordinate descent has proven to be very efficient in solving regularized least squares problems, see Friedman et al. (2010) for example.
The coordinate descent update for is
where is the in-sample prediction of using model and leaving out variable , soft is the soft-thresholding operator, defined by , the superscript stands for the new solution and the superscript stands for the old solution.
The proof of Proposition 4 is straightforward and for this reason it is ommited. Note that the shrinkage being applied to variable in model , , increases with the sum of the absolute values of the coefficients of variable in all other models. This shows more clearly that the penalty encourages diversity among the models.
We cycle through the coordinates of , then through those of and so on until we reach , where we check for convergence. Convergence is declared when
for some small positive . Since the data is standardized, the convergence criterion in the original units is:
Hence, the algorithm converges when the in-sample average predictions no longer change significantly. If the algorithm did not converge, we start over.
Once we have computed the models , we aggregate them to form a predictor by averaging the models: if is a new observation the prediction of the response is
where , which is an estimate of the regression coefficients.
Another way to aggregate the models is to weight them according to, for example, their cross-validated prediction error. Let be the estimate of prediction error of model obtained by cross-validation. Let
Then we can take
Breiman (1996) proposes to aggregate predictors by averaging them according to weights determined by solving a constrained least squares problem and calls the method stacking. In detail, given predictors , define their leave-one-out versions as . Let . Then the weights used to form the final predictor are defined as the non-negative constants that minimize
Breiman also provides empirical evidence to show that using 10-fold cross-validation instead of leave-out-out to generate the data can be more effective, as well as less computationally demanding.
The theoretical properties of combining prediction procedures are discussed in Yang (2004) and references therein.
We choose and over grids of candidates, looking to minimize the cross-validated (CV) mean squared error (MSE). The grids of candidates are built as follows. It is easy to show that, for and , the smallest that makes all the models null is given by
is the maximum sparsity penalty that will be considered. The smallest that maximises diversity among the models (makes them disjoint) for a given , say , is estimated using a grid search. Proposition 4 hints that in general will depend in a complicated way on the correlations between the predictors. To build a grid to search for the optimal we take 100 log-equispaced points between and , where is if and otherwise. The grid used for is built analogously, but including zero as a candidate.
Even though we could also cross-validate over a grid of possible values of , we find that taking a large value of , say or , generally works well and hence in what follows we assume that is fixed.
Fix one of , . We then minimize the objective function over the grid of candidates corresponding to the other penalty term, going from the largest to the smallest values in the grid; for each element of the grid of candidates, the solution to the problem using the previous element is used as a warm start. Even though the optimal is not in general a continuous function of and , see Proposition 2, we find that using warm starts as described above works well in practice.
The main loop of the algorithm works as follows, starting with , and until the CV MSE no longer decreases:
Find the in the grid giving minimal CV MSE, .
Take the optimal from the previous step. Recompute and the corresponding grid. Find the in the grid giving minimal cross-validated MSE, . Go to the previous step.
As we mentioned earlier, since we start with , the solution with all columns equal to the Elastic Net estimator is always a candidate to be chosen.
We conduct a small simulation study to illustrate the effect increasing the number of models has on the computation time and the performance of an ensemble of Lassos. We generate 500 replications of a linear model with predictors and observations, corresponding to the second covariance structure described in Section 4. For each replication, two data-sets are generated, one on which the ensemble is trained and one used for computing the prediction mean squared error (PMSE). The computation is repeated for various values of the proportion of active variables. The signal to noise ratio is 10. We show the PMSEs for different values of the number of models used (rows) and the proportion of active variables in the data generating process (columns). We also computed a measure of the overlap between the models in the ensemble. Let be the matrix with columns equal to the computed models, where is the number of models and the number of features. Let
then we define the overlap as
and as 0 otherwise. Note that . If then all models are empty, whereas if , then at least one model is non-empty and actually . If then each variable that is active can only appear in one model, and hence the overlap between the models is minimal, since they are disjoint. Finally, if then all the variables that are active in at least one model, actually appear in all the models, and hence we have maximum overlap.
Table 1 shows the results. The last column shows the average computation time. The computation time doesn’t vary much between different sparsity levels, and hence we report the average over them. In this case, as the number of models used increases, both the overlap and the PMSE decrease, but the gain in prediction accuracy due to using more models also decreases. There seems to be a ‘diminishing returns’ type phenomenon. Of course, this pattern may not persist in other settings. An objective way to determine the number of models to be used, is to cross-validate over a coarse grid, say, taking or models.
In all the settings studied in this paper, the increase in computational time due to using more models, appears to be approximately linear in the number of models, as evidenced by Table 1. In our simulations and examples, we always use ten models, a possibly sub-optimal choice, but still good enough to give a good performance.
We ran a simulation study, comparing the prediction accuracy of the following nine competitors. All computations were carried out in R.
The Lasso, computed using the glmnet package (Friedman et al., 2010).
The Elastic Net with , computed using the glmnet package.
An ensemble of Lassos, using models, called Ens-Lasso.
An ensemble of Elastic Nets, with , using models, called Ens-EN.
The forward stepwise algorithm, computed using the lars package, called Stepwise.
The Cluster Representative Lasso, proposed in Bühlmann et al. (2013), computed using code kindly provided by the authors, called CRL.
All tuning parameters were chosen via cross-validation. The CRL of Bühlmann et al. (2013) was not included in scenarios with due to its long computation time when compared with the rest of the methods. For the same reason, in the scenarios with , we only did 100 replications for CRL, instead of the 500 done for all the other procedures.
The popular Group Lasso (Yuan and Lin, 2006; Simon et al., 2013) is not included in the simulation, because we don’t assume that there is a priori knowledge of the existence of pre-defined groups among the features. The interesting recent proposals of Bühlmann et al. (2013), Sharma et al. (2013) and Witten et al. (2014), assume that there exist unknown clusters of correlated variables, and shrink the coefficients of variables in the same cluster towards each other. Because CRL has a relatively more efficient numerical implementation (compared with the other two) we included it in our simulation to represent the cluster-based approaches. Finally, note that all the competitors above, except perhaps for the forward stepwise algorithm, could in principle be used as building blocks in our procedure for forming ensembles. Hence, our main objective in this simulation study is to show that the proposed method for building ensembles improves upon the prediction accuracy of the base estimators being ensembled, in this case, the Lasso and the Elastic Net.
For each Monte Carlo replication, we generate data from a linear model:
where the are multivariate normal with zero mean and correlation matrix and the are standard normal. We consider different combinations of and , see below. For each we take the number of active variables to be for and . Given , and a sparsity level , the following scenarios for and are considered
for all , the first coordinates of are equal to 2 and the rest are 0.
for , for and the rest of the coordinates equal to zero.
the first coordinates of are equal to 2 and the rest are 0.
We consider different values of : , , . Then is chosen to give a desired signal to noise ratio (SNR), defined as
We consider SNRs of 3, 5, 10 and 50. In the first scenario all the predictors are correlated among each other. In scenario two we have two groups of active variables. This is similar to the simulation scenario considered in Witten et al. (2014). Variables within each group are correlated with each other, but the groups are independent. In the third scenario the active variables are only correlated with each other.
For each replication, two independent copies of the data are generated, one to fit the procedures, the other one to compute the prediction mean square error (PMSE), divided by the variance of the noise, . Hence, the best possible result is
. In each table reporting the PMSEs we also compute the standard error for each of the methods, and report the maximum among them in the caption.
We also compute the precision (PR) and recall (RC) of each method, defined as
For the ensembles, the vector of coefficients used to compute the precision and recall is the average of the models, see (2). For the SIS-SCAD method, the precision and recall are computed using the variables selected by the SIS step.
We report results for Scenario 1 with and , Scenario 2 with and and and and Scenario 3 with and .
In the scenarios we consider the Elastic Net and the Lasso have very similar behaviours, as do the Ensemble of Elastic Nets and the Ensemble of Lassos. For this reason, and due to the fact that we are comparing nine different procedures, we will only report the results for: the Lasso, the ensemble of Lassos, and the best performing among the remaining methods, excluding the Lasso, the Elastic Net and the two ensembles. The full results of the simulation can be found in the supplement to this article, which is available at https://github.com/esmucler/ensembleEN_sims.
Tables 2 to 9 show the results, which can be summarized as follows. The ensemble does as well or better than the base Lasso. In one case, see Table 2, the base Lasso has an error that’s nominally lower than the ensemble, but the difference is negligible (around ) and can be attributed to variability in the cross-validation process and also to the fact that we are comparing the ensemble with the base estimator as computed by glmnet, not by forming the ensemble with , and hence details in the implementation can play a role. On the other hand, the ensemble can lead to important improvements over the base estimator. In cases with , the improvements generally range from around 5 to 15%, whereas in cases with improvements range from around 10 to 30%. In cases with an admittedly artificially high SNR of 50, the ensemble can have a PMSE that is half or less than half that of the base estimator. In general, the improvements tend to increase with the SNR and with the proportion of active variables. Moreover, in the overwhelming majority of the cases considered here, the ensemble has the lowest PMSE of all the competitors considered.
We also note that in general the recall of the ensemble is higher than that of the base estimator and those of the other competitors. The price to pay for this improvement is a decrease in precision, generally minor, but in some cases important (for example in Table 9).
We analyze the performance of the competitors considered in the previous section when predicting on real data-sets. For each data-set, to evaluate the prediction accuracy of the competitors we randomly split the data into a training set that has of the observations and a testing set that has the remaining . This is repeated 500 times and the resulting prediction MSEs are averaged. The results are reported relative to the best average performance among all estimators. Hence, the estimator with the best average performance will have a score of 1. All ensembles are formed using models.
The glass data set (Lemberge et al., 2000) was obtained from an electron probe X-ray microanalysis (EPXMA) of archaeological glass samples. A spectrum on 1920 frequencies was measured on a total of 180 glass samples. The goal is to predict the concentrations of several chemical compounds from the spectrum. After removing predictors with little variation, we are left with predictors and observations. Table 10 shows the results. Highlighted in black is the best performing method for each compound. In ten out of the thirteen compounds considered the ensembles have the best performance, with improvements of up to 30%.
In Figure 2 we report the proportion of splits in which each variables was selected when the response variable is Al2O3. It is seen that the ensembles tend to pick models with more variables, which could in this case explain their superior predictive performance.
Scheetz et al. (2006) conducted a study of mammalian eye diseases, measuring the gene expressions of the eye tissues from 120 twelve-week-old male rats. The expression level of gene TRIM32 is of particular interest here since it is linked with the Bardet-Biedl syndrome, a genetic disorder that affects several body systems, and it is taken as the response variable. Previous studies indicate that TRIM32 is only linked to a small number of genes, and for this reason we keep only the top 5000 genes with the highest sample variance, which are used as predictors. Hence we have predictors and observations. Table 11 shows the results. It is seen that the performances of the Lasso, the Elastic Net and the ensembles are similar, with the ensembles reducing the average prediction error of the base estimators by 4%.
Next, we list the genes that were selected by each method in more than of the splits. Gene 1376747_at is selected by all the methods. The Ens-Lasso selects two more genes, which in turn are a subset of the five genes selected by the Ens-EN. The rest of the methods did not select any variables in more than of the splits.
|Ens-Lasso||1376747_at, 1390539_at, 1381902_at|
|Ens-EN||1376267_at, 1376747_at, 1390539_at, 1381902_at, 1377792_at|
We have proposed a novel method for forming ensembles of linear regression models. Examples using real and synthetic data-sets show that the approach systematically improves the prediction accuracy of the base estimators being ensembled. In the synthetic data-sets, the improvements tend to increase with the signal to noise ratio and the number of active variables.
The approach taken in this paper can be extended in several ways. Other sparsity penalties such as the SCAD can be handled similarly. In fact, the algorithm proposed here will work with any regularized model approach provided the coordinate descent updates can be expressed in closed form. Our method can be extended to GLMs by ensembling regularized GLM estimators instead of linear regression estimators. For example, ensembles of logistic regression models can be formed by replacing the quadratic loss in (11), instead of the classical least squares loss; in this case regularized robust regression estimators (see Smucler and Yohai (2017) for example) would be ensembled. Lower computational times may be achieved by using early stopping strategies when computing solution paths over one of the penalties and also by using an active set strategy when cycling over the groups, see Friedman et al. (2010).
An R package that implements the procedures proposed in this paper, called ensembleEN is available from https://github.com/esmucler/ensembleEN.
Consider any solution of the ensemble problem and fix . Let . Using the orthogonality of we have that
Hence if , . This proves 1.
Assume now and until the end of this proof that that . Moreover, assume for now that . The case is dealt with similarly. Since , it can’t happen that . Assume that only one of or holds. Without loss of generality, we can assume . Hence
so that it must be that
and hence that
which implies . We have thus shown that if , and are both non-zero. Now assume both and are non-zero and . Then
Solving for gives 2.a) for , the same argument proves it for .
Using the orthogonality of it is easy to show that for any
In particular, if as before is any solution to the ensemble problem, we have that and hence minimizes
If , we get
Assume is the non-null coordinate in the ensemble. If we have