Feature-weighted elastic net
In some supervised learning settings, the practitioner might have additional information on the features used for prediction. We propose a new method which leverages this additional information for better prediction. The method, which we call the feature-weighted elastic net ("fwelnet"), uses these "features of features" to adapt the relative penalties on the feature coefficients in the elastic net penalty. In our simulations, fwelnet outperforms the lasso in terms of test mean squared error and usually gives an improvement in true positive rate or false positive rate for feature selection. We also apply this method to early prediction of preeclampsia, where fwelnet outperforms the lasso in terms of 10-fold cross-validated area under the curve (0.86 vs. 0.80). We also provide a connection between fwelnet and the group lasso and suggest how fwelnet might be used for multi-task learning.READ FULL TEXT VIEW PDF
In high-dimensional data settings, additional information on the feature...
Feature selection from a large number of covariates (aka features) in a
Feature selection is an important and active research area in statistics...
We propose a new sparse regression method called the component lasso, ba...
We derive a novel norm that corresponds to the tightest convex relaxatio...
In the context of the highly increasing number of features that are avai...
We ascertain and compare the performances of AutoML tools on large, high...
Feature-weighted elastic net
We consider the usual linear regression model: givenrealizations of predictors for and , the response is modeled as
with having mean
and varianceare obtained by minimizing the residual sum of squares (RSS). There has been much work on regularized estimators that offer an advantage over the OLS estimates, both in terms of accuracy of prediction on future data and interpretation of the fitted model. One popular regularized estimator is the elastic net (Zou & Hastie, 2005) which minimizes the sum of the RSS and a combination of the and -squared penalties. More precisely, letting , the elastic net minimizes the objective function
The elastic net has two tuning parameters: which controls the overall sparsity of the solution, and which determines the relative weight of the and -squared penalties.
corresponds to ridge regression(Hoerl & Kennard, 1970), while corresponds to the lasso (Tibshirani, 1996). These two tuning parameters are often chosen via cross-validation (CV). One reason for the elastic net’s popularity is its computational efficiency: is convex in its parameters, which means that solutions can be found efficiently even for very large and . In addition, the solution for a whole path of values can be computed quickly using warm starts (Friedman et al., 2010).
In some supervised learning settings, we often have some information about the features themselves. For example, in genomics, we know that each gene belongs to one or more genetic pathways, and we may expect genes in the same pathway to have correlated effects on the response of interest. Another example is in image data, where each pixel has a specific position (row and column) in the image. We would expect methods which leverage such information on the features to perform better prediction and inference than methods which ignore it. However, many popular supervised learning methods, including the elastic net, do not use such information about the features in the model-fitting process.
In this paper, we develop a framework for organizing such feature information as well as a variant of the elastic net which uses this information in model-fitting. We assume the information we have for each feature is quantitative. This allows us to think of each source as a “feature” of the features. For example, in the genomics setting, the th source of information could be the indicator variable for whether the th feature belongs to the th genetic pathway.
We organize these “features of features” into an auxiliary matrix , where is the number of features and is the number of sources of feature information. Each column of represents the values for each feature information source, while each row of represents the values that a particular feature takes on for the different sources. We let denote the th row of
as a column vector.
To make use of the information in , we propose assigning each feature a score , which is simply a linear combination of its “features of features”. We then use these scores to influence the weight given to each feature in the model-fitting procedure. Concretely, we give each feature a different penalty weight in the elastic net objective function based on its score:
where for some function .
is a hyperparameter inwhich the algorithm needs to select. In the final model, can be thought of as an indication of how influential feature is on the response, while represents how important the th source of feature information is in identifying which features are important for the prediction problem.
The rest of this paper is organized as follows. In Section 2, we survey past work on incorporating “features of features” in supervised learning. In Section 3, we propose a method, the feature-weighted elastic net (“fwelnet”), which uses the scores in model-fitting. We then illustrate its performance on simulated data in Section 4 and on a real data example in Section 5. In Section 6, we present a connection between fwelnet and the group lasso, and in Section 7, we show how fwelnet can be used in multi-task learning. We end with a discussion and ideas for future work. The appendix contains further details and proofs.
The idea of assigning different penalty weights for different features in the lasso or elastic net objective is not new. For example, the adaptive lasso (Zou, 2006) assigns feature a penalty weight , where is the estimated coefficent for feature in the OLS model and is some hyperparameter. However, the OLS solution only depends on and and does not incorporate any external information on the features. In the work closest to ours, Bergersen et al. (2011) propose using weights , where is some function (possibly varying for ) and is a hyperparameter controlling the shape of the weight function. While the authors present two ideas for what the ’s could be, they do not give general guidance on how to choose these functions which could drastically influence the model-fitting algorithm.
There is a correspondence between penalized regression estimates and Bayesian maximum a posteriori (MAP) estimates with a particular choice of prior for the coefficients. (For example, ridge regression and lasso regression are MAP estimates when the coefficient vectoris given a normal and Laplace prior respectively.) Within this Bayesian framework, some methods have been developed to use external feature information to guide the choice of prior. For example, van de Wiel et al. (2016) take an empirical Bayes approach to estimate the prior for ridge regression, while Velten & Huber (2018) use variational Bayes to do so for general convex penalties.
We also note that most previous approaches for penalized regression with external information on the features only work with specific types of such information. A large number of methods have been developed to make use of feature grouping information. Popular methods for using grouping information in penalized regression include the group lasso (Yuan & Lin, 2006) and the overlap group lasso (Jacob et al., 2009). IPF-LASSO (integrative lasso with penalty factors) (Boulesteix et al., 2017) gives features in each group its own penalty parameter, to be chosen via cross-validation. Tai & Pan (2007) modify the penalized partial least squares (PLS) and nearest shrunken centroids methods to have group-specific penalties.
Other methods have been developed to incorporate “network-like” or feature similarity information, where the user has information on how the features themselves are related to each other. For example, the fused lasso (Tibshirani et al., 2005) adds an penalty on the successive differences of the coefficients to impose smoothness on the coefficient profile. Structured elastic net (Slawski et al., 2010) generalizes the fused lasso by replacing the -squared penalty in elastic net with , where is a symmetric, positive semi-definite matrix chosen to reflect some a priori known structure between the features. Mollaysa et al. (2017) use the feature information matrix to compute a feature similarity matrix, which is used to construct a regularization term in the loss criterion to be minimized. The regularizer encourages the model to give the same output as long as the total contribution of similar features is the same. We note that this approach implicitly assumes that the sources of feature information are equally relevant, which may or may not be the case.
It is not clear how most of the methods in the previous two paragraphs can be generalized to more general sources of feature information. Our method has the distinction of being able to work directly with real-valued feature information and to integrate multiple sources of feature information. We note that while van de Wiel et al. (2016) claim to be able to handle binary, nominal, ordinal and continuous feature information, the method actually ranks and groups features based on such information and only uses this grouping information in the estimation of the group-specific penalties. Nevertheless, it is able to incorporate more than one source of feature information.
As mentioned in the introduction, one direct way to utilize the scores in model-fitting is to give each feature a different penalty weight in the elastic net objective function based on its score:
where for some function . Our proposed method, which we call the feature-weighted elastic net (“fwelnet”), specifies :
The fwelnet algorithm seeks the minimizer of this objective function over and :
There are a number of reasons for this choice of penalty factors. First, when , we have for all , reducing fwelnet to the original elastic net algorithm. Second, for all and , ensuring that we do not end up with features having negligible penalty. This allows the fwelnet solution to have a wider range of sparsity as we go down the path of values. Third, this formulation provides a connection between fwelnet and the group lasso (Yuan & Lin, 2006) which we detail in Section 6. Finally, we have a natural interpretation of a feature’s score: if is relatively large, then is relatively small, meaning that feature is more important for the response and hence should have smaller regularization penalty.
We illustrate the last property via a simulated example. In this simulation, we have observations and features which come in groups of . The response is a linear combination of the first two groups with additive Gaussian noise. The coefficient for the first group is while the coefficient for the second group is so that the features in the first group exhibit stronger correlation to the response compared to the second group. The “features of features” matrix is grouping information, i.e. if feature belongs to group , and is otherwise. Figure 1 shows the penalty factors that fwelnet assigns the features. As one would expect, the features in the first group have the smallest penalty factor followed by features in the second group. In contrast, the original elastic net algorithm would assign penalty factors for all .
It can be easily shown that . Henceforth, we assume that and the columns of are centered so that and we can ignore the intercept term in the rest of the discussion.
For given values of , and , it is easy to solve (7): the objective function is convex in (in fact it is piecewise-quadratic in ) and can be found efficiently using algorithms such as coordinate descent. However, to deploy fwelnet in practice we need to determine the hyperparameter values , and that give good performance. When , the number of sources of feature information, is small, one could run the algorithm for a grid of values, then pick the value which gives the smallest cross-validated loss. Unfortunately, this approach is computationally infeasible for even moderate values of .
To avoid this computational bottleneck, we propose Algorithm 1 as a method to find and at the same time. If we think of as an argument of the objective function , Step 3 can be thought of as alternating minimization over and . Notice that in Step 3(c), we allow the algorithm to have a different value of for each value. However, we force to be the same across all values
: Steps (a) and (b) can be thought of as a heuristic to perform gradient descent forunder this constraint.
We have developed an R package, fwelnet, which implements Algorithm 1. We note that Step 3(c) of Algorithm 1 can be done easily by using the glmnet function in the glmnet R package and specifying the penalty.factor option. In practice, we use the lambda sequence provided by glmnet’s implementation of the elastic net as this range of values covers a sufficiently wide range of models. With this choice of sequence, we find that fwelnet’s performance does not change much whether we use the component-wise mean or median in Step 3(a), or the mean or median in Step 3(b). Also, instead of running Step 3 until convergence, we recommend running it for a small fixed number of iterations . Step 3(c) is the bottleneck of the algorithm, and so the runtime for fwelnet is approximately times that of glmnet. In our simulation studies, we often find that one pass of Step 3 gives a sufficiently good solution. We suggest treating as a hyperparameter and running fwelnet for and .
In the exposition above, the elastic net is described as a regularized version of the ordinary least squares model. It is easy to extend elastic net regularization to generalized linear models (GLMs) by replacing the RSS term with the negative log-likelihood of the data:
where is the negative log-likelihood contribution of observation . Fwelnet can be extended to GLMs in a similar fashion:
with as defined in (6). Theoretically Algorithm 1 can be used as-is to solve (9). Because only appears in the penalty term and not in the negative log-likelihood, this extension is not hard to implement in code. The biggest hurdle to this extension is a solver for (8) which is needed for Steps 2 and 3(c). Step 3(a) is the same as before, while Step 3(b) simply requires a function that allows us to compute the negative log-likelihood .
We tested the performance of fwelnet against other methods in a simulation study. In the three settings studied, the true signal is a linear combination of the columns of , with the true coefficient vector being sparse. The response is the signal corrupted by additive Gaussian noise. In each setting, we gave different types of feature information to fwelnet to determine the method’s effectiveness.
For all methods, we used cross-validation (CV) to select the tuning parameter . Unless otherwise stated, the hyperparameter was set to (i.e. no squared penalty) and Step 3 of Algorithm 1 was run for one iteration, with the mean used for Steps 3(a) and 3(b). To compare methods, we considered the mean squared error (MSE) achieved on 10,000 test points, as well as the true positive rate (TPR) and false positive rate (FPR) of the fitted models. (The oracle model which knows the true coefficient vector has a test MSE of .) We ran each simulation 30 times to get estimates for these quantities. (See Appendix B for details of the simulations.)
In this setting, we have observations and features, with the true signal being a linear combination of just the first 10 features. The feature information matrix has two columns: a noisy version of and a column of ones.
We compared fwelnet against the lasso (using the glmnet
package) across a range of signal-to-noise ratios (SNR) in both the responseand the feature information matrix (see details in Appendix B.1). The results are shown in Figure 2. As we would expect, the test MSE figures for both methods decreased as the SNR in the response increased. The improvement of fwelnet over the lasso increased as the SNR in increased. In terms of feature selection, fwelnet appeared to have similar TPR as the lasso but had smaller FPR.
In this setting, we have observations and features, with the features coming in 15 groups of size 10. The feature information matrix contains group membership information for the features: . We compared fwelnet against the lasso and the group lasso (using the grpreg package) across a range of signal-to-noise ratios (SNR) in the response .
We considered two different responses in this setting. The first response we considered was a linear combination of the features in the first group only, with additive Gaussian noise. The results are depicted in Figure 3. In terms of test MSE, fwelnet was competitive with the group lasso in the low SNR scenario and came out on top for the higher SNR settings. In terms of feature selection, fwelnet had comparable TPR as the group lasso but drastically smaller FPR. Fwelnet had better TPR and FPR than the lasso in this case. We believe that fwelnet’s improvement over the group lasso could be because the true signal was sparse: fwelnet’s connection to the version of the group lasso (see Section 6 for details) encourages greater sparsity than the usual group lasso penalty based on norms.
The second response we considered in this setting was not as sparse in the features: the true signal was a linear combination of the first 4 feature groups. The results are shown in Figure 4. In this case, the group lasso performed better than fwelnet when the hyperparameter was fixed at 1, which is in line with our earlier intuition that fwelnet would perform better in sparser settings. It is worth noting that fwelnet with performs appreciably better than the lasso when the SNR is higher. Selecting via cross-validation improved the test MSE performance of fwelnet, but not enough to outperform the group lasso. The improvement in test MSE also came at the expense of very high FPR.
In this setting, we have observations and features, with the true signal being a linear combination of just the first 10 features. The feature information matrix consists of 10 noise variables that have nothing to do with the response. Since fwelnet is adapting to these features, we expect it to perform worse than comparable methods.
We compare fwelnet against the lasso across a range of signal-to-noise ratios (SNR) in the response . The results are depicted in Figure 5. As expected, fwelnet has higher test MSE than the lasso, but the decrease in performance is not drastic. Fwelnet attained similar FPR and TPR to the lasso.
Preeclampsia is a leading cause of maternal and neonatal morbidity and mortality, affecting 5 to 10 percent of all pregnancies. The biological and phenotypical signals associated with late-onset preeclampsia strengthen during the course of pregnancy, often resulting in a clinical diagnosis after 20 weeks of gestation (Zeisler et al., 2016). An earlier test for prediction of late-onset preeclampsia will enable timely interventions for improvement of maternal and neonatal outcomes (Jabeen et al., 2011). In this example, we seek to leverage data collected in late pregnancy to guide the optimization of a predictive model for early diagnosis of late-onset preeclampsia.
We used a dataset of plasma proteins measured during various gestational ages of pregnancy (Erez et al., 2017). For this example, we considered time points weeks “early” and time points
weeks as “late”. We had measurements for between 2 to 6 time points for each of the 166 patients for a total of 666 time point observations. Protein measurements were log-transformed to reduce skew. We first split the patients equally into two buckets. For patients in the first bucket, we used only their late time points (83 patients with 219 time points) to train an elastic net model withto predict whether the patient would have preeclampsia. From this late time point model, we extracted model coefficients at the hyperparameter value which gave the highest 10-fold cross-validated (CV) area under the curve (AUC). For patients in the second bucket, we used only their early time points (83 patients with 116 time points) to train a fwelnet model with the absolute values of the late time point model coefficients as feature information. When performing CV, we made sure that observations from one patient all belonged to the same CV fold to avoid “contamination” of the held-out fold. One can also run the fwelnet model with additional sources of feature information for each of the proteins.
We compare the 10-fold CV AUC for fwelnet run with 1, 2 and 5 minimization iterations (i.e. Step 3 in Algorithm 1) against the lasso as a baseline. Figure 6 shows a plot of 10-fold CV AUC for these methods against the number of features with non-zero coefficients in the model. The lasso obtains a maximum CV AUC of 0.80, while fwelnet with 2 minimization iterations obtains the largest CV AUC of 0.86 among all methods.
We note that the results were somewhat dependent on (i) how the patients were split into the early and late time point models, and (ii) how patients were split into CV folds when training each of these models. We found that if the late time point model had few non-zero coefficients, then the fwelnet model for the early time point data was very similar to that for the lasso. This matches our intuition: if there are few non-zero coefficients, then we are injecting very little additional information through the relative penalty factors in fwelnet, and so it will give a very similar model to elastic net. Nevertheless, we did not encounter cases where running fwelnet gave worse CV AUC than the lasso.
One common setting where “features of features” arise naturally is when the features come in non-overlapping groups. Assume that the features in come in non-overlapping groups. Let denote the number of features in group , and let denote the subvector of which belongs to group . Assume also that and the columns of are centered so that . In this setting, Yuan & Lin (2006) introduced the group lasso estimate as the solution to the optimization problem
The penalty on features at the group level ensures that features belonging to the same group are either all included in the model or all excluded from it. Often, the penalty given to group is modified by a factor of to take into account the varying group sizes:
below establishes a connection between fwelnet and the group lasso. For the moment, consider the more general penalty factor, where is some function with range . (Fwelnet makes the choice .)
If the “features of features” matrix is given by , then minimizing the fwelnet objective function (7) jointly over , and reduces to
for some .
We turn now to an application of fwelnet to multi-task learning. In some applications, we have a single model matrix but are interested in multiple responses . If there is some common structure between the signals in the responses, it can be advantageous to fit models for them simultaneously. This is especially the case if the signal-to-noise ratios in the responses are low.
We demonstrate how fwelnet can be used to learn better models in the setting with two responses, and . The idea is to use the absolute value of coefficients of one response as the external information for the other response. That way, a feature which has larger influence on one response is likely to be given a correspondingly lower penalty weight when fitting the other response. Algorithm 2 presents one possible way of doing so.
We tested the effectiveness of Algorithm 2 (with step 2 run for 3 iterations) on simulated data. We generate 150 observations with 50 independent features. The signal in response 1 is a linear combination of features 1 to 10, while the signal in response 2 is a linear combination of features 1 to 5 and 11 to 15. The coefficients are set such that those for the common features (i.e. features 1 to 5) have larger absolute value than those for the features specific to one response. The signal-to-noise ratios (SNRs) in response 1 and response 2 are 0.5 and 1.5 respectively. (See Appendix D for more details of the simulation.)
We compared Algorithm 2 against: (i) the individual lasso (ind_lasso), where the lasso is run separately for and ; and (ii) the multi-response lasso (mt_lasso), where coefficients belonging to the same feature across the responses are given a joint penalty. Because of the penalty, a feature is either included or excluded in the model for all the responses at the same time.
The results are shown in Figure 7 for 50 simulation runs. Fwelnet outpeforms the other two methods in test MSE as evaluated on 10,000 test points. As expected, the lasso run individually for each response performs well in the response with higher SNR but poorly in the response with lower SNR. The multi-response lasso is able to borrow strength from the higher SNR response to obtain good performance on the lower SNR response. However, because the models for both responses are forced to consist of the same set of features, performance suffers on the higher SNR response. Fwelnet has the ability to borrow strength across responses without being hampered by this restriction.
In this paper, we have proposed organizing information about predictor variables, which we call “features of features”, as a matrix, and modifying model-fitting algorithms by assigning each feature a score, , based on this auxiliary information. We have proposed one such method, the feature-weighted elastic net (“fwelnet”), which imposes a penalty modification factor for the elastic net algorithm.
There is much scope for future work:
Choice of penalty modification factor. While the penalty modification factors we have proposed works well in practice and has several desirable properties, we make no claim about its optimality. We also do not have well-developed theory for the current choice penalty factor.
Extending the use of scores beyond the elastic net. The use of feature scores in modifying the weight given to each feature in the model-fitting process is a general idea that could apply to any supervised learning algorithm. More work needs to be done on how such scores can be incorporated, with particular focus on how can be learned through the algorithm.
Whether should be treated as a parameter or a hyperparameter, and how to determine its value. In this paper, we introduced as a hyperparameter for (7). This formulation gives us a clear interpretation for : is a proxy for how important the th source of feature information is for identifying which features are important. With this interpretation, we do not expect to change across values.
When is treated as a hyperparameter, we noted that the time needed for a grid search to find its optimal value grows exponentially with the number of sources of feature information. To avoid this computational burden, we suggested a descent algorithm for based on its gradient with respect to the fwelnet objective function (Step 3(a) and 3(b) in Algorithm 1). There are other methods for hyperparameter optimization such as random search (e.g. Bergstra & Bengio (2012)) or Bayesian optimization (e.g. Snoek et al. (2012)) that could be applied to this problem.
One could consider as an argument of the fwelnet objective function to be minimized over jointly with . One benefit of this approach is that it gives us a theoretical connection to the group lasso (Section 6). However, we will obtain different estimates of for each value of the hyperparameter , which may be undesirable for interpretation. The objective function is also not jointly convex in and , meaning that different minimization algorithms could end up at different local minima. In our attempts to make this approach work (see Appendix A), it did not fare as well in prediction performance and was computationally expensive. It remains to be seen if there is a computationally efficient algorithm which treats as a parameter to be optimized for each value.
An R language package fwelnet which implements our method is available at https://www.github.com/kjytay/fwelnet.
Acknowledgements: Nima Aghaeepour was supported by the Bill & Melinda Gates Foundation (OPP1189911, OPP1113682), the National Institutes of Health (R01AG058417, R01HL13984401, R61NS114926, KL2TR003143), the Burroughs Wellcome Fund and the March of Dimes Prematurity Research Center at Stanford University. Trevor Hastie was partially supported by the National Science Foundation (DMS-1407548 and IIS1837931) and the National Institutes of Health (5R01 EB001988-21). Robert Tibshirani was supported by the National Institutes of Health (5R01 EB001988-16) and the National Science Foundation (19 DMS1208164).
Assume that and the columns of are centered so that and we can ignore the intercept term in the rest of the discussion. If we consider as an argument of the objective function, then we wish to solve
is not jointly convex and , so reaching a global minimum is a difficult task. Instead, we content ourselves with reaching a local minimum. A reasonable approach for doing so is to alternate between optimizing and : the steps are outlined in Algorithm 3.
Unfortunately, Algorithm 3 is slow due to repeated solving of the elastic net problem in Step 2(b)ii for each . The algorithm does not take advantage of the fact that once and are fixed, the elastic net problem can be solved quickly for an entire path of values. We have also found that Algorithm 3 does not predict as well as Algorithm 1 in our simulations.
Set , , with for , for , and otherwise.
Generate for and .
For each and :
Generate , where for .
Generate , where . Append a column of ones to get .
Set , .
For and , set if , otherwise.
Generate with or
with equal probability for, otherwise.
Generate for and .
For each :
Generate , where for .
Set , , with for , and otherwise.
Generate for and .
For each :
Generate , where for .
Generate for and . Append a column of ones to get .
First note that if feature belongs to group , then , and its penalty factor is
where denotes the number of features in group . Letting for , minimizing the fwelnet objective function (7) over and reduces to
For fixed , we can explicitly determine the values which minimize the expression above. By the Cauchy-Schwarz inequality,
Note that equality is attainable for (12): letting , equality occurs when there is some such that
Since , we have , giving for all . A solution for this is for all , which is feasible for having range . (Note that if only has range , the connection still holds if or : the solution will just have or .)
Thus, the fwelnet solution is
Writing in constrained form, (13) becomes minimizing subject to
Converting back to Lagrange form again, there is some such that the fwelnet solution is
Setting and in the expression above gives the desired result.
Set , .
Generate for and .
Generate response 1, , in the following way:
Generate , where for .
Generate response 2, , in the following way:
Generate , where for .
Journal of Machine Learning Research13, 281–305.
Tai, F. & Pan, W. (2007), ‘Incorporating prior knowledge of predictors into penalized classifiers with multiple penalty terms’,Bioinformatics 23(14), 1775–1782.