A comprehensive review of variable selection in high-dimensional regression for molecular biology

09/24/2021 ∙ by Perrine Lacroix, et al. ∙ 0

Variable selection methods are widely used in molecular biology to detect biomarkers or to infer gene regulatory networks from transcriptomic data. Methods are mainly based on the high-dimensional Gaussian linear regression model and we focus on this framework for this review. We propose a comparison study of variable selection procedures from regularization paths by considering three simulation settings. In the first one, the variables are independent allowing the evaluation of the methods in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the estimation. Finally, the third setting mimics the biological complexity of transcription factor regulations, it is the farthest setting from the Gaussian framework. In all the settings, the capacity of prediction and the identification of the explaining variables are evaluated for each method. Our results show that variable selection procedures rely on statistical assumptions that should be carefully checked. The Gaussian assumption and the number of explaining variables are the two key points. As soon as correlation exists, the regularization function Elastic-net provides better results than Lasso. LinSelect, a non-asymptotic model selection method, should be preferred to the eBIC criterion commonly used. Bolasso is a judicious strategy to limit the selection of non explaining variables.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bioinformatics researches have been booming in the past decades with the researches in genomics generating numerous datasets describing the molecular activities. For instance, microarrays and RNA-sequencing technologies providing a quantification of the expression of all the genes simultaneously ease the study of their interactions. Genetic associations studies diversify by considering a large range of phenotypes including gene expression or proteomic and metabolomic data. In this review, we focus on the detection of interactions between molecular entities through variable selection procedures in high-dimensional linear Gaussian regression. Our work is guided by the identification of transcription factors regulating a target gene. From the responses of all the transcription factors and the responses of the target gene in the same experiments, the goal is to select a few transcription factors. The considered dataset with a number of variables close to or slightly higher than the number of observations

is a real challenge since it hampers the use of the traditional estimation methods. A regularization of the cost function is required so that only a subset of variables is selected to explain the response variable.

In most reviews on variable selection in high-dimensional Gaussian linear regression, a focus is done on the construction of the regularization path. [1] provides a meticulous theoretical analysis of the Lasso penalty function. In particular, for a given number of explaining variables, he discusses the choice of the number of observations to ensure asymptotic properties of finding these explaining variables. [2]

compared several regularization functions in a simulation study by using semi-real datasets. In their simulation design they considered several numbers of observations, variables and explaining variables. They also modified the signal-to-noise ratio and considered two scenarios of variable correlations. The results of the different regularization functions are inspected with ROC curves and partial ROC curves when 0.5

and 0.9 variables are selected. [3] compared a large set of regularization functions with a simulation design similar to [2]. They evaluated both prediction and variable identification but the main difference with our investigations is that the model selection is only performed by a cross-validation procedure. Finally some reviews considered different contexts. [4]

were interested in robust variable selection strategies when heavy-tailed errors and outliers in response variables exist. They discussed the different steps from the modification of the least squares function to the choice of the parameters for the model selection through a presentation about algorithms accounting for outliers.

[5] considered a variety of models from survival models to generalized linear models, frequently used in biomedical researches. [6] considered a wide range of model structures (linear, grouped, additive, partially linear and non-parametric) and discussed three main categories of algorithms for the variable selection.

Our review distinguishes itself from the previous ones since we propose an evaluation of both the regularization path construction and the choice of the final selected variables leading to 45 combinations. Moreover, we add in this review non-asymptotic methods of model selection which are generally not considered. To construct the regularization path, we benchmark two regularization functions (Lasso [7] and Elastic-Net [8]) combined with two algorithms (LARS [9] and gradient descent algorithm [10]). Each regularization path provides a collection of variable subsets and to choose one of them, we compare model selection and variable identification approaches. On the one hand, the model selection uses penalization criteria of the least squares (eBIC [11], data-driven calibration strategies [12, 13, 14, 15] and LinSelect [16, 17]). On the other hand, the variable identification methods (ESCV [18], Bolasso [19], Stability Selection [20], Tigress [21] and the knockoffs method [22]) use sampling strategies to stabilize the selected variable subset while limiting the number of non explaining variables. Methods based on multiple testing procedures [23] and Bayesian approaches are not included in this review. We refer the readers to [24] for an empirical comparison of frequentist and Bayesian points of view. Lastly, we assume no prior knowledge between interactions, spatial localization and chronological information and refer to [25, 26] for such approaches.

After a description of the methods and their theoretical properties, we compare them in a simulation study by considering three settings. In the first one, the variables are independent and drawn from a Gaussian distribution. It allows a method comparison in the theoretical framework used to develop them. In the second setting, two structures of the correlation between variables are considered to evaluate how biological dependencies usually observed affect the methods. Observations are still generated according to a Gaussian linear model, the most favorable case where assumptions broadly hold. Finally, the third setting mimics the biological complexity of transcription factor regulations, observations are generated using the FRANK algorithm

[27].

In these three settings, the method performances are evaluated for their capacity of prediction and of identification of the explaining variables. To discriminate explaining variables to the others, we use the pROC-AUC and the pPR-AUC metrics. We use the mean squared error to measure the prediction performance, and the recall, specificity and false discovery rate metrics to assess the quality of the selected variables in terms of explaining variables. As [2, 3], we notice that there is no unambiguous winner among all the studied approaches. Our idea is rather to offer recommendations for a judicious choice of a method according to the application. In particular, when correlations exist between variables, Elastic-Net should be preferred to Lasso for the regularization function. Moreover, to ensure a prediction ability, LinSelect seems to be the most judicious choice. If the goal is to recover the explaining variables, data-driven calibration strategies and ESCV success whereas Bolasso, Stability Selection or knockoffs method should be privileged to limit the non explaining variables in the selected subset.

2 Methods

2.1 Statistical framework

We consider a Gaussian linear regression model where the response variable is explained by a linear combination of variables :

where

is the unknown vector and

follows a centered Gaussian distribution with an unknown variance denoted

. To estimate the parameters and , independent observations are available: for , we observe and the variables .

We consider high-dimensional regression where or , preventing the traditional least squares estimation. So an additive regularization corresponding to a sparsity assumption is imposed: only a small number of variables explains the response variable. For , the criterion becomes:

where is the number of non-zero coefficients of . Its associated Lagrangian form is for :

(1)

The proof of the equivalence and the link between and can be found in [7]. A large (resp. small) value of gives a small (resp. large) subset of variables corresponding to a linear adjustment far from (resp. close to) the response variable. Unfortunately, the criterion is non-convex, so the existence and the uniqueness of the solution are not guaranteed. Hence, as presented in [5], (1) is generalized as:

(2)

where is a continuous and convex regularization function which guarantees the existence of a minimum for any .

2.2 Regularization functions

The Lasso procedure [7] considers the -norm:

Lasso is a good approximation of (1. If is well chosen, it provides a consistent estimator of . This procedure achieves the best trade-off between regularity (convexity, reasonable computationally solving) and sparsity for independent variables. However, when some variables are correlated, Lasso tends to select randomly only one of them rather than selecting none or all of them. A solution is the Adaptative Lasso procedure [28] where each variable is properly weighted. There is also the Ridge procedure [29]:

In addition to taking variable dependencies into account, Ridge provides a strictly convex and derivable optimization problem with an explicit estimator of . However, this estimator is not sparse. Another alternative is the Elastic-Net procedure [8] which balances the sparsity control with variable dependencies inclusion through the parameter :

When prior knowledge on variable dependencies is available, there exist other regularization functions, not considered here: the Group Lasso [30], Overlap Group Lasso [31], Hierarchical Group Lasso [32] and fused Lasso [33].

2.3 Regularization path construction for Lasso and Elastic-Net

The optimization problem (2) has generally no explicit solution and must be solved by a computational approach by considering a grid of . A first algorithm is LARS [9]. Briefly speaking, the first subset contains only one variable: the variable which has the largest absolute correlation with . The second model contains exactly two variables: and the variable with being the most correlated variable with the residuals of the regression of on . One variable is added at each step and each step corresponds to a value of . A second algorithm [10] is based on the gradient descent method. This algorithm constructs a regular grid of a given size by starting with the largest corresponding to the first nonempty variable subset. Then, a variable subset is obtained for each of this grid by solving (2) with the cyclical coordinate descent method.

LARS gives an exact solution of the optimization problem whereas the gradient descent method provides a proxy. LARS leads to nested subsets, an important property for theoretical considerations, whereas the gradient descent method gives independent solutions along the grid. Consequently, LARS gives at most one subset per dimension whereas the gradient descent method may provide a richer collection.

2.4 Model selection

Whatever the chosen algorithm, a collection of variable subsets is obtained. Each is associated with an estimator of , however it is known to be biased [34] and is usually replaced with the least-squares estimator calculated only on the variables of . It is denoted and its dimension is denoted . the number of variables in . To select the best subset

, model selection approaches consist of minimizing a penalized loss function in

:

(3)

The loss function, , quantifying the quality of the model fit is either the least-squares function or the deviance , where is the likelihood function calculated with and , the empirical estimators associated to the model . The penalty function pen accounts for the model complexity and the characteristics of the sample: higher the penalty values, smaller the number of selected variables and farther the linear combination to the response variable .

Asymptotic criterion The first criteria are asymptotic: theirs properties are verified when the sample size tends to infinity. In this review, we focus on the more recent asymptotic criterion, called eBIC [11], used to get a consistent estimator by penalizing the deviance by:

(4)

where is a value in .

Non-asymptotic criteria With the characteristics of genomics data, having estimator properties for going to infinity has no sense (see [17]) and applying criteria with properties confirmed for any sample size is more relevant. Introduced by Birgé and Massart [35], the goal of non-asymptotic criteria is to achieve the risk oracle:

and instead of getting asymptotic equality of the kind

they get an inequality holding for any value of :

(5)

where and are two functions of which tend to and respectively when tends to infinity.

The selection procedure which satisfies (5) is the same as for an asymptotic criterion: the selected model is the minimizer of Equation (3) where the loss function is the least-squares function and two penalty functions are available. The first penalty is a data-driven calibration function [12]

where the constant has been calculated in a context of detection of changepoints in a signal [13]. The constant

is calibrated from the sample. Two strategies are proposed: the first one is the slope heuristics, assuming that the least-squares function is linear in

as soon as is large enough (see Figure 2 of [14]). The constant equals the estimated slope. The second strategy is the dimension jump, assuming the existence of such that for all the values smaller than , the associated model has a very high dimension, whereas for all the values greater than , the associated model has a reasonable dimension (see Figure 1 of [14]). The constant equals . For more practical and theoretical details, we refer the reader to [14] and the survey [15]. The second penalty function is LinSelect proposed in [16] and generalized for a high dimensional context in [17]. It is built from the empirical estimator of the variance onto each :

(6)

where the are weights satisfying some properties and the function is the unique solution of the equation:

where is defined for :

for and two independent random variables with degrees of freedom and respectively.

Both penalties are based on the theory developed in [35] and define an explicit penalty function when the variance is unknown. LinSelect deals with the empirical variance leading to the constant in Equation (6), whereas data-driven penalties prefer to calibrate the variance from the sample. This last approach is an heuristic tested and verified in various frameworks without theoretical properties [14].

In our simulation study, methods are defined by the combination of a regularization function (Lasso or Elastic-net) with an algorithm (LARS or the gradient descent method) and a penalty function (eBIC, LinSelect or data-driven penalties).

2.5 Variable identification

The high-dimensional framework usually leads to an unstable result: addition, suppression, or modification of some observations could radically change the subset of selected variables. For prediction, different sets of variables can give the same prediction performance. However, when the objective is the identification of the explaining variables, this instability is a drawback.

To circumvent this sampling uncertainty, the idea is to replicate the procedure on perturbed data generated from the original dataset. Cross-validation [36, 37] consists in splitting times the original dataset into a training and a test set. For each and each , the training set is used to calculate and the test set is used to evaluate the mean squared error. The selected model minimizes in the mean squared error. Applying cross-validation in high dimension is computationally expensive and known to be unstable, ESCV [18] estimates the instability along the regularization path instead:

Sampling strategy is also a solution. Two widely used approaches are Bolasso and Stability Selection proposed by [19, 20], which mainly differ in their sampling strategy. Bolasso generates samples of data uniformly chosen with replacement among the original observations, whereas the Stability Selection strategy generates samples of distinct data randomly chosen. Moreover, in the Stability Selection method, to limit the subsampling effects, when a sample is generated, its complement is also taken into account and a random perturbation is added in the regularization function :

where with . For these two approaches, we consider either a given grid or a grid varying with each sample. Sampling strategies allow one to get an occurrence frequency of each variable for each of the grid and those having the highest occurrence frequencies are retained to constitute the final variable subset. Tigress method [21] modifies the calculation of the occurrence frequency by averaging also on the LARS steps.

The last type of variable identification method is the knockoffs method [22]. It controls the false discovery rate (FDR). This method starts with the construction of a matrix such that and have the same covariance structure with the less correlated to . The matrix is designed through linear algebra tools (see [22, 38] for details). Then, a regularization path is constructed on the augmented matrix of size where the explaining variables are expected to be selected very earlier than their copy. Let denote

(7)

where and correspond to the largest for which and respectively are selected. A positive value of states that is selected before its copy and a large positive value indicates that is selected rapidly. Let be the target FDR, the final variable subset is composed by the such that with:

where . If the set is empty, is set to infinity.

In our simulation study, methods are defined for variable identification. When the same samples are used for all the of the grid, 16 methods are defined by the combination of the sampling strategy (Bolasso or Stability Selection) with a regularization function (Lasso or Elastic-net, both randomized or not) and an algorithm (LARS or the gradient descent method). When the grid is fixed, the sampling strategy is performed for each of the grid, which implies using the gradient descent algorithm, hence 8 methods are defined. Furthermore, we compare Tigress, the knockoffs method and ESCV. For the last two, a gradient descent algorithm is used with either Lasso or Elastic-net.

2.6 Simulation framework

The design of the simulation study is directly linked to the question addressed and is composed of three simulation settings.

Simulation under independent design This is the simplest setting where the large number of variables is the single handicap [39, 3]. The matrix is simulated by concatenation of independent standard Gaussian vectors of size . The number of non-zero coefficients of the vector is generated from a uniform variable on integers between 1 and

. Theirs values are generated from a uniform distribution between 0.5 and 2 and the response variable

is got by .

Simulation under a Gaussian graphical model A sample of size is generated from a multivariate centered Gaussian distribution with covariance matrix , where the dependency structure is encoded in the precision matrix : if its coefficient is non-zero, it means that the variables and interact when all other variables are given. We refer to [40] and [41] for more information. The response variable is chosen as a column of the multivariate centered Gaussian, whereas in the previous papers [42, 43, 44, 3], the response variable is usually simulated after the built of the matrix . This choice has been motivated by applications in genomics where the status of the response and explaining variables are similar. We consider two types of dependency:

  • Cluster design: the precision matrix is simulated as a block diagonal matrix with blocks of equal size. The response variable is defined as the first variable.

  • Scale-free design: a few variables have a lot of neighbors while all the others have a few ones. We consider two response variables corresponding to the variables having the highest and the smallest number of neighbors. These simulation designs are named scale-free-max and scale-free-min respectively.

Simulation under a dynamical process It is the most realistic setting, based on the gene regulatory networks-simulating engine called FRANK [27]. This algorithm simulates large gene regulatory networks with characteristics observed in vivo: variables are categorized into a set of transcription factors that activates or inhibits a set of target genes. We use FRANK with only transcription factors in order to compare the results with those from the Gaussian settings. We consider two possible response variables corresponding to the variables having the highest and the smallest number of neighbors. These simulation designs are named FRANK-max and FRANK-min respectively.

Simulation parameters We set and . For the scenarios (independent, cluster, scale-free max, scale-free min, FRANK-max, FRANK-min), samples of size are generated to create a training set of size used for the estimation and a test set of size used for the method evaluation. For each sample, the variables are primarily centered and scaled. For the simulation under the Gaussian graphical model, we use the function huge.generator from the R package huge (version 1.3.4.1). For the cluster design, the block number equals

and the probability of connection within a component is set to the default value

. For simulation under a dynamical process, we use the online FRANK algorithm available on the website https://m2sb.org/?page=FRANK with transcription factors and

observations. The number of eigenvalues of the matrix on the unit circle is fixed to

and the minimum and maximum of sparsity are set to and . Other parameters are set to default values.

For the LARS algorithm, we use the function enet of the R package elasticnet (version 1.1.1). The maximal number of steps to define the grid size is the default value . For the gradient descent method, we use the function glmnet of the R package glmnet (version 3.0) and set the grid size at . Both functions propose the Lasso and elastic-net regularization functions. We set for Elastic-Net regularization.

To perform model selection, we implement eBIC setting to . For the non-asymptotic criteria, LinSelect is implemented in the function tuneLasso of the R package LINselect (version 1.1.3). The data-driven penalties are calculated by using the function capushe of the R package capushe (version 1.1.1). The parameters are set to the default values except for the minimum percentage of points for the plateau selection set to .

To perform variable identification, we use the function escv.glmnet of the R package HDCI (version 1.0.2) for the ESCV strategy, with a number of groups . We implement Bolasso and Stability Selection with samples. When we consider a random perturbation, weights are drawn from a uniform distribution between and . A variable is selected when its occurrence frequency is higher than . To apply Tigress, we use the function tigress of the R package tigress (version 0.1.0) with steps for the LARS algorithm. For the knockoffs method, we use the function knockoff.filter with option create.second_order of the R package knockoff (version 0.3.2), we calculate the with the function stat.lasso_lambdasmax and set the false discovery rate to .

2.7 Evaluation metrics

It requires to distinguish the explaining variables from the others among the selected variables. We define a relevant variable as a selected explaining variable whereas an irrelevant variable is a selected variable which is not an explaining variable. All the metrics presented below are calculated on each test set and discussed averaged on test sets.

First, we evaluate the performance of regularization path constructions. We use the partial area under the receiver operating characteristic curve (pROC-AUC) where the x-axis is the proportion of irrelevant variables among the non explaining variables and the y-axis is the proportion of relevant variables among the explaining variables. A high value of pROC-AUC states that the regularization path is able to discriminate the explaining variables from the others. We use also the partial area under the precision-recall curve (pPR-AUC) where the x-axis is the proportion of relevant variables among the explaining variables and the y-axis is the proportion of relevant variables among the selected variables. A high value of pPR-AUC states that the regularization path is able to select the explaining variables. The lengths of the regularization path differ between the methods. To fairly compare them, the pROC-AUC and pPR-AUC are calculated either by restricting the length of the paths to the number of non-zero coefficients of

or by taking the largest value of the x-axis common to all the methods.

Second, we evaluate the prediction performance of the methods by the mean squared error calculated on each set test :

(8)

where is the estimator of selected from the associated training set. As data are centered and scaled, a mean squared error value lower than means that the method has a prediction ability: the selected variables predict better than an empty set of variables.

Finally, we evaluate the variable identification by using three metrics: the recall (the proportion of the relevant variables among the explaining variables), the specificity (the proportion of the non explaining variables not selected among the non explaining variables) and the false discovery proportion (the proportion of irrelevant variables among the selected variables). By averaging on the test sets, the false discovery proportion becomes the FDR. Since the objective is to limit the irrelevant variables while selecting as many explaining variables as possible, recall and specificity are expected to be close to 1 while the false discovery rate is expected to be low or close to the fixed level for the knockoffs method.

3 Results

In the following tables, the notations GD, and E-Net denote the gradient descent algorithm, Lasso and Elastic-Net respectively. The results are discussed per simulation design.

Mean squared error

Table 1 summaries the mean squared errors of the different methods. As both data-driven penalties having similar performances, only the results of the heuristic slope are presented. Among the variable identification methods only results of the best method (the knockoffs) are given, since these methods are not tailored to have a predictive ability.

In the independent framework, eBIC with Lasso provides the best averaged mean squared error. The non-asymptotic criteria give significantly higher values, but still lower than . For the other settings, the mean squared errors of eBIC are greater than meaning that eBIC has no predictive ability when correlations exist. For the non-asymptotic criteria, the results do not depend on the regularization functions and algorithms. LinSelect usually provides the best prediction with the mean squared error values between and , suggesting that the correlation structure slightly impacts its performances. The mean squared errors of slope heuristic are reasonable for scale-free-max design, but are larger than in all other designs. Finally, the knockoffs method values are in the same range as the results of the best non-asymptotic criterion. The results for the other variable identification methods are similar and do not depend either on the choice of regularization function, algorithms or the sampling strategy for Bolasso and Stability Selection (data not shown).

Data setting ind cluster scale-free scale-free FRANK FRANK
max min max min
Explaining variable number 105.7 11.9 30.9 1 17.75 1.2
(52.36) (2.75) (9.56) (0) (3.41) (0.40)
GD + + eBIC 0.395 3.848 0.585 23.141 1.258 1.531
(0.316) (6.416) (0.210) (83.281) (1.555) (1.919)
LARS + + eBIC 0.328 3.385 1.124 4.598 4.230 5.142
(0.190) (0.761) (0.400) (1.087) (2.801) (3.461)
GD + E-Net + eBIC 1.064 889.648 0.615 21.798 27.639 45.889
(3.235) (5591.703) (0.221) (51.946) (139.048) (194.835)
LARS + E-Net + eBIC 355.312 7424.224 507.241 4036.118 77460.694 13716.184
(1413.632) (29627.089) (1583.563) (19395.594) (420614.468) (51209.882)
GD + + LinSelect 0.994 0.953 0.861 0.977 0.906 0.878
(0.066) (0.075) (0.105) (0.092) (0.459) (0.390)
LARS + + LinSelect 0.994 0.953 0.862 0.977 0.906 0.879
(0.066) (0.074) (0.105) (0.092) (0.459) (0.391)
GD + E-Net + LinSelect 0.994 0.953 0.861 0.977 0.899 0.871
(0.066) (0.075) (0.105) (0.092) (0.448) (0.403)
LARS + E-Net + LinSelect 0.994 0.953 0.862 0.977 0.891 0.877
(0.066) (0.074) (0.105) (0.092) (0.430) (0.404)
GD + + slope 0.431 1.665 0.510 1.979 1.553 1.111
(0.294) (0.640) (0.181) (0.859) (1.467) (0.748)
LARS + + slope 0.446 1.534 0.556 1.848 1.766 1.417
(0.314) (0.524) (0.196) (0.670) (1.435) (0.923)
GD + E-Net + slope 0.437 1.581 0.497 2.073 1.363 1.377
(0.305) (0.640) (0.184) (0.960) (1.075) (1.559)
LARS + E-Net + slope 0.488 1.136 0.375 1.318 1.007 0.977
(0.292) (0.495) (0.088) (0.221) (0.512) (0.488)
E-Net + knockoffs (GD) 0.887 0.993 0.376 0.997 0.983 1.003
(0.244) (0.000) (0.161) (0.022) (0.113) (0.059)
  • Results are the mean and standard deviation (in brackets) of the mean squared error defined in Equation (

    8

    ), computed over 40 samples. Bold values are the highest value for each design and italic values are significantly different of the bold value of the column. The significance is obtained by a t-test with

    of confidence.

Table 1: The mean squared error values

To conclude, non-asymptotic criteria have a better prediction ability than eBIC. The variable identification methods have an ability comparable to the best non-asymptotic criterion in each design. LinSelect seems to be preferable since its results are independent of the dataset characteristics, an advantage for analyzing real data.

Discrimination of the explaining variables from the others In a predictive task, the methods are designed to achieve the smallest mean squared error but the selected variables may have no meaning. Indeed, understanding and predicting are two different tasks. As an example, in genomics applications, some genes could be good markers to predict the status of a disease, without being involved in the underlying mechanisms.

To compare the ability to select the explaining variables first, we evaluate pROC-AUC and pPR-AUC along the regularization path for methods based on model selection and ESCV. Since Bolasso, Stability Selection and Tigress are not based on a unique regularization path, the pROC-AUC and pPR-AUC are calculated from the variables ranked according to their occurrence frequency. For the knockoffs method, we use the variables defined in (7).

Data setting ind cluster scale-free scale-free FRANK FRANK
max min max min
Explaining variable number 105.7 11.9 30.9 1 17.75 1.2
(52.36) (2.75) (9.56) (0) (3.41) (0.40)
GD + 0.315 0.503 0.542 0.593 0.151 0.204
(0.073) (0.075) (0.053) (0.242) (0.054) (0.184)
GD + E-Net 0.314 0.514 0.551 0.609 0.155 0.222
(0.071) (0.075) (0.053) (0.226) (0.054) (0.191)
LARS + 0.316 0.501 0.541 0.592 0.150 0.234
(0.072) (0.075) (0.053) (0.239) (0.054) (0.205)
LARS + E-Net 0.294 0.547 0.593 0.653 0.184 0.238
(0.071) (0.076) (0.042) (0.197) (0.072) (0.206)
E-Net + ESCV (GD) 0.309 0.512 0.549 0.605 0.154 0.221
(0.069) (0.075) (0.052) (0.225) (0.054) (0.190)
Tigress (LARS + ) 0.268 0.542 0.596 0.649 0.174 0.174
(0.061) (0.072) (0.039) (0.176) (0.077) (0.077)
E-net + knockoffs (GD) 0.277 0.494 0.552 0.593 0.172 0.192
(0.076) (0.089) (0.055) (0.271) (0.070) (0.203)
LARS + E-Net + Bolasso + 0.280 0.547 0.595 0.642 0.169 0.253
fixed resamples (0.060) (0.071) (0.040) (0.193) (0.063) (0.199)
reference value 0.49 0.7 0.609 0.729 0.579 0.597
(0.084) (0.027) (0.04) (0.015) (0.127) (0.099)
  • Mean and standard deviation (in brackets) of the pROC AUC restricted to the smallest value of the x-axis largest values obtained by the different strategies, computed over 40 samples. The two last rows are mean and standard deviation of the expected value of the pAUC ROC restricted to the smallest value of the x-axis largest values. Bold values are the highest value for each design and italic values are significantly different of bold values per column. The significance is obtained by a t-test with of confidence.

Table 2: The pROC-AUC values.

Table 2 states the results of pROC-AUC with a threshold based on the smallest value of the x-axis largest values since the conclusions drawn from the pROC-AUC with the second threshold and from the pPR-AUC are the same. Additionally, among all Bolasso and Stability Selection strategies, we only present Bolasso with Randomized regularization when samples are first generated since it gives the highest pROC-AUC values.

For the independent framework, we observed that all the pROC-AUC values significantly differ from the reference value, indicating that some non explaining variables are selected before all explaining variables. LARS combined with Lasso provides the closest result to the reference value.

For correlation structure settings, all values significantly differ from the reference values. The discrepancy depends on the correlation structure: the highest difference is observed for FRANK designs. For model selection methods, the use of Elastic-Net improves the performance of both algorithms, significantly when the response variable depends on several variables. Concerning variable identification methods, Bolasso or Tigress are generally the best methods for discriminating the explaining variables from the others and performs as well as the best model selection method.

To conclude, when a dependency structure between variables exist, Elastic-Net discriminates the explaining variables from the others better than Lasso. It should be used with LARS algorithm. Among the variable identification methods, Bolasso seems to be the most judicious choice. It should be used with a randomized Elastic-Net combined with LARS and the best sampling strategy is to first generate the samples to have the same along the grid.

Recall Although only methods for variable identification are developed to assess that the selected variables are the explaining variables, it is also interesting to measure the recall metric on methods developed for model selection. The expected value of this metric is meaning that all the explaining variables are selected. Table 3 states the most favorable combination per method since we did not observe significant differences between the algorithms and the regularization functions. However we can note that Elastic-Net is more present in the table.

For the independent framework, the best methods are eBIC, ESCV and the data-driven penalties with values from to . The values of the other methods are lower than .

For settings with a correlation structure, ESCV is the best variable identification method but its recall value at is far from the result of the best model selection method. For the scale-free settings, the best values are higher than , given by knockoffs or ESCV but these values are still far from eBIC which is the best method with a recall greater than . For FRANK settings, values for the variable identification methods are always lower than , whereas eBIC provides the best value around . We note that all values of Tigress are close to , meaning that this method is not sensitive at all. It is also mostly the case for the knockoffs method, Bolasso and Stability Selection. LinSelect is also not sensitive at all with values close to , except in scale-free min design. For the data-driven strategies, values are smaller than eBIC but rather reasonable except for data generated with FRANK algorithm and with the cluster design.

Data setting ind cluster scale-free scale-free FRANK FRANK
max min max min
Explaining variable number 105.7 11.9 30.9 1 17.75 1.2
(52.36) (2.75) (9.56) (0) (3.41) (0.40)
E-Net + ESCV (GD) 0.731 0.291 0.609 0.675 0.047 0.113
(0.233) (0.238) (0.266) (0.474) (0.059) (0.310)
Tigress (LARS + ) 0.008 0.003 0.003 0.000 0.003 0.000
(0.040) (0.016) (0.011) (0.000) (0.015) (0.000)
E-Net + knockoffs (GD) 0.114 0.000 0.699 0.025 0.008 0.000
(0.234) (0.000) (0.225) (0.158) (0.039) (0.000)
E-Net + Stab-Selec + 0.101 0.146 0.303 0.250 0.016 0.062
fixed grid (GD) (0.133) (0.080) (0.120) (0.439) (0.040) (0.232)
E-Net Rand + Bolasso + 0.057 0.107 0.236 0.225 0.016 0.037
fixed grid (GD) (0.079) (0.062) (0.092) (0.423) (0.040) (0.175)
LARS + E-Net + eBIC 0.871 0.956 0.994 0.950 0.728 0.713
(0.083) (0.054) (0.013) (0.221) (0.105) (0.437)
LARS + E-Net + LinSelect 0.015 0.081 0.047 0.600 0.037 0.138
(0.014) (0.060) (0.036) (0.496) (0.066) (0.339)
LARS + E-Net + slope 0.662 0.581 0.884 0.825 0.122 0.163
(0.225) (0.185) (0.118) (0.385) (0.163) (0.347)
  • Mean and standard deviation (in brackets) of the recall, computed over 40 samples. Bold values are the highest value for each design and italic values are significantly different of bold values per column. The significance is obtained by a t-test with of confidence.

Table 3: The recall values

To conclude, the existence of the correlation structure impacts the recall metric. For all methods and settings, eBIC provides the best recall values which are higher than the ones provided by the variable identification methods.

Specificity The expected value of this metric is meaning that all the non explaining variables are not selected. Table 4 states the most favorable combination per method since we did not observe significant differences between the algorithms and the regularization functions. However we can note that the gradient descent algorithm is more present in the table.

For the independent setting, while the specificity value of ESCV is , those of all other variable identification methods are higher than . For model selection method, the best methods are LinSelect and the data-driven penalty with the dimension jump strategy with a specificity higher than .

For the correlation structure settings, the specificity of the variable identification methods are always higher than and those of Tigress always equals one. For the methods dedicated to model selection, we observed that LinSelect performs the best with a specificity always higher than . For the data-driven penalties, the dimension jump strategy provides better results than the slope heuristic. The eBIC values are close to for scale-free-max and FRANK settings whereas they are lower than for the others. It suggests that eBIC specificity depend on the dataset complexity and the number of explaining variables.

To conclude, all methods except eBIC and the data-driven penalty with slope heuristics strategy control well the specificity, even for the FRANK settings.

Data setting ind cluster scale-free scale-free FRANK FRANK
max min max min
Explaining variable number 105.7 11.9 30.9 1 17.75 1.2
(52.36) (2.75) (9.56) (0) (3.41) (0.40)
E-Net + ESCV (GD) 0.706 0.974 0.994 0.978 0.977 0.970
(0.159) (0.047) (0.012) (0.071) (0.024) (0.025)
Tigress (LARS + ) 1.000 1.000 1.000 1.000 1.000 1.000
(0.000) (0.000) (0.000) (0.000) (0.001) (0.000)
E-Net + knockoffs (GD) 0.993 1.000 0.988 0.999 0.998 0.998
(0.016) (0.000) (0.011) (0.008) (0.011) (0.010)
GD + E-Net Rand + Stab-Selec + 1.000 0.998 1.000 0.999 0.997 0.997
fixed shuffles (0.002) (0.003) (0.001) (0.003) (0.005) (0.005)
GD + E-Net Rand + Bolasso + 1.000 0.999 1.000 1.000 0.998 0.998
fixed shuffles (0.002) (0.002) (0.000) (0.001) (0.005) (0.005)
GD + + eBIC 0.565 0.624 0.939 0.307 0.935 0.900
(0.166) (0.361) (0.186) (0.162) (0.192) (0.237)
GD + + LinSelect 1.000 0.998 0.999 0.998 0.986 0.987
(0.002) (0.004) (0.002) (0.003) (0.014) (0.012)
GD + + slope 0.787 0.697 0.803 0.728 0.786 0.870
(0.117) (0.178) (0.135) (0.182) (0.169) (0.121)
GD + + dim-jump 0.903 0.893 0.905 0.891 0.914 0.924
(0.092) (0.092) (0.095) (0.083) (0.082) (0.060)
  • Mean and standard deviation (in brackets) of the specificity, computed over 40 samples. Bold values are the highest value for each design and italic values are significantly different of bold values per column. The significance is obtained by a t-test with of confidence.

Table 4: The specificity values.

False discovery rate The single method controlling the FDR is the knockoffs method and we recall that its target FDR is . Concerning the others, we expect a small value. Table 5 presents one combination per method since we did not observe significant differences between the algorithms and the regularization functions. For the data-driven penalties, only results of slope heuristic strategy are presented since dimension jump performances are similar.

For the independent framework, the FDR of the knockoffs method equals , lower than the initial threshold. The FDR is lower than for the other variable identification methods except for ESCV with an FDR of . For the model selection methods, LinSelect gives the lowest value with whereas values for the others are larger than .

When dependencies between variables exist, the FDR of the knockoffs method varies between and , still lower than the initial threshold. Concerning variable identification methods, for FRANK designs, the FDR is smaller than except for ESCV with a FDR equals . For the other settings, the FDR varies between and except for ESCV where it can achieve . Note that for Tigress, the FDR always equals , except for FRANK-max setting. These results were expected given the recall and specificity values. For Bolasso and Stability Selection, values can increase up to but for scale-free-max, the FDR is whereas the selected subset is not empty. For the model selection methods, eBIC has a FDR higher than , except for the scale-free max setting with a value equals . For the non-asymptotic criteria, the FDR is high, between and , except for LinSelect in the scale-free-max setting with an FDR of .

Data setting ind cluster scale-free scale-free FRANK FRANK
max min max min
Explaining variable number 105.7 11.9 30.9 1 17.75 1.2
(52.36) (2.75) (9.56) (0) (3.41) (0.40)
E-Net + ESCV (GD) 0.235 0.303 0.037 0.278 0.714 0.904
(0.140) (0.309) (0.060) (0.415) (0.362) (0.268)
Tigress (LARS + ) 0.000 0.000 0.000 0.000 0.025 0.000
(0.000) (0.000) (0.000) (0.000) (0.158) (0.000)
E-Net + knockoffs (GD) 0.035 0.000 0.074 0.023 0.039 0.025
(0.069) (0.000) (0.070) (0.144) (0.171) (0.158)
GD + Rand + Stab-Selec + 0.025 0.108 0.000 0.100 0.163 0.175
fixed resamples (0.158) (0.284) (0.000) (0.282) (0.347) (0.385)
GD + Rand + Bolasso + 0.025 0.087 0.000 0.062 0.129 0.175
fixed resamples (0.158) (0.275) (0.000) (0.232) (0.301) (0.385)
GD + E-Net + eBIC 0.304 0.636 0.137 0.919 0.831 0.984
(0.171) (0.433) (0.296) (0.265) (0.260) (0.065)
GD + E-Net + LinSelect 0.025 0.216 0.054 0.400 0.853 0.982
(0.158) (0.396) (0.153) (0.496) (0.256) (0.059)
GD + E-Net + slope 0.250 0.827 0.469 0.971 0.879 0.990
(0.154) (0.091) (0.214) (0.040) (0.124) (0.018)
  • Mean and standard deviation (in brackets) of the false discovery rate, computed over 40 samples. Bold values are the highest value for each design and italic values are significantly different of bold values per column. The significance is obtained by a t-test with of confidence.

Table 5: The false discovery rate values.

To conclude, methods based on model selection and ESCV do not control the FDR. If we set Tigress apart since the final variable subset is usually empty, knockoffs method has the lowest values. Bolasso and Stability Selection are also reasonable choices, indicating that their sampling strategy do not select too much non explaining variables.

4 Discussion

High-dimensional regression is usually used to model genomics projects since the number of variables is often close to or larger than the number of observations. This framework raises many methodological questions and this review aims at highlighting the method performances according to biological considerations. To evaluate the different methods in a fair way, we simulated different settings, each one having its own characteristics. The independent setting is irrelevant for applications in molecular biology but it is the framework used to develop the statistical methods. The settings based on the Gaussian graphical model generate correlated variables and provide a more realistic framework. The last settings are far from the statistical framework but offer characteristics observed in molecular datasets.

A statistical framework always requires some hypotheses. For the high-dimensional Gaussian linear regression, data are assumed to be distributed according to a Gaussian distribution, observations are assumed to be independent, variable dependencies are well controlled and the set of explaining variables is assumed to be small. Real datasets generally do not verify all these hypotheses. The most striking conclusion of this review is the deterioration of the metrics in the FRANK settings, where variables are correlated and observations are not simulated directly from a Gaussian regression. To ensure that the high-dimensional context is not an additional issue, we run simulations with an increasing number of observations from to and we observe the same deteriorated performances. The dependence structure between the variables being close to the structure observed in a scale-free network, it cannot explain the deterioration of the metrics. So we computed the residuals of the linear regression from the FRANK data and observed that they are not Gaussian. It suggests that the Gaussian distribution is an important assumption. If this assumption is not verified on a real dataset, data transformations and pre-processing steps are recommended (see [45]). We use the independent setting, where the median number of explaining variables equals to investigate the impact of a high number of explaining variables. In the simulations with an increasing number of observations from to , we observe an improvement of the performances but not significantly, suggesting that the hypothesis about the small number of explaining variables is crucial.

As soon as variables are correlated, Elastic-net should be preferred to Lasso. This result is expected since Elastic-Net has been created for datasets with correlated variables and it is consistent with previous studies [3]. According to the different metrics, there is no need to favor one algorithm. However to discriminate the explaining variables from the others, LARS has the best performances for model selection methods. For explaining variable discrimination, Bolasso with randomized regularization and the same samples along the grid is the best variable identification method. This conclusion is unexpected since LARS uses a single regularization path while Bolasso is based on an average of several regularization paths. We also observed that obtained values are far from the reference value. It is explained by a very early entrance of some non explaining variables along the regularization path, especially for the knockoffs method. By a further study of the results of the sampling methods, we observe that the irrelevant variables are always the same. Consequently it does not alter significantly the occurrence frequencies. However we observe by increasing the number of observations from to that the first irrelevant variable appears more and more late and the regularization paths become more stable. Hence, we conclude that the high-dimensional context impacts the estimation process from the construction of the regularization paths.

To define a final variable subset from the ranked variables, our simulation study shows that asymptotic criteria should be avoided in high dimension. To illustrate this point, we studied the eBIC mean squared errors by varying the number of observations from to . We obtained a significant improvement: while mean squared error values are extremely high when , they became smaller than when exceeds . When we compared the performances of the non-asymptotic criteria and the variable identification strategies, we first observed that the predictive performances of these latter are similar to those of LinSelect. It was unexpected since variable identification methods do not aim at controlling the mean squared error. In contrast, the mean squared error values of LinSelect are small and lower than for the data-driven procedures. It was expected since LinSelect was constructed from an oracle inequality whereas the data-driven methods are based on a heuristic. We also want to point out that for the data-driven calibration methods, the shape penalty and the multiplicative constant are derived from other statistical frameworks. More theoretical work is needed to propose other choices which could improve their performances in high-dimensional regression.

As already said, prediction differs from variable identification. To ensure that the selected variables are explaining variables, a method with a high value of both recall and specificity should be preferred. Indeed a high recall means that almost all explaining variables are selected and a high specificity means that almost all non explaining variables are not selected. Based on our simulation study, the methods with the best trade-off between these two metrics are the data-driven procedure with dimension jump strategy and ESCV.

If the user prefers a small FDR, non-asymptotic criteria and ESCV have to be discarded as well as Tigress which is too conservative leading to an empty selected subset. In contrast, for knockoffs method, Bolasso and Stability Selection, the FDR is small enough, even if it is at the price of a large computational time. Moreover, among all of Bolasso and Stability Selection strategies, our results show that Bolasso with Randomized Lasso and when the samples are first generated should be privileged.

To summarize, the strategy used should be decided according to the metrics to favor. For a best predictive performance, LinSelect seems to be the most judicious choice. For both recall and specificity control, data-driven procedures with dimension jump strategy and ESCV are the best. For a FDR control, knockoffs and Bolasso with Randomized Lasso and when the samples are first generated have to be preferred.

5 Supplementary data

6 Acknowledgments

We are grateful to the INRAE MIGALE bioinformatics facility (MIGALE, INRAE, 2020. Migale bioinformatics Facility, doi: 10.15454/1.5572390655343293E12) for providing computing resources. IPS2 benefits from the support of the LabEx Saclay Plant Sciences-SPS (ANR-17-EUR-0007). This work was supported by a public grant as part of the Investissement d’avenir project, reference ANR-11-LABX-0056-LMH, LabEx LMH.

References

  • [1] M.J. Wainwright. Sharp thresholds for noisy and high-dimensional recovery of sparsity using -constrained quadratic programming. IEEE Transactions on Information Theory, 55(5):2183–2202, 2009.
  • [2] P. Bühlmann and J. Mandozzi. High-dimensional variable screening and bias in subsequent inference, with an empirical comparison. Computational Statistics, 29(3):407–430, 2014.
  • [3] F. Wang, S. Mukherjee, S. Richardson, and S.M. Hill. High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking. Statistics and computing, 30(3):697–719, 2020.
  • [4] C. Wu and S. Ma. A selective review of robust variable selection with applications in bioinformatics. Briefings in bioinformatics, 16(5):873–883, 2015.
  • [5] S. Vinga. Structured sparsity regularization for analyzing high-dimensional omics data. Briefings in Bioinformatics, 22(1):77–87, 2021.
  • [6] Loann David Denis Desboulets.

    A review on variable selection in regression analysis.

    Econometrics, 6(4):45, 2018.
  • [7] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [8] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320, 2005.
  • [9] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. The Annals of statistics, 32(2):407–499, 2004.
  • [10] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1, 2010.
  • [11] J. Chen and Z. Chen. Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95(3):759–771, 2008.
  • [12] L. Birgé and P. Massart. Minimal penalties for gaussian model selection. Probability theory and related fields, 138(1-2):33–73, 2007.
  • [13] E. Lebarbier. Detecting multiple change-points in the mean of gaussian process by model selection. Signal processing, 85(4):717–736, 2005.
  • [14] J.P. Baudry, C. Maugis, and B. Michel. Slope heuristics: overview and implementation. Statistics and Computing, 22(2):455–470, 2012.
  • [15] S. Arlot. Minimal penalties and the slope heuristics: a survey. arXiv preprint arXiv:1901.07277, 2019.
  • [16] Y. Baraud, C. Giraud, and S. Huet. Gaussian model selection with an unknown variance. The Annals of Statistics, 37(2):630–672, 2009.
  • [17] C. Giraud, S. Huet, and N. Verzelen. High-dimensional regression with unknown variance. Statistical Science, 27(4):500–518, 2012.
  • [18] C. Lim and B. Yu. Estimation stability with cross-validation (escv). Journal of Computational and Graphical Statistics, 25(2):464–492, 2016.
  • [19] F. Bach. Bolasso: model consistent lasso estimation through the bootstrap. In

    Proceedings of the 25th international conference on Machine learning

    , pages 33–40, 2008.
  • [20] N. Meinshausen and P. Bühlmann. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473, 2010.
  • [21] A.C. Haury, F. Mordelet, P. Vera-Licona, and J.P. Vert. Tigress: trustful inference of gene regulation using stability selection. BMC systems biology, 6(1):145, 2012.
  • [22] R. Barber and E.J. Candès. Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5):2055–2085, 2015.
  • [23] M. Kos and M. Bogdan. On the asymptotic properties of slope. Sankhya A, 82(2):499–532, 2020.
  • [24] G. Celeux, M. El Anbari, J.M. Marin, and C.P. Robert. Regularization in regression: comparing bayesian and frequentist methods in a poorly informative situation. Bayesian Analysis, 7(2):477–502, 2012.
  • [25] H.D. Bondell and B.J. Reich. Consistent high-dimensional bayesian variable selection via penalized credible regions. Journal of the American Statistical Association, 107(500):1610–1624, 2012.
  • [26] Z. Razaghi-Moghadam and Z. Nikoloski. Supervised learning of gene-regulatory networks based on graph distance profiles of transcriptomics data. NPJ systems biology and applications, 6(1):1–8, 2020.
  • [27] C. Carré, A. Mas, and G. Krouk. Reverse engineering highlights potential principles of large gene regulatory network design and learning. NPJ systems biology and applications, 3(1):1–15, 2017.
  • [28] H. Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
  • [29] A Hoerl and R Kennard. Ridge regression, in ‘encyclopedia of statistical sciences’, vol. 8, 1988.
  • [30] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49–67, 2006.
  • [31] L. Jacob, G. Obozinski, and J.P. Vert. Group lasso with overlap and graph lasso. In Proceedings of the 26th annual international conference on machine learning, pages 433–440, 2009.
  • [32] P. Zhao, G. Rocha, and B. Yu. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 37(6A):3468–3497, 2009.
  • [33] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(1):91–108, 2005.
  • [34] N. Meinshausen. Relaxed lasso. Computational Statistics & Data Analysis, 52(1):374–393, 2007.
  • [35] L. Birgé and P. Massart. Gaussian model selection. Journal of the European Mathematical Society, 3(3):203–268, 2001.
  • [36] D. Allen. The relationship between variable selection and data augmentation and slow feature analysis. Technometrics, 16:125–127, 1974.
  • [37] M. Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2):111–133, 1974.
  • [38] E. Jean Candès, Y. Fan, L. Janson, and J. Lv. Panning for gold: Model-free knockoffs for high-dimensional controlled variable selection. Department of Statistics, Stanford University, 2016.
  • [39] J. Fan, Y. Fan, and E. Barut. Adaptive robust variable selection. Annals of statistics, 42(1):324, 2014.
  • [40] S.L. Lauritzen. Graphical Models. Oxford University Press, 1996.
  • [41] I. Córdoba, G. Varando, C. Bielza, and P. Larrañaga. Generating random gaussian graphical models. arXiv preprint arXiv:1909.01062, 2019.
  • [42] H. Zou and M. Yuan.

    Regularized simultaneous model selection in multiple quantiles regression.

    Computational Statistics & Data Analysis, 52(12):5296–5304, 2008.
  • [43] L. Wang, Y. Wu, and R. Li. Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107(497):214–222, 2012.
  • [44] B. Peng and L. Wang. An iterative coordinate descent algorithm for high-dimensional nonconvex penalized quantile regression. Journal of Computational and Graphical Statistics, 24(3):676–694, 2015.
  • [45] H. Liu, F. Han, M. Yuan, J. Lafferty, and L. Wasserman. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326, 2012.