Nowadays, where computing becomes less expensive, huge amounts of data are routinely collected by companies and organizations of all kinds. The scale of the data being collected increases greatly and thus also the number of measured features (variables/predictors/covariates). Fitting statistical models, which include a large number of features, often results in over-fitting since the resulting model complexity cannot be supported by the available sample size (Buehlmann et al. (2016)). Moreover, it becomes hard to detect the most predictive features and their importance with respect to the model. Thus, variable selection (i.e. the extraction of the relevant features for a given task) plays an ever more important role in many fileds including genetics (Lee et al. (2003)), astronomy (Zheng and Zhang (2007)), and economics (Foster and Stine (2004)).
In this paper we are interested in variable selection in the multiple linear regression model
where is a
-dimensional response vector,denotes the so-called design matrix which holds the potential predictors, is the vector of unknown regression coefficients, and denotes the noise vector with independent and identically distributed components. To avoid the need for an intercept it is assumed that the response is zero centered. Moreover, to allow for an easy evaluation of the influence of single predictors on the model based on the magnitude of the regression coefficients the predictors are treated as zero centered and standardized with for .
In the linear model one assumes that only some regression coefficients are different from zero. Thus, the problem of variable selection reduces to the identification of the nonzero regression coefficients (Alhamzawi and Taha Mohammad Ali (2018)). Especially shrinkage approaches such as the lasso (Tibshirani (1996)), the adaptive lasso (Zou (2006)), the elastic net (Zou and Hastie (2005)) and their Bayesian analogues (Park and Casella (2008); Alhamzawi et al. (2012); Leng et al. (2014); Huang et al. (2015)
) which simultaneously perform variable selection and coefficient estimation have been shown to be effective and are often the methods of choice in linear regression. These methods estimateas minimizer of the objective function , where
denotes the quadratic loss function (negative log-likelihood)and denotes a method specific penalty function that encourages a sparse solution. The parameter vector controls the penalization strength and thus the level of sparsity. Commonly, a good choice for is determined via cross-validation in practical applications. Among above mentioned approaches the most popular one is the least absolute shrinkage and selection operator (lasso), which was proposed by Tibshirani (1996)
. In the lasso regression the penalty function is defined as. Thus, the regression coefficients are continuously shrunken to zero and for sufficiently large some coefficients take exactly the value zero. The lasso can be viewed as a convex and therefore more efficiently solvable reformulation of the best subset selection approach, in which the penalty function is given by . Moreover, the lasso estimate for can be interpreted as a Bayesian posterior mode estimate when independent Laplace priors all with zero mean and the same scale parameter are assigned to the regression coefficients:
In contrast to the frequentist approach the Bayesian one (Park and Casella (2008)
) provides credible intervals for the model parameters which can be used for variable selection. A further generalization of the classical lasso is the adaptive lasso (Zou (2006)), which allows for different penalization factors of the regression coefficients:
For instance, the penalization vector
can be chosen based on the ordinary least squares estimatorwith , or in more complicated settings (collinear predictors, or ) similarly based on the minimizer of the ridge-regularized problem given by:
The adaptive lasso was proposed to address potential weaknesses of the classical lasso. It has been shown (Zou (2006)) that there are situations in which the classical lasso either tends to select inactive predictors, or over-shrinks the regression coefficients of correct predictores. In contrast, the adaptive lasso satisfies the so-called oracle properties introduced by Fan and Li (2001). A procedure satisfies these properties, if it asymptotically identifies the right subset model and has optimal estimation rate. For more details please refer to Fan and Li (2001). Recently, a Bayesain version of the adaptive lasso was proposed by Alhamzawi et al. (2012) and also by Leng et al. (2014). The Bayesian adaptive lasso generalizes the classical (i.e. non adaptive) Bayesian lasso by allowing different scale parameters in the Laplace priors of the regression coefficients:
The scale parameters
are then either choosen via marginal maximal likelihood in an emprical Bayesian setting or by assigning appropriate hyperpriors in a hierarchical Bayes approach. Another generalization of the classical lasso is the elastic net proposed byZou and Hastie (2005). Here the penalty function is given by:
The elastic net encourages a grouping of strongly correlated predictors, such that they tend to be in or out the model together. Morover, it works better than the classical lasso in the case where . There are also some Bayesian versions of the elastic net proposed in literature including the one proposed by Huang et al. (2015). In particular, we used their R-package EBglmnet to validate our novel method.
Besides above decribed shrinkage approaches, Bayesian methods that are based on the introduction of a random indicator vector gain increasing popularity (Liang et al. (2008); Guan and Stephens (2011); Fisher and Mehta (2014); Wang et al. (2015)). If equals zero the regression coefficient of the -th predictor is assumed to equal zero. This is equivalent to the assumption that the -th predictor does not explain the target . A common choice is to assign independent Bernoulli priors to the indicator variables:
For instance, Guan and Stephens (2011) used prior with the simplification and assigned an uniform Prior to . In the work of Fisher and Mehta (2014) the flat prior was used which is equivalent to seting in for . For given the vector of non-zero coefficients denoted by is usally assigned a conventional -prior (introduced by Zellner (1986)) where
denotes the design matrix with all columns deleted that correspond to indicator variables being equal to zero. The reason for the common choice of this prior is that it often leads to a computationally tractable Bayes factor.
Inspired by above described approaches, we propose a setting which is not based on a random indicator vector , but on a random set that holds the indices of the active predictors, i.e. the predictors with coefficients different from zero. We assign a prior to which depends on the cardinality of the set as well as on the actual elements of . This enables an easy formulation of useful a priori knowledge. In section 3 we show how the results of other methods can be used in order to define such a useful prior in terms of an empirical Bayes approach. To simulate from the joint posterior of our model we propose a novel random walk Metropolis-Hastings algorithm, in contrast to Guan and Stephens (2011); Fisher and Mehta (2014); Wang et al. (2015) where Gibbs sampling is applied for this task. This is especially useful when it is difficult or even impossible to determine the conditional distributions needed for the Gibbs sampling algorithm in an analytically closed form. In these cases the corresponding integrals have to be approximated and in particular for practitioners it can become a difficult task to choose a proper approximation technique among the various possible ones. For instance, in Liang et al. (2008) the Laplace approximation is used for this task. Wang et al. (2015) assigned on purpose a beta-prime prior with specific parameters to g in order to achive a closed-form expression of the marginal posterior of . Since we don’t need to care about closed-form expressions by using a random walk Metropolis algorithm, our approach is based on the more popular Zellner-Siow prior for (Zellner and Siow (1980)
), which is an inverse gamma distribution with shape parameterand scale parameter .
However, the -prior depends on the inverse of the empirical covariance matrix of the selected predictors. This matrix is singular if the number of selected covariates is greater than the number of observations and, further, may be almost rank deficient given that the predictors are highly correlated. To overcome this problem Wang et al. (2015) replaced the classical inverse with the Moore-Penrose generalized inverse and thus ended up with the so-called -prior (see West (2003)). In contrast to them, we adopt a -prior with an additional ridge parameter for the unknown regression coefficients to guarantee nonsingularity of the empirical covariance matrix. This modification of the classical -prior was first proposed by Gupta and Ibrahim (2007) and further investigated by Baragatti and Pommeret (2012).
Finally, in Section 2.2 we state that our approach is consistent in terms of model selection according to the consistency definition given by Fernández et al. (2001). The proof of this result is deferred to the appendix. Moreover, in Section 3, we evaluate our approach on the basis of real and simulated data and compare the results with the already described shrinkage methods. We show that our approach performs at least on par and often better than the comparative methods.
In this section we describe the hierarchical Bayesian approach we propose for the task of variable selection in multiple linear regression models. Moreover, we state that the novel model presented is consistent in terms of model selection. Finally, we propose an intelligent random walk Metropolis-Hastings algorithm to simulate from the joint posterior of the model parameters.
2.1 Hierarchical representation of the full model
In order to perform variable selection, a random set which holds the indices of the active predictors, i.e. the predictors with coefficients different from zero, is used. Moreover, it is assumed that the cardinality is greater than zero. In this paper the focus lies on problems where at least one of the measured features is predictive. Nevertheless, one can easily generalize the proposed approach in such a way that also the null model (all regression coefficients equal zero) is valid. Thus, we propose the following prior for :
with and for
The mapping can be arbitrarily chosen and should represent the a-priori belief in the model size, i.e. the number of parameters different from zero. For instance one could defineand the second parameter chosen in such a way that the mean of the distribution equals the a-priori most probable model size. On the other hand the parameters are used to represent the a-priori belief in the importance of the predictors to the model. In terms of empirical Bayes these parameters could be chosen based on the regression coefficients of a ridge regularized model (compare to the adaptive lasso) or based on the correlations (i.e. the linear dependencies) between the target variable and the predictors. Please note that in the case of equal a-priori importance of the predictors , the prior reduces to
Thus, in this case the prior only reflects the a-priori belief regarding , which shows the necessity of the factor in .
For the varianceof the error terms an inverse gamma prior is chosen:
For and the inverse gamma prior converges to the non-informative and improper Jeffreys prior (Jeffreys (1961)) which is invariant under a change of scale. In the experimental studies in Section 3, and are both set to such that the chosen prior is a proper approximation of the Jeffreys prior.
As already stated in the introduction, for given the vector of non-zero coefficients denoted by is commonly assigned a conventional -prior (introduced by Zellner (1986)), where denotes the submatrix of consisting of all columns corresponding to predictors with index in . Since the -prior depends on the inverse of the empirical covariance matrix of the selected predictors it is singular if the number of selected covariates is greater than the number of observations and further may be almost rank deficient provided that the predictors are highly correlated (see West (2003); Gupta and Ibrahim (2007); Baragatti and Pommeret (2012)). To overcome these two problems, based on the ideas of Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012), we consider a ridge penalized version of the -prior:
with denoting an appropriately large constant. It should be explained what appropriately large means. Obviously, for the first possible problem () cannot appear. The second problem, i.e. an almost rank deficient matrix , also should not appear for , i.e. for sufficiently large . Even if the predictors are highly correlated, a huge sample size will prevent from beeing nearly singular and thus maybe computationally singular (compare Carsey and Harden (2013)). Although it has to be assumed that there is no predictor that is an exact linear combination of the other predictors. However, this assumption can be easily achieved by excluding such non-informative predictors from the beginning. It should be mentioned that setting to zero is not necessary at all in practical applications. This is just a theoretical consideration which will ease the proof of model selection consistency in A. The definition of for is taken from Baragatti and Pommeret (2012), who suggested to reduce the influence of the ridge parameter when the number of predictors is large up to the threshold .
Finally, we assign the popular Zellner-Siow prior (Zellner and Siow (1980)), i.e. an inverse gamma distribution with shape parameter and scale parameter to , resulting in the complete hierarchical representation of the model
where denotes the components of with indices not in ,
and satisfies equation .
2.2 Model consistency
In this section we state that model is consistent in terms of model selection. The proof of the result is is given in A. Therefore, throughout the section it is assumed that the sample is generated by model with parameters and , i.e.,
In rough words, model selection consistency means that the true model will be selected provided that enough data is available. A sound mathematical definition was given by Fernández et al. (2001)
. According to these authors the posterior probability is said to be consistent if
where the probability limit is taken with respect to the true sampling distribution . Moreover, some preliminary results which are required in our proof are formulated as Lemma A.1 in their work. In our notation this Lemma reads as:
Under the sampling model (also true model) ,
If is nested within or equal to model ,
Under the assumption that for any model that does not nest ,
Assume that holds. Then the proposed model setup is consistent in terms of model selection.
2.3 Metropolis-Hastings algorithm
As usual in Bayesian regression models, the joint posterior
is analytically intractable. The most common approach to overcome this problem is to use Markov chain Monte Carlo (MCMC) methods to simulate from this unknown distribution. Such a method generates realizations of a Markov chain which converges in distribution to the (conditional) random vector of interest, i.e. in this paper to. Thus, after some time of convergence, the so-called burn-in phase, the generated random numbers can be considered as a (dependent) sample from the posterior distribution. To finally obtain an independent sample only simulated numbers with a given distance are chosen, a so-called thinning is performed. In this paper a special Metropolis Hastings (MH) algorithm, and thus a special MCMC method, is proposed to simulate from . MH algorithms are iterative and in each iteration a random number is drawn from a proposal distribution which is then accepted or not based on a given criterion. If the proposal depends on the respectively last accepted random number one also speaks of a random walk MH algorithm. For further details on MCMC methods please refer to Fahrmeir et al. (2013). In order to define a MH algorithm for a given problem basically the only thing one has to specify is the proposal distribution used in the algorithm. Therefore, we define the proposal as follows:
Inspired by the work of Hastings (1970) the proposal distribution for the variance parameter as well as the proposal distribution of the parameter
are defined to be uniform distributions
where and denote tuning parameters. Tuning parameters are always chosen in such a way, that the acceptance rate of the algorithm is neither very low nor very high which should guarantee a fast convergence of the algorithm. In order to specify the proposal for the random setis introduced
where can be considered as a tuning parameter. The event means that the model size (number of nonzero regression coefficients) changes, i.e. . Moreover, a random variable with support on the index set of the regression coefficients is introduced in order to describe the model transition probabilities in case of changing model size. A realization of this random variable corresponds to the index of a predictor which is going to be added to or removed from the model. Transitions between models, which differ by two or more parameters, are not allowed. The probability mass function of is defined as
where the parameters are greater or equal to zero, sum up to one, and can but must not be identical to the parameters already used in the prior . Again, as the parameters the parameters can be used to represent a-priori beliefs about the importance of the predictors and a good choice of them should lead to a fast convergence of the MH algorithm. Especially, in problems with a high number of predictors it is essential to assign with informative values, since otherwise it might take a very large number of iterations until a MCMC method (see Section 2.3) used to simulate from the joint posterior might converge. According to in the case where a predictor is added with probability which might correspond to the a-priori importance of this predictor. Furthermore, a predictor is removed with a probability proportional to which might be the inverse a-priori importance of . In the case where a predictor is added with a probability proportional to and it is impossible that the only predictor in the model is removed, this would violate the prior assumption that with probability . Using and the proposal distribution is finally defined by:
Note that the probability mass function in equation is given by equation , since . Besides the easy evaluation of for a given input, one can easily simulate from this distribution. This is done by sampling from the distribution of followed by sampling from the conditional distribution of where the condioning takes places with respect to the sample of generated in the previous step. The only thing remaining to define the overall proposal is the conditional proposal . It is well known (see Fahrmeir et al. (2013)) that under the observation model with prior distributions and the conditional posterior of given is given by
Thus, a natural choice for is given by
Finally, the overall proposal distribution can be written as
One can easily simulate from the overall proposal by iteratively simulating from the factors in from right to left and conditioning on the values sampled up to the current step of execution.
3 Experimental studies
In this section the (predictive) performance of the proposed hierarchical Bayesian approach is analysed and compared with some other Bayesian and non-Bayesian methods. For this task we have implemented the MH algorithm described in Section 2.3 by using the interface between the programming languages R (R Core Team (2015)) and C. The comparison includes the lasso (Tibshirani (1996)), the adaptive lasso (alasso) (Zou (2006)), the elastic net (elastic) (Zou and Hastie (2005)), the Bayesian lasso (blasso) (Park and Casella (2008)), the Bayesian adaptive lasso (balasso) (Alhamzawi et al. (2012); Leng et al. (2014)) and the Bayesian elastic net (belastic) (Huang et al. (2015)). For our extensive comparisons three real-world datasets and also simulated datasets from two different artificial models are used. The performance comparison on the real-world datasets is carried out on the basis of a -fold cross-validation. Performing a -fold cross-validation on a given dataset means partitioning the dataset in equal sized subsets, iteratively selecting each of these subsets exactly once as testing data, whilst using the remaining subsets for training. Aggregating some measure of accuracy, respectively computed from each of the testing datasets, results in the overall performance of the models to evaluate. The performance comparison for the artificial models is accomplished by simulating multiple datasets from these models, then performing a train/test split for each simulated dataset, and finally again aggregating the single accuracies achieved. The accuracy measures chosen in this paper are the mean squared error (MSE) and the mean absolute deviation (MAD)
where denotes the cardinality of the testing dataset and the denote the predicted target values, for
. For the aggregation of single accuracies we are using the median since it is more robust to outliers than the mean of the values. The median of MSEs is then denoted by MMSE and the median of MADs by MMAD. For our novel approach it should be stressed that, for obtaining a predictioncorresponding to a test sample
, an estimate of the expected value of the posterior predictive distributionis used. Using Monte Carlo integration this expected value can be estimated as follows:
where is a sample of size from the marginal posterior . For all the non-Bayesian comparsion models the predicted target values are computed by calculating the inner product of the test samples with the estimate obtained by minimizing the corresponding model-specific objective function. For the Bayesian comparsion models the way the predictions are computed depend on the output provided by the R-packages used in this paper. In this work the R-function blasso included in the R-package monomvn (Gramacy (2018)) is applied in order to perform Bayesian lasso regression. Since the function blasso returns a sample from the marginal posterior , as for the model proposed by us, the predictions are computed by estimating the posterior predictive mean. Moreover, we use the R-Function brq included in the package Brq (Alhamzawi (2018)
) for applying the Bayesian adaptive lasso regression. It should be mentioned that this R-function is not developed for the classical linear regression setting where the conditional mean of the response variable is modelled, rather it is developed for the more general setting where any conditional quantile is modelled. For purposes of the comparision study in this paper the conditional median, i.e. the conditional-quantile, is selected. The R-function produces a Bayes estimate of which is then used for prediction (i.e. inner product computation with the test samples). In order to use the Bayesian elastic net the function EBglmnet included in the package EBglmnet (Huang and Liu (2016)) is used in this work. This function returns the posterior mean of the nonzero regression coefficients, which is then used for prediction (again via inner product computation with the test samples).
It is also important to mention how (inside the
-fold cross-validation for real-world datasets, or for the multiple simulated datasets) the hyperparameters of the trained models are determined. For the frequentist models, which are all implemented in the R-packageglmnet (Friedman et al. (2010)), the R-function cv.glmnet is used, which is also included in this package. The penalization factor of the lasso regression is chosen based on a -fold cross-validation on a one dimensional grid with the MAD as accuracy measure. Analogously, the penalization factors of the elastic net are selected based on a -fold cross-validation on a now two dimensional grid, again with MAD as accuracy measure. For the adaptive lasso the chosen penalization factors are the absolute values of the inverse regression coefficients obtained from the respective ridge-penalized problem, the penalization factor of which is also determined via -fold cross-validation. For the Bayesian version of the lasso and the adaptive lasso the default values of the corresponding functions blasso and brq are assigned to the hyperparameters of the a-priori distributions. For further details please refer to the help pages of these packages. However, it should be mentioned that for the Bayesian lasso the option of additional model selection based on a Reversible Jump MCMC algorithm is set to true for the experimental studies in this paper. Moreover, for the Bayesian elastic net the hyperparameters in the prior distribution are selected via -fold cross-validation. This is done by using the function cv.EBglmnet, which is also available in the package EBglmnet. The hyperparameters of our proposed Bayesian model are chosen differently in the following experimental studies to indicate useful possible choices. These specific choices are described later on in this article.
In order to obtain i.i.d. samples from the model-specific posterior distributions the R-functions used in this paper merely save a fraction of all samples generated by the MCMC methods implemented. For the Bayesian elastic net, the decision which samples to store is already taken by the implementation itself and cannot be user-defined. All other Bayesian models go along with implementations which allow for a user specific determination of this fraction. To do so the following procedure is chosen: In a first step, respectively, the first samples are deleted and, in a second step, all samples except of every -th one are deleted. Note that samples generated in the first step before convergence of the used MCMC algorithms are deleted, while thinning done in the second step should guarantee that the remaining samples can be considered as being independent. For each of the observed Bayesian models (dependent) samples are generated, except for the Bayesian adaptive lasso where ones are produced. This results in i.i.d. samples each, except of samples for the adaptive lasso. The reason for the special treatment is that in our experiments the Bayesian adaptive lasso turned out to require more samples to get stable estimates of the target variables.
3.1 Real-world studies
In this section the performance of the proposed method is compared with the performance of other well-established methods based on three real-world datasets. The comparison is made along the lines of the above considerations (see beginning of Section 3).
3.1.1 The diabetes dataset
In this section the popular diabetes dataset, originally used by Efron et al. (2004) to examine the so-called LARS algorithm, is used for the performance evaluation. This dataset includes ten predictors, age, sex, body mass index, average blood pressure, and six blood serum measurements, measured from diabetes patients. The target variable of interest is a quantitative measure of disease progression one year after baseline. In the R-package care (Zuber and Strimmer. (2017)) a standardized version (all variables have zero mean and variance equal to one) of the diabetes dataset is available. This version is used in our experiments.
The hyperparameters of the proposed Bayesian approach are specified as follows: In an empirical Bayes manner, the parameters which belong to the prior of and represent the a-priori belief in the importance of the predictors are assigned appropriate multiples (they must sum up to one) of the absolute values of the regression coefficients of the ridge regularized model which also determines the penalization factors of the adaptive lasso model. Moreover, the parameters (see Section 2.3) are specified identically to the parameters . As already mentioned in Section 2.1, one could define the mapping as the probability mass function of a zero truncated binomial distribution with size parameter equal to the number of predictors and the second parameter chosen in such a way that the mean of the distribution equals the a-priori most probable model size. Thus, a zero truncated binomial prior is assigned to the unknown model size . However, in general it is somewhat cumbersome to determine the second parameter in such a way that the distribution has a given mean. Note that the expected value of a zero truncated binomial distribution with size parameter and second parameter equals . Further, note that by the Abel-Ruffini theorem there is no algebraic solution to the general polynomial equations of degree five or higher with arbitrary coefficients. A classical binomial distribution with size parameter and second parameter equal to has mean . Thus, a zero truncated binomial distribution with these parameters will result in a distribution with expected value not too far away from . Therefore, defining as probability mass function of such a zero truncated distribution is a good alternative and our final choice in the performance evaluation of the diabetes dataset. Now the only question remaining is how to specify . Since there is no a-priori knowledge available one could determine a useful value of via cross-validation in an empirical Bayes manner. However, to save time we empirically tried some possible values for and very soon ended up with the specification (which results in a good performance as one can see later on in this section). The parameters and are both set to which will also hold for the remaining experimental studies, as already reported in Section 2. Finally, the tuning parameters (see Section 2.3) and are respectively set to the values and .
Performing a -fold cross-validation as described at the beginning of Section 3 results in Table 1. The proposed approach achieves the lowest MMSE as well as the lowest MMAD and thus performs better than all methods under comparison.
|Bayesian adaptive lasso||0.4997307||0.5786672|
|Bayesian elastic net||0.4895844||0.5727555|
3.1.2 The prostate cancer dataset
In this section the performance of the seven considered methods is compared on the prostate cancer data provided by Stamey et al. (1989). This dataset examines the relationship between the level of prostate specific antigen (lpsa) and eight clinical measures from patients who were about to receive a radical prostatectomy. The eight measurments are log cancer volume (lcavol), log prostate weight (lweight), age, log benign prostatic hyperplasia amount (lbph), seminal vesicle invasion (svi), log capsular penetration (lcp), Gleason score (gleason), and percentage Gleason scores 4 or 5 (pgg45). This dataset is included e.g. in the R-package lasso2 (Lokhorst et al. (2014)). The predictors are standardized, consequently we only have to zero-center the target variable lpsa to make the data compatible with the model proposed by us.
The hyperparameters of our new Bayesian approach are specified as follows: The parameters and are specified the same way as in Section 3.1.1. Again, is defined as probability mass function of a zero truncated binomial distribution with size parameter and second parameter , but now is specified as the number of nonzero regression coefficients from the corresponding lasso model. The remaining parameters are also defined as in Section 3.1.1.
|Bayesian adaptive lasso||0.547966||0.5614767|
|Bayesian elastic net||0.5267334||0.6138624|
Again, our method achieves the lowest MMSE, while some other methods go along with better MMADs on the prostate cancer dataset.
3.1.3 The ozone dataset
In this section the ozone datset originally studied by Breiman and Friedman (1985) is used for the performance comparison. The data was measured in the area of Upland, CA, east of Los Angeles on 330 days in 1976. The target variable is the daily maximum one-hour-average ozone reading (parts per million) and the predictors are:
Temperature (degrees F)
Inversion base height (feet)
Daggett pressure gradient (mmHg)
Vandenburg 500 millibar height (m)
Inversion base temperature (degrees F)
Wind speed (mph)
The dataset is available in the R-package gclus (Hurley (2012)). We decide to examine the regression model including all linear, quadratic and two-way interactions, resulting in possible predictors. Moreover, we zero center and standardize (variance equal to one) all variables previous to the training of the diverse models. We want to mention that in contrast to a statement at the beginning of Section 3 in this performance evaluation for all Bayesian models MCMC samples are drawn and not or as stated there. This is the only experimental study where this statement is violated.
The hyperparameters of the newly proposed Bayesian approach are specified as follows: We assign to each of the parameters and the value , i.e. assume equal a-priori importance of the predictors. The function is defined in such a way, that it maps a value to a value proportional to the third power of the probability , where denotes a random variable which is zero truncated binomial distributed with size parameter and second parameter . We empirically discover that taking the power of three can be a helpful step to guarantee fast convergence of the proposed MH algorithm (see Section 2.3) in case of an increasing number of predictors and a prior based on a binomial distribution. By choosing the constant of proportionality in the definition of in such a way, that
defines a probability distribution, the resulting distribution will have a smaller variance than the generating binomial distribution. Finally, the tuning parameters (see Section2.3) and are set to the values and , respectively.
|Bayesian adaptive lasso||0.2458003||0.3865291|
|Bayesian elastic net||0.2409481||0.3853373|
3.2 Simulation studies
In this section, on the basis of simulated data corresponding to two different artificial models, the performance of our new method is compared with the performance of other well-established methods. The artificial models on purpose include many correlated predictors, from which only a small subset is predictive, i.e. has regression coefficients different from zero. This should reflect difficult variable selection problems that more and more organizations have top face in their daily life nowadays and in the future. From both artificial models multiple datasets are simulated. In particular, datasets are sampled, each according to three different settings. First the sampling takes place with training observations, then with training observations and finally with training observations. The number of test observations is always the same and given by the value
. Moreover, all the observations are drawn independently from each other. For each sampled dataset the entire target vector (includes target values from both the training set and the testing set) is standardized via division by the sample standard deviation of the training target values. This simplifies finding a good choice of the tuning parameters for the MH algorithm proposed. The scaling allows for choosing them equal to specifications used in Section3.1. The predictors are not scaled at all since they are sampled from distributions with mean zero and standard deviation one.
3.2.1 Simulation study 1
In this simulation study data are generated from the model
with and the remaining coefficients equal to zero. This model is inspired by those used to evaluate the performance of the spike-and-slab lasso by Ročková and George (2018).
The hyperparameters of the proposed Bayesian approach are specified as follows: The parameters are defined as multiples of the absolute sample correlations between the training responses and the training predictors . Note, that the multiplicative factor has to be defined in such a way that the parameters sum up to one. Moreover, the parameters are specified identical to the parameters . The function maps each value to a value proportional to the third power of the probability , where denotes a random variable which is zero truncated binomially distributed with size parameter and second parameter . Finally, the tuning parameters and are set to the values and , respectively.
Aggregating the accuracy measures obtained by training the models to be compared, as described at the beginning of Section 3, results in Table 4. The proposed approach achieves the lowest MMSE as well as the lowest MMAD and thus performs better than all other methods.
|Bayesian adaptive lasso||0.7795937||0.7110935||0.8896619||0.7637107||0.3793791||0.4924849|
|Bayesian elastic net||0.8148157||0.7146019||0.3324611||0.4647419||0.2668035||0.4122355|
In the Figures 4-4 boxplots of the MSEs and MADs obtained from the simulated datasets with and training observations can be found. These plots again illustrate that our new model performs significantly better than the other models considered. Moreover, one can see that the implementation of the Bayesian adaptive lasso has some serious problems with these simulated data, due to numerical instabilities.
To provide some more insights into the new method proposed we consider a representative dataset simulated from model with observations. In Figure 6 one can see the Markov chain of the model size (i.e. the number of non-zero regression coefficients) produced from our MH algorithm. Further, in Figure 6 the relative frequencies of a given regression coefficient being non-zero are plotted for the reduced Markov chain. By the latter one we mean the Markov chain remaining after the deletion step performed to obtain i.i.d. samples. The relative frequencies can be interpreted as the a posteriori probabilities of the predictors being included in the model. All the truly non-zero regression coefficients are colored blue, while the remaining coefficients are colored red. One can see that the proposed approach selects the true data generating model pretty well. Finally, we want to report that the acceptance rate of the proposed MH algorithm is given by the value , which is a good acceptance rate indicating that the algorithm converges fast.
3.2.2 Simulation study 2
In this simulation study data are generated from the model
with and the remaining coefficients equal to zero. Simulating from this model results in more difficult datasets in context of variable selection than simulating from the model defined in Section 3.2.1. There are additional potential predictors and the pairwise correlations between them are increased from to .
The hyperparameters of the newly proposed Bayesian approach are specified, in an empirical Bayes manner, as follows: The parameters are assigned appropriate multiples (they must sum up to one) of the absolute values of the regression coefficients of the ridge regularized model, which also determines the penalization factors of the adaptive lasso model. The parameters are chosen analogously, except that the third power of the ridge coefficients is chosen. The function maps each value to a value proportional to the third power of the probability , where denotes a random variable which is zero truncated binomially distributed with size parameter and second parameter . Finally, the tuning parameters and are set to the values and , respectively.
Due to numerical issues in the corresponding R-implementation the Bayesian adaptive lasso has to be excluded from the model comparison. Aggregating the accuracy measures obtained by training the remaining models results in Table 5. The proposed approach achieves the lowest MMSE as well as the lowest MMAD in the settings with and training observations. In the setting with training observations it achieves a comparable accuracy. The Figures 10-10 show boxplots of the MSEs and MADs obtained from the simulated datasets with and training observations. In the setting our method outperforms the other ones. The median of the MADs as well as the median of the MSEs lies below all corresponding boxes of the other methods.
|Bayesian elastic net||1.147992||0.8520899||0.7743223||0.6907885||0.8825549||0.7553813|
In this article we have presented a novel Bayesian approach to cope with the problem of variable selection in the multiple linear regression model with dependent predictors. To make our method robust to challenges such as multicollinearity, or the number of observations being smaller than the number of potential predictors, a -prior with an additional ridge parameter was assigned to the unknown regression coefficients. While other authors apply Gibbs sampling to simulate from the joint posterior distribution we presented an intelligent Metropolis Hastings algorithm for this task. Thus, we do not need to compute the conditional distributions needed for the Gibbs sampling algorithm, which often are not available in an analytically closed form, and have to be approximated. Further, we have shown that the proposed Bayesian approach is consistent in terms of model selection under some nominal assumptions. Experimental studies with three different real-world datasets as well as simulated datasets from two artificial models demonstrated the good performance of the presented method. In particular, on the simulated datasets our new approach performed significantly better than all other well-established methods.
The publication is one of the results of the project iDev40 (www.idev40.eu). The iDev40 project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783163. The JU receives support from the European Union’s Horizon 2020 research and innovation programme. It is co-funded by the consortium members, grants from Austria, Germany, Belgium, Italy, Spain and Romania.
Appendix A Proof of Theorem 1
Note that for showing it suffices to show that
holds, where denotes the true model and denotes another model different from the true one. The sets and , respectively, denote the index sets of the active predictors (predictors probably being different from zero) of the two models and thus uniquely determine them. In a first step of the proof we compute the joint posterior of and
up to a constant of proportionality. Simple linear algebra and the facts that the densities of multivariate normal distributions and inverse gamma distributions integrate to one reveal for:
and denotes the cardinality of . In principle, the required marginal posterior of can be found by integrating out in equation (A.2). However, there exists no closed-form solution of the corresponding integral. To overcome this problem, a lower and an upper bound of the marginal posterior are derived. The numerator of the quotient in (A.1) is then replaced by the upper bound and the denominator by the lower bound such that the resulting quotient is an upper bound of the original one. Showing that the upper bound of the original quotient converges to zero will immediately imply that (A.1) holds. To determine these bounds, at first, a property of the Gaussian hypergeometric function given by
and convergent for with and for only if and is derived. It can be easily validated that for and the following equation holds: