 # Neuronized Priors for Bayesian Sparse Linear Regression

Although Bayesian variable selection procedures have been widely adopted in many scientific research fields, their routine use in practice has not caught up with their non-Bayesian counterparts, such as Lasso, due to difficulties in both Bayesian computations and in testing effects of different prior distributions. To ease these challenges, we propose the neuronized priors to unify and extend existing shrinkage priors such as one-group continuous shrinkage priors, continuous spike-and-slab priors, and discrete spike-and-slab priors with point-mass mixtures. The new priors are formulated as the product of a weight variable and a scale variable. The weight is a Gaussian random variable, but the scale is a Gaussian variable controlled through an activation function. By altering the activation function, practitioners can easily implement a large class of Bayesian variable selection procedures. Compared with classic spike and slab priors, the neuronized priors achieve the same explicit variable selection without employing any latent indicator variable, which results in more efficient MCMC algorithms and more effective posterior modal estimates obtained from a simple coordinate-ascent algorithm. We examine a wide range of simulated and real data examples and also show that using the "neuronization" representation is computationally more or comparably efficient than its standard counterpart in all well-known cases.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the Bayesian linear regression problem in high dimensions. Suppose a model, from which we assume that the observed data are generated, contains unknown coefficients denoted by . Here the linear model is

 y=X\boldmathθ+ϵ, (1)

where is the covariate matrix and , and . Under the sparsity assumption on , we typically impose a shrinkage prior on each coefficient. A popular choice is the one-group (continuous) shrinkage prior, which we refer to as the “continuous shrinkage prior” here. It can be constructed via a product of independent hierarchical Gaussian mixture distributions:

 θj∣τ2wτ2j ∼ N(0,τ2wτ2j) (2) τj ∼ πτandτw∼πg,

for , where and

are some densities chosen by the user. The variance parameter

that governs the shrinkage level of individual parameter is called the local shrinkage parameter, and the variance parameter that controls the overall shrinkage effect is called the global shrinkage parameter (Polson & Scott 2010).

There have been numerous choices of considered to induce shrinkage on the parameters. These priors include the Bayesian Lasso (Park & Casella 2008) with

being an exponential distribution, the horseshoe prior

(Carvalho et al. 2010) with

being a half-Cauchy distribution, the generalized double Pareto

(Armagan et al. 2013) with being a mixture of Laplace distributions, and the Dirichlet-Laplace prior (Bhattacharya et al. 2015) with being the distribution for the product of a Dirichlet and a Laplace random variables. Recently, theoretical properties of the prior choice for have been investigated, and the results show that the marginal prior density of with a heavy tail and a sufficient mass around zero achieves the minimax optimal rate of posterior contraction (Ghosh et al. 2017, van der Pas et al. 2016), like point-mass mixtures of spike and slab priors that will be introduced later.

In Gaussian linear regression models, MCMC sampling of given the local and global shrinkage parameters can be efficiently implemented by taking advantage of the conjugacy. However, while the continuous shrinkage priors have computational advantages over point-mass priors, their posterior approximation is still difficult in high-dimensional settings. Also, the resulting posterior samples do not automatically provide sparse estimates of the coefficients, so that extra steps are required for variable selection (Hahn & Carvalho 2015).

Another popular class of shrinkage priors is the class of two-group mixture priors, called the spike-and-slab (SpSL) priors (Mitchell & Beauchamp 1988, George & McCulloch 1993). These prior densities are represented by a mixture of two densities as follows:

 θj∣γj ∼ (1−γj)π0(θj)+γjπ1(θj) (3) γj ∼ Bernoulli(η),

for a hyperparameter

and . The density function is typically chosen to be highly concentrated around zero so that the shape is spiked at zero, wheras is relatively disperse (the slab part). When , the parameter is strongly forced to shrink towards zero, and when , the prior imposed on should have a minimal shrinkage effect. Throughout this article, when a point mass density on zero is used for , we refer to the resulting prior as the “discrete SpSL prior”, and we refer to the SpSL prior with a non-degenerate as the “continuous SpSL prior”.

Common choices for and

are Gaussian distributions with a small and a large variance, respectively

(George & McCulloch 1993, 1997, Narisetty & He 2014). The role of is to control the sparsity, and it supervises how many parameters are significant (Scott & Berger 2010). Under some regularity conditions, it has been shown that an appropriate choice of leads to model selection consistency (Narisetty & He 2014) and the optimal posterior contraction (Castillo et al. 2012, 2015) for high-dimensional linear regression and normal mean models. However, its computational implementation is challenging due to the adoption of the binary latent variable. In particular, when a point-mass prior on zero is used as , the approximation of the posterior distribution of the ’s is notoriously challenging. MCMC sampling strategies (Dellaportas et al. 2002, Guan & Stephens 2011) and stochastic search strategies (Hans et al. 2007, Berger & Molina 2005, Zhang et al. 2007) have been proposed to attack the computational difficulty, mostly relying on the conjugacy of each component of the prior. However, a computational strategy for general discrete SpSL priors such as those using reversible jump proposals (Green 1995) is rarely practical especially under high-dimensional settings.

As a computationally scalable implementation of continuous SpSL priors, Rockova & George (2014)

proposed the Expectation Maximization Variable Selection (EMVS), which is an EM algorithm to evaluate the

maximum a posteriori (MAP) estimator of the regression coefficients under continuous SpSL priors with Gaussian components, and Rockova & George (2018) extended their idea to the spike-and-slab Lasso (SSLasso) prior by adopting Laplace distributions for and . These procedures, however, provide only point estimates, and are insufficient in quantifying the uncertainty in model selection and estimation.

To address these computational and practical issues from the shrinkage priors, we propose neuronized priors, which provides a unified form of shrinkage priors including as special cases continuous shrinkage priors, continuous SpSL priors, and discrete SpSL priors. In the form of neuronized priors, each parameter is reparameterized as a product of a weight parameter and a transformed scale parameter via an activation function. We define the proposed prior as follows:

###### Definition 1.1.

(Neuronized prior) For a nondecreasing activation function and a pre-fixed hyperparameter , a neuronized prior for is defined as:

 θj\coloneqqT(αj−α0)wj, (4)

where the scale parameter follows and the weight parameter follows for a hyperparameter , and .

As the name implies, this formulation is inspired by the use of activation function in neural network models

(Rosenblatt 1958, Rumelhart et al. 1986). However, unlike neural network models, the proposed formulation is fully parametric, and it retains clear interpretability on the regression coefficients. In neuronized priors, we use an activation function in the formulation of shrinkage priors, and show that, for most existing shrinkage priors, we can find specific activation functions such that the resulting neuronized priors correspond to the existing ones. As a consequence, existing theoretical properties of various shrinkage priors can be exactly applied to posterior behaviors based on the neuronized priors. This theoretical equivalence will be discussed in Section 2. We also show that variable selection procedures based on neuronized priors attain the following advantages over existing shrinkage priors:

• Unification. Without changing computational algorithms, various classes of shrinkage priors can be practically implemented by just changing the activation function. This characteristic significantly reduces practical and computational efforts to migrate from one shrinkage prior to another in a different class, e.g., from a horseshoe prior to a discrete SpSL prior. It is of practical value and importance to scientists who adopt a Bayesian variable selection procedure to examine effects of different prior choices and our Bayesian computational procedure with neuronized priors can certainly help with this effort. In Table 1 in Section 2, we provide details regarding how a choice of the activation function connects the corresponding neurnoized prior to a commonly considered shrinkage prior.

• Efficient MCMC implementation.

By formulating the discrete SpSL prior as a neuronized prior using a Rectifier Linear Unit (ReLU) activation function (

; Glorot et al. 2011), we can significantly improve the computational efficiency of the corresponding Bayesian variable selection procedure under high-dimensional settings (Section 2). Moreover, neuronized versions of continuous shrinkage priors also attain comparable or better efficiency in the corresponding MCMC sampling compared with the standard procedures.

• Scalable optimization to evaluate the MAP estimator. For massive data sets, MCMC algorithms are often less practical, and one needs to consider the problem of finding the MAP estimator. To achieve this end, we propose an efficient coordinate-ascent optimization algorithm. Unlike EMVS of Rockova & George (2014), the proposed procedure can be applied to a more general class of shrinkage priors including continuous shrinkage priors, continuous and discrete SpSL priors. Compared with the Majorization-Minimization (MM) method of Yen et al. (2011), the EMVS, and the SSLasso, the proposed algorithm with a warm start performed much better in finding the MAP estimator for the notoriously challenging regression problem with discrete SpSL priors (Section 4.3).

We will demonstrate the neuronized counterparts of three popular shrinkage priors for Bayesian sparse regression in Section 2: the discrete SpSL, the Bayesian Lasso, and the horseshoe prior. In Section 3, we show how to manage the neuronized prior so as to achieve one’s intended goals, such as matching a given prior or controlling the prior sparsity. We describe two main advantages of using neuronized priors in Section 4: more efficient MCMC sampling and more effective mode-finding. In Section 5, we cover a wide range of simulation studies to compare the effects of different priors and provide evidences showing that Bayesian solutions with discrete SpSL priors and their neuronized counterparts tend to perform better than other approaches when signal is weak to modest. Two real data examples are analyzed in Section 6, and a short conclusion is given in Section 7. All proofs of main results are provided in Appendix.

## 2 Connections of Neuronized Priors to Existing Priors

### 2.1 Discrete SpSL prior

Consider the ReLU (or hinge) activation function. When and , since , it is clear that the distribution of follows an equal mixture of the point-mass at zero and the half standard Gaussian, as shown in Figure 4(a). This implies that the marginal density of on Figure 4(b) is equivalent to the standard discrete SpSL prior (Figure 4(c)), and can be written as

 θ∣γ ∼ (1−γ)δ0(θ)+γπ(θ) γ ∼ Bernoulli(1/2), (5)

where is the marginal density of the product of two independent standard Gaussian random variables, which can be shown to have an exponential tail.

In general, the hyperparameter

controls the prior probability of sparsity. Since

, it follows that , where is the standard Gaussian CDF. More precisely, setting in (2.1) results in the same prior as the neuronized prior corresponding to . Conversely, given , we choose to result in the desired neuronized prior.

When , theoretical results for discrete SpSL priors are well-studied, and Castillo et al. (2012) and Castillo et al. (2015)

considered a class of hyperpriors on the proportion parameter in the Bernoulli variable in (

3). That is

 η∼Beta(1,pa), (6)

where . By using this prior on , they investigated model selection consistency and posterior contraction rate related to the choice of ( for the neuronized prior) under linear regression models. The same theoretical claims can be applied to the neuronized prior with a ReLU activation function by choosing . This connection is due to the fact , where , so it follows that by using Stirling’s formula. Then, the corresponding is .

To make a comparison between the discrete SpSL prior and its neuronized version, we consider the Boston housing price data under the linear regression model in (1), which contains median housing prices of owner-occupied homes in the Boston area, together with ten variables that might be associated with the median prices. Under the Jeffrey’s prior on , that is , we consider the independent neuronized prior that is , where and for . The solution path of each variable selection procedure is provided in Figure 5, and it shows that the solution path of the neuronized prior with the ReLU function is almost identical to that of the standard discrete SpSL prior. Figure 5: Solution paths of the neuronized prior and the discrete SpSL prior.

### 2.2 Bayesian Lasso

Bayesian Lasso imposes an double-exponential prior on and uses a latent-variable representation to faciliate efficient MCMC computations (Park & Casella 2008). We show below that with the identity activation function , the resulting neuronized prior is approximately equivalent to the Bayesian Lasso prior. The similarity between Bayesian Lasso and the neuronized prior under the identity activation function can be explained by the marginal density form of the neuronized prior demonstrated in the following lemma.

###### Lemma 2.1.

The use of an activation function results in the marginal density of the neuronized prior being proportional to .

The form under the integral is similar with that of the Bayesian Lasso prior, i.e., , and the only difference is the term in the integrand. Furthermore, the following proposition shows the tail behavior of the neuronized prior.

###### Proposition 2.2.

Let the marginal density function of defined in (4) with and . Then, for any , there exists such that for some constants and .

Proposition 2.2 shows that when is linear, the resulting neuronized priors attain the same tail behavior as that of Bayesian Lasso (double exponential) prior. This result also suggests that the slab part in the neuronized prior based on the ReLU function also has an exponential tail. This tail behavior is theoretically desirable, because the adaptive minimax rate of the posterior contraction can be achieved when the tails of the slab part in the discrete SpSL prior are at least exponential (or heavier) (Castillo et al. 2012, 2015).

Figure 8(a) shows a QQ plot of 100,000 samples from the Bayesian Lasso prior and its neuronized version, verifying that the two distributions are indeed very similar. There exists, however, a small bump in the QQ plot, showing that the neuronized prior has slightly more density around zero than the standard Bayesian Lasso prior. Figure 9 shows the solution paths of the Bayesian Lasso, the neuronized Bayesian Lasso, and the standard Lasso for the analysis of the Boston housing price data set, and the three solution paths are almost identical. Figure 9: Solution paths of the neuronized prior, the Bayesian Lasso and the Lasso.

A similar formulation related to Bayesian Lasso was considered in Hoff (2017). He showed that the MAP estimator based on the product representation of the parameter (i.e., the neuronized prior) is identical to the standard Lasso. This fact justifies the use of the linear activation function to approximate the Bayesian Lasso prior density.

### 2.3 Horseshoe prior

We propose a class of activation functions that lead the corresponding neuronized prior to approximate the horseshoe prior.

###### Proposition 2.3.

Let the marginal density function of defined in (4) with and for and . Then, there exists such that where and are some constants, and is the sign function.

Proposition 2.3 indicates that when for some and , the tail behavior of the corresponding neuronized prior is polynomial. We numerically evaluated the neuronized prior that was closest to the horseshoe prior by choosing . The details of a general numerical derivation are given in Section 3.1.

Figure 8(b) shows a QQ plot of 100,000 samples from the horseshoe prior against its neuronized version, illustrating that the two marginal prior distributions are very similar. Figure 10 compares the solution paths under the two priors for the same Boston housing price data, again demonstrating their nearly identical behaviors. Finally, we summarize the connections between the existing priors and the neuronized prior in Table 1. Figure 10: Solution paths of the neuronized prior and the horseshoe prior.

Although the neuronization formulation we introduced in (4) covers a large class of prior densities as demonstrated, it cannot approximate all possible priors. For example, nonlocal prior densities (Johnson & Rossell 2010, 2012, Rossell & Telesca 2017), which have bimodal shapes symmetric around zero, are not be formulated by the neuronized prior. However, it is still possible to capture the bimodality of a density by imposing a bimodal prior density on for the neuronized prior, instead of the Gaussian. Also, dependent prior densities cannot be represented by the product of independent densities. These examples include the Zellner’s -prior (Zellner 1986) and the Dirichlet-Laplace prior (Bhattacharya et al. 2015). But an extension of the neuronized prior to a multivariate version may overcome this limitation.

Remark. When a linear activation function is used, the parameter itself is identifiable. However, there is an unidentifiability issue for individual and since switching signs of and induces the same parameter value, i.e., . In contrast, for the ReLU activation function and the exponential activation function for the horseshoe prior, the and are identifiable because these activation functions are non-decreasing and their codomains are non-negative.

## 3 Managing the Neuronized Prior

### 3.1 Find the activation function to match a given prior

We have examined that some existing priors can be approximately represented by the neuronized priors as summarized in Table 1. However, it is still not clear how to choose the activation function when we want to approximate a given arbitrary prior density. To address this issue, we propose a numerical procedure to derive an activation function that leads the corresponding neuronized prior to match a given prior density, . We denote by the class of activation functions that may be used by a neuronized prior, where is a parameter that determines the form of the activation function. For example, B-spline basis functions can be used to approximate the activation function, which can be expressed as , where

is a vector of

B-spline basis functions and .

We first draw a large number, , of i.i.d. pairs for , and then derive a sample of i.i.d. draws from the neuronized prior as

 ˜θϕ,i=Tϕ(αi−α0)wi, i=1,…,S.

We also generate a sample of i.i.d. draws from the original prior, for . We measure the distance between these two samples, for example, by defining the distance , where and are the -th order statistics of the generated samples and , respectively. Some other attractive measures are the Kolmogorov-Smirnov or the Wasserstein distances. Then, we can minimize with respect to by using a grid search algorithm or a simulated annealing algorithm (Kirkpatrick & Vecchi 1983). This optimization is not computationally intensive as long as the dimension of is moderate.

### 3.2 The choice of hyperparameters

Neuronized priors have two hyperparameters: the variance of the global shrinkage parameter and the bias parameter . The roles of these hyperparameters are different according to the choice of the activation function. When a ReLU function is considered, the corresponding neuronized prior is equivalent to a discrete SpSL prior, and the sparsity level of the parameter is mainly determined by , i.e., the prior probability for each coefficient to be non-zero is . One might consider a hyperprior on to avoid choosing the value. However, sampling in MCMC step is challenging, because there is no explicit form of the conditional posterior distribution of and the hyperparameter is highly correlated with and for a posteriori.

When we consider the neuronized version of a continuous shrinkage prior, the shrinkage level of the parameter is controlled by the variance of the global shrinkage parameter and we implicitly assume . As the gets smaller, more prior density would concentrate around zero so that the resulting posterior distribution also attains more density around zero. Even though some asymptotic rate of the global shrinkage parameter was proposed to achieve the minimax optimal posterior contraction based on the horseshoe prior in van der Pas et al. (2014) under normal mean models, a practical selection of the hyperparameter is still unclear. For this, a hyperprior can be imposed on . Gelman (2006) argued that the hyperprior on the variance parameter should have enough density around zero, and recommended the use of a half-Cauchy prior. However, half-Cauchy priors implicitly contain a scale hyperparameter, and the standard half-Cauchy prior density is a special case with the scale parameter one; i.e., for the scale parameter . Piironen & Vehtari (2017) provided some general examples where the posterior distribution of the regression coefficients is sensitive to the scale of the half-Cauchy prior. They concluded that the use of the standard half-Cauchy prior can lead to bad results and it is desirable to explicitly choose the global shrinkage parameter. In this sense, instead of imposing the standard Cauchy prior on the global shrinkage parameter, we set by following a theoretical rate investigated in van der Pas et al. (2014) for the optimal posterior contraction rate of Gaussian mean models using a horseshoe prior (by assuming that the prior guess of the number of the true variables is one). We subsequently use this setting and show that the empirical performance of the resulting procedure is promising in various simulation and real data examples.

For linear regression model (1), the scale of the parameter can be critical in discerning the signal from noise. Thus, it is desirable to scale the variance of relative to so that . We do not consider this modification to the prior variance of . One reason is that a different scale of affects the sparsity control. For a ReLU activation function, the prior probability of being zero is , when . Usually, the value of is unknown so that the control of the prior sparsity level is almost impossible in this case. The other reason is that when is multiplied to both variance of the scale and weight parameters, the scale of the original parameter can be inflated to .

## 4 Sampling and Optimization with Neuronized Priors

In this section, we describe computational strategies for Bayesian linear regression inference with the neuronized priors including both MCMC algorithms for sampling from the posterior distribution and optimization algorithms to evaluate the MAP estimator.

### 4.1 MCMC sampling with neuronized priors

Consider the linear regression model in (1) and the independent neuronized prior (4) on each regression coefficient. The unnormalized form of the posterior distribution of and is expressible as

 π(\boldmathα,\boldmathw∣y,σ2)∝exp{−∥y−Xθ(\boldmathα,\boldmathw)∥222σ2−\boldmathαT\boldmathα2−% \boldmathwT\boldmathw2σ2τ2w}π(σ2), (7)

where is the prior on , , , and , with denoting the diagonal matrix of the ’s.

The conditional posterior distribution of given and other hyperparameters is Gaussian, which can simplifies its sampling:

 \boldmathw∣y,\boldmathα,σ2,τ2w∼N(˜μ,σ2˜Σ), (8)

where . When an inverse-gamma prior Inv-Gam() is imposed on , the Gibbs update of follows Inv-Gam When is large relative to , the numerical calculation of is very expensive. Bhattacharya et al. (2016) proposed a fast sampling procedure that reduces the computational complexity from to , which is employed here. Conditional on and , each can be sampled by a naive random-walk Metropolis-Hastings (RWMH) algorithm, for . Since and tend to be highly correlated a posteriori, a better strategy is to integrate out so as to draw from , and then draw from .

The RWMH step in Algorithm 1 is rather local and cheap; we typically iterate the RWMH step times. We used

in all our numerical examples, and found the resulting algorithm to perform well. We use a Gaussian distribution with standard deviation 2 as the proposal distribution, which enables

to jump efficiently between the regions and . We subsequently use Algorithm 1 as the default to implement the posterior inference based on the neuronized prior.

### 4.2 Properties of the ReLU activation function in MCMC

A most direct and effective approach for conducting sparse Bayesian linear regression is to employ a discrete SpSL prior for the coefficients. When the continuous component of this prior is conjugate to the Gaussian likelihood, one of the best known computational strategies is to integrate out all the continuous parameters (e.g., regression coefficients and the variance parameter) and to sample directly the binary indicator vector in (3) from its posterior distribution by MCMC. This posterior distribution can be defined as

 π(\boldmathγ∣y)=mγ(y)g(\boldmathγ)∑\boldmathγ′m%\boldmath$γ$′(y)g(\boldmathγ′), (9)

where is the marginal likelihood of and is the model prior. Since the number of possible increases exponentially in , a naive RWMH algorithm on the discrete posterior model space can become very inefficient under high-dimensional settings. Moreover, in every MCMC iteration, one has to calculate the marginal likelihood of the current model, which requires a matrix inversion step. Even though the size of the matrix to be inverted should be much smaller than under sparsity settings, multiple evaluations of the matrix inversion at every iteration significantly slow down the computation. Furthermore, even this approach is unavailable if one cannot analytically integrate out the continuous parameters. In such cases, either a crude approximation strategy, or a clever and specially designed yet case-specific data augmentation scheme (Polson et al. 2013), or a much less efficient reversible-jump scheme (Green 1995) has to be employed.

In contrast, our neuronized prior with ReLU activation can achieve the same effect as using the discrete SpSL prior and give rise to more efficient computation, even if one cannot integrate out the continuous component in the joint posterior distribution. In Sections 5 and 6, we show with numerical examples how this procedure improves the sampling efficiency compared to the best-available MCMC procedure based on the conjugate discrete SpSL.

When a ReLU activation function is adopted, the posterior space becomes non-smooth with respect to . As a result, the efficiency of RWMH sampling for might be compromised. We show below that conditional distribution is a mixture of two truncated Gaussians, which can be sampled exactly so as to avoid some inefficient RWMH steps.

###### Proposition 4.1.

Let , and let denote the corresponding vector excluding the -th component . We denote a truncated Gaussian distribution with mean and variance on by . Suppose that the full posterior distribution based on the neuronized prior is expressible as (7) and a ReLU function is used as the activation function. Then, , where , , and

 κ = Φ(α0)exp{−∥rj∥222σ2}Φ(α0)exp{−∥rj∥222σ2}+{1−Φ(α0−˜αj˜σj)}˜σjexp{˜α2j2˜σ2j−∥rj+Xjα0wj∥222σ2}.

There is the other computational advantage of using the ReLU activation function. When sampling in a Gibbs step, the conditional posterior distribution can be decomposed as a product of some independent Gaussian densities so that it avoids the numerical inversion of the matrix in (8). Note that the mean vector and the covariance matrix in (8) can be expressed as

 ˜Σ=(D∗αX∗TX∗D∗α+τ−2wI00τ−2wI),˜μ=((D∗αX∗TX∗D∗α+τ−2wI)−1D∗αX∗Ty0),

where and are the sub-matrices induced by the index of the nonzero regression coefficients. This expression means that for such that , the corresponding coefficient is set to zero and the sampling of follows an independent Gaussian distribution . The conditional distribution of the sub-vector is then , where and . To sample , we only need to invert , which has a much smaller size than the matrix . The computational complexity of this step is only , where is the number of elements in , while the computational complexity of the original form is .

### 4.3 A scalable algorithm for finding posterior mode

For massive-sized data sets, MCMC algorithms may be prohibitively slow, so we may need to consider optimization-based algorithms for obtaining the MAP estimator. We here propose a coordinate-ascent algorithm to evaluate the MAP estimator. The proposed algorithm adopts a warm start procedure, which begins with a hyperparameter resulting in a weak shrinkage and gradually increases the strength of the shrinkage. This warm start idea has also been adopted by Rockova & George (2018) for finding the MAP estimator using an EM algorithm. While this warm start technique requires multiple implementations of the optimization with various hyperparameters so that the total computational burden is heavier than a single optimization, the proposed approach alleviates the danger of being trapped in a local optimum. Although the proposed algorithm is not theoretically free of the local optima issue, our empirical results from both simulation studies and real data examples in Sections 5 and 6 show that the algorithm performed significantly better than existing methods. The detailed algorithm is described in Algorithm 2.

A key to the success of this algorithm depends on the optimization with respect to while fixing other parameters such as and . The vector is updated jointly conditioning on by taking advantage of the Gaussian conjugacy. Because the function of in () of Algorithm 2 is a linear combination of a quadratic function and a function of , we divide the optimization space into two parts ( and ), and find a local maximum from each part. Then, we update to the local maximum that has a larger objective value. This is an one-dimensional optimization problem, and a local maximum can be easily found by existing optimization algorithms such as secant algorithms (Brent 1973, Dekker 1969), constrained Newton-Rhapson algorithms (Fischer 1992), constrained gradient descent algorithms (Tseng & Yun 2009), etc. In this article, we use the secant algorithm proposed by Brent (1973).

For the ReLU activation function, crucially affects the sparsity level by the prior non-zero probability for each parameter, while the global shrinkage parameter controls how much density is concentrated around zero for the neuronized version of continuous shrinkage priors. In the warm start procedure for the ReLU activation function, we start with so that the prior probability that a regression coefficient is non-zero is 1/2, and evaluate the MAP estimator. The resulting MAP estimator is then used as an initial value for the next step of optimization with a slightly increased . By doing so, we gradually increase until we reach the desired hyperparameter. In the simulation and real data examples, we use the hyperparameter schedule that is a equi-spaced sequence between 0 and the target hyperparameter with size 20. For the neuronized version of continuous shrinkage priors, the hyperparameter controlled in the warm start procedure is the global shrinkage parameter . We typically start with a large , such as 1, and then decrease gradually to a certain value. The hyperparameter schedule used in the following examples is , where is a equi-spaced sequence between 0 and with size 20, and is the target hyperparameter.

Throughout the optimization algorithm, the error variance is fixed, in advance, at the MLE using the top variables selected from all candidate ones based on marginal correlations (no more than ). We do not update in the algorithm because in the posterior space the regression coefficients are highly correlated with so that the optimization is more likely trapped in a local maximum. The results in the following sections show that this procedure works well for various real and simulated data sets.

### 4.4 Comparisons with other posterior optimization procedures

Yen et al. (2011) proposed a MM algorithm to find the MAP estimator of discrete SpSL priors by approximating norm by a log-transformed function. By following the notation used in the article, the approximation is . In practice, however, we need to fix the hyperparameter in advance, and the performance of the approximation is crucially determined by the choice of . While a smaller results in a better approximation to the original posterior distribution, the resulting target function becomes highly non-concave and is much more difficult to optimize.

EMVS (Rockova & George 2014) and SSLasso (Rockova & George 2018) were proposed to evaluate the MAP estimator of a continuous SpSL prior in (3) based on an EM formulation. The prior for the EMVS is a mixture of two Gaussian densities, and , and that for the SSLasso is a mixture of two Laplace densities, and , where and . It can mimic a point mass mixture of a sparsity-inducing prior by setting the variance of to be very small. Since the spike prior part is not a point mass (but a continuous prior), an extra hyperparameter (or ) needs to be chosen to control how much the spike prior density is concentrated around zero. To make a computational comparison with the neuronized MAP estimator evaluated by Algorithm 2, we set in (2.1); and then choose and for the EMVS, and choose and for SSLasso.

Figure 15 shows a comparison of optimization paths of the MM algorithm, the EMVS, the SSLasso, and Algorithm 2 for the variable selection procedure of the Bardet-Biedl data set ( and ), which will be discussed in more details in Section 6. Each different colored-line indicates an optimization path based on a randomly generated initial point. As shown in Figure 15, the other optimization-based procedures obviously failed to find the high posterior region, and the solutions were very sensitive to the initial point – ten randomly selected initial points resulted in ten different solutions in our example. In contrast, Algorithm 2 for the neuronized prior found the same MAP estimator from different initial points. In Section 5 and Section 6, we provide a more thorough performance comparison of aforementioned optimization algorithms: MM, EMVS, SSLasso, and Algorithm 2, for various synthetic and real data sets.

To reduce the risk of trapping in local maxima, we applied the warm start procedure to the MM algorithm and the SSLasso by gradually decreasing (or increasing) its hyper-parameter; for the MM algorithm and for the SSLasso. Also, by following a recommendation of Rockova & George (2014), we used a deterministic annealing procedure to the EMVS to mitigate the issue of trapping in local maximum modes. Nevertheless, the optimized solutions by the MM algorithm, the EMVS, and the SSLasso are still sensitive to different initial points.

## 5 Simulation Studies

In this section, we examine how neuronized priors perform for synthetic data sets under both low-dimensional and high-dimensional settings. Under the Bayesian regression framework, we compare effects of some standard priors such as the Bayesian Lasso prior, the horseshoe prior, and the discrete SpSL prior in (2.1) with those of their corresponding neuronized versions. In this simulation study, we also consider penalized two likelihood procedures: Lasso (Tibshirani 1996) and Smoothly Clipped Absolute Deviation Penalty (SCAD) (Fan & Li 2001).

### 5.1 Simulation setups and evaluation criteria

We examine two covariance structures to generate the predictors: (i) Independent covariates: , where is the identity matrix; (ii) AR(1) dependent structure: for , where , if and , otherwise for

. For the low-dimensional cases, we test two settings of the sample size and the total number of predictor variables: (i)

; and (ii) . The number of nonzero regression coefficients is of , and the regression coefficients are equally set to be with random signs. We use and to examine strong signal and weak signal scenarios, respectively. For high-dimensional settings, we fix the regression coefficients at for and under two settings: (iii) ; and (iv) . The sign of each coefficient is randomly assigned. We set the true error variance to be for all scenarios.

We choose the tuning parameter of Lasso and SCAD by using cross-validation. Alternatively, we also consider Bayesian Information Criterion (BIC) for low dimensional settings and Extended BIC (EBIC, (Chen & Chen 2008)) for high-dimensional settings to select the tuning parameter. The EBIC can be written as where denote the set of selected variables, is the cardinality of and is a tuning parameter. By following a default setting in Chen & Chen (2008), we set .

We evaluate the ability of Algorithm 2 in finding the the MAP estimator of our neuronized version of (2.1) (denoted as N-SpSL(MAP)) and compare it to MM, EMVS and SSLasso. In this simulation study, we first fix in (2.1), and for the EMVS and for SSLasso. Then, we evaluate the MAP estimators based on different choices of for the EMVS and for the SSLasso, and select a value that minimizes the information criterion (BIC for low-dimensions and EBIC for high-dimensions). To implement these procedures, we use R packages EMVS and SSLasso (available on the CRAN or http://faculty.chicagobooth.edu/veronika.rockova/).

To evaluate the estimation performance of the neuronized priors, we report the Mean Squared Error (MSE) and the angle between the true regression coefficient and the estimated coefficients by each method. More precisely, the angle is defined as , where is the true regression coefficient and is the estimated coefficient. The angle measure is more stringent as it cannot benefit from a simple shrinkage. To measure model selection performances, we examine the Matthews correlation coefficient (MCC; Matthews (1975)) defined as, , where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively. MCC is generally regarded as a balanced measure of variable selection procedures that simultaneously takes into account TP, TN, FP, and FN. The value of MCC is bounded by one, and the closer to one MCC is, the better a model selection procedure is.

We consider the Effective Sample Size (ESS) to measure the efficiency of a MCMC procedure, which is defined as , where is number of MCMC samples and is the lag- autocorrelation. To make comparisons across different Bayesian procedures, we report ESS per second, which is obtained from the R package coda.

### 5.2 Technical descriptions about computational strategies

For neuronized priors, we consider a RWMH step to sample in Algorithm 1, and this setting is denoted by “RW” in parenthesis in the tables. For the neuronized version of the discrete SpSL prior via a ReLU activation function, we additionally consider a computational strategy that samples from its exact conditional density in Proposition 4.1, denoted by “Exact” in parenthesis in the tables. We also evaluate the MAP estimator of the neuronized prior by using Algorithm 2, denoted by “MAP” in parenthesis. We examine the MM algorithm (Yen et al. 2011) for finding the MAP estimator with the discrete SpSL prior, in which two tuning parameters are tested: and , denoted by “MM1” and “MM2” respectively in parenthesis. We do not consider the MAP estimator of the horseshoe prior nor its neuronized version. This is because as discussed in Carvalho et al. (2010) the individual marginal density of the horseshoe prior is infinite at zero, so the resulting MAP estimator is always the null value.

For Bayesian procedures based on SpSL priors in (3), we fix . Its neuronized version uses the ReLU activation function with . We impose a prior on proportional to for all Bayesian procedures. For the horseshoe prior and its neuronized version, we fix the global shrinkage parameter as , which is a theoretical rate investigated in van der Pas et al. (2014) for optimal posterior contraction for Gaussian mean models. For the Bayesian Lasso and its neuronized version, we choose the global shrinkage parameter that matches the tuning parameter value determined by cross-validation for the standard Lasso procedure. This connection stems from the relationship between the Bayesian Lasso and the standard Lasso. That is, , where with the Lasso estimator based on tuning parameter .

For the discrete SpSL prior, we use a Gaussian distribution for the slab part and a point mass at zero for the spike part. We also note that the use of a Gaussian prior in the slab part does not match the neuronized prior with the ReLU activation function since the product of two independent Gaussian random variables in the neuronization formulation results in a Laplace-like slab part. Nevertheless, we use the Gaussian slab prior to sustain computational efficiency by the Gaussian conjugacy. Due to the Gaussian conjugacy, the marginal likelihood of each model has a closed form, so it is not required to consider computationally more demanding approximation algorithm to evaluate the marginal likelihood. We note that the parameter estimation is mainly affected by the prior inclusion probability that is controlled by in (2.1) and for the neuronized prior.

For the standard discrete SpSL prior, the MCMC algorithm works on the variable selection indicator space, i.e., the space of . We let the algorithm have a certain probability, 0.7 in our simulation studies, to propose a single flip move (i.e., randomly selecting a predictor, say predictor , and propose to change its inclusion indicator to ), and 0.3 to propose a double-flip move. The proposed indicator vector is accepted with probability , where is defined in (9). Given , by using the Gaussian conjugacy we can sample easily. This MCMC algorithm for Bayesian variable selection has been used by numerous researchers (Madigan et al. 1995, Raftery et al. 1997, Brown et al. 1998, Guan & Stephens 2011), and its theoretical properties including the convergence rate of the MCMC chain has been investigated in Yang et al. (2016).

For the Gibbs sampler with Bayesian Lasso, the local shrinkage parameter can be sampled exactly from its conditional distribution, which is an inverse Gaussian distribution. For the Bayesian computation with the horseshoe prior, we use a slice sampler to sample each local shrinkage parameter. For both procedures, since the posterior distribution cannot provide a sparse solution directly, variables are selected by a hard thresholding step on the posterior mean of the regression coefficients. The threshold is set to be , where is the posterior mean of .

For all except the case with the discrete SpSL prior, we generate 20,000 MCMC samples after 2,000 burn-in iterations. For the discrete SpSL prior case, we simulate 200,000 MCMC samples because the acceptance rate of the RWMH algorithm for the standard procedure is very low (less than 2%). For all simulation scenarios, we replicate 100 data sets and average the results over the replications. All computations for MCMC algorithms are coded in C++ and implemented on a Xeon Broadwell processor with 16 cores of 1.8Ghz and 128GB RAM.