# Bayesian inference in high-dimensional linear models using an empirical correlation-adaptive prior

In the context of a high-dimensional linear regression model, we propose the use of an empirical correlation-adaptive prior that makes use of information in the observed predictor variable matrix to adaptively address high collinearity, determining if parameters associated with correlated predictors should be shrunk together or kept apart. Under suitable conditions, we prove that this empirical Bayes posterior concentrates around the true sparse parameter at the optimal rate asymptotically. A simplified version of a shotgun stochastic search algorithm is employed to implement the variable selection procedure, and we show, via simulation experiments across different settings and a real-data application, the favorable performance of the proposed method compared to existing methods.

## Authors

• 89 publications
• 12 publications
• 2 publications
• 20 publications
• ### Bayesian variable selection in linear regression models with instrumental variables

Many papers on high-dimensional statistics have proposed methods for var...
01/10/2019 ∙ by Gautam Sabnis, et al. ∙ 0

• ### High-dimensional variable selection via low-dimensional adaptive learning

04/17/2019 ∙ by Christian Staerk, et al. ∙ 0

• ### Variational Bayes for high-dimensional linear regression with sparse priors

We study a mean-field variational Bayes (VB) approximation to Bayesian m...
04/15/2019 ∙ by Kolyan Ray, et al. ∙ 0

• ### Asymptotically optimal empirical Bayes inference in a piecewise constant sequence model

Inference on high-dimensional parameters in structured linear models is ...
12/11/2017 ∙ by Ryan Martin, et al. ∙ 0

• ### Predicting Phenotypes from Brain Connection Structure

This article focuses on the problem of predicting a response variable ba...
10/06/2019 ∙ by Subharup Guha, et al. ∙ 0

• ### Empirical priors for prediction in sparse high-dimensional linear regression

Often the primary goal of fitting a regression model is prediction, but ...
03/03/2019 ∙ by Ryan Martin, et al. ∙ 0

• ### Variational approximation for heteroscedastic linear models and matching pursuit algorithms

Modern statistical applications involving large data sets have focused a...
11/22/2010 ∙ by David J. Nott, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the standard linear regression model

 Y=Xβ+ε,

where is a vector of response variables, is a matrix of predictor variables, is a vector of regression coefficients, and is a vector of iid errors. We are interested in the high-dimensional case, where . Furthermore, it is assumed that the true is sparse, i.e., only a small subset of the coefficients are nonzero.

There are a variety of methods available for estimating

under a sparsity constraint. This includes regularization-based methods like the lasso (Tibshirani 1996), the adaptive lasso (Zou 2006), the SCAD (Fan and Li 2001), and MCP (Zhang 2010); see Fan and Lv (2010) for a review. From a Bayesian point of view, varieties of priors for regression coefficients and the model space have been developed, leading to promising selection properties. For the regression coefficients, , the normal mixture prior is specified in George and McCullogh (1993); George and Foster (2000) introduce empirical Bayes ideas; Ishwaran and Rao (2005) use spike-and-slab priors; Bondell and Reich (2012) estimate as the “most sparse” among those in a suitable posterior credible region; Polson and Scott (2012) consider a horseshoe prior; Narisetty and He (2014) use shrinking and diffusing priors; and Martin et al. (2017) consider an empirical Bayes version of spike-and-slab.

Collinearity is unavoidable in high-dimensional settings, and methods such as lasso tend to smooth away the regression coefficients of highly collinear predictors, and hence deter correlated covariates to be included in the model simultaneously. This motivated Krishna et al. (2009) to propose an adaptive powered correlation prior that lets the data itself weigh in on how collinear predictors are to be handled. However, their suggested generalized Zellner’s prior is not applicable in the scenario. To overcome this, we use an empirical Bayes approach, and propose an empirical correlation-adaptive prior (ECAP) to this framework. In Section 2, we present our empirical Bayes model and a motivating example illustrating the effect of correlation-adaptation in the prior. Asymptotic posterior concentration properties are derived in Section 3, in particular, adaptive minimax rates are established for estimation of the mean response. In Section 4, we recommend a shotgun stochastic search approach for computation of the posterior distribution over the model space. Simulation experiments are presented in Section 5, and we demonstrate the benefits of adaptively varying the correlation structure in the prior for variable selection compared to existing methods. The real-data illustration in Section 6 highlights the improved predictive performance that can be achieved using the proposed correlation-adaptive prior. Proofs of the theorems are deferred to the Appendix.

## 2 Model specification

### 2.1 The prior

Under assumed sparsity, it is natural to decompose as , where is the set of non-zero coefficients, called the configuration of , and is the -vector of non-zero values, with denoting the cardinality of . We will write for the sub-matrix of corresponding to the configuration . With this decomposition of , a hierarchical prior is convenient, i.e., a prior for and a conditional prior for , given .

First, for the prior for , we follow Martin et al. (2017) and write

 π(S)=π(S∣|S|=s)fn(s),

where is a prior on and is a conditional prior on , given . Based on the recommendation in Castillo et al. (2015), we take

 fn(s)∝c−sp−as,s=0,1,…,R, (1)

where and are positive constants, and . It is common to take to be uniform, but here we break from this trend to take collinearity into account. Let denote the determinant of

, and consider the geometric mean of the eigenvalues,

, as a measure of the “degree of collinearity” in model . We set

 πλ(S∣|S|=s)=D(S)−λ/(2s)1{κ(S)

where is the condition number of , and and are positive constants specified to exclude models with extremely ill-conditioned . The constant is an important feature of the proposed model, and will be discussed in more detail below. Because of the dependence on above, we will henceforth write for the prior of .

Second, as the prior for , given , we take

 (βS∣S,λ)∼N(ϕ^βS,σ2gkS(X⊤SXS)λ). (3)

Here is the least squares estimator corresponding to configuration and design matrix , is a shrinkage factor to be specified, is a parameter controlling the prior spread, is an adaptive powered correlation matrix, and

 kS=tr{(X⊤SXS)−1}/tr{(X⊤SXS)λ}

is a standardizing factor as in Krishna et al. (2009) designed to control for the scale corresponding to different values of . This data-driven prior centering is advantageous because we can use a conjugate normal prior, which has computational benefits, but without sacrificing on theoretical posterior convergence rate properties (Martin et al. 2017). Let denote this prior density for , given .

The power parameter on the prior covariance matrix can encourage or discourage the inclusion of correlated predictors. When , the prior shrinks the coefficients of correlated predictors towards each other; when , they tend to be kept apart, with being the most familiar; and, finally, implies prior independence. Therefore, positive would prefer larger models by capturing as many correlated predictors as possible, while negative tends to select models with less collinearity; see Krishna et al. (2009) for additional discussion of this phenomenon. Our data-driven choice of , along with that of the other tuning parameters introduced here and in the next subsection, will be discussed in Section 4.2.

### 2.2 The posterior distribution

For this standard linear regression model, the likelihood function is

 Ln(β)=(2πσ2)−n/2e−12σ2∥Y−Xβ∥2,β∈Rp.

It would be straightforward to include as an argument in this likelihood function, introduce a prior for , and carry out a full Bayesian analysis. However, our intention is to use a plug-in estimator for

in what follows and, hence, we do not include the error variance as an argument in our likelihood function.

Given a prior and a likelihood, we can combine the two via Bayes’s formula to obtain a posterior distribution for or, equivalently, for the coefficient vector . To put a minor twist on this standard formulation, we follow a recent trend (e.g., Martin and Walker 2014; Grünwald and van Ommen 2017; Syring and Martin 2017) and use a power likelihood instead of the likelihood itself. That is, our posterior for is defined as

 πnλ(S,βS)∝Ln(βS+)απλ(βS∣S)πλ(S),

where is the -vector obtained by filling in around with zeros in the entries corresponding to , and is a regularization factor. Then the posterior distribution for , denoted by , is obtained by summing over all configurations , i.e.,

 Πnλ(A)∝∫{βS:βS+∈A}∑Sπnλ(S,βS)dβS,A⊆Rp.

Since one of our primary objectives is variable selection, it is of interest that we can obtain a closed-form expression for the posterior distribution of , up to a normalizing constant, a result of our use of a conjugate normal prior for , given . That is, we can integrate out above to get a marginal likelihood for , i.e.,

 mλ(Y∣S) =(2πσ2)−n/2s∏i=1(1+αgkSdλ+1S,i)−1/2 ×exp[−α2σ2{∥y−^yS∥2+(1−ϕ)2s∑i=1dS,i1+αgkSdλ+1S,iθ2S,i}], (4)

where is the eigenvalue of ; also, is the spectral decomposition of , with , and is the element of . Then it is straightforward to get the posterior distribution for :

 πnλ(S)∝mλ(Y∣S)πλ(S). (5)

The variable selection method described in Section 4 and illustrated in Sections 56 is based on this posterior distribution.

### 2.3 A motivating example

We now give a simple example to illustrate the effects of incorporating into (2) and (3). Consider a case with and let have iid rows, each with a standard multivariate normal with first-order autoregressive dependence and correlation parameter . Given , the conditional distribution of the response is determined by the linear model

 yi=xi1+0.8xi2+εi,whereε1,…,ε5\tiny iid∼N(0,1).

The black, blue, and red curves in Figure 1 represent , for three different configurations, namely, the true configuration , , and , respectively. Panel (a) corresponds to a high correlation case, , we see that the ECAP-based posterior would prefer for suitably large ; compare this to the choice in Martin et al. (2017) which would prefer the smaller configuration . On the other hand, when the correlation is relatively low, as in Panel (b), we see that large positive would encourage larger configuration while the true configuration would be preferred for sufficiently large negative values of . The take-away message is that, by allowing to vary, the ECAP-based model has the ability to adjust to the correlation structure, which can be beneficial in identifying the relevant variables.

## 3 Posterior convergence properties

### 3.1 Setup and assumptions

We will stick with the standard notation given previously, but it will help to keep in mind that and are better understood as triangular arrays. Therefore, we can have , , and all depend on . We assume throughout that ; more precise conditions given below. We also assume that , , and are fixed constants in this setting, not parameters to be estimated/tuned. Therefore, to simplify the notation here and in the proofs, we will drop the subscript and simply write for the posterior for instead of .

For the fundamental problem of estimating the mean response, the minimax rate does not depend on the correlation structure in , so we cannot expect any improvements in the rate by incorporating this correlation structure in our prior distribution. Therefore, our goal here is simply to show that the minimax rates can still be achieved while leaving room to adjust for collinearity in finite-samples. The finite-sample benefits of the correlation-adaptive prior will be seen in the numerical results presented in Section 5.

We start by stating the basic assumptions for all the results that follow, beginning with two simple-to-state assumptions about the asymptotic regime. In particular, relative to , the true configuration is not too complex.

###### Assumption 1.

.

The next assumption puts a very mild size condition on the non-zero regression coefficients as well as on the user-specified shrinkage factor in the prior.

###### Assumption 2.

The true satisfies and, moreover, the factor satisfies .

The condition on above basically says that , which is typically much weaker than the usual beta-min condition—see (10) in Section 3.5—that is required for variable selection consistency. Note, however, that one would need to know to check this assumption about , and we present a data-driven choice of that satisfies this condition in Section 4.2.3.

Finally, we need to make some assumptions on the design matrix . For a given configuration , let and denote the smallest and the largest eigenvalues of , respectively. Next, define

 ℓ(s)=minS:|S|=sλmin(S)andu(s)=maxS:|S|=sλmax(S).

Recall that these depend (implicitly) on because of the triangular array formulation. It is also clear that and are non-increasing and non-decreasing functions of the complexity , respectively. If is the condition number of , then we can define

 ω(s)=maxS:|S|=sκ(S),

and get the relation .

###### Assumption 3.

.

This assumption says, roughly, that every submatrix , for , is full rank, and is implied by, for example, the sparse Riesz condition of order in Zhang and Huang (2008). More general results can surely be derived, i.e., with weaker conditions on , but at the expense of more complicated statements and assumptions.

### 3.2 Rates under prediction error loss

Define the set

 Bε={β∈Rp:∥Xβ−Xβ⋆∥2>ε},ε>0, (6)

which contains those that would do a relatively poor job of estimating the mean or, equivalently, of predicting the response . We, therefore, hope that the posterior assigns asymptotically negligible mass there. The following theorem makes this precise. Recall the definitions of the prior and, in particular, the quantities and .

###### Theorem 1.

Under Assumptions 13, there exists a constant such that

 supEβ⋆{Πn(BMεn)}→0,n→∞,

where the supremum is over all such that ,

 εn=max{q(R,λ,r,a),s⋆log(p/s⋆)},

and

 q(R,λ,r,a)=⎧⎪⎨⎪⎩R{r(1+λ)−a}logpif λ∈[0,∞)R(r−a)logpif λ∈[−1,0)R(−rλ−a)logpif λ∈(−∞,−1).
###### Proof.

See Appendix A.2. ∎

In the so-called ordinary high-dimensional regime (e.g., Rigollet and Tsybakov 2012), is the minimax concentration rate. So the take-away message here is, as long as the in (1) and (2) are chosen to satisfy

 a>rmax{1+λ,1−λ},

the ECAP posterior attains the minimax optimal rate.

### 3.3 Effective posterior dimension

Theorem 1 suggests that the posterior for concentrates near the true in some sense. However, since is sparse, it would be beneficial, for the sake of parsimony, etc, if the posterior also concentrated on a roughly -dimensional subset of . This does not follow immediately from the prediction loss result but, the following theorem reveals that the effective dimension of the posterior for is approximately .

###### Theorem 2.

Suppose has that satisfy , and define

 ρ0=a+1a−rmax{1+λ,1,−λ}>1. (7)

Then, under Assumptions 13, we have as , where the supremum is over all -sparse and for any .

###### Proof.

See Appendix A.3. ∎

Aside from the economical benefits of having an effectively low-dimensional posterior distribution, Theorem 2 is useful in the proofs of all the remaining results.

### 3.4 Rates under estimation error loss

Define the set

 Vδ={β∈Rp:∥β−β⋆∥2>δ},δ>0. (8)

Just like in Section 3.2, we hope that, for suitable , the posterior will assign asymptotically negligible mass to this set of undesirable s.

###### Theorem 3.

Suppose has that satisfy . Under Assumptions 13, there exists a constant such that as , where the supremum is over all -sparse ,

 δn=s⋆log(p/s⋆)nℓ((ρ+1)s⋆), (9)

and is greater than in (7).

###### Proof.

See Appendix A.4. ∎

Under Assumption 1 and 3,

is bounded with probability

. Hence, our rate is , better than the rate for lasso (Zhang et al. 2008).

### 3.5 Variable selection consistency

One of our primary objectives in introducing the -dependent prior distribution that accounts for collinearity structure in the design matrix is for the purpose of more effective variable selection. So it is imperative that we can demonstrate theoretically that, at least asymptotically, our posterior distribution will concentrate on the correct configuration . The following theorem establishes this variable selection consistency property.

###### Theorem 4.

In addition to Assumptions 13, assume that the constant in the prior is such that and , where

 G=α+12log{1+αgκ(S⋆)max{λ+1,1,−λ}}=O(1).

Then

 supEβ⋆[Πn({β:Sβ⊃Sβ⋆})]→0,n→∞,

where the supremum is over all that are -sparse. Furthermore, if

 minj∈S⋆|β⋆j|≥ϱn:=σ√(ℓ(s⋆)){2Mnα(1−α)logp}1/2, (10)

where and , then

 Eβ⋆[Πn({β:Sβ⊉Sβ⋆})]→0,n→∞.

If both sets of conditions hold, then variable selection consistency holds, i.e.,

 Eβ⋆[Πn({β:Sβ=Sβ⋆})]→1,n→∞.
###### Proof.

See Appendix A.5. ∎

The extra conditions on in Theorem 4 effectively require that the true configuration size, , is small relative to and, furthermore, that the constant in (1) is large enough that concentrates on comparatively small configurations. Also, the non-zero values are more difficult to detect if their magnitudes are small. This is intuitively clear, and also shows up in our simulation results for Cases 1–2 in Section 5. Theorem 4 gives a mathematical explanation of this intuition, stating that variable selection based on our empirical Bayes posterior will be correct asymptotically if the true signals have magnitude greater than the threshold in (10). This is the so-called beta-min condition; see, e.g., Bühlmann and van de Geer (2011) or Arias-Castro and Lounici (2014).

## 4 Implementation details

### 4.1 Stochastic search of the configuration space

In order to compute the posterior probability, we need to evaluate

as defined in (2). The difficulty comes from the denominator of (2), i.e.,

 ∑S:|S|=sD(S)−λ/2s1{κ(S)

where, again, is the determinant. Since and could be chosen large enough so that only extremely ill-conditioned cases would be excluded, there are approximately terms in the above summation. So brute-force computation can be done only for relatively small . Given that the eigenvalues are bounded from above and below for a given predictor matrix , is too, so it is not unreasonable to approximate the above summation by . This approximation is exact in the case of if all are included, and numerical experiments suggest that it is stable across a range of , , and . Therefore, using this approximation, the posterior distribution for that we use is given by

 πnλ(S)∝mλ(Y∣S)D(S)−λ2|S|(p|S|)−1fn(|S|). (11)

Markov chain Monte Carlo (MCMC) methods can be used to compute this posterior but this tends to be inefficient in high-dimensional problems. As an alternative, we employ the simplified shotgun stochastic search algorithm with screening (S5, Shin et al. 2018), a simplified version of shotgun stochastic search (SSS, Hans et al. 2007), to explore our posterior distribution. Different from traditional MCMC method, SSS does not attempt to approximate the posterior probability; instead, it only tries to explore high posterior probability regions as thoroughly as possible.

Here is a summary of our SSS algorithm. Let be a configuration of size , with , its corresponding (unnormalized) posterior. Define the neighborhood of as , where is the set containing all dimensional configurations that include , is the set containing all -dimensional configurations that only have one variable different from variables in , and is the set containing all -dimensional configurations that are nested in . The iteration of SSS goes as follows:

1. Given , compute for all .

2. Sample , , respectively from , , , with probabilities .

3. Sample from with probabilities proportional to , , and , obtained by summing.

All visited configurations are recorded, and the final chosen configuration can be the maximum a posteriori model, median probability model (the model including those variables of which marginal inclusion probability is not less than ), or something else. For our simulations in Section 5, the selected configuration is the median probability model.

While SSS has the ability to explore many more high posterior configurations than MCMC, it is still computationally expensive, especially in high-dimensional case. For this reason, we adopt the S5 algorithm, which uses a screening technique to significantly decrease the computational cost.

### 4.2 Choice of tuning parameters

#### 4.2.1 Choice of λ

An “ideal” value of

is one that minimizes the Kullback–Leibler divergence of the marginal distribution

from the true distribution of or, equivalently, one that maximizes the expected log marginal likelihood, i.e.,

 λ⋆=argmaxλE{logmλ(Y)}.

Unfortunately, the ideal value is not available because we do not know the true distribution of , nor can we estimate it with an empirical distribution. However, a reasonable estimate of this ideal is

 ^λ=argmaxλlogmλ(Y).

Indeed, Figure 2 shows for several different samples, along with an approximation of based on point-wise averaging. Notice that the individual log marginal likelihoods are maximized very close by where the expectation is maximized.

There is still one more obstacle in obtaining , namely, that we cannot directly compute the summation involved in due to the large number of configurations . Fortunately, we can employ an importance sampling strategy to overcome this. Specifically, we have

 mλ(Y) =∑Smλ(Y∣S)D(S)−λ/2|S|fn(|S|)(p|S|)−1∑SD(S)−λ/2|S|fn(|S|)(p|S|)−1 ≈∑Nℓ=1mλ(Y∣Sℓ)D(Sℓ)−λ/2|Sℓ|∑Nℓ=1D(Sℓ)−λ/2|Sℓ|,

where are samples from . In our numerical results that follow, we use this to estimate defined above.

As discussed in Section 2.3, plays an important role in both model prior and coefficient prior. That is, for a fixed size , a positive favors model including predictors with relatively high correlation; a negative favors model including predictors with relatively low correlation; equals zero actually put equal mass on each model regardless of their predictors’ correlation structure. The in the conditional prior for , given , has a similar effect; see Krishna et al. (2009). Thus, a “good” estimate of should be such that it reflects the correlation structure in .

To help see this, consider a few examples, each with of dimension and , having an AR(1) correlation structure with varying correlation and true configuration . In particular, we consider two configurations:

 S⋆1 ={11,…,15,31,…,35} S⋆2 ={1,51,100,151,200,251,300,351,400,451}.

Figure 3 shows chosen by maximizing the marginal likelihood in three different cases, and we argue that is at least in the “right direction.” In particular, when the true predictors are highly correlated, as in Panel (a), tends to be positive which encourages highly correlated predictors to be selected; and when the true predictors have low correlation, as in Panel (b), the estimate of is close to 0 hence a nearly uniform prior for . The situation in Panel (c) is different since the true predictors are minimally correlated while unimportant predictors are highly correlated. In this case, tends to be negative which discourages selecting the highly correlated ones that are likely unimportant.

#### 4.2.2 Choice of g

Now, recall that determines the magnitude of the prior variance of . If is sufficiently large, the conditional prior for is effectively flat; if is extremely tiny, then the posterior probability for will concentrate on the prior center . Kass and Wasserman (1995) proposed the unit information criteria, which amounts to taking in the regression setting with Zellner’s prior. Foster and George (1994) suggest a choice of . Here, we use a local empirical Bayes estimator for . That is, for given and , we choose a that maximizes the local marginal likelihood, that is,

 ^gS=argmaxgmλ(y∣S).

In the special case where and

, and a conjugate prior for

, Feng et al. (2008) showed that , where is the usual statistic under model for testing . In general, our estimator, must be computed numerically.

#### 4.2.3 Choice of ϕ

In our choice of , we seek to employ a meaningful amount of shrinkage while still maintaining the condition in Assumption 2. Towards this, if we view as a shrinkage estimator, then it is possible to choose so that the corresponding James–Stein type estimate has smaller mean square error. In particular, a that achieves this is

 ϕn=1−2E∥^βS⋆−β⋆S⋆∥2∥β⋆S⋆∥2+E∥^βS⋆−β⋆S⋆∥2

and, moreover, it can be shown that . Unfortunately, this still depends on , so we need to use some data-driven proxy for this. We recommend first estimating by from the adaptive lasso, with and the corresponding least squares estimators, and then setting

 ^ϕn=[1−2^σ2tr{(X⊤^SX^S)−1}∥^β^S∥2+^σ2tr{(X⊤^SX^S)−1}]+.

In practice, variable selection results are not sensitive to the choice of unless it is too close to 1. That is, according to Figure 4, we see good curvature in the log marginal likelihood for , with roughly the same maximizer, for a range of . The curves flatten out when is too close to 1, but that “too close” cutoff gets larger with , consistent with the assumption that . To ensure identifiability of , we manually keep our estimate of away from 1, in particular, we take .

#### 4.2.4 Specification of remaining parameters

It remains to specify the likelihood power , the tuning parameters , specifying the prior on configuration size, and to specify a plug-in estimator for the error variance . As in Martin et al. (2017), we take , , and . For the error variance, we use the adaptive lasso to select a configuration and set equal to the mean square error for that selected configuration.

## 5 Simulation experiments

Here we investigate the variable selection performance of different methods in five simulated data settings. In each setting, and and the error variance is set to 1. The first two settings have severe collinearity. We employ the first order autoregressive structure with as the covariance structure of the design matrix ; and the true configuration includes two blocks of variables that the first block contains the 11th to the 15th variable and the second block contains the 31st to the 35th variable. We explored both large and small signal cases as follows.

Case 1:

Case 2:

Case 3:

In this case, we consider a block covariance setting, which is the same as the Case 4 in Narisetty and He (2014). In this setting, interesting variables have common correlation ; uninteresting variables have common correlation ; the common correlation between interesting and uninteresting ones is . The coefficients of the interesting variables are .

Case 4:

This case is similar to Case 3, but let , , and . Also, a larger is adopted.

Case 5:

This case is a low correlation case, which is set the same as Case 2 in Narisetty and He (2014). All variables are set to have common correlation and the coefficients for interesting variables are .

For each case, 1000 data sets are generated. Denoting the chosen configuration as , we compute and in these 1000 iterations to measure the performance of our method, denoted by ECAP. For comparison purposes, we also consider the lasso (Tibshirani 1996), the adaptive lasso (Zou 2006), the smoothly clipped absolute deviation (SCAD, Fan and Li 2001), the elastic net (EN, Zou and Hastie 2005) and an empirical Bayes approach (EB, Martin et al. 2017). Tuning parameters in the first four methods are chosen by BIC. The results are summarized in Table 1.

According to these results, ECAP performs significantly better than lasso, SCAD, and EN in terms of the probability of choosing the true configuration. It also has uniformly better performance compared with EB, which is expected since the new ECAP method takes the correlation information into account. However, when considering , ECAP is not always the highest, e.g., Case 1. Note that and for ECAP are always close to each other, which is not the case for lasso or EN. This is because the ECAP method is more likely to shrink the coefficients of unimportant predictors to zero, which is desirable if the goal is to find the true .

## 6 Real-data illustration

In this section, we examine our method in a real data example and evaluate its performance against other prevalent approaches including lasso, SCAD and the penalized credible region approach in Bondell and Reich (2012). We use the data from an experiment conducted by Lan et al. (2006) that studies the genetics of two inbred mouse populations (B6 and BTBR). The data include gene expressions of 31 female and 29 male mice. Some phenotypes, including phosphoenopiruvate (PEPCK) and glycerol-3-phosphate acyltransferase (GPAT) were also measured by quatitative real-time PCR. The data are available at Gene Expression Omnibus data repository (http://www.ncbi.nlm.nih.gov/geo; accession number GSE3330).

We choose PEPCK and GPAT as response variables. Given that this is an ultra-high dimensional problem, we use marginal correlation based screening method, to screen down from genes to genes. Combining the screened 1999 genes with the sex variable, the final dimension of the predictor matrix is . After screening, we apply our method to the data and select a best subset of predictors . Then we use the posterior mean of as the estimator for , for given and . The posterior distribution for is normal with

 mean=(X⊤^SX^S+V−1^S)−1(X⊤^Sy+ϕV−1^S^β^S)andcov=σ2(X⊤^SX^S+V−1^S)−1,

where

. For hyperparameters

, and , we can plug in their corresponding estimators which can be obtained in the same way as in Section 4.

In order to evaluate the performance of our approach, we randomly split the sample into a training data set of size 55 and a test set of 5. First, we apply our variable selection method to the training set and obtain the selected variables. Then conditioning on this model, we estimate the regression coefficients using the above method. Based on the estimated regression coefficient, we predict the remaining 5 observations and calculate the prediction loss. This process is repeated 100 times, and an estimated mean square prediction error (MSPE) along with its standard error can be computed; see Table

2.

In Table 2, BCR.joint and BCR.marginal denote methods using joint credible sets and marginal credible sets respectively, for details, see Bondell and Reich (2012). The first four methods, ECAP, lasso, BCR.joint and BCR.marginal are implemented on the screened data with dimension . The fifth row is the results of sure independence screening (SIS) combined with SCAD applied to the full data and the last row is the outcomes from directly applying ECAP to unscreened dataset. The stopping rules of lasso, SCAD, BCR.joint and BCR.marginal are based on BIC.

In terms of MSPE, ECAP outperforms all the other methods significantly in both PEPCK and GPAT cases, given the estimated standard errors. Moreover, the MSPE from ECAP is even smaller for the full dataset compared with the screened data. And for the model size, on average, ECAP, lasso, BCR.joint and SIS+SCAD select models with comparable sizes while BCR.marginal always chooses larger models. Overall, ECAP performs very well in this real data example compared to these other methods in terms of both MSPE and model size.

## Acknowledgments

The work presented herein is partially supported by the U.S. National Science Foundation, grant DMS–1737933.

## Appendix A Technical details

### a.1 Preliminary lemmas

Before getting to the proofs of the main theorems, we need to suitably lower-bound the posterior denominator and upper-bound the posterior numerator, the latter depending on the type of neighborhood being considered. In particular, for a generic measurable subset of the parameter space, write

 Πn(A)=Nn(A)Dn=∫A∑Sπ(S)Rn(βS+,β∗)απλ(βS∣S)dβS∫∑Sπ(S)Rn(βS+,β∗)απλ(βS∣S)dβS,

where is the likelihood ratio, with the -vector corresponding to with zeros filled in around the indices in . Lemma 1 gives a general lower bound on the denominator .

In what follows, will denote the true and sparse coefficient vector in , with the corresponding configuration of complexity .

###### Lemma 1.

Given , define

 c=c(α,g,λ,σ2)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩12log{1+αgκ(S⋆)1+λ},if λ∈[0,∞)12log{1+αgκ(S⋆)},if λ∈[−1,0)12log{1+αgκ(S⋆)−λ},if λ∈(−∞,−1),

where is the condition number of . Then under Assumptions 1–4, the denominator satisfies as .

###### Proof.

Since is a sum of non-negative terms, we get the trivial bound

 Dn >π(S⋆)∫Rn(βS⋆,β⋆S⋆)αN(dβS⋆∣ϕ^βS⋆,σ2gkS⋆(X⊤S⋆XS⋆)λ) =π(S⋆)s⋆∏i=1(1+αgkS⋆d1+λS⋆,i)−1/2exp{α2σ2(An−Bn)},

where

 An=n(^βS⋆−β⋆S⋆)⊤(n−1X⊤S⋆XS⋆)(^βS⋆−β⋆S⋆),Bn=(1−ϕ)2^β⊤S⋆Q−1S⋆^βS⋆

are both non-negative, and the matrix is defined as

 QS=(X⊤SXS)−1+αgkS(X⊤SXS)λ,S⊆{1,2,…,p}. (12)

From the well-known sampling distribution of , we have

 An∼σ2ChiSq(s⋆)=Op(s⋆).

Next, for , under Assumption 3, it can be verified that the maximal eigenvalue of is . Therefore, . Since

 ∥^βS⋆∥2≲∥^βS⋆−β⋆S⋆∥2+∥β⋆S⋆∥2,

and the first term is , it follows from Assumptions 1 and 2 that