Bayesian Variable Selection for Non-Gaussian Responses: A Marginally Calibrated Copula Approach

07/10/2019 ∙ by Nadja Klein, et al. ∙ Melbourne Business School 0

We propose a new highly flexible and tractable Bayesian approach to undertake variable selection in non-Gaussian regression models. It uses a copula decomposition for the vector of observations on the dependent variable. This allows the marginal distribution of the dependent variable to be calibrated accurately using a nonparametric or other estimator. The family of copulas employed are `implicit copulas' that are constructed from existing hierarchical Bayesian models used for variable selection, and we establish some of their properties. Even though the copulas are high-dimensional, they can be estimated efficiently and quickly using Monte Carlo methods. A simulation study shows that when the responses are non-Gaussian the approach selects variables more accurately than contemporary benchmarks. A marketing example illustrates that accounting for even mild deviations from normality can lead to a substantial improvement. To illustrate the full potential of our approach we extend it to spatial variable selection for fMRI data. It allows for voxel-specific marginal calibration of the magnetic resonance signal at over 6,000 voxels, leading to a considerable increase in the quality of the activation maps.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 40

page 41

page 42

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian approaches to selecting covariates in regression models are well established; see Clyde and George (2004); O’Hara and Sillanpää (2009) and Bottolo and Richardson (2010) for overviews. However, most work remains focused on Gaussian regression models, and extensions to the non-Gaussian case are still limited. In particular, the importance of distributional calibration (Gneiting et al., 2007) in Bayesian variable (ie. covariate) selection remains unexplored. We address this question here by proposing a new Bayesian approach to covariate selection that is based on a copula decomposition for the vector of observations of length on the dependent variable. The impact of the covariates on the dependent variable is captured by the copula function only. This separates the task of selecting covariates from that of modeling the marginal distribution of the dependent variable; the latter of which can then be calibrated accurately—either parametrically or non-parametrically—regardless of its distributional form. For the copula function we propose a new family of ‘implicit copulas’, which are constructed from existing Bayesian hierarchical regression models used for selecting covariates. The result is a very general and tractable approach that extends Bayesian variable selection methodology to data that have an arbitrary marginal distribution. Our aim is to show that by doing so, accurate marginal calibration increases the precision of covariate selection, and also the accuracy of the predictive distribution of the dependent variable.

Lower dimensional copulas are often used to capture dependence between multiple variables (Nelsen, 2006; Joe, 2015). We stress that the copula is used here in a very different way to capture the dependence between multiple observations on just one dependent variable. The specification of this -dimensional copula function is the key ingredient of our method. Few existing copulas can be used in such high dimensions, although implicit copulas constructed by the inversion of a parametric distribution (Nelsen, 2006, Sec.3.1) can. To construct a copula family we consider a Gaussian linear model for observations on a second dependent variable, which we label a ‘pseudo-response’ because it is not observed directly. Gaussian spike-and-slab priors (Smith and Kohn, 1996; George and McCulloch, 1997) with selection indicator variables are employed for the coefficients, which are integrated out to obtain the distribution of the pseudo-response vector conditional on the covariates and

. This is a Gaussian distribution, and its implicit copula is the popular Gaussian copula function 

(Song, 2000) with a parameter matrix that is a function of both the covariate values and . Finally, to obtain our copula family we mix this Gaussian copula over the scaling factor of the non-zero coefficients with respect to the different hyper-priors suggested by Liang et al. (2008). The resulting implicit copulas are not Gaussian copulas, and we derive some of their properties.

Because of its high dimension, it is difficult to evaluate the implicit copula directly. However, we show how to construct a Markov chain Monte Carlo (MCMC) sampler to undertake stochastic search variable selection 

(George and McCulloch, 1993), where the scaling factor is generated using the Hamiltonian Monte Carlo (HMC) method of Hoffman and Gelman (2014). Careful use of matrix identities for the computations makes application of the method to high dimensions practical. A simulation study and a real marketing data example, with and covariates respectively, compare our approach to Gaussian Bayesian variable selection and the high performing method of Rossell and Rubio (2018). They show that accurate marginal calibration can result in more precise covariate selection, along with more accurate predictive densities for the dependent variable.

To illustrate the generality of our approach, we extend it to spatial variable selection for functional magnetic resonance imagining (fMRI). In these studies a large vector of binary indicators signifies which voxels in a partition of the brain are active. We follow Smith and Fahrmeir (2007), Li and Zhang (2010) and Goldsmith et al. (2014) and use an Ising density as a prior to smooth these spatially, but employ our implicit copula model to allow for voxel-wise marginal calibration of the magnetic resonance (MR) signal. The neuroimaging literature suggests that accounting for deviations from normality in the MR signal is important to obtain accurate activation maps (Eklund et al., 2017). Application of our approach to data from a visual experiment with voxels shows this to be true here, and also produces much more accurate voxel-wise predictive distributions for the MR signal, as measured by the logarithmic scores.

A number of other approaches allow Bayesian variable selection for non-Gaussian continuous-valued data. These include conditionally Gaussian models, where the disturbances follow a mixture of normals and/or data transformations of the dependent variable are considered (Smith and Kohn, 1996; Gottardo and Raftery, 2009; Wang et al., 2017; Ranciati et al., 2019). Rossell and Rubio (2018)

propose a Bayesian variable selection approach that allows for skewness and thicker tails compared to the Gaussian distribution but cover only two-piece errors applied to the Gaussian and Laplace distributions.

Chung and Dunson (2009); Kundu and Dunson (2014)

propose non-parametric models where the mean and shape learn the effect of covariates, but assume symmetric residuals.

Yu et al. (2013)

propose variable selection in Bayesian quantile regression using a latent scale augmentation of the asymmetric Laplace errors and

Yan and Kottas (2017) extend Azzalini’s skew normal to Laplace errors in Bayesian quantile regression with LASSO penalties. However, none of these approaches fits our copula framework or ensures accurate calibration of the marginal distribution of the response. Sharma and Das (2018)

recently proposed an alternative class of non-conjugate priors for regression coefficients based on copulas; however this is very different from our use of copulas to build a non-Gaussian regression model here. Last,

Kraus and Czado (2017) use a D-vine copula to capture flexibly dependence between covariates and response in a regression model, also allowing for accurate marginal calibration for the response. However, this approach uses a copula in a very different way than that suggested here, and does not generalize existing Bayesian variable selection schemes. For example, it could not be employed to undertake spatial variable selection for fMRI as considered here.

The rest of this paper is structured as follows. Section 2 outlines our approach, including derivation of the implicit copula and some of its properties. Section 3

outlines how to compute Bayesian inference using stochastic search, along with the predictive densities of the dependent variable and Bayes factors for model comparison. Section 

4 contains the simulation study, and Section 5 the analysis of the marketing data. Section 6 extends the approach to spatial variable selection for fMRI data, and Section 7 concludes.

2 Variable Selection in Regression using Copulas

2.1 Marginally calibrated variable selection

Consider a vector of realisations on a continuous dependent variable, along with an design matrix for regression covariates. Bayesian approaches to variable selection usually proceed by introducing a vector of binary indicator variables , such that the th covariate is included in the regression if , and excluded if . When the dependent variable is non-Gaussian, the most common approach is to consider non-Gaussian distributions for the disturbance to a linear model; for example, see Chung and Dunson (2009); Rossell and Rubio (2018) and references therein. Thus, either a parametric or non-parametric non-Gaussian distribution is selected for , conditional on and . In this paper we suggest an alternative approach based on copulas that allows the marginal distribution of , unconditional on and , to be selected instead.

To define our model, let have distribution and density functions and , respectively. Then the copula representation (Sklar, 1959) of the joint density of is , where , , and is a copula density. However, in general and are both unknown, so that in copula modelling it is typical to select forms for both. In this paper we model the joint density as

(1)

Above, the distribution of is assumed to be marginally invariant with respect to and , with density and distribution function for all . Instead, the impact of the covariate values and model indicators on jointly is captured through the copula with density , which is a function of and . For this we use the copula proposed in Section 2.2 below.

A major advantage of employing Eq. (1) is that it separates the modeling of the marginal distribution of the data, from the task of selecting the covariates. Therefore, can be calibrated accurately, and we chose to model it non-parametrically in our work. Variable selection is based on the posterior distribution of , which is given by

(2)

with model prior and . A major aim of this paper is to show that adopting Eq. (1) with our proposed copula provides a very general, but tractable, approach to undertaking variable selection and model averaging for non-Gaussian regression data.

We make two remarks concerning the appropriateness of the decomposition at Eq. (1

). First, regression models are usually specified conditional on parameters for the mean, variance and possibly other moments. In contrast, the expressions above are unconditional on such parameters. We show later in Section 

2.4 that when also conditioning on additional model parameters, the distribution of is a function of the covariates, as is expected in a regression model. Second, the assumption that density is not a function of

in consistent with the standard Gaussian linear regression model. For example, when a proper g-prior is used for the regression coefficients, the margin of

with the coefficients and error variance integrated out is asymptotically independent of ; see Appendix A for a proof.

Last, consider a new realization of the dependent variable, with a vector of observed covariate values . To see how it is directly affected by the covariates and indicators, let and , then from Eq. (1), has predictive density conditional on :

(3)

This is a function of the observations on all the covariates, and the indicator variables . Moreover, marginalizing over the posterior of

gives the posterior predictive density for

as

(4)

Eq. (4) forms the basis of the model average predictive distribution of from the copula model, and we show how to evaluate it efficiently in Section 3.2.2.

2.2 Variable selection copula

Key to our approach is the specification of our proposed copula function with density . To derive these, we consider the linear model

(5)

where , is an sub-matrix of that comprises the columns of where , are the corresponding regression coefficients, and . We refer to as a vector of observations on a ‘pseudo-response’, because it is not observed directly in our model. Following Smith and Kohn (1996); George and McCulloch (1997); Liang et al. (2008) and many others, the conjugate g-prior of Zellner (1986)

(6)

is used for the non-zero coefficients. The use of this prior has been much discussed in the context of a linear model; for example, see Clyde and George (2004) or Liang et al. (2008). Its conjugacy and scaling prove attractive features for constructing the variable selection copula.

To construct the variable selection copula, we first extract the ‘implicit’ copula of the joint distribution of

conditional on , but with integrated out. Implicit copulas—also called ‘inversion’ copulas—are obtained by inverting the usual expression of Sklar’s theorem; see Nelsen (2006, Sec. 3.1). Here, we construct it from the distribution , where

which follows from integrating out

as a normal, and applying the Woodbury formula. The implicit copula of a normal distribution is the Gaussian copula 

(Song, 2000), with parameter matrix equal to the correlation of the distribution. To derive this matrix we standardize by the marginal variances

where is the th row of . If , then the correlation matrix is

(7)

Thus, the implicit copula of the distribution of (conditional on ) is the Gaussian copula function

with density

where and are and distribution functions, respectively, and  (Song, 2000). Note that the location and scale of are unidentified in its copula, with not featuring in the copula parameter matrix . Therefore, without loss of generality, throughout the paper we set and do not include an intercept in .111We stress here that this does not mean the observational data has zero mean or fixed scale. Instead, these are captured through the margin in Eq. (1).

Finally, we mix over with respect to its prior to obtain the variable selection copula as a continuous mixture of Gaussian copulas, as defined below.

Definition 1.

Let and be the Gaussian copula function and density, respectively. Then if follows the linear model at Eq. (5), with the g-prior for at Eq. (6), and is a proper density for , then we call

a variable selection copula, with density function

Moreover, it is straightforward to show the following Corollary:

Corollary 1.

The function is a well-defined copula function.

We consider the three priors discussed by Liang et al. (2008) for , along with a point mass, as listed below:

  • Hyper- prior: with density , which is proper for and corresponds to both the Jeffrey’s and reference prior. Note that this prior implies a Beta prior on the shrinkage factor with mean , and we set , leading to a uniform prior on this shrinkage factor.

  • Hyper- prior: with density . This prior is motivated by the lack of consistency of the hyper-g prior in the linear model.

  • Zellner-Siow prior: with density . This prior is less popular for a linear model because the marginal distributions of the dependent variable are not of closed form. However, the margin if is selected directly in the copula model at Eq. (1), so this is not an issue when constructing a variable selection copula.

  • Point mass prior: We also consider fixing (ie. a point mass) for comparison.

While computing the integral over in Defn. 1 is possible using numerical methods, in general it is difficult to evaluate or directly because is an -dimensional matrix. Instead, when computing inference we generate as part of a Markov chain Monte Carlo (MCMC) scheme, as discussed in Section 3.

Last, for we use the mass function . This has been used extensively in the Bayesian selection literature because it accounts for the multiplicity of the possible configurations of  (Scott and Berger, 2010). It implies a uniform prior distribution on and Bernoulli margins .

2.3 Copula properties

Here, we state some elementary properties of the copula , with proofs given in the Online Appendix.

  • When , the variable selection copula is the independence copula (Nelsen, 2006, p.11), with

Below are some pairwise dependence metrics of the bivariate sub-copula of in elements . This is defined as , with the sub-vector of without elements .

  • For , if , the lower and upper quantile dependence are

    where and are the lower and upper pairwise quantile dependences of a bivariate Gaussian copula with correlation parameter given by the th element of in Eq. (7).

  • The lower and upper extremal tail dependence

  • Spearman’s rho is

    where is as defined above and is a function of .

  • Kendall’s tau is

The dependence metrics at properties (ii),(iv) and (iv) above are functions of the varying dimension parameters , and also all the covariate values , rather than just those corresponding to the th and th observations.

2.4 Margin of

In the copula decomposition at Eq. (1), it is assumed that is marginally independent of the covariates and indicators . However, to see how the covariate values are linked to the dependent variable, we derive the distribution of conditional on and also on . To do so, let , where as before is the pseudo-response with . From Eq. (5), this normalized pseudo-response has distribution . Because the copula model at Eq. (1) is an implicit copula model, can be expressed as the transformation of , with Jacobean . Then, the density

(8)

Thus, the distribution of is a function of when also conditioning on .

3 Estimation and Inference

Estimation of the copula model at Eq. (1) requires estimation of both the marginal and parameters . In the copula literature, it is popular to use two stage estimators—where is estimated first, followed by —because they are much faster, and only involve a minor loss of efficiency in a likelihood context (Joe, 2005). Grazian and Liseo (2017) and Klein and Smith (2018) integrate out uncertainty for

using a Bayesian non-parametric estimator, but find that this does not improve the accuracy of inference meaningfully. Here, we adopt a two-stage estimator, and use the adaptive kernel density estimator (KDE) of 

Shimazaki and Shinomoto (2010) to estimate in the first stage.

3.1 Posterior evaluation

We follow George and McCulloch (1993) and subsequent authors, and evaluate the atoms in the posterior of using Markov chain Monte Carlo (MCMC). However, direct computation of the posterior mass function at Eq. (2) is slow because computing requires numerical integration over . We therefore pursue an alternative approach that also generates as part of the MCMC scheme. To implement the sampler we express the likelihood conditional on , which following Klein and Smith (2018) can be obtained in closed form by transforming to the (normalized) pseudo-response as follows. Let , where is the pseudo-response at Eq. (5), then it follows from Section 2.2 that . Moreover, can be expressed in terms of as , which is neither a function of nor . By a change of variables from to ,

(9)

where the Jacobian of the transformation is .

The right-hand side of Eq. (9) is tractable, as long as the matrix is not computed directly. To evaluate the posterior, we employ the following sampler.

MCMC Sampler

  • Randomly partition into pairs of elements.

  • For each pair , generate from .

  • Generate from using Hamiltonian Monte Carlo.

In forming the partition in Step 1, if

is odd-valued one element is simply selected twice, so that pairs of elements

are always generated in Step 2. To implement Step 2, first note that from Eq. (9) the joint posterior of the indicators is

Generating each pair involves computing for the four possible configurations , and then setting

(10)

where we are careful to compute all ratios on the logarithmic scale. To implement this efficiently requires fast computation of for the four configurations in , which we do using a number of matrix identities, as outlined in Appendix B. In doing so, at no stage is the matrix computed directly, which would be prohibitive.

Generating at Step 3 uses a Hamiltonian Monte Carlo (HMC) step for . This is where is augmented by a momentum variable, and a draw is made from the extended target distribution that is proportional to the exponential of the Hamiltonian function. Dynamics specify how the Hamiltonian function evolves, and its volume-conserving property results in high acceptance rates of the proposed iterates. We use a variant of the leapfrog integrator of Neal (2011) to generate , which employs the logarithm of the target density given by

where are the priors at and its gradient

where are the priors for . The step size and the number of leapfrog steps in each sweep is set using the dual averaging approach of Hoffman and Gelman (2014) as follows. A trajectory length is obtained by preliminary runs of the sampler using a small value of (to ensure a small discretization error) and a large value of (to move far). The dual averaging algorithm uses this trajectory length and adaptively changes during iterations of the complete sampler with sweeps, in order to achieve a desired rate of acceptance . In our examples we choose ; see Hoffman and Gelman (2014) for recommendations. A reasonable starting value for is determined by Algorithm 4 of (Hoffman and Gelman, 2014).

3.2 Inference

The sampler above produces draws from the posterior distribution which can be used to compute posterior estimates as detailed below.

3.2.1 Variable selection

Variables can be selected using the marginal posteriors, which are estimated as

To evaluate the term in this summation, at Step 2 of the sampler for the single pair that contains , the following is computed

where the four values of the bivariate function are already computed at Step 2 of the sampler.

3.2.2 Predictive density

In general, direct evaluation of the predictive density of a new observation with covariates at Eq. (3) is computationally infeasible because evaluating is also. However, the posterior predictive density at Eq. (4) can still be evaluated by conditioning on and as follows. Note that

(11)

The predictive density inside the integrals above can obtained by considering a change of variables from to , with Jacobian , as

From Eq. (5), independently when conditioning on (whereas the elements of are dependent unconditional on ). Then, because ,

(12)

where recall that is a density, , and are the elements of that correspond to . Notice that cancels out in the above computation because it is unidentified in the copula, and plays no role in the predictions.

An expression for the posterior predictive density is obtained by plugging Eq. (12) into Eq. (11). The integrals and summation can be evaluated in the usual Bayesian fashion by averaging over Monte Carlo iterates from the posterior . However, this requires the additional generation of at each sweep of the sampler. A faster approximation that avoids this—and which we have found to be almost as accurate empirically—is to plug in the posterior expectation of given by,

Note that the main components required for the evaluation of have already been computed at each sweep of the sampler. Thus, a fast and accurate predictive density estimator can be constructed using the Monte Carlo iterates from the sampler as

(13)

with .

3.2.3 Regression function estimator

The regression function can be estimated using the posterior predictive mean

(14)

The expectation in the integrand above can be computed by a change of variables from to and plugging in the expression at Eq. (12) to give

where the univariate integral with respect to is computed numerically. Evaluation of Eq. (14) then proceeds by plugging in and averaging over the Monte Carlo iterates in the exactly the same manner as in Section 3.2.2 above.

3.2.4 Bayes factor

Last, we derive the Bayes factor for a pair of hypothesized subsets and . For subset , let be a upper triangular Cholesky factor, such that and . Then for , and ; and similar for

Proposition 1.

The Bayes factor comparing the model with varying dimension parameter with the one with is given by

where we call

the implicit copula coefficient of determination, which (as opposed to the ordinary coefficient of determination) depends on in the copula model.

Note that the expression for