Bayesian variable selection for multi-dimensional semiparametric regression models

11/30/2017 ∙ by Joseph Antonelli, et al. ∙ 0

Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures, such as some metals, can exhibit complex, nonlinear forms, since some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures we propose sparse tensor regression, which uses tensor products of marginal basis functions to approximate complex functions. We induce sparsity using multivariate spike and slab priors on the number of exposures that make up the tensor factorization. We allow the number of components required to estimate the health effects of multiple pollutants to be unknown and estimate it from the data. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion to identify pollutants that interact with each other. We illustrate our approach's ability to estimate complex functions using simulated data, and apply our method to a study of the association between exposure to metal mixtures and neurodevelopment.



There are no comments yet.


page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The study of the health impacts of environmental mixtures, including both chemical and non-chemical stressors, on human health is a research priority in the environmental health sciences (Carlin et al. , 2013; Taylor et al. , 2016; Braun et al. , 2016). In the past, the majority of health effect studies upon which environmental regulation is based estimate the independent association between an outcome and a given exposure, either marginally or while controlling for co-exposures. Other approaches allow for interactions between exposures, perhaps with some variable selection technique applied, but these approaches are often based on models assuming simple, linear associations between exposures or other strong parametric assumptions. It has been recently shown that in some settings, such as in studies on the impacts of exposure to metal mixtures on neurodevelopment in children, the exposure-response relationship between the multivariate exposure of interest and a health outcome can be a complex, non-linear and non-additive function (Henn et al. , 2010, 2014; Bobb et al. , 2014). Therefore there is a need for statistical approaches that can handle complex interactions among a large number of exposures. Even in settings with a relatively small number of exposures, the number of potential interactions can become large, making some form of variable selection or dimension reduction necessary.

There exists a vast statistical literature on high-dimensional regression models, including approaches to reduce the dimension of the covariate space. The LASSO (Tibshirani, 1996) utilizes an L1 penalty on the absolute value of regression coefficients in a regression model, which effectively selects which variables should be included in the model by setting some coefficients equal to zero. Many approaches have built upon this same methodology (Zou & Hastie, 2005; Zou, 2006)

; however these all typically rely on relatively simple, parametric models. More recently, a number of papers

(Zhao et al. , 2009; Bien et al. , 2013; Hao et al. , 2014; Lim & Hastie, 2015) have adopted these techniques to identify interactions, although these approaches are only applicable to linear models with pairwise interactions. While similar ideas for nonlinear models were used by Radchenko & James (2010), these analyses were restricted to pairwise interactions. Moreover, quantification of estimation uncertainty can be quite challenging in high dimensions.

A number of approaches have been introduced that embed variable selection within nonlinear regression models. Shively et al.  (1999) utilized variable selection in Bayesian nonparametric regression based on an integrated Wiener process for each covariate in the model, and then extended these ideas in Wood et al.  (2002) to non additive models that allowed for interactions between covariates. They utilized the BIC approximation to the marginal likelihood in order to calculate posterior model probabilities, which requires enumerating all potential models. Even when the number of covariates in the model is not prohibitively large, inclusion of just two-way interactions can lead to a massive number of possible models. Reich et al.  (2009) utilized Gaussian process priors for all main effects and two-way interactions within a regression model and performed variable selection in order to assess which covariates are important predictors of the outcome. They do not allow for explicit identification of higher-order interactions, as all remaining interactions are absorbed into a function of the covariates that accounts for higher-order interactions. Yau et al.  (2003) and Scheipl et al.  (2012) utilized variable selection in an additive regression model framework with interaction terms within a Bayesian framework, though one must specify the terms that enter the model a priori and there can be a large number of potential terms to consider. Qamar & Tokdar (2014) used Gaussian processes to separately model subsets of the covariate space and used stochastic search variable selection to identify which covariates should enter into a given Gaussian process function, implying that they interact with each other. While this approach can be used to capture and identify complex interactions, it relies on Gaussian processes and therefore does not scale to larger data sets.

Complex models that allow for nonlinearities and interactions are becoming popular in the machine learning literature, with Gaussian process (or kernel machine) regression

(O’Hagan & Kingman, 1978)

, support vector machines

(Cristianini & Shawe-Taylor, 2000)

, neural networks

(Bengio et al. , 2003), Bayesian additive regression trees (BART, Chipman et al.  (2010)

), and random forests

(Breiman, 2001) being just a few examples. Bobb et al.  (2014)

used Bayesian kernel machine regression to quantify the health effects of an environmental mixture. This approach imposed sparsity within the Gaussian process to flexibly model multiple exposures. While these approaches and their many extensions can provide accurate predictions in the presence of complex relationships among exposures, they typically only provide a variable importance measure for each feature included in the model, and do not allow for formal inference on higher-order effects. An important question remains how to estimate the joint effect of a set of exposures that directly provides inference on interactions, while scaling to larger data sets without requiring a search over such a large space of potential models involving higher-order interactions. Further, full Markov Chain Monte Carlo (MCMC) inference in kernel machine regression does not scale with the sample size, limiting its applicability in large sample settings. While one can potentially use approximate Bayesian methods, such as variational inference to simplify computation, these methods are only approximate and can yield biased estimators for some parameters

(Guo et al. , 2016; Miller et al. , 2016).

In this paper we introduce a Bayesian semiparametric regression approach that addresses several of the aforementioned issues as it allows for nonlinear interactions between variables, identifies interactions between variables, and scales to larger data sets than those handled by traditional Gaussian process regression. We will introduce sparsity via spike and slab priors (Mitchell & Beauchamp, 1988; George & McCulloch, 1993) within a semiparametric regression framework that reduces the dimension of the parameter space, while yielding inference on interactions among individual exposures in the mixture.

2 Environmental mixture data

2.1 Chemical mixtures and neurodevelopment in Bangladesh

To study the health impact of chemical mixtures on neurodevelopment, we examine a study of 375 children in Bangladesh whose mothers were potentially exposed to a variety of chemical pollutants during pregnancy. The outcome of interest is the z-score of the motor composite score (MCS), a summary measure of a child’s psychomotor development derived from the Bayley Scales of Infant and Toddler Development, Third Edition (BSID-III)

(Bayley, 2006). The exposure measures are log-transformed values of Arsenic (As), Manganese (Mn), and Lead (Pb) measured in umbilical cord blood, which is taken to represent prenatal exposure to these metals. As potential confounders of the neurodevelopment - exposure associations, we will also control for the following covariates: gender, age at the time of neurodevelopment assessment, mother’s education, mother’s IQ, an indicator for which clinic the child visited, and the HOME score, which is a proxy for socioeconomic status. Previous studies have examined these data (Bobb et al. , 2014; Valeri et al. , 2017) and found harmful associations between the exposures and the MCS score. Further, these studies have suggested a possible interaction between Arsenic and Manganese, though it was unclear whether this was simply random variation or a true signal.

2.2 Effects of pollutants on telomere length in NHANES

We will utilize data from the 2001-2002 cycle of the National Health and Nutrition Examination Survey (NHANES). The data is a representative sample of the United States and consists of 1003 adults. The outcome of interest is leukocyte telomere length (LTL). Telomeres are segments of DNA at the ends of chromosomes that are used to help protect chromosomes, and their lengths generally decrease with increasing age. In particular, LTL has been used as a proxy for overall telomere length, and has been shown to be associated with a number of adverse health outcomes (Mitro et al. , 2015)

. In the NHANES data, we have measurements from 18 persistent organic pollutants, which consist of 11 polychlorinated biphenyls (PCBs), 3 dioxins, and 4 Furans. It is believed that these environmental exposures could impact telomere length, though it is unclear how telomere length is associated with this particular mixture of exposures. Also in the data are additional covariates for which we need to adjust including age, sex, BMI, education status, race, lymphocyte count, monocyte count, cotinine level, basophil count, eosinophil count, and neutrophil count. We will analyze these data to assess whether any of the aforementioned environmental exposures are associated with LTL, and to search for any interactions between the environmental exposures. This quickly becomes a high-dimensional statistical issue even when looking at two-way interactions only as there are 153 two-way interactions alone. Therefore, this data requires a statistical approach to estimate high-dimensional models that allow for interactions and nonlinearities in the associations between the environmental mixture and LTL.

3 Model formulation

Throughout, we will assume that we observe for , where is the sample size of the observed data, is the outcome, is a -dimensional vector of exposure variables, and is an -dimensional vector of covariates for subject . Building on the construction of Wood (2006), we will assume that we have low rank bases that represent smooth functions for each of the exposures. For instance, the main effect of , and an interaction effect between and could be modeled as follows:


where and represent dimensional basis function expansions of and , respectively. represents a dimensional basis expansion of the interaction between and . We will let and be natural spline basis functions of and

, respectively. For simplicity we will let the degrees of freedom for each exposure all be set to

, though this can easily be extended to allow differing degrees of freedom. Importantly, the construction above enforces that main effect terms are a subset of the terms included when we have an interaction. As we include higher-order interactions, we will impose that all lower level interactions between the exposures being modeled are included in the basis expansion. To ease the presentation of our model, we will use subscripts to denote the order of the interaction to which a coefficient applies. For instance, refers to parameters associated with two-way interaction terms between and (excluding main effect terms), while refers to the main effect parameters for . This process could be trivially extended to any order interaction up to a way interaction, which is the maximum possible interaction in our model. We will use the following fully Bayesian formulation:


where any probability statements are conditional on parameters introduced in later rows. The set is a collection of binary indicators of whether the exposure is included in the component of the model. We use spike and slab priors to effectively eliminate exposures from the model, and in this paper we consider the spike to be a point mass at , while the slab,

, is a multivariate normal distribution centered at

with covariance . The dimension of the spike and slab prior depends on how many exposures are currently in the model, since there are more parameters for higher-order interactions. We therefore let be a diagonal matrix with on the diagonals. In practice we set to a large number that flexibly captures all features in the data, with a subset of the column vectors in

equaling zero. The hyperparameters

and control the amount of sparsity induced by our prior and we discuss them in more detail in Section 3.3. Details of posterior sampling from this model can be found in Appendix A.

Note that each of the functions take the same form. The difference comes from the spike and slab formulation and the structure of the matrix, which ensures that each of the functions will be comprised of a different subset of the exposures. Importantly, the exposures included in one of the functions can not be a subset of the exposures included in another function as this would lead to redundant information in the two functions.

3.1 Identifiability of variable inclusion parameters

Two scientific goals in the study of environmental mixtures are to identify which exposures are associated with the outcome and to identify interactions among exposures. To this end, our prior encodes the condition that, for any two sets of variables and included in a given feature, either and or one of the two is an empty set. These conditions mean that each column vector of the matrix must either be unique and not a subset of another column vector, or be a vector of zeros. An illustrative example of why this condition is necessary is a scenario in which all elements of are zero with the exception of and . In this scenario, the first exposure has a nonzero effect in both the first and second features of the model. Regardless of the values of and , this model could be equivalently expressed as a function of one feature, rather than two. For instance, while represents the same model as , inference regarding would differ across these two models. In the first scenario it would appear that the first covariate is associated with the outcome, even though it is not. Our conditions preclude this from happening by restricting the columns of to be unique, and therefore will provide a meaningful and interpretable measure of the importance of each exposure and potential interaction.

Another issue affecting our model selection indicators is that low-order interactions are subsets of the models defined by higher-order interactions. For example, if the true model is such that there exists a main effect of both and , but no interaction between them, we could mistakenly estimate that there is an interaction between exposure 1 and 2 simply due to the main effects. (Wood, 2006; Reich et al. , 2009) addressed this issue in similar contexts by utilizing constraints that force the two-way interaction between and to only include information that is orthogonal to the main effects of the individual exposures. In our context, this is not so much an issue of identifiability, but rather an issue with MCMC chains getting stuck in local modes. When exploring the model space via MCMC we could accept a move that includes the two-way interaction, which is a local mode, and never be able to leave this model even if the simpler model is the true model.

To avoid this scenario we simply need to update particular parameters jointly when searching the model space. When proposing the inclusion of an exposure in feature that leads to higher-order interactions, we also propose excluding exposure in feature , but including exposure in a different feature with a subset of the exposures already included in feature . For the two exposure case, this can be illustrated as follows.

Imagine that and we seek to update , which indicates whether exposure 2 enters into feature 1. If we do this naively, we simply compare models and , which would result in a choice of even if there were only main effects of and . If, however, we update at the same time, we also evaluate model . If the true relationship is a main effect of the two exposures, then model will be favored by the data and we will estimate accurately. This setting generalizes to higher-order interactions and illustrates how we can avoid misleading conclusions about interactions.

3.2 Selection of spline degrees of freedom

An important tuning parameter in the proposed model is the number of degrees of freedom, , used to flexibly model the effects of the exposures. Throughout we will make the simplifying assumption that the degrees of freedom is the same for all exposures, though this can be relaxed to allow for differing amounts of flexibility across the marginal functions. We propose the use of the WAIC model selection criterion (Watanabe, 2010; Gelman et al. , 2014) to select . If we let index samples from the posterior distribution of the model parameters and let represent all parameters in the model, the WAIC is defined as


The WAIC approximates leave one out cross validation, and therefore provides an assessment of model fit based on out of sample prediction performance. We first draw samples from the posterior distributions of the model parameters under a range of values for , and then select the model that minimizes WAIC. While this is not a fully Bayesian approach to choosing , we will see in Section 5

that this approach still leads to credible intervals with good frequentist properties.

3.3 Prior choice

There are a number of hyperparameters that need to be chosen to complete the model specification. First, one must choose an appropriate value for , which controls the amount of variability and shrinkage of for those coefficients originating from the slab component of the prior. We will make the simplifying assumption that is a diagonal matrix with on the diagonals. It is well known that variable selection is sensitive to the choice of this parameter (Reich et al. , 2009), so it is crucially important to select an appropriate value. We propose the use of an empirical Bayes strategy where we estimate and then obtain posterior samples conditional on this value. A description of the algorithm for finding the empirical Bayes estimate is available in Appendix B. One could alternatively put a hyper prior on for a fully Bayesian analysis, though we have found that the empirical Bayes strategy works better in practice, so we restrict attention to that case.

In addition, one must also select a value for and

, which dictate the prior probability that an exposure has a nonzero association in a given feature. One can choose a value to induce a desired level of sparsity in the model. In practice, the appropriate degree of sparsity is typically not known and therefore, in order to accommodate higher-dimensional data, we choose values that scale with the number of exposures. We will set

, which allows the prior amount of sparsity to increase as the number of exposures increases. This strategy closely resembles other existing approaches in high-dimensional Bayesian models, such as those proposed by Zhou et al.  (2014) and Ročková & George (2016). We will discuss the implications of scaling the prior with in Section 4.

3.4 Lower bound on slab variance

One issue with the aforementioned empirical Bayes estimation of is that it will be estimated to be very small if there is little to no association between the exposures and the outcome. This result can lead to problems in the estimation of

. If the prior variance for the slab is very small, then it is almost indistinguishable from the spike, since in this case it approximates a point mass at zero. In this situation, updating

becomes essentially a coinflip for all and , which will lead to high posterior inclusion probabilities even when there is no true association. To avoid this, we estimate a lower bound, above which we believe that any high posterior inclusion probabilities reflect true signals in the data.

To estimate this lower bound, we first permute the rows of , thereby breaking any associations between the exposures and the outcome. We run our MCMC algorithm for a small number of iterations on the permuted data and keep track of the probability that when for all . We do this for a decreasing grid of potential values for and we take as our lower bound for the smallest value of such that a pre-chosen threshold is met. Examples of such criteria include the smallest value of such that the probability of including a main effect for any of the exposures is less than 0.25 or the smallest value such that the probability of any two-way interaction is less than 0.05. Alternatively, because one can calculate the average false discovery rate for each value of , one can use this strategy to implement Bayesian counterparts to traditional multiple testing adjustments that control frequentist error rates, such as the false discovery rate. It is important to note that this false discovery rate control is conditional on a fixed value of , which we choose to be 0.5. In practice, we let be random and estimated from the data. Nonetheless, this procedure should provide a reasonable lower bound for . If strict control of false discovery rate is required then one could specify a truncated prior for such that it is never greater than the value of used for the calculation of the lower bound (i.e., 0.5). It is important to note that this procedure is very fast computationally, generally taking only a small fraction of the computation time as the full MCMC for one data set.

To illustrate how this approach works for a given analysis, we simulated data with no associations between the exposure and outcome. In this case, the empirical Bayes estimate of will be very small as it is roughly estimated to be the average squared value of the nonzero coefficients in . Figure 1 shows the posterior inclusion probabilities averaged over all exposures in this example as a function of . The gray region represents the area where false positives start to arise because is too small, and we see that the empirical Bayes estimate is contained in this region. If we proceed with the empirical Bayes estimate, then we would conclude that all of the exposures had posterior inclusion probabilities near 0.5, incorrectly indicating they are important for predicting the outcome. Our permutation strategy to identifying a lower bound is able to find a value of that avoids high posterior inclusion probabilities for exposures with null associations with the outcome. In practice, we estimate both the lower bound and empirical Bayes estimate and proceed with the maximum of these two values, which in this simulated case is the lower bound.

Figure 1: Illustration of empirical Bayes estimate and lower bound estimate of

4 Properties of Prior Distributions

A primary goal in many analyses of the health effects of an environmental mixture is to identify interactions among exposures on risk. Therefore, it is important to understand the extent to which the proposed prior specification induces shrinkage on the interaction terms. Both and dictate the amount of shrinkage. We utilize empirical Bayes to select . Therefore, here we focus on the role of , which directly controls the probability that a order interaction is nonzero in the model. The prior for this parameter can be controlled through selection of and . The covariates are exchangeable a priori, so we examine the probability that a order interaction is zero by deriving the probability that the first covariates are included in an interaction term.

Lemma 3.1: Assuming Model 2, the prior probability that there does not exist a feature, in the model such that and is


Appendix C contains a proof of this result, which is useful because it expresses the probability of an interaction of any order being zero as a function of and . This result can guide the user in specifying values of and that best represents prior beliefs. It has been noted in the literature that a desirable feature of the prior is to place more shrinkage on higher-order interactions (Gelman et al. , 2008; Zhou et al. , 2014). Therefore we use (4) to derive conditions that accomplish this type of shrinkage. Specifically, Theorem 3.2 provides a default choice of that specifies the amount of shrinkage on an interaction term to be an increasing function of the order of that term.

Theorem 3.2: Assuming Model 2, such that , then the probability in Lemma 3.1 increases as a function of if

Appendix C contains a proof of this result, which shows that higher-order interactions are more aggressively forced out of the model under either the standard prior, which is often used in high-dimensional Bayesian models (Zhou et al. , 2014; Ročková & George, 2016), or if , which shrinks higher-order interactions even more aggressively than this beta prior. In practice, a common prior choice for is , with being some constant that does not change with the number of exposures. Nonetheless, this result shows that one can achieve stronger penalization of higher-order interaction terms as long as we scale the prior with .

5 Simulation study

Here we present results from a simulation study designed to assess the operating characteristics of the proposed approach in a variety of settings. In all cases we take the maximum of the empirical Bayes and lower bound estimates of . For each simulated data set, we fit the model for values of , and run a Gibbs sampler for 10,000 iterations with two chains, keeping every 8th sample and discarding the first 2,000 as burn-in. We present results for each value of , the value that is selected based on the WAIC, the spike and slab GAM (ssGAM) approach of Scheipl et al.  (2012), and for the Bayesian kernel machine regression (BKMR) approach of Bobb et al.  (2014). We assess the performance of the various models by calculating the mean squared error (MSE) and 95% credible interval coverages of the true relationship between the exposures and the outcome, , at all observed values of the exposure. We also assess the ability of the proposed model to identify nonzero exposure effects and interaction terms.

5.1 Polynomial interactions

First, we generate data with a sample size of , and generate exposures, , from a multivariate normal distribution with all pairwise correlations set to 0.3 and marginal variances set to 1. We generate one additional covariate, , from a standard normal distribution, and we set the residual variance to . For the outcome, we use a polynomial model that contains both a linear and a nonlinear interaction term as follows:


Figure 2 presents the results of the simulation study. The top left panel of Figure 2 shows the matrix of posterior probabilities that a pair of exposures interact with each other. Specifically, cell of this heatmap shows the posterior probability that exposures and interact with each other in the model averaged over all simulations. Results show that the model correctly identifies the pairs and as interacting with each other. The top right panel shows the average 95% credible interval coverage of for each approach considered. Our model based on WAIC achieves the nominal coverage indicating that we are accounting for all sources of uncertainty. The bottom left panel highlights the ability of our procedure to estimate the marginal effect of , which has a quadratic effect on the outcome, while holding constant the values of other exposures. The grey lines, which represent individual simulation estimates, reflect the true relationship, denoted by the red line. Finally, the bottom right panel of Figure 2 presents the mean squared error of the global effect estimates of . Results show that well chosen values of , including the automated choice using WAIC, lead to the best performance among the methods considered.

Figure 2: Results of the quadratic simulation. The top left panel shows the posterior probabilities of exposures interacting with each other for the model selected by the WAIC. The top right panel shows the 95% credible interval coverages of . The bottom left panel shows the estimated marginal functions of on Y for the model selected by the WAIC, where the red line represents the truth and the grey lines represent individual estimates from separate simulations. The bottom right panel shows the MSE in estimating for all models considered.

5.2 High-dimensional data with three-way interaction

Now we present a scenario with and

to highlight that our approach can handle large sample sizes and large exposure dimensions. We will generate the exposures from independent uniform random variables. We generate one additional covariate,

, from a standard normal distribution, and we set the residual variance to . The outcome model is closely related to one that was explored in Qamar & Tokdar (2014) and corresponds to a model that includes a three-way, nonlinear interaction as follows:


It is clear that the relationship between the exposures is nonlinear and non-additive. We do not show results for BKMR due to the sample size limitation from using Gaussian processes. Further, there are a large number of exposures considered here, as there are 4,950 possible two-way interactions and 161,700 possible three-way interactions. For this reason, we do not include ssGAM in this simulation as it requires specifying the form of the linear predictor, and there are too many terms to consider. Our approach considers interactions of any order, though it does not require pre-specification of which terms to consider, and aims to find the correct model among all possible interactions. Figure 3 shows results from this scenario. The top left panel shows the posterior probability of a three-way interaction between and . This probability is 1 for all models except for , which does not capture this interaction. The top right panel shows the 95% credible interval coverages of . We see that the coverage is low for and , which are not sufficiently flexible to capture the true function, while the coverage is low for since it does not capture the three-way interaction. Importantly, the model chosen by WAIC achieves the nominal coverage level. The bottom left panel of Figure 3 shows the estimated marginal effect of on the outcome, fixing the other exposures. Results suggest that in almost all cases the method accurately estimates the true effect. The bottom right panel of Figure 3 shows that the MSE of the model chosen by WAIC is competitive, or better, than any individual choice of , suggesting that WAIC is an effective method for choosing the degrees of freedom for the spline terms.

Figure 3: Results of the high-dimensional simulation. The top left panel shows the posterior probabilities of the 3-way interaction between and for all models. The top right panel shows the 95% credible interval coverages of . The bottom left panel shows the estimated marginal functions of on Y for the model selected by the WAIC, where the red line represents the truth and the grey lines represent individual estimates from separate simulations. The bottom right panel shows the MSE in estimating for all models considered. The MSE for and are too large to be seen on the plot.

6 Analysis of environmental mixtures

6.1 Persistent organic pollutants in NHANES

We first apply our proposed approach to estimate the relationship between 18 persistent organic pollutants and telomere length in the 2001-2002 cycle of the NHANES data. We ran the Markov chain for 40,000 iterations with , , and a non-informative, normal prior on . We applied our approach with degrees of freedom, and we will present the results for as it was the model that minimized the WAIC. We examined the convergence of our MCMC using both trace plots and the potential scale reduction factor (PSR, Gelman et al.  (2014)) in Appendix D, and find that our model has converged. Figure 4 shows the marginal inclusion probabilities for each of the 18 exposures. We see that all of the exposures have a posterior inclusion probability at or near zero with the exception of Furan1, which was included in the model 99% of the time. We do not detect any higher-order interactions, as the posterior inclusion probabilities were exactly zero for all two-way and higher-order interactions. This provides strong evidence of a relationship between Furan1 and telomere length.

Figure 4: Marginal posterior inclusion probabilities for each of the 18 exposures in the NHANES data.

Figure 5 shows the effect of Furan1 on telomere levels while fixing other exposures at their medians. It is important to note that fixing other exposures at their medians does not change the shape of the curve in this instance, because there are no interactions between Furan1 and other exposures. We estimate a positive association between Furan1 exposure and telomere length. The curve is somewhat nonlinear as seen in both figure 5 and the fact that the WAIC chose three degrees of freedom for the basis functions.

Figure 5: The relationship between Furan1 and telomere length, while fixing all the other exposures at their medians.

6.2 Metal mixtures in Bangladesh

Here we analyze data from a study of the impact of metal mixture exposures on neurodevelopment in young Bangladeshi children. We applied our proposed approach to identify the complex relationship between the three metal exposures and neurodevelopment. We ran the Markov chain for 40,000 iterations with , , and a non-informative, normal prior on . We ran the model for values of and present results for , as it provided the smallest WAIC value. Convergence of the MCMC was confirmed using trace plots and PSR values, and these results can be seen in Appendix E.

6.2.1 Posterior inclusion probabilities

One feature of the proposed approach is the ability to examine the posterior distribution of in order to identify whether individual exposures are important for outcome prediction, whether there is evidence of an interaction between two metals or any higher-order interactions. Figure 6 shows the posterior probability of two-way interactions between all metals in the model, and the marginal posterior inclusion probability for each metal exposure. Figure 6 provides strong evidence that each exposure is associated with a child’s MCS score. It further provides strong evidence of an interaction between Mn and As, as well as moderate evidence of an interaction between Pb and As. Not shown in the figure is the posterior probability of a three-way interaction among Pb, Mn, and As, which was estimated to be 0.0005.

Figure 6: The left panel contains posterior inclusion probabilities for two-way interactions, and the right panel shows the marginal inclusion probabilities for each exposure

6.2.2 Visualization of identified interactions

Now that we have identified interactions in the model, we plot posterior predictive distributions to assess the form of these associations. Figure


plots the relationship between Mn and MCS for different values of As, while fixing Pb at its median. At the median level of As, we estimate an inverted U-shaped effect of Mn on MCS scores. This matches results seen previously in a kernel machine regression analysis of these data

Bobb et al.  (2014), which provided graphical evidence of an interaction between Mn and As on the MCS score but didn’t provide a measure of the strength of this interaction.

Figure 7:

Posterior means and pointwise credible intervals of the effect of Mangenese on MCS at different quantiles of Arsenic while fixing Lead at its median.

We also examine surface plots that show the exposure-response relationship between As, Mn, and MCS across the range of As and Mn values, not just specific quantiles. Figure 8

shows both the posterior mean and pointwise standard deviations of this relationship. We see that the median levels of Mn and As are the most beneficial towards children in terms of their MCS score, which corresponds to the hump in the middle panel of Figure

7. This is a somewhat unexpected result given that As is not beneficial even at low levels, and could be due to residual confounding (Valeri et al. , 2017). The posterior standard deviations of this estimate indicate that the uncertainty around this surface is the smallest in the center, where most of the data is observed, and greater on the fringes of the observed data, where we are effectively extrapolating effects.

Figure 8: Posterior means and pointwise credible intervals of the relationship between Mangenese and Arsenic on MCS. The black dots represent the location of the observed data points.

7 Discussion

We have proposed a flexible method for analyzing complex, nonlinear associations between multiple exposures and a continuous outcome. Our method improves upon existing methodology for environmental mixtures in several ways. By not relying on Gaussian processes to introduce flexibility in the exposure-response relationship, our approach can handle larger data sets on the order of tens of thousands of subjects. The spike and slab formulation of our model also allows for formal inference on the presence of interactions among exposures, an important question in environmental epidemiology. Previous studies have used graphical tools and inference on summary statistics characterizing the effects of one pollutant given the levels of others to identify interactions between pollutants. While useful, these approaches do not provide uncertainty assessment of the interaction as a whole. Moreover, it can be difficult to visualize higher-order interactions. We have illustrated the strengths of this approach through simulated data, a study of the health effects of persistent organic pollutants in the NHANES data, and a study of the health effects of metal mixtures on child neurodevelopment in Bangladesh.

While our formulation enables the analysis of much larger data sets, there are some disadvantages relative to fully nonparametric techniques such as Gaussian processes. Our model requires us to specify the functional form of the relationships between the predictors and the outcome a priori. In the current implementation, we employed natural splines to flexibly model these relationships. While other formulations are possible, any formulation would require a subjective choice. Further, one must select the number of degrees of freedom of the splines, and we have restricted this number to be the same for all exposures across all features. This can be a limitation if some of the functions to be estimated are quite smooth while others are more complicated. Our approach is also not fully Bayesian, as we use the WAIC to select the degrees of freedom. However, our simulations show the proposed estimation procedure yields uncertainty estimates that are accurate in all of the scenarios we considered. Future extensions could consider a data-adaptive approach to the selection of the degrees of freedom for each function, or the utilization of other flexible structures that are both computationally fast and do not require pre-specification of the functional forms of the exposure-response functions.


An R package implementing the proposed methodology is available at


This publication was made possible by USEPA grant RD-835872-01. Its contents are solely the responsibility of the grantee and do not necessarily represent the official views of the USEPA. Further, USEPA does not endorse the purchase of any commercial products or services mentioned in the publication. This work was also supported by grants from the National Institutes of Health (ES000002, ES024332, ES007142, ES016454).

Appendix A Posterior computation

Samples from the posterior distributions of all unknown parameters in Model 3.2 can easily be obtained via Gibbs sampling, as all of the full conditional distributions have closed form. The algorithm is as follows:

  1. Sample from the following full conditional:



  2. Sample for from the following full conditional:

  3. To update let us first introduce some additional notation. Let be the subset of we are updating and be the corresponding regression coefficients. Further, let be the set of all parameters in our model excluding and . To update we need to calculate , which we can do via the following:


    where represents the multivariate normal density function. and represent the conditional posterior mean and variance for the elements of corresponding to the elements of that are equal to one and are defined as

    where is a design matrix with the appropriate main effects and interactions induced by the model , and . Intuitively, the numerator in (3) is the prior probability that all elements of corresponding to features in are zero and . The denominator in (3) represents the conditional posterior probability that all elements of corresponding to features in are zero.

  4. Update from a normal distribution with mean and variance as defined above.

  5. Update from the following distribution:


Appendix B Derivation of empirical Bayes variance

As seen in Casella (2001), the Monte Carlo EM algorithm treats the parameters as missing data and iterates between finding the expectation of the “complete data” log likelihood, where expectations are calculated from draws of a gibbs sampler, and maximizing this expression as a function of the hyperparameter values. After removing the terms not involving , we write the “complete data” log likelihood as

where is the prior mean for coefficient , and is the number of nonzero regression parameters. The estimate of at iteration uses the posterior samples from iteration , which were sampled with . Adopting the same notation that is typically used with the EM algorithm, we compute the expectation of this expression, where expectations are taken as the averages over the previous iterations posterior samples:

We then take the derivative of this expression with respect to to find that the maximum occurs at


Intuitively, to obtain the empirical Bayes estimate, one must run an MCMC chain and update every MCMC iterations. This process is continued until the estimate of converges, and then the MCMC can be run conditional on this estimated value.

Appendix C Results about prior shrinkage

Here we investigate the prior probability that a particular interaction term is equal to zero. Our goal is to penalize interactions of higher order more strongly, and the probability that we force them to zero is one component of this strategy. Since the covariates are exchangeable a priori, we just investigate the probability that and .

c.1 Probability increases as increases if

It is of interest how the variable inclusion proabilities vary as a function of , the order of each interaction. Ideally we would want to show that the probability increases as a function of . We don’t expect this to happen in general, but we can find conditions at which it does hold. To simplify expressions with the gamma function we restrict to be an integer, though we expect the results to extend to all real values of .

The expression increasing as a function of is equivalent to decreasing with , so we show that is a decreasing function of under certain conditions.

Let and let us examine .

We want this ratio to be greater than 1 for all for the function to be increasing. Clearly this ratio is getting smaller as gets larger so we can look at the largest value of , which is .

So clearly if we set , we have the desired result.

Appendix D MCMC convergence in the NHANES data

To assess convergence in the NHANES data set we kept track of the potential scale reduction factor (PSR, Gelman et al.  (2014)) for all identifiable parameters in our model. Therefore we examined for , all elements of , and . We found that the PSR value was below 1.005 for all parameters in our model, which leads us to believe that our MCMC has converged. A plot of the PSR values for can be found in Figure 9. Subsequently, we show the trace plots for a subset of parameters in and in Figure 10 and find that our chain appears to be converging.

Figure 9: Histogram of PSR values for in the NHANES data.
Figure 10: Trace plots for and in the NHANES data.

Appendix E MCMC convergence in the Bangladesh data

To assess convergence in the Bangladesh data set we again kept track of the potential scale reduction factor for all identifiable parameters in our model. Therefore we examined for , all elements of , and . We found that the PSR value was below 1.005 for all parameters in our model, which leads us to believe that our MCMC has converged. A plot of the PSR values for can be found in Figure 11. Subsequently, we show the trace plots for all parameters in and in Figure 12 and find that our chain appears to be converging.

Figure 11: Histogram of PSR values for in the Bangladesh data.
Figure 12: Trace plots for and in the Bangladesh data.


  • Bayley (2006) Bayley, Nancy. 2006. Bayley Scales of Infant and Toddler Development-Third Edition: Administration manual.
  • Bengio et al.  (2003) Bengio, Yoshua, Ducharme, Réjean, Vincent, Pascal, & Jauvin, Christian. 2003. A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137–1155.
  • Bien et al.  (2013) Bien, Jacob, Taylor, Jonathan, & Tibshirani, Robert. 2013. A lasso for hierarchical interactions. Annals of statistics, 41(3), 1111.
  • Bobb et al.  (2014) Bobb, Jennifer F., Valeri, Linda, Claus Henn, Birgit, Christiani, David C., Wright, Robert O., Mazumdar, Maitreyi, Godleski, John J., & Coull, Brent A. 2014. Bayesian kernel machine regression for estimating the health effects of multi-pollutant mixtures. Biostatistics, 16(3), 493–508.
  • Braun et al.  (2016) Braun, Joseph M., Gennings, Chris, Hauser, Russ, & Webster, Thomas F. 2016. What can epidemiological studies tell us about the impact of chemical mixtures on human health? Environmental Health Perspectives, 124(1), A6–A9.
  • Breiman (2001) Breiman, Leo. 2001. Random forests. Machine learning, 45(1), 5–32.
  • Carlin et al.  (2013) Carlin, Danielle J, Rider, Cynthia V, Woychik, Rick, & Birnbaum, Linda S. 2013. Unraveling the health effects of environmental mixtures: an NIEHS priority. Environmental health perspectives, 121(1), a6.
  • Casella (2001) Casella, George. 2001. Empirical bayes gibbs sampling. Biostatistics, 2(4), 485–500.
  • Chipman et al.  (2010) Chipman, Hugh A, George, Edward I, McCulloch, Robert E, et al. . 2010. BART: Bayesian additive regression trees. The Annals of Applied Statistics, 4(1), 266–298.
  • Cristianini & Shawe-Taylor (2000) Cristianini, Nello, & Shawe-Taylor, John. 2000. An introduction to support vector machines and other kernel-based learning methods. Cambridge university press.
  • Gelman et al.  (2008) Gelman, Andrew, Jakulin, Aleks, Pittau, Maria Grazia, & Su, Yu-Sung. 2008. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 1360–1383.
  • Gelman et al.  (2014) Gelman, Andrew, Carlin, John B, Stern, Hal S, & Rubin, Donald B. 2014. Bayesian data analysis. Vol. 2. Chapman & Hall/CRC Boca Raton, FL, USA.
  • George & McCulloch (1993) George, Edward I, & McCulloch, Robert E. 1993. Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423), 881–889.
  • Guo et al.  (2016) Guo, Fangjian, Wang, Xiangyu, Fan, Kai, Broderick, Tamara, & Dunson, David B. 2016. Boosting Variational Inference. arXiv preprint arXiv:1611.05559.
  • Hao et al.  (2014) Hao, Ning, Feng, Yang, & Zhang, Hao Helen. 2014. Model Selection for High Dimensional Quadratic Regression via Regularization. 85721, 26.
  • Henn et al.  (2010) Henn, Birgit Claus, Ettinger, Adrienne S, Schwartz, Joel, Téllez-Rojo, Martha María, Lamadrid-Figueroa, Héctor, Hernández-Avila, Mauricio, Schnaas, Lourdes, Amarasiriwardena, Chitra, Bellinger, David C, Hu, Howard, et al. . 2010. Early postnatal blood manganese levels and children’s neurodevelopment. Epidemiology (Cambridge, Mass.), 21(4), 433.
  • Henn et al.  (2014) Henn, Birgit Claus, Coull, Brent A, & Wright, Robert O. 2014. Chemical mixtures and children’s health. Current opinion in pediatrics, 26(2), 223.
  • Lim & Hastie (2015) Lim, Michael, & Hastie, Trevor. 2015. Learning interactions via hierarchical group-lasso regularization. Journal of Computational and Graphical Statistics, 24(3), 627–654.
  • Miller et al.  (2016) Miller, Andrew C, Foti, Nicholas, & Adams, Ryan P. 2016. Variational Boosting: Iteratively Refining Posterior Approximations. arXiv preprint arXiv:1611.06585.
  • Mitchell & Beauchamp (1988) Mitchell, Toby J, & Beauchamp, John J. 1988.

    Bayesian variable selection in linear regression.

    Journal of the American Statistical Association, 83(404), 1023–1032.
  • Mitro et al.  (2015) Mitro, Susanna D, Birnbaum, Linda S, Needham, Belinda L, & Zota, Ami R. 2015. Cross-sectional associations between exposure to persistent organic pollutants and leukocyte telomere length among US adults in NHANES, 2001–2002. Environmental health perspectives, 124(5), 651–658.
  • O’Hagan & Kingman (1978) O’Hagan, Anthony, & Kingman, JFC. 1978. Curve fitting and optimal design for prediction. Journal of the Royal Statistical Society. Series B (Methodological), 1–42.
  • Qamar & Tokdar (2014) Qamar, Shaan, & Tokdar, ST. 2014. Additive Gaussian Process Regression. arXiv preprint arXiv:1411.7009,  28.
  • Radchenko & James (2010) Radchenko, Peter, & James, Gareth M. 2010. Variable selection using adaptive nonlinear interaction structures in high dimensions. Journal of the American Statistical Association, 105(492), 1541–1553.
  • Reich et al.  (2009) Reich, Brian J, Storlie, Curtis B, & Bondell, Howard D. 2009. Variable selection in Bayesian smoothing spline ANOVA models: Application to deterministic computer codes. Technometrics, 51(2), 110–120.
  • Ročková & George (2016) Ročková, Veronika, & George, Edward I. 2016. The spike-and-slab lasso. Journal of the American Statistical Association.
  • Scheipl et al.  (2012) Scheipl, Fabian, Fahrmeir, Ludwig, & Kneib, Thomas. 2012. Spike-and-slab priors for function selection in structured additive regression models. Journal of the American Statistical Association, 107(500), 1518–1532.
  • Shively et al.  (1999) Shively, Thomas S, Kohn, Robert, & Wood, Sally. 1999. Variable selection and function estimation in additive nonparametric regression using a data-based prior. Journal of the American Statistical Association, 94(447), 777–794.
  • Taylor et al.  (2016) Taylor, Kyla W, Joubert, Bonnie R, Braun, Joe M, Dilworth, Caroline, Gennings, Chris, Hauser, Russ, Heindel, Jerry J, Rider, Cynthia V, Webster, Thomas F, & Carlin, Danielle J. 2016. Statistical approaches for assessing health effects of environmental chemical mixtures in epidemiology: lessons from an innovative workshop. Environmental health perspectives, 124(12), A227.
  • Tibshirani (1996) Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.
  • Valeri et al.  (2017) Valeri, Linda, Mazumdar, Maitreyi M, Bobb, Jennifer F, Henn, Birgit Claus, Rodrigues, Ema, Sharif, Omar IA, Kile, Molly L, Quamruzzaman, Quazi, Afroz, Sakila, Golam, Mostafa, et al. . 2017. The joint effect of prenatal exposure to metal mixtures on neurodevelopmental outcomes at 20–40 months of age: Evidence from rural Bangladesh. Environmental health perspectives, 125(6).
  • Watanabe (2010) Watanabe, Sumio. 2010. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11(Dec), 3571–3594.
  • Wood et al.  (2002) Wood, Sally, Kohn, Robert, Shively, Tom, & Jiang, Wenxin. 2002. Model selection in spline nonparametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(1), 119–139.
  • Wood (2006) Wood, Simon N. 2006. Low-Rank Scale-Invariant Tensor Product Smooths for Generalized Additive Mixed Models. Biometrics, 62(4), 1025–1036.
  • Yau et al.  (2003) Yau, Paul, Kohn, Robert, & Wood, Sally. 2003. Bayesian variable selection and model averaging in high-dimensional multinomial nonparametric regression. Journal of Computational and Graphical Statistics, 12(1), 23–54.
  • Zhao et al.  (2009) Zhao, Peng, Rocha, Guilherme, & Yu, Bin. 2009. The composite absolute penalties family for grouped and hierarchical variable selection. The Annals of Statistics, 3468–3497.
  • Zhou et al.  (2014) Zhou, Jing, Bhattacharya, Anirban, Herring, Amy H., & Dunson, David B. 2014. Bayesian factorizations of big sparse tensors. Journal of the American Statistical Association, 1459(1994), 00–00.
  • Zou (2006) Zou, Hui. 2006. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476), 1418–1429.
  • Zou & Hastie (2005) Zou, Hui, & Hastie, Trevor. 2005. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.