Assessing Bayes factor surfaces using interactive visualization and computer surrogate modeling

09/14/2018 ∙ by Christopher T. Franck, et al. ∙ 0

Bayesian model selection provides a natural alternative to classical hypothesis testing based on p-values. While many papers mention that Bayesian model selection is frequently sensitive to prior specification on the parameters, there are few practical strategies to assess and report this sensitivity. This article has two goals. First, we aim educate the broader statistical community about the extent of potential sensitivity through visualization of the Bayes factor surface. The Bayes factor surface shows the value a Bayes factor takes (usually on the log scale) as a function of user-specified hyperparameters. We provide interactive visualization through an R shiny application that allows the user to explore sensitivity in Bayes factor over a range of hyperparameter settings in a familiar regression setting. We compare the surface with three automatic procedures. Second, we suggest surrogate modeling via Gaussian processes (GPs) to visualize the Bayes factor surface in situations where computation of Bayes factors is expensive. That is, we treat Bayes factor calculation as a computer simulation experiment. In this context, we provide a fully reproducible example using accessible GP libraries to augment an important study of the influence of outliers in empirical finance. We suggest Bayes factor surfaces are valuable for scientific reporting since they (i) increase transparency, making potential instability in Bayes factors easy to visualize, (ii) generalize to simple and more complicated examples, and (iii) provide a path for researchers to assess the impact of prior choice on modeling decisions in a wide variety research areas.



There are no comments yet.


page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the current scientific landscape, multiple concerns surrounding the use of classical

-values for null hypothesis significance testing have been raised, see e.g.

Ioannidis (2005), Boos and Stefanski (2011), Young and Karr (2011), Trafimow and Marks (2015), and Wasserstein et al. (2016). Despite these concerns, interest in hypothesis testing persists. Bayesian model selection enables direct probabalistic statements about the plausibility of competing models and is thus a natural alternative to null hypothesis testing based on -values. However, Bayesian model selection also has drawbacks. Specifically, model selection from the Bayesian perspective is typically quite sensitive to prior choice on parameters, even ones deep within a hierarchy. Scientific reporting can be improved if these sensitivities are more broadly recognized and easily reported among practicing statisticians and the wider research community.

In the statistical literature, the sensitivity to hyperparameters (parameters to the priors on parameters) has been mentioned previously, see e.g. Berger (1985), Kass and Raftery (1995), Berger and Pericchi (1996), O’Hagan (1995), Chipman et al. (2001), and Yao et al. (2018). Yet, it remains difficult for practitioners who are interested in using Bayesian model selection in their research to understand the degree of sensitivity they might expect in their specific analysis plan. A researcher hopes their Bayesian model selection “will be relatively insensitive to reasonable choices [of hyperparameters]” (Berger, 1985, p. 146) but worries that “… such hypothesis tests can pose severe diffiulties in the Bayesian inferential paradigm where the Bayes factors may exhibit undesirable properties unless parameter prior distributions are chosen with exquisite care” (Ormerod et al., 2017).

There is currently a dearth of practical strategies to help researchers systematically assess and report sensitivity in Bayesian model selection. In order for an approach to be useful for this purpose, it must (i) be simple to implement, not requiring extensive re-conceptualization in different scenarios, (ii) be efficient in instances where computing Bayes factors is computationally expensive, (iii) readily reveal the extent to which Bayes factors are sensitive to prior specification and facilitate the reporting of potential sensitivity.

With the above requirements in mind, we propose systematic assessment of the Bayes factor surface with the aid of modern libraries and visualization tools. We define the Bayes factor surface as the value Bayes factors (usually on the log scale) take as a function of the hyperparameters in the model. Bayes factor surfaces are easy to visualize, require only the ability to form a Bayes factor in order to produce, can be compared with available automatic procedures which appear as planes in Bayes factor surface visualizations. We propose approximating Bayes factor surfaces using computer surrogate modeling (e.g., Santner et al., 2003) if computation of a given Bayes factor is expensive. Importantly, Bayes factor surfaces inform their audience how inferences on competing models might vary as a function of prior belief about the parameters. We suspect that the instability revealed by Bayes factor surfaces might be considerably higher than many practicing statisticians realize, thus revealing a potential barrier for the greater adoption of Bayesian model selection in data analysis.

Figure 1: Scatter plot (left panel) and Bayes factor surface contour plot (right panel). Contour plot shows Bayes factor comparing and hypotheses on natural log. Contours correspond to strength of evidence according to Kass and Raftery (1995).

Figure 1

shows the dilemma. Consider a simple linear regression problem where the researcher wishes to test whether the slope parameter

is equal to zero or not. The data in the left panel were simulated with

, error variance

, and sample size ). The classical -value for the slope is 0.0021, below the recently proposed criterion to declare statistical significance for new discoveries (Benjamin et al., 2018). The contour plot shows the Bayes factor (i.e., the Bayes factor surface) comparing and hypotheses. The - and -axes of the contour plot represent prior hyperparameters—user chosen inputs that encode prior belief. (We define these in detail in Section 3.1.) Blue favors and red favors . It is clear that the prior belief that the researcher brings to this analysis plays a major role in the conclusion. Decreasing prior precision (

-axis) yields equivocal evidence about hypotheses despite the apparent trend in the data. Precise prior beliefs near the least squares estimate yield Bayes factors “strongly” in favor of a non-zero slope

(according to the scale of evidence proposed by Kass and Raftery, 1995). Incorrect but precise prior beliefs (i.e., red in lower right corner) provide very strong evidence in favor of .

Figure 1 is, at first blush, rather discouraging. Least squares regression is among the simplest data analytic approaches in any analyst’s toolbox. Even in the simplest of cases, it is clear that priors on parameters must be handled with care. The good news is that within the class of linear models, several well-calibrated automatic approaches are available for Bayes factors, which we review later. In more complicated settings such as the problem described in Section 3.2, automatic procedures for model selection are usually unavailable (although Fonseca et al. (2008) is a good starting point to develop an automatic procedure for this example). In these latter cases, the Bayes factor surface can play an important role in scientific reporting.

A thorough overview of Bayes factors can be found in Kass and Raftery (1995). A useful guide for implementation of Bayesian model selection is Chipman et al. (2001). The remainder of this article is organized as follows. Section 2 reviews the general formulation of Bayesian model selection via Bayes factors and develops the idea of Bayes factor surfaces. Section 3 covers two examples of varying complexity. The first example includes an interactive R shiny (Chang et al., 2017) application hyperlink

that allows the user to explore Bayes factor surfaces in real time. The second example addresses the common situation where Bayes factor approximation is computationally intensive, and often a “noisy” enterprise owing to the stochastic nature of approximation via Markov chain Monte Carlo (MCMC). Therein we propose a space-filling design in the hyperparamter space, and surrogate modeling with Gaussian process models to emulate the Bayes factor surface. Modern libraries in

R, like hetGP (Binois and Gramacy, 2018) on CRAN, cope well with both mean and variance features common in Bayes factor surfaces. Our fully reproducible illustration, with code provided in the supplementary material, is designed to be easily ported to Bayes factor calculations in novel contexts. Discussion is included in Section 4.

2 Bayesian model selection overview

Let represent observed data. Let and represent two competing models (or hypotheses), and let and

represent vectors of parameters associated with each of

and , respectively. Index models and parameter vectors with . Let the distribution of the data given the parameters (i.e., the likelihood normalized as a proper distribution) be denoted .

The marginal likelihood for model (denoted

) arises when parameters are integrated out of the joint distribution of the parameters and data, i.e.,


The marginal likelihood plays a central role in Bayesian model comparison, selection, and averaging. The Bayes factor is the usual metric to compare and and is the ratio of those models’ marginal likelihoods:


In this work we consider Bayes factors based on parametric Bayesian models. Historically, scales of interpretation (Jeffreys, 1961; Kass and Raftery, 1995) have been used to describe strength of evidence in favor of over where large values of suggest that observed data support more than

. However, using scales in this manner implicitly assumes prior model probabilities

, which is the unique setting where the Bayes factor describes the ratio of how much more probable is compared with

. In general, interpreting Bayes factors as the weight of evidence in favor of one hypothesis over another is not strictly justified. Technically, Bayes factors summarize the extent to which the observed data update prior model odds to posterior model odds. In our discussion we will assume equal model priors but the reader can adopt other model priors and interpret the Bayes factor as a multiplicative factor which updates prior belief about models. For a more complete discussion of this and related issues, see

Lavine and Schervish (1999).

The notation for marginal likelihoods shown in Equation (1) is remarkably consistent across the statistical literature. Perhaps for brevity, this notation omits reference to the parameters that have been integrated, which facilitates an illusion that does not depend on the priors assigned to . In the remainder of this paper we show that choices about the priors on can exert strong influence on Bayes factors.

The first famous discussion about the sensitivity to priors for Bayesian model selection arose as the Jeffreys-Lindley-Bartlett paradox (Jeffreys, 1935, 1961; Bartlett, 1957; Lindley, 1957). The issue (which is not technically a paradox) was described (Lindley, 1957) in terms of a one parameter point null testing problem in which frequentist -values strongly favor rejecting the null while Bayes factors provide strong support for the null. This simple example illustrates that as the prior on the parameter being tested becomes vague, the Bayes factor increasingly favors the simpler point-null hypothesis. See Ormerod et al. (2017) for thorough discussion and numerous references to the “paradox” in the literature. In addition to the effects of prior vagueness, Bayes factor surfaces allow for the visualization of the effects of multiple hyperparameters simultaneously.

The methods described thus far apply when only two models are under consideration. When models are of interest, usually a series of Bayes factors are produced, where a specific “base” model is chosen, and a Bayes factor is formed comparing each other model to the base model. In cases such as these, the researcher may wish to examine Bayes factor surfaces corresponding to each of the Bayes factors that are considered.

2.1 Bayes factor surfaces

The method we propose in this paper is simple. We suggest producing a graphical display of the log Bayes factor as a function of hyperparameters of interest for inclusion in scientific reports. Basically, enumerate a grid in the hyperparameter space, collect Bayes factor estimates under each setting, and stretch over an appropriate mesh—off-loading the heavy visual lifting, projections, slices, etc., to any number of existing rich visualization suites. One drawback to this approach is that numerical integration to obtain marginal likelihoods (1) is frequently expensive. When Monte Carlo integration is used the output is frequently noisy. Although that noise can be reduced with further computation (i.e., longer MCMC chains), convergence is typically not uniform in the hyperparameter space. For a fixed simulation effort in all hyperparameter settings, the level of noise in the output Bayes factor may change. In other words, the Bayes factor response surface can be heteroskedastic. In situations such as these, we propose treating Bayes factor calculation as a computer simulation experiment: space-filling design in the hyperparameter space and surrogate modeling via Gaussian processes in lieu of direct inspection. We illustrate a modern library called hetGP which copes with input dependent variance and leads to nice surrogate Bayes factor surface visualizations in our examples.

The simplicity of the approach is a virtue. The core idea does not need to be reconceptualized in different settings and could be explored in any situation where Bayes factors are available and concern about sensitivity is present. For example, a Bayes factor surface plot could accompany statistical reporting as a method to indicate to reviewers and the broader audience whether the conclusions with respect to model selection are driven heavily by the specific priors chosen for . As a corollary this approach also fosters transparency since the audience can see how model selection would differ under other hyperparameter settings.

3 Examples

3.1 Ordinary regression

The first example comes from the ordinary regression setting with model


where is the -intercept, is the slope, ,

represents fixed known data from a predictor variable, and

is the outcome. We assume , where the error precision is and is the usual error variance. Specifying error in terms of precision is mathematically convenient for Bayesian analysis. Following Equation (2) and the color scale in Figure 1, under we have and under implies . The remaining model specifications for and share the following features:


The parameters in each model are vectors and . It has been shown previously that putting a flat prior on the intercept implicitly centers the and about their respective means during the computation of the marginal likelihood (Chipman et al., 2001). Denote and .

Choice of hyperparameters , , , and enable the researcher to incorporate prior information about parameters into the analysis. In the absence of specific prior information, the priors on and are frequently taken to be vague yet proper. For example, a researcher may center at zero and choose a small prior precision (i.e., large prior variance) and similar for they may choose to center at 1 and make the prior variance large. This is in an attempt to impart minimal information to the inference from the prior. The improper flat intercept contributes an unspecified constant to the marginal likelihood in each model which cancels once the Bayes factor is formed.

The included R shiny (Chang et al., 2017) application hyperlink

allows users to entertain various settings of hyperprior, sample size, and true underlying effect size

in order to explore how these affect the Bayes factor surface. Experimentation with this application illustrates that hyperparameters for parameters common to both models (e.g., and specified for ) tend to have much less influence on Bayes factors than hyperparameters for parameters that are being tested (e.g., and specified for ).

Under the simpler hypothesis and (57) the marginal likelihood has a convenient closed-form solution.


As noted in Section 2, the standard notation for marginal likelihoods, as in Eq. (8), omits reference to due to its integration. Yet, obviously specific choices about hyperparamters ( and in this case) associated with appear in the marginal likelihood. The form of prior densities also influences the marginal likelihood in a less visible manner. Even though standard notation tends to obscure this fact, prior choice on parameters directly influences marginal likelihoods and thus Bayesian model selection. Across the universe of possible Bayesian model selection endeavors, there are no ironclad guarantees that increasing diminishes the influence of hyperparameters.

Under the in Eqs. (47), integration of and can be handled analytically, but integration of requires numerical approximation.


Of course, adopting the priors (4-6) is only one possible strategy in a universe of available Bayesian priors. In this study we will also consider three specific automatic procedures to obtain Bayes factors that are readily available for regression. The term “automatic” is used in Bayesian procedures to describe cases where the researcher does not have to specify priors subjectively. We explore Zellner–Siow mixture -priors (Liang et al., 2008), Bayes factor approximation via Bayesian information Criterion (Schwarz, 1978, See also Kass and Raftery 1995), and Fractional Bayes factors (O’Hagan, 1995), each of which is briefly summarized next. As the reader probably expects, these automatic procedures are each defensible individually, yet do not necessarily agree and could potentially lead to different conclusions in different situations.

Zellner–Siow mixture -prior: The Zellner–Siow mixture -prior was introduced (Liang et al., 2008) as an automatic Bayesian specification to allow selection among nested ordinary linear models. Previous work (Zellner and Siow, 1980)

has shown that a multuivariate Cauchy distribution on regression coefficients satisfies basic model consistency requirements.

Liang et al. (2008) demonstrates that an inverse gamma with shape and rate parameters of and respectively, placed on the parameter of Zellner’s -prior (Zellner, 1986) induces Cauchy tails on the prior of while only requiring a one dimensional approximation to the integral for the marginal likelihood. The Appendix includes a brief description of the Bayesian model for the Zellner–Siow mixture -prior.

Approximation by BIC: Asymptotically, where is the usual Bayesian information criterion (Schwarz, 1978) for model . Further details relating BIC to approximate Bayesian model selection can be found in Kass and Raftery (1995).

Fractional Bayes factor: The final automatic method we consider is the fractional Bayes factor (O’Hagan, 1995). Fractional Bayes factors are a variant of partial Bayes factors (Berger and Pericchi, 1996) which reserve a training sample from the full data, update the prior using the training sample, then form a Bayes factor using the remainder of the data in the likelihood. A central appeal of the partial Bayes factor approach is that parameters with noninformative priors (which are frequently improper) can be tested in an essentially automatic fashion. While intrinsic Bayes factors (Berger and Pericchi, 1996) are obtained by averaging over many partial Bayes factors, Fractional Bayes factors obviate the need to choose a specific training sample through a clever approximation of the “partial” likelihood and are thus computationally less expensive than intrinsic Bayes factors. In this approach we adopt noninformative improper priors and compute a Bayes factor surface using fractional Bayes factors. More detail can be found in the Appendix.

R shiny application

To accompany this paper and allow the reader to interactively explore Bayes factor surfaces in the regression setting, we have produced a R shiny (Chang et al., 2017) application. The app allows the user to set sample size, parameter values and choose hyperparameters in Equations (3-7). The application provides 3 dimensional rotating Bayes factor surface plots with user options to superimpose planes corresponding to each of the three automatic procedures described. Figure 2 shows the interface with this specific output. Additionally, the application also provides scatter plots contour plots such as in Figure 1 and density plots for the priors. The specific surface visualized corresponds to a true , where user rotation reveals the same phenomnenon as Figure 1: strong evidence in favor of the correct model is achieved when the prior on the slope is precice and centered near the least squares estimate. As prior precision on the slope decreases, the Bayes factor reduces (i.e. Jeffreys-Lindley-Bartlett phenomenon), and precise but misguided prior beliefs about the slope incorrectly suggest the plausibility that . Each of the three automatic procedures provide strong evidence of a non-zero slope, where the Fractional Bayes factor plane (not shown) is between the BIC approximation and mixture g prior.

Figure 2: Screenshot of R shiny application. The application allows users to simulate regression data with varying slope and sample size to explore Bayes factor surfaces under various hyperparameter settings. The application includes 3d rotatable surfaces, contour plots, scatter plots, and prior densities. Above, the red lines show the surface at and . The grey plane corresponds to , i.e. a BF such that prior and posterior odds are equal (for a uniform model prior). Users can superimpose planes corresponding to the three automatic procedures, including mixture g prior (green), the BIC approximation to BF(blue), and the fractional Bayes factor (not shown). These and additional graphics and analyses can be interactively explored using the accompanying shiny app hosted insert hyperlink.

3.2 Surrogate modeling the BF surface

In the simple regression setting of Section 3.1, computation is cheap and many viable automatic procedures are available. In many other settings, obtaining marginal likelihoods is a delicate and computationally expensive task. For cases such as these, we propose that an adequate analogue of the visualizations provided above can be facilitated by emulating the Bayes factor surface in the hyperparameter space of interest. In other words, we propose treating expensive Bayes factor calculations, via MCMC say, as a (stochastic) computer simulation experiment. Bayes Factor calculations at a space-filling design in the (input) hyperparameter space can be used to map out the space and better understand the sensitivity of model selection to those settings.

As motivation, consider an experiment described by Gramacy and Pantaleo (2010, Section 3.3–3.4) involving Bayes factor calculations to determine if data are leptokurtic (Student- errors) or not (simply Gaussian). The context of that experiment is portfolio balancing with historical asset return data of varying length. Highly unbalanced history lengths necessitated a modeling in the Cholesky-space, through a cascade of least-squares style regressions decomposing a joint multivariate normal via conditionals. To address this, Gramacy and Pantaleo proposed Bayesian modernization of methods first introduced by Andersen (1957) in some generality, and Stambaugh (1997) for portfolio balancing. Gramacy and Pantaleo entertained modeling variations for those regressions including double-exponential (lasso), Gaussian (ridge), normal-gamma (NG), and horseshoe priors for the regression coefficients, potentially under a reversible-jump (RJ) scheme for variable selection. Their monomvn package (Gramacy, 2017) on CRAN provided RJ-MCMC inference for the regression subroutines, and subsequently for Andersen’s reconstruction of the resulting joint MVN distributions of the asset returns, which are funneled through quadratic programs for Markowitz (1959)-style portfolio balancing. With the proposed methods coming online just after the 2008 financial crisis, the authors were particularly interested in the appropriateness of the prevailing MVN modeling setup—theirs being one example—in the presence of data which may contain outliers, or otherwise heavy tails.

To accommodate robust inference in the face out outliers, Gramacy and Pantaleo tacked the latent-variable-based Student- sampler from Geweke (1992, 1993) onto their inferential framework. Next, to determine if the data warranted heavy-tailed modeling they further augmented the software with a Bayes factor calculation in the style Jacquier et al. (2004, Section 2.5.1), leveraging that Student- (St) and normal () models differ by just one parameter in the likelihood:

, the degrees of freedom. In such cases, one can write the BF as the expectation of the ratio of un-normalized posteriors with respect to the posterior under the bigger (St) model. That is,


where and collects the parameters shared by both models. A subsequent simulation study, however, cast doubt on whether or not one could reliably detect heavy tails (when indeed they were present) in situations where the generating -value was of moderate size, say . It is relevant to ask to what extent their conclusions are impacted by their hyperparameter choices, particularly on the prior for which they took to be . Their intention was to be diffuse, but ultimately they lacked an appropriate framework for studying sensitivity to this choice.

Consider the following warm-up experiment toward a more comprehensive analysis. We set up a grid of hyperparameter values in , evenly spaced in space from to 10m spanning “solidly Student-” (even Cauchy) to “essentially Gaussian” in terms of the mean of the prior over For each on the grid we ran the RJ-MCMC to approximate by feeding sample likelihood evaluations provided by monomvn’s blasso function through (10). The training data was generated following a setup identical to the one described in Gramacy and Pantaleo (2010), involving 200 observations randomly drawn from a linear model in seven input dimensions, with three of the coefficients being zero, and . Details are provided by our fully reproducible R implementation provided in the supplementary material. In order to understand the Monte Carlo variability in those calculations, ten replicates of the BFs under each hyperparameter setting were collected.

Figure 3: Log Bayes factors for varying hyperparameter prior settings in . The -axis () is shown in space. A predictive surface from hetTP is overlayed via mean and 95% error-bars.

The results are shown in Figure 3. Each open circle is a evaluation, plotted in space. Observe that outputs on the -axis span the full range of evidence classes (Kass and Raftery, 1995) for the hypothesis of Student- errors over the Gaussian alternative. When is small, the Student- is essentially a foregone conclusion; whereas if is large the Gaussian is. The data, via likelihood through to the posterior, is providing very little in the form of adjudication between these alternatives. An explanation is that the model is so flexible, the posterior is happy to attribute residual variability—what we know to be outliers from heavy tails (because we generated the data)—to mean variability via settings of the linear regression coefficients, . A seemingly innocuous hyperparameter setting is essentially determining the outcome of a model selection enterprise.

Each evaluation, utilizing MCMC samples, takes about 36 minutes to obtain on a 4.2 GHz Intel Core i7 processor, leading to a total runtime of about 120 hours to collect all 200 values used in the figure. Although that burden is tolerable (and perhaps we could have made due with fewer evaluations), extending to higher dimensions is problematic. Suppose we wanted to entertain , where the case reduces to above. If we tried to have a similarly dense grid, the runtime would balloon to 100 days, which is clearly unreasonable. Rather, we propose treating the

calculation as a computer experiment: build a surrogate model from a more limited space-filling design, and use the resulting posterior predictive surface to understand variability in Bayes factors in the hyperparameter space. Before providing an example, the simpler experiment Figure

3 points to some potential challenges. The surface is heteroskedastic, even after log transform, and may itself have heavy tails.

Fortunately, stochastic computer simulations with heteroskedastic errors are popping up all over the literature and some good libraries have recently been developed to cope. The hetGP package on CRAN (Binois and Gramacy, 2018), based on the work of Binois et al. (2018) with Student- extensions (Chung et al., 2018), is easy to deploy in our context. The predictive surface from a so-called hetTP surrogate, a fitted heteroskedastic Student- process, is overlayed on the figure via mean and 95% predictive intervals. Although this fitted surface doesn’t add much to the visualization here, it’s analog is essential in higher dimensions.

As an illustration, lets return now to . Our supplementary material provides a code which evaluates BFs on a space-filling design in -space, via a Latin hypercube sample (McKay et al., 1979) of size 40, using a recently updated version of the monomvn library to accommodate the Gamma prior. Five replicates are obtained at each input setting, for a total of 200 runs. I.e., at similar computational expense compared to the earlier Exp experiment despite the higher dimension.

Figure 4: Log Bayes factors for varying hyperparameter prior settings and in . The -axis () and -axis are both shown in space. A predictive surface from hetTP

is overlayed via mean (left) and standard deviation (right). The contours come from

Kass and Raftery (1995).

Figure 4 shows the outcome of that experiment via a fitted hetTP surface: mean on the left and standard deviation on the right. The numbers overlayed on the figure are the average obtained for the five replicates at each input location. Observe that the story here is much the same as before in terms of , which is maps to in the earlier experiment, especially nearby (i.e., ) where the equivalence is exact. The left panel shows that along that slice you can get just about whatever conclusion you want. Smaller values tell a somewhat more nuanced story, however. A rather large range of smaller values lead to somewhat less sensitivity in the outcome due to the hyperparameter setting, except when is quite large. It would appear that it is essential to have a small setting if the data are going to have any influence on model selection via BF. The right panel shows that the variance surface is indeed changing over the input space, justifying the heteroskedastic surrogate.

4 Discussion

The instability of Bayes factors as a function of prior choices on parameters has been commented on in the statistical literature that develops Bayesian model comparison and selection. While the issue has been frequently mentioned, less attention has been spent developing approaches to help researchers understand the extent of sensitivity in their own specific analyses. The two-fold purpose of this article has been to educate the broader statistical community about sensitivity in Bayes factors through interactive visualization, and to propose surrogate modeling in the form of computer surrogate models to examine the Bayes factor surface in cases where computing Bayes factors is too expensive to accomplish extensively. We hope these contributions will provide researchers the insight and tools necessary to asses sensitivity in Bayes factors for their own specific research interests.

Much has been said and written about the downside of the near ubiquitous use of -values for hypothesis testing across many branches of science. While it is unclear whether Bayesian model selection can or should ultimately supplant use of -values on a broad scale, illustrating these sensitivities helps researchers use Bayes factors and Bayesian model selection responsibly. Further, we hope availability of Bayes factor surface plots will help journal reviewers assess these sensitivities and prevent statistical malpractice (intentional or otherwise) through exploitation of these sensitivities. Many of the concerns about -values relate specifically to the practice of setting hard thresholds to declare results statistically significant or not (see, e.g., Lakens et al., 2018; Benjamin et al., 2018). Similar thresholding of Bayesian metrics surely will not serve as a panacea to these concerns.

Whereas research papers used to include visuals of prior to posterior updating at the parameter level, and trace plots from MCMC chains to indicate good mixing, these are now passé. If they are not cut entirely, they are frequently relegated to an appendix. The extra transparency such visuals provide are apparently no longer of value to the literature, perhaps because audiences have matured and readers/referees are now better able to sniff out troublesome modeling and inference choices through other, less direct, means. Yet a similar setup is sorely needed for model selection. Helpful visuals are not well-established and, as we show in our examples, it is easy to gain a false sense of security from the outcome of an experiment for which further analysis (e.g., visualization of a Bayes factor surface) reveals was actually balanced on a knife’s edge. Studies involving Bayesian model selection desperately need greater transparency.

We acknowledge that when many parameters are being tested simultaneously, it will be harder to visualize a single Bayes factor surface and multiple plots might be necessary. Individual surfaces can be helpful in facilitating lower-dimensional slices and projections. However, the quality of that fitted surrogate will depend on the input design, with space-filling becoming harder (as a function of a fixed design size) as the input dimension increases. An attractive option here is sequential design. Binois et al. (2018) show how an integrated mean-square prediction error criteria can be used to dynamically select design sites and degrees of replication in order for focus sampling on parts of the input space where the signal (e.g., in evaluations via (10), say) is harder to extract from the noise. Treating the Bayes factor surface surrogate as a sequential experiment in that context represents a promising avenue for further research. While this may seem potentially burdensome, we believe that the availability of Bayes factor surfaces will remain valuable. We advise researchers to prioritize surfaces that address parameters that differ between the hypotheses. Typically, hyperparameters for quantities common to both models affect Bayes factors to a much lesser degree than hyperparameters on the quantities being tested. We encourage users to use the accompanying R shiny application to explore the comparatively large impact of changing and (both related to the parameter that is being tested) compared with the smaller influence and have, as these latter hypeparameters govern the error precision which is common to both models.


  • Andersen (1957) Andersen, T. (1957, June).

    Maximum likelihood estimates for a multivariate normal distribution when some observations are missing.

    J. of the American Statistical Association 52, 200–203.
  • Bartlett (1957) Bartlett, M. S. (1957). A comment on D. V. Lindley’s statistical paradox. Biometrika 44(3-4), 533–534.
  • Benjamin et al. (2018) Benjamin, D. J., J. O. Berger, M. Johannesson, B. A. Nosek, E.-J. Wagenmakers, R. Berk, K. A. Bollen, B. Brembs, L. Brown, C. Camerer, D. Cesarini, C. D. Chambers, M. Clyde, T. D. Cook, P. De Boeck, Z. Dienes, A. Dreber, K. Easwaran, C. Efferson, E. Fehr, F. Fidler, A. P. Field, M. Forster, E. I. George, R. Gonzalez, S. Goodman, E. Green, D. P. Green, A. G. Greenwald, J. D. Hadfield, L. V. Hedges, L. Held, T. Hua Ho, H. Hoijtink, D. J. Hruschka, K. Imai, G. Imbens, J. P. A. Ioannidis, M. Jeon, J. H. Jones, M. Kirchler, D. Laibson, J. List, R. Little, A. Lupia, E. Machery, S. E. Maxwell, M. McCarthy, D. A. Moore, S. L. Morgan, M. Munafó, S. Nakagawa, B. Nyhan, T. H. Parker, L. Pericchi, M. Perugini, J. Rouder, J. Rousseau, V. Savalei, F. D. Schönbrodt, T. Sellke, B. Sinclair, D. Tingley, T. Van Zandt, S. Vazire, D. J. Watts, C. Winship, R. L. Wolpert, Y. Xie, C. Young, J. Zinman, and V. E. Johnson (2018, January). Redefine statistical significance. Nature Human Behaviour 2(1), 6–10.
  • Berger (1985) Berger, J. (1985). Statistical Decision Theory and Bayesian Analysis, second edition. Springer.
  • Berger and Pericchi (1996) Berger, J. O. and L. R. Pericchi (1996). The intrinsic bayes factor for model selection and prediction. Journal of the American Statistical Association 91(433), 109–122.
  • Binois and Gramacy (2018) Binois, M. and R. B. Gramacy (2018). hetGP: Heteroskedastic Gaussian Process Modeling and Design under Replication. R package version 1.0.3.
  • Binois et al. (2018) Binois, M., R. B. Gramacy, and M. Ludkovski (2018). Practical heteroskedastic Gaussian process modeling for large simulation experiments. Journal of Computational and Graphical Statistics to appear. arXiv preprint arXiv:1611.05902.
  • Binois et al. (2018) Binois, M., J. Huang, R. B. Gramacy, and M. Ludkovski (2018). Replication or exploration? Sequential design for stochastic simulation experiments. Technometrics to appear. arXiv preprint arXiv:1710.03206.
  • Boos and Stefanski (2011) Boos, D. D. and L. A. Stefanski (2011). P-value precision and reproducibility. The American Statistician 65(4), 213–221.
  • Chang et al. (2017) Chang, W., J. Cheng, J. Allaire, Y. Xie, and J. McPherson (2017). shiny: Web Application Framework for R. R package version 1.0.5.
  • Chipman et al. (2001) Chipman, H., E. I. George, and R. E. McCulloch (2001). The Practical Implementation of Bayesian Model Selection, Volume Volume 38 of Lecture Notes–Monograph Series, pp. 65–116. Beachwood, OH: Institute of Mathematical Statistics.
  • Chung et al. (2018) Chung, M., R. B. Gramacy, M. Binois, D. Moquin, A. Smith, and A. Smith (2018). Parameter and uncertainty estimation for dynamical systems using surrogate stochastic processes. Technical report, Virginia Tech. preprint on arXiv:1802.00852.
  • Fonseca et al. (2008) Fonseca, T. C. O., M. A. R. Ferreira, and H. S. Migon (2008). Objective bayesian analysis for the student-t regression model. Biometrika 95(2), 325–333.
  • Geweke (1992) Geweke, J. (1992). Priors for microeconomic times series and their application. Technical Report Institute of Empirical Macroeconomics Discussion Paper No.64, Federal Reserve Bank of Minneapolis.
  • Geweke (1993) Geweke, J. (1993, Dec). Bayesian treatment of the independent student– linear model. J. of Applied Econometrics Vol. 8, Supplement: Special Issue on Econometric Inference Using Simulation Techniques, S19–S40.
  • Gramacy (2017) Gramacy, R. B. (2017). monomvn: Estimation for Multivariate Normal and Student- Data with Monotone Missingness. R package version 1.9-7.
  • Gramacy and Pantaleo (2010) Gramacy, R. B. and E. Pantaleo (2010, 06). Shrinkage regression for multivariate inference with missing data, and an application to portfolio balancing. Bayesian Anal. 5(2), 237–262.
  • Ioannidis (2005) Ioannidis, J. P. A. (2005, 08). Why most published research findings are false. PLOS Medicine 2(8).
  • Jacquier et al. (2004) Jacquier, E., N. Polson, and P. E. Rossi (2004). Bayesian analysis of stochastic volatility models with fat-tails and correlated errors. J. of Econometrics 122, 185–212.
  • Jeffreys (1935) Jeffreys, H. (1935). Some tests of significance, treated by the theory of probability. Mathematical Proceedings of the Cambridge Philosophical Society 31(2), 203–222.
  • Jeffreys (1961) Jeffreys, H. (1961). The Theory of Probability, Volume 2 of Oxford Classic Texts in the Physical Sciences. Oxford University Press.
  • Kass and Raftery (1995) Kass, R. E. and A. E. Raftery (1995). Bayes factors. Journal of the American Statistical Association 90(430), 773–795.
  • Lakens et al. (2018) Lakens, D., F. G. Adolfi, C. J. Albers, F. Anvari, M. A. J. Apps, S. E. Argamon, T. Baguley, R. B. Becker, S. D. Benning, D. E. Bradford, E. M. Buchanan, A. R. Caldwell, B. Van Calster, R. Carlsson, S.-C. Chen, B. Chung, L. J. Colling, G. S. Collins, Z. Crook, E. S. Cross, S. Daniels, H. Danielsson, L. DeBruine, D. J. Dunleavy, B. D. Earp, M. I. Feist, J. D. Ferrell, J. G. Field, N. W. Fox, A. Friesen, C. Gomes, M. Gonzalez-Marquez, J. A. Grange, A. P. Grieve, R. Guggenberger, J. Grist, A.-L. van Harmelen, F. Hasselman, K. D. Hochard, M. R. Hoffarth, N. P. Holmes, M. Ingre, P. M. Isager, H. K. Isotalus, C. Johansson, K. Juszczyk, D. A. Kenny, A. A. Khalil, B. Konat, J. Lao, E. G. Larsen, G. M. A. Lodder, J. Lukavský, C. R. Madan, D. Manheim, S. R. Martin, A. E. Martin, D. G. Mayo, R. J. McCarthy, K. McConway, C. McFarland, A. Q. X. Nio, G. Nilsonne, C. L. de Oliveira, J.-J. O. de Xivry, S. Parsons, G. Pfuhl, K. A. Quinn, J. J. Sakon, S. A. Saribay, I. K. Schneider, M. Selvaraju, Z. Sjoerds, S. G. Smith, T. Smits, J. R. Spies, V. Sreekumar, C. N. Steltenpohl, N. Stenhouse, W. Świątkowski, M. A. Vadillo, M. A. L. M. Van Assen, M. N. Williams, S. E. Williams, D. R. Williams, T. Yarkoni, I. Ziano, and R. A. Zwaan (2018, March). Justify your alpha. Nature Human Behaviour 2(3), 168–171.
  • Lavine and Schervish (1999) Lavine, M. and M. J. Schervish (1999). Bayes factors: What they are and what they are not. The American Statistician 53(2), 119–122.
  • Liang et al. (2008) Liang, F., R. Paulo, G. Molina, M. A. Clyde, and J. O. Berger (2008). Mixtures of g priors for Bayesian variable selection. Journal of the American Statistical Association 103(481), 410–423.
  • Lindley (1957) Lindley, D. V. (1957). A statistical paradox. Biometrika 44(1/2), 187–192.
  • Markowitz (1959) Markowitz, H. (1959). Portfolio Selection: Efficient Diversification of Investments. New York: John Wiley.
  • McKay et al. (1979) McKay, M. D., W. J. Conover, and R. J. Beckman (1979). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 21, 239–245.
  • O’Hagan (1995) O’Hagan, A. (1995). Fractional Bayes factors for model comparison. Journal of the Royal Statistical Society. Series B (Methodological) 57(1), 99–138.
  • Ormerod et al. (2017) Ormerod, J. T., M. Stewart, W. Yu, and S. E. Romanes (2017). Bayesian hypothesis tests with diffuse priors: Can we have our cake and eat it too? arXiv preprint arXiv:1710.09146.
  • Santner et al. (2003) Santner, T. J., B. J. Williams, and W. I. Notz (2003). The Design and Analysis of Computer Experiments. New York, NY: Springer-Verlag.
  • Schwarz (1978) Schwarz, G. (1978, 03). Estimating the dimension of a model. Ann. Statist. 6(2), 461–464.
  • Stambaugh (1997) Stambaugh, R. F. (1997). Analyzing investments whose histories differ in lengh. J. of Financial Economics 45, 285–331.
  • Trafimow and Marks (2015) Trafimow, D. and M. Marks (2015). Editorial. Basic and Applied Social Psychology 37(1), 1–2.
  • Wasserstein et al. (2016) Wasserstein, R. L., N. A. Lazar, et al. (2016). The asa’s statement on p-values: context, process, and purpose. The American Statistician 70(2), 129–133.
  • Yao et al. (2018) Yao, Y., A. Vehtari, D. Simpson, and A. Gelman (2018). Using stacking to average bayesian predictive distributions. Advance publication.
  • Young and Karr (2011) Young, S. S. and A. Karr (2011). Deming, data and observational studies. Significance 8(3), 116–120.
  • Zellner (1986) Zellner, A. (1986).

    On assessing prior distributions and bayesian regression analysis with g-prior distributions.

    In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti, Chapter 15, pp. 233–243. New York: Elsevier Science Publishers, Inc.
  • Zellner and Siow (1980) Zellner, A. and A. Siow (1980, Feb). Posterior odds ratios for selected regression hypotheses. Trabajos de Estadistica Y de Investigacion Operativa 31(1), 585–603.

5 Appendix

The Bayesian model for the mixture -prior comes from Liang et al. (2008):

The marginal likelihood can be obtained by analytically integrating , , and , then approximating the integral over . As in the original paper, we use Laplace approximation for this last integral.

Fractional Bayes factors: Let denote a training sample size, the sample size, and . The likelihood raised to approximates the likelihood of a training sample, a fact used to form fractional Bayes factors. Fractional Bayes factors are appealing since there is no need to choose an arbitrary training sample or spend computational resources averaging over some or all possible training samples to obtain an intrinsic Bayes factor (Berger and Pericchi, 1996). The prior distribution for the fractional Bayes factor analysis is the reference prior:

where is a constant. This prior is improper, though it yields a proper posterior distribution for inference on parameters and is eligible for model selection due to the availability of fractional Bayes factors. The minimal training sample in this study is . Further details about fractional Bayes factors are left to the literature.