Hypothesis testing is the primary tool for statistical inference across much of the biological and behavioral sciences. As such, most scientists are trained in classical null hypothesis significance testing (NHST). The scenario for testing a hypothesis is likely familiar to most readers of this journal. Suppose one wants to test a specific research hypothesis (e.g., some treatment has an effect on some outcome measure). NHST works by first assuming anull hypothesis (e.g., the treatment has no effect
) and then computing some test statistic for a sample of data. This sample test statistic is then compared to a hypothetical distribution of test statistics that would arise if the null hypothesis were true. If the sample’s test statistic is in the tail of the distribution (that is, it should occur with low probability), the scientist decides toreject the null hypothesis in favor of the alternative hypothesis. Further, the -value, which indicates how surprising the sample would be if the null hypothesis were true, is often taken as a measure of evidence: the lower the -value, the stronger the evidence.
While orthodox across many disciplines, NHST does have philosophical criticisms (Wagenmakers, 2007). Also, the -value is prone to misinterpretation (Gigerenzer, 2004; Hoekstra et al., 2014). Finally, NHST is ideally suited to providing support for the alternative hypothesis, but the procedure does not work in the case where one wants to measure support for the null hypothesis. That is, we can reject the null, but we cannot accept the null. To overcome this limitation, we can use an alternative method for testing hypotheses that is based on Bayesian inference: the Bayes factor.
i.1 The Bayes factor
Bayesian inference is a method of measurement that is based on the computation of , which is called the posterior probability of a hypothesis , given data
. Bayes’ theorem casts this probability as
One may think of Equation 1 in the following manner: before observing data , one assigns a prior probability to hypothesis
. After observing data, one can then update this prior probability to aposterior probability by multiplying the prior by the likelihood
. This product is then rescaled to a probability distribution (i.e., total probability = 1) by dividing by themarginal probability .
Bayes’ theorem provides a natural way to test hypotheses. Suppose we have two competing hypotheses: an alternative hypothesis and a null hypothesis . We can directly compare the posterior probabilities of and by computing their ratio; that is, we can compute the posterior odds
posterior oddsin favor of over as . Using Bayes’ theorem (Equation 1), it is trivial to see that
This equation can also be interpreted in terms of the “updating” metaphor that was explained above. Specifically, the posterior odds are equal to the prior odds multiplied by an updating factor. This updating factor is equal to the ratio of likelihoods and , and is called the Bayes factor (Jeffreys, 1961). Intuitively, the Bayes factor can be interpreted as the weight of evidence provided by a set of data . For example, suppose that one assigned the prior odds of and equal to 1; that is, and are a priori assumed to be equally likely. Then, suppose that after observing data , the Bayes factor was computed to be 10. Now, the posterior odds (the odds of over after observing data) is 10:1 in favor of over . As such, the Bayes factor provides an easily interpretable measure of the evidence in favor of .
In order to help with interpreting Bayes factors, various classification schemes have been proposed. One simple scheme is a four-way classification proposed by Raftery (1995), where Bayes factors between 1 and 3 are considered weak evidence; between 3 and 20 constitutes positive evidence; between 20 and 150 constitutes strong evidence; and beyond 150 is considered very strong evidence.
Note that in the discussion above, there was no specific assumption about the order in which we addressed and . If instead we wanted to assess the weight of evidence in favor of over , Equation 2 could simply be adjusted by taking reciprocals. As such, implied direction is important when computing Bayes factors, so one must be careful to define notation when representing Bayes factors. A common convention is to define as the Bayes factor for over ; similarly, would represent the Bayes factor for over . Note that .
In summary, the Bayes factor provides an index of preference for one hypothesis over another that has some advantages over NHST. First, the Bayes factor tells us by how much a data sample should update our belief in one hypothesis over a competing one. Second, though NHST does not allow one to accept a null hypothesis, doing so within a Bayesian framework makes perfect sense. Given these advantages, it may be surprising that Bayesian inference has not been used more often in the empirical sciences. One reason for the lack of more widespread adoption may be that Bayes factors are quite difficult to compute. We tackle this issue in the next section.
Ii Computing Bayes factors
As an example, suppose we are interested in computing the Bayes factor for a null hypothesis over an alternative hypothesis , given data . Recall from Equation 2 that this Bayes factor (denoted ) is equal to
While this equation may seem conceptually quite simple, it is computationally much more difficult. This is because in order to compute the numerator and denominator, one must parameterize the hypotheses (or models, to be more clear), and then each likelihood is computed by conditioning over all possible parameter values and summing over this set. Since these potential parameter values are often over a continuous parameter space, this computation requires integration, and thus the formula for the Bayes factor amounts to
where and are the parameter spaces for models and , respectively, and and
are the prior probability density functions of the parameters ofand , respectively.
Thus, in order to compute , one must specify the priors and for and . Further, the integrals usually do not have closed-form solutions, so numerical integration techniques are necessary. These requirements lend a computation of the Bayes factor to be inaccessible to all but the most ardent researchers who have at least a more-than-modest amount of mathematical training.
Fortunately, there are an increasing number of solutions that avoid a direct encounter with computations of the above type. Recently, researchers have proposed default priors for standard experimental designs such as -tests (Rouder et al., 2009; Morey and Rouder, 2011) and ANOVA (Rouder et al., 2012). These default priors are implemented in software packages such as the R package BayesFactor (Morey and Rouder, 2015), and as such, have provided a user-friendly method for researchers to compute Bayes factors without the computational overhead needed in Equation 3. While these software solutions work quite well for computing Bayes factors from raw data, they are a bit limited in the following context. Suppose that in the course of reading some published literature, a researcher comes across a result that is presented as “nonsignificant”, with associated test statistic , . In an NHST context, this nonsignificant result does not provide evidence for the null hypothesis; rather, it just implies that we cannot reject the null. A natural question would be what, if any, support does this result provide for the null hypothesis? Of course, a Bayes factor would be useful here, but without the raw data, we cannot use the previously mentioned software solutions. To this end, it would be advantageous if there were some easy way to compute a Bayes factor directly from the reported test statistic.
It turns out that this computation is indeed possible, at least in certain cases. In the following, I will show how one particular method for computing Bayes factors (the BIC approximation; Raftery, 1995) can be adapted to solve this problem, thus allowing researchers to compute approximate Bayes factors from summary statistics alone (with no need for raw data). Further, I will show through simulations that this method compares well to the default Bayes factors for ANOVA developed by Rouder et al. (2012).
Iii The BIC approximation of the Bayes factor
Wagenmakers (2007) demonstrated a method (based on earlier work by Raftery, 1995) for computing approximate Bayes factors using the BIC (Bayesian Information Criterion). For a given model , the BIC is defined as
where is the number of observations, is the number of free parameters of model , and is the maximum likelihood for model . He then showed that the Bayes factor for over can be approximated as
where . Further, Wagenmakers (2007) showed that when comparing an alternative hypothesis to a null hypothesis ,
In this equation, and represent the sum of squares for the error terms in models and , respectively. Both Wagenmakers (2007) and Masson (2011) give excellent examples of how to use this approximation to compute Bayes factors, assuming one is given information about and , as is the case with most statistical software. However, we will now consider the situation where one is given the statistical summary (i.e., ), but not the ANOVA output.
Suppose we wish to examine an effect of some independent variable with associated -ratio , where represents the degrees of freedom associated with the manipulation, and represents the degrees of freedom associated with the error term. Then, , where and are the sum of squared errors associated with the manipulation and the error term, respectively.
From Equation 5, we see that
This equality holds because represents the sum of squares that is not explained by , which is simply (the error term). Similarly, is the sum of squares not explained by , which is the sum of and (see Wagenmakers, 2007, p. 799). Finally, in the context of comparing and in an ANOVA design, we have . Now, we can use algebra to re-express in terms of :
Substituting this into Equation 4, we can compute:
Rearranging this last expression slightly yields the approximation:
Practically speaking, the approximation given in Equation 6 offers nothing new over the previous formulations of the BIC approximation given in Wagenmakers (2007) and Masson (2011). However, it does have two advantages over these previous formulations. First, one can directly take reported ANOVA statistics (e.g., sample size, degrees of freedom, and the -ratio) and compute without having to compute or . We should note that Masson (2011) correctly mentions that , so if a paper reports , the need for computing and is nullified. However, the method of Masson (2011) is still essentially a two-step process; one first computes , which in turn is used to compute . In contrast, the expression derived in Equation 6 is a one-step process that can easily be implemented using a scientific calculator or a simple spreadsheet.
Iv Example computations
In this section, we will discuss two examples of using Equation 6 to compute Bayes factors. In the first example, I will show how to compute and interpret a Bayes factor for a reported null effect in the field of experimental psychology. In the second example, I will show how to modify Equation 6 to work with an independent samples -test.
iv.1 Example 1
Sevos et al. (2016) performed an experiment to assess whether schizophrenics could internal simulate motor actions when perceiving graspable objects. The evidence for such internal simulation comes from a statistical interaction between response-orientation compatibility and the presence of an individual name prime. Sevos et al. reported that in a sample of schizophrenics, there was no interaction between this compatibility and name prime, , . Critically, Sevos et al. claimed that this null effect was evidence for the absence of sensorimotor simulation, which, as they indicate, would imply that schizophrenics would have to rely on higher cognitive processes for even the most simple daily tasks. This claim is based on a null effect, which as pointed out earlier, is problematic for a null hypothesis testing framework. I will now show how to compute a Bayes factor to assess the evidence for this null effect.
To this end, we use Equation 6 to compute
This Bayes factor can be interpeted as follows: after seeing the data, our belief in the null hypothesis should increase only by a factor of 1.19. In other words, this data is not very informative toward our belief in the null, which implies that the claim of a null effect in Sevos et al. (2016) may be a bit optimistic. According the classification scheme of Raftery (1995), this result provides weak evidence for the null.
iv.2 Example 2
Borota et al. (2014) observed that with a sample of participants, those who received 200 mg of caffeine performed significantly better on a test of object memory compared to a control group of participants who received a placebo, , . Borota et al. (2014) claimed this result as evidence that caffeine enhances memory consolidation.
As before, we can measure the evidence provided from this data sample by computing a Bayes factor. However, we note that because Equation 6 casts the Bayes factor in terms of an -ratio, it may not be immediately obvious whether we can use Equation 6 in this context. It turns out to be straightforward to modify Equation 6 to work for an independent samples -test. All we need are two simple transformations: (1) , and (2) . Applying these to Equation 6, we get
We can now apply this equation to the reported results of Borota et al. (2014). We see that
So, perhaps counterintuitively, the significant result reported in Borota et al. (2014) turns out to be weak evidence in support of the null! Such results are an example of Lindley’s paradox (Lindley, 1957), where “significant” values between 0.04 and 0.05 can actually imply evidence in favor of the null when analyzed in a Bayesian framework.
V Simulations: BIC approximation versus default Bayesian ANOVA
At this stage, it is clear that Equation 6 provides a straightforward method for computing an approximate Bayes factor, especially in cases when one is given only minimal output from a reported ANOVA or test. However, it is not yet clear to what extent this BIC approximation would result in the same decision if a Bayesian analysis of variance (e.g., Rouder et al., 2012) were performed on the raw data. To answer this question, I performed a series of simulations.
Each simulation consisted of 1000 randomly generated data sets under a factorial design. The choice of this design is to replicate experimental conditions that are common across many applications in the biological and behavioral sciences. Further, I simulated varying common levels of statistical power in these experiments by testing 3 different cell-size conditions: or
. Specifically each data set consisted of a vectorgenerated as
where , , and . The “effects” , , and
were generated from multivariate normal distributions with mean 0 and variance, yielding three different effect sizes obtained by setting and (as in Wang, 2017). In all, there were 9 different simulations, generated by crossing the 3 cell sizes () with the 3 effect sizes ().
For each data set, I computed (1) a Bayesian ANOVA using the BayesFactor package in R (Morey and Rouder, 2015) and (2) the BIC approximation using Equation 6 from the traditional ANOVA. Bayes factors were computed as to assess evidence in favor of the alternative hypothesis over the null hypothesis. Similar to Wang (2017), I set the decision criterion to select the alternative hypothesis if , and the null hypothesis otherwise. Because the different cell sizes resulted in similar outcomes, for brevity I only report the cell size condition in the summaries below. Also note that all BayesFactor models were fit with a “wide” prior, which is roughly equivalent to the unit-information prior used by Raftery (1995) for the BIC approximation.
First, I will report the results of computing Bayes factors for the main effect in each of the effect size conditions , , and . Five-number summaries for are reported for the simulation in Table 1, as well as the proportion of simulated data sets for which the Bayesian ANOVA and the BIC approximation from Equation 6 selected the same model.
As shown in Table 1, the BIC approximation from Equation 6 provides a similar distribution of Bayes factors compared to those computed from the BayesFactor package in R. Figure 1 shows this pattern of results quite clearly, as the kernel density plots for the two different types of Bayes factors exhibit a large amount of overlap for the and conditions. It is notable that the BIC approximation tended to underestimate the BayesFactor output in the case. However, as can be seen in the “Consistency” column of Table 1, regardless of effect size conditions, the two different types of Bayes factors resulted in the same decision in a large proportion of simulations (at least 98.4% of simulation trials).
A similar picture emerges for the main effect . As can be seen in Table 2 and Figure 2, the BIC approximation and the BayesFactor outputs are largely consistent and result in mostly the same model choice decisions. As with the results for main effect , there is some slight difference in the kernel density plots when simulating null effects (i.e., the condition ). However, both methods chose the same model on at least 92.7% of simulated trials, showing a good amount of consistency.
Finally, we can see in Table 3 and Figure 3 that the BIC approximation closely mirrors the output of the BayesFactor package for the interaction effect . Indeed, the kernel density plots in Figure 3 show considerable overlap between the distributions of BIC values and the distributions of BayesFactor outputs, and this picture is consistent across all three effect sizes (). As expected from this picture, both methods arrive at very similar model choices, picking the same model on at least 97% of trials.
The BIC approximation given in Equation 6
provides an easy-to-use estimate of Bayes factors for simple between-subject ANOVA andtest designs. It requires only minimal information, which makes it well-suited for using in a meta-analytic context. In simulations, the estimates derived from Equation 6 compare favorably to Bayes factors computed using existing software solutions with raw data. Thus, the researcher can confidently add this BIC approximation to the ever-growing collection of Bayesian tools for scientific measurement.
- Borota et al. (2014) Borota, D., Murray, E., Keceli, G., Chang, A., Watabe, J. M., Ly, M., Toscano, J. P., and Yassa, M. A. (2014). Post-study caffeine administration enhances memory consolidation in humans. Nature Neuroscience, 17(2):201–203.
- Gigerenzer (2004) Gigerenzer, G. (2004). Mindless statistics. The Journal of Socio-Economics, 33(5):587–606.
Hoekstra et al. (2014)
Hoekstra, R., Morey, R. D., Rouder, J. N., and Wagenmakers, E.-J. (2014).
Robust misinterpretation of confidence intervals.Psychonomic Bulletin & Review, 21(5):1157–1164.
- Jeffreys (1961) Jeffreys, H. (1961). The Theory of Probability (3rd ed.). Oxford University Press, Oxford, UK.
- Lindley (1957) Lindley, D. V. (1957). A statistical paradox. Biometrika, 44(1-2):187–192.
- Masson (2011) Masson, M. E. J. (2011). A tutorial on a practical Bayesian alternative to null-hypothesis significance testing. Behavior Research Methods, 43(3):679–690.
- Morey and Rouder (2011) Morey, R. D. and Rouder, J. N. (2011). Bayes factor approaches for testing interval null hypotheses. Psychological Methods, 16(4):406–419.
- Morey and Rouder (2015) Morey, R. D. and Rouder, J. N. (2015). BayesFactor: Computation of Bayes Factors for Common Designs. R package version 0.9.12-2.
- Raftery (1995) Raftery, A. E. (1995). Bayesian model selection in social research. Sociological Methodology, 25:111–163.
- Rouder et al. (2012) Rouder, J. N., Morey, R. D., Speckman, P. L., and Province, J. M. (2012). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56(5):356–374.
- Rouder et al. (2009) Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2):225–237.
- Sevos et al. (2016) Sevos, J., Grosselin, A., Brouillet, D., Pellet, J., and Massoubre, C. (2016). Is there any influence of variations in context on object-affordance effects in schizophrenia? Perception of property and goals of action. Frontiers in Psychology, 7:1551.
- Wagenmakers (2007) Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of values. Psychonomic Bulletin & Review, 14(5):779–804.
- Wang (2017) Wang, M. (2017). Mixtures of -priors for analysis of variance models with a diverging number of parameters. Bayesian Analysis, 12(2):511–532.