 # Estimating Bayes factors from minimal ANOVA summaries for repeated-measures designs

In this paper, we develop a formula for estimating Bayes factors from repeated measures ANOVA designs. The formula, which requires knowing only minimal information about the ANOVA (e.g., the F -statistic), is based on the BIC approximation of the Bayes factor, a common default method for Bayesian computation with linear models. In addition to several computational examples, we report a simulation study in which we demonstrate that despite its simplicity, our formula compares favorably to a recently developed, more complex method that accounts for correlation between repeated measurements. Our method provides a simple way for researchers to estimate Bayes factors from a minimal set of summary statistics, giving users a powerful index for estimating the evidential value of not only their own data, but also the data reported in published studies.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Background

Let us begin with a simple case of a one-factor independent groups design. Consider a set of data , on which we impose a linear model as follows:

where represents the grand mean, represents the treatment effect associated with group , and . In all, we have independent observations. We define two hypotheses:

 H0:αj=0 for j=1,…,k H1:αj≠0 for some j

Recall that for and , the Bayes factor (Kass and Raftery, 1995), denoted , is defined as the ratio of marginal likelihoods for and , respectively. That is,

 B01=p(data∣H0)p(data∣H1).

This ratio indicates the extent to which the prior odds for

over are updated after observing data.

DERIVE FORMULA FOR POSTERIOR MODEL PROB HERE!!

In Faulkenberry (2018), it was shown that for any independent-groups design, one can use the results of an analysis of variance to compute an approximation of that is based on a unit information prior (Wagenmakers, 2007; Masson, 2011). Specifically

 B01≈√Ndf1(1+Fdf1df2)−N, (1)

where is the -ratio from a standard analysis of variance applied to these data.

As an example, consider a hypothetical dataset containing groups of observations each (for a total of independent observations). Suppose that an ANOVA produces , . This result would be considered as “significant” by conventional standards, and traditional practice would dictate that we reject in favor of . But is this result really evidential for ? We can apply Equation 1 as follows:

 B01 ≈√Ndf1(1+Fdf1df2)−N =√1003(1+0.76⋅396)−100 =15.98.

This result indicates quite the opposite: by definition of the Bayes factor, this implies that the observed data are almost 16 times more likely under than . Note that the appearance of such contradictory conclusions from two different testing frameworks is actually a classic result known as Lindley’s paradox (Lindley, 1957).

## Ii The BIC approximation for repeated measures

Our goal now is to modify Equation 1 to the case where we have an experimental design with repeated measurements. For context, consider an experiment where measurements are taken from each of subjects. We then have a total of

observations, but they are no longer independent measurements. Assume a linear mixed model structure on the observations:

 Yij=μ+αj+πi+εij;i=1,…,n;j=1⋯,k,

where represents the grand mean, represents the treatment effect associated with group , represents the effect of subject , and . Due to the correlated structure of these data, we have independent observations. We will define and as above.

Prior work of Wagenmakers (2007) has demonstrated that can be approximated as , where

 ΔBIC10=Nln(SSE1SSE0)+(κ1−κ0)ln(N).

Here, is equal to the number of independent observations; as noted above, this is equal to . represents the variability left unexplained by ; for an ANOVA, this is equal to . represents the variability left unexplained by ; for an ANOVA, this is equal to the sum of and . Finally, is equal to the difference in the number of parameters between and ; this is equal to .

We are now ready to derive a formula for . First, we will re-express in terms of :

 ΔBIC10 =Nln(SSE1SSE0)+(κ1−κ0)ln(N) =n(k−1)ln(SSresidualSSresidual% +SStreatment)+(k−1)ln(n(k−1)) =n(k−1)ln(11+SStreatmentSSresidual)+(k−1)ln(n(k−1)) =n(k−1)ln(dfresidualdf% treatmentdfresidualdftreatment+SStreatmentSSresidual⋅dfresidualdftreatment)+(k−1)ln(n(k−1)) =n(k−1)ln(dfresidualdf% treatmentdfresidualdftreatment+F)+(k−1)ln(n(k−1)) =n(k−1)ln(dfresidualdfresidual% +F⋅dftreatment)+(k−1)ln(n(k−1)) =n(k−1)ln((n−1)(k−1)(n−1)(k−1)+F(k−1))+(k−1)ln(n(k−1)) =n(k−1)ln(n−1n−1+F)+(k−1)ln(n(k−1))

Thus, we can write

 B01 ≈exp(ΔBIC10/2) = ⎷(n(k−1))k−1⋅(n−1n−1+F)n(k−1) = ⎷(nk−n)k−1⋅(n−1n−1+F)nk−n

If we invert the term containing and divide into the resulting numerator, we get the following formula:

 B01≈ ⎷(nk−n)k−1⋅(1+Fn−1)n−nk, (2)

where equals the number of subjects and equals the number of repeated measurements per subject.

### ii.1 Some examples

We can now apply Equation 2 to compute Bayes factors for a couple of examples. The examples below are based on data from Faulkenberry et al. (2018). In this experiment, subjects were presented with pairs of single digit numerals and asked to choose the numeral that was presented in the larger font size. For each of subjects, median response times were calculated for each of conditions – congruent trials and incongruent trials. Congruent trials were defined as those in which the physically larger digit was also the numerically larger digit (e.g., 28). Incongruent trials were defined such that the physically larger digit was numerically smaller (e.g., 28). The resulting ANOVA summary table is depicted in Table 1.

Applying Equation 1 gives us the following:

 B01 ≈ ⎷(nk−n)k−1⋅(1+Fn−1)n−nk = ⎷(23⋅2−23)2−1(1+39.6323−1)(23−23⋅2) = ⎷231(1+39.6322)−23 =0.00003436

The resulting Bayes factor displays quite powerful evidence against ; if we cast the Bayes factor in favor of , we get , indicating that the observed data are approximately 30,000 times more likely under than

. This provides overwhelming support for the presence of an effect of physical/numerical congruity on median response times. Converting the Bayes factor to a posterior model probability, we also see incredible evidence for

:

 p(H1∣data) =B101+B10 =291041+29104 =0.99997.

Now let us consider our second example. In addition to analyzing median response times, Faulkenberry et al. (2018) also fit each subjects’ distribution

of response times to a parametric model

(i.e., the shifted Wald distribution; see Anders et al., 2016; Faulkenberry, 2017, for details), allowing them to investigate the effects of congruity on shape, scale, and location of the response time distributions. Specifically, they predicted that the leading edge, or shift, of the distributions would not differ between congruent and incongruent trials, thus providing support against an early encoding-based explanation of the observed size-congruity effect (Santens and Verguts, 2011; Faulkenberry et al., 2016; Sobel et al., 2016, 2017). The shift parameter was calculated for both of the congruity conditions for each of the subjects. The resulting ANOVA summary table is presented in Table 2

Applying Equation 1 gives us the following:

 B01 ≈ ⎷(nk−n)k−1⋅(1+Fn−1)n−nk = ⎷(23⋅2−23)2−1(1+1.33623−1)(23−23⋅2) = ⎷231(1+1.33622)−23 =2.435

This Bayes factor tells us that the observed data are approximately 2.4 times more likely under than . Converting the Bayes factor to a posterior model probability, we also see positive evidence for :

 p(H0∣data) =B011+B01 =2.4351+2.435 =0.709.

## Iii Accounting for correlation between repeated measurements

In a recent paper, Nathoo and Masson (2016) took a slightly different approach to the problem we have , investigating the role of effective sample size in repeated measures designs (Jones, 2011). For single-factor repeated measures designs, effective sample size can be computed as , where is the intraclass correlation. When , , and when , . Though is unknown, Nathoo and Masson (2016) developed a method to estimate it from values in the ANOVA, leading to the following refined estimate:

 ΔBIC10 =n(k−1)ln(SStotal−SStreatment−SSsubjectSStotal−SSsubject) +(k+2)ln(n(SStotal−SStreatment)SSsubject) −3ln(nSStotalSSsubject)

Though this estimate certainly provides a better account of the correlation between repeated measurements, the benefit comes at a price of added complexity, and certainly one cannot reduce this formula easily to a simple expression involving only as we do with Equation 2. This leads to the natural question: how well does our Equation 2 match up with the more complex approach of Nathoo and Masson (2016)?

As a first step toward answering this question, let us revisit the two examples presented above. If we apply the Nathoo and Masson formula to the ANOVA summary in Table 1, we obtain:

 ΔBIC10 =23(2−1)ln(356181−45360−285639356181−285639) +(2+2)ln(23(356181−45360)285639) −3ln(23(356181)285639) =23ln(0.3570)+4ln(25.028)−3ln(28.680) =−20.879

We can convert to a Bayes factor, giving us . As above, we cast this Bayes factor in favor of by inverting, so . This implies . Note that the general interpretation of these results is on par with our earlier method; both indicate overwhelming support for . If anything, the approximation we obtained with Equation 2 is slightly conservative regarding support for ; this is because the method of Nathoo and Masson was designed to reduce the BIC penalty for when repeated measures conditions are highly correlated; compared to the formulation upon which Equation 2 is based, this will tend to increase the support for Nathoo and Masson (2016).

We can do a similar computation with the data from Table 2:

 ΔBIC10 +(2+2)ln(23(116399−739)103984) −3ln(23(116399)103984) =23ln(0.9405)+4ln(25.583)−3ln(25.746) =1.812

This equates to a Bayes factor of and a posterior model probability of . Clearly, these computations are quite similar to the ones we performed with Equation 2; both indicate positive evidence for over .

## Iv Simulation study

The computations in the previous section reflect two preliminary findings. First, the revised BIC formula of Nathoo and Masson (2016) yields Bayes factors and posterior model probabilities that take into account an estimate of the correlation between repeated measurements. This is a highly principled approch which our Equation 2 does not take. However, as we can see with both computations, the general conclusion remains the same regardless of whether we used Equation 2 or the Nathoo and Masson method. Given that our Equation 2 is (1) easy to use, and (2) requires only three inputs (the number of subjects , the number of repeated measurement conditions , and the statistic), could it be that Equation 2 produces results that are sufficient for day-to-day work, with the risk of being conservative being outweighed by the simplicity of our formula? To answer this question, we conducted a Monte Carlo simulation to systematically investigate the relationship between Equation 2 and the Nathoo and Masson method across a wide variety of randomly generated datasets.

In this simulation, we randomly generated datasets that reflected the repeated-measures designs that we have discussed throughout this paper. Specifically, data were generated from the linear mixed model

 Yij=μ+αj+πi+εij;i=1,…,n;j=1,…,k,

where represents a grand mean, represents a treatment effect, and represents a subject effect. For convenience, we set , though similar results were obtained with other values of (not reported here). Also, we assume and . We systematically varied three components of the model:

1. The number of observations for each subject was set to either , , or ;

2. The intraclass correlation between treatment conditions was set to be either or ;

3. The size of the treatment effect was manipulated to be either null, small, or medium. Specifically, these effects were defined as follows. Let (i.e., the condition mean for treatment ). Then we define effect size as

 δ=max(μj)−min(μj)√σ2π+σ2ε,

and correspondingly, we set to one of three values: (null effect), (small effect), and (medium effect). Also note that since we can write the intraclass correlation as

 ρ=σ2πσ2π+σ2ε,

it follows directly that we can alternatively parameterize effect size as

 δ=√ρ(max(μj)−min(μj))σπ.

Using this expression, we were able to set our marginal variance to be constant across the varying values of our simulation parameters.

For each combination of number of observations (), effect size (), and intraclass correlation (

), we generated 1000 simulated datasets. For each of the datasets, we applied a repeated-measures ANOVA model and extracted two posterior probabilities for

; one based on Equation 2 and one based on the refined estimate of Nathoo and Masson (2016). The results are depicted in Figure 1.

The primary message of Figure 1 is clear; our Equation 2, which was derived from the original BIC method (Wagenmakers, 2007; Masson, 2011; Faulkenberry, 2018) performs comparably to the refined BIC method of Nathoo and Masson (2016) across a variety of empirical situations. In the cases where was true (the first row of Figure 1, both Equation 2 and the Nathoo and Masson (2016) method produce posterior probabilities for that are reasonably large. For both methods, the variation of these estimates decreases as the number of observations increases. When the intraclass correlation is small (), the estimates from Equation 2 and the Nathoo and Masson (2016) method are virtually identical. When the intraclass correlation is large (), the Nathoo and Masson (2016) method introduces slightly more variability in the posterior probability estimates. In all, these results indicate that Equation 2 is slightly more favorable when is true.

For small effects (row 2 of Figure 1), the performance of both methods depended heavily on the correlation between repeated measurements. For small intraclass correlation (), both methods were quite supportive of , even though was the true model. This reflects the conservative nature of the BIC approximation (Wagenmakers, 2007); since the unit information prior is uninformative and puts reasonable mass on a large range of possible effect sizes, the predictive updating value for any positive effect (i.e., will be smaller than would be the case if the prior was more concentrated on smaller effects. As a result, the posterior probability for is smaller as well. Regardless, the original BIC method (Equation 2 and the Nathoo and Masson (2016) method produce similar results. The picture is different when the intraclass correlation is large (); both methods produce a wide range of posterior probabilities, though they are again highly comparable. It is worth pointing out that the posterior probability estimates all improve with increasing numbers of observations; but this should not be surprising, given that the BIC approximation underlying both Equation 2 and the Nathoo and Masson (2016) method is large sample approximation technique.

For medium effects (row 3 of Figure 1), we see much of the same message that we’ve already discussed previously. Both Equation 2 and the Nathoo and Masson (2016) method produce similar posterior probability values for . Both methods improve with increasing sample size, and at least for medium-size effects, the computations are quite reliable for high values of correlation between repeated measurements. Figure 1: Results from our simulation. Each boxplot depicts the distribution of the posterior probability p(H0∣data) for 1000 Monte Carlo simulations. White boxes represent posterior probabilities derived from Bayes factors that were computed using Equation 2. Gray boxes represent posterior probabilities that come from the refined Bayes factor of Nathoo and Masson (2016).

## V Conclusion

In this paper, we have proposed a formula for estimating Bayes factors from repeated measures ANOVA designs. These ideas extend previous work of Faulkenberry (2018), who presented such formulas for between-subject designs. Such formulas are advantageous for researchers in a wide variety of empirical disciplines, as they provide an easy-to-use method for estimating Bayes factors from a minimal set of summary statistics. This gives the user a powerful index for estimating evidential value from a set of experiments, even in cases where the only data available are the summary statistics published in a paper. We think this provides a welcome addition to the collection of tools for doing Bayesian computation with summary statistics (e.g., Ly et al., 2018).

Further, we demonstrated that our formula performs similarly to a more refined, yet more complex formula of Nathoo and Masson (2016), who were able to explicitly estimate and account for the correlation between repeated measurements. Though the Nathoo and Masson (2016) approach is certainly more principled than a “one-size-fits-all” approach, it does require knowledge of the various sums-of-squares components from the repeated-measures ANOVA, and to our knowledge, there is not yet any obvious way to recover the Nathoo and Masson (2016) estimates from the statistic alone. Thus, given the similar performance between our method compared to the Nathoo and Masson (2016) method, we think our method stands at a slight advantage, not only for its simplicity, but also its power in light of minimal available information.

## References

• Anders et al. (2016) Anders, R., Alario, F.-X., and Van Maanen, L. (2016). The shifted wald distribution for response time data analysis. Psychological Methods, 21(3):309–327.
• Faulkenberry (2017) Faulkenberry, T. J. (2017). A single-boundary accumulator model of response times in an addition verification task. Frontiers in Psychology, 8.
• Faulkenberry (2018) Faulkenberry, T. J. (2018). Computing bayes factors to measure evidence from experiments: An extension of the bic approximation. Biometrical Letters, 55(1):31–43.
• Faulkenberry et al. (2016) Faulkenberry, T. J., Cruise, A., Lavro, D., and Shaki, S. (2016). Response trajectories capture the continuous dynamics of the size congruity effect. Acta Psychologica, 163:114–123.
• Faulkenberry et al. (2018) Faulkenberry, T. J., Vick, A. D., and Bowman, K. A. (2018). A shifted wald decomposition of the numerical size-congruity effect: Support for a late interaction account. Polish Psychological Bulletin, 49(4):391–397.
• Jones (2011) Jones, R. H. (2011). Bayesian information criterion for longitudinal and clustered data. Statistics in Medicine, 30(25):3050–3056.
• Kass and Raftery (1995) Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90(430):773–795.
• Lindley (1957) Lindley, D. V. (1957). A statistical paradox. Biometrika, 44(1-2):187–192.
• Ly et al. (2018) Ly, A., Raj, A., Etz, A., Marsman, M., Gronau, Q. F., and Wagenmakers, E.-J. (2018). Bayesian reanalyses from summary statistics: A guide for academic consumers. Advances in Methods and Practices in Psychological Science, 1(3):367–374.
• Masson (2011) Masson, M. E. J. (2011).

A tutorial on a practical Bayesian alternative to null-hypothesis significance testing.

Behavior Research Methods, 43(3):679–690.
• Nathoo and Masson (2016) Nathoo, F. S. and Masson, M. E. (2016). Bayesian alternatives to null-hypothesis significance testing for repeated-measures designs. Journal of Mathematical Psychology, 72:144–157.
• Rouder et al. (2012) Rouder, J. N., Morey, R. D., Speckman, P. L., and Province, J. M. (2012). Default Bayes factors for ANOVA designs. Journal of Mathematical Psychology, 56(5):356–374.
• Rouder et al. (2009) Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., and Iverson, G. (2009). Bayesian tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin & Review, 16(2):225–237.
• Santens and Verguts (2011) Santens, S. and Verguts, T. (2011). The size congruity effect: Is bigger always more? Cognition, 118(1):94–110.
• Sobel et al. (2016) Sobel, K. V., Puri, A. M., and Faulkenberry, T. J. (2016). Bottom-up and top-down attentional contributions to the size congruity effect. Attention, Perception, & Psychophysics, 78(5):1324–1336.
• Sobel et al. (2017) Sobel, K. V., Puri, A. M., Faulkenberry, T. J., and Dague, T. D. (2017). Visual search for conjunctions of physical and numerical size shows that they are processed independently. Journal of Experimental Psychology: Human Perception and Performance, 43(3):444–453.
• Wagenmakers (2007) Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of values. Psychonomic Bulletin & Review, 14(5):779–804.