1 Introduction
Following Karl Pearson’s first quantitative synthesis of clinical trials in 1904, metaanalysis gradually established itself as an irreplaceable method for statistics in medicine.o2007historical
However, over a century later metaanalysis still presents formidable statistical challenges to medical practitioners, especially when the number of primary studies is low. In this case the estimation of acrossstudy heterogeneity (i.e., acrossstudy standard deviation)
is problematic brockwell2001comparison; inthout2014hartung; gonnermann2015no; moreover, these problematic estimates may subsequently distort the estimates of the overall treatment effect size .brockwell2001comparison; berkey1995random The practical relevance of the small sample challenge is underscored by the fact that the median number of studies in a metaanalysis from the Cochrane Database of Systematic Reviews (CDSR) is only 3, with an interquartile range from 2 to 6.Davey2011characteristicsOne statistical method that has been proposed to address the small sample challenge is Bayesian estimation, either with weakly informative prior distributions, williams2018bayesian; higgins2009re; chung2013avoiding predictive prior distributions based on pseudodata,rhodes2016implementing or prior distributions informed by earlier studies.higgins1996borrowing These Bayesian techniques are well suited to estimate the model parameters when the data are scarce; however, by assigning continuous prior distributions to and , these estimation techniques implicitly assume that the treatment is effective and the studies are not homogeneous.^{1}^{1}1
Under a continuous prior distribution, the prior probability of any particular value (such as
, which represent the proposition that the treatment is ineffective) is zero. In order to validate these strong assumptions we may adopt the framework of Bayesian testing. Developed in the second half of the 1930s by Sir Harold Jeffreys,Jeffreys1935; Jeffreys1939 the Bayesian testing framework seeks to grade the evidence that the data provide for or against a specific value of interest such as and which corresponds to the null model of no effect and the fixedeffect model, respectively. Jeffreys argued that the testing question logically precedes the estimation question, and that more complex models (e.g., the models used for estimation, where and are free parameters) ought to be adopted only after the data provide positive evidence in their favor: “Until such evidence is actually produced the simpler hypothesis holds the field; the onus of proof is always on the advocate of the more complicated hypothesis.”(Jeffreys1937SI, p. 252)In the context of metaanalysis, Jeffreys’s statistical philosophy demands that we acknowledge not only the uncertainty in the parameter values given a specific model, but also the uncertainty in the underlying models to which the parameters belong. Both types of uncertainty can be assessed and updated using a procedure known as Bayesian modelaveraged (BMA) metaanalysis.^{2}^{2}2A different approach is Bayesian model selection based on posterior (outofsample) predictive performance such as DIC/WAIC/LOO.yao2018using; VehtariEtAl2017 However, these approaches are unable to provide compelling support in favor of simple models.GronauWagenmakers2019LOO1; GronauWagenmakers2019LOO2 The BMA procedure applies different metaanalytic models to the data simultaneously, and draws inferences by taking into account all models, with their impact determined by their predictive performance for the observed data.GronauEtAl2017PowerPose; HinneEtAl2020; gronau2020primer
As in other applications of Bayesian statistics, BMA requires that all parameters are assigned prior distributions. However, in contrast to Bayesian estimation, Bayesian testing does not permit the specification of vague or “uninformative” prior distributions on the parameters of interest. Vague prior distributions assign most prior mass to implausibly large values, resulting in poor predictive performance.
Jeffreys1939; gronau2020primer; KassRaftery1995 In BMA, the relative impact of the models is determined by their predictive performance, and predictive performance in turn is determined partly by the prior distribution on the model parameters. In objective Bayesian statisticsConsonniEtAl2018 socalled default prior distributions have been proposed; these default distributions meet a list of desiderata BayarriEtAl2012 and are intended for general use in testing. In contrast to this work, here we seek to construct and compare different prior distributions based on existing medical knowledge. turner2015predictive; pullenayegum2011informed; rhodes2015predictive Specifically, we propose empirical prior distributions for and as applied to metaanalyses of continuous outcomes in medicine. To this aim we first used 50% of CDSR to develop candidate prior distributions and then used the remaining 50% of CDSR to evaluate their predictive accuracy and that of the associated models.Below we first outline the BMA approach to metaanalyses and then present the results of a preregistered analysis procedure to obtain and assess empirical prior distributions for and for the medical field as a whole. Next we propose empirical prior distributions for the 46 specific medical subdisciplines defined by CDSR. Finally we demonstrate with a concrete example how our results can be applied in practice using the opensource statistical programs RR and JASP.JASP15
2 Bayesian ModelAveraged MetaAnalysis
The standard Bayesian randomeffects metaanalysis assumes that a latent individual study effect is drawn from a Gaussian grouplevel distribution with mean treatment effect and betweenstudy heterogeneity .SpiegelhalterEtAl1994; SuttonAbrams2001 Inference then concerns the posterior distributions for and . This estimation approach allows researchers to answer important questions such as “given that the treatment effect is nonzero, how large is it?”^{3}^{3}3
More specific versions of this generic question are “given that the treatment effect is nonzero, is it positive or negative?” and “given that the treatment effect is nonzero, what is the posterior probability that it falls in the interval from
to ?” and “given that there is betweenstudy heterogeneity, how large is it?” Because the standard model assumes that the effect is nonzero, it cannot address the arguably more fundamental questions that involve a hypothesis test,HaafEtAl2019Nature; Jeffreys1961 such as “how strong is the evidence in favor of the presence or absence of a treatment effect?” and “how strong is the evidence in favor of betweenstudy heterogeneity (betweenstudy standard deviation) versus homogeneity?”(Fisher1928, p. 274) Here we outline a BMA approach that allows for both hypothesis testing and parameter estimation in a single statistical framework.RoverEtAl2019Our generic metaanalysis setupGronauEtAl2017PowerPose; HinneEtAl2020; LandyEtAl2020; ScheibehenneEtAl2017Reply (for the conceptual basis see Jeffreys)(Jeffreys1939, p. 276277 and p. 296) consists of the following four qualitatively different candidate hypotheses^{4}^{4}4The terms ‘hypothesis’ and ‘model’ are used interchangeably.:

the fixedeffect alternative hypothesis : , ;

the randomeffects null hypothesis : , ;

the randomeffects alternative hypothesis : , ,
where represents the grouplevel mean treatment effect, represents the betweenstudy standard deviation (i.e., the treatment heterogeneity), and and represent prior distributions that quantify the uncertainty about and
, respectively. The four prior probabilities of the rival hypotheses are denoted by
, , , and ; these may or may not be set to , reflecting a position of prior equipoise. The main advantage of this framework is that it does not fully commit to any single model on purely a priori grounds. Although in many situations the randomeffects alternative hypothesis is an attractive option, it may be less appropriate when the number of studies is small; in addition, as mentioned above, assumes the effect to be present, whereas assessing the degree to which the data undercut or support this assumption may often be one of the primary inferential goals.In our framework, after specifying the requisite prior distributions and , the data drive an update from prior to posterior model probabilities, and pertinent conclusions are then drawn using BMA.Jevons18741913; HoetingEtAl1999
Specifically, the posterior odds of an effect being present, based on observed data
, is the ratio of the sum of posterior model probabilities for and over the sum of posterior model probabilities for and :In modelaveraging terms, this quantity is referred to as the posterior inclusion odds, as it refers to the postdata odds of ‘including’ the effect size parameter . As a measure of evidence, one may consider the change, brought about by the data, from prior inclusion odds to posterior inclusion odds. This change is known as the Bayes factorKassRaftery1995; Jeffreys1961; EtzWagenmakers2017:
(1) 
One may similarly assess the posterior odds for the presence of heterogeneity by contrasting and versus and :
or one may quantify evidence by the change from prior to posterior inclusion odds:
(2) 
An attractive feature of this framework is that it allows a graceful datadriven transition from an emphasis on fixedeffect models to randomeffects models; with only few studies available, the fixedeffect models likely outpredict the randomeffects models and therefore receive more weight. But as studies accumulate, and it becomes increasingly apparent that the treatment effect is indeed random, the influence of the randomeffects models will wax and of the fixedeffect models will wane, until inference is dominated by the randomeffects models. In addition, the Bayesian framework allows researchers to monitor the evidence as studies accumulate, without the need or want of corrections for optional stopping.BergerWolpert1988 This is particularly relevant as the accumulation of studies is usually not under the control of a central agency, and the stopping rule is illdefined. ter_schure_accumulation_2019
Although theoretically promising, the practical challenge for our BMA metaanalysis is to determine appropriate prior distributions for and . Prior distributions that are too wide will waste prior mass on highly implausible parameter values, thus incurring a penalty for complexity that could have been circumvented by applying a more reasonably peaked prior distribution. On the other hand, prior distributions that are too narrow represent a highly risky bet; if the effect is not exactly where the peaked prior distribution guesses it to be, the model will incur a hefty penalty for predicting the data poorly, a penalty that could have been circumvented by reasonably widening the prior distribution. There is no principled way around this dilemma: Bayes’ rule dictates that evidence is quantified by predictive success, and predictions follow from the prior predictive distributions.Evans2015 Thus, when the goal is to quantify evidence, the prior distributions warrant careful consideration.(GronauEtAl2020Informed; LambertEtAl2005; OHaganEtAl2006; OHagan2019; TurnerEtAl2012)
Fortunately, the framework presented here contains only two key parameters, and ; moreover, a large clinical literature is available to help guide the specification of reasonable prior distributions. Our goal in this work is to use metaanalyses from the CDSR to create a series of informed prior distributions for both the effect size parameter
and betweenstudy variance parameter
.higgins2009re; turner2015predictive; pullenayegum2011informed; rhodes2015predictive We will then assess the predictive adequacy of the various models in conjunction with the prior distributions on a holdout validation set.3 Candidate Prior Distributions
We developed and assessed prior distributions for the and parameters suitable for BMA of continuous outcomes using data from CDSR.^{5}^{5}5We identified systematic reviews in the CDSR through PubMed, limiting the period to Jan 2000 – May 2020. For that we used the NCBI’s EUtils API with the following query: “Cochrane Database Syst Rev”[journal] AND (“2000/01/01”[PDAT]: “2020/05/31”[PDAT]). For each review, we downloaded the XML metaanalysis table file (rm5format) associated with the review’s latest version. We extracted the tables with continuous outcomes (i.e., MD and SMD) from these rm5files with a custom PHP script. We labeled the tables with the Cochrane Review Group’s taxonomy list for subfield analysis. In the remainder of this work we adopt the terminology of Higgins et al.higgins2019cochrane: individual metaanalyses included in each Cochrane review are referred to as ‘comparisons’ and individual studies included in a comparison are referred to as ‘studies’. All of the results were conducted using Cohen’s standardized mean differences (SMD). The analyses presented in this section were executed in accordance with a preregistration protocol (https://osf.io/zs3df/) unless explicitly mentioned otherwise.
In order to assess the predictive adequacy of the various prior distributions and models, we first randomly partitioned the data of the Cochrane reviews in a training and test set. The training set consisted of 3,092 comparisons with a total number of 23,333 individual studies, and the test set consisted of 3,091 comparisons with a total number of 22,117 individual studies. We used the training set to develop prior distributions for the and parameters and then assessed predictive accuracy using the test set.
3.1 Developing Prior Distributions Based on the Training Set
Left panel of Figure 1 outlines the data processing steps performed on the training set (further details are provided in the preregistration protocol, https://osf.io/zs3df/). First, in order to ensure that the training set yields estimates of that form a reliable basis for the construction of a prior distribution, we excluded comparisons with fewer than 10 studies. Second, we excluded comparisons for which at least one individual study was reported by the authors of the review to be “nonestimable” (i.e., the effect size of the original study could not be retrieved). Third, we transformed the reported raw mean differences to the SMD using the metafor R package.viechtbauer2010metafor Fourth, to ensure high consistency of the metaanalytic estimates, we reestimated all comparisons using a frequentist randomeffects metaanalytic model with restricted maximum likelihood estimator using the metafor R package.viechtbauer2010metafor These steps resulted in a final training set featuring 423 comparisons containing a total of 8,044 individual studies. The histograms and tick marks in Figure 2 display the and estimates from each comparison in the training set.^{6}^{6}6As specified in the preregistration protocol, we assumed that estimates lower than are representative of , and therefore these estimates were not used to determine candidate prior distributions for .
To develop candidate prior distributions for parameters and , we used the maximum likelihood estimator implemented in the fitdistrplus R packagefitdistrplus to fit several distributions to the frequentist metaanalytic estimates from the training set. For the parameter, we considered normal and Student’s
distributions fitted to the training set and compared them to an uninformed Cauchy distribution with scale
(a default choice in the field of psychology).morey2015bayesfactor For theparameter, we considered halfnormal, inversegamma, and gamma distributions fitted to the training set and compared them to an uninformed uniform distribution on the range from 0 to 1.
LambertEtAl2005 The resulting distributions are summarized in Table 1 and their fit to the training set is visualized in Figure 2.Parameter  Parameter 

parameters as obtained from the training set. The inversegamma and gamma distributions follow the shape and scale parametrization and the Student distributions follow the location, scale, and degrees of freedom parametrization. See Figure
2.3.2 Assessing Prior Distributions Based on the Test Set
Right panel of Figure 1 outlines the data processing steps performed on the test set. Similarly to the training set, we removed nonestimable comparisons and transformed all effect sizes to SMD. However, in contrast to the training set, we retained all comparisons that feature at least 3 studies: there is no reason to limit the assessment of predictive performance to comparisons with at least 10 studies. These data processing steps resulted in a final test set consisting of 2,416 comparisons containing a total of 18,479 individual studies. The median number of studies in a comparison was 5 with an interquartile range from 3 to 9.
For each possible pair of candidate prior distributions depicted in Table 1,^{7}^{7}7
We failed to estimate all BMA metaanalytic models for ten comparisons due to large outliers.
we computed posterior model probabilities and modelaveraged Bayes factors with the metaBMA R package,heck2017metabma which uses numerical integration and bridge sampling.GronauEtAl2017BSTutorial; GronauEtAl2020JSS; MengWong19963.2.1 Performance of Prior Distribution Configurations Under
In the first analysis we evaluate the predictive performance associated with the different prior distribution configurations as implemented in , that is the randomeffects model that allows both and to be estimated from the data. Specifically, under there are prior configurations and each is viewed as a model yielding predictions. The prior probability of each prior configuration is and the predictive accuracy of each prior configuration is assessed with the 2406 comparisons from the test set. Table 2 lists the 12 different prior configurations and summarizes the number of times their posterior probability was ranked . The results show that informed configurations generally outperformed the uninformed configurations (i.e., the distribution on and the distribution on ). The worst ranking performance was obtained with prior configuration 1 (i.e., uniformed distributions for both and ). Prior configurations 2, 3, 4, 5, and 9 feature an uninformed prior distribution on either or , and also did not perform well in terms of posterior rankings. The same holds for prior distribution configurations with the halfnormal prior distribution for the parameter (i.e., prior configurations 2, 6, and 10). The best performing prior distribution configurations (i.e., numbers 7, 11, and 12) used more datadriven prior distributions for both (i.e., fitted normal or distributions) and (i.e., fitted inversegamma or gamma).
Rank  

Prior  Prior  1  2  3  4  5  6  7  8  9  10  11  12 
1.  67  74  35  92  51  56  128  89  91  111  128  1484  
2.  7  54  39  38  62  82  103  170  134  327  1390  0  
3.  54  17  46  64  35  55  109  455  706  163  133  569  
4.  9  23  41  51  73  55  62  84  493  1249  261  5  
5.  282  180  50  139  75  101  155  853  443  59  39  30  
6.  8  110  350  217  253  470  933  38  13  9  5  0  
7.  247  226  211  724  143  120  106  229  139  191  66  4  
8.  123  189  128  230  677  808  196  22  15  6  8  4  
9.  227  162  79  283  549  303  304  166  72  55  94  112  
10.  30  197  1051  257  233  199  111  93  99  83  53  0  
11.  1122  207  111  137  60  47  101  140  92  52  155  182  
12.  230  967  265  174  195  110  98  67  109  101  74  16 
Figure 3 displays the posterior probability for each of the 12 prior distribution configurations across the 2406 comparisons. The color gradient ranges from white (representing low posterior probability) to dark red (representing high posterior probability). Figure 3 shows that, on average, the different prior distribution configurations perform similarly. As suggested by the posterior rankings from Table 2, configuration 1 predicted the data relatively poorly, resulting in an average posterior probability of ; in contrast, configuration 11 predicted the data relatively well, resulting in an average posterior probability of . However, these posterior probabilities differ only modestly from the prior probability of , and the Bayes factor associated with the comparison of posterior probabilities of and is less than 2.
In sum, among the 12 prior configurations under the best predictive performance was consistently obtained by datadriven priors, and the worst predictive performance was obtained by uninformed priors (cf. the posterior rankings from Table 2). However, the extent of this predictive advantage is relatively modest: starting from a prior probability of , the worst prior configuration has an average posterior probability of , and the best prior configuration has average posterior probability of (cf. Figure 3).^{8}^{8}8This modest change in plausibility is consistent with the fact that for overlapping prior distributions, there is an asymptotic limit to the evidence that is given by the ratio of prior ordinates evaluated at the maximum likelihood estimate.LyWagenmakerssubmPeri
3.2.2 Posterior Probability of the Four Model Types
In the second analysis we evaluate the predictive performance of the four metaanalytic model types (i.e., , , , and ) by modelaveraging across all prior distribution configurations, separately for each of the 2406 comparisons. Table 3 shows the prior model probabilities obtained by first assigning probability to each of the four model types, and then spreading that probability out evenly across the constituent prior distribution configurations.(Jeffreys1961, p. 47)
Prior Model  Prior Configuration  
Model  Probability  Effect Size  Heterogeneity  Probability 
For any particular comparison, a model type’s modelaveraged posterior probability is obtained by summing the posterior probabilities of the constituent prior distribution configurations. For example, the modelaveraged posterior probability for is obtained by summing the posterior probabilities for the 12 possible prior configurations, each of them associated with prior probability (cf. Table 3).
Table 4 lists the four model types and summarizes the number of times their modelaveraged posterior probability was ranked . The results show that, across all comparisons, complex models generally received more support than simple models. The model that predicted the data best was , the randomeffects model that assumes the presence of an effect; the model that predicted the data worst was , the fixedeffect model that assumes the absence of an effect. However, even outpredicted the other three model types in % of comparisons. Table 4 also shows the modelaveraged posterior model probability across all comparisons. In line with the ranking results, the average probability for decreased from to , whereas that for increased from to . Nevertheless, the support for across all comparisons is not overwhelming and does not appear to provide an empirical license to ignore (or any of the other three model types) from the outset.
Rank  
Model  1  2  3  4  PrMP*  AV. PoMP** 
662  177  183  1382  .25  .19  
573  334  1235  262  .25  .22  
406  1158  790  52  .25  .24  
765  737  196  708  .25  .36  
*Prior model probability  
**Average posterior model probability 
Left panel of Figure 4 displays the modelaveraged posterior probability for each model type across the 2406 comparisons. It is apparent that the posterior probability is highest for . However, for a substantial number of comparisons (i.e., %, cf. Table 4) a different model type performs better. For instance –and in contrast to popular belief– the fixedeffect models and together show the best predictive performance in a slight majority of comparisons (i.e., %).
3.2.3 Inclusion Bayes Factors
In the third analysis we assess the inclusion Bayes factors for a treatment effect (cf. Eq. 1) and for heterogeneity (cf. Eq. 2); that is, we modelaverage across all prior distribution configurations and across two model types, separately for each of the 2406 comparisons. First, the inclusion for a treatment effect quantifies the evidence that the data provide for the presence vs. the absence of a grouplevel effect, taking into account the model uncertainty associated with whether the effect is fixed or random. The left panel of Figure 5 displays a histogram of the log of the modelaveraged
for the test set featuring 2406 comparisons. The histogram is noticeably rightskewed, which affirms the regularity that it is easier to obtain compelling evidence for the presence rather than the absence of an effect.
(Jeffreys1939, p. 196197;)^{9}^{9}9This is the case because most prior distributions assign considerable mass to values near the testrelevant one; it is not a universal property of Bayes factors.JohnsonRossell2010 Evidence for the presence of an effect was obtained in a small majority of the comparisons (i.e., ).Second, the inclusion for heterogeneity quantifies the evidence that the data provide for the presence vs. absence of betweenstudy variability, taking into account the model uncertainty associated with whether the grouplevel effect is present or absent. The right panel of Figure 5 displays a histogram of the log of the modelaveraged for the test set featuring 2406 comparisons. The rightskew again confirms the regularity: it is easier to find compelling evidence for heterogeneity than for homogeneity. Nevertheless, the data provide evidence for heterogeneity only in a slight majority of of the comparisons.
In sum, the inclusion Bayes factors revealed that in nearly half of the comparisons from the test set, the data provide evidence in favor of the absence of a treatment effect (i.e., %) and provide evidence in favor of the absence of heterogeneity (i.e., %). The distribution of the log Bayes factors is asymmetric, indicating that it is easier to obtain compelling evidence for the presence of a treatment effect (rather than for its absence) and for the presence of heterogeneity (rather than for homogeneity).
3.3 Exploratory Analysis: ModelAveraging Across Prior Distributions Under
To further investigate the predictive performance of the prior distributions, we performed one additional analysis that was not preregistered in the original analysis plan. We focused on and evaluated the prior distributions for each parameter by modelaveraging across the possible prior distributions for the other parameter. For instance, to obtain the modelaveraged posterior probability for the Cauchy prior distribution on the parameter (i.e., ), we consider the posterior probability for all 12 possible prior configurations and then sum across the four possible prior distributions on the parameter – the four top models listed in the row of Table 3. This way we obtain an assessment of the relative predictive performance of a particular prior distribution, averaging over the uncertainty on the prior distribution for the other parameter.
Rank  
Prior distribution  1  2  3  4  PrMP*  AV. PoMP** 
Parameter  
142  199  2065  .33  0.25  
655  1727  24  .33  0.35  
1609  480  317  .33  0.39  
Parameter  
576  83  116  1631  .25  0.23  
47  772  1587  0  .25  0.25  
1418  172  67  749  .25  0.27  
365  1379  636  26  .25  0.25  
*Prior model probability  
**Average posterior model probability 
Table 5 lists the prior distributions and gives the number of times their modelaveraged posterior probability attained a particular ranking. Consistent with the results reported earlier, the more datadriven prior distributions generally received more support than the prior distributions that are less informed. For the parameter, the best performing prior distribution was ; for the parameter, the best performing prior distribution was . Although the preference for the datadriven prior distributions is relatively consistent, it is not particularly pronounced, echoing the earlier results. Specifically, Table 5 also shows the modelaveraged posterior model probability across all comparisons. For the parameter, the prior has a modelaveraged posterior probability of (up from ), but the Cauchy prior retains a nonnegligible probability of . For the parameter, the different prior distributions perform even more similarly; on average, the worst prior distribution is , and yet its modelaveraged posterior model probability equals , down from but only a little. Likewise, the onaverage best prior distribution is , with a modelaveraged posterior model probability of , which is only modestly larger than .
Right panel of Figure 4 displays the modelaveraged posterior probability for each prior distribution across the 2406 comparisons. The figure confirms that the datadriven prior distributions perform somewhat better than the relatively uninformed prior distributions. The color band is darker red, on average, for the prior distributions with the highest posterior model probabilities, that is, and .
4 Exploratory Analysis: SubfieldSpecific Prior Distributions
Medical subfields may differ both in the typical size of the effects and in their degree of heterogeneity. In recognition of this fact we sought to develop empirical prior distributions for and that are subfieldspecific. We differentiated between 47 medical subfields according to the taxonomy of the Cochrane Review Group. Based on their relatively good predictive performance detailed in the previous sections, we selected a distribution for the parameter (i.e., for subfield , ) and an inversegamma distribution for the parameter (i.e., for subfield , ).
To estimate the parameters of these distributions separately for each subfield, we used the complete data set and proceeded analogously to the training set preparation: we removed comparisons with nonestimable studies, only used comparisons with at least ten studies, reestimated the comparisons with a restricted maximum likelihood estimator, and removed comparisons with
estimates. These frequentist estimates were used as input for constructing the datadriven subfieldspecific prior distributions. However, since many subfields contain only a limited number of comparisons, we used Bayesian hierarchical estimation with weakly informative priors on the hyperparameters. The hierarchical aspect of the estimation procedure shrinks the estimated parameter values towards the grand mean, a tendency that is more pronounced if the estimated fieldspecific value is both extreme and based on relatively little information.
LeeWagenmakersBayesBook; McElreath2016; RouderLu2005 Specifically, we assumed that all fieldspecific parameters (i.e., , , , and) are governed by positiveonly normal distributions. For the
distribution, we assigned positiveonly Cauchy prior distributions both to the acrossfield normal mean and to the acrossfield normal standard deviation, with for parameter and for parameter . For the inversegamma distribution, we assigned positiveonly Cauchy prior distributions both to the acrossfield normal mean and to the acrossfield normal standard deviation for shape parameter and scale parameter . The hierarchical models were estimated using the rstan R packagerstan that interfaces with the Stan probabilistic modeling language.stan The Stan code is available alongside the supplementary materials at https://osf.io/zs3df/.Table 6 lists the 46 different subfields (the 47^{th} subfield “Multiple Sclerosis and Rare Diseases of the CNS” featured two comparisons, both of which were excluded based on the criterion), the associated number of comparisons and studies, and the estimated distributions for both and . The scale estimates for the parameter show considerable variation, ranging from (“Developmental, Psychosocial and Learning Problems”) to 0.60 (“HepatoBiliary”).^{10}^{10}10Notice that the most extreme estimates come from fields with relatively large number of comparisons, as these estimates are less subject to shrinkage. A similar variation is present in the estimated distributions for the parameter. Figure 6 visualizes the prior distributions for each subfield.
Topic  Comparisons  Studies  Prior  Prior 

Acute Respiratory Infections  6  104  
Airways  46  815  
Anaesthesia  44  661  
Back and Neck  13  278  
Bone, Joint and Muscle Trauma  32  1221  
Colorectal  13  372  
Common Mental Disorders  17  264  
Consumers and Communication  6  72  
Cystic Fibrosis and Genetic Disorders  1  12  
Dementia and Cognitive Improvement  9  197  
Developmental, Psychosocial and Learning Problems  20  407  
Drugs and Alcohol  8  170  
Effective Practice and Organisation of Care  10  204  
Emergency and Critical Care  9  214  
ENT  17  273  
Eyes and Vision  14  347  
Gynaecological, Neurooncology and Orphan Cancer  1  10  
Gynaecology and Fertility  14  253  
Heart  88  2112  
HepatoBiliary  34  1103  
HIV/AIDS  2  23  
Hypertension  27  524  
Incontinence  17  219  
Infectious Diseases  8  150  
Inflammatory Bowel Disease  1  12  
Injuries  3  54  
Kidney and Transplant  39  767  
Metabolic and Endocrine Disorders  25  503  
Methodology  5  106  
Movement Disorders  5  70  
Musculoskeletal  32  778  
Neonatal  11  259  
Oral Health  10  236  
Pain, Palliative and Supportive Care  16  283  
Pregnancy and Childbirth  32  539  
Public Health  2  22  
Schizophrenia  21  436  
Sexually Transmitted Infections  9  113  
Skin  6  85  
Stroke  21  357  
Tobacco Addiction  4  44  
Upper GI and Pancreatic Diseases  1  12  
Urology  2  33  
Vascular  3  35  
Work  2  24  
Wounds  7  103  
Pooled estimate  713  14876 
5 Example: Dentine Hypersensitivity
We demonstrate BMA metaanalysis with an example from oral health. Poulsen et al.poulsen2006potassium considered the effect of potassiumcontaining toothpaste on dentine hypersensitivity. Five studies with a tactile outcome assessment were subjected to a metaanalysis. In their review, Poulsen et al.poulsen2006potassium reported a metaanalytic effect size estimate , 95% CI , , of potassiumcontaining toothpastes on reducing tactile sensitivity (“Analysis 1.1. Comparison 1 Potassium containing toothpaste (update), Outcome 1 Tactile.”).
We reanalyze the Poulsen et al.poulsen2006potassium comparison using the BMA metaanalysis implementation in the opensource statistical software package JASP (jaspstats.org).JASP15; LoveEtAl2019JASP; LyEtAl2021ISBA; vanDoornEtAlinpressJASPGuidelines; WagenmakersEtAl2018PBRPartII Appendix A provides the same analysis in RR using the metaBMA package.heck2017metabma Figure 7
shows the JASP graphical user interface with the left panel specifying the analysis setting and the right panel displaying the default output. After loading the data into JASP, the BMA metaanalysis can be performed by activating the “MetaAnalysis” module after clicking the blue “+” button in the top right corner, choosing “MetaAnalysis” from the ribbon at the top, and then selecting “Bayesian MetaAnalysis“ from the dropdown menu. In the left input panel, we move the study effect sizes and standard errors into the appropriate boxes and adjust the prior distributions under the “Prior” tab to match the subfieldspecific distributions given in Table
6. Specifically, for the “Oral Health” subfield the prior distributions are and .The JASP output panel displays the corresponding BMA metaanalysis results. The “Posterior Estimates per Model” table summarizes the estimates and evidence from the fixedeffect models, randomeffects models, and finally the modelaveraged results. The final row of the table shows an effect size estimate , 95 % CI which is slightly lower and more conservative than the one provided by the frequentist randomeffects metaanalysis, further quantified with extreme evidence for the presence of an effect, , and moderate evidence for the presence of heterogeneity, . The JASP output panel also presents a forest plot that visualizes the observed effects size estimates from the individual studies, the overall fixedeffect and randomeffects metaanalytic estimates and the corresponding modelaveraged effect size estimate.
The JASP interface provides additional options not discussed here, such as (a) visualizing the prior and posterior distributions; (b) visualizing the estimated effect sizes from individual studies; (c) performing onesided hypothesis tests; (d) updating evidence sequentially, studybystudy; (e) adding ordinal constraintshaaf2020does; and (f) adjustments for publication bias.maier2020robust; bartos2020adjusting
6 Concluding Comments
In this article, we introduced Bayesian modelaveraged metaanalysis for continuous outcomes in medicine. The proposed methodology provides a principled way to integrate, quantify, and update uncertainty regarding both parameters and models. Specifically, the methodology allows researchers to simultaneously test for and estimate effect size and heterogeneity without committing to a particular model in an allornone fashion. In Bayesian modelaveraged metaanalysis, multiple models are considered simultaneously, and inference is proportioned to the support that each model receives from the data. This eliminates the need for stagewise, multistep inference procedures that first identify a single preferred model (e.g., a fixedeffect model or a randomeffects model) and then interpret the model parameters without acknowledging the uncertainty inherent in the model selection stage. The multimodel approach advocated here also decreases the potential impact of model misspecification.
Bayesian modelaveraged metaanalysis comes with the usual advantages of Bayesian statistics – the ability to quantify evidence in favor or against any hypothesis (including the null hypothesis), the ability to discriminate between absence of evidence and evidence of absence,KeysersEtAl2020; Robinson2019 the ability to monitor evidence as individual studies accumulate,ter_schure_accumulation_2019 the straightforward interpretation of the results (i.e., probability statements that refer directly to parameters and hypotheses),OHaganForster2004 and the opportunity to incorporate historical information.berry2006bayesian; hobbs2007practical In this article, our goal was to take advantage of the existing medical knowledge base in order to propose and assess prior distributions that allow for more efficient inference.
Following a preregistered analysis plan, we fitted and assessed different prior distributions for both effect size and heterogeneity using comparisons of continuous outcomes from the Cochrane database of systematic reviews. We fitted prior distributions based on a training set of randomly selected comparisons, and then evaluated predictive performance based on a test set. The results showed that predictive performance on the test set was relatively similar for the different datadriven prior distributions. Moreover, and in contrast to popular belief and recommendations,higgins2009re; chung2013avoiding; council1992combining; mosteller1996understanding we did not find that the randomeffects metaanalytic model provided a superior account of the data: the randomeffects metaanalytic models outpredicted their fixedeffect counterparts in only 51.0% of comparisons. Although the randomeffects alternative hypothesis showed the best predictive performance on average, the data increased its modelaveraged posterior probability from to only , leaving for the three competing model types (i.e., a model with no heterogeneity, a model without an effect, and a model without both).
Based on the outcome of our preregistered analysis, we used the data from Cochrane database of systematic reviews to develop empirical prior distributions for continuous outcomes in 46 different medical subfields. Finally, we applied Bayesian modelaveraged metaanalysis with subfieldspecific prior distributions to an example from oral health, using the free statistical software packages R and JASP. We believe that the proposed Bayesian methodology provides an alternative perspective on metaanalysis that is informed, efficient, and insightful.
Acknowledgments
This work was supported by The Netherlands Organisation for Scientific Research (NWO) through a Research Talent grant (to QFG; 406.16.528), a Vici grant (to EJW; 016.Vici.170.083), and a NWA Idea Generator grant (to WMO; NWA.1228.191.045).
Financial disclosure
None reported.
Data availability statement
Data and analysis scripts are publicly available at: https://osf.io/zs3df/files/.
Conflict of interest
František Bartoš, Alexander Ly, and EricJan Wagenmakers declare their involvement in the opensource software package JASP (https://jaspstats.org), a noncommercial, publiclyfunded effort to make Bayesian statistics accessible to a broader group of researchers and students.
References
Appendix A R Code for the Oral Health Example
The result for the example analysis can also be obtained with the statistical programming language R.R After initializing the R session, the metaBMA R packageheck2017metabma needs to be installed with the following command (this command needs to be executed only if the package is not already present):
install.packages("metaBMA")
After the metaBMA has been installed, it must then be loaded into the session, and the data set (containing the effect sizes and standard errors from the individual studies) from the “Tactile Sensitivity.csv” file can be made available, as follows:
library("metaBMA") data < read.csv("Tactile Sensitivity.csv")
In order to perform the BMA metaanalysis with the subfieldspecific prior distributions, we first assign the column containing the effect sizes study_effect_size to the y argument, and the column containing standard errors study_effect_size to the SE argument (the data = data arguments specifies that both of the columns are located in the already loaded data set called data). Then, we specify the prior distributions for the effect size, the d argument, and the heterogeneity, the tau argument, accordingly to the “Oral Health” row in Table 6. Finally, the control
argument specifies an additional control argument for the Markov chain Monte Carlo (MCMC) sampler, increasing the target acceptance probability from the default value of
, often required in cases with a small number of studies (see https://mcstan.org/misc/warnings.html for more information about Stan’s warnings and possible solutions):fit_example < meta_bma(y = study_effect_size, SE = study_se, data = data, d = prior("t", c(location = 0, scale = 0.51, nu = 5)), tau = prior("invgamma", c(shape = 1.79, scale = 0.28)), control = list(adapt_delta = .90))
To obtain the numerical summaries of the estimated model, we just execute the name of the object containing the fitted model:
fit_example
The resulting output corresponds to that given by JASP output (up to MCMC error):
> fit_example ### MetaAnalysis with Bayesian Model Averaging ### Fixed H0: d = 0 Fixed H1: d ~ ’t’ (location=0, scale=0.51, nu=5) with support on the interval [Inf,Inf]. Random H0: d = 0, tau ~ ’invgamma’ (shape=1.79, scale=0.28) with support on the interval [0,Inf]. Random H1: d ~ ’t’ (location=0, scale=0.51, nu=5) with support on the interval [Inf,Inf]. tau ~ ’invgamma’ (shape=1.79, scale=0.28) with support on the interval [0,Inf]. # Bayes factors: (denominator) (numerator) fixed_H0 fixed_H1 random_H0 random_H1 fixed_H0 1.00e+00 8.83e20 4.29e18 2.52e20 fixed_H1 1.13e+19 1.00e+00 4.85e+01 2.85e01 random_H0 2.33e+17 2.06e02 1.00e+00 5.88e03 random_H1 3.97e+19 3.50e+00 1.70e+02 1.00e+00 # Bayesian Model Averaging Comparison: (fixed_H1 & random_H1) vs. (fixed_H0 & random_H0) Inclusion Bayes factor: 218.526 Inclusion posterior probability: 0.995 # Model posterior probabilities: prior posterior logml fixed_H0 0.25 1.95e20 51.22 fixed_H1 0.25 2.21e01 7.34 random_H0 0.25 4.56e03 11.23 random_H1 0.25 7.74e01 6.09 # Posterior summary statistics of average effect size: mean sd 2.5% 50% 97.5% hpd95_lower hpd95_upper n_eff Rhat averaged 1.085 0.183 0.686 1.091 1.432 0.705 1.446 NA NA fixed 1.092 0.118 0.860 1.093 1.325 0.853 1.317 3645.6 1.001 random 1.083 0.200 0.649 1.090 1.451 0.673 1.466 3863.5 1.000
The inclusion Bayes factor quantifying the evidence in favor of heterogeneity can be obtained using Equation 2 and output from the “Model posterior probabilities” table.