Meta-analysis is a term used broadly for a collection of approaches which aim to combine results from multiple related statistical analyses. In the standard formulation, these results are summary statistics computed from data, a typical example being point estimates for the effect size of some treatment. For the combination of point estimates, there exists well-established Bayesian methodology and a large body of literature (see e.g. Higgins et al., 2009, and references therein). However, while the natural outcome of a Bayesian analysis is a posterior distribution, the analogous task of combining posterior distributions has received little attention. In the frequentist paradigm, Xie et al. (2011) have previously introduced a framework for the combination of confidence distributions, a concept loosely related to Bayesian posteriors. In this paper, we develop a Bayesian framework for combining posterior distributions.
In the standard setting of Bayesian random effects meta-analysis, a summary statistic (or data set) , , has been observed for each of studies. The summary statistics aim to provide information about some quantity or ‘effect’ of interest. To reflect the general idea of the studies being non-identical but related, it is common to regard them as exchangeable (Gelman et al., 2013), with the study-specific effect represented by some parameter, say , and the overall effect by another parameter, say . This leads to the following hierarchical model:
where is typically modeled as with estimated from data. One of the primary goals of the above model is to estimate the overall effect , for which the marginal posterior density is given by
There are many compelling reasons for reporting analysis results as posterior distributions instead of data summaries, and subsequently combining them in a meta-analysis. First, a distribution describes the analyst’s uncertainty in the obtained results. Second, a posterior distribution can be directly specified on a quantity of interest, whereas a summary statistic often only provides indirect information about the quantity. Furthermore, the posterior distribution may also include prior knowledge not present in the data, but possibly obtained by expert elicitation (e.g. Albert et al., 2012). This is particularly important in problems where not enough data is available to inform a model about some of its parameters. For example, models describing complex biological phenomena may have so many parameters that they cannot be estimated without the use of informative priors (e.g. Kuikka et al., 2014). Thus, the posterior for a given quantity may be primarily informed by its prior and only indirectly by data, which could make the extraction of a meaningful summary statistic challenging. Finally, if we wish to combine the results of such studies in a meta-analysis, it is desirable to preserve the study-specific prior knowledge in the combined model. Unlike data summaries, this knowledge is naturally encapsulated in posterior distributions.
Consider now a setting, where instead of summary statistics , we have posterior distributions with densities available from each of studies, based on which we wish to update our prior knowledge about the global effect , in analogy with Equation (1). We build on the interpretation that each is a probabilistic representation of belief (Bernardo and Smith, 1994), which reflects our uncertainty about the value of the corresponding local effect . Updating prior knowledge (in our case regarding
) subject to uncertain or ‘soft’ evidence, represented as probability distributions instead of observed values, has been extensively studied as a philosophical topic in both statistics and artificial intelligence. The most well-known of such update rules, Jeffrey’s rule of conditioning(Diaconis and Zabell, 1982; Jeffrey, 2004; Smets, 1993; Zhao and Osherson, 2010)
, computes the updated probability for an event as a weighted average of the posterior probabilities under all possible values of the evidence. Due to its construction, Jeffrey’s rule is applicable in simple discrete cases but becomes computationally infeasible for more complex models with continuous variables. Instead, we directly marginalize away the uncertainty in the observations as they appear in the likelihood, which leads us to updateas
In addition to achieving computational tractability, the above formulation retains some basic properties of standard Bayesian inference, such as order-invariance in successive updates (in contrast with Jeffrey’s rule) and posterior concentration as . From a practical point of view, it is interesting to note that Equations (1) and (2) differ only in the local likelihood being replaced with the density . Intuitively, the former carries information provided by the data only, while the latter carries information provided by both the data and additional prior knowledge. A more subtle difference is that in Equation (2), the integration is thought of as being performed with respect to the measure defined by .
Our formulation of the meta-analysis problem also lends itself to an intuitive interpretation as message passing in probabilistic graphical models (Jordan, 2004; Koller and Friedman, 2009). Using the language of undirected models, updating into in Equation (2) is equivalent to propagating beliefs from leaf nodes to the root node in a tree-structured graphical model, with node potentials , and edge potentials , , see Figure 1. Further utilizing the induced graphical model structure, we may similarly update any into by propagating beliefs to the th leaf node from the remaining nodes in the model. This yields a way of updating study-specific posteriors post-hoc, borrowing strength from posteriors obtained in other studies. Besides its intuitive appeal, adopting a graphical model view enables a straightforward extension of the framework to more complex model structures, and it may be utilized in devising efficient computational strategies.
A major advantage of our meta-analysis framework is its flexibility. In particular, the study-specific inferences resulting in are independent of the combination model, , which is imposed by the meta-analyst. This means that, unlike in hierarchical models, all study-level complexities are hidden ‘under the hood’ and need not explicitly be included in the meta-analysis. For instance, in likelihood-free models (e.g. Lintusaari et al., 2017; Marin et al., 2012), the data can typically be summarized by a number of different statistics but there is no closed-form likelihood to relate these to the parameter of interest. In our framework, likelihood-free inferences can be conducted separately for each study using approximate Bayesian computation
, after which the resulting posteriors are directly combined in a meta-analysis. We highlight the benefit of being able to reuse results from computationally costly analyses, such as likelihood-free inferences, and update them without having to rerun the analyses themselves. In the current paper, we propose a straightforward computational strategy for our framework, where we first impose parametric approximations on the observed posteriors. Sampling from the joint distribution of all variables can then be done using general-purpose software such as Stan(Carpenter et al., 2017). Finally, the obtained joint distribution can be refined using importance sampling, from which any desired marginals can be extracted.
The paper is structured as follows. In Section 2, we develop a framework for conducting Bayesian inference with observed beliefs, which underlies our meta-analysis approach. In Section 3, we summarize the main equations for practical use, and provide further insight into the framework by highlighting connections to message passing in probabilistic graphical models, and standard forms of Bayesian meta-analysis. Section 4 introduces a straightforward computational strategy, which can be implemented using general-purpose software. In Section 5, we illustrate our method with both synthetic and real data. The paper ends with a brief discussion of related work in Section 6 and some concluding remarks in Section 7.
2 Bayesian inference with observed beliefs
In this section, we first develop a posterior update rule given observed beliefs, which is motivated by the problem of conducting meta-analysis for a set of related posterior distributions. The notion of relatedness is here characterized as the exchangeability of the quantities targeted by the posteriors. The proposed update rule is given in Equation (5). We then show in Section 2.1 that this rule retains some basic theoretical properties of standard Bayesian inference.
Let us first assume that is a collection of observable and exchangeable random quantities. Following standard theory, de Finetti’s representation theorem (e.g. Schervish, 1995) states that if is an infinitely exchangeable sequence of random quantities taking values in a Borel space , , then there exists a probability measure such that the joint distribution of the subsequence , i.e. the predictive distribution, has the form
where . Here, the set of probability measures on is taken to be a family , indexed by a parameter , such that is a probability measure on . Furthermore, we define the density function with respect to a reference measure (Lebesgue or counting measure).
In the above standard setting, the Bayesian learning process works through updating conditional on observed data. Following Equation (3), the posterior distribution of , given observed values , has the form
with , the Borel -algebra on . We will now build further on this setting, assuming that instead of directly observing the value of each , we have a set of distributions , expressing our currently available beliefs about the values of . Note that, while in our current context, we assume that each of the beliefs is obtained as the posterior distribution from a previously conducted analysis, this assumption is not essential to our developments. Importantly, the observed distributions are assumed independent of the distribution we seek to update. In the absence of fixed likelihood contributions for each observation, we propose to compute the expected likelihood contributions with respect to the available beliefs.
The proposed modification of the likelihood now leads to an update of the form
where, with slight abuse of notation, we write to denote conditioning on beliefs in analogy with conditioning on fully observed values; we give more context for this choice of notation below in Section 2.1.2. Equation (5) further induces a joint distribution on , which can be marginalized with respect to , resulting in a predictive distribution as follows:
such that . Throughout this work, we assume that is a probability measure. However, it is interesting to note that if we make an exception and allow to be the Lebesgue (or counting) measure for all
, which corresponds to having a uniformly distributed—possibly improper—belief about the value of, then the updated measure in Equation (5
) equals the prior probability measure. Moreover, with this choice of , Equation (6) reduces to the standard predictive distribution in Equation (3).
To gain an intuitive understanding of the update rule proposed in Equation (5), we consider inference in the following simple model:
In the standard case, we apply Bayes’ rule (4) to update the prior distribution , conditional on observed values of , into a posterior distribution, which by conjugacy is , with .
Next, let us assume that the values of cannot be directly observed, but instead, we are able to express beliefs about these values through binary distributions , , . Note that these distributions are independent of the Bernoulli-distribution posited by the modeller.
We now wish to update the prior
. Note that these distributions are independent of the Bernoulli-distribution posited by the modeller. We now wish to update the priorinto a distribution using the provided information. Following Equation (5), the standard Bernoulli likelihood takes a modified form
While the modified likelihood in general no longer permits analytical calculations through conjugacy, two analytically tractable special cases can be identified. Specifically, if , for all , then the observed distributions are uninformative about the value of , and coincides with the prior . Moreover, if , for all , then the ’s are in effect fully observed, and equals the standard posterior . A numerical example is provided in Figure 2.
2.1 Theoretical properties
Two well-known properties of standard Bayesian inference, which are of practical relevance in our meta-analysis setting, are (i) order-invariance in successive posterior updates of exchangeable models and (ii) posterior concentration. The former ensures that inferences conditional on exchangeable data are coherent. The latter tells us that the posterior distribution becomes increasingly informative about the quantity of interest, as we accumulate more data. We will now briefly discuss these properties in the context of the framework introduced above.
2.1.1 Order-invariance in successive posterior updates
Under the assumption of exchangeability, standard Bayesian inference can be constructed as a sequence of successive updates, invariant to the order in which the data are processed. The following proposition establishes that the update rule defined in Equation (5) retains the same property.
The update rule defined in Equation (5) is invariant to permutations of the indices .
As an alternative strategy to Equation (5), we could first attempt to formulate a posterior distribution according to Equation (4) and then, as a final step, integrate out the uncertainty in the conditioning set with respect to the observed beliefs. This is in essence the strategy of Jeffrey’s rule of conditioning. It is, however, well known that Jeffrey’s update rule is in general not order-invariant (Diaconis and Zabell, 1982).
2.1.2 Posterior concentration
Asymptotic theory states that if a consistent estimator of the true value (or an optimal one in terms KL-divergence) of the parameter exists, then the posterior distribution (4) concentrates in a neighborhood of this value, as (e.g. Schervish, 1995). Here we discuss conditions under which the same property holds for the measure , defined in Equation (5). Considerations of asymptotic normality will not be discussed here.
Our strategy is to first formulate a generative hierarchical model for the observed distributions . Then, we show that can be expressed as the marginal posterior distribution of in this model. Finally, we show that under some further technical conditions, standard asymptotic theory can be applied to this distribution. To this end, consider the following hierarchical model:
where is treated as a soft observation of the unobserved value of , and is the inference mechanism that produces . Note that in the particular case of being the Dirac measure, simply generates a point mass at the true value of , such that the hierarchical model reduces to an ordinary, non-hierarchical Bayesian model. Since is an inference over the values of , produced by , it is also a direct representation of the likelihood of under the model , given the observation itself. We finally note, that the generating mechanism may in general be different for each , which is highlighted in the notation by the superscript.
Assume now that the Radon-Nikodym derivative with respect to a dominating measure can be defined for all . The marginal posterior distribution of is then
where , taken as a function of , is the likelihood of given . On the other hand, according to our previous assumption, the likelihood is directly encapsulated in itself. We will therefore assume, by construction, that , where is the density function corresponding to . Using this equivalence, we state the following lemma:
Let . Then the measures and are equivalent.
To prove the claim, we only need to verify that the integrals and are equivalent. Since and are both probability distributions on , and their densities are defined with respect to the same reference measure , we may swap the roles of the integrand and the measure that we integrate against:
The essential difference between the standard posterior distribution defined in Equation (4) and that of Corollary 2.2.1 above, is that in the former, the observations are conditionally i.i.d., while in the latter they are conditionally independent but non-identically distributed.
Assuming that there exists, for all , a unique minimizer of the KL-divergence from the true distribution of to the parametrized representation , a key step in proving posterior concentration is to establish a limit for the log-likelihood ratio
. Since the summands are now non-identically distributed, standard forms of the strong law of large numbers cannot be applied to obtain this limit. However, with further conditions imposed on the second moment of each term in the sum, an alternative form can be used, which relaxes the requirement of the terms being identically distributed(Sen and Singer, 1993, Theorem 2.3.10). The conditions are stated in the following theorem.
Assume that the log-likelihood ratio terms are independent, and that and exist for all . Let , for . Then
In conclusion, if the measure can be written in the form of Equation (2.2.1) and the conditions of Theorem 2.3 hold, then posterior concentration falls back to the standard case, for which elementary proofs can be found in many sources (e.g. Bernardo and Smith, 1994; Gelman et al., 2013). A rigorous treatment is given in Schervish (1995). As an example, we provide a proof of posterior concentration of for discrete parameter spaces in Appendix A.
3 Meta-analysis of Bayesian analyses
We now turn to a practical view of the framework developed in the previous section. To this end, it is convenient to work with densities instead of measures. We are motivated by the problem of conducting meta-analysis for Bayesian analyses summarized as posterior distributions, and refer to our framework as meta-analysis of Bayesian analyses (MBA). The central belief updates of the framework are given in Equations (9) and (10), which update beliefs regarding global and local effects, respectively. Figures 2(a) and 2(b) visualize the updates by interpreting them as message passing in probabilistic graphical models.
To reiterate the setting laid out in Section 1, we assume that a set of posterior density functions is available, each expressing a belief about the value of a corresponding quantity of interest in a set . While the density functions can be thought of as resulting from previously conducted Bayesian analyses, it is worth pointing out that from a methodological point of view, we are agnostic to how they have been formed; instead of posteriors, some (or all) of the ’s could be purely subjective prior beliefs, or as discussed in Section 2, even directly observed values.
Judging the quantities to be exchangeable, the meta-analyst now formulates a model
with an appropriate prior placed on the parameter . Note that this model initially makes no reference to the ’s, and it is formulated as if the ’s were fully observable quantities. Then, to update based on the observed density functions, we apply Equation (5) in density form to have
where for brevity, we denote .
In a meta-analysis context, the parameter often has an interpretation as the central tendency of some shared property of , such as the mean or the covariance (or both jointly). As such, inference on is often of primary interest in providing a ‘consensus’ over a number of studies. As a secondary goal, we may also be interested in updating a (possibly weakly informative) belief about any individual quantity , subject to the information provided by observations on the remaining quantities. To do so, we first write Equation (6) in density form:
and then marginalize over all quantities but the one to be updated. Let be a set of indices and let be an arbitrary index in this set. The density function is then updated as follows:
According to Section 2.1.2, the density , defined in Equation (9), will under suitable conditions become increasingly peaked around some point , as . That does not behave similarly, becomes clear by the following considerations. First, we note that Equation (10) is equivalent to
where is a normalizing constant. As becomes increasingly peaked around , the integral in the above equation converges to . Consequently,
which can only be degenerate if either or is degenerate by design. Instead of degeneracy, exhibits shrinkage with respect to .
3.1 Interpretation as message passing
The formulation of the above meta-analysis framework, constructed as an extension of standard Bayesian inference, can also be viewed within the formalism of probabilistic graphical models. This provides both an intuitive interpretation and a visualization of Equations (9) and (10), and gives a straightforward way of extending the framework to more complex model structures. To elaborate further on this, consider a tree-structured undirected graphical model with leaf nodes and a root. This is a special case of a pairwise Markov random network (Koller and Friedman, 2009), where all factors, or clique potentials, are over single variables or pairs of variables, referred to as node and edge potentials, respectively. Note that the potential functions are simply non-negative functions, which may not integrate to 1. Choosing, for , the node potentials as and , and the edge potentials as , the model has the joint density
where is a normalizing constant. Finding the marginal density of can then be interpreted as propagating beliefs from each of the leaf nodes up to the root node in the form of messages, a process known as message passing or belief propagation (e.g. Yedidia et al., 2001). To that end, we specify the following messages to be sent from the th leaf node to the root:
The initial belief on is then updated according to
In a similar way, we may pass information to any single leaf node from the remaining leaf nodes. We now specify two kinds of messages: from leaf nodes indexed by to the root node, as given by Equation (11), and from the root node to the th leaf node,
The updated belief over is then
Although not directly utilized in this work, the graphical model view may also be useful in devising efficient computational strategies (see also the remarks in Section 7). Especially with more complex model structures, making use of the conditional independencies made explicit by the graphical model may bring considerable computational gains.
3.2 Bayesian meta-analysis as a special case
It is straightforward to show that Bayesian random-effects and fixed-effects meta-analyses can be recovered as special cases of the proposed framework. In its traditional formulation (e.g. Normand, 1999), random-effects meta-analysis (REMA) assumes that for each of studies, a summary statistic, , , has been observed, drawn from a distribution with study-specific mean
where the approximation of the distribution of
by a normal distribution is justified by the asymptotic normality of maximum likelihood estimates. The variancesare directly estimated from the data, while the means , are assumed to be drawn from some common distribution, typically
where the parameters and represent the average treatment effect and inter-study variation, respectively. Fixed-effects meta-analysis is a special case of REMA, where , such that .
The posterior density for the parameters in REMA can be written as
where denotes a Gaussian density function, is the likelihood function of given , and is the empirical variance of . To study the connection between the above posterior density and Equation (9), assume that instead of a summary statistic , each study has been summarized using a posterior distribution with density over its study-specific effect parameter . If the distribution has been computed under the data model given by equation (14), and using an improper uniform prior , the density is
resulting in the posterior density of being equivalent in both cases.
Here we describe a simple computational strategy, which is used in the numerical examples of Section 5 below. Some further alternatives are briefly discussed at the end of this section. Recall now that the density of the joint distribution of the parameters can be written as
Our goal is to produce joint samples from the above model, enabling any desired marginals to be extracted from them. Probabilistic programming languages (e.g. Carpenter et al., 2017; Salvatier et al., 2015; Tran et al., 2016; Wood et al., 2014) allow sampling from an arbitrary model, provided that the components of the (unnormalized) model can be specified in terms of probability distributions of some standard form. In the illustrations of this section, we use Hamiltonian Monte Carlo implemented in the Stan probabilistic programming language (Carpenter et al., 2017).
We first note that in the above joint model (15), the part specified by the meta-analyst, i.e. , can by design be composed using standard parametric distributions. The observed part of the model , however, is in general analytically intractable, and instead of having direct access to posterior density functions of standard parametric form, we typically have a sets of posterior samples , with . Our strategy is then to first find an intermediate parametric approximation for , which enables us to sample from an approximate joint distribution. Assuming that the true densities
can be evaluated using e.g. kernel density estimation, and that, the joint samples can be further refined using sampling/importance resampling (SIR; Smith and Gelfand, 1992). The steps of the computational scheme are summarized below:
For , fit a parametric density function to the samples .
Draw samples from the approximate joint model .
Compute importance weights , where
Note that the constant cancels in the computation of the normalized weights .
Resample with weights .
For problems with a very large number of studies or high dimensional local parameters, or if the imposed parametric densities approximate the actual posteriors poorly, the computation of importance weights may become numerically unstable. The issue could possibly be mitigated using more advanced importance sampling schemes, such as Pareto-smoothed importance sampling (Vehtari et al., 2015) or prior swap importance sampling (Neiswanger and Xing, 2017). If we are only interested in sampling from the density of the global parameter, as given by Equation (9), then an obvious alternative strategy would be to implement a Metropolis-Hastings algorithm, using the samples to compute Monte Carlo estimates of the integrals . However, this would lead to expensive MCMC updates as the integrals need to be re-estimated at every iteration of the algorithm. Finally, instead of directly sampling from the full joint distribution, we could try to utilize the induced graphical model structure (Section 3.1) to do localized inference, see also Section 7.
5 Numerical illustrations
In this section, we illustrate our meta-analysis framework, meta-analysis of Bayesian analyses (MBA), with numerical examples. In these examples, we consider the problem of combining results from analyses conducted using likelihood-free models. In such models, the data can typically be summarized by a number of different statistics but there is no closed-form likelihood to relate these to the quantity or effect of interest, which poses a challenge for traditional formulations of meta-analysis. In our framework, we directly utilize the inferred posteriors to build a joint model. In addition to modeling the shared central tendency of the inferred model parameters, we demonstrate that weakly informative or poorly identifiable posteriors for individual studies can be updated post-hoc through joint modeling. We begin with a brief review of likelihood-free inference using approximate Bayesian computation.
5.1 Likelihood-free inference using approximate Bayesian computation
Approximate Bayesian computation (ABC) is a paradigm for Bayesian inference in models, which either entirely lack an analytically tractable likelihood function, or for which it is costly to compute. The only requirement is that we are able to sample data from the model by fixing values for the parameters of interest, as is the case for simulator-based models. In the basic form of ABC, simulations are run for parameter proposals drawn from a prior distribution. The parameter proposals whose simulated data match the observed data are collected and constitute a sample from the posterior distribution. It can be shown that this process is equivalent to accepting parameter proposals in proportion to their likelihood value, given the observed data, as is done in traditional rejection sampling.
In practice, the simulated data virtually never exactly matches the observed data and likely no sample from the posterior distribution would be acquired. This problem can be circumvented by loosening the acceptance condition to accept samples whose simulations yield results similar enough to the observed data. For this purpose, a dissimilarity function and a scalar are defined such that a parameter proposal with respective simulation result is accepted if . This function is often defined in terms of summary statistics and . For example, could be defined as the Euclidean distance between and . The aforementioned relaxation results in samples being drawn from an approximate posterior instead of the actual posterior distribution, hence the name approximate Bayesian computation.
For a comprehensive introduction to ABC, see Marin et al. (2012). More recent developments are reviewed in Lintusaari et al. (2017). In the following numerical illustrations, posterior distributions obtained using ABC provide a starting point for meta-analysis. These likelihood-free inferences are implemented using the ELFI open-source software package (Lintusaari et al., 2018).
5.2 Example 1: MA process
In our first example, we use simulated data from a MA process of order . The MA process is a standard example in the literature on likelihood-free inference due to its simple structure but fairly complex likelihood and non-trivial relationship between parameters and observed data. Assuming zero mean, the process is defined as
where and , . The quantity of interest for which we conduct inference is . Following Marin et al. (2012), we use as prior for a uniform distribution over the set
which, by restriction of the parameter space, imposes a general identifiability condition for MA processes. Inference for is then conducted using ABC with rejection sampling, taking as summary statistics the empirical autocovariances of lags one and two, denoted as and , respectively. Furthermore, a Euclidean distance of 0.1 is used as acceptance threshold.
To illustrate our meta-analysis framework, we first sample realizations of using the following generating process:
Given each realization , , we then generate a series of data points, , according to Equation (16). For each time-series, we independently conduct likelihood-free inference as described above, generating samples from the posterior. The computed posterior distributions along with their corresponding true parameter values are shown in Figure 4. It can be seen that the very limited information given by the data in each of the analyses leaves the posteriors with a considerable amount of uncertainty.
For meta-analysis, we first specify a model for the study-specific effects as if they were observed quantities from an exchangeable sequence; see Equation (8
). As the true generating mechanism of the effects is typically unknown, the model must be specified according to the analyst’s judgment. To reflect this, we will here model the generating process as a Gaussian distribution with parameters,
For and the covariance matrix , we use Gaussian and inverse Wishart priors, respectively,
The above values were chosen to provide reasonable coverage of , the constrained support of . Furthermore, was chosen as to directly yield as t