For various statistical models the likelihood function cannot be computed point-wise, which prevents the use of standard Markov chain Monte Carlo (MCMC) algorithms such as Metropolis–Hastings (MH) for Bayesian inference. For example, the likelihood of latent variable models typically involves an intractable integral over the latent space. Classically, one can address this problem by designing MCMC algorithms on the joint space of parameters and latent variables. However, these samplers can mix poorly when latent variables and parameters are strongly correlated under the joint posterior distribution. Furthermore these schemes cannot be implemented if we can only simulate the latent variables and not evaluate their probability density function(Andrieu et al., 2010, Section 2.3). Similarly, in the context of undirected graphical models, the likelihood function might involve an intractable integral over the observation space; see Møller et al. (2006) with examples from spatial statistics.
Pseudo-marginal methods have been proposed for these situations (Lin et al., 2000; Beaumont, 2003; Andrieu and Roberts, 2009), whereby unbiased Monte Carlo estimators of the likelihood are used within an MH acceptance mechanism while still producing chains that are ergodic with respect to the exact posterior distribution of interest, denoted by . Pseudo-marginal algorithms and their extensions (Deligiannidis et al., 2015; Tran et al., 2016) are particularly adapted to latent variable models, such as random effects models and state space models, where the likelihood can be estimated without bias using importance sampling or particle filters (Beaumont, 2003; Andrieu and Roberts, 2009; Andrieu et al., 2010). Related schemes include the exchange algorithm (Murray et al., 2006; Andrieu et al., 2018)
, which applies to scenarios where the likelihood involves an intractable, parameter-dependent normalizing constant. Exchange algorithms rely on simulation of synthetic observations to cancel out intractable terms in the MH acceptance ratio. As with any MCMC algorithm, the computation of each iteration requires the completion of the previous ones, which hinders the potential for parallel computation. Running independent chains in parallel is always possible, and averaging over independent chains leads to a linear decrease of the resulting variance. However, the inherent bias that comes from starting the chains outside of stationarity, also called the “burn-in bias”, remains(Rosenthal, 2000).
This burn-in bias has motivated various methodological developments in the MCMC literature; among these, some rely on coupling techniques, such as the circularly-coupled Markov chains of Neal (2017), regeneration techniques described in Mykland et al. (1995); Brockwell and Kadane (2005), and “coupling from the past” as proposed in Propp and Wilson (1996). Coupling methods have also been proposed for diagnosing convergence in Johnson (1996, 1998) and as a means to assess the approximation error for approximate MCMC kernels in Nicholls et al. (2012). Recently, a method has been proposed to completely remove the bias of Markov chain ergodic averages (Glynn and Rhee, 2014). An extension of this approach using coupling ideas was proposed by Jacob et al. (2017) and applied to a variety of MCMC algorithms. This methodology involves the construction of a pair of Markov chains, which are simulated until an event occurs. At this point, a certain function of the chains is returned, with the guarantee that its expectation is exactly the integral of interest. The output is thus an unbiased estimator of that integral. Averaging over i.i.d. copies of such estimators we obtain consistent estimators in the limit of the number of copies, which can be generated independently in parallel. Relevant limit theorems have been established in Glynn and Heidelberger (1990); Glynn and Whitt (1992)
, enabling the construction of valid confidence intervals. The methodology has already been demonstrated for various MCMC algorithms(Jacob et al., 2017; Heng and Jacob, 2017; Jacob et al., 2018), which were instances of geometrically ergodic Markov chain samplers under typical conditions. However, in the case of intractable likelihoods and pseudo-marginal samplers, the associated Markov chains are typically sub-geometrically ergodic (Andrieu and Vihola, 2015).
We show here that unbiased estimators of
, with finite variance and finite computational cost, can also be derived from polynomially ergodic Markov chains such as those generated by pseudo-marginal methods. We provide results on the associated efficiency in comparison with standard MCMC estimators. We apply the methodology to particle MCMC algorithms for inference in generic state space models, with an application to a time series of neuron activation counts. We also consider a variant of the pseudo-marginal approach known as the block pseudo-marginal approach(Tran et al., 2016) as well as the exchange algorithm (Murray et al., 2006).
Accompanying code used for simulations and to generate the figures are provided at https://github.com/lolmid/unbiased_intractable_targets.
1.2 Unbiased estimators from coupled Markov chains
Let be a probability measure on a topological space equipped with the Borel -algebra . In this section we recall how two coupled chains that are marginally converging to can be used to produce unbiased estimators of expectations for any -integrable test function . Following Glynn and Rhee (2014); Jacob et al. (2017), we consider the following coupling of two Markov chains and . First, are drawn independently from an initial distribution . Then, is drawn from a Markov kernel given , which is denoted . Subsequently, at step , a pair is drawn from a Markov kernel given , which is denoted . The kernel is such that, marginally, and . This implies that, marginally for all , and have the same distribution. Furthermore, the kernel
is constructed so that there exists a random variabletermed the meeting time, such that for all , almost surely (a.s.). Then, for any integer , the following informal telescoping sum argument informally suggests an unbiased estimator of . We start from and write
The sum is treated as zero if . The suggested estimator is thus defined as
with and denoting the chains and respectively. As in Jacob et al. (2017), we average over a range of values of , for an integer , resulting in the estimator
Intuitively, can be understood as a standard Markov chain average after steps using a burn-in period of steps (which would be in general biased for ), plus a second term that can be shown to remove the burn-in bias. That “bias correction” term is a weighted sum of differences of the chains between step and the meeting time . In the following, we will write for brevity. The construction of is summarized in Algorithm 1, where the initial distribution of the chains is denoted by , and the Markov kernels by and as above. Standard MCMC estimators require the specification of and , while the proposed method requires the additional specification of the coupled kernel . We will propose coupled kernels for the setting of intractable likelihoods, and study the estimator under conditions which cover pseudo-marginal methods.
To see how coupled kernels can be constructed, we first recall a construction for simple MH kernels. Assume that where is the Lebesgue measure. The standard MH algorithm relies on a proposal distribution
, for instance chosen as a Gaussian distribution centered at. At iteration , a proposal is accepted as the new state with probability , known as the MH acceptance probability. If is rejected, then is assigned the value of . This defines the kernel . To construct , following Jacob et al. (2017) we can consider a maximal coupling of the proposal distributions. This is described in Algorithm 2 for completeness; see also Johnson (1998). Here
refers to the uniform distribution on the interval. The algorithm relies on draws from a maximal coupling (or -coupling) of the two proposal distributions and at step . Draws from maximal couplings are such that the probability of the event is maximal over all couplings of and . Sampling from maximal couplings can be done with rejection sampling techniques as described in Jacob et al. (2017), in Section 4.5 of Chapter 1 of Thorisson (2000) and in Johnson (1998). On the event , the two chains are given identical proposals, which are then accepted or not based on and using a common uniform random number. In the event that both proposals are identical and accepted, then the chains meet: . One can then check that the chains remain identical from that iteration onwards.
The unbiased property of has an important consequence for parallel computation. Consider independent copies, denoted by for , and the average . Then is a consistent estimator of as , for any fixed
, and a central limit theorem holds provided that; sufficient conditions are given in Section 1.3. Since is a random variable, the cost of generating is random. Neglecting the cost of drawing from , the cost amounts to that of one draw from the kernel , draws from the kernel , and then draws from if . Overall that leads to a cost of units, where each unit is the cost of drawing from , and assuming that one sample from costs two units. Theoretical considerations on variance and cost will be useful to guide the choice of the parameters and as discussed in Section 1.5.
1.3 Theoretical validity under polynomial tails
We provide here sufficient conditions under which the estimator is unbiased, has finite expected cost and finite variance. Below, Assumptions 1 and 3 are identical to Assumptions 2.1 and 2.3 in Jacob et al. (2017) whereas Assumption 2 is a polynomial tail assumption on the meeting time weaker than the geometric tail assumption, namely, for all , for some constants and , used in Jacob et al. (2017). Relaxing this assumption is useful in our context as the pseudo-marginal algorithm is polynomially ergodic under realistic assumptions (Andrieu and Vihola, 2015) and, as demonstrated in Section 1.4, this allows the verification of the polynomial tail assumption.
Each of the two chains marginally starts from a distribution , evolves according to a transition kernel and is such that as for a real-valued function . Furthermore, there exists constants and such that for all .
The two chains are such that there exists an almost surely finite meeting time such that for some constants and , where is as in Assumption 1.
The chains stay together after meeting, i.e. for all .
Under Assumption 2, for all and thus if . As it is assumed that this implies that for . In particular, one has and thus the computational cost associated with has a finite expectation. It also implies that
has a finite second moment.
The following result states that has not only a finite expected cost but also has a finite variance and that its expectation is indeed under the above assumptions. The proof is provided in Appendix A.1.
1.4 Conditions for polynomial tails
We now proceed to establishing conditions that imply Assumption 2. To state the main result, we put assumptions on the probability of meeting at each iteration. We write for the diagonal of the joint space , that is and introduce the measure . In this case, we identify the meeting time with the hitting time of the diagonal, . The first assumption is on the ability of the pair of chains to hit the diagonal when it enters a certain subset of .
The kernel is -irreducible: for any set such that and all there exists some such that . The kernel is also aperiodic. Finally, there exist , and a set such that
Next we will assume that the marginal kernel admits a polynomial drift condition and a small set ; we will later consider that small set to be the same set as in Assumption 4. Intuitively, the polynomial drift condition on will ensure regular entries of the pair of chains in the set , from which the diagonal can be hit in steps under Assumption 4.
There exist , a probability measure on and a set such that
In addition, there exist a measurable function , constants , , and a value , such that, defining for a constant and all , then for any ,
We note the direct relation between the exponent in the polynomial drift condition and the exponent in the bound on the survival probabilities . In turn this relates to the existence of finite moments for , as discussed after Assumption 2. In particular, if we can take large values of in Assumption 1, then we require in Assumption 2 that is just above , which is implied by according to Theorem 2. However, if we consider in Assumption 1, for instance, then we require in Assumption 2 that is just above , which is implied by according to Theorem 2. The condition will appear again in the next section.
1.5 Efficiency under polynomial tails
In removing the bias from MCMC estimators, we expect that will have an increased variance compared to an MCMC estimator with equivalent cost. In this section we study the overall efficiency of in comparison to standard MCMC estimators. This mirrors Proposition 3.3 in Jacob et al. (2017) in the case of geometrically ergodic chains.
We can define the inefficiency of the estimator as the product of its variance and of its expected computational cost via , with denoting the computational cost. This quantity appears in the study of estimators with random computing costs, since seminal works such as Glynn and Heidelberger (1990) and Glynn and Whitt (1992). The inefficiency can be understood as the asymptotic variance of the proposed estimator as the computing budget goes to infinity. The following provides a precise comparison between this inefficiency and the inefficiency of the standard “serial” algorithm. Since the cost is measured in units equal to the cost of sampling from , the cost of obtaining a serial MCMC estimator based on iterations is equal to such units. The mean squared error associated with an MCMC estimator based on is denoted by , where and where denotes the number of discarded iterations. We are particularly interested in the comparison between , the inefficiency of the proposed estimator with parameters , and , the asymptotic inefficiency of the serial MCMC algorithm. Both correspond to asymptotic variances when the computing budget goes to infinity.
We first express the estimator , for as , where the bias correction term is
Then Cauchy-Schwarz provides a relationship between the variance of , the MCMC mean squared error, and the second moment of the bias-correction term:
This relationship motivates the study of the second moment of . The following result shows that if the Markov chains are mixing well enough, in the sense of the exponent in the polynomial drift condition of Assumption 5 being close enough to one, then we can obtain a bound on which is explicit in and . The proof can be found in Appendix A.3.
Suppose that the marginal chain evolving according to is -irreducible and that the assumptions of Theorem 2 hold for and some measurable function , such that . In addition assume that there exists a such that . Then for any measurable function such that , and any integers we have that, for , and a constant ,
The fact that a restriction on the exponent has to be specified to control the second moment of is to be expected: we have already seen in the previous section that such a restriction is also necessary to apply Theorem 2 to verify Assumption 2 with an adequate exponent , which, in turn, leads to a finite variance for through Theorem 1. The specific condition could perhaps be relaxed with a more refined technical analysis, thus we interpret the condition qualitatively: the chains are allowed to satisfy only a polynomial drift condition but it needs to be “close” enough to a geometric drift condition. With pseudo-marginal MCMC algorithms this can be achieved by tuning the number of Monte Carlo samples per iteration.
The variance of is thus bounded by the mean squared error of an MCMC estimator, and additive terms that vanish polynomially when , and increase. To compare the efficiency of to that of MCMC estimators, we add simplifying assumptions as in Jacob et al. (2017). As increases and for , we expect to converge to as , where . We will make the simplifying assumption that for large enough. As the condition is equivalent to , will be negligible compared to the two other terms appearing on the right hand side of (11), so we obtain the approximate inequality
where the cost of is approximated by the cost of calls to . For the left-hand side to be comparable to , we can select as a large multiple of such that is close to one. The second term on the right-hand side is then negligible as increases, and we see that the polynomial index determining the rate of decay is monotonic in .
2 Unbiased pseudo-marginal MCMC
2.1 Pseudo-marginal Metropolis–Hastings
The pseudo-marginal approach (Lin et al., 2000; Beaumont, 2003; Andrieu and Roberts, 2009) generates Markov chains that target a distribution of interest, while using only non-negative unbiased estimators of target density evaluations. For concreteness we focus on target distributions that are posterior distributions in a standard Bayesian framework. The likelihood function associated to data is denoted by , and a prior density w.r.t. the Lebesgue measure is assigned to an unknown parameter . We assume that we can compute a non-negative unbiased estimator of , for all , denoted by where are random variables such that , where for any , denotes a Borel probability measure on . We assume that admits a density with respect to the Lebesgue measure denoted by . The random variables represent variables required in the construction of the unbiased estimator of . The pseudo-marginal algorithm targets a distribution with density
The generated Markov chain takes values in . Since for all , marginally , corresponding to the target of interest for the component of . Sampling from is achieved with an MH scheme, with proposal . This results in an acceptance probability that simplifies to
which does not involve any evaluation of . Thus the algorithm proceeds exactly as a standard MH algorithm with proposal density , with the difference that likelihood evaluations are replaced by estimators with . The performance of the pseudo-marginal algorithm depends on the likelihood estimator: lower variance estimators typically yield ergodic averages with lower asymptotic variance, but the cost of producing lower variance estimators tends to be higher which leads to a trade-off analyzed in detail in Doucet et al. (2015); Schmon et al. (2018).
In the following we will generically denote by the distribution of when , and for notational simplicity, we might write instead of . The above description defines a Markov kernel and we next proceed to defining a coupled kernel , to be used for unbiased estimation as in Algorithm 1.
2.2 Coupled pseudo-marginal Metropolis–Hastings
To define a kernel that is marginally identical to but jointly allows the chains to meet, we proceed as follows, mimicking the coupled MH kernel in Algorithm 2. First, the proposed parameters are sampled from a maximal coupling of the two proposal distributions. If the two proposed parameters and are identical, we sample a unique likelihood estimator and we use it in the acceptance step of both chains. Otherwise, we sample two estimators, and . Denoting the two states of the chains at step by and , Algorithm 3 describes how to obtain and ; thereby describing a kernel .
In step 2. of Algorithm 3 the two likelihood estimators and can be generated independently, as we will do below for simplicity. They can also be sampled together in a way that induces positive correlations, for instance using common random numbers and other methods described in Deligiannidis et al. (2015); Jacob et al. (2018). We leave the exploration of possible gains in correlating likelihood estimators in that step as a future avenue of research. An appealing aspect of Algorithm 3, particularly when using independent estimators in step 2., is that existing implementation of likelihood estimators can be readily used. In Section 3.2 we will exploit this by demonstrating the use of controlled sequential Monte Carlo (Heng et al., 2017) in the proposed framework. Likewise, one could explore the use of other advanced particle filters such as sequential quasi Monte Carlo (Gerber and Chopin, 2015). To summarize, given an existing implementation of a pseudo-marginal kernel, Algorithm 3 involves only small modifications and the extra implementation of a maximal coupling which itself is relatively simple following, for example, Jacob et al. (2017).
2.3 Theoretical guarantees
We provide sufficient conditions to ensure that the coupled pseudo-marginal algorithm returns unbiased estimators with finite variance and finite expected computation time, i.e. sufficient conditions to satisfy the requirements of Theorem 2 are provided. By introducing the parameterization and using the notation when , we can rewrite the pseudo-marginal kernel
where, in this parameterization, we write
and is the corresponding rejection probability. We first make assumptions about the target and proposal densities.
The target posterior density is strictly positive everywhere and continuously differentiable. Its tails are super-exponentially decaying and have regular contours, that is,
where denotes the Euclidean norm of . Moreover, the proposal distribution satisfies with a bounded, symmetric density that is bounded away from zero on all compact sets.
We then make assumptions about the moments of the noise.
There exist constants and such that
where the essential supremum is taken with respect to the Lebesgue measure. Additionally the family of distributions defined by the densities is continuous with respect to in the topology of weak convergence.
Both assumptions are used in (Andrieu and Vihola, 2015) to establish a drift condition for the pseudo-marginal algorithm. Assumption 6 can be understood as a condition on the ‘ideal’ algorithm, i.e. if the likelihood could be evaluated exactly, and Assumption 7 ensures the likelihood estimate has neither too much mass around zero nor in the tails. The following proposition follows from establishing minorization conditions for both the pseudo-marginal and coupled pseudo-marginal kernels along with (Andrieu and Vihola, 2015, Theorem 38).
where and for some constants , and . Additionally the minorization conditions (3) and (4) hold for the same and .Finally if for all and if for some we have then Assumptions 4 and 5 hold with the same for the kernel induced by Algorithm 3.
If the assumptions of Proposition 2 are satisfied for such that then, by application of Theorem 2, the coupling times exhibit the required tail bounds of Assumption 2 with - provided also admits a density with respect to and is supported on a compact set. We note that the uniform moments bounds of Assumption 7 might not be satisfied in many non-compact parameter spaces. A weaker assumption allowing to satisfy the polynomial drift condition is provided in (Andrieu and Vihola, 2015, Condition 44) and could be alternatively used here.
2.4 Tails of meeting times in a toy experiment
We provide numerical experiments on the tails of the meeting time in a toy example, to illustrate the transition from geometric to polynomial tails. The target
is a bivariate Normal distributionwith , and the initial distribution is uniform over the unit square. To emulate the pseudo-marginal setting, we pretend that we cannot evaluate ; instead we have access to for all , where and is a log-Normal variable: , and calibrates the amount of noise of the unbiased estimator of . We consider a pseudo-marginal Metropolis–Hastings algorithm with proposal distribution , and a coupled version following Algorithm 3. Indeed, in this setting we are able to verify Assumptions 6 and 7 directly. We note that in the case , we recover the standard MCMC setting.
We draw independent realizations of the meeting time for in a grid of values . We then approximate survival probabilities by empirical counterparts, for between and the quantile of the meeting times for each . The resulting estimates of are plotted against in Figure (a)a, where the y-axis is in log-scale. First note that in the case , seems to be bounded by a linear function of , which would correspond to for some constants and . This is indeed the expected behavior in the case of geometrically ergodic Markov chains (Jacob et al., 2017).
As increases, decreases less rapidly as a function of . To verify whether might be bounded by (as our theoretical considerations suggest), we plot against with both axes in log-scale in Figure (b)b, with a focus on the tails, with . The figure confirms that the log survival probability might indeed by upper bounded by , up to a constant offset, for large enough values of . The figure suggests also that in this case decreases with .
3 Experiments in state space models
State space models are a popular class of time series models. These latent variable models are defined by an unobserved Markov process and an observation process where the observations are conditionally independent given with
where parameterizes the distributions , and (termed the ‘initial’, ‘transition’ and ‘observation’ distribution respectively). Given a realization of the observations we are interested in performing Bayesian inference on the parameter to which we assign a prior density . The posterior density of interest is thus where the likelihood is usually intractable. It is possible to obtain a non-negative unbiased estimator of using particle filtering where here