 # Delayed acceptance ABC-SMC

Approximate Bayesian computation (ABC) is now an established technique for statistical inference used in cases where the likelihood function is computationally expensive or not available. It relies on the use of a model that is specified in the form of a simulator, and approximates the likelihood at a parameter θ by simulating auxiliary data sets x and evaluating the distance of x from the true data y. However, ABC is not computationally feasible in cases where using the simulator for each θ is very expensive. This paper investigates this situation in cases where a cheap, but approximate, simulator is available. The approach is to employ delayed acceptance Markov chain Monte Carlo (MCMC) within an ABC sequential Monte Carlo (SMC) sampler in order to, in a first stage of the kernel, use the cheap simulator to rule out parts of the parameter space that are not worth exploring, so that the "true" simulator is only run (in the second stage of the kernel) where there is a reasonable chance of accepting proposed values of θ. We show that this approach can be used quite automatically, with the only tuning parameter choice additional to ABC-SMC being the number of particles we wish to carry through to the second stage of the kernel. Applications to stochastic differential equation models and latent doubly intractable distributions are presented.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Motivation

Approximate Bayesian computation (ABC) is a technique for approximate Bayesian inference originally introduced in the population genetics literature

(Pritchard et al., 1999; Beaumont et al., 2002), but which is now used for a wide range of applications. It is suitable for situations in which a model with parameters for data is readily available in the form of a simulator, but where as a function of (known as the likelihood) is intractable in that it cannot be evaluated pointwise. If is assigned prior distribution the object of inference is the posterior distribution . ABC yields approximations to the posterior distribution through approximating the likelihood, in the simplest case using

 ˆfϵ(y∣θ)=1MM∑m=1Pϵ(y∣x(m)). (1)

where is a kernel centred around and points are sampled from

. We may see Monte Carlo ABC algorithms as sampling from a joint distribution on

and proportional to , where is used as a proposal distribution for .

In this paper we describe methods that are designed to be applicable in cases where the likelihood estimation is computationally costly because the model

is expensive to simulate from, for example when studying an ordinary differential equation (ODE) or stochastic differential equation (SDE) model

(Picchini, 2014) that requires a solver with the small step size, or when using an individual based model (van der Vaart et al., 2015) with a large number of individuals. This situation is not uncommon, since even if the model takes only a matter of seconds to simulate, this cost is prohibitive when it needs to be simulated a number of times due to estimating the likelihood at a large number of Monte Carlo points in -space. We note that this situation is exacerbated when is of moderate to high dimension, since in these cases (as in any Monte Carlo method) it is difficult to design a proposal that can efficiently make moves on all dimensions of simultaneously. Most commonly, each dimension of is updated using a single component move, and in the ABC context this means using simulations from the likelihood when updating each dimension of .

### 1.2 Previous work and contributions

In this paper we also aim to limit the number of points at which we need to perform an expensive estimation of the likelihood. In section 2.2.1 we describe delayed-acceptance ABC-MCMC for exploring -space, and use it as a method for screening for and discarding

points that are unlikely to be in regions of high posterior density. Delayed acceptance (DA) decomposes an MCMC move into two stages. At the first stage a point is proposed and is accepted or rejected according to a posterior that is usually chosen to approximate the desired posterior. If accepted at the first stage, at the second stage an acceptance probability is used that “corrects” for the discrepancy between the approximate and the desired target. Thus we may think of the first stage as screening for points to take through to the second stage. Compared to the previous work, this approach is most similar to the “early rejection” of

Picchini and Forman (2016) who use the prior for this screening step.

Delayed-acceptance ABC-MCMC makes use at the first stage of an approximate, but computationally cheap, alternative to the full simulator. We will see that our approach is applicable in cases in which the is a cheap approximate simulator is performed independently to the full simulator, and also where the full simulator is a continuation of the initial cheap simulation. This latter situation is the same as that considered in “Lazy” ABC, in which is it possible to monitor a simulation as it is progressing, and to be able to assess before the full simulation whether the current point is likely to be in a region of high posterior density.

We will see that using delayed-acceptance ABC-MCMC introduces an additional tuning parameter: the tolerance used for the cheap simulator. When using delayed-acceptance ABC-MCMC directly, it is not always straightforward to set this parameter. In section 2.2.2 we embed delayed-acceptance ABC-MCMC in an ABC-SMC framework, since this is established as being an effective method for guiding Monte Carlo points in -space to regions of high posterior density, and also since we see that it allows for automating the choice of the additional tuning parameter. An appealing feature of this approach is that the only tuning parameter that the user is required to set is to choose the number of times the expensive simulator is run at each iteration of the SMC.

The resultant algorithm is particularly useful for distributions where it is difficult to design a useful proposal distribution. When our proposal has a poor acceptance rate, we may need to run the expensive simulator many times for a single acceptance. However, with delayed acceptance we may propose a very large number of points without the need to run the expensive simulator at all: instead we evaluate whether the proposed points in a high density region according to the cheap simulator. For points that lie in this region (i.e. that pass the first stage of DA), we must then run the expensive simulator, and expect to accept a reasonable proportion of these (as long as the cheap simulator is close to the expensive simulator). Essentially the first stage of the delayed acceptance acts to give us a well-targeted proposal.

Section 3 considers an application to SDEs, where we examine using cheap simulators that use a large step size in the numerical solver. Section 4 considers applications to doubly intractable distributions: applying ABC to these models (and other Markov random fields) has the complication in that no computationally feasible exact simulator is available. Instead an MCMC algorithm is commonly used as an approximate approach (Grelaud et al., 2009; Everitt, 2012). In such approaches the burn in of this MCMC method determines the accuracy of the approach: for a small burn in the method is biased, with this bias being eliminated by a large burn in which may be computationally expensive. We examine the cases of latent Ising models and latent exponential random graph models (ERGMs) previously studied in Everitt (2012), noting the potential advantage of ABC in these cases compared to methods such as the exchange algorithm. We then conclude with a discussion in section 5.

## 2 Delayed acceptance ABC-SMC

### 2.1 Delayed acceptance

This section introduces delayed acceptance (Christen and Fox, 2005) as being the algorithm that results when one Metropolis-Hastings kernel is used as a proposal within another. We note here the link with pseudo-marginal type algorithms that, as a special case, show how to use importance sampling or SMC within a Metropolis-Hastings algorithm.

Let Metropolis-Hastings transition kernel have invariant distribution and suppose our aim is to use the distribution as the proposal distribution in another Metropolis-Hastings kernel with invariant distribution . It turns out that we can use the output of as draw from as if it is a point drawn from . That is, we apply to to to obtain , then accept with probability

 α2(θ,θ∗)=min{1,π2(θ∗)π1(θ)π2(θ)π1(θ∗)}. (2)

Verifying this acceptance probability is simple. Since is simulated from , we have an acceptance probability for of

 α2(θ,θ∗)=min{1,π2(θ∗)K1(θ∣θ∗)π2(θ)K1(θ∗∣θ)}. (3)

Now, satisfies the detailed balance equation with respect to

 π1(θ∗)K1(θ∗∣θ)=π1(θ)K1(θ∣θ∗), (4)

and substituting this into 3 we obtain the desired result of equation 2.

In the literature on delayed acceptance, is chosen to be a desired target distribution that may be computationally expensive to evaluate, and is chosen to be a cheaper, approximate target distribution. Used in this way, the first stage of delayed acceptance (i.e. the result of applying ) provides a proposal that should be well suited to being used in an MH algorithm with target , e.g. Banterle et al. (2014) uses approximate likelihoods based on subsets of the data as the approximate targets. Strens (2004) uses a very similar idea, using a hierarchical decomposition of the likelihood. Banterle et al. (2014) notes that whilst a standard MH algorithm dominates delayed acceptance in terms of the Peskun ordering, it is possible that a reduced computational cost in the first stage of delayed acceptance can lead to more efficient algorithms in terms of the Monte Carlo error per computational time.

### 2.2 Delayed acceptance ABC-SMC

#### 2.2.1 Delayed acceptance with ABC

This paper makes use of delayed acceptance in the ABC setting. Suppose that there exists a computationally cheap alternative to our “true” simulator . We wish to perform the first stage of delayed acceptance using points simulated from , using this to determine the parameters at which we simulate points from from which we estimate the standard ABC likelihood. At the first stage of delayed acceptance, we use an ABC-MCMC move with the simulator , using a tolerance . We use the extended state space view of ABC-MCMC, where the move is seen to be a Metropolis-Hastings move on the space , with being proposed via and via . Further, in order to fit directly into the framework of section 2.1 we may view the move as being on the space , with being proposed via . although we will see that in practice this simulation does not need to be performed at the first stage (this construction is essentially the same as in Sherlock et al. (2017)).

The acceptance probability at the first stage is

 α1 = min{1,p(θ∗)l1(x∗1∣θ∗)Pϵ1(y1∣x∗1)l2(x∗2∣x∗1,θ∗)p(θ)l1(x1∣θ)Pϵ1(y1∣x1)l2(x2∣x1,θ)q(θ∣θ∗)l1(x1∣θ)l2(x2∣x1,θ)q(θ∗∣θ)l1(x∗1∣θ∗)l2(x∗2∣x∗1,θ∗)} = min{1,p(θ∗)Pϵ1(y1∣x∗1)p(θ)Pϵ1(y1∣x1)q(θ∣θ∗)q(θ∗∣θ)},

where may be the same as the complete data set (but is not required to be) and we observe that the marginal distribution of the target we have used is the ABC posterior using . and . Using equation 2, the acceptance probability at the second stage is

 α2 = min{1,p(θ∗)l1(x∗1∣θ∗)l2(x∗2∣x∗1,θ∗)Pϵ2(y∣x∗1,x∗2)p(θ)l1(x1∣θ)l2(x2∣x1,θ)Pϵ2(y∣x1,x2)p(θ)l1(x1∣θ)Pϵ1(y1∣x1)l2(x2∣x1,θ)p(θ∗)l1(x∗1∣θ∗)Pϵ1(y1∣x∗1)l2(x∗2∣x∗1,θ∗)} =

where the marginal distribution of the target we have used is the ABC posterior using . and . Note that we have included the conditioning on in and , which allows for the possibility that the “true” simulator is a composition of the simulators and , with being a partial simulation. If the true simulator is simply , we may drop this conditioning on from and .

#### 2.2.2 Use in ABC-SMC

The use of SMC with ABC originates in Sisson et al. (2007). The observation in this paper is that the ABC posterior is a natural candidate for simulating from using SMC, in that a sequence of posteriors with a decreasing sequence of tolerances is a natural and useful choice as a sequence of distributions to use in SMC. In this paper we examine the use of the delayed acceptance ABC-MCMC approach from the preceding section within the ABC-SMC algorithm of Del Moral et al. (2012). The appendix gives a thorough background of ABC-SMC as used in this paper. Most details remain the same the standard algorithms in the appendix, but we make the following remarks.

• The SMC sampler operates on a sequence of target distributions where decreases at each iteration, thus we use to denote its value at the -th iteration. and may also change between SMC iterations, and we denote their values at the -th iteration by and respectively.

• The target distribution used at iteration is .

• The weight update for each particle is given by

 ~wt+1 = wtp(θt)l1(x1,t∣θt)l2(x2,t∣x1,t,θt)Pϵ2,,t+1(y∣x1,t,x2,t)p(θt)l1(x1,t∣θt)l2(x2,t∣x1,t,θt)Pϵ2,,t(y∣x1,t,x2,t) = wtPϵ2,,t+1(y∣x1,t,x2,t)Pϵ2,,t(y∣x1,t,x2,t).
• The first stage of the delayed acceptance ABC-MCMC kernel is applied to every particle, before the second stage is applied to the particles that make it past the first.

#### 2.2.3 An adaptive first stage when using indicator functions

In this section we consider the situation when is chosen to be an indicator function; i.e. , where is a distance metric. In this case when specifying our acceptance probabilities we need to account for the our target distributions having zero density in some parts of the space. We follow the framework of Tierney (1998), in which for a target and proposal , the MH acceptance probability is written as

where . Thus, at the -th iteration of the SMC the acceptance probability at the first stage of the delayed acceptance is

 α1,t=⎧⎪⎨⎪⎩min{1,p(θ∗t)p(θt)q(θt∣θ∗t)q(θ∗t∣θt)}d(y1,t,x1,t),d(y1,t,x∗1,t)<ϵ1,t0otherwise. (5)

As in Picchini and Forman (2016), we may perform the first stage of delayed acceptance in two stages, which we will refer to as steps 1a and 1b, such that some simulations from may be avoided. At step 1a, is simulated from and an accept-reject step is performed using the acceptance probability

 α1a,t=min{1,p(θ∗t)p(θt)q(θt∣θ∗t)q(θ∗t∣θt)},

then at step 1b is simulated from and the entire move is accepted (to be used in stage 2) with probability

 α1b,t={1d(y1,t,x1,t),d(y1,t,x∗1,t)<ϵ1,t0otherwise. (6)

Splitting the first stage into two substages could itself be seen as a form of delayed acceptance, but its acceptance rate is the same as the single step implementation since it simply uses the fact that is either 1 or 0 in order to reorganise the single step calculation in a computationally efficient way.

The acceptance probability at the second stage is

 α2,t={1d(y1,t,x1,t),d(y1,t,x∗1,t)<ϵ1,t and d(y,x2,t),d(y,x∗2,t)<ϵ2,t0otherwise,

which may be seen directly from equation 3 when making the appropriate choices of and . Note that must be true for the particle to have non-zero weight, and must be satisfied to reach the second stage of DA, thus in practice we use

 α2,t={1d(y,x∗2,t)<ϵ2,t0otherwise.

Del Moral et al. (2012) make use of a useful property of ABC to create an adaptive algorithm: that once simulation of has been performed for any , it is computationally cheap to estimate the ABC likelihood for any tolerance . Here we use this property, together with the fact we have available a population of particles in the SMC algorithm, to automatically determine an appropriate value for at every iteration. We choose using the criterion that we desire to perform the second stage of the delayed acceptance for a fixed proportion of the particles; we choose such that particles are accepted to move forward to the second stage according to equation 6, where is chosen prior to running the SMC. This is achieved by choosing such that particles satisfy : a bisection method may be used to find such an . As with the bisection used in the adaptive approach of Del Moral et al. (2012), in practice the bisection will not always give precisely particles at the second stage of DA; and if fewer than particles pass early rejection due to the prior, is chosen to let all these particles through to the second stage.

#### 2.2.4 Discussion

This ABC-SMC algorithm, which we call DA-ABC-SMC, is described in full in the appendix. In brief, we use this algorithm with: adaptively to choosing the sequence such that there are unique particles after reweighting and resampling (Bernton et al., 2017)

; adaptively choosing the variance of the (Gaussian) MCMC proposals to be the sample variance if the current particle set. The method requires us to specify: the number of particles

, the number of particles that are allowed past the first stage of the DA and the number of unique particles . We now discuss how to choose each of these parameters.

• Number of unique particles. On terminating the algorithm, we effectively have a sample of size from the final ABC posterior, thus should be chosen to be the number of Monte Carlo points we wish to generate.

• Number of particles allowed past the first stage. dictates the computational cost of each iteration, since it is the number of expensive simulations we will perform per SMC iteration. It is our hope that by increasing , we will be able to choose a smaller ABC tolerance at the first stage of DA, thus that the particles that make it through to stage two of DA are more likely to generate simulations close to the real data, therefore increasing the expected acceptance rate of the second stage of the DA. Overall the DA-MCMC move will accept new particles, thus as long as is not too low (e.g. 5%) it is reasonable to assume that choosing is sufficient to ensure that useful reductions in the ABC tolerance will be made at each iteration of the SMC. (Further discussion about appropriate acceptance rates in DA may be found in Sherlock et al. (2015).)

• Total number of particles. needs to be chosen such that is not small. In practice to choose we need to consider the factor by which the cheap simulator is faster than the expensive simulator. For DA to be most effective, we require that the cost of running the cheap simulator is small compared to running the expensive simulator, i.e. that , but that so that the DA procedure results in a better MCMC proposal than would otherwise have been used.

If is chosen to be much larger than , we require a slightly non-standard initial step in our SMC algorithm in order that the computational expense of this step does not dominate the subsequent iterations. In a standard SMC sampler, the initial step would require initial values of and to be simulated for each of the particles, which requires running the expensive simulator times. Instead we simulate times from the initial distribution, and repeat these points times so that we have a sample of size from the initial distribution. This sample contains only unique points but, importantly, will result in proposed points at every MCMC move in the SMC.

The parameter plays a role that is slightly different to a standard SMC sampler, in which we might informally think of it as roughly the size of the Monte Carlo sample (in DA-ABC-SMC, plays this role instead). In DA-ABC-SMC, is one of three factors determining the “DA proposal” (i.e. the distribution of points that arrive at the second stage of DA), the remaining factors being: the choice of ; and the distribution of the particles from the previous iteration of the SMC. A useful DA proposal has similar characteristics to a good independent MCMC or importance sampling proposal: we wish it to be close to the target distribution, with wider tails. Since the DA proposal depends on the previous distribution of the particles, we observe empirically that it can to some extent automatically “track” the target distribution over SMC iterations, and provide a useful proposal distribution at each SMC iteration. However, this relies on: the ABC posterior using being centred on roughly the same parameter values as the ABC posterior using ; not being too small so that is large, so that the DA proposal is too diffuse; or not being too large so that is too small, so that the DA proposal is too narrow. In sections 3 and 4, we illustrate the performance of our algorithm for different choices of , which result in different sequences of DA proposals.

## 3 Application to stochastic differential equation models

### 3.1 Lotka-Volterra model

The Lotka-Volterra model is a stochastic Markov jump process that models the number of individuals in two populations of animals: one predator population, and the other being their prey. It is a commonly used example in ABC since it is possible to simulate exactly from the model using the Gillespie algorithm (Gillespie, 1977), but the likelihood is not available pointwise. We follow the description of the model in Wilkinson (2013), in which represent the number of predators and the number of prey. The following reactions may take place:

• A prey may be born, with rate , increasing by one.

• The predator-prey interaction in which increases by one and decreases by one, with rate (note that in other specifications of the the model this single reaction is separated into two: a predator birth and a predator eating a prey, each with different rates).

• A predator may die, with rate , decreasing by one.

Papamakarios and Murray (2016) note that for most parameters this model has the behaviour that the size of one population quickly decreases to zero (in the case of the predators dying out, this results in the prey population growing exponentially). The relatively small region of parameter space that contains values of the parameters that result in oscillating population sizes, such as those in figure 0(a), makes this a relatively challenging inference problem.

In this section we use this example to demonstrate the use of DA-ABC-SMC for SDE models that need to be simulated numerically. To do this, we use the chemical Langevin equation approximation to the Markov jump process, as detailed in Golightly et al. (2015). This results in two coupled non-linear SDEs, which we simulate numerically using the Euler-Maruyama method.

### 3.2 Results

We use the data “LVPerfect” (figure 0(a)) in the R packge smfsb (Wilkinson, 2011) (the numerical methods for simulating from the likelihood are also taken from this package), previously studied in Wilkinson (2013). In this data the simulation starts with initial populations and , and has 30 time units, with the values of and being recorded every 2 time units, resulting in 16 data points in each of the two time series. Our prior follows that in Wilkinson (2013), being uniform in the domain. Specifically we use

 p(log(θ))∝3∏i=1U(log(θi)∣lower=−6,upper=2).

Our ABC approach follows that in Wilkinson (2013); Papamakarios and Murray (2016)

: as summary statistics we use a 9-dimensional vector composed of the mean,

variance and first two autocorrelations of each time series, together with the cross-correlation between them. These statistics were normalised by the standard deviation of the statistics determined by a pilot run, precisely as in

Wilkinson (2013). The distance between the summary statistics used in ABC was taken to be the Euclidean norm between the normalised statistic vectors. In all our ABC algorithms we used a final tolerance : reducing the tolerance below this level does not appear to have a large impact on the posterior distribution.

We used DA-ABC-SMC with and two different choices of and two different choices of the Euler-Maruyama step size in the cheap simulator and , both of which result in very rough approximations of the dynamics. We compared these approaches with standard ABC-SMC, with particles, with the sequence of tolerances selected such that unique particles are retained at each iteration. All algorithms used an expensive simulator with Euler-Maruyama step size (which results in a very accurate approximation), and included the scheme of Picchini and Forman (2016) to avoid simulations from the likelihood where they may be rejected using the prior only. In all approaches the MCMC proposal is a Gaussian centred at the current point with variance given by the sample variance of the previous particles. To measure computational cost, we counted the total number of steps simulated using Euler-Maruyama, taking account of the fact that some simulations are cut short by the simulation diverging.

Figures 0(b) and 0(c) show respectively the number of sweeps of the Gibbs sampler used for the values of and that are chosen by the SMC, and figure 0(d) shows the number of accepted points at the second stage of the DA. We see that all configurations of the DA approach appear to significantly outperform standard ABC-SMC, with the best configuration achieving with a reduction in computational cost of approximately one order of magnitude. The effect of changing and are difficult to interpret from these results: this is likely due to a big change between the ABC posteriors as decreases. For larger the posterior is bi-modal with well-separated modes, and as decreases one of the modes disappears. When running ABC-SMC, at some random the sample posterior changes from having two modes to one, and the performance of the algorithm dramatically changes when this occurs. This effect is most obviously observed for the algorithm with , and at around Euler-Maruyama steps. A very large number of runs of the algorithm would be required to discover the average relative behaviour of the different algorithms. Since each run of the algorithm is expensive, we instead investigate the effect of choosing different values of and different cheap simulators on the application in section 4.2. (a) The LVperfect data from smfsb package, simulated from the Lotka-Volterra model.

## 4 Applications to doubly intractable distributions

### 4.1 Background

This section is concerned with the class of likelihoods that may not be evaluated pointwise due to the presence of a intractable normalising constant, i.e. that

 l(y∣θ)=γ(y∣θ)Z(θ),

where is tractable, but it is not computationally feasible to evaluate the normalising constant (also known as the partition function). Such a distribution is known as doubly intractable (Murray et al., 2006) since it is not possible to directly use the Metropolis-Hastings algorithm to simulate from the posterior , due to the presence of in both the numerator and denominator of the acceptance ratio. The most common occurrence of these distributions occurs where factorises as a Markov random field. The most well-established approaches to inference for these models are the single and multiple auxiliary variable approaches (Møller et al., 2006; Murray et al., 2006) and the exchange algorithm (Murray et al., 2006), which respectively use importance sampling estimates of the likelihood and acceptance ratio to avoid the calculation of the normalising constant. From here on we refer to these as “auxiliary variable methods”. ABC (Grelaud et al., 2009) and synthetic likelihood (Everitt et al., 2017) have also been used for inference in these models. See Stoehr (2017) for a recent, thorough review of the literature.

Previous work (e.g. (Friel, 2013) suggests that auxiliary variable methods are more effective than ABC for simulating from the posterior where has an intractable normalising constant. In the full data ABC case, Everitt et al. (2017) suggest that this is because the multiple auxiliary variable approach may be seen to be a carefully designed non-standard ABC method. However, in this paper we consider the situation in Everitt (2012), where our data are indirect observations of a latent (hidden) field , modelled with a joint distribution with having an intractable normalising constant. In this case we might expect ABC to become more competitive, or even to outperform other approaches. The most obvious approach is to use data augmentation: using MCMC to simulate from the joint posterior by alternating simulation from the full conditionals of and , using the exchange algorithm in the update. Everitt (2012) compares this approach with “exchange marginal particle MCMC”, in which an SMC algorithm is used to integrate out the variables at every iteration of an MCMC and finds the particle MCMC approach to be preferable. However, in order to perform well, both of these approaches require efficient simulation from the conditional distribution .

Now compare these “exact” approaches with ABC. In this case, for every , we simulate and , then use an ABC likelihood that compares with . The variables corresponding to a particular variable that are retained in the ABC sample are distributed according to an ABC approximation to ; indeed in order that the ABC is at all efficient, there must be a reasonable probability that simulated is in a region of high probability mass under this distribution. Compared to particle MCMC:

• Standard ABC has the disadvantage that the simulation of is, in contrast to particle MCMC, not conditioned on (although we note recent work in Prangle et al. (2017) in which for some models we may also refine the ABC likelihood by conditioning on ).

• ABC has the advantage that a posteriori is only conditioned on , rather than as in particle MCMC. This condition is often considerably less stringent. For example, consider the Ising model example described below. When conditioning on , the posterior is often restricted to relatively small regions of -space: for individual pixel of the posterior mass may be concentrated in a small region in order to match each individual data point. However, when conditioning on , the posterior may have non-negligible mass in many regions of -space: there are many different configurations of pixels that give rise to similar summary statistics.

In this paper we revist using ABC on these models, examining the data previously studied in Everitt (2012), and show that we may make significant computational savings using DA-ABC-SMC. We now introduce the cheap simulator that makes this improvement possible. In Markov random field models, exact simulation from is either very expensive or intractable. Grelaud et al. (2009) proposes to replace the exact simulator with taking the last point of a long MCMC run, and Everitt (2012) shows that under certain ergodicity conditions, as the length of this “internal” MCMC goes to infinity, the bias in our ABC posterior sample due to this particular approximation also goes to zero (following ideas in Andrieu and Roberts (2009)). However, empirically one finds that often using only a single MCMC iteration is sufficient to yield a posterior close to the desired posterior. In this paper we propose to use an internal MCMC run of a single iteration as the cheap simulator, and to run the internal MCMC for many iterations (we use 1,000) as the expensive simulator, where the final iteration of the cheap simulator is used as the first iteration of the expensive simulator.

In sections 4.2 and 4.3 we apply DA-ABC-SMC to latent Ising models and ERGMs, respectively.

### 4.2 Latent Ising model

An Ising model is a pairwise Markov random field model on binary variables, each taking values in

. Its distribution is given by

 l(xh∣θx)∝exp⎛⎝θx∑(i,j)∈Nxhixhj⎞⎠,

where , denotes the

-th random variable in

and where is a set that defines pairs of nodes that are “neighbours”. We consider the case where the neighbourhood structure is given by a regular 2-dimensional grid. Our data are noisy observations of this field: the -th variable in has distribution

 g(yi∣xhi,θy)∝exp(θyyixhi),

as in Everitt (2012) (with the normalising constant being tractable). We used independent Gaussian priors on and , each with mean 0 and standard deviation 5. Our DA-ABC-SMC algorithm used , and we examined different values of

. A single site Gibbs sampler was used to simulate from the likelihood, with the expensive simulator using the final point of 1,000 sweeps of the Gibbs sampler. The MCMC proposal was taken to be a Gaussian distribution centred at the current particle, with covariance given by the sample covariance of the particles from the previous iteration. We used a single statistic in the ABC: the number of equivalently valued neighbours

(noting that this is not sufficient, but that the particle MCMC results in Everitt (2012) indicate that this does not have a large impact on the posterior). The distance metric used in ABC on the summary statistics was the absolute value of the difference.

We used the same grid of data as was studied in Everitt (2012), shown in figure 1(a), which was generated from the model using . These parameter values represent a relatively weak interaction between neighbouring pixels, and also quite noisy observations. On a grid of this size, there is ambiguity as to whether this data may have been generated with either one or both of and being small, giving a posterior distribution shaped like a cross, similar to that shown in figure 1(e). Figure 1(e) gives an example of a DA proposal, i.e. points that pass the first stage of the DA-MCMC, from the early stages of a run of DA-ABC-SMC. We observe how this distribution would be a suitable independent MCMC proposal for the posterior.

We ran DA-ABC-SMC for several values of , and compared the results with standard ABC-SMC using all of the same algorithmic choices, including using the scheme of Picchini and Forman (2016) to avoid simulations from the likelihood where they may be rejected using the prior only, using and . The standard ABC-SMC algorithm used the expensive simulator, i.e. 1,000 iterations of the internal Gibbs sampler. In algorithms we terminated the method when , or when the number of sweeps of the internal Gibbs sampler reached , whichever was sooner. We focus on studying the total computational effort expended in each algorithm, measured by the total number of sweeps of the internal Gibbs sampler, and also the automatically chosen sequence of tolerances and .

We ran the algorithm with several different values of , and two different cheap simulators: the first where only a single sweep of the Gibbs sampler is used (); the second using the final point of 5 sweeps of the Gibbs sampler (). We chose such that in most cases the computational cost is dominated by the expensive simulations in the second stage of the DA, although when and the cost of the first stage dominates. Figures 1(b) and 1(c) show respectively the number of sweeps of the Gibbs sampler used for the values of and that are chosen by the SMC, and figure 1(d) shows the number of accepted points at the second stage of the DA. We see that all configurations of the DA approach appear to significantly outperform standard ABC-SMC: in all cases the DA approach reaches whereas the ABC-SMC only reaches . As increases we see that the chosen values of are smaller and, as desired, the acceptance rate larger. However, when , we see that has been reduced too quickly and sometimes observe a very large acceptance rate, the result of the DA proposal having lighter tails than the target. However, in most other cases we observe the desired behaviour that for almost all values of . When and the cost of the first stage dominates the second, but the overall cost is still lower than standard ABC-SMC due to the significant improvement in the acceptance rate. In general we observe that using leads to a better performance (lower overall cost, lower and higher acceptance rate) than , due to the improved proposal given by whilst maintaining a relatively low cost of the cheap simulator.

We compare the total number of sweeps of the internal Gibbs sampler in DA-ABC-SMC with the particle MCMC in Everitt (2012). In the results in this paper, sweeps of the Gibbs sampler were used every iteration of the MCMC, thus sweeps are required to produce 100 correlated MCMC points, and at least an order of magnitude more than this to produce an effective sample size of 100. For DA-ABC-SMC we see that effort can be as low as approximately sweeps for 100 unique points from the DA-ABC-SMC, giving a reduction in cost of at least two orders of magnitude.

### 4.3 Latent exponential random graph model

An ERGM is a model for network data in which the global network structure is modelled as having arisen through local interactions. In this section we consider the situation in which the network is not directly observed, thus is a hidden network made up of a random variable for each edge which takes value 1 if the edge is present and 0 if it is absent, and is a noisy observation of this network. The ERGM on is

 l(xh∣θx)∝exp(θTxS(xh)),

with an intractable normalising constant, and our noisy observations modelled by

 g(yi∣xhi,θy)∝exp(θy(2xhi−1)(2yi−1))

where the normalising constant is tractable. We studied the Dolphin network (figure 2(a)) (Lusseau et al., 2003), as also analysed in Caimo and Friel (2011) when the network is treated as directly observed, and used the same summary statistics and priors as in this paper. We used the statistics

 S1(xh)=∑i

with , the prior on and was ; and we used the Euclidean distance to compare simulated with observed statistics. The ergm package (Hunter et al., 2008) in R was used to simulate from , which uses the “tie no tie” (TNT) sampler and the expensive simulator used iterations. Our DA-ABC-SMC algorithm used , and again the MCMC proposal was taken to be a Gaussian distribution centred at the current particle, with covariance given by the sample covariance of the particles from the previous iteration.

We ran DA-ABC-SMC for , and a cheap simulator having (after exploratory runs suggested that this would be enough iterations to provide a useful DA proposal) and compared the results with standard ABC-SMC with the same configuration as in the previous section (both using iterations of the TNT sampler). Figure 2(b) shows the results from the two algorithms, this time showing the sequence alongside the sequence . Again we observe that DA-ABC-SMC outperforms standard ABC-SMC, and that the tolerance changes adaptively such that it provides a useful DA proposal. This data has not previously been studied using a latent ERGM. To use particle MCMC as in Everitt (2012) would require at every MCMC iteration to run an SMC sampler to integrate out the latent ERGM space, which consists of 1891 binary edge variables. We might expect that many SMC particles would be required to produce low variance marginal likelihood estimates, leading to a high computational cost. However, the acceptance rate was very low towards the end of our ABC runs, suggesting that a very large computational cost would be required to reduce to be close to zero.

## 5 Conclusions

In this paper we introduced DA-ABC-SMC as a means for reducing the computational cost of ABC-SMC in the case of an expensive simulator, through using a DA-ABC-MCMC move, in which a cheap simulator is used to “screen” the candidate values of proposed parameters so that less effort needs to be used running the expensive simulator. This cheap simulator may be completely independent of the expensive simulator, but preferably the expensive simulator may be conditioned on the output of the cheap simulator. The tolerance for the cheap simulator is chosen adaptively with DA-ABC-SMC, giving an algorithm that only has three tuning parameters that are relatively easy to choose. The key factor in the performance of the algorithm is the DA proposal, i.e. the distribution of the particles that pass the first stage of DA. For this proposal to be useful, the cheap simulator needs to produce a posterior that is centred around the same values as the expensive simulator, and the total number of particles in the DA-ABC-SMC needs to be chosen appropriately. The empirical results suggest that the method is relatively robust to the choice of , and that the DA proposal has the important property that it changes in an appropriate way as the SMC algorithm progresses, so that it “tracks” the ABC posterior at each step to some extent. This latter property suggests that DA-ABC-SMC may often be preferable to a method that simply uses the cheap simulator as means to construct, say, a non-adaptively designed importance sampling distribution.

Section 3 illustrates that DA-ABC-SMC shows promise for using ABC for inference in ODE or SDE models, where a numerical method is used to simulate from the model. Section 4 revisits the idea of using ABC for inference in latent Markov random field models, and suggests that in some cases this approach may be preferable to “exact” methods such as particle MCMC.

## Acknowledgements

Richard Everitt’s work was supported by BBSRC grant BB/N00874X/1 and EPSRC grant EP/N023927/1, and Paulina Rowińska’s work was supported by EPSRC grant EP/L016613/1 (the Centre for Doctoral Training in the Mathematics of Planet Earth).

## Appendix A Background on ABC-SMC

### a.1 SMC samplers

An SMC sampler (Del Moral et al., 2006) is an importance sampling based Monte Carlo method for generating weighted points from a target distribution through making use of a sequence of target distributions that bridge between some initial proposal distribution from which we can simulate, and the final target . Here we focus on the variant of the algorithm that uses an MCMC kernel, and provide a sketch of the main steps of the algorithm (further details may be found in Del Moral et al. (2006)). The algorithm begins by simulating weighted “particles” from (initially with equal weights). Then the particles are moved from target to through an importance sampling reweighting step, followed by a “move” step in which each particle is simulated from an MCMC kernel starting at its current value, and with target . Let be the value taken by the -th particle after iteration and be its (normalised) weight. The reweighting step then computes the (unnormalised) weight of the -th particle using

 ~w(i)t+1 = w(i)tπt+1(θ(i)t)πt(θ(i)t), (7)

followed by a normalisation step to find .

This algorithm yields an empirical approximation of

 ^πNt=N∑i=1w(i)tδθ(i)t, (8)

where is a Dirac mass at , and an estimate of its normalising constant

 ^Zt=t∏s=1N∑i=1w(i)sπs+1(θ(i)s)πs(θ(i)s). (9)

Usually the variance of the weights of the particles is monitored during the algorithm, and if this becomes too large (which would lead to high variance estimators if no intervention was made) a resampling step is performed after reweighting. The resampling step at iteration simulates particles from the empirical distribution .

SMC outperforms importance sampling in cases where one needs many points to obtain low variance importance sampling estimates when directly using as a proposal for , i.e. when the distance between and is too large. In order to ensure that that SMC estimates using the algorithm above are low variance, we require that the distance between and is small for all . A proxy for the distance between subsequent targets that may be estimated as the algorithm runs is given by the effective sample size (ESS) (Kong et al., 1994)

 ESS=(N∑i=1(w(i)t+1)2)−1. (10)

A large ESS corresponds to a small distance between targets although, as referred to in the following section, a high ESS is necessary but not sufficient for achieving a low Monte Carlo variance.

### a.2 Abc-Smc

Sisson et al. (2007) remark that the ABC posterior is a natural candidate for simulating from using SMC, in that a sequence of posteriors with a decreasing sequence of tolerances to is a natural and useful choice as a sequence of distributions to use in SMC. We follow Del Moral et al. (2012) in choosing the sequence of distributions

 πt(θt,xt∣y)∝p(θt)l(xt∣θt)Pϵt(y∣xt) (11)

such that , with being the corresponding auxiliary variable (from equation LABEL:eq:abc_approx) at iteration . This leads to the following weight update when moving from target to

 ~w(i)t+1=w(i)tPϵt+1(y∣x(i)t)Pϵt(y∣x(i)t). (12)

• The weight update is computationally cheap and requires no simulation to produce , which has been generated either in the initial step of the algorithm or as part of a previous MCMC move. Additionally it is cheap to compute for any choice of : a fact which we use below.

• In ABC a common choice for the kernel is , where is a distance metric, in which case the particle weights are either zero (“dead”) or non-zero (“alive”). In this situation we: use a resampling algorithm (such as stratified resampling) that ensures that every particle that is alive before resampling is represented in the resampled particles; and force the resampling step to be used at every iteration of the SMC to avoid propagating “dead” particles with zero weight. A consequence of this is that the denominator in equation 12 is positive and the same for every particle, thus in practice the reweighting step becomes

 ~w(i)t+1=w(i)tI(d(y,x(i)t)<ϵt+1)

(for simplicity we have omitted the additional normalising term for that is required in order to provide a correct marginal likelihood estimator - see Didelot et al. (2011) for details).

The first point is used in Del Moral et al. (2012) in order to devise an adaptive ABC-SMC algorithm that automatically determines the sequence . At every SMC iteration a bisection method is used to determine the that will result in an ESS that is some proportion (e.g. 90%) of (if resampling is not performed at every step, the conditional ESS (Zhou et al., 2015) should be used in place of the ESS defined above). The second point has the consequence that the ESS is equal to the number of “alive” particles that have non-zero weight after the update. This scheme is revised slightly in Bernton et al. (2017), and it is this revised scheme that is used in this paper. The revision addresses the issue that the acceptance rate of ABC-MCMC moves decreases as the ABC-SMC progresses (since the ABC tolerance decreases). When the MCMC moves have a poor acceptance rate, the ESS is not a good criterion for deciding on a new tolerance, since it is unaffected by the values taken by each particle: even if all particles take the same value, the ESS may be high. Therefore Bernton et al. (2017) suggest instead to choose such that some number (with ) of the particles will be unique after resampling has been performed. In practice the bisection method will not often yield precisely unique particles, but for simplicity we do not refer to this again in the paper. The ABC-SMC algorithm with this adaptation is given in the appendix.

It is also common practice to use the current set of particle to adaptively set the proposal to be used in the MCMC kernels used in the “move” step. The simplest scheme (used here) sets the proposal covariance to be some multiple of the empirical covariance of the particles (this choice being rooted in results on optimal proposals for some MCMC algorithms).

## References

• Andrieu and Roberts (2009) Andrieu, C. and Roberts, G. O. (2009, apr). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics 37(2), 697–725.
• Banterle et al. (2014) Banterle, M., Grazian, C., and Robert, C. P. (2014). Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching. arXiv.
• Beaumont et al. (2002) Beaumont, M. A., Zhang, W., and Balding, D. J. (2002, dec). Approximate Bayesian computation in population genetics. Genetics 162(4), 2025–35.
• Bernton et al. (2017) Bernton, E., Jacob, P. E., Gerber, M., and Robert, C. P. (2017). Inference in generative models using the Wasserstein distance. arXiv.
• Caimo and Friel (2011) Caimo, A. and Friel, N. (2011). Bayesian inference for exponential random graph models. Social Networks33, 41–55.
• Christen and Fox (2005) Christen, J. A. and Fox, C. (2005, dec). Markov chain Monte Carlo Using an Approximation. Journal of Computational and Graphical Statistics 14(4), 795–810.
• Del Moral et al. (2006) Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B 68(3), 411–436.
• Del Moral et al. (2012) Del Moral, P., Doucet, A., and Jasra, A. (2012). An adaptive sequential Monte Carlo method for approximate Bayesian computation. Statistics and Computing 22(5), 1009–1020.
• Didelot et al. (2011) Didelot, X., Everitt, R. G., Johansen, A. M., and Lawson, D. J. (2011). Likelihood-free estimation of model evidence. Bayesian Analysis 6(1), 49–76.
• Everitt (2012) Everitt, R. G. (2012). Bayesian Parameter Estimation for Latent Markov Random Fields and Social Networks. Journal of Computational and Graphical Statistics 21(4), 940–960.
• Everitt et al. (2017) Everitt, R. G., Johansen, A. M., Rowing, E., and Evdemon-Hogan, M. (2017). Bayesian model comparison with un-normalised likelihoods. Statistics and Computing 27(2), 403–422.
• Everitt et al. (2017) Everitt, R. G., Prangle, D., Maybank, P., and Bell, M. (2017). Auxiliary variable marginal smc for doubly intractable models. arXiv.
• Friel (2013) Friel, N. (2013).

Evidence and Bayes factor estimation for Gibbs random fields.

Journal of Computational and Graphical Statistics 22(3), 518–532.
• Gillespie (1977) Gillespie, D. T. (1977). Exact stochastic simulation of coupled chemical reactions. Journal of Physical Chemistry 8(25), 2340–2361.
• Golightly et al. (2015) Golightly, A., Henderson, D. A., and Sherlock, C. (2015). Delayed acceptance particle MCMC for exact inference in stochastic biochemical network models. Statistics and Computing 25(5), 1039–1055.
• Grelaud et al. (2009) Grelaud, A., Robert, C. P., and Marin, J.-M. (2009). ABC likelihood-free methods for model choice in Gibbs random fields. Bayesian Analysis 4(2), 317–336.
• Hunter et al. (2008) Hunter, D. R., Handcock, M. S., Butts, C. T., Goodreau, S. M., and Morris, M. (2008). ergm: A Package to Fit, Simulate and Diagnose Exponential-Family Models for Networks. Journal of Statistical Software 24(3), nihpa54860.
• Kong et al. (1994) Kong, A., Liu, J. S., and Wong, W. H. (1994).

Sequential Imputations and Bayesian Missing Data Problems.

Journal of the American Statistical Association 89(425), 278–288.
• Lusseau et al. (2003) Lusseau, D., Schneider, K., Boisseau, O. J., Haase, P., Slooten, E., and Dawson, S. M. (2003, sep). The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations. Behavioral Ecology and Sociobiology 54(4), 396–405.
• Møller et al. (2006) Møller, J., Pettitt, A. N., Reeves, R. W., and Berthelsen, K. K. (2006, jun). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458.
• Murray et al. (2006) Murray, I., Ghahramani, Z., and MacKay, D. J. C. (2006). MCMC for doubly-intractable distributions. In UAI, pp. 359–366.
• Papamakarios and Murray (2016) Papamakarios, G. and Murray, I. (2016). Fast Epsilon-Free Inference of Simulation Models with Bayesian Conditional Density Estimation. arXiv.
• Picchini (2014) Picchini, U. (2014, apr). Inference for SDE models via Approximate Bayesian Computation. Journal of Computational and Graphical Statistics 23(4), 1080–1100.
• Picchini and Forman (2016) Picchini, U. and Forman, J. L. (2016). Accelerating inference for diffusions observed with measurement error and large sample sizes using Approximate Bayesian Computation: A case study. Journal of Statistical Computation and Simulation 86(1), 195–213.
• Prangle et al. (2017) Prangle, D., Everitt, R. G., and Kypraios, T. (2017). A rare event approach to high dimensional approximate Bayesian computation. Statistics and Computing.
• Pritchard et al. (1999) Pritchard, J. K., Seielstad, M. T., Perez-Lezaun, A., and Feldman, M. W. (1999, dec). Population growth of human Y chromosomes: a study of Y chromosome microsatellites. Molecular Biology and Evolution 16(12), 1791–8.
• Sherlock et al. (2017) Sherlock, C., Golightly, A., and Henderson, D. A. (2017). Adaptive, delayed-acceptance MCMC for targets with expensive likelihoods. Journal of Computational and Graphical Statistics 26(2), 434–444.
• Sherlock et al. (2015) Sherlock, C., Thiery, A. H., and Golightly, A. (2015). Efficiency of delayed acceptance random walk Metropolis algorithms. arXiv.
• Sisson et al. (2007) Sisson, S. A., Fan, Y., and Tanaka, M. M. (2007, feb). Sequential Monte Carlo without likelihoods. Proceedings of the National Academy of Sciences of the United States of America 104(6), 1760–1765.
• Stoehr (2017) Stoehr, J. (2017). A review on statistical inference methods for discrete Markov random fields. arXiv.
• Strens (2004) Strens, M. (2004). Efficient hierarchical MCMC for policy search.

Proceedings of the 21st International Conference on Machine Learning

, 97.
• Tierney (1998) Tierney, L. (1998). A Note On Metropolis-Hastings Kernels For General State Spaces. Annals of Applied Probability 8(1), 1–9.
• van der Vaart et al. (2015) van der Vaart, E., Beaumont, M. A., Johnston, A. S. A., and Sibly, R. M. (2015). Calibration and evaluation of individual-based models using Approximate Bayesian Computation. Ecological Modelling312, 182–190.
• Wilkinson (2011) Wilkinson, D. J. (2011). Stochastic Modelling for Systems Biology (Second edi ed.). Boca Raton, Florida: Chapman & Hall/CRC Press.
• Wilkinson (2013) Wilkinson, D. J. (2013). Summary Stats for ABC.
• Zhou et al. (2015) Zhou, Y., Johansen, A. M., and Aston, J. A. D. (2015). Towards automatic model comparison: an adaptive sequential Monte Carlo approach. Journal of Computational and Graphical Statistics.