1 Introduction
In natural and social sciences, one finds many examples of probabilistic models whose likelihood function is intractable. This includes most models of noisy biological neural networks
(Gerstner and Kistler, 2002), some time series and choice models in Economics (Train, 2003), phylogenetic models in evolutionary Biology (Beaumont, 2010), spatial extremes in Environmental Statistics (Davison et al., 2012), among others. That the likelihood is intractable is unfortunate, because one would still like to perform the usual statistical tasks of parameter inference and model comparison, and the traditional statistical toolkit assumes that the likelihood function is either directly available or can be made so by introducing latent variables. This explains that researchers have often had to content themselves with semiquantitative analyses showing that a model could reproduce some aspect of an empirical phenomenon for some values of the parameters; for two representative examples from vision science, see Nuthmann et al. (2010), Brascamp et al. (2006).A breakthrough was provided by the work of Tavaré et al. (1997), Pritchard et al. (1999) and Beaumont et al. (2002), in the form of the Approximate Bayesian Computation (ABC) algorithm, which enables Bayesian inference in the likelihoodfree context. (See also Diggle and Gratton (1984) and Rubin (1984) for early versions of ABC.) Assuming some model for the data , and a prior over the parameter , the ABC algorithm iterates the following steps:

Draw from the prior, .

Draw a dataset from the model conditional on , .

If , then keep , otherwise reject.
and therefore produces samples from the socalled ABC posterior:
(1.1) 
The pseudodistance is usually taken to be , for some norm , where
is a vector of summary statistics, for example some empirical quantiles or moments of
. Unless is sufficient, the approximation error does not vanish as , as. In that respect, the ABC posterior suffers from two levels of approximation: a nonparametric error, akin to the error of kernel density estimation, and where
plays the role of a bandwidth (see e.g. Blum, 2010), and a bias introduced by the summary statistics . The more we include in , the smaller the bias induced by should be. On the other hand, as the dimensionality of increases, the lower the acceptance rate will be. We would then have to increase , which leads to an approximation of lower quality.Thus, ABC requires in practice some more or less arbitrary compromise between what summary statistics to include and how to set To establish that the results of the inference are somewhat robust to these choices, many runs of the algorithm may be required. Although several variants of the original ABC algorithm that aim at increasing acceptance rates exist (e.g. Beaumont et al., 2002), the current state of the matter is that an ABC analysis is very far from routine use because it may take days to tune on real problems. The semiautomatic method for constructing summary statistics recently proposed by Fearnhead and Prangle (2012) seems to alleviate partly these problems, but it still requires at least one, and sometimes several pilot runs.
In this article we introduce EPABC, an adaptation of the Expectation Propagation (EP) algorithm (Minka, 2001a; Bishop, 2006, Chap. 10) to the likelihoodfree setting. The main advantage of EPABC is that it is much faster than previous ABC algorithms: typically, it provides accurate results in a few minutes, whereas a standard ABC algorithm may need several hours, or even days. EPABC requires that the data may be decomposed into “chunks”, (of possibly different dimensionality or support), in such a way that is possible to simulate sequentially the chunks; i.e. one is able to simulate from for each . More precisely, EPABC builds an EP approximation of the following type of ABC posterior distributions:
(1.2) 
where is a summary statistic specific to chunk , and with the convention that . Given the way it operates, we shall see that EPABC essentially replaces the initial, possibly difficult ABC problem, by ABC simpler (i.e. lowerdimensional) ABC problems. In particular, it may be easier to construct a set of “local” summary statistics for each chunk rather than a global set the whole dataset
, in such a way that the probability that
is not too small.Of course, not all ABC models are amenable to the factorisation in (1.2), at least directly. We shall discuss this point more in detail in the paper. At the very least, factorisable models include situations with repeated experiments, which are not easily tackled through standard ABC, because of the difficulty to define some vector which summarises these experiments. Bazin et al. (2010) discusses this point, in the context of genetic data collected over a large number of loci, and recommend instead to construct locispecific summary statistics, and combine results from locispecific ABC exercises.
In certain problems such that is of low dimension, it may be even possible to take . In that case, EPABC provides an EP approximation of a summaryless ABC posterior; that is, when (under appropriate measuretheoretic conditions), , the true posterior. In such situations, EPABC also provides an approximation of the evidence , which is a very useful quantity for model comparison. Dean et al. (2011)
show that, in the specific context of hidden Markov models such that the observation density is intractable (as this work was motivated by the filtering ABC algorithm developed in
Jasra et al. (2010)), the ABC error of a summaryless ABC algorithm should not be interpreted as a nonparametric error, but rather as a misspecification error, where the misspecified model is the true model, but with observations corrupted with noise. This misspecification implies a bias of order. Thus, in addition to being more convenient for the user, as it avoids specifying some summary statistics, summaryless ABC does not seem to suffer from the curse of dimensionality (in the dimension of
) of standard ABC. Allowing summaryless ABC inference in certain cases seems to be another advantage of EPABC.We start with a generic description of the EP algorithm, in Section 2, and explain in Section 3 how it can be adapted to the likelihoodfree setting. We explain in Section 4 how EPABC can be made particularly efficient when datapoints are IID (independent and identically distributed). Section 5 contains three case studies drawn from finance, population ecology, and vision science. The two first examples are borrowed from already known applications of ABC, and illustrate to which extent EPABC may outperform standard ABC techniques in realistic scenarios. To the best of our knowledge, our third example from vision science is a novel application of likelihoodfree inference, which, as which shall argue, seems too challenging for standard ABC techniques.
Section 6 discusses how EPABC may cope with difficult posteriors through two additional examples. Section 7 discusses possible extensions of EPABC; in particular, for nonfactorisable likelihoods. In such cases, one may replace the likelihood by some factorisable approximation, such as a composite likelihood. Section 8 concludes.
We use the following notations throughout the paper: bold letters refer to vectors or matrices, e.g. , , and so on. We also use bold face to distinguish complete sets of observations, i.e. or , from their components, , and , , although we do not assume that these components are necessarily scalar. For subvectors of observations, we use the colon notation: . The notation refers to a generic norm, and
refers to the Euclidean norm. The KullbackLeibler divergence between probability densities
and is denoted asThe letter always refers to probability densities concerning the model; i.e. is the prior, is the likelihood of the first observation, and so on. Transpose of a matrix is denoted , and the diagonal matrix with diagonal elements given by vector is .
2 Expectation Propagation
Expectation Propagation (EP, Minka, 2001a)
is an algorithm for variational inference, a class of techniques that aim at finding a tractable probability distribution
that best approximates an intractable target density . One way to formulate this goal is as finding the member of some parametric family that is in some sense closest to , where “closest” is defined by some divergence between probability distributions. A distinctive feature of EP is that it based on the divergence , whereas many variational methods (e.g. Variational Bayes, see Chap. 10 of Bishop, 2006) try to minimize . For a good discussion of the relative merits of each divergence, see Bishop (2006, p. 468470) and Minka (2005). As a brief summary, minimizing tends to produce approximation that are too compact; see also Wang and Titterington (2005). The divergence does not suffer from this drawback, and is perhaps more appealing from a statistical point of view (if only because maximum likelihood estimation is based on the corresponding contrast), but on the other hand, minimizing when is multimodal tends to produce an approximation that covers all the modes, and therefore has too large a support. We shall return to this point.2.1 Assumptions of Expectation Propagation
EP assumes that the target density decomposes into a product of simpler factors
(2.1) 
and exploits this factorisation in order to construct a sequence of simpler problems. For instance, may be the prior , if is a set of observations , with the convention that , and
EP uses an approximating distribution with a similar structure:
(2.2) 
where the ’s are known as the “sites”. In a spirit close to a coordinatedescent optimization algorithm, each site is updated in turn, while the other sites are kept fixed.
For the sake of conciseness, we focus in this paper on Gaussian sites, expressed under their natural parametrisation: , where and are called from now on the site parameters. This generates the following Gaussian approximation :
(2.3) 
In addition, we assume that the true prior is Gaussian, with natural parameters and . In that case, the site is kept equal to the prior, and only the sites to are updated.
We note however that EP may easily accommodate a nonGaussian prior (by simply treating the prior as additional factor to be approximated), or other types of parametric sites. In fact, the site update described in the following section may be easily adapted to any exponential family; see Seeger (2005).
2.2 Site update
Suppose that (2.2) is the current approximation, and one wishes to update site . This is done by creating a “hybrid” distribution, obtained by substituting site with the true likelihood contribution :
(2.4) 
For Gaussian sites, this leads to:
The new value of site is then obtained by minimising with respect to the KullbackLeibler pseudodistance between the hybrid and the Gaussian approximation (again keeping the ’s fixed). When Gaussian sites are used, this minimisation is equivalent to taking to be the Gaussian density with moment parameters that match those of the hybrid distribution
(2.5) 
A key observation at this stage is that the feasibility of EP is essentially dictated by how easily the moments above may be computed. These moments may be interpreted as the moments of a posterior distribution, based on a Gaussian prior, with natural parameters and , and a likelihood consisting of a single factor .
Finally, from these moment parameters, one may recover the natural parameters of and deduce the new site parameters for as follows:
EP proceeds by looping over sites, updating each one in turn until convergence is achieved. In wellbehaved cases, one observes empirically that a small number of complete sweeps through the sites is sufficient to obtain convergence. However, there is currently no general theory on the convergence of EP.
Appendix A gives an algorithmic description of EP, in the more general case where an exponential family is used for the sites.
2.3 Approximation of the evidence
EP also provides an approximation of the normalising constant of (2.1), using the same ideas of updating site approximations through moment matching. To that effect, we rewrite the EP approximation with normalising constants for each site (assuming again the prior is Gaussian and does not need to be approximated):
(2.6) 
Then the update of site proceeds as before, by adjusting , and through moment matching. Simple calculations (see e.g. Seeger, 2005) lead to the following expressions for the update of :
(2.7) 
where is the normalising constant of the hybrid, as defined in (2.5), , (resp. , ) are the natural parameters of the current Gaussian approximation (resp. of ) and is the lognormalising constant of an unnormalised Gaussian density:
For each site update, one calculates as defined in (2.7). Then, at the end of the algorithm, one may return the following quantity
as an approximation to the logarithm of the evidence.
3 EPABC: Adapting EP to likelihoodfree settings
3.1 Basic principle
As explained in the introduction, our objective is to approximate the following ABC posterior
(3.1) 
which corresponds to a particular factorisation of the likelihood,
(3.2) 
Note that, in full generality, the may be any type of “chunk” of the observation vector
, i.e. the random variables
may have a different dimension, or more generally different supports. For simplicity, we assume that the prior is Gaussian, with natural parameters and .One may interpret (3.1) as an artificial posterior distribution, which decomposes into a prior times likelihood contributions , as in (2.1), with
We have seen that the feasibility of the EP algorithm is determined by the tractability of the following operation: to compute the two first moments of a pseudoposterior, made of a Gaussian prior , times a single likelihood contribution . This immediately suggests the following EPABC algorithm. We use the EP algorithm, as described in Algorithm 3, and where the moments of such a pseudoposterior are computed using as described in Algorithm 1, that is, as Monte Carlo estimates, based on simulated pairs , where , and .
Inputs: , , , and the moment parameters , of the Gaussian pseudoprior .

Draw variates from a distribution.

For each , draw .

Compute the empirical moments
(3.3) (3.4)
Return , and .
Since EPABC integrates one datapoint at a time, it does not suffer from a curse of dimensionality with respect to : the rejection rate of Algorithm 1 corresponds to a single constraint , not of them, and is therefore likely to be tolerably small even for small windows . (Otherwise, in very challenging situations, one has the liberty to replace Algorithm 1 by a more elaborate ABC algorithm.)
The only requirement of EPABC is that the factorisation of the likelihood, (3.2), is chosen in such a way that simulating from the model, i.e. can be decomposed into a sequence of steps, where one samples from , for . We shall see in our examples section, see Section 5, that several important applications of likelihoodfree inference fulfil this requirement. We shall also discuss in Section 7 how other likelihoodfree situations may be accommodated by the EPABC approach.
3.2 Numerical stability
EPABC is a stochastic version of EP, a deterministic algorithm, hence some care must be taken to ensure numerical stability. We describe here three strategies towards this aim.
First, to ensure that the stochastic error introduced by each site update does not vary too much in the course of the algorithm, we adapt dynamically , the number of simulated points, as follows. For a given site update, we sample repetitively pairs , as described in Algorithm 1, until the total number of accepted points exceeds some threshold . Then we compute the moments (3.3) and (3.4) based on all the accepted pairs.
Second, EPABC computes a great deal of Monte Carlo estimates, based on IID (independent and identically distributed) samples, part of which are Gaussian. Thus, it seems worthwhile to implement variance reduction techniques that are specific to the Gaussian distribution. After some investigation, we recommend the following quasiMonte Carlo approach. We generate a Halton sequence
of dimension , which is a low discrepancy sequence in , and takewhere , are the moment parameters corresponding to the natural parameters , , is the Cholesky lower triangle of , and returns a vector that contains the inverse distribution function of each component of the input vector. We recall briefly that a low discrepancy sequence in is a deterministic sequence that spreads more evenly over the hypercube
than a sample from the uniform distribution would; we refer the readers to e.g. Chap. 3 of
Gentle (2003) for a definition of Halton and other low discrepancy sequences, and the theory of quasiMonte Carlo. Rigorously speaking, this quasiMonte Carlo version of EPABC is a hybrid between Monte Carlo and quasiMonte Carlo, because the are still generated using standard Monte Carlo. However, we do observe a dramatic improvement when using this quasiMonte Carlo approach. An additional advantage is that one may save some computational time by generating once and for all a very large sequence of vectors, and store it in memory for all subsequent runs of EPABC.The third measure we may take is to slow down the progression of the algorithm such as to increase stability, by conservatively updating the parameters of the approximation in Step 3 of Algorithm 3, that is, . Standard EP is the special case with . Updates of this type are suggested in Minka (2004).
In our experiments, we found that the two first strategies improve performance very significantly (in the sense of reducing Monte Carlo variability over repeated runs), and that the third strategy is sometimes useful, for example in our reaction time example, see Section 5.4.
3.3 Evidence approximation
In this section, we consider the special case , and we normalise the ABC posterior (3.1) as follows:
(3.5) 
where is the normalising constant of the uniform distribution with respect to the ball of centre , radius , and norm . For the Euclidean norm, and assuming that the ’s have the same dimension , one has: , with ; e.g. if , if .
Wilkinson (2008) shows that a standard ABC posterior such as (1.1) can be interpreted as the posterior distribution of a new model, where the summary statistics are corrupted with a uniformlydistributed noise (assuming these summary statistics are sufficient). The expression above indicates that this interpretation also holds for this type of summaryless ABC posterior, except that the artificial model is now such all the random variables are corrupted with noise (conditional on ).
The expression above also raises an important point regarding the approximation of the evidence. In (3.5), the normalising constant is the evidence of the corrupted model, which converges to the evidence of the actual model as . On the other hand, EPABC targets (3.1), and, in particular, see Section 2.3, produces an EP approximation of its normalising constant, which is . Thus, one needs to divide this EP approximation by in order recover an approximation of . We found in our simulations that, when properly normalised as we have just described, the approximation of the evidence provided by EPABC is particularly accurate, see Section 5. In contrast, standard ABC based on summary statistics cannot provide an approximation of the evidence, as explained in the Introduction.
4 Speeding up EPABC in the IID case
Typically, the main computational bottleneck of EPABC, or other types of ABC algorithms, is simulating pseudo datapoints from the model. In this section, we explain how these simulations may be recycled throughout the iterations in the IID (independent and identically distributed) case, so as to significantly reduce the overall computational cost of EPABC.
Our recycling scheme is based on a straightforward importance sampling strategy. Consider an IID model, with likelihood . Also, for the sake of simplicity, take . Assume that, for a certain site , pairs are generated from , as described in Algorithm 1. We have removed the subscript in both and , to highlight the fact that the generative process of the datapoints is the same for all the sites. The next update, for site , requires computing moments with respect to . Thus, we may recycle the simulations of the previous site by assigning to each pair the importance sampling weight:
and compute the corresponding weighted averages.
Obviously, this step may also be applied to the subsequent sites, , until one reaches a stage when the weighted sample is too degenerated. When this happens, “fresh” simulations may be generated from the current site. Algorithm 2 describes more precisely this recycling strategy. To detect weight degeneracy, we use the standard ESS (Effective Sample Size) criterion of Kong et al. (1994): we regenerate when the ESS is smaller that some threshold .
Inputs: , current weighted sample , moment parameters and (resp. and ) that correspond to the site where data were regenerated for the last time (resp. that correspond to the Gaussian approximation ).

Compute the importance sampling weights
where stands for the Gaussian probability density evaluated at , and the effective sample size:

If , replace by IID draws from , set , and , .

Compute the following importance sampling estimates:
and
Return , , , , and .
The slower the EP approximation evolves, the less often regenerating the pseudo datapoints is necessary, so that as the approximation gradually stabilises, we do not need to draw any new samples any more. Since EP slows down rapidly during the first two passes, most of the computational effort will be devoted to the early phase, and additional passes through the data will come essentially for free.
In nonIID cases several options are still available. For some models the data may come in blocks, each block made of IID datapoints (think for example of a linear model with discrete predictors). We can apply the strategy outlined above in a blockwise manner (see the reaction times example, section 5.4). In other models there may be an easy transformation of the samples for datapoint such that they become samples for datapoint , or one may be able to reuse part of the simulation.
5 Case studies
5.1 General methodology
In each scenario, we apply the following approach. In a first step, we run the EPABC algorithm. We may run the algorithm several times, to evaluate the Monte Carlo variability of the output, and we may also run it for different values of , in order to assess the sensitivity to this approximation parameter. We use the first run to determine how many passes (complete cycles through the
sites) are necessary to reach convergence. A simple way to monitor convergence is to plot the evolution of the expectation (or the smallest eigenvalue of the covariance matrix) of the current Gaussian approximation
along the site updates. Note however that, in our experience, it is quite safe to simply fix the number of complete passes to a small number like . Finally, note that in each example, we could take , so the point of determining appropriate local summary statistics is not discussed.In a second step, we run alternative algorithms, that is, either an exact (but typically expensive) MCMC algorithm, or an ABC algorithm, based on some set of summary statistics. The ABC algorithm we implement is a Gaussian random walk version of the MCMCABC algorithm of Marjoram et al. (2003). This algorithm targets a standard ABC approximation, i.e. (1.1), that corresponds to a single constraint , for some vector of summary statistics , and some ; the specific choices of and are discussed for each application. We calibrate the tuning parameters of these MCMC algorithms using the information provided by the first step: we use as a starting point for the MCMC chain the expectation of the approximated posterior distribution provided by the EPABC algorithm, random walk scales are taken to be some fraction of the square root of the approximated posterior variances, and so on. This makes our comparisons particularly unfavourable to EPABC. Despite this, we find consistently that the EPABC algorithm is faster by several orders of magnitude, and leads to smaller overall approximation errors. We report computational loads both in terms of CPU time (e.g. 30 seconds) and in terms of the number of simulations of replicate datapoints . The latter should be typically the bottleneck of the computation.
All the computations were performed on a standard desktop PC in Matlab; programs are available from the first author’s web page.
5.2 First example: Alphastable Models
Alphastable distributions are useful in areas (e.g. Finance) concerned with noise terms that may be skewed, may have heavy tails and an infinite variance. A univariate alphastable distribution does not admit a closeform expression for its density, buy may be specified through its characteristic function
where determines the tails, , determines skewness, , and and are respectively scale and location parameters; see Nolan (2012, Chap. 1) for a general introduction to stable distributions.
Peters et al. (2010) consider a model of i.i.d. observations , from a univariate alphastable distribution, and propose to use the ABC approach to infer the parameters. Likelihoodfree inference is appealing in this context, because sampling from an alphastable distribution is fast (using e.g. the algorithm of Chambers et al., 1976), while computing its density is cumbersome.
Trying EPABC on this example is particularly interesting for the following reasons: (a) Peters et al. (2010) show that choosing a reasonable set of summary statistics for this problem is difficult, and that several natural choices lead to strong biases; and (b) since alphastable distributions are very heavytailed, the posterior distribution may be heavytailed as well, which seems a challenging problem for a method based on a Gaussian approximation such as EPABC.
Our data consist of rescaled logreturns, , computed from daily exchange rates of AUD (Australian Dollar) recorded in GBP (British Pound) between 1 January 2005 and 1 December 2010. (These data are publicly available on the Bank of England website.) We take where is the cumulative distribution function, and we set the prior to . Note however that our results are expressed in terms of the initial parametrisation , , and ; i.e. for each parameter we report the approximate marginal posterior distribution obtained through the appropriate variable transform of the Gaussian approximation produced by EPABC. We run the EPABC algorithm (recycling version, as model is IID, see Section 4), with , , , and set to the Euclidean norm in (i.e. the constraints in (3.1) simplify to ). Variations over ten runs are negligible. Average CPU time for one run is 39 minutes, and average number of simulated datapoints over the course of the algorithm, is .
We first compare these results with the output of an exact randomwalk HastingsMetropolis algorithm, which relies on the evaluation of an alphastable probability density function for each datapoint (using numerical integration). Because of this, this algorithm is very expensive. We ran the exact algorithm for about 60 hours (
iterations). One sees in Figure 5.1 that the difference between EPABC and the exact algorithm is negligible.We then compare these results with those obtained by MCMCABC, for the set of summary statistics which performs best among those discussed by Peters et al. (2010, see in Section 3.1). We run iterations of this sampler, which leads to about times more simulations from an univariate alphastable distribution than in the EPABC runs above. Through pilot runs, we decided to set , which seems to be as small as possible, subject to having a reasonable acceptance rate () for this computational budget. In Figure 5.1, we see that the posterior output from this MCMCABC exercise is not as good an approximation as the output of EPABC. As explained in the previous section, we have set the starting point of the MCMCABC chain to the posterior mode. If initialised from some other point, the sampler typically takes a much longer time to reach convergence, because the acceptance rate is significantly lower in regions far from the posterior mode.
Finally, we also use EPABC, with the same settings as above, e.g. , in order to approximate the evidence of the model above () and two alternative models, namely a symmetric alphastable model, where is set to (, and a Student model (), with 3 parameters (scale , position
, and a Gaussian prior for). (Standard deviation over repeated runs is below
.) One sees that there no strong evidence of skewness in the data, and that the Student distribution and a symmetric alphastable distribution seem to fit equally well the data. We obtained the same value () for the evidence of the Student model when using the generalised harmonic mean estimator
(Gelfand and Dey, 1994) based on a very long chain of an exact MCMC algorithm. For both alphastable models, this approach proved to be too expensive to allow for a reliable comparison.5.3 Second example: LotkaVolterra models
The stochastic LotkaVolterra process describes the evolution of two species (prey) and (predator) through the reaction equations:
This chemical notation means that, in an interval , the probability that one prey is replaced by two preys is , and so on. Typically, the observed data are made of vectors in , which correspond to the population levels at integer times. We take . This model is Markov, for , and one can efficiently simulate from using Gillespie (1977)’s algorithm. On the other hand, the density is intractable. This makes this model a clear candidate both for ABC, as noted by Toni et al. (2009), and for EPABC. Boys et al. (2008) show that MCMC remains feasible for this model, but in certain scenarios the proposed schemes are particularly inefficient, as noted also by Holenstein (2009, Chap. 4).
Following the aforementioned papers, we consider a simulated dataset, corresponding to rates , , , initial population values , and ; see Figure 5.2. Since the observed data are integervalued, we use the supremum norm in (3.1), and an integervalued ; this is equivalent to imposing simultaneously the constraints and in the ABC posterior.
First, we run EPABC (standard version) with , and for both and . We find that a single pass over the data is sufficient to reach convergence. For (resp. ), CPU time for each run is 2.5 minutes (resp. 25 minutes), and number of simulated transitions is about (resp. ); marginal posteriors obtained through EPABC are reported in Figure 5.3.
When applying ABC to this model, Toni et al. (2009) uses as a pseudodistance between the actual data and the simulated data the sum of of squared errors. In Wilkinson (2008)’s perspective discussed in Section 4, this is equivalent to considering a statespace model where the latent process is the LotkaVolterra process described above, and the observation process is the same process, but corrupted with Gaussian noise (provided the indicator function in the ABC posterior is replaced by the kernel function ). Thus, instead of a standard ABC algorithm, it seems more efficient to resort to a MCMC sampler specifically designed for statespace models in order to simulate from the ABC approximation of Toni et al. (2009). Following Holenstein (2009, Chap. 4), we consider a Gaussian random walk version of the marginal PMCMC sampler. This algorithm is a particular instance of the state of the art PMCMC framework of Andrieu et al. (2010), which is based on the idea of approximating the marginal likelihood of the data by running a particle filter of size at each iteration of the MCMC sampler. The big advantage of PMCMC in this situation (comparatively to other MCMC approaches for statespace models), is that it does not requires a tractable probability density for the Markov transition of the statespace model.
In Figure 5.3, we report the posterior output obtained from this sampler, run for about iterations and particles (2 days in CPU time, simulated transitions ), with random walk scales set to obtain approximatively a acceptance rate. These plots correspond to , and a statespace model with an uniformly distributed observation noise. In Figure 5.3, one detects practically no difference between PMCMC and EPABC with (black lines), although the CPU time of the latter was about 1500 smaller.
The difference between the two EPABC approximations (corresponding to and ) is a bit more noticeable. Presumably, the EPABC approximation corresponding to is slightly closer to the true posterior. We did not manage however to obtain reliable results from our PMCMC sampler and in a reasonable time.
5.4 Third example: Race models of reaction times
Reaction time models seek to describe the decision behaviour of (human or animal) subjects in a choice task (Luce, 1991; Meyer et al., 1988; Ratcliff, 1978). In the typical experiment, subjects view a stimulus, and must choose an appropriate response. For example, the stimulus might be a set of moving points, and the subject must decide whether the points move to the left or to the right.
Assuming that the subject may choose between alternatives, one observes independent pairs, , where is the chosen alternative, and is the measured reaction time. For convenience, we drop for now the index in order to describe the random distribution of the pair .
Reaction time models assume that the brain processes information progressively, and that a decision is reached when a sufficient amount of information has been accumulated. In the model we use here (a variant of typical models found in e.g. Ratcliff and McKoon, 2008; Bogacz et al., 2007) parallel integrators represent the evidence in favour of each of the alternatives. The model is illustrated on Figure 5.4. The first accumulator to reach its boundary wins the race and determines which response the subject will make. Each accumulator undergoes a Wiener process with drift:
where the ’s are the drift parameters, the ’s are independent Wiener processes; and is a fixed time scale, . The measured reaction time is corrupted by a uniformlydistributed noise , representing the “nondecisional time” (Ratcliff and McKoon, 2008), i.e. the time the subject needs to execute the decision (prepare a motor command, press an answer key, …). This model is summarised by the following equations:
(We fix and to , , credible values from Ratcliff and McKoon (2008))
The model above captures the essential ideas of reaction time modelling, but it remains too basic for experimental data. We now describe several important extensions. First, a better fit is obtained if the boundaries are allowed to vary randomly from trial to trial (as in Ratcliff, 1978): we assume that , where and is a parameter to be estimated. Second, a mechanism is needed to ensure that the reaction times cannot be too large: we assume that if no boundary has been reached after 1 second, information accumulation stops and the highest accumulator determines the decision. Finally, one needs to account for lapses (Wichmann and Hill, 2001): on certain trials, subjects simply fail to pay attention to the stimuli and respond more or less at random. We account for this phenomenon by having a lapse probability of 5%. In case a lapse occurs, becomes uniformly distributed between 0 and 800 ms and the response is chosen between the alternatives with equal probability. Clearly, although this generalised model remains amenable to simulation, the corresponding likelihood is intractable.
We apply this model to data from an unpublished experiment by M. Maertens (personal communication to Simon Barthelmé). The dataset is made of 1860 observations, obtained from a single human subject, which had to choose between
alternatives: “signal absent” (no light increment was presented), or “signal present” (a light increment was presented), under 15 different experimental conditions: 3 different locations on the screen, and 5 different contrast values. Following common practice in this field, trials with very high or very low reaction times (top and bottom 5%) were excluded from the dataset, because they have a high chance of being outliers (fast guesses, keyboard errors or inattention). The data are shown on Figure
5.5.From the description above, one sees that five parameters, , are required to describe the random behaviour of a single pair , when To account for the heterogeneity introduced by the varying experimental conditions, we assume that the 2 accumulation rates, , vary across the 15 experimental conditions, while the 3 parameters related to the boundaries, , and , are shared across conditions. The parameter is therefore 33dimensional.
We note that this model would present a challenge for inference even if the likelihood function was available. It is difficult to assign priors to the parameters, because they do not have a clear physical interpretation, and available prior knowledge (e.g., that reaction times will normally be less than 1 second) does not map easily unto them. Moreover, the model is subject in certain cases to weak identifiability problems. For instance, if one response dominates the dataset, there is little information available beyond the fact that one drift rate is much higher than the other (or one threshold much lower than the other, or both).
We reparametrised the positive parameters , as , , and assigned a prior to , , and . Taking a prior for these 3 quantities led to similar results. Some experimentation suggested that the drift rates could be constrained to lie between 0.1 and 0.1, because values outside of this interval seem to yield improbable reaction times (too short or too long). We assigned a uniform prior for the 30 drift rates, and applied the appropriate transform, i.e. , in order to obtain a prior for the transformed parameters.
After a few unsuccessful attempts, we believe that this application is out of reach of normal ABC techniques. The main difficulty is the choice of the summary statistics. For instance, if one takes basic summaries (e.g. quartiles) of the distribution of reactions times, under each of the 15 experimental conditions, one ends with a very large vector
of summary statistics. Due to the inherent curse of dimensionality of standard ABC (Blum, 2010), sampling enough datasets, of size 1860, which are sufficiently close (in terms of the pseudodistance ) would require enormous computational effort. Obviously, taking a much smaller set of summary statistics would on the other hand lead to too poor an approximation.Some adjustments are needed for ABCEP to work on this problem. First, notice that within a condition the datapoints are IID so that the posterior distribution factorises over IID “blocks”. We can therefore employ the recycling technique described in Section 4 to save simulation time, by going through the likelihood sites blockbyblock. Second, since datapoints take values in , we adopt the following set of ABC constraints: , where denotes as usual the actual datapoints. Third, we apply the following two variancereduction techniques. One stems from the fact that each site likelihood does not depend on the whole 33 parameters but on a subset of size 5. In that case, using simple linear algebra, one can see that it is possible to update only the marginal distribution of the EP approximation with respect to these 5 parameters; see Appendix B for details. The second is a simple RaoBlackwellisation scheme that uses the fact that the nondecisional component is uniformly distributed, and may therefore be marginalised out when computing the EP update.
We report below the results obtained from ABCEP with , , (see end of Section 3.2), and 3 complete passes over the data; CPU time was about 40 minutes. Results for smaller values of , e.g. , were mostly similar, but required a larger CPU time.
Since we could not compare the results to those of a standard ABC algorithm, we assess the results through posterior predictive checking
. For each of the 15 experimental conditions, we generate 5,000 samples from the predictive density, and compare the simulated data with the real, as follows.The marginal distribution of responses can be summarised by regressing the probability of response on stimulus contrast, separately for each stimulus position (as on Figure 5.5), and using a binomial generalized linear model (with Cauchit link function). Figure 5.6 compares data and simulations, and shows that the predictive distribution successfully captures the marginal distribution of responses.
To characterise the distribution of reaction times, we look at means and interquantile intervals, conditional on the response and the experimental conditions. The results are presented on Figure 5.7. The predictive distributions capture the location and scale of the reaction time distributions quite well, at least for those conditions with enough data. Such results seem to indicate that, at the very least, the ABCEP approximate posterior corresponds to a highprobability region of the true posterior.
6 Difficult posteriors
One obvious cause for concern is the behaviour of EPABC when confronted with difficult posterior distributions, and in this section we explore a few possible scenarios and suggest methods for detecting and correcting potential problems. Of all potential problematic shapes a posterior distribution could have, the worst is for it to have several modes, and we deal with this question first. Nonmultimodal but otherwise problematic posteriors are discussed later.
6.1 Multimodality
We begin with a toy example of a multimodal posterior. Consider the following IID model: , , and . The dataset is obtained by sampling from the model, with . The true posterior may be computed exactly, and is plotted in the left hand side of Figure 6.1. It features two symmetric, well separated modes, around and 2. The Gaussian pdf that minimises where is the posterior (or, in other words, the Gaussian pdf, with mean equal to posterior expectation, variance equal to the posterior variance) is also represented, as a dashed line, but it cannot be distinguished from the EPABC approximation (thick line).
The behaviour of EPABC in this case is instructive. If we run the standard EPABC algorithm (using the IID version, see Section 4 of the paper, and with , ), we obtain a result very close to the the thick line in one pass (one cycle over all the observations). However, if the algorithm is run for two passes or more, there is a very high probability that the algorithm stops its execution before completion, because many site updates will produce negative definite contributions . To perform more than one pass, we have to use slow updates (as described in Section 3 of the paper), and set to a small value. The thick line in Figure 6.1 was obtained with three passes, and slow updates with . The right hand side of 6.1 shows the type of plot one may use to assess convergence of EPABC: it represents the evolution of the mean of the EP approximation along the iterations; one “iteration” corresponds to one site update, and since we have performed three passes over sites, there are 150 iterations in total. This plot seems to indicate that the posterior expectation is slowly drifting to the right, and does not stabilise. If we try to run for more passes, we end up again with noninvertible covariance matrices. This behaviour was already described by Minka in his PhD thesis (Minka, 2001b) for the deterministic case.
The first test to see if the posterior distribution is reasonable is simply to run EPABC and see if it fails. There are two possible causes for failures: either one used too few samples and Monte Carlo variance led to negative definite updates, or the model itself is problematic for EP, in which case EP will still fail when using large sample sizes.
Beyond this rather informal test there are more principled things one could do. In Paquet et al. (2009) a very generic class of corrections to EP is introduced, and is described in Appendix Appendix C: PaquetWinterOpper corrections. Their firstorder correction can be obtained relatively cheaply from the hybrid distributions, and can be used to build corrected marginals. In Figure 6.1 we show the firstorder correction for our toy multimodal example: although the corrected marginal is still far from the true posterior, it is clearly bimodal and in this case serves as a warning that the Gaussian approximation is inappropriate (we also applied the same type of correction in our third example, for which no MCMC gold standard is available, but the correction did not modify in any noticeable manner the marginal posterior distributions) In a similar vein, one could use goodnessoffit tests to see if the hybrid distributions show large deviations from normality.
Once the problem has been diagnosed, what can one do? The answer is, unfortunately, “not much”. If one can determine that multimodality is caused by ambiguity in one or two parameters, it is possible to effectively treat these parameters as hyperparameters. One separates
into where is the set of problematic parameters. Running EPABC with will produce an estimate of which can be used to form an approximation to the marginals in a manner analoguous to the procedure used in INLA (Rue et al., 2009). Although this will require running EPABC several times for different values of , it might still be a lot cheaper than a MCMC procedure.If no information is available about what the cause of the multimodality is or where the different modes are, our opinion is that all existing ABC methods will have tremendous difficulties. Work might be better invested in injecting more prior information or changing the parameterisation such as to ensure there is only one mode.
6.2 Difficult unimodal posteriors
Compared to the toy example of the previous section, a more realistic scenario is a posterior distribution that is unimodal but otherwise still badly behaved. There are no theoretical results available on convergence in this case, so we evaluated EP’s behaviour in a case of a rather nasty posterior distribution, adapted from the model of reaction times described above.
We devised a problem in which we guessed the posterior would be at the minimum very badly scaled and would probably have an inconvenient shape. In the context of the reaction times model described in Section 5.4, imagine that in a choice experiment the subject picked exclusively the second category; so that we have no observed reaction times for the first category. From the point of the model this occurs whenever the threshold for the first category is high enough, compared to accumulation speed, that the second accumulator always wins. This creates complete uncertainty as to the ratio of accumulation speed vs. threshold value for the first accumulator.
We generated a large dataset (1,000 datapoints) based on fixed values of : and . In this dataset there are no decisions favouring the first category, and the resulting reaction time distribution is plotted on Fig. 6.2. To make the inference problem still manageable using MCMCABC, we limit inference to three parameters: , (logaccumulation rate of the two accumulators) and (logthreshold for the first category). The other two parameters are considered known and fixed at their true value. We chose these parameters because we expected the posterior to show quasicertainty for and very high uncertainty for . To further increase uncertainty, we chose a very vague prior .
We ran EPABC on this problem using the recycling strategy described in section 3.2 (data are IID). In this case EPABC needs large sample sizes to be stable numerically, which is probably due to the very illconditioned covariance matrices that arise (itself due to a very poorlyscaled posterior, more on that below). To eliminate any such problems we deliberately chose to use a very high number of samples, 3 million, with a minimum expected sample size of 500. We used an acceptance window of 10ms. We did 3 passes over the data, for a total computing time of 9 minutes on a standard laptop. To check for stability, we ran EPABC 10 times.
To check the accuracy of the results, we ran a MCMCABC sampler that used as its summary statistics 8 quantiles of the reaction time distribution, linearly spaced from 0.01 to 0.99, and whether the first category was ever chosen. Quantiles were rescaled by a factor of . Samples were accepted if the second category was chosen less than 100% of the time, or if the Euclidean distance between was over . This latter value was found by iteratively adjusting to get an acceptance rate of 1/1000.
We would like to stress that an appropriate proposal distribution was found using the covariance matrix obtained using ABCEP (and readjusted later based on short runs). The scaling of the posterior is such that using, e.g., the prior covariance matrix in a proposal distribution is impossible: one ends up using extremely high values of , and therefore with a very poor approximation of the posterior. As a starting value we used the true parameters. After several adjustments, we ran a MCMCABC chain for 3 million iterations, which we trimmed down to 600 samples for visualisation on Fig. 6.3 (autocorrelation in the chain being in any case extremely high).
MCMCABC samples are shown on Fig. 6.3, along with density contours representing EPABC results. Although MCMCABC and EPABC target different densities, it is reasonable to hope that in this case these densities must be fairly close, since we deliberately set the acceptance ratio to be very small, and the data are simple enough for the summary statistics to be adequate. The posterior distribution obtained using MCMCABC, as shown on figure 6.3 is pathological, as expected. Scaling is very poor: one parameter varies over a very small range while the other two vary over and . The posterior also has a difficult shape and appears truncated along certain dimensions. EPABC provides approximations that are located squarely within the posterior distribution, but seems to underestimate posterior variance for the third parameter, . Given the difficult context, EPABC still performs arguably rather well. At the very least it may be used to find an appropriate proposal distribution for MCMCABC, which may perform extremely poorly without such help.
7 Extensions
Our description of EPABC assumes the prior is Gaussian, and produces a Gaussian approximation. With respect to the latter point, we have already mentioned (see also Appendix A) that EP, and therefore EPABC, may propagate more generally approximations from within a given exponential family; say the parametric family of Wishart distributions if the parameter is a covariance matrix. Regarding the prior, even when considering Gaussian approximation, it is possible to accommodate a non Gaussian prior, by treating the prior as an additional site.
Thus, the main constraint regarding the application of EPABC is that the likelihood is factorizable, i.e. in such a way that simulating from each factor is feasible.
In this paper, we focused our attention on important applications of likelihoodfree inference where the likelihood is trivial to factorise; either because the datapoints are independent, or Markov. But ABCEP is not limited to these two situations. First, quite a few time series models may be simulated sequentially, even if they are not Markov. For instance, one may apply straightforwardly EPABC to a GARCHstable model (e.g. Liu and Brorsen, 1995; Mittnik et al., 2000; Menn and Rachev, 2005), which is a GARCH model (Bollerslev, 1986) with an alpha stabledistributed innovation. Second, one may obtain a simple factorisation of the likelihood by incorporating latent variables into the vector of unknown parameters. Third, one may replace the true likelihood by a certain approximation which is easy to factorise. The following section discusses and illustrates such an approach, based on the concept of composite likelihood.
7.1 Composite likelihood
Composite likelihood is an umbrella term for a family of techniques based on the idea of replacing an intractable likelihood by a factorisable pseudolikelihood; see the excellent review of Varin et al. (2011). For the sake of simplicity, we focus on marginal composite likelihood, but we note that other versions of composite likelihood (where for instance the factors are conditional distributions) could also be treated similarly.
In our EPABC context, replacing the likelihood by a marginal composite likelihood leads to the following type of ABC posterior
where is the marginal distribution of some subset of the observations, and is a nonnegative weight. There exists some theory on how to choose the weights so that the composite likelihood is close enough to the true likelihood in some sense, see again Varin et al. (2011), but for simplicity we shall take . Clearly, EPABC may be applied straightforwardly to this new ABC posterior, provided one may sample independently from the marginal distributions . In fact, the factors may be treated as IID factors, which makes it possible to use the recycling strategy described in Section 4.
To make this idea more concrete, we consider the class of hidden Markov models, where the datapoints are conditionally independent,
Comments
There are no comments yet.