# Expectation-Propagation for Likelihood-Free Inference

Many models of interest in the natural and social sciences have no closed-form likelihood function, which means that they cannot be treated using the usual techniques of statistical inference. In the case where such models can be efficiently simulated, Bayesian inference is still possible thanks to the Approximate Bayesian Computation (ABC) algorithm. Although many refinements have been suggested, ABC inference is still far from routine. ABC is often excruciatingly slow due to very low acceptance rates. In addition, ABC requires introducing a vector of "summary statistics", the choice of which is relatively arbitrary, and often require some trial and error, making the whole process quite laborious for the user. We introduce in this work the EP-ABC algorithm, which is an adaptation to the likelihood-free context of the variational approximation algorithm known as Expectation Propagation (Minka, 2001). The main advantage of EP-ABC is that it is faster by a few orders of magnitude than standard algorithms, while producing an overall approximation error which is typically negligible. A second advantage of EP-ABC is that it replaces the usual global ABC constraint on the vector of summary statistics computed on the whole dataset, by n local constraints of the form that apply separately to each data-point. As a consequence, it is often possible to do away with summary statistics entirely. In that case, EP-ABC approximates directly the evidence (marginal likelihood) of the model. Comparisons are performed in three real-world applications which are typical of likelihood-free inference, including one application in neuroscience which is novel, and possibly too challenging for standard ABC techniques.

## Authors

• 12 publications
• 15 publications
09/11/2019

### Efficient Bayesian synthetic likelihood with whitening transformations

Likelihood-free methods are an established approach for performing appro...
05/22/2018

### Fast and Accurate Binary Response Mixed Model Analysis via Expectation Propagation

Expectation propagation is a general prescription for approximation of i...
10/15/2020

### Detecting conflicting summary statistics in likelihood-free inference

Bayesian likelihood-free methods implement Bayesian inference using simu...
10/03/2018

### An easy-to-use empirical likelihood ABC method

Many scientifically well-motivated statistical models in natural, engine...
11/12/2020

### On a Variational Approximation based Empirical Likelihood ABC Method

Many scientifically well-motivated statistical models in natural, engine...
12/16/2010

### Fast Convergent Algorithms for Expectation Propagation Approximate Bayesian Inference

We propose a novel algorithm to solve the expectation propagation relaxa...
12/16/2014

### Expectation propagation as a way of life: A framework for Bayesian inference on partitioned data

A common approach for Bayesian computation with big data is to partition...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In natural and social sciences, one finds many examples of probabilistic models whose likelihood function is intractable. This includes most models of noisy biological neural networks

(Gerstner and Kistler, 2002), some time series and choice models in Economics (Train, 2003), phylogenetic models in evolutionary Biology (Beaumont, 2010), spatial extremes in Environmental Statistics (Davison et al., 2012), among others. That the likelihood is intractable is unfortunate, because one would still like to perform the usual statistical tasks of parameter inference and model comparison, and the traditional statistical tool-kit assumes that the likelihood function is either directly available or can be made so by introducing latent variables. This explains that researchers have often had to content themselves with semi-quantitative analyses showing that a model could reproduce some aspect of an empirical phenomenon for some values of the parameters; for two representative examples from vision science, see Nuthmann et al. (2010), Brascamp et al. (2006).

A breakthrough was provided by the work of Tavaré et al. (1997), Pritchard et al. (1999) and Beaumont et al. (2002), in the form of the Approximate Bayesian Computation (ABC) algorithm, which enables Bayesian inference in the likelihood-free context. (See also Diggle and Gratton (1984) and Rubin (1984) for early versions of ABC.) Assuming some model for the data , and a prior over the parameter , the ABC algorithm iterates the following steps:

1. Draw from the prior, .

2. Draw a dataset from the model conditional on , .

3. If , then keep , otherwise reject.

and therefore produces samples from the so-called ABC posterior:

 pϵ(θ|y∗)∝p(θ)∫p(y|θ)1{d(y,y⋆)≤ϵ}dy. (1.1)

The pseudo-distance is usually taken to be , for some norm , where

is a vector of summary statistics, for example some empirical quantiles or moments of

. Unless is sufficient, the approximation error does not vanish as , as

. In that respect, the ABC posterior suffers from two levels of approximation: a nonparametric error, akin to the error of kernel density estimation, and where

plays the role of a bandwidth (see e.g. Blum, 2010), and a bias introduced by the summary statistics . The more we include in , the smaller the bias induced by should be. On the other hand, as the dimensionality of increases, the lower the acceptance rate will be. We would then have to increase , which leads to an approximation of lower quality.

Thus, ABC requires in practice some more or less arbitrary compromise between what summary statistics to include and how to set To establish that the results of the inference are somewhat robust to these choices, many runs of the algorithm may be required. Although several variants of the original ABC algorithm that aim at increasing acceptance rates exist (e.g. Beaumont et al., 2002), the current state of the matter is that an ABC analysis is very far from routine use because it may take days to tune on real problems. The semi-automatic method for constructing summary statistics recently proposed by Fearnhead and Prangle (2012) seems to alleviate partly these problems, but it still requires at least one, and sometimes several pilot runs.

In this article we introduce EP-ABC, an adaptation of the Expectation Propagation (EP) algorithm (Minka, 2001a; Bishop, 2006, Chap. 10) to the likelihood-free setting. The main advantage of EP-ABC is that it is much faster than previous ABC algorithms: typically, it provides accurate results in a few minutes, whereas a standard ABC algorithm may need several hours, or even days. EP-ABC requires that the data may be decomposed into “chunks”, (of possibly different dimensionality or support), in such a way that is possible to simulate sequentially the chunks; i.e. one is able to simulate from for each . More precisely, EP-ABC builds an EP approximation of the following type of ABC posterior distributions:

 pϵ(θ|y⋆)∝p(θ)n∏i=1{∫p(yi|y⋆1:i−1,θ)1{∥si(yi)−si(y⋆i)∥≤ε}dyi} (1.2)

where is a summary statistic specific to chunk , and with the convention that . Given the way it operates, we shall see that EP-ABC essentially replaces the initial, possibly difficult ABC problem, by ABC simpler (i.e. lower-dimensional) ABC problems. In particular, it may be easier to construct a set of “local” summary statistics for each chunk rather than a global set the whole dataset

, in such a way that the probability that

is not too small.

Of course, not all ABC models are amenable to the factorisation in (1.2), at least directly. We shall discuss this point more in detail in the paper. At the very least, factorisable models include situations with repeated experiments, which are not easily tackled through standard ABC, because of the difficulty to define some vector which summarises these experiments. Bazin et al. (2010) discusses this point, in the context of genetic data collected over a large number of loci, and recommend instead to construct loci-specific summary statistics, and combine results from loci-specific ABC exercises.

In certain problems such that is of low dimension, it may be even possible to take . In that case, EP-ABC provides an EP approximation of a summary-less ABC posterior; that is, when (under appropriate measure-theoretic conditions), , the true posterior. In such situations, EP-ABC also provides an approximation of the evidence , which is a very useful quantity for model comparison. Dean et al. (2011)

show that, in the specific context of hidden Markov models such that the observation density is intractable (as this work was motivated by the filtering ABC algorithm developed in

Jasra et al. (2010)), the ABC error of a summary-less ABC algorithm should not be interpreted as a non-parametric error, but rather as a misspecification error, where the misspecified model is the true model, but with observations corrupted with noise. This misspecification implies a bias of order

. Thus, in addition to being more convenient for the user, as it avoids specifying some summary statistics, summary-less ABC does not seem to suffer from the curse of dimensionality (in the dimension of

) of standard ABC. Allowing summary-less ABC inference in certain cases seems to be another advantage of EP-ABC.

We start with a generic description of the EP algorithm, in Section 2, and explain in Section 3 how it can be adapted to the likelihood-free setting. We explain in Section 4 how EP-ABC can be made particularly efficient when data-points are IID (independent and identically distributed). Section 5 contains three case studies drawn from finance, population ecology, and vision science. The two first examples are borrowed from already known applications of ABC, and illustrate to which extent EP-ABC may outperform standard ABC techniques in realistic scenarios. To the best of our knowledge, our third example from vision science is a novel application of likelihood-free inference, which, as which shall argue, seems too challenging for standard ABC techniques.

Section 6 discusses how EP-ABC may cope with difficult posteriors through two additional examples. Section 7 discusses possible extensions of EP-ABC; in particular, for non-factorisable likelihoods. In such cases, one may replace the likelihood by some factorisable approximation, such as a composite likelihood. Section 8 concludes.

We use the following notations throughout the paper: bold letters refer to vectors or matrices, e.g. , , and so on. We also use bold face to distinguish complete sets of observations, i.e. or , from their components, , and , , although we do not assume that these components are necessarily scalar. For sub-vectors of observations, we use the colon notation: . The notation refers to a generic norm, and

refers to the Euclidean norm. The Kullback-Leibler divergence between probability densities

and is denoted as

 KL(π∥q)=∫π(θ)log(π(θ)q(θ))dθ.

The letter always refers to probability densities concerning the model; i.e. is the prior, is the likelihood of the first observation, and so on. Transpose of a matrix is denoted , and the diagonal matrix with diagonal elements given by vector is .

## 2 Expectation Propagation

Expectation Propagation (EP, Minka, 2001a)

is an algorithm for variational inference, a class of techniques that aim at finding a tractable probability distribution

that best approximates an intractable target density . One way to formulate this goal is as finding the member of some parametric family that is in some sense closest to , where “closest” is defined by some divergence between probability distributions. A distinctive feature of EP is that it based on the divergence , whereas many variational methods (e.g. Variational Bayes, see Chap. 10 of Bishop, 2006) try to minimize . For a good discussion of the relative merits of each divergence, see Bishop (2006, p. 468-470) and Minka (2005). As a brief summary, minimizing tends to produce approximation that are too compact; see also Wang and Titterington (2005). The divergence does not suffer from this drawback, and is perhaps more appealing from a statistical point of view (if only because maximum likelihood estimation is based on the corresponding contrast), but on the other hand, minimizing when is multimodal tends to produce an approximation that covers all the modes, and therefore has too large a support. We shall return to this point.

### 2.1 Assumptions of Expectation Propagation

EP assumes that the target density decomposes into a product of simpler factors

 π(θ)=1Zπn∏i=0li(θ) (2.1)

and exploits this factorisation in order to construct a sequence of simpler problems. For instance, may be the prior , if is a set of observations , with the convention that , and

EP uses an approximating distribution with a similar structure:

 q(θ)∝n∏i=0fi(θ) (2.2)

where the ’s are known as the “sites”. In a spirit close to a coordinate-descent optimization algorithm, each site is updated in turn, while the other sites are kept fixed.

For the sake of conciseness, we focus in this paper on Gaussian sites, expressed under their natural parametrisation: , where and are called from now on the site parameters. This generates the following Gaussian approximation :

 q(θ)∝exp⎧⎨⎩−12θt(n∑i=0Qi)θ+(n∑i=0ri)tθ⎫⎬⎭. (2.3)

In addition, we assume that the true prior is Gaussian, with natural parameters and . In that case, the site is kept equal to the prior, and only the sites to are updated.

We note however that EP may easily accommodate a non-Gaussian prior (by simply treating the prior as additional factor to be approximated), or other types of parametric sites. In fact, the site update described in the following section may be easily adapted to any exponential family; see Seeger (2005).

### 2.2 Site update

Suppose that (2.2) is the current approximation, and one wishes to update site . This is done by creating a “hybrid” distribution, obtained by substituting site with the true likelihood contribution :

 h(θ)∝q−i(θ)li(θ),q−i(θ)=∏j≠ifj(θ). (2.4)

For Gaussian sites, this leads to:

 h(θ)∝li(θ)exp(−12θtQ−iθ+rt−iθ),Q−i=∑j≠iQj,r−i=∑j≠irj.

The new value of site is then obtained by minimising with respect to the Kullback-Leibler pseudo-distance between the hybrid and the Gaussian approximation (again keeping the ’s fixed). When Gaussian sites are used, this minimisation is equivalent to taking to be the Gaussian density with moment parameters that match those of the hybrid distribution

 Zh = ∫q−i(θ)li(θ)dθ, μh = 1Zh∫θq−i(θ)li(θ)dθ, Σh = 1Zh∫θθtq−i(θ)li(θ)dθ−μhμth. (2.5)

A key observation at this stage is that the feasibility of EP is essentially dictated by how easily the moments above may be computed. These moments may be interpreted as the moments of a posterior distribution, based on a Gaussian prior, with natural parameters and , and a likelihood consisting of a single factor .

Finally, from these moment parameters, one may recover the natural parameters of and deduce the new site parameters for as follows:

 Qi←Σ−1h−Q−i,ri←Σ−1hμh−r−i.

EP proceeds by looping over sites, updating each one in turn until convergence is achieved. In well-behaved cases, one observes empirically that a small number of complete sweeps through the sites is sufficient to obtain convergence. However, there is currently no general theory on the convergence of EP.

Appendix A gives an algorithmic description of EP, in the more general case where an exponential family is used for the sites.

### 2.3 Approximation of the evidence

EP also provides an approximation of the normalising constant of (2.1), using the same ideas of updating site approximations through moment matching. To that effect, we rewrite the EP approximation with normalising constants for each site (assuming again the prior is Gaussian and does not need to be approximated):

 q(θ)=p(θ)n∏i=1fi(θ)Ci,fi(θ)=exp(−12θtQiθ+rtiθ). (2.6)

Then the update of site proceeds as before, by adjusting , and through moment matching. Simple calculations (see e.g. Seeger, 2005) lead to the following expressions for the update of :

 log(Ci)=log(Zh)−Ψ(r,Q)+Ψ(r−i,Q−i) (2.7)

where is the normalising constant of the hybrid, as defined in (2.5), , (resp. , ) are the natural parameters of the current Gaussian approximation (resp. of ) and is the log-normalising constant of an unnormalised Gaussian density:

 Ψ(r,Q)=log{∫exp(−12θtQθ+rtθ)dθ}=−12log|Q/2π|+12rtQr.

For each site update, one calculates as defined in (2.7). Then, at the end of the algorithm, one may return the following quantity

 n∑i=1log(Ci)+Ψ(r,Q)−Ψ(r0,Q0)

as an approximation to the logarithm of the evidence.

## 3 EP-ABC: Adapting EP to likelihood-free settings

### 3.1 Basic principle

As explained in the introduction, our objective is to approximate the following ABC posterior

 pϵ(θ|y⋆)∝p(θ)n∏i=1{∫p(yi|y⋆1:i−1,θ)1{∥si(yi)−si(y⋆i)∥≤ε}dyi} (3.1)

which corresponds to a particular factorisation of the likelihood,

 p(y⋆|θ)=n∏i=1p(y⋆i|y⋆1:i−1,θ). (3.2)

Note that, in full generality, the may be any type of “chunk” of the observation vector

, i.e. the random variables

may have a different dimension, or more generally different supports. For simplicity, we assume that the prior is Gaussian, with natural parameters and .

One may interpret (3.1) as an artificial posterior distribution, which decomposes into a prior times likelihood contributions , as in (2.1), with

 li(θ)={∫p(yi|y⋆1:i−1,θ)1{∥si(yi)−si(y⋆i)∥≤ε}dyi}.

We have seen that the feasibility of the EP algorithm is determined by the tractability of the following operation: to compute the two first moments of a pseudo-posterior, made of a Gaussian prior , times a single likelihood contribution . This immediately suggests the following EP-ABC algorithm. We use the EP algorithm, as described in Algorithm 3, and where the moments of such a pseudo-posterior are computed using as described in Algorithm 1, that is, as Monte Carlo estimates, based on simulated pairs , where , and .

Since EP-ABC integrates one data-point at a time, it does not suffer from a curse of dimensionality with respect to : the rejection rate of Algorithm 1 corresponds to a single constraint , not of them, and is therefore likely to be tolerably small even for small windows . (Otherwise, in very challenging situations, one has the liberty to replace Algorithm 1 by a more elaborate ABC algorithm.)

The only requirement of EP-ABC is that the factorisation of the likelihood, (3.2), is chosen in such a way that simulating from the model, i.e. can be decomposed into a sequence of steps, where one samples from , for . We shall see in our examples section, see Section 5, that several important applications of likelihood-free inference fulfil this requirement. We shall also discuss in Section 7 how other likelihood-free situations may be accommodated by the EP-ABC approach.

### 3.2 Numerical stability

EP-ABC is a stochastic version of EP, a deterministic algorithm, hence some care must be taken to ensure numerical stability. We describe here three strategies towards this aim.

First, to ensure that the stochastic error introduced by each site update does not vary too much in the course of the algorithm, we adapt dynamically , the number of simulated points, as follows. For a given site update, we sample repetitively pairs , as described in Algorithm 1, until the total number of accepted points exceeds some threshold . Then we compute the moments (3.3) and (3.4) based on all the accepted pairs.

Second, EP-ABC computes a great deal of Monte Carlo estimates, based on IID (independent and identically distributed) samples, part of which are Gaussian. Thus, it seems worthwhile to implement variance reduction techniques that are specific to the Gaussian distribution. After some investigation, we recommend the following quasi-Monte Carlo approach. We generate a Halton sequence

of dimension , which is a low discrepancy sequence in , and take

where , are the moment parameters corresponding to the natural parameters , , is the Cholesky lower triangle of , and returns a vector that contains the inverse distribution function of each component of the input vector. We recall briefly that a low discrepancy sequence in is a deterministic sequence that spreads more evenly over the hyper-cube

than a sample from the uniform distribution would; we refer the readers to e.g. Chap. 3 of

Gentle (2003) for a definition of Halton and other low discrepancy sequences, and the theory of quasi-Monte Carlo. Rigorously speaking, this quasi-Monte Carlo version of EP-ABC is a hybrid between Monte Carlo and quasi-Monte Carlo, because the are still generated using standard Monte Carlo. However, we do observe a dramatic improvement when using this quasi-Monte Carlo approach. An additional advantage is that one may save some computational time by generating once and for all a very large sequence of vectors, and store it in memory for all subsequent runs of EP-ABC.

The third measure we may take is to slow down the progression of the algorithm such as to increase stability, by conservatively updating the parameters of the approximation in Step 3 of Algorithm 3, that is, . Standard EP is the special case with . Updates of this type are suggested in Minka (2004).

In our experiments, we found that the two first strategies improve performance very significantly (in the sense of reducing Monte Carlo variability over repeated runs), and that the third strategy is sometimes useful, for example in our reaction time example, see Section 5.4.

### 3.3 Evidence approximation

In this section, we consider the special case , and we normalise the ABC posterior (3.1) as follows:

 pϵ(θ|y⋆)=1pϵ(y⋆)p(θ)n∏i=1⎧⎪⎨⎪⎩∫p(yi|y⋆1:i−1,θ)1{∥yi−y⋆i∥≤ε}vi(ϵ)dyi⎫⎪⎬⎪⎭, (3.5)

where is the normalising constant of the uniform distribution with respect to the ball of centre , radius , and norm . For the Euclidean norm, and assuming that the ’s have the same dimension , one has: , with ; e.g. if , if .

Wilkinson (2008) shows that a standard ABC posterior such as (1.1) can be interpreted as the posterior distribution of a new model, where the summary statistics are corrupted with a uniformly-distributed noise (assuming these summary statistics are sufficient). The expression above indicates that this interpretation also holds for this type of summary-less ABC posterior, except that the artificial model is now such all the random variables are corrupted with noise (conditional on ).

The expression above also raises an important point regarding the approximation of the evidence. In (3.5), the normalising constant is the evidence of the corrupted model, which converges to the evidence of the actual model as . On the other hand, EP-ABC targets (3.1), and, in particular, see Section 2.3, produces an EP approximation of its normalising constant, which is . Thus, one needs to divide this EP approximation by in order recover an approximation of . We found in our simulations that, when properly normalised as we have just described, the approximation of the evidence provided by EP-ABC is particularly accurate, see Section 5. In contrast, standard ABC based on summary statistics cannot provide an approximation of the evidence, as explained in the Introduction.

## 4 Speeding up EP-ABC in the IID case

Typically, the main computational bottleneck of EP-ABC, or other types of ABC algorithms, is simulating pseudo data-points from the model. In this section, we explain how these simulations may be recycled throughout the iterations in the IID (independent and identically distributed) case, so as to significantly reduce the overall computational cost of EP-ABC.

Our recycling scheme is based on a straightforward importance sampling strategy. Consider an IID model, with likelihood . Also, for the sake of simplicity, take . Assume that, for a certain site , pairs are generated from , as described in Algorithm 1. We have removed the subscript in both and , to highlight the fact that the generative process of the data-points is the same for all the sites. The next update, for site , requires computing moments with respect to . Thus, we may recycle the simulations of the previous site by assigning to each pair the importance sampling weight:

 w[m]i+1=q−(i+1)(θ[m])q−i(θ[m])×1{∥y[m]−y⋆i+1∥≤ϵ}

and compute the corresponding weighted averages.

Obviously, this step may also be applied to the subsequent sites, , until one reaches a stage when the weighted sample is too degenerated. When this happens, “fresh” simulations may be generated from the current site. Algorithm 2 describes more precisely this recycling strategy. To detect weight degeneracy, we use the standard ESS (Effective Sample Size) criterion of Kong et al. (1994): we regenerate when the ESS is smaller that some threshold .

The slower the EP approximation evolves, the less often regenerating the pseudo data-points is necessary, so that as the approximation gradually stabilises, we do not need to draw any new samples any more. Since EP slows down rapidly during the first two passes, most of the computational effort will be devoted to the early phase, and additional passes through the data will come essentially for free.

In non-IID cases several options are still available. For some models the data may come in blocks, each block made of IID data-points (think for example of a linear model with discrete predictors). We can apply the strategy outlined above in a block-wise manner (see the reaction times example, section 5.4). In other models there may be an easy transformation of the samples for data-point such that they become samples for data-point , or one may be able to reuse part of the simulation.

## 5 Case studies

### 5.1 General methodology

In each scenario, we apply the following approach. In a first step, we run the EP-ABC algorithm. We may run the algorithm several times, to evaluate the Monte Carlo variability of the output, and we may also run it for different values of , in order to assess the sensitivity to this approximation parameter. We use the first run to determine how many passes (complete cycles through the

sites) are necessary to reach convergence. A simple way to monitor convergence is to plot the evolution of the expectation (or the smallest eigenvalue of the covariance matrix) of the current Gaussian approximation

along the site updates. Note however that, in our experience, it is quite safe to simply fix the number of complete passes to a small number like . Finally, note that in each example, we could take , so the point of determining appropriate local summary statistics is not discussed.

In a second step, we run alternative algorithms, that is, either an exact (but typically expensive) MCMC algorithm, or an ABC algorithm, based on some set of summary statistics. The ABC algorithm we implement is a Gaussian random walk version of the MCMC-ABC algorithm of Marjoram et al. (2003). This algorithm targets a standard ABC approximation, i.e. (1.1), that corresponds to a single constraint , for some vector of summary statistics , and some ; the specific choices of and are discussed for each application. We calibrate the tuning parameters of these MCMC algorithms using the information provided by the first step: we use as a starting point for the MCMC chain the expectation of the approximated posterior distribution provided by the EP-ABC algorithm, random walk scales are taken to be some fraction of the square root of the approximated posterior variances, and so on. This makes our comparisons particularly unfavourable to EP-ABC. Despite this, we find consistently that the EP-ABC algorithm is faster by several orders of magnitude, and leads to smaller overall approximation errors. We report computational loads both in terms of CPU time (e.g. 30 seconds) and in terms of the number of simulations of replicate data-points . The latter should be typically the bottleneck of the computation.

All the computations were performed on a standard desktop PC in Matlab; programs are available from the first author’s web page.

### 5.2 First example: Alpha-stable Models

Alpha-stable distributions are useful in areas (e.g. Finance) concerned with noise terms that may be skewed, may have heavy tails and an infinite variance. A univariate alpha-stable distribution does not admit a close-form expression for its density, buy may be specified through its characteristic function

where determines the tails, , determines skewness, , and and are respectively scale and location parameters; see Nolan (2012, Chap. 1) for a general introduction to stable distributions.

Peters et al. (2010) consider a model of i.i.d. observations , from a univariate alpha-stable distribution, and propose to use the ABC approach to infer the parameters. Likelihood-free inference is appealing in this context, because sampling from an alpha-stable distribution is fast (using e.g. the algorithm of Chambers et al., 1976), while computing its density is cumbersome.

Trying EP-ABC on this example is particularly interesting for the following reasons: (a) Peters et al. (2010) show that choosing a reasonable set of summary statistics for this problem is difficult, and that several natural choices lead to strong biases; and (b) since alpha-stable distributions are very heavy-tailed, the posterior distribution may be heavy-tailed as well, which seems a challenging problem for a method based on a Gaussian approximation such as EP-ABC.

Our data consist of rescaled log-returns, , computed from daily exchange rates of AUD (Australian Dollar) recorded in GBP (British Pound) between 1 January 2005 and 1 December 2010. (These data are publicly available on the Bank of England web-site.) We take where is the cumulative distribution function, and we set the prior to . Note however that our results are expressed in terms of the initial parametrisation , , and ; i.e. for each parameter we report the approximate marginal posterior distribution obtained through the appropriate variable transform of the Gaussian approximation produced by EP-ABC. We run the EP-ABC algorithm (recycling version, as model is IID, see Section 4), with , , , and set to the Euclidean norm in (i.e. the constraints in (3.1) simplify to ). Variations over ten runs are negligible. Average CPU time for one run is 39 minutes, and average number of simulated data-points over the course of the algorithm, is .

We first compare these results with the output of an exact random-walk Hastings-Metropolis algorithm, which relies on the evaluation of an alpha-stable probability density function for each data-point (using numerical integration). Because of this, this algorithm is very expensive. We ran the exact algorithm for about 60 hours (

iterations). One sees in Figure 5.1 that the difference between EP-ABC and the exact algorithm is negligible.

We then compare these results with those obtained by MCMC-ABC, for the set of summary statistics which performs best among those discussed by Peters et al. (2010, see in Section 3.1). We run iterations of this sampler, which leads to about times more simulations from an univariate alpha-stable distribution than in the EP-ABC runs above. Through pilot runs, we decided to set , which seems to be as small as possible, subject to having a reasonable acceptance rate () for this computational budget. In Figure 5.1, we see that the posterior output from this MCMC-ABC exercise is not as good an approximation as the output of EP-ABC. As explained in the previous section, we have set the starting point of the MCMC-ABC chain to the posterior mode. If initialised from some other point, the sampler typically takes a much longer time to reach convergence, because the acceptance rate is significantly lower in regions far from the posterior mode.

Finally, we also use EP-ABC, with the same settings as above, e.g. , in order to approximate the evidence of the model above () and two alternative models, namely a symmetric alpha-stable model, where is set to (, and a Student model (), with 3 parameters (scale , position

, and a Gaussian prior for

). (Standard deviation over repeated runs is below

.) One sees that there no strong evidence of skewness in the data, and that the Student distribution and a symmetric alpha-stable distribution seem to fit equally well the data. We obtained the same value (

) for the evidence of the Student model when using the generalised harmonic mean estimator

(Gelfand and Dey, 1994) based on a very long chain of an exact MCMC algorithm. For both alpha-stable models, this approach proved to be too expensive to allow for a reliable comparison.

### 5.3 Second example: Lotka-Volterra models

The stochastic Lotka-Volterra process describes the evolution of two species (prey) and (predator) through the reaction equations:

 Y1 r1→ 2Y1 Y1+Y2 r2→ 2Y2 Y2 r3→ ∅.

This chemical notation means that, in an interval , the probability that one prey is replaced by two preys is , and so on. Typically, the observed data are made of vectors in , which correspond to the population levels at integer times. We take . This model is Markov, for , and one can efficiently simulate from using Gillespie (1977)’s algorithm. On the other hand, the density is intractable. This makes this model a clear candidate both for ABC, as noted by Toni et al. (2009), and for EP-ABC. Boys et al. (2008) show that MCMC remains feasible for this model, but in certain scenarios the proposed schemes are particularly inefficient, as noted also by Holenstein (2009, Chap. 4).

Following the aforementioned papers, we consider a simulated dataset, corresponding to rates , , , initial population values , and ; see Figure 5.2. Since the observed data are integer-valued, we use the supremum norm in (3.1), and an integer-valued ; this is equivalent to imposing simultaneously the constraints and in the ABC posterior.

First, we run EP-ABC (standard version) with , and for both and . We find that a single pass over the data is sufficient to reach convergence. For (resp. ), CPU time for each run is 2.5 minutes (resp. 25 minutes), and number of simulated transitions is about (resp. ); marginal posteriors obtained through EP-ABC are reported in Figure 5.3.

When applying ABC to this model, Toni et al. (2009) uses as a pseudo-distance between the actual data and the simulated data the sum of of squared errors. In Wilkinson (2008)’s perspective discussed in Section 4, this is equivalent to considering a state-space model where the latent process is the Lotka-Volterra process described above, and the observation process is the same process, but corrupted with Gaussian noise (provided the indicator function in the ABC posterior is replaced by the kernel function ). Thus, instead of a standard ABC algorithm, it seems more efficient to resort to a MCMC sampler specifically designed for state-space models in order to simulate from the ABC approximation of Toni et al. (2009). Following Holenstein (2009, Chap. 4), we consider a Gaussian random walk version of the marginal PMCMC sampler. This algorithm is a particular instance of the state of the art PMCMC framework of Andrieu et al. (2010), which is based on the idea of approximating the marginal likelihood of the data by running a particle filter of size at each iteration of the MCMC sampler. The big advantage of PMCMC in this situation (comparatively to other MCMC approaches for state-space models), is that it does not requires a tractable probability density for the Markov transition of the state-space model.

In Figure 5.3, we report the posterior output obtained from this sampler, run for about iterations and particles (2 days in CPU time, simulated transitions ), with random walk scales set to obtain approximatively a acceptance rate. These plots correspond to , and a state-space model with an uniformly distributed observation noise. In Figure 5.3, one detects practically no difference between PMCMC and EP-ABC with (black lines), although the CPU time of the latter was about 1500 smaller.

The difference between the two EP-ABC approximations (corresponding to and ) is a bit more noticeable. Presumably, the EP-ABC approximation corresponding to is slightly closer to the true posterior. We did not manage however to obtain reliable results from our PMCMC sampler and in a reasonable time.

### 5.4 Third example: Race models of reaction times

Reaction time models seek to describe the decision behaviour of (human or animal) subjects in a choice task (Luce, 1991; Meyer et al., 1988; Ratcliff, 1978). In the typical experiment, subjects view a stimulus, and must choose an appropriate response. For example, the stimulus might be a set of moving points, and the subject must decide whether the points move to the left or to the right.

Assuming that the subject may choose between alternatives, one observes independent pairs, , where is the chosen alternative, and is the measured reaction time. For convenience, we drop for now the index in order to describe the random distribution of the pair .

Reaction time models assume that the brain processes information progressively, and that a decision is reached when a sufficient amount of information has been accumulated. In the model we use here (a variant of typical models found in e.g. Ratcliff and McKoon, 2008; Bogacz et al., 2007) parallel integrators represent the evidence in favour of each of the alternatives. The model is illustrated on Figure 5.4. The first accumulator to reach its boundary wins the race and determines which response the subject will make. Each accumulator undergoes a Wiener process with drift:

 τdej(t)=mjdt+dWjt

where the ’s are the drift parameters, the ’s are independent Wiener processes; and is a fixed time scale, . The measured reaction time is corrupted by a uniformly-distributed noise , representing the “non-decisional time” (Ratcliff and McKoon, 2008), i.e. the time the subject needs to execute the decision (prepare a motor command, press an answer key, …). This model is summarised by the following equations:

 r = rd+rnd,rnd∼U[a,b], rd = minjinft{t:ej(t)=bj}, d = argminjinft{t:ej(t)=bj}.

(We fix and to , , credible values from Ratcliff and McKoon (2008))

The model above captures the essential ideas of reaction time modelling, but it remains too basic for experimental data. We now describe several important extensions. First, a better fit is obtained if the boundaries are allowed to vary randomly from trial to trial (as in Ratcliff, 1978): we assume that , where and is a parameter to be estimated. Second, a mechanism is needed to ensure that the reaction times cannot be too large: we assume that if no boundary has been reached after 1 second, information accumulation stops and the highest accumulator determines the decision. Finally, one needs to account for lapses (Wichmann and Hill, 2001): on certain trials, subjects simply fail to pay attention to the stimuli and respond more or less at random. We account for this phenomenon by having a lapse probability of 5%. In case a lapse occurs, becomes uniformly distributed between 0 and 800 ms and the response is chosen between the alternatives with equal probability. Clearly, although this generalised model remains amenable to simulation, the corresponding likelihood is intractable.

We apply this model to data from an unpublished experiment by M. Maertens (personal communication to Simon Barthelmé). The dataset is made of 1860 observations, obtained from a single human subject, which had to choose between

alternatives: “signal absent” (no light increment was presented), or “signal present” (a light increment was presented), under 15 different experimental conditions: 3 different locations on the screen, and 5 different contrast values. Following common practice in this field, trials with very high or very low reaction times (top and bottom 5%) were excluded from the dataset, because they have a high chance of being outliers (fast guesses, keyboard errors or inattention). The data are shown on Figure

5.5.

From the description above, one sees that five parameters, , are required to describe the random behaviour of a single pair , when To account for the heterogeneity introduced by the varying experimental conditions, we assume that the 2 accumulation rates, , vary across the 15 experimental conditions, while the 3 parameters related to the boundaries, , and , are shared across conditions. The parameter is therefore 33-dimensional.

We note that this model would present a challenge for inference even if the likelihood function was available. It is difficult to assign priors to the parameters, because they do not have a clear physical interpretation, and available prior knowledge (e.g., that reaction times will normally be less than 1 second) does not map easily unto them. Moreover, the model is subject in certain cases to weak identifiability problems. For instance, if one response dominates the dataset, there is little information available beyond the fact that one drift rate is much higher than the other (or one threshold much lower than the other, or both).

We re-parametrised the positive parameters , as , , and assigned a prior to , , and . Taking a prior for these 3 quantities led to similar results. Some experimentation suggested that the drift rates could be constrained to lie between -0.1 and 0.1, because values outside of this interval seem to yield improbable reaction times (too short or too long). We assigned a uniform prior for the 30 drift rates, and applied the appropriate transform, i.e. , in order to obtain a prior for the transformed parameters.

After a few unsuccessful attempts, we believe that this application is out of reach of normal ABC techniques. The main difficulty is the choice of the summary statistics. For instance, if one takes basic summaries (e.g. quartiles) of the distribution of reactions times, under each of the 15 experimental conditions, one ends with a very large vector

of summary statistics. Due to the inherent curse of dimensionality of standard ABC (Blum, 2010), sampling enough datasets, of size 1860, which are sufficiently close (in terms of the pseudo-distance ) would require enormous computational effort. Obviously, taking a much smaller set of summary statistics would on the other hand lead to too poor an approximation.

Some adjustments are needed for ABC-EP to work on this problem. First, notice that within a condition the datapoints are IID so that the posterior distribution factorises over IID “blocks”. We can therefore employ the recycling technique described in Section 4 to save simulation time, by going through the likelihood sites block-by-block. Second, since datapoints take values in , we adopt the following set of ABC constraints: , where denotes as usual the actual datapoints. Third, we apply the following two variance-reduction techniques. One stems from the fact that each site likelihood does not depend on the whole 33 parameters but on a subset of size 5. In that case, using simple linear algebra, one can see that it is possible to update only the marginal distribution of the EP approximation with respect to these 5 parameters; see Appendix B for details. The second is a simple Rao-Blackwellisation scheme that uses the fact that the non-decisional component is uniformly distributed, and may therefore be marginalised out when computing the EP update.

We report below the results obtained from ABC-EP with , , (see end of Section 3.2), and 3 complete passes over the data; CPU time was about 40 minutes. Results for smaller values of , e.g. , were mostly similar, but required a larger CPU time.

Since we could not compare the results to those of a standard ABC algorithm, we assess the results through posterior predictive checking

. For each of the 15 experimental conditions, we generate 5,000 samples from the predictive density, and compare the simulated data with the real, as follows.

The marginal distribution of responses can be summarised by regressing the probability of response on stimulus contrast, separately for each stimulus position (as on Figure 5.5), and using a binomial generalized linear model (with Cauchit link function). Figure 5.6 compares data and simulations, and shows that the predictive distribution successfully captures the marginal distribution of responses.

To characterise the distribution of reaction times, we look at means and inter-quantile intervals, conditional on the response and the experimental conditions. The results are presented on Figure 5.7. The predictive distributions capture the location and scale of the reaction time distributions quite well, at least for those conditions with enough data. Such results seem to indicate that, at the very least, the ABC-EP approximate posterior corresponds to a high-probability region of the true posterior.

## 6 Difficult posteriors

One obvious cause for concern is the behaviour of EP-ABC when confronted with difficult posterior distributions, and in this section we explore a few possible scenarios and suggest methods for detecting and correcting potential problems. Of all potential problematic shapes a posterior distribution could have, the worst is for it to have several modes, and we deal with this question first. Nonmultimodal but otherwise problematic posteriors are discussed later.

### 6.1 Multimodality

We begin with a toy example of a multimodal posterior. Consider the following IID model: , , and . The dataset is obtained by sampling from the model, with . The true posterior may be computed exactly, and is plotted in the left hand side of Figure 6.1. It features two symmetric, well separated modes, around and 2. The Gaussian pdf that minimises where is the posterior (or, in other words, the Gaussian pdf, with mean equal to posterior expectation, variance equal to the posterior variance) is also represented, as a dashed line, but it cannot be distinguished from the EP-ABC approximation (thick line).

The behaviour of EP-ABC in this case is instructive. If we run the standard EP-ABC algorithm (using the IID version, see Section 4 of the paper, and with , ), we obtain a result very close to the the thick line in one pass (one cycle over all the observations). However, if the algorithm is run for two passes or more, there is a very high probability that the algorithm stops its execution before completion, because many site updates will produce negative definite contributions . To perform more than one pass, we have to use slow updates (as described in Section 3 of the paper), and set to a small value. The thick line in Figure 6.1 was obtained with three passes, and slow updates with . The right hand side of 6.1 shows the type of plot one may use to assess convergence of EP-ABC: it represents the evolution of the mean of the EP approximation along the iterations; one “iteration” corresponds to one site update, and since we have performed three passes over sites, there are 150 iterations in total. This plot seems to indicate that the posterior expectation is slowly drifting to the right, and does not stabilise. If we try to run for more passes, we end up again with non-invertible covariance matrices. This behaviour was already described by Minka in his PhD thesis (Minka, 2001b) for the deterministic case.

The first test to see if the posterior distribution is reasonable is simply to run EP-ABC and see if it fails. There are two possible causes for failures: either one used too few samples and Monte Carlo variance led to negative definite updates, or the model itself is problematic for EP, in which case EP will still fail when using large sample sizes.

Beyond this rather informal test there are more principled things one could do. In Paquet et al. (2009) a very generic class of corrections to EP is introduced, and is described in Appendix Appendix C: Paquet-Winter-Opper corrections. Their first-order correction can be obtained relatively cheaply from the hybrid distributions, and can be used to build corrected marginals. In Figure 6.1 we show the first-order correction for our toy multi-modal example: although the corrected marginal is still far from the true posterior, it is clearly bimodal and in this case serves as a warning that the Gaussian approximation is inappropriate (we also applied the same type of correction in our third example, for which no MCMC gold standard is available, but the correction did not modify in any noticeable manner the marginal posterior distributions) In a similar vein, one could use goodness-of-fit tests to see if the hybrid distributions show large deviations from normality.

Once the problem has been diagnosed, what can one do? The answer is, unfortunately, “not much”. If one can determine that multimodality is caused by ambiguity in one or two parameters, it is possible to effectively treat these parameters as hyperparameters. One separates

into where is the set of problematic parameters. Running EP-ABC with will produce an estimate of which can be used to form an approximation to the marginals in a manner analoguous to the procedure used in INLA (Rue et al., 2009). Although this will require running EP-ABC several times for different values of , it might still be a lot cheaper than a MCMC procedure.

If no information is available about what the cause of the multimodality is or where the different modes are, our opinion is that all existing ABC methods will have tremendous difficulties. Work might be better invested in injecting more prior information or changing the parameterisation such as to ensure there is only one mode.

### 6.2 Difficult unimodal posteriors

Compared to the toy example of the previous section, a more realistic scenario is a posterior distribution that is unimodal but otherwise still badly behaved. There are no theoretical results available on convergence in this case, so we evaluated EP’s behaviour in a case of a rather nasty posterior distribution, adapted from the model of reaction times described above.

We devised a problem in which we guessed the posterior would be at the minimum very badly scaled and would probably have an inconvenient shape. In the context of the reaction times model described in Section 5.4, imagine that in a choice experiment the subject picked exclusively the second category; so that we have no observed reaction times for the first category. From the point of the model this occurs whenever the threshold for the first category is high enough, compared to accumulation speed, that the second accumulator always wins. This creates complete uncertainty as to the ratio of accumulation speed vs. threshold value for the first accumulator.

We generated a large dataset (1,000 datapoints) based on fixed values of : and . In this dataset there are no decisions favouring the first category, and the resulting reaction time distribution is plotted on Fig. 6.2. To make the inference problem still manageable using MCMC-ABC, we limit inference to three parameters: , (log-accumulation rate of the two accumulators) and (log-threshold for the first category). The other two parameters are considered known and fixed at their true value. We chose these parameters because we expected the posterior to show quasi-certainty for and very high uncertainty for . To further increase uncertainty, we chose a very vague prior .

We ran EP-ABC on this problem using the recycling strategy described in section 3.2 (data are IID). In this case EP-ABC needs large sample sizes to be stable numerically, which is probably due to the very ill-conditioned covariance matrices that arise (itself due to a very poorly-scaled posterior, more on that below). To eliminate any such problems we deliberately chose to use a very high number of samples, 3 million, with a minimum expected sample size of 500. We used an acceptance window of 10ms. We did 3 passes over the data, for a total computing time of 9 minutes on a standard laptop. To check for stability, we ran EP-ABC 10 times.

To check the accuracy of the results, we ran a MCMC-ABC sampler that used as its summary statistics 8 quantiles of the reaction time distribution, linearly spaced from 0.01 to 0.99, and whether the first category was ever chosen. Quantiles were rescaled by a factor of . Samples were accepted if the second category was chosen less than 100% of the time, or if the Euclidean distance between was over . This latter value was found by iteratively adjusting to get an acceptance rate of 1/1000.

We would like to stress that an appropriate proposal distribution was found using the covariance matrix obtained using ABC-EP (and readjusted later based on short runs). The scaling of the posterior is such that using, e.g., the prior covariance matrix in a proposal distribution is impossible: one ends up using extremely high values of , and therefore with a very poor approximation of the posterior. As a starting value we used the true parameters. After several adjustments, we ran a MCMC-ABC chain for 3 million iterations, which we trimmed down to 600 samples for visualisation on Fig. 6.3 (auto-correlation in the chain being in any case extremely high).

MCMC-ABC samples are shown on Fig. 6.3, along with density contours representing EP-ABC results. Although MCMC-ABC and EP-ABC target different densities, it is reasonable to hope that in this case these densities must be fairly close, since we deliberately set the acceptance ratio to be very small, and the data are simple enough for the summary statistics to be adequate. The posterior distribution obtained using MCMC-ABC, as shown on figure 6.3 is pathological, as expected. Scaling is very poor: one parameter varies over a very small range while the other two vary over and . The posterior also has a difficult shape and appears truncated along certain dimensions. EP-ABC provides approximations that are located squarely within the posterior distribution, but seems to underestimate posterior variance for the third parameter, . Given the difficult context, EP-ABC still performs arguably rather well. At the very least it may be used to find an appropriate proposal distribution for MCMC-ABC, which may perform extremely poorly without such help.

## 7 Extensions

Our description of EP-ABC assumes the prior is Gaussian, and produces a Gaussian approximation. With respect to the latter point, we have already mentioned (see also Appendix A) that EP, and therefore EP-ABC, may propagate more generally approximations from within a given exponential family; say the parametric family of Wishart distributions if the parameter is a covariance matrix. Regarding the prior, even when considering Gaussian approximation, it is possible to accommodate a non Gaussian prior, by treating the prior as an additional site.

Thus, the main constraint regarding the application of EP-ABC is that the likelihood is factorizable, i.e. in such a way that simulating from each factor is feasible.

In this paper, we focused our attention on important applications of likelihood-free inference where the likelihood is trivial to factorise; either because the datapoints are independent, or Markov. But ABC-EP is not limited to these two situations. First, quite a few time series models may be simulated sequentially, even if they are not Markov. For instance, one may apply straightforwardly EP-ABC to a GARCH-stable model (e.g. Liu and Brorsen, 1995; Mittnik et al., 2000; Menn and Rachev, 2005), which is a GARCH model (Bollerslev, 1986) with an alpha stable-distributed innovation. Second, one may obtain a simple factorisation of the likelihood by incorporating latent variables into the vector of unknown parameters. Third, one may replace the true likelihood by a certain approximation which is easy to factorise. The following section discusses and illustrates such an approach, based on the concept of composite likelihood.

### 7.1 Composite likelihood

Composite likelihood is an umbrella term for a family of techniques based on the idea of replacing an intractable likelihood by a factorisable pseudo-likelihood; see the excellent review of Varin et al. (2011). For the sake of simplicity, we focus on marginal composite likelihood, but we note that other versions of composite likelihood (where for instance the factors are conditional distributions) could also be treated similarly.

In our EP-ABC context, replacing the likelihood by a marginal composite likelihood leads to the following type of ABC posterior

 pCLε(θ|y)∝p(θ)ns∏s=1∫p(ys|θ)ηs1{∥ys−y⋆s∥≤ε}dys

where is the marginal distribution of some subset of the observations, and is a non-negative weight. There exists some theory on how to choose the weights so that the composite likelihood is close enough to the true likelihood in some sense, see again Varin et al. (2011), but for simplicity we shall take . Clearly, EP-ABC may be applied straightforwardly to this new ABC posterior, provided one may sample independently from the marginal distributions . In fact, the factors may be treated as IID factors, which makes it possible to use the recycling strategy described in Section 4.

To make this idea more concrete, we consider the class of hidden Markov models, where the datapoints are conditionally independent,