Accelerating delayed-acceptance Markov chain Monte Carlo algorithms

Delayed-acceptance Markov chain Monte Carlo (DA-MCMC) samples from a probability distribution, via a two-stages version of the Metropolis-Hastings algorithm, by combining the target distribution with a "surrogate" (i.e. an approximate and computationally cheaper version) of said distribution. DA-MCMC accelerates MCMC sampling in complex applications, while still targeting the exact distribution. We design a computationally faster DA-MCMC algorithm, which samples from an approximation of the target distribution. As a case study, we also introduce a novel stochastic differential equation model for protein folding data. We consider parameters inference in a Bayesian setting where a surrogate likelihood function is introduced in the delayed-acceptance scheme. In our applications we employ a Gaussian process as a surrogate likelihood, but other options are possible. In our accelerated algorithm the calculations in the "second stage" of the delayed-acceptance scheme are reordered in such as way that we can obtain a significant speed-up in the MCMC sampling, when the evaluation of the likelihood function is computationally intensive. We consider both simulations studies, and the analysis of real protein folding data. Simulation studies for the stochastic Ricker model and the novel stochastic differential equation model for protein-folding data, show that the speed-up is highly problem dependent. The more involved the computations of the likelihood function are, the higher the acceleration becomes when using our algorithm. Inference results for the standard delayed-acceptance algorithm and our approximated version are similar, indicating that our approximated algorithm can return reliable Bayesian inference.

Authors

• 4 publications
• 6 publications
• 1 publication
• Efficient Bernoulli factory MCMC for intractable likelihoods

Accept-reject based Markov chain Monte Carlo (MCMC) algorithms have trad...
04/16/2020 ∙ by Dootika Vats, et al. ∙ 0

• Delayed acceptance ABC-SMC

Approximate Bayesian computation (ABC) is now an established technique f...
08/07/2017 ∙ by Richard G. Everitt, et al. ∙ 0

• Hierarchical surrogate-based Approximate Bayesian Computation for an electric motor test bench

Inferring parameter distributions of complex industrial systems from noi...
06/17/2021 ∙ by David N. John, et al. ∙ 0

• Multilevel Delayed Acceptance MCMC with an Adaptive Error Model in PyMC3

Uncertainty Quantification through Markov Chain Monte Carlo (MCMC) can b...
12/10/2020 ∙ by Mikkel B. Lykkegaard, et al. ∙ 0

• Accelerating sequential Monte Carlo with surrogate likelihoods

Delayed-acceptance is a technique for reducing computational effort for ...
09/08/2020 ∙ by Joshua J Bon, et al. ∙ 0

• Gaussian Processes to speed up MCMC with automatic exploratory-exploitation effect

We present a two-stage Metropolis-Hastings algorithm for sampling probab...
09/28/2021 ∙ by Alessio Benavoli, et al. ∙ 0

• Warped Gradient-Enhanced Gaussian Process Surrogate Models for Inference with Intractable Likelihoods

Markov chain Monte Carlo methods for intractable likelihoods, such as th...
05/10/2021 ∙ by Quan Vu, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We introduce a new strategy to accelerate Markov chain Monte Carlo (MCMC) sampling when the evaluation of the target distribution is computationally expensive. We build on the “delayed-acceptance” (DA) strategy developed in Christen and Fox (2005) where a fast, “two-stages” DA-MCMC algorithm is proposed while still targeting the desired distribution exactly. We produce an approximated and accelerated delayed-acceptance MCMC algorithm (ADA-MCMC), where in exchange of exactness we obtain results even more rapidly than the standard DA-MCMC. In a computationally intensive case study, the run-time for ADA-MCMC is 2–4 times faster than for standard DA-MCMC.

The methodology we consider is general, as our novel method pertains sampling from arbitrary distributions. However, in the interest of our applications, we will focus on Bayesian inference, and then suggest how to implement our ideas for general problems. In Bayesian inference we aim at sampling from the posterior distribution , where are model parameters, denotes data, is the likelihood function, and is the prior distribution of . We assume that the point-wise evaluation of the likelihood (or an approximation thereof) is computationally intensive, because the underlying probabilistic model is complex and/or the data is large. For those situations, DA-MCMC algorithms turn particularly useful. In the approach originally outlined in Christen and Fox (2005) a DA strategy decomposes an MCMC move into two stages. At the first stage a proposal can either be rejected, according to a “surrogate of the posterior” (one that is computationally cheap to evaluate and chosen to approximate the desired posterior), or be sent to the second stage. If the proposal is not rejected at the first stage, at the second stage an acceptance probability is used that corrects for the discrepancy between the approximate surrogate and the desired posterior, and at this stage the proposal can finally be accepted or rejected. The advantage of using DA-MCMC is that the computationally expensive posterior only appears in the second stage, whereas the surrogate posterior in the first stage is cheap to evaluate. Therefore, in the first stage the surrogate posterior rapidly screens proposals, and rejects those that are unlikely to be accepted at the second stage, if the surrogate model is reliable. When considering a Bayesian approach, we build a surrogate of the computationally expensive likelihood function, while we assume the cost of evaluating the prior to be negligible. Therefore the expensive likelihood appears only in the second stage. Some implementations of the DA approach in Bayesian inference can be found e.g. in Golightly et al. (2015), Sherlock et al. (2017), and Banterle et al. (2015), and similar approaches based on approximate Bayesian computation (ABC) can be found in Picchini (2014), Picchini and Forman (2016), and Everitt and Rowińska (2017).

In this work, the sequence of computations pertaining the second stage of DA-MCMC are arranged so to find further opportunities to avoid the evaluation of the expensive likelihood. This leads to our accelerated and approximated ADA-MCMC. The computational benefit of using ADA-MCMC is that, unlike DA-MCMC, once a parameter proposal reaches the second stage, the expensive likelihood is not necessarily evaluated, but this comes at the price of introducing an approximation in the sampling procedure. We test and compare delayed-acceptance algorithms, particle marginal methods for exact Bayesian inference, and Markov-chain-within-Metropolis on two case studies: The stochastic Ricker model, and a novel state-space model for protein folding data, with dynamics expressed via a stochastic differential equation (SDE). Therefore, in this work we contribute with: (i) a novel, approximate and accelerated delayed-acceptance MCMC algorithm, and (ii) a novel double-well potential state-space model for protein folding data. For practical applications, we use Gaussian processes to specify surrogates of the likelihood function, though this is not an essential component of our approach and other surrogates of the likelihood can be considered. We found that the acceleration produced by ADA-MCMC, compared to DA-MCMC, is dependent on the specific application. If the exact or approximate likelihood function used in the second stage of the algorithm is not computationally intensive to evaluate, then our method produces negligible benefits. Therefore, the use of our ADA-MCMC, just as the standard DA-MCMC, is beneficial when each evaluation of the likelihood has a non-negligible impact on the total computational budget. Then, the time savings due to ADA-MCMC are proportional to the number of MCMC iterations where the evaluation of the likelihood at the second stage is avoided. In terms of inference quality, we find that ADA-MCMC returns results that are very close to DA-MCMC, so our approximations do not seem to harm the accuracy of the resulting inference.

The outline of this paper is as follows: The delayed-acceptance (DA) scheme and our novel accelerated DA algorithm are introduced in a general framework in Section 2. The Gaussian process (GP) surrogate model is introduced in Section 3. The DA-GP-MCMC algorithm and the accelerated version ADA-GPMCMC are introduced in Section 4. A simulation study for the stochastic Ricker model is in Section 5.1. The protein folding data and the novel double-well potential stochastic differential equation model are introduced in Section 5.2. A discussion in Section 6 closes our work. Further supplementary material is available, outlining: particle Markov chain Monte Carlo methods for state-space models, implementation guidelines for the algorithms, a further simulation study, and diagnostic analyses. The code used to generate results can be found at https://github.com/SamuelWiqvist/adamcmcpaper and in the supplementary material.

2 Delayed-acceptance MCMC

We first introduce the delayed-acceptance (DA-MCMC) scheme due to Christen and Fox (2005) in full generality, then we specialize it for Bayesian inference. Our accelerated delayed-acceptance (ADA-MCMC) algorithm is introduced in section 2.1. We are interested in sampling from some distribution using Metropolis-Hastings (Hastings, 1970). Metropolis-Hastings proceeds by evaluating random moves produced by a Markov kernel from the current value of to a new . The sequence of accepted moves forms a Markov chain having as stationary distribution. Now, assume that the point-wise evaluation of is computationally expensive. The main idea behind a DA-MCMC approach is to delay (or avoid as much as possible) the evaluation of the computationally expensive , by first trying to early-reject the proposal using some surrogate (cheap to evaluate) deterministic or stochastic model . To enable early-rejections while still targeting the distribution , a two-stages acceptance scheme is introduced in Christen and Fox (2005). Say that we are at the th iteration of the Metropolis-Hastings algorithm, and denote with the state of the chain produced at the previous iteration. At the “first stage” of DA-MCMC we evaluate the acceptance probability (though at this stage we do not really accept any proposal as explained below)

 α1=min(1,~p(x⋆)~p(xr−1)⋅g(xr−1|x⋆)g(x⋆|xr−1)), (1)

where is the transition kernel used to generate proposals, i.e. at the th iteration . If the proposal “survives” the first stage (i.e. if it is not rejected) it is then promoted to the second stage where it is accepted with probability ,

 α2=min(1,p(x⋆)p(xr−1)⋅~p(xr−1)~p(x⋆)). (2)

Therefore can only be accepted at the second stage, while it can be rejected both at the first and second stage. A computational speed-up is obtained when is early-rejected at the first stage, as there the expensive is not evaluated. Hence, to obtain a significant speed-up it is important to early-reject “bad” proposals that would likely be rejected at the second stage. The probability corrects for the approximation introduced in the first stage and the resulting Markov chain has the correct stationary distribution . This result holds if is -irreducible and reversible, and if implies . From (2) it is evident how the surrogate model acts as a proposal distribution. See Franks and Vihola (2017)

for a comparison in terms of asymptotic variances of Monte Carlo estimators provided via importance sampling, pseudo-marginal and delayed-acceptance methods.

In a Bayesian framework we are interested in sampling from the posterior . Furthermore, for the cases of interest to us, the log-likelihood function (or an approximation thereof) , is computationally expensive while the prior distribution is assumed cheap to evaluate. By introducing a deterministic or stochastic surrogate likelihood , DA has first stage acceptance probability , where

 α1=min(1,~L(θ⋆)~L(θr−1)⋅p(θ⋆)p(θr−1)⋅g(θr−1|θ⋆)g(θ⋆|θr−1)),

with transition kernel . Similarly, by setting , the second stage acceptance probability is

 α2=min(1,L(θ⋆)L(θr−1)⋅~L(θr−1)~L(θ⋆)).

An extension of the DA-MCMC scheme due to Sherlock et al. (2017) is to generate a proposal from a different transition kernel , and with a small but positive probability allow the evaluation of the proposal in an ordinary Metropolis-Hastings algorithm, with acceptance probability denoted ,

 αMH=min(1,L(θ⋆)L(θr−1)⋅p(θ⋆)p(θr−1)⋅~g(θr−1|θ⋆)~g(θ⋆|θr−1)). (3)

In this case the proposal can be immediately accepted or rejected as in a regular MCMC. The transition kernel should have a somewhat larger variance than . With probability a proposal is instead evaluated using the two-stages DA-MCMC algorithm. When considering this “extended version” of DA-MCMC (where is introduced) it is preferable to use a small in order not to lose too much of the acceleration implied by a DA approach. Our experience also indicates that this extension can be critical to better explore the tails of the posterior distribution, compared to a standard DA-MCMC that uses . This "mixture" of the two Metropolis-Hastings kernels (i.e. the acceptance kernel for the DA scheme, and the acceptance kernel in (3)) produces a valid MCMC algorithm, since both kernels in the standard cases target the correct posterior (Rosenthal and Roberts, 2007).

2.1 Accelerated delayed-acceptance MCMC

There have been a number of attempts at accelerating the original DA-MCMC of Christen and Fox (2005). For example, in a Bayesian framework, Banterle et al. (2015) propose to break down the posterior into the product of chunks. The Metropolis-Hastings acceptance ratio becomes the product of acceptance ratios, each of which can be sequentially evaluated against one of independent uniform variates. The acceleration is given by the possibility to “early-reject” a proposal, as soon as one of those acceptance ratios leads to a rejection (in the same spirit of Solonen et al., 2012). However, an acceptance requires instead the scanning of all components, i.e. the full posterior. Quiroz et al. (2017) never use the full data set in the second stage of DA and instead construct an approximated likelihood from subsamples of the data, which is particularly relevant for Big Data problems (see references therein and Angelino et al., 2016). Remarkably, Quiroz et al. (2017) prove that even when the full likelihood is approximated using data subsamples, the resulting chain has the correct stationary distribution. However, they assume data to be conditionally independent, a strong condition which does not apply to case studies considered in the present work.

We now introduce the novel, accelerated DA-MCMC algorithm, shortly ADA-MCMC. The main idea behind ADA-MCMC is that, under some assumptions on how the likelihood function and the surrogate model relate, it is possible to arrange the computations in the second stage to obtain an acceleration in the computations. This is implied by the possibility to avoid the evaluation of the expensive likelihood in the second stage, in some specific circumstances. However, this also implies that ADA-MCMC is an approximated procedure, since a proposal can sometimes be accepted according to the surrogate model. We introduce ADA-MCMC in a Bayesian setting where the surrogate model pertains the likelihood function. However, the idea can straightforwardly be adapted to the case where a surrogate model of a generic distribution is used, as in Equations (1)-(2). The more general setting is briefly described later in this section. As previously mentioned, at the th iteration the DA algorithm is governed by the values of the likelihood function and , and the values of the surrogate model and . These four values can be considered arranged in four mutually exclusive scenarios:

 case 1)~L(θ⋆)>~L(θr−1)andL(θ⋆)>L(θr−1), case 2)~L(θ⋆)<~L(θr−1)andL(θ⋆)~L(θr−1)andL(θ⋆)L(θr−1).

We study each case separately to investigate any opportunity for accelerating the computations in the second stage of DA-MCMC, under the assumption that the relations between the evaluations of and hold. Afterwards, we suggest ways to determine approximately which of the four possibilities we should assume to hold, for any new proposal , without evaluating the expensive likelihood .

Case 1)

Under the assumption that and it is clear that and . It also holds that

 ~L(θr−1)~L(θ⋆)

Hence, the acceptance region for the second stage can be split in two parts, where one part is “governed” by only. To clarify, at the second stage of the standard DA-MCMC, acceptance of a proposed takes place if where

is uniformly distributed in [0,1], hence, the acceptance region is

. However, because of (4) we are allowed to further decompose the acceptance region, as presented below:

Hence, if a proposal has survived the first stage and we assume that we are in case 1, we can first check whether we can “early-accept” the proposal (i.e. without evaluating the expensive likelihood), that is, check if

 u<~L(θr−1)~L(θ⋆), (5)

and if this is the case is (early)-accepted and stored, and we can move to the next iteration of ADA-MCMC. If is not early-accepted, we can look into the remaining part of the [0,1] segment to determine if the proposal can be accepted or rejected. Hence, when early-acceptance is denied, the expensive likelihood is evaluated and the proposal is accepted and stored if

 u

and rejected otherwise, and we can move to the next iteration of ADA-MCMC. Since the acceptance region for the second stage is split in two parts (early-acceptance and acceptance), the same random number is used in (5) and (6). By splitting the region it is possible to early-accept proposals without evaluating , and thereby obtaining a speed-up.

Case 2)

If this case holds, then and . Hence, it is not possible to obtain any early-accept or early-reject opportunity in this case.

Case 3)

If this case holds, then and . Hence, it also holds that

 ~L(θr−1)~L(θ⋆)>L(θ⋆)L(θr−1)⋅~L(θr−1)~L(θ⋆).

The rejection region is and this can be split in two parts, where one part is only governed by , see below:

By simulating a , we can first check if the proposal can be early-rejected. This happens if . If the proposal is not early-rejected, it is accepted if

 u

and rejected otherwise. Hence, in case 3 there is a chance to early-reject without evaluating .

Case 4)

Under the assumption we have that and , and we can immediately accept the proposal without evaluating , since .

Clearly, assuming a specific case to be the “right one”, for proposal , is a decision subject to probabilistic error. This is why ADA-MCMC is an approximate version of DA-MCMC. Of course, the crucial problem is to determine which of the four cases to assume to hold for the proposed . One method is to consider a pre-run of some MCMC algorithm, to estimate the probability for each of the four different cases, where is the true but unknown probability that case holds, . This is of course a possibly computationally heavy procedure, however, for the specific algorithms we study in Section 4, such a pre-run is necessary to construct the surrogate model for the log-likelihood, hence the estimation of the comes as a simple by-product of the inference procedure. Then, once the estimates are obtained, for a new one first checks if or if . If then we can either be in case 2 or 4. We toss a uniform and if case 2 is selected with probability (and otherwise case 4 is selected, since ). Correspondingly, if then we can be either in case 1 or 3. We toss a uniform , and if case 1 is selected (otherwise case 3 is selected, since ). Another approach is to model the probabilities as a function of . Hence, we are then interested in computing the probabilities , ,, and

. For this task, we can for instance use logistic regression, or some other classification algorithm. The problem of the selection of cases 1–4 is discussed in detail in Section

4.1.

We stated early that ADA can also be used in a non-Bayesian setting, where we target a generic distribution for some . In that case we need to introduce a corresponding surrogate model . The th iteration of ADA will then be governed by the four values , , , and , where is a proposed value . These can be arranged into four cases, similarly to what previously described: case 1) > and > , 2) < and < , 3) > and < , and 4) < and > . Therefore, by adapting the methodology, possibilities for early-rejection and early-acceptance of a proposal can straightforwardly be obtained regardless of whether we pursue a Bayesian analysis or not.

3 Modeling the log-likelihood function using Gaussian processes

We have outlined our methodology without reference to a specific choice for the surrogate likelihood. A possibility is to use Gaussian process regression to obtain a surrogate log-likelihood . Gaussian processes (GPs) is a class of statistical models that can be used to describe the uncertainty about an unknown function. In our case, the unknown function is the log-likelihood

. A GP has the property that the joint distribution for the values of the unknown function, at a finite collection of points, has a multivariate normal distribution. As such, each Gaussian process is fully specified by a mean function

, and a covariance function (Rasmussen and Williams, 2006). We introduce a GP regression model, similar to the one used in Drovandi et al. (2018), as a computationally cheap proxy to the unknown log-likelihood . Our GP model uses covariates that are powers and interactions of the parameters of interest (see the supplementary material). The GP model assumes

 ℓ(θ)∼GP(mβ(θ),kϕ(θ,θ′)),

where are the auxiliary parameters for the mean and covariance function respectively. Since is in general unknown, this must be estimated by fitting the GP model to some “training data”. In our case, training data is obtained by running a a number of preliminary MCMC iterations, and collect all generated parameter proposals and corresponding log-likelihood values. The GP regression considers the log-likelihood values as “responses” and the proposed parameters are used to construct the covariates. Once is available, then for any new we obtain a proxy to the unknown log-likelihood that is computationally much faster to evaluate than . The training data we fit the GP model to is denoted , and how this data is collected is explained in Section 4. Using the same assumptions for the Gaussian process model as in Drovandi et al. (2018), we have that the predictive distribution for the GP model is available in closed form. Therefore, for given and we can easily produce a draw from said distribution, which is Gaussian, and given by

 ℓ(θ⋆)|D,^η∼N(¯ℓ(θ⋆),Var(ℓ(θ⋆))). (7)

See the supplementary material for the definitions of and . It is computationally very rapid to produce draws from (7) at any new , which is why we use GP prediction as a surrogate of the log-likelihood within DA algorithms. The derivation of (7), and more details pertaining the GP model are found in the supplementary material.

4 Delayed-acceptance Gaussian process Markov chain Monte Carlo

We now make use of the fitted GP model discussed in Section 3 as a surrogate of the log-likelihood function, within DA-MCMC and ADA-MCMC. By sampling a GP log-likelihood from (7) for some , we denote with the GP prediction of the corresponding likelihood function. In addition to be computationally intensive to evaluate, the true likelihood

might also be unavailable in closed form. However, it is often possible to obtain Monte Carlo approximations returning non-negative unbiased estimates of

. We denote with such unbiased estimate. For our case studies, is obtained via sequential Monte Carlo (SMC, also known as particle filter, see Kantas et al., 2015 and Schön et al., 2018 for reviews). A simple example of SMC algorithm (the bootstrap filter) and its use within particle-marginal methods (Andrieu and Roberts, 2009) for inference in state-space models are presented in the supplementary materials. Two types of pseudo-marginal methods, particle MCMC (PMCMC) and Markov-chain-within-Metropolis (MCWM), are there described. In the supplementary material we give a brief technical presentation of PMCMC and MCWM.

Notice that MCMC algorithms based on GP-surrogates have already been considered, e.g. in Meeds and Welling (2014) and Drovandi et al. (2018). Meeds and Welling (2014)

assume that the latent process has a Gaussian distribution with unknown moments, and these moments are estimated via simulations using “synthetic likelihoods”. There, the discrepancy between the simulated (Gaussian) latent states and observed data is evaluated using a Gaussian ABC kernel, where ABC stands for “approximate Bayesian computation”, see

Marin et al. (2012) for a review. This computationally expensive setting is fitted to “training data”, then used in place of the (unknown) likelihood into a pseudo-marginal MCMC algorithm. The work in Drovandi et al. (2018) builds up on the ideas found in Meeds and Welling (2014), with the difference that the former does not use synthetic likelihoods nor ABC to produce training data. Instead they use the MCWM algorithm to collect many log-likelihood evaluations at all proposed parameter values, then fit a GP regression model on these training data. Finally, they use the fitted GP regression in a pseudo-marginal algorithm, without ever resorting to expensive likelihood calculations. As opposed to Drovandi et al. (2018), we make use of both a surrogate of the likelihood and (with low frequency) of the expensive likelihood approximated via a particle filter. We call DA-GP-MCMC a delayed acceptance MCMC algorithm using predictions from GP regression as a surrogate of the likelihood function. Similarly, we later introduce our accelerated version ADA-GP-MCMC.

The DA-GP-MCMC procedure is detailed in Algorithm 1. Using the notation in Section 2, we now have that the first stage acceptance probability for DA-GP-MCMC is

 α1=min(1,^LGP(θ⋆)^LGP(θr−1)⋅p(θ⋆)p(θr−1)⋅g(θr−1|θ⋆)g(θ⋆|θr−1)).

The second stage acceptance probability is

 α2=min(1,^Lu(θ⋆)^Lu(θr−1)⋅^LGP(θr−1)^LGP(θ⋆)).

As mentioned in Section 2, for our applications we found it beneficial to use the extended DA-MCMC introduced in Sherlock et al. (2017). However, this is in general not a requirement for using DA-MCMC. The DA-GP-MCMC algorithm is preceded by the following two steps, required to collect training data and fit the GP regression to these data:

1. Collect training data using MCWM:

A MCWM algorithm is run to approximately target , where a bootstrap particle filter using particles is employed to obtain , until the chain has reached apparent stationarity. When using MCWM we do not target the exact posterior for a finite number of particles , however, this is not a concern to us. In fact, we use MCWM as in Drovandi et al. (2018), namely to “harvest” a large number of (approximate) log-likelihood function evaluations, in order to learn the dependence between loglikelihoods and corresponding parameters. Indeed, in this phase we store as training data all the proposed parameters (regardless of whether these are accepted or rejected from MCWM) and their corresponding log-likelihoods . Hence, all parameter proposals and corresponding log-likelihoods from MCWM (excluding some sufficiently long burnin period) are stored as training data , (where here the superscript ranges from 1 to the number of iterations post-burnin). We also collect the generated Markov chain and their corresponding log-likelihood estimations in . Basically the difference between and is that parameters in the latter are the standard output of a Metropolis-Hastings procedure, i.e. may contain “repeated parameters” (when rejections occur). Instead contains all simulated proposals. We motivate the use for set in Section 4.1.

2. Fit the GP model:

The Gaussian process model is fitted to the training data using the method described in Section 3.

4.1 Accelerated delayed-acceptance Gaussian process MCMC

Our accelerated delayed-acceptance Gaussian process MCMC algorithm (ADA-GP-MCMC) is described in Algorithm 2. Same as for DA-GP-MCMC, also ADA-GP-MCMC is preceded by two phases (collection of training data and GP regression). After fitting the GP model, the training data is also used to produce a “selection method” for the four cases introduced in Section 2.1. As already mentioned in Section 2.1, we can either select which case to use independently of the current proposal , or make the selection of cases a function of . We introduce three selection methods, where the first one selects which case to assume independently of , while the other two depend on the proposal.

Biased coin:

In the most naive approach, selecting a case between 1 and 3, or between 2 and 4 can be viewed as the result of tossing a biased coin. Hence, we just compute the relative frequency of occurrence for cases 1, 2, 3 and 4 (see Section 2.1) as observed in the training data. These are obtained as follows: using the fitted GP model we predict log-likelihoods using (7) for all collected ( denotes the matrix of the proposals that belong to the training data ). Then we obtain corresponding , for all . Now, since all the corresponding are already available as training data, it is possible to compute said relative frequencies of occurrence for each case (). At iteration of the ADA-GP-MCMC algorithm, for proposal , and supposing we have survived the first stage, then if we draw from the distribution and go for case 1 if the draw equals one, and go for case 3 otherwise. If instead we draw from and go for case 2 if the draw equals one, and go for case 4 otherwise.

State-dependent selection:

The biased coin model does not take into account the specific value of the current proposal , that is, the same are applied to all proposals during a run of ADA-GP-MCMC. We could instead estimate

using logistic regression or a decision tree model. When using logistic regression, we have two regression models to estimate, one for cases 1 and 3, and one for cases 2 and 4. By combining the training data

, and the accepted proposals stored in , we have access to both the particle filter evaluations corresponding to all generated proposals, and to the ones for the accepted proposals. Using and

we can now classify which case each proposal

should belong to. This is done by computing GP predictions, independently for both sets of parameters stored in and . Note, after computing the GP predictions we have (i) particle filter predictions and GP predictions for all proposals in , i.e. and , and (ii) particle filter predictions and GP predictions for all accepted proposals in , hence, and . We can now loop over the proposals in the training data and assign labels for which of the four cases each proposal belongs to. As an example, after labelling is performed, all proposals in the training data that are classified to belong to case 1 or 3 are denoted

, and an associated indicator vector

, having 1 for proposals belonging to case 1 and 0 for proposals belonging to case 3, is created. We now fit a logistic regression model on , where the take the role of “covariates” and the are binary “responses”. We denote with the resulting fitted probability of selecting case 1 (so that ). In a similar way, after labelling is performed, all proposals in the training data that are classified to belong to case 2 or 4 are denoted , with associated indicator vector . We fit a logistic regression model on to obtain (and ).

All the above is preliminary to starting ADA-GP-MCMC. Then we proceed as described for the biased coin case, with minimal notation adjustment. Namely for a new proposal , if we decide between case 1 and 3 by drawing from . If instead we draw from to decide between case 2 and 4. Alternatively, in place of a logistic regression model we can use decision trees, but still employ the same ideas as for logistic regression. Decision trees can perform better at modeling non-linear dependencies in the data. Importantly, a decision tree does not produce an estimation of the probabilities for each case (hence, we do not obtain a direct estimation of ), instead a classification decision is computed, which will directly select which case to assume for the given proposal . We obtained the best results with the decision tree model. We have found beneficial to include, as a covariate in the decision tree model, the ratio between the GP-based log-likelihood estimates at the current proposal and the previous log-likelihood estimate.

In conclusion, we have introduced three selection methods. In Algorithm 2 the selection methods are denoted (for selection between case 1 and 3) and (for selecting between case 2 and 4), to highlight the fact that different selection methods are available. In the supplementary material we describe how to test the fit of the GP model and the performance of the selection method.

5 Case studies

In Section 5.1 we consider the Ricker model, which has been used numerous times as a toy model to compare inference methods (e.g. Fearnhead and Prangle (2012), Fasiolo et al. (2016) to name a few). In Section 5.2 we consider a novel double-well potential stochastic differential equation (DW-SDE) model for protein folding data, which is a considerably more complex case study. An additional simulation study for the DW-SDE model, diagnostics and further methodological sections are presented in the supplementary material. The code can be found at https://github.com/SamuelWiqvist/adamcmcpaper.

5.1 Ricker Model

The Ricker model is used in ecology to describe how the size of a population varies in time and follows

 {yt+1∼P(ϕxt+1),xt+1=rxte−xt+ϵt,ϵti.i.d.∼N(0,σ2), (8)

where

is the Poisson distribution with mean

. The process is a latent (i.e. unobservable) Markov process and realizations from the observable process are conditionally independent given the latent states, since the are assumed independent. Even though the model is fairly simple its dynamics are highly non-linear and close to chaotic for some choice of the parameter values (Wood, 2010). The likelihood function is also both analytically and numerically intractable, if evaluated at parameters very incompatible with the observed data, see Fasiolo et al. (2016) for a review of inference methods applied to the Ricker model.

We are interested in , and we use PMCMC, MCWM, DA-GP-MCMC, and ADA-GP-MCMC for this task. That is, MCWM is not only used to provide the training data for fitting a GP regression, but also to provide inference results, in the interest of comparison between methods. PMCMC is used to provide exact Bayesian inference. A data set containing observations, generated from the model with ground-truth parameters at integer sampling times , and the starting value for the latent state was deterministically set to and considered as a known constant throughout.

Results obtained with PMCMC and MCWM are produced using in total 52,000 iterations (including a burnin period of 2,000 iterations), and

particles (the standard deviation of the log-likelihood obtained from the particle filter is about

). The proposal distribution was adaptively tuned using the generalized AM algorithm (Andrieu and Thoms, 2008, Mueller, 2010), which is set to target an acceptance rate of 40%. For DA-GP-MCMC algorithm, we used the last 2,000 iterations of a previous MCWM run to obtain training data. Prior to fitting the GP model we removed the 10% of the cases having the lowest log-likelihood values from the training data, as these cases badly affected the GP predictions. After fitting the GP model, we use the “extended” version of the DA algorithm discussed in section 2 and set (that is a 15% probability to skip the delayed-acceptance step and execute a regular Metropolis-Hastings step), , and ran DA-GP-MCMC for further 50,000 iterations. The Gaussian kernels for the Metropolis random walks, and , were kept fixed during the entire run of the DA-GP-MCMC algorithm: specifically, used the covariance matrix returned by the final iteration of the MCWM algorithm that was used to collect training data, and was set to a kernel having slightly larger terms in the covariance, i.e. we used a covariance with . An important modification of DA-GP-MCMC as described in Algorithm 1, is that in our case studies the value at the denominator of is “refreshed”. Hence, we employ a MCWM updating procedure in the second stage. This is to obtain a reasonable high acceptance rate and to avoid problems with stickiness. The same modification was used for ADA-GP-MCMC. At the second-stage of the -th iteration of ADA-GP-MCMC, a decision tree model was used to select a case from the four ones discussed in sections 2.1 and 4.1.

Wide uniform priors were employed for all unknown parameters; , and . The starting values were also deliberately set far away from the true parameter values: , , and . Results are presented in Table 1 and Figure 1. We can conclude that all parameters are well inferred. The results for the different algorithms are also similar. The parameter with the highest estimation uncertainty is , which is in not surprising since is the parameter that governs the noise in the model, and this is often the hardest parameter to estimate from discretely observed measurements. Notice that results produced by ADA-GP-MCMC are essentially identical to those from DA-GP-MCMC. We find this very encouraging since the most relevant way to judge inference results from the accelerated ADA procedure is to compare those to the standard DA algorithm rather than, say, PMCMC.

Properties of the algorithms are presented in Table 2. Before discussing these results, we emphasize that the benefits of our accelerated procedure are to be considered when the case study has a likelihood that is computationally very challenging, and this is not the case for the present example, see instead Section 5.2. The ADA-GP-MCMC algorithm is the fastest algorithm, though only marginally faster than DA-GP-MCMC (4.2 times faster than MCWM and 1.09 times faster than DA-GP-MCMC), while MCWM is the slowest one. Not surprisingly, PMCMC is almost twice as fast as MCWM, and this is because PMCMC only requires one evaluation of the particle filter per iteration, while the MCWM requires two evaluations. The four algorithms are, however, essentially equally efficient, as from the ESS/sec values.

The estimated probabilities for the four different cases characterizing ADA-GP-MCMC (recall that and ), and the percentage for each case to hold, i.e. the probability that the selected case indeed is the correct one, are presented in Table 3. We notice that the probability for the different cases vary considerably, and also that the percentages that the assumption holds vary for the different cases. We also notice that the performance of the selection algorithm is much better for case 2 than for case 4: this is due to the unbalance of the two classes, meaning that in our training data case 2 occurs more frequently than case 4, and therefore it is more difficult to estimate the latter case accurately.

5.2 Double-well potential stochastic differential equation model for protein folding data

We now consider a computationally intensive case study concerning statistical inference for protein folding data. The challenges for this case study are: (a) the sample size is large, data being a long time-series (about observations), (b) the non-linear dynamics, and (c) the presence of local perturbations. “Protein folding” is the last and crucial step in the transformation of genetic information, encoded in DNA, into a functional protein molecule. Studying the time-dynamics of real protein folding dynamics results in a very high dimensional problem, which is difficult to analyze using exact Bayesian methodology. Therefore, for reasons of simplification and tractability, the dynamics of a protein are often modelled as diffusions along a single “reaction coordinate”, that is one-dimensional diffusion models are considered to model a projection of the actual dynamics in high-dimensional space (Best and Hummer, 2011).

The (reaction coordinate) data is in Figure 2. We notice that data have a marginal bimodal structure, with irregular change-points where the mean of the data shifts, and a local noisy structure. A class of models shown to be suitable for statistical modeling of protein folding (at least when these data result into a low-dimensional projection of the original data) is given by stochastic differential equations (SDEs), see Forman and Sørensen (2014) and Picchini and Forman (2016). Monte Carlo inference methods are very computationally intensive for these models (in Picchini and Forman, 2016 data sub-sampling and special approximate Bayesian computation methods were used to accelerate the inference problem). We now introduce a novel double-well potential stochastic differential equation (DWP-SDE) model for protein folding data. This model is faster to simulate than the one proposed in Forman and Sørensen (2014) and Picchini and Forman (2016).

The DWP-SDE model is defined as

 ⎧⎪⎨⎪⎩zt=xt+yt,dxt=−∇V(xt)dt+σdWxt,dyt=−κytdt+√2κγ2dWyt. (9)

Here is the observable process, consisting in the sum of the solutions to the double-well potential SDE process and process , the latter being unobservable and representing autocorrelated error. Here is the gradient of the double-well potential function with respect to , further specified by six parameters introduced in (10). Finally and are independent standard Wiener processes, that is their increments and are independent, Gaussian distributed with zero mean and variance . We consider the following double-well potential function

 (10)

which is based on the potential described in equation 1 in Fang et al. (2017). The formulation in (10) is fairly general, in the sense that many different potentials can be specified by varying its parameters. The parameters in (10) have the following interpretation: specifies the location for the potential (i.e. where the potential is centered); determines the spread of the potential; is an asymmetry parameter; compresses the two modes of the long term (stationary) density of process ; parameters and control the shape of the two modes (if the parameters and are set to low values the long term probability distribution becomes more flat with less distinct modes); governs the noise in the latent process. The error-model is an Ornstein-Uhlenbeck process specified by two parameters: is the autocorrelation level, and is the noise intensity. In principle, inference should be conducted for . However, the model parameters and are “stiff”, i.e. small changes in their values result in considerable changes in the output, and are therefore hard to estimate. Estimating all the parameters of the DWP-SDE model is also a complex task since a larger data set seems needed to capture the stationary distribution of the data. We will therefore consider the easier task of estimating the parameters . The remaining parameters, and , will be fixed to arbitrary values, as discussed later.

Simulating the process in (9) is easy since the transition density for the Ornstein-Uhlenbeck process process is known. We have that

 yt+Δt|yt=x∼N(xe−κΔt,γ2(1−e−2κΔt)),

where . The transition density for the process is not analytically known, and we use the Euler-Maruyama scheme to propagate the process, that is we use

 xt+δt|xt=x≈x−∇V(x)δt+σϵt,

where , and is the stepsize for the Euler-Maruyama numerical integration scheme (typically ).

Let us now consider the likelihood function for the process in (9), for a set of discrete observations that we assume observed at integer sampling times . Corresponding (unobservable) values for the process at the same sampling times are . In addition, we denote with the set , which includes an arbitrary value from which simulations of the latent system are started. The likelihood function can be written as

 L(θ) =p(z|θ)=p(z1|θ)T∏t=2p(zt|z1,…,zt−1,θ), =∫p(z1,…,zT|x0,…xT,θ)p(x0,…,xT|θ)dx0⋯xT, =∫p(z1,…,zT|x0,…xT,θ)p(x0)T∏t=1p(xt|xt−1,θ)dx0⋯xT.

The last product in the integrand is due to the Markov property of . Also, we have introduced a density , and if is deterministically fixed (as in our experiments) this density can be discarded. We cannot compute the likelihood function analytically (as the integral is typically intractable), but we can use sequential Monte Carlo (for example, the bootstrap filter in supplementary material) to compute an unbiased approximation , which allows us to use PMCMC or MCWM for the inference. Furthermore, the process is a transformation of the measurement noise that follows an Ornstein-Uhlenbeck process, and the density for is known (Picchini and Forman, 2016). We have that

where , and denotes the density function for the standard Gaussian distribution.

We now explain how an unbiased approximation to is computed. To facilitate this explanation we introduce the following notation: let denote the set of particles we have at time before resampling is performed (see the bootstrap filter algorithm in the supplementary material). Let denote the resampled particles that are used to propagate the latent system forward to time (using Euler-Maruyama). We approximate unbiasedly with as

 ^p(z|θ)=^p(z1|θ)T∏t=2^p(zt|z1,…,zt−1,θ)=^p(z1|θ){T∏t=21NN∑n=1wnt},

where the weights are

 wnt=1γ√1−e−2κΔt⋅ϕ(zt−xnt−e−κΔt(zt−1−~xnt−