Stochastic processes are used to model complex systems in almost all fields of science and engineering. Partially observed stochastic processes result in some of the most computationally challenging problems for Bayesian inference [1, 2, 3]. Given the ubiquity of stochastic processes for real applications it is essential that efficient methods are developed to enable the analysis of modern high resolution data sets without sacrificing accuracy.
An important application of stochastic processes occurs in the study of cellular processes [4, 5]. Here, stochastic models of biochemical reaction networks often provide a more accurate description of system dynamics than deterministic models . This is largely due to intrinsic noise in the system dynamics of many biochemical processes that are significantly influenced by relatively low populations of certain chemical species . For example, in eukaryotic cells, molecules that regulate gene expression occur in relatively low numbers; as a result, stochastic fluctuations have a direct effect on the production rates of proteins [4, 8]. In addition, there are other interesting phenomena that occur in biological systems, such as self-induced stochastic resonance [9, 10, 11], stochastic focussing , and stochastic bi-stability [13, 14] that can only be captured by stochastic models.
To quantify uncertainty in parameters or predictions, it is typical to consider an expectation of a function of an unknown vector of model parameters,, conditional on some observational data, ,
is the Bayesian posterior probability density,
Here, is the a prioriprobability density of unknown parameters, is the likelihood function that maps a parameter vector to the probability of the observations for the model with this parameter vector, and is a normalising constant often referred to as the evidence.
The evidence term, given by
is rarely tractable and the expectation in Equation (1
) must be numerically estimated using sampling techniques, such as Markov chain Monte Carlo (MCMC)[16, 17] and sequential Monte Carlo (SMC) , that only require point-wise evaluation of the likelihood function, . For partially observed Markov processes, point-wise evaluation of the likelihood requires the solution to the forward Kolmogorov equation, which which must be computed approximately. Therefore, standard Bayesian tools cannot be applied and likelihood-free methods are needed, such as approximate Bayesian computation (ABC) [19, 20, 21], pseudo-marginal methods [22, 23, 24], and Bayesian synthetic likelihood (BSL) [25, 26, 27]. Regardless of the likelihood-free approach, realisations of the stochastic model are generated by stochastic simulation in place of likelihood function evaluation. Thus, the computational cost of evaluating Equation (1) depends on the efficiency of the stochastic simulation algorithm and the posterior sampler used as a basis for likelihood-free inference. For example, ABC can be implemented using either rejection sampling [28, 29, 30], MCMC  or SMC approaches [32, 33, 34]. However, if the model is even moderately expensive then such techniques are still infeasible.
Recently, there has been substantial research activity in the application of approximate model simulations and posterior samplers in combination with bias correction adjustments to accelerate likelihood-free applications that would be impractical otherwise. For example, delayed-acceptance [35, 36, 37, 38] or early-rejection approaches . Techniques such as transport maps 
and moment-matching transforms aim to transform a set of approximate posterior samples, using a surrogate or reduced model, into posterior samples under an expensive exact model. Other approaches such as preconditioning utilise approximate models to inform efficient proposal mechanisms . Various early-rejection and delayed-acceptance schemes [36, 37, 38, 39] probabilistically simulate the accurate model based on the rejection/acceptance status of an approximation; this family of methods is generalised for ABC schemes under the multifidelity framework [42, 43, 44].
There has also been substantial research in the last decade on the use of control variates to improve the rate of convergence in mean-square of Monte Carlo estimates to expectations. In particular, the multilevel Monte Carlo (MLMC) method [45, 46]
expands an expectation as a telescoping summation of bias corrections and achieve variance reductions by exploiting path-wise convergence properties of numerical schemes for solving stochastic differential equations (SDEs) or discrete-state Markov processes [48, 49, 50]. Recently, the MLMC telescoping summation approach has also been successfully applied to Bayesian inference [51, 52, 53, 54, 55], as per Equation (1), including ABC-based samplers [56, 57, 58].
Many of the above approaches use approximate simulations in ways that are not always mutually exclusive. However, it is an open question as to how these various schemes could be combined to obtain compounding effects in performance.
In this work, we derive and demonstrate an effective method for approximate Bayesian inference that combines the benefits of multifidelity ABC sampling with that of variance reduction with MLMC applied to the expectations with respect to an ABC posterior density. Acknowledging the wide range of variations in these algorithm classes, we specifically consider a natural application of these approaches to rejection samplers and extend the methods of Prescott and Baker , and Warne et al. . However, our results are equally relevant to other sampling approaches, such as SMC, using more recent developments by Prescott and Baker  and Jasra et al. . Using biochemical reaction networks as a characteristic application area, we present key computational features of ABC methods and introduce the fundamental algorithms developed by Prescott and Baker  and Warne et al. . We then highlight the distinct features of the inference problem exploited by each approach to accelerate ABC rejection sampling, and derive our novel approach by leveraging the complementary nature of these features. We then explore various performance results using stochastic models of Michaelis–Menten enzyme kinetics and the repressilator gene regulatory network. We then establish requirements for practical application of our approach and provide rules-of-thumb to simplify the tuning of algorithm parameters. Finally, we demonstrate the efficacy of our multifidelity MLMC approach using a challenging model of a two–step MAPK cascade reaction, which forms the basis for many real biological functions such as cell-to-cell signalling through the epithelial growth factor receptor (EGFR) [59, 60]. Our results demonstrate the potential of multilevel and multifidelity methods to accelerate ABC sampling by several orders of magnitude.
In this section we describe standard numerical methods for simulation and inference of stochastic biochemical network models [21, 61, 62]. Then we present recent developments in MLMC [21, 58] and multifidelity  methods for inference, and develop novel extensions that achieve the benefits from both the MLMC and multifidelity approaches. Finally, we highlight practical challenges and provide guidelines for the application and tuning of our new method.
2.1 Stochastic models of biochemical reaction networks
We consider biochemical reaction network models that involve a well-mixed population of chemical species, , that react via a network of chemical reactions,
where and are, respectively, the reactant and product stoichiometries for species in reaction . Given the state of the system at time , with denoting the copy number of species at time , it can be shown that, for a sufficiently small time interval , the probability of reaction occurring within this interval is proportional to where is the propensity function of reaction . Generally, the propensity function will take the form
where is the kinetic rate parameter for reaction , however, other more complex forms are also possible.
Mathematically, the stochastic dynamics of a biochemical network is governed by a discrete-state, continuous-time Markov process [62, 63, 64, 65]. The transitional density function of this process, , describes the probability of the system state at time given the state at a previous time . To obtain one needs to solve the forward Kolmogorov equation, also referred to the chemical master equation (CME) [62, 66],
where is the state change corresponding to the occurrence of reaction , that is .
2.1.1 Stochastic simulation methods
The CME is intractable for all but the simplest of models [21, 61, 62]. As a result, exact stochastic simulation [63, 67, 68] is required to study the system behaviour without introducing potentially substantial bias . These exact stochastic simulation schemes, such as Gillespie’s direct method  (Algorithm 1), simulate every reaction event, and are computationally prohibitive for systems with very large copy numbers or very high reaction rates.
2.1.2 Acceleration using multilevel Monte Carlo
Often the goal of stochastic simulation is to estimate the expectation, , where is a function of the process state at time . MLMC provides a mechanism to exploit approximations that computationally accelerate the estimation [21, 47, 49]. Assume we have a stochastic process , such as biochemical reaction network model, that is computationally expensive to simulate with the Gillespie direct method (Algorithm 1) or equivalent. Now consider a sequence of stochastic processes, , that approximate . This sequence is constructed such that the bias decreases and computational cost increases with . For example, could be a tau-leaping approximation with time step and constant for all . The insight of Giles  was to expand the desired expectation as a telescoping summation,
where each of the difference terms act to correct for the bias of the initial approximation. While a direct Monte Carlo approximation to Equation (5) would necessarily be more computationally expensive than the original expectation, Giles  demonstrated that substantial computational improvements can be obtained if positive correlations can be induced between the terms in the bias corrections, that is between and for , and between and . For SDEs and discrete-state continuous-time Markov processes, coupling schemes that induce sufficiently strong positive correlations have been well studied [21, 46, 48, 50] and performance improvements of many orders of magnitude can be obtained without any loss in accuracy in terms of mean-square error.
2.2 Approximate Bayesian computation
For the purposes of inference, the likelihood function for a sequences of observations, denoted by , is given by
where is the probability density of noisy observation , given the true state, , at time , and is the solution to the CME for the state transition over the time interval . Note that the CME in Equation (4) implicitly depends on the vector of rate parameters through the propensity functions, here in Equation (6) we make this explicit. The intractability of the CME immediately implies the intractability of the Bayesian inference problem. However, depending on the structure of the observation process, Equation (6) may still be intractable even when Equation (4) has an analytical solution. Therefore, almost any practical application of partially observed continuous-time Markov processes will require some form of likelihood-free inference.
ABC, pseudo-marginal methods and BSL are all possible approaches to avoid the dependence of Bayesian inference on the likelihood (Equation (6)) by generating simulated data from the likelihood using exact stochastic simulation along with the observation noise process. Here, we focus on the ABC approach that is based on the posterior approximation,
where is a discrepancy metric between noisy observations, , and simulated noisy observations, , and is a sufficiently small discrepancy threshold. The most direct method to implement ABC is to use rejection sampling (Algorithm 3).
Here, independent identically distributed samples of rate parameters are drawn from the prior, and the sample is accepted if a resulting stochastic simulation of the model is within of the observations under the discrepancy metric. In practice, rejection sampling may not be computationally feasible due to the prohibitively high rejection rates when the data dimensionality is high. There are many techniques that can be applied to improve the computational efficiency of ABC methods that we do not discuss here due to the wealth of available literature [19, 20, 31, 32, 34, 74]. Instead we specifically investigate recent developments in MLMC and multifidelity methods [42, 58], as they provide a new avenue to explore computational improvements for ABC inference. While we focus on these methods as applied directly to ABC rejection sampling (Algorithm 3), we note that our work is also applicable to other schemes such as SMC [43, 57].
2.3 Multilevel Monte Carlo and multifidelity methods for ABC inference
In this section, we describe two recent methods for acceleration of ABC inference that are foundational to the main contribution of this work. These methods are presented by Warne et al.  and Prescott and Baker , respectively. Both methods build upon the MLMC work of Giles  and Rhee and Glynn  for efficient computation of expectations with respect to a stochastic process [47, 48]. We refer the reader to Warne et al. , Lester et al. , Schnoerr et al. , and Peherstorfer et al. , for accessible introductions to simulation and inference methods including MLMC and multifidelity methods.
2.3.1 Multilevel rejection sampling
with for threshold for , where is a large discrepancy threshold leading to a high acceptance rate (typically close to the prior) and . The greatest challenge in application of MLMC for inference is the construction of a coupling to generate positively correlated sample pairs without introducing additional bias that would violate the telescoping summation. For this Warne et al.  apply a novel construction using the marginal empirical distribution at level and the inverse marginal distributions obtained from the bias corrections up to level . That is, given i.i.d. samples , one generates
where is the marginal empirical distribution of the th dimension of , denoted by , and is an estimate of the marginal distribution inverse for the th dimension of . The result is a sampling procedure for with a positive correlation induced between pairs. Warne et al.  note that this approach only strictly satisfies the telescoping summation in each marginal rather that the full distribution of . As a result, the sequence of discrepancies, , needs to be chosen so that the correlation structures between and are similar. This requires the bias correction terms in Equation (8) to be computed in order from due to the dependence on the previous level for the estimation of the marginal distribution inverses at level (Equation (9)). The complete process is given in Algorithm 4. Given an appropriate discrepancy sequence, the sample size sequence, , can be optimised to achieve improved convergence rates in mean squared error. However, the target accuracy needs to be sufficiently small for this improved convergence rate to take effect due to an overhead computational cost in generating the trial simulations needed to optimise the sequence .
2.3.2 Multifidelity rejection sampling
An alternative approach developed by Prescott and Baker  utilises the telescoping summation in a probabilisic manner akin to the de-biasing approach of Rhee and Glynn . Instead of considering a sequence of ABC samplers defined in terms of acceptance thresholds, they consider ABC rejection samples with different simulator fidelities. That is, a high fidelity simulator that is computationally expensive, such as the Gillespie direct method, and a computationally cheaper low fidelity simulator , such as tau-leaping approximation with time step . The idea is to perform ABC rejection sampling with the low fidelity simulator with discrepancy and acceptance threshold , and then perform a probabilistic bias correction that requires ABC rejection using a high fidelity simulation with discrepancy and acceptance threshold . The resulting estimator is given by
where are samples from the prior and the weight function is
is a random variable given by
where and is the probability of generating a high-fidelity simulation given a realisation from the low-fidelity simulation. This so-called continuation probability can take many forms, however, the method originally proposed by Prescott and Baker  is
where and are, respectively, the continuation probabilities when the low-fidelity simulation, , is accepted and rejected. The sampler proceeds according to Algorithm 5.
Prescott and Baker  prove that when and the multifidelity estimator (Equation (10)) is asymptotically unbiased, and under certain conditions may be optimised such that the multifidelity ABC rejection sampler (Algorithm 5) is more computationally efficient than direct rejection sampling with the high fidelity simulator (Algorithm 3). This improvement represents a decrease in the average computational cost for a given effective sample size, but it does not improve the convergence rate. However, the multifidelity approach will not incur an significant overhead for selecting the continuation probabilities as an adaptive scheme can be applied (See Appendix A).
2.4 Multifidelity MLMC for ABC inference
The MLMC and multifidelity approaches to ABC inference, MLMC-ABC and MF-ABC, obtain computational improvements in distinct ways that are also complementary. MLMC-ABC (Algorithm 4) combines a sequence of ABC rejection samplers with different discrepancy thresholds [21, 58] while the stochastic simulation scheme is fixed and must be unbiased. Conversely, MF-ABC (Algorithm 5) combines two stochastic simulation schemes  with the other elements of the ABC sampler remaining largely unchanged. Computationally, MLMC-ABC improves the convergence rate as at the expense of a tuning step that incurs an additional cost, and MF-ABC reduces the average simulation cost without improvements in the convergence rate. Our novel contribution in this work is to show how these methods can be combined to exploit the computational advantages of both.
We now derive our new method, called multifidelity MLMC for ABC (MF-MLMC-ABC). Assume we have two sequences of ABC acceptance thresholds, and , with and for all , and a sequence of time-step lengths, . Note that there are no constraints on the relation between the two acceptance thresholds sequences, nor any requirement that the sequence of time-step lengths be strictly decreasing or even monotonic. Given a set of model parameters, , let and denote, respectively, exact stochastic simulation (i.e., using the Gillespie direct method ) and approximate stochastic simulation with time-step (i.e., using the tau-leaping method ). Finally, let and be discrepancy metrics used to compare observed data with, respectively, exact simulation output, , or approximate simulation output, .
Given the above notation, for any we can apply MF-ABC sampling using weights,
with continuation probability function
for constants , for each and . Based on the results of Prescott and Baker , an expectation estimated this way would be unbiased with respect the ABC posterior under the exact stochastic simulation and discrepancy measure, that is, . This provides a connection to the MLMC-ABC telescoping summation in Equation (4). Therefore, we can apply the MF-ABC estimator (Equation (5)) using weights defined by Equation (14) to each of the terms in the MLMC-ABC telescoping summation (Equation (4)) and thereby arrive at the MF-MLMC-ABC estimator,
and is constructed from and estimated marginal distribution functions obtained from the previous terms, as given in Equation (9), to implement an approximate coupling between levels [21, 58]. Due to the properties of MF-ABC, Equation (16
) is an unbiased estimator of Equation (8)  and therefore an unbiased estimator of up to the approximate coupling scheme .
We therefore arrive at the MF-MLMC-ABC method presented in Algorithm 6
. Note that the proposed approach, just as with MLMC-ABC and MF-ABC, requires a number of algorithmic hyperparameters to be selected appropriately to ensure efficient sampling. In the next section we discuss theoretical results that guide how these parameters should be selected, and we provide examples of how they can be practically tuned in the Results section.
2.5 Optimal algorithm configuration
There are several important algorithmic hyperparameters that must appropriately chosen to practically apply the MF-MLMC-ABC method. Each of these have various influences on the accuracy and performance of the MF-MLMC-ABC method. In this section, we will step through each of these algorithmic hyperparameters and provide theoretical results to optimally configure the method. In the next section, practical guidelines for all components for the MF-MLMC-ABC method are provided. The algorithmic hyperparameters that require optimisation are the number of levels, , the form of the sequences and , the sequence of samples to draw from each level, , and the sequence of continuation probabilities, . Of these, and
need to be selected heuristically, however, for any givenand , the sequences and may be optimised.
For a given level and assuming is selected, the optimal and can be determined through optimising the limiting efficiency as the number of samples . Prescott and Baker  show that this corresponds to minimising the function
where is the computational cost of computing the weight (Equation (14
)). If the acceptance state of an approximate simulation is interpreted as a classifier for the predicted acceptance state for an exact simulation, and the true positive rate exceeds the false positive rate, then Equation (18) can be minimised for (See Lemmas 4.2 and 4.3 in Prescott and Baker ). The optimal continuation probabilities are given by (See Corollary 4.4 in Prescott and Baker )
To optimise the sequence of samples , we aim to minimise the total expected computational cost of computing the MF-MLMC-ABC estimator, that is, , subject to the constraint where is the target variance. Using a Lagrange multiplier, it can be shown (See Giles , Lester et al. , and Warne et al. ) that the following scaling is optimal,
where and .
2.6 Practical algorithm tuning
A practical choice for the selection of most components of MF-MLMC-ABC is immediately available from the target ABC-based inference problem, for example, the exact stochastic simulation process
, the prior probability density,, the discrepancy metric , and the target acceptance threshold . Other choices are easily motivated. The largest acceptance threshold, , can be chosen so that . The approximate stochastic simulation scheme, , could be chosen from a range of candidates, but a first order method such as tau-leaping will be an appropriate default choice in many cases. Given a tau-leaping scheme for the approximate model, it will often be appropriate to take , and therefore for all is applicable. Of course, there is freedom and flexibility in all of these choices, and the discussion section highlights some potentially useful alternative strategies.
Next the number of levels, , needs to be determined. Unfortunately, there is no general theory for this choice. However one common approach from the MLMC literature is to consider a geometric sequence then set . Some heuristics do exist to determine the appropriate scale factor . For MLMC with SDEs Giles  demonstrated is optimal, however, due to the approximate coupling scheme for MLMC-ABC, Warne et al., [21, 58] propose as a practical choice for inference.
Choice of the sequence of time-steps, , can be guided by Equations (19)–(21). For each level , must be small enough that , but also large enough so that the computation advantage is great enough to compensate to the mis-classifications. Ultimately, some experimentation is required compare the differences in the acceptance state between pairs of exact and approximate simulation. Fortunately, it is possible to identify poor choices of since the optimal continuation probabilities will be and the multifidelity step is less efficient then standard rejection sampling. It is also important to note that need not be a strictly decreasing or even monotonic sequence, and we demonstrate in the results section that works quite well for many applications. Furthermore, introducing a coupling scheme between the exact and approximate simulations [42, 48, 49] can reduce mis-classification rates of larger values of and lead to improved performance.
In practice, Equation (19) is solved for optimal continuation probabilities through the generation of initial trial samples . In this work, we extend this through adaptive updates to while generating samples at level . For the trial samples, we generate samples using Algorithm 5 with , then initial estimates of the expectations in Equation (20) are produced through direct Monte Carlo estimates. Next, gradient descent is applied to iteratively update and toward the optimum in Equation (19) while also refining the estimates in Equations (20). While it is possible to iteratively refine the solution to Equation (19) directly as the sampling proceeds , we find that this is extremely sensitive to the initial estimates. Therefore we utilise our adaptive gradient descent MF-ABC sampler (See Appendix A) as a robust alternative to Algorithm 5.
Finally, to apply MF-MLMC-ABC the sequence of sample numbers, , are needed. Fortunately, we can rewrite Equation (22) as
to highlight that the optimal sequence of sample numbers, , is dependent on the optimal continuation probabilities obtained using the adaptive gradient descent MF-ABC scheme at each level. Therefore, we can estimate the terms required for optimal directly from the same trial samples used to estimate optimal .
The MF-ABC and MLMC-ABC methods exploit the multilevel telescoping summation, and properties of approximate stochastic simulation and ABC sampling in distinct and complementary ways. In the first instance, MF-ABC can improve on the expected cost of stochastic simulation in ABC rejection sampling using the randomized bias correction term [42, 75]. MLMC-ABC assumes exact stochastic simulation for ABC rejection sampling, but applies MLMC techniques to a sequence of correlated samplers with discrepancy thresholds with the effect of improving the convergence rate [57, 58]. We develop a new method, MF-MLMC-ABC, that results from application of MF-ABC sampling for each of the terms in the MLMC-ABC telescoping summation. In the next section we demonstrate how to tune these methods practically and show the computational benefits of both can be exploited to achieve improvements of two orders of magnitude.
Using a variety of biologically relevant stochastic biochemical reaction network models, we demonstrate the substantial computational improvements using our new MF-MLMC-ABC method (Algorithm 6). First we consider the properties of the MLMC-ABC (Algorithm 4) and MF-ABC (Algorithm 5) methods for a fundamental biochemical building block, the Michaelis–Menten model, to show practically how to tune these methods. Then we apply these guidelines to tune our new MF-MLMC-ABC method for the repressilator gene regulatory network to show the performance benefits over MF-ABC and MLMC-ABC. Finally, we perform a realistic test on the two-step MAPK cascade network that is of fundamental importance in cell biology. For two computationally challenging networks we show that MF-MLMC-ABC effectively combines the advantages of both MF-ABC and MLMC-ABC to accelerate ABC rejection sampling by two orders of magnitude.
3.1 Initial explorations of MF-ABC and MLMC-ABC: Michaelis–Menten kinetics
Using a stochastic network of Michaelis–Menten enzyme kinetics [77, 78], we demonstrate the essential requirements and computational benefits of the MF-ABC and MLMC-ABC methods in order to inform the configuration of MF-MLMC-ABC for more challenging networks. The Michaelis–Menten enzyme kinetics model describes the catalytic conversion of a substrate, , into a product, , via an enzymatic reaction involving enzyme, ,
with kinetic rate parameters, , , and . Biologically, this network is of interest since many intracellular processes are built from Michaelis–Menten sub-components. Computationally, the Michaelis–Menten model is a minimal example of a network without a closed-form solution to the CME, however, with only three rate parameters and four chemical species, ABC inference is feasible even with rejection sampling .