# Exact Bayesian inference for discretely observed Markov Jump Processes using finite rate matrices

We present new methodologies for Bayesian inference on the rate parameters of a discretely observed continuous-time Markov jump processes with a countably infinite state space. The usual method of choice for inference, particle Markov chain Monte Carlo (particle MCMC), struggles when the observation noise is small. We consider the most challenging regime of exact observations and provide two new methodologies for inference in this case: the minimal extended state space algorithm (MESA) and the nearly minimal extended state space algorithm (nMESA). By extending the Markov chain Monte Carlo state space, both MESA and nMESA use the exponentiation of finite rate matrices to perform exact Bayesian inference on the Markov jump process even though its state space is countably infinite. Numerical experiments show improvements over particle MCMC of between a factor of three and several orders of magnitude.

## Authors

• 13 publications
• 8 publications
• ### Parameter elimination in particle Gibbs sampling

Bayesian inference in state-space models is challenging due to high-dime...
10/30/2019 ∙ by Anna Wigren, et al. ∙ 6

• ### Scalable Bayesian Inference for Population Markov Jump Processes

Bayesian inference for Markov jump processes (MJPs) where available obse...
04/17/2019 ∙ by Iker Perez, et al. ∙ 0

• ### The Mixture of Markov Jump Processes: Monte Carlo Method and the EM Estimation

This paper discusses tractable development and statistical estimation of...
12/19/2018 ∙ by H. Frydman, et al. ∙ 0

• ### Rare Event Simulation for Steady-State Probabilities via Recurrency Cycles

We develop a new algorithm for the estimation of rare event probabilitie...
04/05/2019 ∙ by Krzysztof Bisewski, et al. ∙ 0

• ### Unbiased Bayesian Inference for Population Markov Jump Processes via Random Truncations

We consider continuous time Markovian processes where populations of ind...
09/28/2015 ∙ by Anastasis Georgoulas, et al. ∙ 0

• ### Exact and computationally efficient Bayesian inference for generalized Markov modulated Poisson processes

Statistical modeling of point patterns is an important and common proble...
06/17/2020 ∙ by Flavio B. Gonçalves, et al. ∙ 0

• ### Bayesian inference based process design and uncertainty analysis of simulated moving bed chromatographic systems

Prominent features of simulated moving bed (SMB) chromatography processe...
11/27/2019 ∙ by Qiao-Le He, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The usual method of choice for exact inference on discretely observed Markov jump processes (MJPs) on a countably infinite state space is particle Markov chain Monte Carlo (particle MCMC, Andrieu et al., 2010) using a bootstrap particle filter (e.g. Andrieu et al., 2009; Golightly and Wilkinson, 2011; McKinley et al., 2014; Owen et al., 2015; Koblents and Miguez, 2015; Wilkinson, 2018). Although the particle Gibbs algorithm could also be used for such inference, we could find no examples of its use in this setting. Paths from the prior distribution of the process are simulated and then resampled according to weights that depend on the likelihood of the next observation given the simulated path. Typically, as the precision of an observation increases, its compatibility with most of the paths plummets, leading to low weights, and the efficiency of bootstrap particle MCMC decreases substantially. We consider the situation that is most challenging of all for a particle filter: when the observations are exact. Recently, paths proposed from alternative stochastic processes which take the next observation into account have successfully mitigated against this issue within particle MCMC Golightly and Wilkinson (2015); Golightly and Sherlock (2019), albeit at an increased computational cost. The first of these provides the primary particle-filter comparator for the very different inference methodology that we will introduce.

Our particular interest lies in Markov jump processes (MJPs) which arise from reaction networks: stochastic models for the joint evolution of one or more populations of species. These species may be biological species (e.g. Wilkinson, 2018), animal species (e.g. Drovandi and McCutchan, 2016), interacting groups of individuals at various stages of a disease (e.g. Andersson and Britton, 2000), or counts of sub-populations of alleles Moran (1958), for example. The state of the system is encapsulated by the number of each species that is present, and the system evolves via a set of reactions whose rates depend on the current state. Section 1.1 describes three examples of reaction networks. The number of possible ‘next’ states given the current state is bounded by the number of reactions, which is typically small; thus the infinitesimal generator of the process, which can be viewed as a countably infinite ‘matrix’, is sparse. The methods which we develop in this article can be applied to any MJP, but they are particularly effective when the generator of the MJP is sparse.

The likelihood for a discretely observed continuous-time Markov chain with a large but finite statespace and a rate matrix (or infinitesimal generator)

is the product of a set of transition probabilities, each of which requires the evaluation of

for an inter-observation time

and a non-negative vector

representing the state at an observation time. Fast algorithms for exactly this calculation, some specifically designed for sparse , are available (e.g. Sidje, 1998; Sidje and Stewart, 1999; Moler and Van Loan, 2003; Al-Mohy and Higham, 2011) and some are applicable even when the number of possible states, , is in the hundreds of thousands; however, many processes of interest have a countably infinite number of states, and an exact, matrix-exponential approach might appear impossible for such systems. Refuting this conjecture, Georgoulas et al. (2017) describes an ingenious pseudo-marginal MCMC algorithm Andrieu and Roberts (2009) that uses random truncations (e.g. McLeish, 2011; Glynn and Rhee, 2014) and coffin states to explore the parameter posteriors for MJPs with infinite statespaces using only exponentials of finite rate matrices. Unlike other pseudo-marginal algorithms which use random truncation (e.g. Lyne et al., 2015), the algorithm of Georgoulas et al. (2017)

is guaranteed to produce unbiased estimates of the likelihood which are non-negative. The algorithm, however, suffers from issues, mainly arising from the need to use a proposal distribution for the truncation level (see Section

2.2) and as a result it is less efficient than the most appropriate particle MCMC algorithm (see Section 4).

We describe the minimal and nearly minimal extended statespace algorithms, MESA and nMESA, inspired by the key novel idea in Georgoulas et al. (2017). These are fast and efficient algorithms for exact inference on Markov jump processes with infinite statespaces through exponentiation of finite-dimensional rate matrices. Essentially, a sequence of nested regions is defined, , with , the statespace of the MJP. The statespace of the MCMC Markov chain is then extended to include , the smallest region that contains the MJP, and the corresponding extended posterior can be calculated using only finite rate matrices. In the examples we investigate we find that MESA and nMESA are anything from a factor of to several orders of magnitude more efficient than the most efficient particle MCMC algorithm.

We conclude this section with three motivating examples of reaction networks; these three examples will be used to benchmark our algorithms, the algorithm of Georgoulas et al. (2017) and particle MCMC in Section 4. In Section 2 we describe the algorithm of Georgoulas et al. (2017), separating out the key idea of nested regions which is shared by our algorithms. Section 3 describes MESA and nMESA themselves, and the article concludes in Section 5 with a discussion.

### 1.1 Examples and motivation

To motivate the importance of inference on reaction networks we now present: the Lotka-Volterra predator-prey model, the Schlögel model, which is one of the simplest bistable networks, and a model for an auto-regulatory gene network. In all models the state vector consists of the (non-negative) counts of one or more physical, chemical or biological species. We will perform inference on these reaction networks in our simulation study in Section 4.

###### Examples 1.

The Lotka-Volterra predator-prey model (e.g. Boys et al., 2008). Two species, predators, , and prey, , with counts of and respectively evolve and interact through the following three reactions (with associated rates):

 Pred θ1X1⟶∅ Prey θ2X2⟶2Prey Pred+Prey θ3X1X2⟶2Pred.
###### Examples 2.

The Schlögel model (e.g. Vellela and Qian, 2009) has two stable meta states, and the frequency of transitions between the meta states is much lower than the frequency of transitions between states within a single meta state. The interactions between the single species, whose frequency is , and two chemical ‘pools’, and are:

 A+2Xθ1X(X−1)/2⟶3X,Bθ3⟶X\par3Xθ2X(X−1)(X−2)/6⟶2X+A,Xθ4X⟶B.
###### Examples 3.

The autoregulatory gene network of Golightly and Wilkinson (2005) models the production of from and of a protein from , as well as the extinction of both and , the reversible dimerisation of and the reversible binding of the dimer, to , where the binding inhibits production of . The total number of copies of , , is fixed, and the reactions are:

 DNA+P2θ1(G−X4)X3⟶DNA⋅P2,DNAθ3(G−X3)⟶DNA+RNA,DNA⋅P2θ2X4⟶DNA+P2,RNAθ4X1⟶RNA+P,2Pθ5X2(X2−1)/2⟶P2,RNAθ7X1⟶∅,P2θ6X3⟶2P,Pθ8X2⟶∅,

where denote the counts of , , , and respectively.

The potentially countably infinite set of possible states of a reaction network can be placed in one-to-one correspondence with the non-negative integers. The th entry of the corresponding rate ‘matrix’ () is the rate for moving from state to state .

### 1.2 Notation

Throughout this article, a scalar operation applied to a vector means that the operation is applied to each element of the vector in turn, leading to a new vector, e.g., for the -vector , . We denote the vector of s by . The symbol denotes the scalar or the vector or matrix of s as dictated by the context.

There is a one-to-one correspondence between any vector state , such as numbers of predators and prey in Example 1, and the associated non-negative integer state, which we denote by . Throughout this article, for simplicity of presentation, we abuse notation by abbreviating the element of a matrix , strictly , to .

## 2 Inference for MJPs with infinite statespaces using the rate matrix

For simplicity of presentation we assume a known initial condition, , though the methodology is trivially generalisable to an initial distribution. As in Georgoulas et al. (2017), we then consider the observation regime where particle filters typically struggle most: exact counts of all species are observed at discrete points in time, ; for simplicity, throughout, we present the case where for some inter-observation interval .

Throughout this article, the vector of positive reaction-rate parameters is denoted by , and Bayesian inference is performed on , to which a general prior is assigned.

For a finite-statespace Markov chain, whilst the rate matrix, , is the natural descriptor of the process, the likelihood for typical observation regimes involves the transition matrix, , the th element of which is exactly . By the Markov property, the likelihood for the exact observations is then

 L(ψ;x1:n)=n∏i=1[eQ(ψ)t]xi−1,xi. (1)

The above likelihood is used within the algorithm of Georgoulas et al. (2017), and within MESA and nMESA. All three algorithms share the same construction of nested regions which enables the use of (1) and which we now describe.

### 2.1 Set up for countably infinite state spaces

Let the MJP, , have a statespace of , start from an initial point and be observed precisely at time : . Consider an infinite sequence of regions, with and ; we permit equality so that the description also applies to MJPs with finite state spaces. Furthermore, should be chosen such that . Let be the number of states in .

Let be the infinitesimal generator for the MJP on . For finite , let be the submatrix of that involves transitions from to , and let be the be the rate-matrix for transitions inside except that it has an additional coffin state, , for any transitions of the full chain that would exit . Specifically

 Qr:=[Q(Rr,Rr)c00],

where, here and henceforth, denotes the scalar or vector (as appropriate) such that . We will, in fact, have a different sequence of regions defined for each inter-observation interval. We will denote the th region for the th inter-observation interval by and the associated transition matrix by . For clarity of exposition, we will suppress the superscript whenever it is not needed.

For each region , there is a one-to-one map and we add that . Using the shorthand of for

we define the Bernoulli random variables:

 B(X;t1,t2,R) := 1{Xs∈R ∀s∈[t1,t2]} (r≥0), Cr(X;t1,t2) := B(X;t1,t2,Rr)−B(X;t1,t2,Rr−1)   (r≥1) = 1{Xs∈Rr ∀s∈[t1,t2] % and ∃ s∈[t1,t2] such that Xs∉Rr−1}.

### 2.2 The method of Georgoulas et al. (2017)

In Georgoulas et al. (2017), henceforth abbreviated to GHS17, the random-truncation method of McLeish (2011) and Glynn and Rhee (2014) leads to an unbiased estimator of the likelihood of a set of observations, which feeds into a pseudo-marginal MCMC algorithm Andrieu and Roberts (2009) targetting the posterior . Unlike other uses of random truncation within MCMC (e.g. Lyne et al. (2015); see also Jacob and Thiery (2015)), however, the unbiased estimator of GHS17 can never be negative. We first briefly describe the random truncation method, before detailing the algorithm of GHS17.

Let be a sequence, with . Let be sampled from some mass function. Then

 ^Z:=z0+R∑j=1zj−zj−1P(R≥j)

is an unbiased estimator of . Subject to the condition of Fubini’s Theorem, the order of sum and expectation can be exchanged, so

 E[^Z]=z0+∞∑j=1(zj−zj−1)P(R≥j)E[I(R≥j)],

and the result follows from the telescoping sum.

At each iteration of the algorithm of GHS17, a value for is sampled at random from some discrete probability mass function . In GHS17, for some and with .

Since , . Thus, for two consecutive observations of the state, and , separated by a time , the quantity

 ^L(ψ;x,x′,r)=r∑j=11P(R≥j){[eQj(ψ)t]x,x′−[eQj−1(ψ)t]x,x′}, (2)

is a realisation from an unbiased estimator for the likelihood contribution (here ).

One estimator of the form (2), with its own independently sampled , is created for each inter-observation interval, and then provides a realisation from an unbiased estimator for the full likelihood.

Given a current position and a set of region indices we have a realisation of an unbiased likelihood estimate . One iteration of the pseudo-marginal algorithm of GHS17 proceeds as follows: first, propose a new position from some density then sample independently from the mass function to obtain a realisation, of an unbiased estimator for . This unbiased likelihood estimate is then used in a pseudo-marginal Metropolis-Hastings accept-reject step for :

 α(ψ,ψ′)=1∧π0(ψ′)^L(ψ′;r′1:n)q(ψ|ψ′)π0(ψ)^L(ψ;r1:n)q(ψ′|ψ).

The pseudo-marginal algorithm can be viewed as a Metropolis-Hastings Markov chain on and with a target distribution proportional to

 π0(ψ)^L(ψ;r1:n)n∏i=1p(ri),

and a proposal of . Because is unbiased, integrating out all of the auxiliary variables from the target leaves , as desired.

Typically, likelihood estimates obtained using random-truncation suffer from the potentially severe problem that, because they arise from a sequence of differences, they might be negative and hence unusable in an accept-reject step (see e.g. Lyne et al. (2015) and Jacob and Thiery (2015)). However, for an observation at and at ,

 [eQr(ψ)t]x,x′−[eQr−1(ψ)t]x,x′ =P(X1=x′,B(X;0,t,Rr)=1|X0=x,ψ) −P(X1=x′,B(X;0,t,Rr−1)=1|X0=x,ψ) (3) =P(X1=x′,Cr(X;0,t)=1|X0=x,ψ), (4)

which is non-negative; so, by construction, a negative likelihood estimate is impossible.

Although it can never be negative, the random truncation algorithm in (2) suffers from several related problems. The proposal should somehow reflect the patterns in the terms in the sum in (2): if as

then the unbiased estimate will be unstable and have a high, or even infinite, variance; if, on the other hand the ratio goes to zero then unnecessarily large regions will frequently be used, resulting in the exponentiation of unnecessarily large rate matrices with unnecessarily high rates, increasing the computational expense. The heuristic is very much akin to the requirements for a good independence sampler proposal and, indicates, the potential for non-geometrically ergodic behaviour that could result from a poor choice

(e.g. Roberts and Rosenthal, 2011).

In GHS17 it is stated that the distributional form of , was chosen partly since it describes the steady state of many simple queuing systems. However, the M/M/1 queue, for example, has a geometric stationary distribution (e.g. Grimmett and Stirzaker, 2001, Ch.11). Moreover , so the tails of are very light compared, for example, to geometric tails. Even if a heavier-tailed proposal were used, however, there is no obvious choice for its shape, or reason to believe the shape would be consistent across inter-observation intervals. Further, some species might have a different spread than others, requiring differently shaped regions. Finally, there is every reason to believe that this shape would depend on as discussed in the following remark.

###### Remark 1.

Consider, for example, a Lotka-Volterra model with the true rates divided by , which is equivalent to slowing down time by a factor of ; given the start time and the end time, the most likely paths would be those with close to a minimal number of events to get from to , so ; on the other hand, a large increase in all of the rates would see an approximately quasi-stationary distribution for most of the interval between and so larger regions would be more likely.

We will reformulate the likelihood, creating an explicit extended statespace and giving a different distribution for and ; as a result, there is no division by and, indeed, no requirement for a generic proposal . The potential for different amounts of variation between species and across intervals is allowed for by letting the shape of the cuboidal regions vary and for the nature of the cuboids themselves to vary between observations, all governed by two tuning parameters.

## 3 New Algorithms

We employ the same idea of a sequence of nested regions as in GHS17, with one sequence for each inter-observation interval. The nearly minimal extended statespace algorithm (nMESA) has one auxiliary variable per interval as in GHS17, whereas the minimal extended statespace algorithm (MESA) has a single auxiliary variable. In each case, however, instead of introducing the auxiliary variable(s) via random truncation we explicitly extend the statespace from to include the index of the outermost region visited by over the observation window (MESA) or between each pair of observations (nMESA), and perform MCMC directly on this extended statespace.

### 3.1 New regions

For simplicity, each region, , is cuboid. In GHS17, for an interval between observations of and , is the smallest cuboid that contains both and . However this cuboid does not necessarily allow a path between and . For example, since it has no reaction that increases predator numbers by , a Lotka-Volterra system starting with and ending with must have left the rectangle defined by the start and end states. We therefore define a minimum window-width parameter , which specifies, for Region , the smallest width for each species; if, for any species, the smallest region that contains the two observations is narrower than then the recursions below are performed until this is no longer the case. Subsequent regions are obtained recursively from the previous region by widening the bounds for each species, with the increase for a given species proportional to the current number of possible states for that species. Let the upper and lower bounds for species in region for inter-observation interval be and . Then

 u(j)k+1,s =us∧{u(j)k,s+[1∨γ(u(j)k,s−l(j)k,s+1)]}, l(j)k+1,s =ls∨{l(j)k,s−[1∨γ(u(j)k,s−l(j)k,s+1)]}

where and represent hard constraints on the species, such as non-negativity (), and where is a tuning parameter. The regions in GHS17 are a special case of the above formulation, with .

### 3.2 The minimal extended statespace and target

For the observation regime given at the start of Section 2, define

 ~Br(X):=n∏i=1B(X;ti−1,ti,R(i)r) (r≥0)   and   ~Cr(X):=~Br(X)−~Br−1(X) (r≥1).

Thus if in each inter-observation interval the process is confined to region for that interval but in at least one inter-observation interval it is not confined to that interval’s region . Letting , the smallest region index such that the whole path is contained within the regions , we target the extended posterior

 ~π(ψ,r)∝π0(ψ)P(x1:n,~r(X)=r∣ψ,x0),

where

 P(x1:n,~r(X)=r∣ψ,x0)=n∏i=1[eQr(ψ)t]xi−1,xi−n∏i=1[eQr−1(ψ)t]xi−1,xi.

The marginal for is , as required.

### 3.3 Nearly minimal extended statespace and target

Let be, for the th inter-observation interval, the first region to completely contain the MJP over that interval. We target the extended posterior

 ~π(ψ,r1:n)∝π0(ψ)n∏i=1P(xi,~ri(X)=ri∣ψ,xi−1)

where

 P(xi,~ri(X)=ri∣ψ,xi−1)=[eQr(ψ)t]xi−1,xi−[eQr−1(ψ)t]xi−1,xi.

As for the MESA, the marginal is the desired posterior, in this case since

 ∑r1:nn∏i=1P(xi,~ri(X)=ri∣ψ,xi−1)=n∏i=1∞∑ri=1P(xi,~ri(X)=ri∣ψ,xi−1)=n∏i=1P(xi∣ψ,xi−1).

### 3.4 The MCMC algorithms

Both MESA and nMESA use Metropolis-within-Gibbs algorithms. First, either (MESA) or (nMESA) is updated, then, respectively, or . For each algorithm, the update can, in principle, use any MCMC move that works on a continuous statespace; for simplicity and robustness we employ the random walk Metropolis. We propose

from a multivariate normal distribution centred on

and with a variance matrix proportional to the variance of obtained from an initial tuning run. For MESA we update using a discrete random walk, proposing or each with a probability of (when , the downward proposal is immediately rejected). For nMESA, conditional on , each component, , is updated independently via this symmetric discrete random-walk move. The random walk moves by a single region so as to save on computational cost. The new likelihood for a region involves quantities of the form and ; when is either or , one of these quantities is already available from the likelihood calculations for the current region.

Both stages of both algorithms require the computation of the exponential of at least rate matrices. As with the algorithm of GHS17, therefore, our algorithm is, well suited to parallelisation if the rate matrices and/or the largest required matrix power are large; we do not pursue this here.

## 4 Numerical comparisons

For each of the reaction networks given in Section 1.1 and a known initial condition, , we simulated a realisation from the stochastic process from time to an appropriate and then recorded the states at regular intervals so that there were observations each from a realisation of the Schlögel process and a realisation of the autoregulatory process, and observations of a Lotka-Volterra process; we name these data sets Sch50, AR50 and LV20, respectively. To investigate the effect of altering the inter-observation interval, for the Lotka-Volterra process, two further data sets, LV40 and LV10, were generated with and observations, respectively. To suppress the effect of inter-realisation variability, LV40 and LV10 were generated from the same realisation as LV20 by, repsectively, halving and doubling the inter-observation time interval; thus . Figure 1 shows the realisations of the stochastic processes from which LV20, LV40, LV10 and Sch50 arose, together with the observations in LV20 and Sch50. The realisation from the autoregulatory process and the observations AR50 are provided in Figure 3 in Appendix B.1, which also provides values for , and the true parameters for all three processes (see Table 5) as well as the prior distributions assigned to the parameters.

For each data set the output from a tuning run of iterations of nMESA was used to create an estimate, , of the variance matrix of the parameter vector, . For comparability, for all algorithms, proposals for the random walk on were of the form: , where is a realisation of a vector of standard Gaussians, is defined so that , and is a tuning parameter. The scaling, , of the random walk proposal was chosen using standard acceptance-rate heuristics (e.g. Roberts et al., 1997; Sherlock et al., 2010, 2015). The number of particles was chosen so that the variance of at points in the main posterior mass was not much above Sherlock et al. (2015); Doucet et al. (2015). No tuning advice is given in GHS17 so we proceeded by first tuning and for a fixed, sensible , and then tuning ; see Appendix B.2 for further details. Unless otherwise stated, each algorithm was run for iterations. In all cases, the first iterations were removed as burn in, as trace plots showed that this was all that was necessary.

Results for MESA are presented in terms of the CPU time in seconds, , the acceptance rate for the random walk update on the parameters, , the acceptance rate for the integer random walk update on the region, , and the number of effective samples per minute (rounded to the nearest integer) for the parameters, the region index and . The number of effective samples was calculated using effectiveSize command in the coda package in R Plummer et al. (2006). Quantities are the same for nMESA except that is the mean of the acceptance rates for the random walks on each of the region indices, and the effective samples per minute of the average region index is recorded. Neither particle MCMC nor GHS17 has a ‘region’ auxiliary variable, so the two fields for this are left blank. GHS17 strictly should have , but we also report the results for larger where this did not reduce the efficiency too much.

### 4.1 Numerical and computational issues

Calculations of the form were performed using the more efficient of two possible algorithms, chosen automatically at runtime on a case-by-case basis. The uniformisation method (e.g. Sidje and Stewart, 1999) and a variation on the scaling and squaring method (e.g. Moler and Van Loan, 2003). In our examples, is a sparse matrix where the number of entries is proportional to ; define . With this set up, the uniformisation method takes operations and has a memory footprint of , whereas the scaling and squaring method has a computational cost of and a memory footprint of . The latter was typically only used for some of the calculations for the Schlog̈el model where for some of the observation intervals, with the MESA and GHS17 algorithms, typically, but . See Appendix A for more details on the methods.

The maximum size of an unsigned integer in C++ is . Both methods of exponentiation require the evaluation of for some matrix, with no negative entries, for some integer power . Straightforward, exact (to a prescribed small tolerance) evaluation requires storing as an integer. For the Schlögel system using GHS17 or using MESA with small , for some inter-observation intervals on some iterations , sometimes considerably so. In such cases was truncated to so that inference was no longer exact, but could at least continue; in the remainder of this section we refer to this as the integer overflow problem. With an increase in code and algorithm complexity this issue could be overcome, but on the occasions when it occurred the algorithms for which it occurred, even with the inexact inference, were much less efficient than MESA with a larger or nMESA, so we did not pursue this further.

Unless stated otherwise runs were performed on a desktop machine using a single thread of a single i7-3770 core.

### 4.2 Lotka-Volterra model

Tables 1, 2 and 3 show the simulation results for a selection of the runs performed, respectively, for the LV20, LV40 and LV10 data sets. We focus on the LV20 data set, and point out any differences evident in the other two data sets.

Firstly, particle MCMC using the bridge of Golightly and Wilkinson (2015), henceforth referred to as GW15, was approximately a factor of times as efficient as the algorithm of GHS17 (factors of approximately and for LV40 and LV10 respectively). Further, runs of GHS17 with (best performance was for LV10, which is shown) were much less efficient than with . The most efficient MESA tuning was a factor of more efficient than particle MCMC (factors of approximately and for LV40 and LV10), whilst the most efficient nMESA was a factor of nearly times more efficient than particle MCMC (factors of approximately and for LV40 and LV10).

When , for each species is units wider than the . However, for regions of size , say, this is a very small relative increase in width, and as such, we might expect an associated very small probability of the process staying within . In other words, the range of region indices over which the process is likely to need to move is large. Since in MESA and nMESA the MCMC move to change region proposes either an increase or decrease of the region index by (see Section 3.4), region number mixes slowly. Since larger region indices are associated with larger reaction rates, for example, this also affects the mixing of , all of which suggests choosing . On the other hand, consider a so large that the process is almost certain to be in ; the dimension of will be approximately times the size of that of , and may contain unnecessarily large rates, making matrix exponentials expensive to calculate, yet a move to is proposed every other iteration. These heuristics suggest that there should be an optimal , and explain the variations in efficiency with for MESA and nMESA in Tables 1, 2 and 3.

Setting to the minimum allowable for the system (e.g. for the Lotka-Volterra model and for the Schlögel model) leads to the smallest sizes and the resulting matrix exponentials are very cheap to calculate. However, the larger the between-observation stochasticity the larger the region that is likely to be needed to contain the process between the observation times, yet some observations in the sequence will just happen, by chance, to be close together, so the minimal s will be too small. This is not an issue for nMESA which allows each inter-observation interval to find its own region level ; but MESA forces all inter-observation intervals to use the same region, . The disparity between some of these s very probably containing the process, and others very probably not containing the process leads to poor mixing for MESA when is at its minimum, and suggests increasing to a sensible minimum value for any interobservation interval. Increasing too far, for example beyond the range where the stochastic process is likely to lie, would lead to unnecessary computational expense, suggesting that there may be an optimal , too. Any trend (or drift) in the process between a pair of observations would be approximated by the observations differences, so the choice of should be driven by the expected stochasticity, and so should increase as the inter-observation time interval increases, as observed in Tables 1, 2 and 3.

### 4.3 Schlögel model

For the Schlögel model, nMESA is slightly more efficient than MESA, both of which are four orders of magnitude more efficient than GHS17. The particle MCMC scheme of Golightly and Wilkinson (2015) failed to converge, but a bootstrap particle filter driven scheme with a large number of particles did converge, although it was over two and a half orders of magnitude less efficient than MESA and nMESA. There are several distinct reasons for these striking results.

Firstly, for typical rate parameters, , and large values of the rates of the two reactions involving the reservoir are extremely high. Any particle filter must simulate all the reactions that occur, a number of the same order as the sum of the four reaction rates. If GHS17, MESA and nMESA had used the uniformisation method for matrix exponentiation then they would have suffered from this same issue; however for this dataset the scaling and squaring algorithm (see Section 4.1) was used, with a cost that increases with the logarithm of the of total rate. For reference, shorter runs at the optimal parameter choices, but forcing GHS17, MESA and nMESA to use the uniformisation method slowed these algorithms down by factors of , , and , meaning that MESA (resp. nMESA) would still have been three times (resp. an order of magnitude) more efficient than the bootstrap particle filter.

The bridge of Golightly and Wilkinson (2015) tries to drive the path for the stochastic process along an approximate straight line from the current position to the next observation. For transitions both between meta states and within the higher meta state this is an extremely poor approximation to the behaviour (see Figure 1). However, particle MCMC using a bootstrap particle filter did mix. As well as needing to simulate every single one of the reactions, the large computational cost arose from the filter needing an order of magnitude more particles than was used on the Lotka-Volterra examples.

Figure 1 shows that the largest changes from one observation to the next occur during transitions from one meta state to the other, but that it is in the higher meta state that the largest expansion in region coverage (from , the smallest box containing adjacent observations) is required. To fit the latter a relatively high region number (MESA with ) or a proposal distribution that leads to a relatively high region number (GHS17) is required, but this leads to a large statespace size as well as larger arising from the upper end of this large statespace. For MESA, this problem is overcome by choosing a larger .

The biggest problem with GHS17 was the requirement for the same proposal distribution for the truncation index whatever the meta-state of the process, whereas in reality the process is likely to need a high index when in the high meta state and a low index when in the low meta state. For lower values, such as (not shown), small region numbers were typically proposed, the calculations were relatively cheap but the chain mixed exceedingly poorly because of the enormous (possibly infinite) variance of the logarithm of the unbiased estimator of the likelihood through the quotient term in (2); the chains failed to converge. To bring the variance under control, the probability of proposing larger region numbers should be much higher, which would necessitate very expensive calculations because, for each inter-observation interval, the final region is larger, with even higher rates, and because there are typically more terms in the random truncation. Increasing similarly reduces the variance of the truncation estimator of the likelihood and increases the computational effort. Furthermore, once increased to or above with an large enough for visible mixing, integer overflow (see Section 4.1) became more and more frequent since the states in the larger proposed regions led to larger rates. Specifically, for the best run, which used , integer overflow occurred on of the iterations with a maximum true and values in excess of . For MESA with , integer overflow occurred on of the iterations, with a maximum true , and values up to ; no overflow occurred when . For nMESA the maximum value of was .