Fitting stochastic epidemic models to gene genealogies using linear noise approximation

by   Mingwei Tang, et al.
University of California, Irvine

Phylodynamics is a set of population genetics tools that aim at reconstructing demographic history of a population based on molecular sequences of individuals sampled from the population of interest. One important task in phylodynamics is to estimate changes in (effective) population size. When applied to infectious disease sequences such estimation of population size trajectories can provide information about changes in the number of infections. To model changes in the number of infected individuals, current phylodynamic methods use non-parametric approaches, parametric approaches, and stochastic modeling in conjunction with likelihood-free Bayesian methods. The first class of methods yields results that are hard-to-interpret epidemiologically. The second class of methods provides estimates of important epidemiological parameters, such as infection and removal/recovery rates, but ignores variation in the dynamics of infectious disease spread. The third class of methods is the most advantageous statistically, but relies on computationally intensive particle filtering techniques that limits its applications. We propose a Bayesian model that combines phylodynamic inference and stochastic epidemic models, and achieves computational tractability by using a linear noise approximation (LNA) --- a technique that allows us to approximate probability densities of stochastic epidemic model trajectories. LNA opens the door for using modern Markov chain Monte Carlo tools to approximate the joint posterior distribution of the disease transmission parameters and of high dimensional vectors describing unobserved changes in the stochastic epidemic model compartment sizes (e.g., numbers of infectious and susceptible individuals). We apply our estimation technique to Ebola genealogies estimated using viral genetic data from the 2014 epidemic in Sierra Leone and Liberia.



There are no comments yet.



A linear noise approximation for stochastic epidemic models fit to partially observed incidence counts

Stochastic epidemic models (SEMs) fit to incidence data are critical to ...

A modified Susceptible-Infected-Recovered model for observed under-reported incidence data

Fitting Susceptible-Infected-Recovered (SIR) models to incidence data is...

Inference in Stochastic Epidemic Models via Multinomial Approximations

We introduce a new method for inference in stochastic epidemic models wh...

Probabilistic predictions of SIS epidemics on networks based on population-level observations

We predict the future course of ongoing susceptible-infected-susceptible...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Phylodynamics is an area at the intersection of phylogenetics and population genetics that studies how epidemiological, immunological, and evolutionary processes affect viral phylogenies constructed based on molecular sequences sampled from the population of interest [Grenfell et al., 2004, Volz et al., 2013]. Phylodynamics is especially useful in infectious disease modeling because genetic data provide a source of information that is complimentary to the traditional disease case count data. Here we are interested in inferring parameters governing infectious disease dynamics from the genealogy/phylogeny estimated from infectious disease agent molecular sequences collected during the disease outbreak. Working in a Bayesian framework, we develop an efficient Markov chain Monte Carlo (MCMC) algorithm that allows us to work with stochastic models of infectious disease dynamics, properly accounting for stochastic nature of the dynamics.

Currently, learning about population-level infectious disease dynamics from molecular sequences can be accomplished using three general strategies. The first strategy relies on the coalescent theory — a set of population genetics tools that specify probability models for genealogies relating individuals randomly sampled from the population of interest [Kingman, 1982, Griffiths and Tavare, 1994, Donnelly and Tavare, 1995]. Using a subset of these models [Griffiths and Tavaré, 1994], it is possible to estimate changes in effective population size — the number of breeding individuals in an idealized population that evolves according to a Wright-Fisher model [Wright, 1931]. Such reconstruction can be done assuming parametric [Kuhner et al., 1998, Drummond et al., 2002] or nonparametric [Drummond et al., 2002, 2005, Minin et al., 2008, Palacios and Minin, 2013, Gill et al., 2013] functional forms of the effective population size trajectory. In the context of infectious disease phylodynamics, nonparametric inference is the norm and the estimated effective population size is often interpreted as the effective number of infections or the effective number of infectious individuals. However, reconstructed effective population size trajectories are not easy to interpret and estimation of parameters of disease dynamics is difficult to accomplish if one wishes to maintain statistical rigor [Pybus et al., 2001, Frost and Volz, 2010].

Another way to learn about infectious disease dynamics from molecular sequences is to model explicitly events that occur during the infectious disease spread and to link these events to the genealogy/phylogeny of sampled individuals using birth-death processes. For example, a Susceptible-Infectious-Removed (SIR) model includes two possible events: infections and removals (e.g., recoveries and deaths), represented by births and deaths in the corresponding birth-death model [Stadler et al., 2013, Kühnert et al., 2014]. Other SIR-like models (e.g., SI and SIS models) differ by the number and types of the events that are needed to accurately describe natural history of the infectious disease [Leventhal et al., 2013]. Although these methods are more principled than post-hoc processing of nonparametrically estimated disease dynamics, they are not easy to scale to large datasets and/or high dimensional models. For example, in order to fit phylodynamic birth-death models to genomic and epidemiological data Vaughan et al. [2018] use particle filter MCMC. However, computational burden of particle filter MCMC methods is usually very high. Moreover, these methods often struggle with convergence when the dimensionality of statistical model parameters is even moderately high [Andrieu et al., 2010].

Structured coalescent models provide the third strategy of inferring parameters governing spread of an infectious disease [Volz et al., 2009, Volz, 2012, Dearlove and Wilson, 2013]. These models assume infectious disease agent genetic data have been obtained from a random sample of infected individuals, allowing for serial sampling over time. Although similar to the birth-death modeling framework, the structured coalescent models have two advantages. First, one does not have to keep track, analytically or computationally, of extinct and not sampled genetic lineages. Second, the density of the genealogy can be obtained given the population level information about status of individuals: for example, in the SIR model it is sufficient to know the numbers of susceptible, (), infectious, (), and recovered, (), individuals at each time point . The second advantage comes with two caveats: 1) such densities can be obtained only approximately and 2) evaluating densities of genealogies is not straightforward and involves numerical solutions of differential equations. Even in cases when these caveats are manageable, the density of the assumed stochastic epidemic model population trajectory remains computationally intractable. One way around this intractability assumes a deterministic model of infectious disease dynamics [Volz et al., 2009, Volz, 2012, Volz and Pond, 2014], which potentially leads to overconfidence in estimation of model parameters. Particle filter MCMC offers another solution [Rasmussen et al., 2011, 2014], but, as we discussed already, these methods are difficult to use in practice, especially in high dimensional parameter spaces.

In this paper, we develop methods that allow us to bypass computationally unwieldy particle filter MCMC with the help of a linear noise approximation (LNA). LNA is a low order correction of the deterministic ordinary differential equation describing the asymptotic mean trajectories of compartmental models of population dynamics defined as Markov jump processes (e.g., chemical reaction models and SIR-like models of infectious disease dynamics)

[Kurtz, 1970, 1971, Van Kampen and Reinhardt, 1983]. LNA can also be viewed as a first order Taylor approximation of Markov population dynamics models represented by stochastic differential equations [Giagos, 2010, Wallace, 2010]. A key feature of the LNA method is that it approximates the transition density of a stochastic population model with a Gaussian density [Komorowski et al., 2009].

Inspired by recent applications of LNA to analysis of Google Flu Trends data [Fearnhead et al., 2014] and disease case counts [Buckingham-Jeffery et al., 2018]

, we develop a Bayesian framework that combines LNA for stochastic models of infectious disease dynamics with structured coalescent models for genealogies of infectious disease agent genetic samples. Our approach yields a latent Gaussian Markov model that closely resembles a Gaussian state-space model. We use this resemblance to develop an efficient MCMC algorithm that combines high dimensional elliptical slice sampler updates

[Murray et al., 2010] with low dimensional Metropolis-Hastings (MH) moves. Using simulations, we demonstrate that this algorithm can handle reasonably complex models, including an SIR model with a time-varying infection rate. We apply this SIR model to a recent Ebola outbreak in West Africa. Our analysis of data from Liberia and Sierra Leone illuminates significant changes in the Ebola infection rate over time, likely caused by the public health response measures and increased awareness of the outbreak in the population.

2 Methodology

2.1 Genealogy as data

We start with infectious disease agent molecular sequences obtained from infected individuals sampled uniformly at random from the total infected population. Further, we assume that a phylogenetic tree, or genealogy, relating these sequences has been estimated in such a way that the tree branch lengths respect the known sequence sampling times. Such estimation can be performed with, for example, BEAST — a leading software package for Bayesian phylogenetic studies, particularly popular among molecular epidemiologists who collect and analyze viral genetic sequences [Suchard et al., 2018]. The genealogy is represented by a tree structure with its nodes containing two sources of temporal information: coalescent and sampling times. The coalescent times correspond to the internal nodes of the tree, which are defined as the times at which two lineages in the tree are merged into a common ancestor. The sampling times, corresponding to the tips of the tree, are the times at which molecular sequences were sampled. Note that sampling times are observed directly, while coalescent times are estimated from molecular sequences during phylogenetic reconstruction.

To perform inference about infectious disease dynamics using the above genealogy we need a probability model that relates the genealogy and infectious disease dynamics model parameters. Without too much loss of generality, we assume that the infectious disease is spreading through the population according to the SIR model — a compartmental model that at each time point tracks the number of susceptible individuals , number of infected/infectious individuals , and number of removed individuals [Bailey, 1975, Anderson and May, 1992]. We assume that the population is closed so for all times , where is the population size that we assume to be known. This constraint implies that vector is sufficient to keep track of the population state at time . We follow common practice and model as a Markov jump process (MJP) with allowable instantaneous jumps shown in Figure 1 [O’Neill and Roberts, 1999]. The assumed MJP process is inhomogeneous, because we allow the infection rate and removal rate to be time-varying.


Figure 1: SIR Markov jump process. From the current state with the counts , the population can transition to state (an infection event) with rate or to state (a removal event) with rate . No other instantaneous transitions are allowed.

The structured coalescent models assume that only coalescent times provide information about the population dynamics. These times are modeled as jumps of an inhomogeneous pure death process with rate , where each “death” event corresponds to coalescence of two lineages and is called a coalescent rate. Then the density of the genealogy, which serves as a likelihood in our work, is written as

where denotes the most recent sequence sampling time. The dependence of coalescent rate on the assumed population dynamics can be complicated and mathematically intractable, but luckily approximations exist for some specific cases. For the SIR model the approximate coalescent rate can be obtained via the following formula:


where is the number of lineages present at time . Note that when the number of susceptibles is not changing significantly relative to the total population size (i.e., ) and infection rate is constant (i.e., ), the structured coalescent reduces to the classical Kingman’s coalescent, where we interpret as the effective population size trajectory [Kingman, 1982]. It is possible to find approximate coalescence rate for general compartmental models, but closed form expressions exist only for a few models with a low number of compartments (e.g., SI, SIR) [Volz et al., 2009, Volz, 2012, Dearlove and Wilson, 2013].

Since we allow sequences to be sampled at different times , some inter-coalescent times are censored. To deal with this censoring algebraically, each inter-coalesecent interval is partitioned by the sampling events into sub-intervals: . The intervals that end with a coalescent event are defined as , for and . Let the number of lineages in each interval be . Then the number of lineages at each time point can be written as . If the interval ends with a coalescent time, the number of lineages in the next interval will be decreased by . If the interval ends with a sampling event , then the number of lineages in the next interval is increased by — the number of sequences sampled at time . Figure 2 shows an example of a genealogy with labeled coalescent times, sampling times, number of lineages, and the corresponding intervals.

We are now ready to connect the SIR model and a genealogy with serially sampled tips with the help of a structured coalescent density/likelihood. First we discretize the time interval between the time to most recent common ancestor (time corresponding to the root of the tree) and the most recent sampling time using a regular grid ( and ). Using this grid, we discretize the latent epidemic trajectory by assuming that , where is a column vector. Similarly, we discretize the infectious disease dynamics parameter vector trajectory so that , where is also a column vector. We collect latent variables s and parameters s into matrices and respectively. The SIR structured coalescent density/likelihood then becomes


Since , , and are piecewise constant functions, the integrals in the above formula are readily available in closed form and are fast to compute.

Figure 2: Example of a genealogy. Black solid lines show the genealogy structure. The colescent times and sampling times are labeled with vertical dashed lines. The number of lineages is given in each intervals .

2.2 Bayesian data augmentation

2.2.1 Posterior distribution

Given genealogy , our goal is to infer the latent SIR population dynamic and rate parameters over time grid . Let and denote the prior densities for the initial compartment states and the SIR parameters respectively. The posterior distribution for the population trajectory and parameters given observed genealogy is


where is the structured coalescent likelihood introduced in Section 2.1 and is the likelihood function for discrete observations of trajectory given the initial value :


where the factorization comes from the assumed Markov property of the disease dynamics. However, the SIR transition density becomes intractable as population size grows large, making it difficult to perform likelihood-based inference for outbreaks in large populations.

2.2.2 Linear noise approximation

To furnish a feasible computation strategy for large populations, we use a linear noise approximation (LNA) method, in which the computationally intractable transition probability is approximated using a closed form Gaussian transition density. The LNA method replaces the MJP discrete state space with a continuous state space of to approximate the counts of at time , under the following constraints: and . To briefly explain how this approximation is obtained, we will need additional notation.

The SIR MJP instantaneous transitions, depicted in Figure 1, are encoded in an effect matrix


Each row in matrix (5) represents a type of transition event and each column corresponds to a change in the susceptible and infected populations. Next, we define a rate vector and a rate matrix :


The above notation, as well as subsequent developments based on it, can be generalized to other epidemic models and, more generally, to a large class of density dependent stochastic processes, such as chemical reaction and gene regulation models [Wilkinson, 2011]. See Section A-1 in the Appendix for more details on this generalization.

Consider a transition from at time to at . Recall that we assume that the SIR rates take constant values in . The LNA represents the value of the next state as , where is a deterministic component and is a stochastic component. The deterministic component can be obtained by solving the standard SIR ODE that in our notation can be written as


The stochastic part corresponds to the solution of the following SDE at time :


where is the Jacobian matrix of the deterministic part in (7) evaluated at . The solution of SDE (8), , is a Gaussian process and can be recovered by solving two ordinary differential equations governing the mean function and covariance function :



. A heuristic derivation of LNA, based on

Wallace [2010], is given in Section A-2 of the Appendix. Let denote the initial values of at time in differential equations (7), (9), and (10) respectively. There are two options for choosing these initial conditions: the non-restarting LNA of Komorowski et al. [2009] and the restarting LNA of Fearnhead et al. [2014]. In this paper, we will use the non-restarting LNA by Komorowski et al. [2009] with the following choice of initial conditions:

  1. , where was obtained by solving the ODE (7) using parameter vector over the interval ,

  2. ,

  3. .

Solving the system of ODEs (7), (9), (10), we obtain , , and . The solution will be a function of the initial value , the interval length and the SIR rates . To make this dependence explicit, we write . Since (9) is a first order homogeneous linear ODE, the solution is a linear function of . Hence, the transition from to

follows the following Gaussian distribution:


To summarize, the derived conditional Gaussian densities allow us to compute the density of the latent SIR trajectory (4). As a result, our augmented posterior distribution of and , shown in equation (3), can be computed up to proportionality constant and approximated via “standard” (not particle filter) MCMC approaches.

2.3 Reparameterization, priors, and MCMC algorithm

2.3.1 Reparameterizing SIR rates

We have experimented with multiple parameterizations of our inhomogeneous SIR model and found that the following parameterization works best with our proposed MCMC algorithm for approximating the posterior distribution (3). First, recall that we allow SIR rates to vary with time. Since it is much more likely for the infection rate to be time variable, we are going to assume a constant removal/recovery rate . This leaves us with the following parameters: infection rates on a grid , removal rate , and initial SIR state . Since we are interested in modeling an emerging infectious disease outbreak, we set the initial counts of susceptibles to . Initial counts of infected individuals, , is assumed to be low and treated as an unknown parameter with a lognormal prior distribution. Instead of the time-varying infection rate , we parameterize our SIR model with a time-varying basic reproduction number . The reproduction number is interpreted as the average number of cases that one case generates over its infectious period in a completely susceptible population. Since our infection rate changes in a piecewise manner, the basic reproduction number varies over time in a piecewise manner too:


where is the reproduction number corresponding to the time interval . Let be the initial basic reproductive number and be a normalized log ratio of over two successive time intervals. Then, interval-specific basic reproduction numbers can be written as


where we assume a priori that

s are independent standard normal random variables.

This construction implies that log-transformed piecewise constant reproduction numbers, s, a priori

follow a first order Gaussian Markov random field (GMRF) with standard deviation

that controls the a priori smoothness of trajectory [Rue, 2001, Rue and Held, 2005]. In addition to speeding MCMC convergence, working with is convenient, because this trajectory is dimensionless and retains its interpretation when one changes the population size . The initial is assigned a prior. We use a prior for the inverse of standard deviation .

2.3.2 Reparameterizing SIR latent trajectories

We reparameterize the latent SIR trajectory with a sequence of independent Gaussian random variables , following a non-centered parameterization framework of Papaspiliopoulos et al. [2007]. According to formula (11), conditional on , can be written as


where for and is a identity matrix. In our parameterization, we will treat as random latent variables and the SIR latent trajectory as a deterministic transformation of . More details about our non-centered parameterization of can be found in Section A-3 of the Appendix.

2.3.3 MCMC algorithm

Using our new parameterization, we are now interested in the posterior distribution of the initial number of infected individuals, , removal rate, , the initial basic reproduction number, , standardized vectors, and , and GMRF standard deviation, :

The latent variables and parameter vector are deterministic functions of new parameters , , , , , and . We use the following MCMC with block updates to approximate this posterior distribution. We update high dimensional vector using the efficient elliptical slice sampler [Murray et al., 2010]. Vector is updated the same way in a separate step. Initial number of invected individuals and removal rate are updated using univariate Metropolis steps. The full procedure is described in Algorithm 2, which together with details of the elliptical slice sampler can be found in Section A-4.1

of the Appendix. After MCMC is done, we report posterior summaries using natural parameterization. For example, we report posterior medians and 95% Bayesian credible intervals (BCIs) of the piecewise latent reproduction number trajectory,

for , and latent trajectory .

2.3.4 Implementation

Our R package called LNAPhylodyn provides an implementation of our MCMC algorithm. The package code is publicly available at This repository also contains scripts that should allow one to reproduce key numerical results in this manuscript.

3 Simulation experiments

3.1 Simulations based on single genealogy realizations

In this section, we use simulated genealogies to assess performance of our LNA-based method and to compare it with an ODE-based method, where we replace equation (14) with its simplified version: . Under our assumption of a fixed and known genealogy and constant , our ODE-based method closely resembles previously developed methods by Volz et al. [2009] and Volz and Siveroni [2018]. To compare ODE-based and LNA-based models in a Bayesian nonparametric setting, we equip the ODE model with the GMRF prior for time-varying , described in Section 2.3.1. We use the same MCMC algorithm for both LNA-based and ODE-based models, except we do not have a separate step to update latent vector (equivalently, ) in the ODE-based inference. See Algorithm 3 in the Appendix for a more detailed description of the ODE-based MCMC.

The simulation protocol consists of two steps. First, given the population size and pre-specified parameters , , and , we simulate one realization of the SIR population trajectory based on the MJP using the Gillespie algorithm [Gillespie, 1977]. Next, we generate realistic lineage sampling times and simulate coalescent times from the distribution specified by density (2) using a thinning algorithm by Palacios and Minin [2013].

We test LNA-based and ODE-based methods under three “true” trajectories over the time interval :

  1. Constant (CONST) . for . Recovery rate . Initial counts of infected individuals . Total population size is 100,000.

  2. Stepwise decreasing (SD) . , and . Recovery rate . Initial counts of infected individuals . Total population size 1,000,000.

  3. Non-monotonic (NM) . , and . Recovery rate . Initial counts of infected individuals . Total population size 1,000,000.

For all simulations, we use prior for . The parameters of the lognormal priors for the initial and inverse standard deviation are set to and respectively, in such a way that a priori trajectory stayed within a reasonable range of with 0.9 probability. We assign an informative prior for in each simulation scenario, because prior information about this parameter is usually available: (1) CONST: , (2) SD: , (3) NM: . We set the grid size to , with for . For both LNA-based and ODE-based methods, we use 300,000 MCMC iterations. All MCMC chains appeared to converge (trace plots are shown in Section A-5.3.1 of the Appendix). The effective sample sizes of all unknown quantities were above 100.

Figure 3: Analysis of 3 simulation scenarios. Columns correspond to CONST, SD, and NM simulated trajectories. The first row shows the estimated trajectories for the 3 scenarios, with the black solid lines representing the truth, the red dashed lines depicting the posterior median and the red-shaded area showing the 95% BCIs for the LNA-based method. For the ODE-based method, the posterior median is plotted in blue dotted lines, with blue shading showing the 95% BCIs. The second row corresponds to the estimation for the removal rate . Posterior density curves from the LNA are shown in red lines and the posterior density for ODE is plotted in blue lines, compared with prior density curve in green lines. The bottom two figures shows the estimated trajectory of and respectively.

The first row of Figure 3 shows point-wise posterior medians and 95% BCIs for the basic reproduction number trajectory, . Our LNA-based method performs well in capturing the continuous dynamics of . Though our approach may not perfectly catch the discontinuous changes in in the SD scenario, the method provides BCIs that are able to capture most of the trajectory. The ODE-based method yields similar results in the CONST case and the SD case, but fails to capture the decreasing trends in the NM scenario.

The second row in Figure 3 shows posterior summaries of removal rate . Both LNA-based and ODE-based methods provide good estimates in the CONST scenario, with posterior modes centered at the true value and higher posterior densities at truth when compared with the prior. In the SD and NM scenarios with the time varying , the posterior estimates from the LNA-based method and ODE-based method, though still centered at the truth, do not differ much from the prior distribution.

Posterior summaries of and are depicted in the third and fourth rows of Figure 3. The two methods produce similar results in the CONST and SD scenario, as both of them have narrow BCIs covering the true trajectories. However, in the NM case, while the LNA-based method manages to recover the latent SIR trajectory trend, the BCIs from the ODE-based method fail to cover the true prevalence trajectory in the middle and at the end of the epidemic. Somewhat counterintuitively, LNA-based method produces BCIs for the latent trajectories, and , that are narrower than its ODE counterparts. We suspect this is a result of the ODE-based method poor estimation of the basic reproduction number trajectory at the end of the epidemic.

3.2 Frequentist properties of posterior summaries

In this Section, we design a simulation study based on repeatedly simulating SIR trajectories using MJP with pre-specified parameters. The simulations are based on the non-monotonic trajectory scenario in Section 3.1 with the same parameter setup, except the parameters of the lognormal prior for the initial are set to . Simulating SIR dynamics under low initial number of infected individuals can end up with low prevalence trajectories that end at the beginning of the epidemic, or trajectories having unrealistically high prevalence, which are less likely to be observed during real infectious disease outbreaks. Therefore, while simulating SIR trajectories we reject such “unreasonable” realizations to arrive at 100 simulated trajectories. The details of the rejection criteria are given in Section A-5.2 of the Appendix. For each simulated SIR trajectory, a realization of a genealogy is generated using the structured coalescent process. We use both LNA-based and ODE-based model to approximate the posterior distribution of model parameters and latent variables for each genealogy.

We use three metrics to evaluate models based on their estimates of and : average error of point estimates (posterior medians), width of credible intervals, and frequentist coverage of credible intervals. Since the value of is greater than 0 and usually upper-bounded by 20 (i.e, it stays within the same order of magnitude), we will measure accuracy using an unnormalized mean absolute error (MAE):


where is the posterior median of . In contrast, varies from one at the beginning of the epidemic to thousands at the peak, so to evaluate accuracy of prevalence estimation we use the mean relative absolute error (MRAE):


where is the posterior median of I. We assess precision of estimation based on the mean credible interval width (MCIW):


where and denote the lower and upper bounds of the 95% BCI for . Similar as our measure of accuracy, precision of estimation is quantified via mean relative credible interval width (MRCIW):


where and specify the lower and upper bounds of the BCI of . In addition, we compute the “envelope” (ENV) — a measure of coverage of BCIs the true trajectory — for and as follows:


Sampling distribution boxplots of posterior summaries are depicted in the left three plots of Figure 4. The LNA-based method yields significantly lower MAE compared with the ODE-based method. As a trade-off, the MCIWs produced by the LNA-method are generally higher, as expected since the LNA-based method incorporates the stochasticity in the population dynamics. With less bias and wider BCIs, the LNA-based method BCIs result in better coverage than ODE-based BCIs, as shown by the envelope boxplots.

Sampling distribution boxplots of posterior summaries, shown in Figure 4, are similar to the results, with the LNA-based method generally having lower MRAEs, higher MRCIWs and a better coverage/envelope than the ODE-based method. Again, somewhat counterintuitively, the MRCIWs for the LNA-based method are smaller than the ODE counterparts. This is likely caused by significant bias in estimation by the ODE-based method.

We also report the absolute error (AE) and 95% BCI widths for removal rate in Figure 4. We note that an informative prior has been chosen for , because this parameter is weekly identifiably from genetic data alone. The LNA-based method yields a slightly higher AE than the ODE method. Both methods produce similar BCI widths.

Figure 4: Boxplots comparing performance of LNA-based and ODE-based methods using 100 simulated genealogies. The first three plots show mean absolute error (MAE), mean credible interval width (MCIW) and envelope for trajectory. The next three plots depict mean relative absolute error (MRAE), mean relative credible interval width (MRCIW), and envelope for (prevalence) trajectory. The last two plots show the absolute error (AE) and Bayesian credible interval (BCI) width for .

4 Analysis of Ebola outbreak in West Africa

We apply our LNA-based method to the Ebola genealogies reconstructed from molecular data collected in Sierra Leone and Liberia during the 2014–2015 epidemic in West Africa [Dudas et al., 2017]. We use a Sierra Leone genealogy, depicted in the top left plot of Figure 5, which was estimated from 1010 Ebola virus full genomes sampled from 2014-05-25 to 2015-09-12 in 15 cities. The Liberia genealogy, shown in the top left plot of Figure 6, was estimated from a smaller number of samples: 205 Ebola virus full genomes sampled from 2014-06-20 to 2015-02-14. The original sequence data and the reconstructed genealogies are publicly available at

When Ebola virus infections were detected in West Africa in mid-Spring of 2014, various intervention measures were proposed and implemented to change behavior of individuals in the populations through which Ebola was spreading. Border closures, encouragement to reduce individual day-to-day mobility, and recommendations on changing burial practices were among the broad spectrum of interventions attempted by multiple countries. It is reasonable to expect that these intervention measures resulted in lowering the contact rates among members of the populations, which in turn reduced the infection rate, or equivalently the basic reproduction number.

When analyzing the Sierra Leone and Liberia genealogies, we rely on conclusions of Dudas et al. [2017] and assume the population in each country to be well mixed. Furthermore, we assume Ebola spread to follow SIR dynamics. For each country, the population size is specified based on its census population size in 2014, with 7,000,000 for Sierra Leone and 4,400,000 for Liberia. As in our simulation study, we use the lognormal prior for with and and the lognormal prior for the inverse standard deviation with . Recall that this prior setting ensures that a priori stays within a reasonable range of with probability 0.9. For removal rate

, we use an informative lognormal prior with mean 3.4 and variance 0.2 based on previous studies

[Towers et al., 2014]. The parameter , interpreted as the length of the infectious period, is expected to be 8-18 days for each country a priori. The total time span for each genealogy is divided evenly into 40 intervals, which results in grid interval lengths, s, to be 12.41 days for Sierra Leone and 6.9 days for Liberia. We run the MCMC algorithm in Section 2.3 for 3,000,000 iterations for Sierra Leone data and 750,000 iterations for Liberia data. The posterior samples are obtained by discarding the first 100,000 iterations and saving every 30th iteration afterward. The trace plots in Section A-5.3.2 of the Appendix indicate the MCMC algorithm has converged and achieved good mixing in each case.

Figure 5: Analysis of the genealogy relating Ebola virus sequences collected in Sierra Leone. Top top left plot depicts the Ebola genealogy. The top right plot shows the estimated , with the red dashed line showing the posterior median and the salmon shaded area showing the 95% BCIs of the LNA-based method. The posterior median based on the ODE-based method is plotted as the blue dotted line with blue shading corresponding to the 95% BCIs. The medium left figure shows prior and posterior densities of the mean infection period . The prior density is shown in green, while the posterior densities based on LNA and ODE are plotted in red and blue respectively. The medium right and the bottom left figures show the estimated trajectory of and , using the same legend as in top right plot. The bottom right plot shows the predicted median and 95% BCIs for weekly reported incidence together with the reported incidence from WHO shown as crosses.

Figures 5 and 6 show results for Sierra Leone and Liberia respectively, with intervention events mapped onto the calendar time on the x-axis. Our LNA-based method estimates the initial in Sierra Leone during 2014–2015 to be 1.68, with 95% BCI of . Similarly, in Liberia during 2014–2015 has a point estimate 1.67 and a 95% BCI . Our estimate of initial in Sierra Leone is consistent with the estimates of Stadler et al. [2014], who fitted multiple birth-death models to 72 sequences at the early stages of the outbreak. Volz and Pond [2014] used a susceptible-exposed-infectious-recovered (SEIR) model with a constant and estimated it to be 2.40 (CI: ). Althaus [2014] assumed an exponentially decaying with an estimated initial of 2.52 (CI: ). The discrepancies between our and SEIR-based estimates are not unexpected, because SEIR models generally yield higher estimates than SIR models when applied to the same dataset [Wearing et al., 2005, Keeling and Rohani, 2011]. Our estimated for Liberia is in agreement with results of Althaus [2014], who fitted a SEIR model to incidence data and arrived at an estimated of 1.59 (CI: ).

The dynamics in the two countries share a similar pattern: with (1) a decreasing trend that starts in Spring/Summer of 2014, (2) a stable/constant period until the end of September 2014 and (3) a final decrease below 1.0 (epidemic is contained) around November 2014. Since the number of susceptible individuals did not change significantly over the course of the epidemic, relative to the total population size, the basic and effective reproduction numbers, and , are approximately equal. This allows us to compare our estimation results with previously estimated changes in . Our estimation of early dynamics in Sierra Leone agrees with results of Stadler et al. [2013], who concluded that the effective reproduction number did not significantly decrease until mid June. Our estimated trajectory suggests that later interventions, such as border closures and release of burial guides, may have been helpful in controlling the spread of the disease. The infectious period for Sierra Leone epidemic is estimated to be 11.2 days with a 95% BCI (7.6,16). For Liberia, the infection period has a point estimate of 9.8, with a 95% BCI . The posterior median of the total number of infected individuals (final epidemic size) is 7,284 and its 95% BCI is for Sierra Leone, which is close to 8,706 total confirmed number of cases reported by Centers for Disease Control (CDC). Liberia had a smaller epidemic than Sierra Leone, with estimated total infected individuals being 2,842 and a 95% BCI of . These results are also in agreement with 3,163 total confirmed cases from CDC.

We perform an out-of-sample validation by comparing our results with weekly reported confirmed incidence in Sierra Leone and Liberia from the World Health Organization

(WHO). The posterior predictive weekly incidence at time

, denoted by , is approximated by


where and are the posterior estimates of the infection rate, number of susceptible and number of infected individuals at time respectively, and corresponds the time interval of one week. We plot the posterior predictive estimates of weekly incidence together with the corresponding weekly reported confirmed incidence. For both countries, our model-based incidence 95% BCIs cover the reported incidence counts from WHO, suggesting that our time varying SIR model can estimate incidence well from genetic data alone. We note that our estimated latent incidence should be greater than the reported incidence, because not all Ebola cases were reported and recorded. However, the discrepancy between latent and reported incidence should not be large, because Ebola reporting rate was high. For example, Scarpino et al. [2014] estimated that 83% of Ebola cases were reported.

We also report results from the ODE-based method and superimpose these results over LNA-based results on Figures 5 and 6. For the relatively small Liberia genealogy, the ODE-based and LNA-based methods yield similar parameter estimates. However, the larger Sierra Leone genealogy produces substantial differences between ODE-based and LNA-based estimates of the . The ODE-based method captures the decreasing trend of in Spring and Summer of 2014, but provides narrow BCIs with unrealistic short term fluctuations in the basic reproduction number trajectory.

Figure 6: Analysis of the genealogy relating Ebola virus sequences collected in Liberia. See caption in Figure 5 for the explanation of the plots.

5 Discussion

In this paper, we propose a Bayesian phylodynamic inference method that can fit a stochastic epidemic model to an observed genealogy estimated from infectious disease genetic sequences sampled during an outbreak. Our statistical model can be viewed as semi-parametric: with (1) a parametric SIR model describing the infectious disease dynamics and (2) a non-parametric GMRF-based estimation of the time varying basic reproduction number. To the best of our knowledge, this is the first method combining a Bayesian nonparametric approach with a deterministic or stochastic SIR model for phylodynamic inference (although see [Xu et al., 2016] for a similar approach applied to more traditional epidemiological data). Our use of LNA allows us to devise an efficient MCMC algorithm to approximate high dimensional posterior distribution of model parameters and latent variables. Our LNA-based method produces posterior summaries with better frequentist properties than the state-of-the-art ODE-based method, underscoring the importance of working with stochastic models even in large populations. We showcase our method by applying it to the Ebola genealogies estimated from viral sequences collected in Sierra Leone and Liberia during the 2014–2015 outbreak. Our nonparametric estimates of in Sierra Lione and Liberia suggest that the basic reproduction number decreased in two-stages, where the second stage brought it below 1.0 — a sign of epidemic containment. The second stage of decrease closely follows in time implementation of interventions, pointing to their effectiveness.

Our method relies on the assumption that population dynamics follow a SIR model. However, it may be desirable to be able to relax this assumption. For example, in Ebola spread modeling some authors used a SEIR model that assumes a latent period during which an infected individual is not infectious [Althaus, 2014, Volz and Siveroni, 2018]. One future direction of this work is to generalize the LNA-based method to fit complicated compartmental epidemic models, including models with multi-stage infections like SEIR model and models with the population stratified by sex, age, geographic location, or other demographic variables. The structured coalescent likelihoods under these models may not have closed-form expressions. However, Volz [2012] and Müller et al. [2017] propose several strategies to approximate structured coalescent likelihoods. It would be interesting to combine these approximation strategies with LNA to estimate parameters of complex stochastic epidemic models from genealogies.

The experiments in Section 3.1 and Appendix Section A-6 indicate that one has to pay close attention to parameter identifiability when fitting SIR models to genealogies or to sequence data directly. Identifiability may not be a problem under an assumption of a constant . However, the removal rate tends to be only weakly identifiable in the scenarios with a time-varying basic reproduction number, in which the estimation can be sensitive to the choice of priors. In Section A-6 of the Appendix, we demonstrate that putting a weakly informative prior on the removal rate can cause bias not only in the estimation for removal rate, but also can lead to a failure in recovering the reproduction number and latent population dynamics. Therefore, successful inference of SIR model parameters using genealogical data should rely on a sound informative prior for the removal rate. This constraint is not a big shortcoming in practice, since prior information about the removal rate, or mean length of the infectious period, is usually readily available from patient hospitalization data [WHO Ebola Response Team, 2014].

Since parameter identifiability is a recurring problem in infectious disease modeling, integration of multiple sources of information is of great interest. Using particle filter MCMC, Rasmussen et al. [2011] demonstrated that jointly analyzing genealogy and incidence case counts considerably reduces the uncertainty in both estimation of latent population trajectory and SIR model parameters, compared with estimation based on a single source of information. We plan to use our LNA-based framework to perform similar integration of genealogical data and incidence time series. Another possible source of information is the distribution of genetic sequence sampling times. Karcher et al. [2016] proposed a preferential sampling approach that explicitly models dependence of the sampling times distribution on the effective population size. The authors demonstrated that accounting for preferential sampling helps decrease bias and results in more precise effective population size estimation. It would be interesting to incorporate preferential sampling into our LNA-based framework by assuming a probabilistic dependency between sampling times and latent prevalence .

Our method assumes a genealogy/phylogenetic tree is given to us. In reality, genealogies are not directly observed and need to be inferred from molecular sequences. Ideally, uncertainty in the genealogy should be handled by building a Bayesian hierarchical model and integrating over the space of genealogies using MCMC. In fact, implementations of such Bayesian hierarchical modeling already exist for nonparametric, birth-death, and ODE-based phylodynamic approaches [Drummond et al., 2005, Minin et al., 2008, Gill et al., 2013, Stadler et al., 2013, Volz and Siveroni, 2018]. Therefore, an important future direction will be to extend our LNA framework to fitting stochastic epidemic models to molecular sequences instead of genealogies. Similarly to the structured coalescent model implementation of Volz and Siveroni [2018], the easiest way to achieve this will be integration of our LNA MCMC algorithm into popular open source phylogenetic/phylodynamic software packages, such as BEAST, BEAST2, and RevBayes [Suchard et al., 2018, Bouckaert et al., 2014, Höhna et al., 2016].


We thank Jon Fintz for discussing the details of implementing Linear Noise Approximation. We are grateful to Michael Karcher and Julia Palacios for patiently answering questions about nonparametric phylodynamic inference and phylodyn package. M.T and V.N.M. were supported by the NIH grant R01 AI107034. M.T., T.B., and V.N.M. were supported by the NIH grant U54 GM111274. G.D. was supported by the Mahan postdoctoral fellowship from the Fred Hutchinson Cancer Research Center. This work was supported by NIH grant R35 GM119774-01 from the NIGMS to TB. TB is a Pew Biomedical Scholar.


  • Althaus [2014] CL Althaus. Estimating the reproduction number of Ebola virus (EBOV) during the 2014 outbreak in West Africa. PLoS Currents, 6, 2014.
  • Anderson and May [1992] RM Anderson and RM May. Infectious Diseases of Humans: Dynamics and Control, volume 28. Wiley Online Library, 1992.
  • Andrieu et al. [2010] C Andrieu, A Doucet, and R Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
  • Bailey [1975] NTJ Bailey. The Mathematical Theory of Infectious Diseases and Its Applications. Hafner Press/MacMillian Pub. Co, 1975.
  • Bernardo et al. [2003] JM Bernardo, MJ Bayarri, JO Berger, AP Dawid, D Heckerman, AFM Smith, and M West. Non-centered parameterisations for hierarchical models and data augmentation. In Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting, page 307. Oxford University Press, USA, 2003.
  • Bouckaert et al. [2014] R Bouckaert, J Heled, D Kühnert, T Vaughan, CH Wu, D Xie, MA Suchard, A Rambaut, and AJ Drummond. BEAST 2: A software platform for Bayesian evolutionary analysis. PLoS Computational Biology, 10(4):1–6, 04 2014.
  • Buckingham-Jeffery et al. [2018] E Buckingham-Jeffery, V Isham, and T House. Gaussian process approximations for fast inference from infectious disease data. Mathematical Biosciences, 301, 2018.
  • [8] Centers for Disease Control. Centers for Disease Control and Prevention. 2014–2016 Ebola outbreak in West Africa. Last accessed: Dec, 15, 2018.
  • Dearlove and Wilson [2013] B Dearlove and DJ Wilson. Coalescent inference for infectious disease: meta-analysis of hepatitis C. Philosophical Transactions of the Royal Society, Series B, 368(1614):20120314, 2013.
  • Donnelly and Tavare [1995] P Donnelly and S Tavare. Coalescents and genealogical structure under neutrality. Annual Review of Genetics, 29(1):401–421, 1995.
  • Drummond et al. [2002] AJ Drummond, GK Nicholls, AlG Rodrigo, and W Solomon. Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. Genetics, 161(3):1307–1320, 2002.
  • Drummond et al. [2005] AJ Drummond, A Rambaut, B Shapiro, and OG Pybus. Bayesian coalescent inference of past population dynamics from molecular sequences. Molecular Biology and Evolution, 22(5):1185–1192, 2005.
  • Dudas et al. [2017] G Dudas, LM Carvalho, T Bedford, AJ Tatem, G Baele, NR Faria, DJ Park, JT Ladner, A Arias, D Asogun, et al. Virus genomes reveal factors that spread and sustained the Ebola epidemic. Nature, 544(7650):309–315, 2017.
  • Fearnhead et al. [2014] P Fearnhead, V Giagos, and C Sherlock. Inference for reaction networks using the linear noise approximation. Biometrics, 70(2):457–466, 2014.
  • Frost and Volz [2010] SDW Frost and EM Volz. Viral phylodynamics and the search for an “effective number of infections” . Philosophical Transactions of the Royal Society B: Biological Sciences, 365(1548):1879–1890, 2010.
  • Giagos [2010] V Giagos. Inference for Auto-Regulatory Genetic Networks Using Diffusion Process Approximations. PhD thesis, Lancaster University, 2010.
  • Gill et al. [2013] MS Gill, P Lemey, NR Faria, A Rambaut, B Shapiro, and MA Suchard. Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci. Molecular Biology and Evolution, 30(3):713–724, 2013.
  • Gillespie [1977] DT Gillespie. Exact stochastic simulation of coupled chemical reactions. The Journal of Physical Chemistry, 81(25):2340–2361, 1977.
  • Gillespie [2000] DT Gillespie. The chemical Langevin equation. The Journal of Chemical Physics, 113(1):297–306, 2000.
  • Grenfell et al. [2004] BT Grenfell, OG Pybus, JR Gog, JLN Wood, JM Daly, JA Mumford, and EC Holmes. Unifying the epidemiological and evolutionary dynamics of pathogens. Science, 303(5656):327–332, 2004.
  • Griffiths and Tavaré [1994] RC Griffiths and S Tavaré. Sampling theory for neutral alleles in a varying environment. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 344(1310):403–410, 1994.
  • Griffiths and Tavare [1994] RC Griffiths and S Tavare. Ancestral inference in population genetics. Statistical Science, pages 307–319, 1994.
  • Höhna et al. [2016] S Höhna, MJ Landis, TA Heath, B Boussau, N Lartillot, BR Moore, JP Huelsenbeck, and F Ronquist. RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Systematic Biology, 65(4):726–736, 2016.
  • Karcher et al. [2016] MD Karcher, JA Palacios, T Bedford, MA Suchard, and VN Minin. Quantifying and mitigating the effect of preferential sampling on phylodynamic inference. PLoS Computational Biology, 12(3):e1004789, 2016.
  • Keeling and Rohani [2011] MJ Keeling and P Rohani. Modeling Infectious Diseases in Humans and Animals. Princeton University Press, 2011.
  • Kingman [1982] JFC Kingman. The coalescent. Stochastic Processes and their Applications, 13(3):235–248, 1982.
  • Komorowski et al. [2009] M Komorowski, B Finkenstädt, CV Harper, and DA Rand. Bayesian inference of biochemical kinetic parameters using the linear noise approximation. BMC Bioinformatics, 10(1):343, 2009.
  • Kuhner et al. [1998] MK Kuhner, J Yamato, and J Felsenstein. Maximum likelihood estimation of population growth rates based on the coalescent. Genetics, 149(1):429–434, 1998.
  • Kühnert et al. [2014] D Kühnert, T Stadler, TG Vaughan, and AJ Drummond. Simultaneous reconstruction of evolutionary history and epidemiological dynamics from viral sequences with the birth–death SIR model. Journal of the Royal Society Interface, 11(94):20131106, 2014.
  • Kurtz [1970] TG Kurtz. Solutions of ordinary differential equations as limits of pure jump Markov processes. Journal of Applied Probability, 7(1):49–58, 1970.
  • Kurtz [1971] TG Kurtz. Limit theorems for sequences of jump Markov processes. Journal of Applied Probability, 8(2):344–356, 1971.
  • Leventhal et al. [2013] GE Leventhal, HF Günthard, S Bonhoeffer, and T Stadler. Using an epidemiological model for phylogenetic inference reveals density dependence in HIV transmission. Molecular Biology and Evolution, 31(1):6–17, 2013.
  • Minin et al. [2008] VN Minin, EW Bloomquist, and MA Suchard. Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. Molecular Biology and Evolution, 25(7):1459–1471, 2008.
  • Müller et al. [2017] NF Müller, DA Rasmussen, and T Stadler. The structured coalescent and its approximations. Molecular Biology and Evolution, 34:2970–2981, 2017.
  • Murray et al. [2010] I Murray, RP Adams, and DJC MacKay. Elliptical slice sampling. In AISTATS, volume 13, pages 541–548, 2010.
  • O’Neill and Roberts [1999] PD O’Neill and GO Roberts. Bayesian inference for partially observed stochastic epidemics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 162(1):121–129, 1999.
  • Palacios and Minin [2013] JA Palacios and VN Minin. Gaussian process-based Bayesian nonparametric inference of population size trajectories from gene genealogies. Biometrics, 69(1):8–18, 2013.
  • Papaspiliopoulos et al. [2007] O Papaspiliopoulos, GO Roberts, and M Sköld. A general framework for the parametrization of hierarchical models. Statistical Science, pages 59–73, 2007.
  • Pybus et al. [2001] OG Pybus, MA Charleston, S Gupta, A Rambaut, EC Holmes, and PH Harvey. The epidemic behavior of the hepatitis C virus. Science, 292(5525):2323–2325, 2001.
  • Rasmussen et al. [2011] DA Rasmussen, O Ratmann, and K Koelle. Inference for nonlinear epidemiological models using genealogies and time series. PLoS Computational Biology, 7(8):e1002136, 2011.
  • Rasmussen et al. [2014] DA Rasmussen, EM Volz, and K Koelle. Phylodynamic inference for structured epidemiological models. PLoS Computational Biology, 10(4):e1003570, 2014.
  • Rue [2001] H Rue. Fast sampling of Gaussian Markov random fields. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2):325–338, 2001.
  • Rue and Held [2005] H Rue and L Held. Gaussian Markov Random Fields: Theory and Applications. CRC press, 2005.
  • Scarpino et al. [2014] SV Scarpino, A Iamarino, C Wells, D Yamin, M Ndeffo-Mbah, NS Wenzel, SJ Fox, T Nyenswah, FL Altice, AP Galvani, et al. Epidemiological and viral genomic sequence analysis of the 2014 Ebola outbreak reveals clustered transmission. Clinical Infectious Diseases, 60(7):1079–1082, 2014.
  • Stadler et al. [2013] T Stadler, D Kühnert, S Bonhoeffer, and AJ Drummond. Birth–death skyline plot reveals temporal changes of epidemic spread in hiv and hepatitis C virus (HCV). Proceedings of the National Academy of Sciences, 110(1):228–233, 2013.
  • Stadler et al. [2014] T Stadler, D Kühnert, DA Rasmussen, and L du Plessis. Insights into the early epidemic spread of Ebola in Sierra Leone provided by viral sequence data. PLoS Currents, 6, 2014.
  • Suchard et al. [2018] MA Suchard, P Lemey, G Baele, DL Ayres, AJ Drummond, and A Rambaut. Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10. Virus Evolution, 4(1):vey016, 2018.
  • Towers et al. [2014] S Towers, O Patterson-Lomba, and C Castillo-Chavez. Temporal variations in the effective reproduction number of the 2014 west africa Ebola outbreak. PLoS Currents, 6, 2014.
  • Van Kampen and Reinhardt [1983] NG Van Kampen and WP Reinhardt. Stochastic processes in physics and chemistry, 1983.
  • Vaughan et al. [2018] TG Vaughan, GE Leventhal, DA Rasmussen, AJ Drummond, D Welch, and T Stadler. Estimating epidemic incidence and prevalence from genomic data. BioRxiv, page 142570, 2018. doi:
  • Volz and Pond [2014] E Volz and S Pond. Phylodynamic analysis of Ebola virus in the 2014 Sierra Leone epidemic. PLoS Currents, 6, 2014.
  • Volz and Siveroni [2018] E Volz and I Siveroni. Bayesian phylodynamic inference with complex models. BioRxiv, 2018. doi: 10.1101/268052.
  • Volz [2012] EM Volz. Complex population dynamics and the coalescent under neutrality. Genetics, 190(1):187–201, 2012.
  • Volz et al. [2009] EM Volz, SLK Pond, MJ Ward, AJL Brown, and SDW Frost. Phylodynamics of infectious disease epidemics. Genetics, 183(4):1421–1430, 2009.
  • Volz et al. [2013] EM Volz, K Koelle, and T Bedford. Viral phylodynamics. PLoS Computational Biology, 9(3):e1002947, 2013.
  • Wallace [2010] EWJ Wallace. A simplified derivation of the linear noise approximation. Arxiv preprint arXiv:1004.4280, 2010.
  • Wearing et al. [2005] HJ Wearing, P Rohani, and MJ Keeling. Appropriate models for the management of infectious diseases. PLoS Medicine, 2(7):e174, 2005.
  • WHO Ebola Response Team [2014] WHO Ebola Response Team. Ebola virus disease in West Africa —- the first 9 months of the epidemic and forward projections. New England Journal of Medicine, 371(16):1481–1495, 2014.
  • Wilkinson [2011] DJ Wilkinson. Stochastic Modelling for Systems Biology. CRC press, 2011.
  • [60] World Health Organization. World Health Organization. Ebola data and statistics., May 11, 2016. Last accessed: February 28, 2018.
  • Wright [1931] S Wright. Evolution in Mendelian populations. Genetics, 16(2):97–159, 1931.
  • Xu et al. [2016] X. Xu, T. Kypraios, and P.D. O’Neill. Bayesian non-parametric inference for stochastic epidemic models using Gaussian processes. Biostatistics, 17(4):619–633, 2016.

A-1 A general framework for stochastic kenetic models

A-1.1 Stochastic model generalization

In Section 2, we provide an example of the linear noise approximation (LNA) for the SIR model. The LNA framework can be also generalized to other types of the stochastic kinetic models in Infectious Disease Epidemiology and in Systems Biology. Here, we give a general representation of the stochastic kinetic model by viewing it as a reaction network system. The notation is based on the work of Fearnhead et al. [2014].

Let’s start with a reaction system with reactants and reactions. Without loss of generality, each reaction is assumed to have a constant rate parameter for and denotes the rate vector of the system (this framework can be extended to handle stochastic kinetic models with time-varying rates as in Section 2 of the main text). The transition event in the th reaction () has the following form:


where and are non-negative integers representing the number of in the th reaction equation. In a compartmental stochastic epidemic model, the coefficient will be either 0 or 1. The transitions in the reaction system can be encoded in an effect matrix,


with each row corresponding to a certain type of reaction event and each column representing the change in the counts of reactants. Let denote counts/population of the at , and the population state at time t can be tracked by vector . Let denote the reaction rate of the th reaction, where can be written as


Hence, following the same notation as in Section 2.2.1 of the main text, the rate vector and the rate matrix can be defined as


Given the above notation, the deterministic ordinary differential equation model of the reaction system can be written as


where is a vector of initial counts of reactants .

A-1.1.1 Example: SEIR model

Figure A-1: SEIR Markov jump process. From the current state with the counts , the population can transition to (1) state (an infection event) with rate or to (2) state (an event where infected individual becomes infectious) with rate or to (3) state (a removal event) with rate . No other instantaneous transitions are allowed.

The above general representation of stochastic kinetic models can be directly applied to stochastic epidemic models. Here, we illustrate this on a Susceptible-Exposed-Infected-Recovery (SEIR) model. SEIR model is an extension of the SIR model that assumes a latent period called “Exposed”, in which an infected individual does not have the ability to infect others. The exposed individual will eventually become infectious with rate . As in the SIR model, an infectious individual has removal/recovery rate . The transition events between different states of the SEIR model are depicted in Figure A-1.

Following the stochastic kinetic model representation, the SEIR model can be viewed as a reaction system of four reactants — susceptible, exposed, infected, and recovered individuals — and the following three “reactions”:


Since the recovered population never interacts with individuals in other compartments, we will only keep track of the counts of susceptible, exposed, and infectious individuals at time , denoted by , , respectively. The effect matrix for the SEIR model can be written as:


with columns representing compartments and rows representing reactant changes during reaction events.

If we let denote the state vector at time , then the rate vector for the SEIR model is


A-2 Derivation of the linear noise approximation

A-2.1 SDE approximation for MJP

A stochastic way to approximate the MJP model is to use the Stochastic Differential Equation (SDE) approximation, also known as the chemical Langevin equation (CLE) [Gillespie, 2000]. The SDE method can be viewed an approximation of the MJP at time

, obtained by applying a normal approximation to the Poisson distributed number of state transitions in a small interval of time

[Gillespie, 2000, Wallace, 2010]. The deterministic part in SDE corresponds to the right hand side of ODE (7) and stochastic part is related to the variance of the system. The SDE for general stochastic kinetic models can be written as


where denote a dimensional Wiener process and the square root means the Cholesky triangle of the covariance matrix.

A-2.2 LNA approximation of the SDE

Since in the main text we assume the rate varies in a piecewise constant way, without loss of generality, we use the notation for the rate in a given time interval where the LNA is applied.

Theorem A-2.1 (Linear Noise Approximation for SDE).

Let be the solution of ordinary differential equation (7) with initial value . Let be the system size, which is the total number of individuals in the system (In SIR model, will be the total population, i.e ), denote the vector of rate parameters in reactions. Then the solution of the SDE (A-11) satisfies the following equation


as .


The following derivation is based on [Wallace, 2010].

We rescale both the compartment size and reaction rates as follows:


where is the sum of coeffcients in the left hand side of -th reaction as in Section A-1. The transformed represents the proportion of individuals/reactants each compartment with respect to the total population size. Then we have and . Hence, the SDE (A-11) becomes


Let be the solution of the ODE


and we have , where is the solution of the ODE (7). can be viewed as a scaled version solution of ODE (7). Let denote the scaled residual, then the rescaled compartment size vector can be written as


After using first order Taylor expansion of and around , the SDE (A-15) becomes