Estimation from available data of model parameters describing epidemic dynamics is a major challenge in epidemiology, especially contributing to better understand the mechanisms underlying these dynamics and to provide reliable predictions. Epidemics can be recurrent over time and/or occur simultaneously in different regions. For example, influenza outbreaks in France are seasonal and can unfold in several distinct regions with different intensities at the same time. This translates into a non-negligible variability between epidemic phenomena. In practice, this inter-epidemic variability is often omitted, by not explicitly considering specific components for each entity (population, period). Instead, each data series is analysed separately and this variability is estimated empirically. Integrating in a unique model these sources of variability allows to study simultaneously the observed data sets corresponding to each spatial (e.g. region) or temporal entity (e.g. season). This approach should improve the statistical power and accuracy of the estimation of epidemic parameters as well as refine knowledge about underlying inter-epidemic variability.
An appropriate framework is represented by the mixed-effects models, which allow to describe the variability between subjects belonging to a same population from repeated data (see e.g. Pinheiro2000 , Lavielle2014
). These models are largely used in pharmacokinetics with intra-population dynamics usually modeled by ordinary differential equations (ODE) and, in order to describe the differences between individuals, random effects on the parameters ruling these dynamics (see e.g.Collin2020 ). This framework was later extended to models defined by stochastic differential equations incorporating mixed effects in the parameters of these diffusion processes (Donnet2009 , Delattre2013 , Donnet2013 , Delattre2018 ). To our knowledge, the framework of mixed-effects models has rarely been used to analyse epidemic data, except in a very few studies. Among these, in (prague:hal-02555100 ), the dynamics of the first epidemic wave of COVID-19 in France were analysed using an ODE system incorporating random parameters to take into account the variability of the dynamics between regions. Using a slightly different approach to tackle data from multiple epidemics, Breto2020 proposed a likelihood-based inference method using particle filtering techniques for non-linear and partially observed models. In particular, these models incorporate unit-specific parameters and shared parameters.
In addition to the specific problem of variability reflected in multiple data sets, observations of epidemic dynamics are often incomplete in various ways: only certain health states are observed (e.g. infected individuals), data are temporally discretized or aggregated, and subject to observation errors (e.g. under-reporting, diagnosis errors). Because of this incompleteness together with the non-linear structure of the epidemic models, the computation of the maximum likelihood estimator (MLE) is often not explicit. In hidden or latent variable models which are appropriate representations of incompletely observed epidemic dynamics, estimation techniques based on Expectation-Maximization (EM) algorithm can be implemented in order to compute the MLE (see e.g.Dempster1977 ). However, the E-step of the EM algorithm requires that, for each parameter value , the conditional expectation of the complete log-likelihood given the observed data, , can be computed. In mixed-effects models, there is generally no closed form expression for . In such cases, this quantity can be approximated using a Monte-Carlo procedure (MCEM, Wei1990 ), which is computationally very demanding. A more efficient alternative is the SAEM algorithm (Delyon1999 ), often used in the framework of mixed-effects models (Kuhn2005 ), which combines at each iteration the simulation of unobserved data under the conditional distribution given the observations and a stochastic approximation procedure of (see also Delattre2013 , Donnet2014 for the study and implementation of the SAEM algorithm for mixed-effects diffusion models).
In this paper, focusing on the inference for multiple epidemic dynamics, we intend to meet two objectives. The first objective is to propose a finer modeling of multiple epidemics through a unique mixed-effects model, incorporating a stochastic representation of each epidemic. The second objective is to develop an appropriate method for jointly estimating model parameters from noisy and partial observations, able to estimate rigorously and explicitly the inter-epidemic variability. Thus, the main expected contribution is to provide accurate estimates of common and epidemic-specific parameters and to provide elements for the interpretation of the mechanisms underlying the variability between epidemics of the same nature occurring in different locations or over distinct time periods.
For this purpose, we extend the Gaussian state-space model introduced in (Narci2020 ) for single epidemics to a model with mixed effects on the parameters describing simultaneously several epidemics and their observations. Then, following (Delattre2013 ) and building on the Kalman filtering-based inference method proposed in (Narci2020 ), we propose to couple the SAEM algorithm with Kalman-like filtering to estimate model parameters. The performances of the estimation method are investigated on simulations mimicking noisy prevalence data (i.e. the number of cases of disease in the population at a given time or over a given period of time). The method is then applied to the case of influenza epidemics in France over several years using noisy incidence data (i.e. the number of newly detected cases of the disease at a given time or over a given period of time), by proposing a new version of the filtering algorithm to handle this type of data.
The paper is organized as follows. In Section 2 we describe the epidemic model for a single epidemic, specified for both prevalence and incidence data, and its extension to account for several epidemics through a two-level representation using the framework of mixed-effects models. Section 3 contains the maximum likelihood estimation method and convergence results of the SAEM algorithm. In Section 4, the performances of our inference method are assessed on simulated noisy prevalence data generated by SIR epidemic dynamics sampled at discrete time points. Section 5 is dedicated to the application case, the influenza outbreaks in France from 1990 to 2017. Section 6 contains a discussion and concluding remarks.
2 A mixed-effects approach for a state-space epidemic model for multiple epidemics
First, we sum up the approach developed in (Narci2020 ) in the case of single epidemics for prevalence data and extend it to incidence data (Section 2.1). By extending this approach, we propose a model for simultaneously considering several epidemics, in the framework of mixed-effects models (Section 2.2).
2.1 The basics of the modeling framework for the case of a single epidemic
The epidemic model
Consider an epidemic in a closed population of size with homogeneous mixing, whose dynamics are represented by a stochastic compartmental model with compartments corresponding to the successive health states of the infectious process within the population. These dynamics are described by a density-dependent Markov jump process with state space and transition rates depending on a multidimensional parameter . Assuming that , the normalized process representing the respective proportions of population in each health state converges, as , to a classical and well-characterized ODE:
Two stochastic approximations of are available: a -dimensional diffusion process with drift coefficient and diffusion matrix (which is also easily deducible from the jump functions of the density-dependent jump process, see e.g. (Narci2020 )), and a time-dependent Gaussian process
with small variance coefficient (see e.g.Pardoux2020 ), having for expression
where is a centered Gaussian process with explicit covariance matrix. There is a link between these two processes: let be a Brownian motion in , then is the centered Gaussian process
and is the resolvent matrix associated to (1)
with denoting the matrix . In the sequel, we rely on the Gaussian process (2) to represent epidemic dynamics.
The epidemic is observed at discrete times , where is the number of observations. Let us assume that the observation times are regularly spaced, that is with the time step (but the following can be easily adapted to irregularly spaced observation times). Setting and , the model can be written under the auto-regressive AR(1) form
In the SIR model, and . The parameters involved in the transition rates are and and the initial proportions of susceptible and infectious individuals are . Denoting , the ODE satisfied by is
When there is no ambiguity, we denote by and the solution of (8). Then, the functions , and are
We refer the reader to Appendix A for the computation of , and in the SEIR model. Another parameterization, involving the basic reproduction number and the infectious period , is more often used for SIR models. Hence, we set .
Observation model for prevalence data
Following (Narci2020 ), we assume that observations are made at times , and that some health states are not observed. The dynamics is described by the -dimensional model detailed in (4). Some coordinates are not observed and various sources of noise systematically affect the observed coordinates (measurement errors, observation noises, under-reporting, etc.). This is taken into account by introducing an additional parameter , governing both the levels of noise and the amount of information which is available from the observed coordinates, and an operator
. Moreover, we assume that, conditionally on the random variables, these noises are independent but not identically distributed. We approximate their distributions by
-dimensional Gaussian distributions with covariance matrixdepending on and . This yields that the observations satisfy
Let us define a global parameter describing both the epidemic process and the observational process,
Finally, joining (4), (9) and (10) yields the formulation (for both epidemic dynamics and observation process) required to implement Kalman filtering methods in order to estimate the epidemic parameters:
Example: SIR model (continued). The available observations could be noisy proportions of the number of infectious individuals at discrete times . Denoting by the reporting rate, one could define the operator and the covariance error as with satisfying (8). The expression of mimics the variance that would arise from assuming the observations to be obtained as binomial draws of the infectious individuals.
Observation model for incidence data
For this purpose, we have extended the framework developed in (Narci2020 ). For some compartmental models, the observations (incidence) at times can be written as the increments of a single or more coordinates, that is where, as above, is a given operator and are emission parameters. Let us write the epidemic model in this framework. For , let
To model the errors that affect the data collected , we assume that, conditionally on , the observations are independent and proceed to the same approximation for their distributions
Contrary to (4), is not Markovian since it depends on all the past observations. Therefore, it does not possess the required properties of classical Kalman filtering methods. We prove in Appendix B that we can propose an iterative procedure and define a new filter to compute recursively the conditional distributions describing the updating and prediction steps together with the marginal distributions of the observations from the model (16).
Example: SIR model (continued). Here, and the number of new infectious individuals at times is given by . Observing a proportion of the new infectious individuals would lead to the operator . Mimicking binomial draws, the covariance error could be chosen as where satisfies (8).
2.2 Modeling framework for multiple epidemics
Consider now the situation where a same outbreak occurs in many regions or at different periods simultaneously. We use the index to describe the quantities for each unit (e.g. region or period), where is the total number of units.
Following Section 2.1, for unit , the epidemic dynamics are represented by the -dimensional process corresponding to infectious states (or compartments) with
state space . It is assumed that is observed at discrete times on , , where is a fixed time step and
is the number of observations, and that are the observations at times . Each of these dynamics has its own epidemic and observation parameters, denoted .
To account for intra- and inter-epidemic variability, a two level representation is considered, in the framework of mixed-effects models. First, using the discrete-time Gaussian state-space for prevalence (11) or for incidence data (16), the intra-epidemic variability is described. Second, the inter-epidemic variability is characterized by specifying a set of random parameters for each epidemic.
1. Intra-epidemic variability
2. Inter-epidemic variability
We assume that the epidemic-specific parameters are independent and identically distributed (i.i.d) random variables with distribution defined as follows,
. The vectorcontains known link functions (a classical way to obtain parameterizations easier to handle), is a vector of fixed effects and are random effects modeled by i.i.d centered random variables. The fixed and random effects respectively describe the average general trend shared by all epidemics and the differences between epidemics. Note that it is sometimes possible to propose a more refined description of the inter-epidemic variability by including unit-specific covariates in (19). This is not considered here, without loss of generality.
Example: SIR model (continued). Let and where is the population size in unit . The random parameter is and has to fulfill the constraints
To meet these constraints, one could introduce the following function :
where and .
In this example, we supposed that all the parameters have both fixed and random effects, but it is also possible to consider a combination of random-effect parameters and purely fixed-effect parameters (see Section 4.1 for instance).
3 Parametric inference
To estimate the model parameters , with and defined in (19), containing the parameters modeling the intra- and inter-epidemic variability, we develop an algorithm in the spirit of (Delattre2013 ) allowing to derive the maximum likelihood estimator (MLE).
3.1 Maximum likelihood estimation
The model introduced in Section 2.2 can be seen as a latent variable model with the observed data and the latent variables. Denote respectively by ), and
the probability density of the observed data, of the random effects and of the observed data given the unobserved ones. By independence of theepidemics, the likelihood of the observations is given by:
Computing the distribution of the observations for any epidemic requires the integration of the conditional density of the data given the unknown random effects with respect to the density of the random parameters:
Due to the non-linear structure of the proposed model, the
integral in (21) is not explicit. Moreover, the computation of is not straightforward due to the presence of latent states in the model. Therefore, the inference algorithm needs to account for these specific features.
Let us first deal with the integration with respect to the unobserved random variables . In latent variable models, the use of the EM algorithm (Dempster1977 ) allows to compute iteratively the MLE. Iteration of the EM algorithm combines two steps: (1) the computation of the conditional expectation of the complete log-likelihood given the observed data and the current parameter estimate , denoted (E-step); (2) the update of the parameter estimates by maximization of (M-step). In our case, the E-step cannot be performed because does not have a simple analytic expression. We rather implement a Stochastic Approximation-EM (SAEM, Delyon1999 ) which combines at each iteration the simulation of unobserved data under the conditional distribution given the observations (S-step) and a stochastic approximation of (SA-step).
a) General description of the SAEM algorithm
Given some initial value , iteration of the SAEM algorithm consists in the three following steps:
(S-step) Simulate a realization of the random parameters under the conditional distribution given the observations for a current parameter denoted .
(SA-step) Update according to
where is a sequence of positive step-sizes s.t. and .
(M-step) Update the parameter estimate by maximizing
In our case, an exact sampling under in the S-step is not feasible. In such intractable cases, MCMC algorithms such as Metropolis-Hastings algorithm can be used (Kuhn2004 ).
b) Computation of the S-step by combining the Metropoligs-Hastings algorithm with Kalman filtering techniques
In the sequel,
we combine the S-step of the SAEM algorithm with a MCMC procedure.
For a given parameter value , a single iteration of the Metropolis-Hastings algorithm consists in:
Generate a candidate for a given proposal distribution
To compute the rate of acceptation of the Metropolis-Hastings algorithm in (22), we need to calculate
Let , . In both models (17) and (18), the conditional densities are Gaussian densities. In model (17) involving prevalence data, their means and variances can be exactly computed with Kalman filtering techniques (see (Narci2020 )). In model (18), the Kalman filter can not be used in its standard form. We therefore develop an alternative filtering algorithm.
From now on, we omit the dependence in and for sake of simplicity.
Let us consider model (16). Assume that
Let and . Then, at iterations , the filtering steps are:
The equations are deduced in Appendix B, the difficult point lying in the prediction step, i.e. the derivation of the conditional distribution .
3.2 Convergence of the SAEM-MCMC algorithm
Generic assumptions guaranteeing the convergence of the SAEM-MCMC algorithm were stated in (Kuhn2004 ). These assumptions mainly concern the regularity of the model (see assumptions (M1-M5)) and the properties of the MCMC procedure used in step S (SAEM3’). Under these assumptions, and providing that the step sizes are such that and , then the sequence obtained through the iterations of the SAEM-MCMC algorithm converges almost surely toward a stationary point of the observed likelihood.
Let us remark that by specifying the inter-epidemic variability through the modeling framework of Section 2.2, our approach for multiple epidemics fulfills the exponentiality condition stated in (M1) provided that all the components of are random. Hence the algorithm proposed above converges almost surely toward a stationary point of the observed likelihood under the standard regularity conditions stated in (M2-M5) and assumption (SAEM3’).
4 Assessment of parameter estimators performances on simulated data
First, the performances of our inference method are assessed on simulated stochastic SIR dynamics. Second, the estimation results are compared with those obtained by an empirical two-step approach.
For a given population of size and given parameter values, we use the Gillespie algorithm (Gillespie1977 ) to simulate a two-dimensional Markov jump process . Then, choosing a sampling interval and a reporting rate , we consider prevalence data simulated as binomial trials from a single coordinate of the system .
4.1 Simulation setting
Recall that the epidemic-specific parameters are .
In the sequel, for all , we assume that and are random parameters. We also set (which means that the initial number of recovered individuals is zero), with being a random parameter. Moreover, we consider that the infectious period is a fixed parameter since the duration of the infectious period can reasonably be assumed constant between different epidemics. It is important to note that the case study is outside the scope of the exponential model since a fixed parameter has been included. We refer the reader to Appendix C for implementation details.
In other words, random effects on and fixed effect on are considered. Moreover, these random effects come from a priori independent sources, so that there is no reason to consider correlations between , and , and we can assume in this set-up a diagonal form for the covariance matrix , .
We consider two settings (denoted respectively (i) and (ii) below) corresponding to two levels of inter-epidemic variability (resp. high and moderate). The fixed effects values are chosen such that the intrinsic stochasticity of the epidemic dynamics is significant (a second set of fixed effects values leading to a lower intrinsic stochasticity is also considered; see Appendix D for details).
Setting (i): and corresponding to , ; ; , ; , ;
Setting (ii): and corresponding to , ; ; , ; , ;
where stands for the coefficient of variation of a random variable . Let us note that the link between and for and does not have an explicit expression.
The population size is fixed to . For each , data sets, each composed of SIR epidemic trajectories, are simulated. Independent samplings of , , , are first drawn according to model (23). Then, conditionally to each parameter set , a bidimensionnal Markov jump process is simulated. Normalizing with respect to and extracting the values of the normalized process at regular time points , , gives the ’s. A fixed discretization time step is used, i.e. the same value of is used to simulate all the epidemic data. For each epidemic, is defined as the first time point at which the number of infected individuals becomes zero. Two values of are considered () corresponding to an average number of time-point observations . Only trajectories that did not exhibit early extinction were considered for inference. The theoretical proportion of these trajectories is given by (Andersson2000 ). Then, given the simulated ’s and parameters ’s, the observations
are generated from binomial distributions.
4.2 Point estimates and standard deviations for inferred parameters
show the estimates of the expectation and standard deviation of the mixed effects, computed from the estimations of and using functions defined in (23), for settings (i) and (ii). For each parameter, the reported values are the mean of the parameter estimates , , and their standard deviations in brackets.