A marginal moment matching approach for fitting endemic-epidemic models to underreported disease surveillance counts

by   Johannes Bracher, et al.

Count data are often subject to underreporting, especially in infectious disease surveillance. We propose an approximate maximum likelihood method to fit count time series models from the endemic-epidemic class to underreported data. The approach is based on marginal moment matching where underreported processes are approximated through completely observed processes from the same class. We demonstrate that the reporting probability cannot be estimated from time series data alone and needs to be specified based on external information. Moreover, the form of the bias when underreporting is ignored or taken into account via multiplication factors is analysed. Notably, we show that this leads to a downward bias in model-based estimates of the effective reproductive number. The good performance of the proposed approach is demonstrated in simulation studies. An extension to time-varying parameters and reporting probabilities is discussed and applied in a case study on rotavirus gastroenteritis in Berlin, Germany.



There are no comments yet.


page 1

page 2

page 3

page 4


Multivariate endemic-epidemic models with higher-order lags and an application to outbreak detection

Multivariate time series models are an important tool for the analysis o...

Eliciting Disease Data from Wikipedia Articles

Traditional disease surveillance systems suffer from several disadvantag...

A zero-inflated endemic-epidemic model with an application to measles time series in Germany

Count data with excessive zeros are often encountered when modelling inf...

Spatio-Temporal Analysis of Surveillance Data

In this chapter, we consider space-time analysis of surveillance count d...

Generalised Joint Regression for Count Data with a Focus on Modelling Football Matches

We propose a versatile joint regression framework for count responses. T...

An integer-valued time series model for multivariate surveillance

In recent days different types of surveillance data are becoming availab...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data from routine surveillance of infectious diseases rarely reflect the entire burden in a population (Gibbons et al., 2014; Noufaily, 2019). At the community level, not all infected individuals seek medical care, e.g. due to absence or mildness of symptoms, which is referred to as under-ascertainment. At the healthcare level, part of those seeking healthcare are not diagnosed correctly, lack lab confirmation or are not entered in the notification system. This is often referred to as underreporting, but we will use the term in a broader sense so that it also subsumes under-ascertainment. Various study types exist to estimate the degree of underreporting, including community-based studies, returning traveller studies, capture-recapture methods and serological surveys (Gibbons et al., 2014).

Underreporting has received a fair amount of attention in infectious disease modelling and statistics. Approaches of varying complexity have been suggested, starting from simple multiplication factor methods (e.g. Jandarov et al. 2014, Stocks et al. 2018). More sophisticated techniques have been described for branching processes (Fraser et al., 2009; White and Pagano, 2010; Azmon et al., 2014) and compartmental models (Cauchemez and Ferguson, 2008; Dorigatti et al., 2012; Gamado et al., 2014; Fintzi et al., 2017; Magal and Webb, 2018). Moreover, disease mapping approaches for underreported spatial (Bailey et al., 2005; Oliveira et al., 2017) and spatio-temporal (Stoner et al., 2019; Sharmin et al., 2018b) count data have been suggested. Underreported count outcomes in regression have been considered by Dvorzak and Wagner (2016). In statistical count time series modelling, however, underreporting is rarely addressed. Exceptions are Fernández-Fontelo et al. (2016, 2019) who suggested a hidden integer valued autoregressive (INAR) model and Sharmin et al. (2018a) who use a hidden log-linear autoregressive count model. Here we examine the impact of underreporting in the endemic-epidemic modelling framework (Held et al., 2005; Paul et al., 2008; Held and Paul, 2012; Meyer and Held, 2014, 2017) which in the univariate case has links to integer-valued GARCH (INGARCH) models (Ferland et al. 2006, Zhu 2011).

Our main contribution consists in an approximate maximum likelihood method to fit endemic-epidemic models to underreported data. It is based on a marginal moment matching approach where underreported processes are approximated by completely observed processes from the same class. The reporting probability needs to be specified based on external knowledge, as it is not identifiable from time series data alone. The sensitivity of estimates to the chosen value can subsequently be examined. The arguments underlying the likelihood approximation also have implications on the biases occurring when underreporting is ignored or taken into account using multiplication factors. This is of particular interest when endemic-epidemic models serve to estimate effective reproductive numbers, as suggested by Bauer and Wakefield (2018). In the following we focus on univariate models, but some implications for the multivariate case are addressed in the discussion. In this context we also comment on forecasting, while our main concern is parameter estimation.

The article is structured as follows. In Section 2 we describe the endemic-epidemic model in a basic form, extend it to account for underreporting and derive marginal moment properties. These form the basis of the approximate inference scheme introduced in Section 3 and assessed in simulation studies in Section 4. In Section 5 we extend the approach to models with time-varying parameters and apply it in a case study on rotavirus gastroenteritis in Berlin, Germany. Section 6 concludes with a discussion.

2 The endemic-epidemic model with underreporting

2.1 Model definition

The endemic-epidemic modelling framework for time series of infectious disease counts was introduced by Held et al. (2005) and extended in a series of articles (Paul et al. 2008; Held and Paul 2012; Meyer and Held 2014, 2017). While much of the existing work treats multivariate models, we here focus on the univariate version. Also, we assume all model parameters to be time-constant and postpone the case of time-varying parameters to Section 5.1 and Appendix A.

Conditional on the past, the number of cases of a given disease in week

is then assumed to follow a negative binomial distribution


with conditional mean and overdispersion parameter so that . Here we use to denote the history . The negative binomial distribution allows the model to adapt to the typically high dispersion in surveillance data. In the simplest case the conditional expectation is additively decomposed into


Here the term is interpreted as the endemic component of incidence, i.e. the part which cannot be explained by previous cases, e.g. due to travelling outside the study region or indirect transmission from environmental sources. The autoregressive term with captures the dynamics of the process and is referred to as the epidemic component.

In Bracher and Held (2019) we recently relaxed the AR(1) assumption from (2) by regressing on a weighted sum of past observations, e.g. using geometrically decaying weights. This reflects the possibility of serial intervals (the time span between symptom onset in a case and a secondary case infected by the first) exceeding one week. A related approach (compare equation (6) in Section 2.2) is to include an autoregression on , as is done in INGARCH models (Ferland et al., 2006; Fokianos, 2016):


where . The initial value is assumed to be random (a common assumption is that it follows the respective stationary distribution). This corresponds to a negative binomial INGARCH(1, 1) model (Zhu, 2011). We will work with this extended definition, which will turn out to be particularly useful in the context of underreporting.

In the following we consider the case where the process is not fully observable. Instead, we observe an underreported process which depends on through independent binomial thinning,


where is the reporting probability.

2.2 The effective reproductive number

Bauer and Wakefield (2018) recently derived the AR(1) version (1)–(2) of the endemic-epidemic model from a time-discrete SIR (susceptible-infected-removed) model with a large population and a relatively rare disease. They showed that the parameter can be interpreted as the local effective reproductive number , i.e. the average number of secondary cases an infected individual causes in the considered population, given all partial immunities and intervention measures. This approach had previously been taken by Wang et al. (2011).

For the extended model (3)–(4) the conditional mean can be re-written as (Fokianos 2016, p.7)


Thus, depends on all past values and impacts all future ones . The term can be interpreted as the probability that an individual infected in week is still infectious in week (compare Wang et al. 2011, Bauer and Wakefield 2018). The effective reproductive number is then

2.3 Properties of completely observed and underreported processes

As shown by Zhu (2011), is second-order stationary if

, with marginal mean, variance and autocorrelation function


respectively (see also Weiß 2018, p.86). Here we denote by


the first-order autocorrelation and the decay factor of the autocorrelation function, respectively.

The corresponding quantities for the underreported process are


where the attenuation factor can be written as

see Supplementary Material for the derivation of (9). The attenuation of the autocorrelations gets stronger (i.e.  gets smaller) as and the index of dispersion decrease.

2.4 Second-order equivalent models

An interesting implication of equations (7)–(9) is that different combinations of reporting probabilities and parameters for the latent process can lead to the same marginal second-order properties of the underreported process. In fact, for any underreported process with underlying parameters and reporting probability , and any one can find a process


so that the underreported process

has the same marginal second-order properties as :


We then call and second-order equivalent. While they are not equal in distribution, they behave very similarly (simulation study in the Supplementary Material). An important special case emerges for when and the fully observed process are second-order equivalent.

The equality of marginal means, variances and autocorrelation functions in (11) implies a system of four equations, which yields explicit formulas for the parameters of the process :


where . Derivations are given in the Supplementary Material, where it is also shown that for the parameters , , , and are all non-negative, i.e.  is a well-defined process. Moreover, and are monotonically decreasing in , while and are monotonically increasing (see Supplementary Material). This has implications on the biases occurring when underreporting is ignored, see Section 3.2. The presented argument can even be extended to models with time-varying parameters, see Section 5.1 and Appendix A.

3 Inference

3.1 Non-identifiability of the reporting probability

A practical implication of the results from Section 2.4 is that it is generally not possible to estimate the reporting probability from time series data alone. The reason is that equally good fits can be achieved with quasi-equivalent models featuring different reporting probabilities. In order to estimate one would need to fix one of the other parameters. Similar identifiability issues have been reported for count data regression models (Dvorzak and Wagner, 2016) and latent INAR time series models (Bracher, 2019). For compartmental models, simulation studies have led to related conclusions, e.g. Fraser et al. (2009, Supplementary Material, p.18) state that the reporting probability and epidemic parameters cannot be estimated jointly.

3.2 Naïve approaches and their biases

In practical analyses of surveillance data underreporting is often ignored, but this will lead to biased parameter estimates. Assume that the data come from a process with reporting probability and underlying parameters . If we wrongly assume full reporting, the parameter estimates can be expected to be around the parameters (12)–(15) of a process which is second-order equivalent to , but has reporting probability . The derivatives etc. given in the Supplementary Material imply that this leads to overestimation of and and underestimation of , and . Note that an upward bias in means we overestimate the relative importance of incidence from two or more weeks ago, compare (6).

A simple approach to account for underreporting is to inflate counts with a multiplication factor, given by the inverse of the assumed reporting probability, prior to analysis (e.g. , Jandarov et al. 2014, Stocks et al. 2018). This changes their overall level and variance, but leaves the autocorrelations unaffected. As and reflect the autocorrelation structure, we do not expect this approach to alleviate bias in these parameters. We can use (12)–(15) to find the parameters of a fully observed process which is second-order equivalent to the process . This tells us which biases to expect, see Section 4.2 for a simulation study exploring this in more detail.

3.3 Maximum likelihood estimation via marginal moment matching

The challenge in fitting the hierarchical model (3)–(5) to data is that the likelihood cannot be evaluated directly. Equations (12)–(15) offer two potential solutions. Both are based on marginal moment matching where underreported processes are approximated by second-order equivalent processes without underreporting, i.e. a reporting probability of one. This is inspired by a method for fitting Gaussian AR() models to mismeasured data by expressing them as ARMA(, ) models (Staudenmayer and Buonaccorsi, 2005).

A simple strategy is to fit a model with , i.e. ignoring underreporting, and subsequently de-bias the parameter estimates. The corrected estimates are given by the parameters underlying a process which is second-order equivalent to the fitted model , but underreported with the assumed reporting probability . These can be obtained using equations (12)–(15). However, this does not ensure that the corrected estimates for and are non-negative (as does not hold here; see Supplementary Material). Especially if the true is small this poses a problem (see simulation study in the Supplementary Material). Moreover, this approach does not extend well to models with time-varying parameters.

We therefore adopt a more elaborate approach where underreporting is accounted for during rather than after the likelihood optimization. In order to evaluate the likelihood of a model with underlying parameters and reporting probability , we consider the second-order equivalent model without underreporting () and parameters as in (12)–(15). Such a process always exists and is well-defined as . The likelihood of can then be approximated by that of , which is easier to evaluate. We also estimate the initial value (Ferland et al., 2006) and for simplicity set . We can then compute


where are the observed counts. For each the likelihood contribution is a negative binomial density with mean (equation (10)) and overdispersion parameter . Maximization of the likelihood function for a given can be done using standard numerical optimization and can be varied to assess the sensitivity of the estimates to this choice. This procedure can be extended to models with time-varying parameters, see Appendix A.

4 Simulation studies

4.1 Comparison to forward algorithm

If , i.e. if the latent process

is a first-order Markov chain, the likelihood of an underreported endemic-epidemic model can be evaluated via the forward algorithm

(Zucchini and MacDonald, 2009). This can serve as a comparison for the proposed approximation method.

To explore the agreement over a broad range of parameter combinations, we independently sampled

sets of parameters from uniform distributions

, , , , , where . As in the estimation method we treat the initial value as an additional parameter. Sets of parameters with , i.e. which imply non-stationarity, were discarded. For each set of parameters we generated a time series of length 100 from the respective underreported process . Subsequently we evaluated the log-likelihood at the true parameter values using the forward algorithm and the proposed approximation method.

Figure 1 shows the differences between the two methods, plotted against characteristics of the generating model. The agreement between both methods is very good, with absolute differences below 0.1 in 73% and below 1 in 97% of the cases. The approximation works somewhat less well as approaches 1, i.e. close to non-stationarity (middle panel), and when the reporting probability is low (right panel). The computation time required for the forward algorithm was two orders of magnitude longer than for the approximation method.

Figure 1: Difference between the log-likelihood values returned by the proposed approximation and the forward algorithm, evaluated at the true parameter values. The differences for 1000 randomly generated parameter combinations are plotted against the stationary mean of the underreported process (on a log scale, left); , which describes how close the process is to non-stationarity (threshold at 1, middle); the reporting probability (right).

4.2 Bias and loss of precision due to underreporting

In a second simulation study we examine whether the arguments from Section 3.2 describe the biases in naïve analyses well and whether the proposed estimation method can avoid them. We simulated 1000 sample paths of length 416 (i.e. eight years of weekly data, as in the case study in Section 5) from a model with and . This implies a marginal mean of

and standard deviation of

, while the effective reproductive number is . We generated underreported versions of these time series with five different reporting probabilities , where corresponds to full reporting. Selected sample paths can be found in the Supplementary Material. The underreported time series are analysed (a) using the proposed estimation method and the correct , (b) ignoring underreporting and (c) inflating the counts with a factor of prior to fitting the model (with subsequent rounding). The results are shown in Figure 2.

The proposed method avoids bias with increasing variability of the estimates as – and thus the data quality – decreases. Ignoring underreporting leads to pronounced biases for low values of . These are well described by equations (12)–(15) with , compare Section 3.2 (red lines in the plots). The multiplication factor method avoids bias in , but has similar biases in the other parameters, the bias for is even more pronounced than when underreporting is ignored. We re-ran the analysis with simulated time series of length 208 (four years of weekly data) and 832 (sixteen years), which gave similar results (see Supplementary Material).

Figure 2: Empirical distribution of parameter estimates under different degrees of underreporting. (a) With the proposed method and the correct . (b) Ignoring underreporting, i.e. assuming . (c) Applying the multiplication factor prior to analysis. Green lines show true parameter values, red lines expected biases.

5 Case study: rotavirus gastroenteritis in Berlin

Rotavirus causes gastroenteritis with symptoms including diarrhoea, vomiting and fever and affects predominantly young children (Heymann, 2015). It is transmitted via the faecal oral route and more rarely contaminated food or water. The average serial interval is five to six days (Richardson et al., 2001). Previous infection has an immunizing effect, but immunity is imperfect and waning over time. Since 2006 a vaccine has been available in Germany and since 2013 it has been recommended by the Standing Committee on Vaccination.

Figure 3: Weekly numbers of laboratory-confirmed rotavirus cases in Berlin, 2001–2008.

Rotavirus is a notifiable disease in Germany. We analyse weekly counts of laboratory-confirmed cases in Berlin, provided by the Robert Koch Institute via its web platform (www.survstat.rki.de). The data cover the years 2001–2008 and are displayed in Figure 3. Incidence in this period has previously been assumed to be largely unaffected by vaccination, while from 2009 onwards effects of increased coverage could be observed (Weidemann et al. 2014, Stocks et al. 2018).

Rotavirus surveillance counts are known to be strongly underreported (Tam et al., 2012). The only available estimate of the reporting probability in Germany comes from a study by Weidemann et al. (2014)

. For the period 2001–2008 it is quantified as 4.3% (95% credibility interval: 3.9% to 4.7%) in the Western Federal States including Berlin and 19.0% (17.6% to 20.5%) in the Eastern Federal States.

5.1 Model specification

We apply an endemic-epidemic model with underreporting to the rotavirus data. Rotavirus is often modelled using more complex mechanistic approaches describing the effects of births, partial immunity acquired by infection and the waning thereof (Atkins et al. 2012; Weidemann et al. 2014; Stocks et al. 2018). Such an approach is beneficial to inform intervention strategies, but a considerable number of parameters values must be pre-specified to enable identifiability of the remaining ones (Weidemann et al., 2014). With the endemic-epidemic model we take a different approach, where the above factors are subsumed in the effective (rather than basic) reproductive number and its variation over the course of a season. A similar approach has been taken by van Gaalen et al. (2017). The simplicity of the assumed epidemic model then enables us to explicitly assess the impact of reporting.

For the rotavirus data we extend model (3)–(5) by allowing and to vary over time,


where we set

to reflect yearly seasonality in weekly data (Held and Paul, 2012). The parameter is assumed to be time-constant and is treated as an additional unknown parameter. Exploratory analyses indicated that including more than one pair of sine/cosine waves or linear time trends into or can lead to slight improvements in AIC, but consistently worse BIC. As the respective parameter estimates were not significantly different from zero in any of the cases we did not include these additional terms.

The models are fitted using the approach described in Appendix A, which extends that from Section 3.3 to time-varying parameters. The idea is that given the initial value , one can recursively compute marginal means, variances and autocovariance functions of also if and vary over time. The parameters of a second-order equivalent process without underreporting, i.e. , can be obtained using a second recursive procedure. As in equation (16) the likelihood of the underreported model can then be approximated by that of , and likelihood optimization can be done using numerical optimization.

5.2 Results

We fitted model (17) for different reporting probabilities . Plots showing fitted and observed values as well as residual autocorrelation functions are provided in the Supplementary Material. The obtained parameter estimates are displayed in Figure 4. The effective reproductive number shows considerable seasonal variation, with higher values during winter and early spring. With as in Weidemann et al. (2014), ranges from 0.65 in calendar week 24 to 1.00 in calendar week 50 (highlighted in red). Under the assumption the range is from 0.48 to 0.98. The sensitivity of the estimates to the assumed reporting probability is thus stronger during the off-season. This is in line with the observation from Section 2.3 that correlation attenuation is inversely related to the index of dispersion, which is typically higher during the high season. In fact, the lines for slightly cross during the first weeks of the year, but this is likely an artefact due to the parsimonious parametrization of the seasonal variation. The small value obtained with implies that only incidence from the directly preceding week is relevant in equation (6). This is in line with the average serial interval slightly below a week given by Richardson et al. (2001), and more plausible than the value obtained with .

Figure 4: Parameter estimates of the underreported endemic-epidemic model applied to weekly rotavirus counts in Berlin, as functions of the assumed reporting probability . The bottom right panel shows the time-varying effective reproductive number . The fit with as in Weidemann et al. (2014) is highlighted in red.

Weidemann et al. (2014) state that the reporting probability may have increased in 2005 due to changed reimbursement schemes, and this is in line with the somewhat increased reported incidence in the seasons 2005–2008. We therefore also fitted a model where in the years 2001 through 2004, then increases linearly to 0.063 over the course of the year 2005 and stays there until 2008. However, as Figure 5 and Supplementary Table 1 show, the resulting parameter estimates are similar to those obtained with a time-constant . To gain an understanding of the uncertainty in the estimates, Figure 5 also shows week-wise 90% confidence bands. As and

are non-linear functions of the estimated parameters, the confidence bands were obtained by repeatedly sampling from a multivariate normal distribution with the maximum likelihood estimates as the mean and the inverse Fisher information as the covariance matrix. It can be seen that there is relatively large uncertainty around the estimates of


As rotavirus spreads predominantly among young children we re-ran our analysis restricted to cases in the age group of four years or younger (these represent approximately 70% of the notified cases). The results, provided in the Supplementary Material, are similar to those reported above. We also re-ran the analysis separately for each of Berlin’s twelve districts (see Supplementary Material). The results get somewhat less stable, and on average the effective reproductive numbers are lower than in the pooled analysis shown here. This is plausible as transmission between districts is ignored in this simple stratified analysis.

Our results are compatible with those from Stocks et al. (2018) who obtain a seasonal range of 0.9 to 1.2 for the basic reproductive number (the basic reproductive number naturally being larger than the effective). In van Gaalen et al. (2017), a study based on Dutch data, the effective reproductive number shows a similar seasonal pattern as in Figure 4, but is somewhat higher (between 0.85 and 1.15, Figure 3 in van Gaalen et al. 2017). Note that both Stocks et al. (2018) and van Gaalen et al. (2017) account for underreporting via multiplication factors.

Figure 5: Estimated time-varying endemic parameters and effective reproductive numbers when assuming (a) (purple), (b) (red) and (c) time-varying reporting probability increasing from to

over the course of 2005 (green). Point-wise 90% confidence intervals are shown as shaded areas.

6 Discussion

We addressed the problem of fitting endemic-epidemic models to underreported data. We demonstrated that reporting probabilities are not identifiable from time series data alone and discussed biases occurring in naïve analyses. In simulation studies we showed that the proposed likelihood approximation is highly accurate and that, combined with the correct reporting probability, our inference method avoids bias. The case study on rotavirus gastroenteritis showed that estimates of effective reproductive numbers increase when underreporting is taken into account.

A number of alternatives to the proposed approximate maximum likelihood method exist. Preliminary analyses with a full Bayesian approach have been hampered by two difficulties. Firstly, when using standard MCMC samplers, mixing for the latent process tended to be poor. To circumvent this, more sophisticated algorithms seem to be needed. Secondly, prior choice is challenging in endemic-epidemic models, compare Bauer and Wakefield (2018). Advantages, however, are that the uncertainty about the reporting probability could be included more easily and that the hidden process could be reconstructed. Alternatively it may be worthwhile to combine the proposed likelihood approximation with a hybrid Bayesian approach, allowing for uncertainty in the reporting probability but preventing feedback using the cut algorithm (Plummer, 2015). An alternative maximum likelihood estimation approach could be based on simulation-based particle filtering, as implemented in the R package pomp (King et al., 2016). While this represents a powerful and very general tool, it is computationally demanding and requires careful choice of starting values to obtain reliable results (Stocks et al., 2018).

We focused on univariate endemic-epidemic models, whereas much of the previous work on the class covered multivariate settings (most recently Meyer and Held 2017, Bracher and Held 2019). It is not entirely straightforward to extend our approach to the multivariate case, but our qualitative findings will still be relevant. They indicate that underreported processes are well approximated by processes with an autoregression on or, as in equation (6), higher-order lags. If the goal is prediction (Czado et al., 2009; Held et al., 2017) rather than parameter estimation, there should be little benefit in explicitly accounting for underreporting. Also, we observed that the biases in the epidemic parameters and were weaker when counts were high. They may thus be negligible if counts are in the hundreds or even thousands (e.g. Bauer and Wakefield 2018). An additional difficulty in multivariate settings is to account for differences in reporting probabilities between strata, where reliable information is rarely available.

The application to the reported incidence of rotavirus gastroenteritis in Berlin deserves some additional comments. Firstly, while the analysis with a time-discrete model on a weekly scale is appropriate for an average serial interval of 5–6 days as reported by Richardson et al. (2001), other studies have assumed shorter serial intervals for rotavirus, e.g. 2.1 days in van Gaalen et al. (2017)

. Dynamics within a week, which are not captured by our model, then become more important. Another aspect is that we did not account for different transmission patterns in different age groups, or spatial dynamics. Taking these into account in a multivariate analysis could yield more detailed insights, but would require information on reporting probabilities within age groups or districts.

A related problem not discussed here consists in reporting delays, where cases may be registered one or several weeks after their actual occurrence (Azmon et al. 2014, Noufaily 2019). This aspect is of particular importance when models are used for outbreak detection or nowcasting (Höhle and an der Heiden, 2014).

Appendix A: Marginal moment matching in the case of time-varying parameters

Consider a process where


and is treated as an additional parameter. This is more general than (17) as , and are also time-varying. The time-varying marginal means , variances and autocovariance functions of can be obtained recursively as


where . This follows from arguments outlined in the Supplementary Material. For the underreported process the same arguments as in the time-homogeneous case (see Supplementary Material) yield


As in the case of time constant parameters one can find a completely observed process


with the same marginal second-order properties as , i.e.

for . Its initial value and time-dependent parameters can be obtained step by step by matching the marginal means, variances and autocovariance functions:

  1. Initialization: Set and . This ensures and .

  2. Recursive matching of marginal moments: For iterate:

    1. Match decay factor of the autocovariance function: Set


      where . This ensures for .

    2. Match marginal mean: To ensure set

    3. Match first-order autocovariance: To ensure set


      With Step 2(a) this also implies a value for as .

    4. Compute

    5. Match marginal variance: To ensure set


Details on how expressions (26)–(29) are obtained are given in the Supplementary Material. Unlike in the time-homogeneous case we do not formally proof that the parameters of are always non-negative. However, numerical studies indicated that this is the case. As in the time-homogeneous case the likelihood can be approximated by , which can be evaluated along the lines of equation (16).


  • Atkins et al. (2012) Atkins, K. E., Shim, E., Pitzer, V. E., and Galvani, A. P. (2012). Impact of rotavirus vaccination on epidemiological dynamics in England and Wales. Vaccine, 30(3):552 – 564.
  • Azmon et al. (2014) Azmon, A., Faes, C., and Hens, N. (2014). On the estimation of the reproduction number based on misreported epidemic data. Stat Med, 33(7):1176–1192.
  • Bailey et al. (2005) Bailey, T., Carvalho, M., Lapa, T., Souza, W., and Brewer, M. (2005). Modeling of under-detection of cases in disease surveillance. Ann Epidemiol, 15(5):335–343.
  • Bauer and Wakefield (2018) Bauer, C. and Wakefield, J. (2018). Stratified space–time infectious disease modelling, with an application to hand, foot and mouth disease in China. J R Stat Soc C-Appl, 67(5):1379–1398.
  • Bracher (2019) Bracher, J. (2019). Comment on “Under-reported data analysis with INAR-hidden Markov chains”. Stat Med, 38(5):893–898.
  • Bracher and Held (2019) Bracher, J. and Held, L. (2019). Endemic-epidemic models with discrete-time serial interval distributions for infectious disease prediction. Preprint available at https://arxiv.org/abs/1901.03090.
  • Cauchemez and Ferguson (2008) Cauchemez, S. and Ferguson, N. M. (2008). Likelihood-based estimation of continuous-time epidemic models from time-series data: application to measles transmission in London. J R Soc Iinterface, 5(25):885–897.
  • Czado et al. (2009) Czado, C., Gneiting, T., and Held, L. (2009). Predictive model assessment for count data. Biometrics, 65(4):1254–1261.
  • Dorigatti et al. (2012) Dorigatti, I., Cauchemez, S., Pugliese, A., and Ferguson, N. M. (2012). A new approach to characterising infectious disease transmission dynamics from sentinel surveillance: Application to the Italian 2009–2010 A/H1N1 influenza pandemic. Epidemics, 4(1):9 – 21.
  • Dvorzak and Wagner (2016) Dvorzak, M. and Wagner, H. (2016). Sparse Bayesian modelling of underreported count data. Stat Model, 16(1):24–46.
  • Ferland et al. (2006) Ferland, R., Latour, A., and Oraichi, D. (2006). Integer-valued GARCH process. J Time Ser Anal, 27(6):923–942.
  • Fernández-Fontelo et al. (2016) Fernández-Fontelo, A., Cabaña, A., Puig, P., and Moriña, D. (2016). Under-reported data analysis with INAR-hidden Markov chains. Stat Med, 35(26):4875–4890.
  • Fernández-Fontelo et al. (2019) Fernández-Fontelo, A., Cabaña, A., Joe, H., Puig, P., and Moriña, D. (2019). Untangling serially dependent underreported count data for gender-based violence. Stat Med, 38(22):4404–4422.
  • Fintzi et al. (2017) Fintzi, J., Cui, X., Wakefield, J., and Minin, V. N. (2017). Efficient data augmentation for fitting stochastic epidemic models to prevalence data. J Comput Graph Stat, 26(4):918–929.
  • Fokianos (2016) Fokianos, K. (2016). Statistical analysis of count time series models: a GLM perspective. In Davis, R., Holan, S., Lund, R., and Ravishanker, N., editors, Handbook of Discrete-Valued Time Series, pages 3–28. Wiley, Boca Raton, Florida.
  • Fraser et al. (2009) Fraser, C., Donnelly, C. A., Cauchemez, S., Hanage, W. P., Van Kerkhove, M. D., Hollingsworth, T. D., Griffin, J., Baggaley, R. F., Jenkins, H. E., Lyons, E. J., Jombart, T., Hinsley, W. R., Grassly, N. C., Balloux, F., Ghani, A. C., Ferguson, N. M., Rambaut, A., Pybus, O. G., Lopez-Gatell, H., Alpuche-Aranda, C. M., Chapela, I. B., Zavala, E. P., Guevara, D. M. E., Checchi, F., Garcia, E., Hugonnet, S., Roth, C., and The WHO Rapid Pandemic Assessment Collaboration (2009). Pandemic potential of a strain of Influenza A (H1N1): Early findings. Science, 324(5934):1557–1561.
  • Gamado et al. (2014) Gamado, K. M., Streftaris, G., and Zachary, S. (2014). Modelling under-reporting in epidemics. J Mat Biol, 69(3):737–765.
  • Gibbons et al. (2014) Gibbons, C. L., Mangen, M.-J. J., Plass, D., Havelaar, A. H., Brooke, R. J., Kramarz, P., Peterson, K. L., Stuurman, A. L., Cassini, A., Fèvre, E. M., and Kretzschmar, M. E. (2014). Measuring underreporting and under-ascertainment in infectious disease datasets: a comparison of methods. BMC Public Health, 14(1):1–17.
  • Held et al. (2005) Held, L., Höhle, M., and Hofmann, M. (2005). A statistical framework for the analysis of multivariate infectious disease surveillance counts. Stat Model, 5(3):187–199.
  • Held et al. (2017) Held, L., Meyer, S., and Bracher, J. (2017). Probabilistic forecasting in infectious disease epidemiology: the 13th Armitage lecture. Stat Med, 36(22):3443–3460.
  • Held and Paul (2012) Held, L. and Paul, M. (2012). Modeling seasonality in space-time infectious disease surveillance data. Biometrical J, 54(6):824–843.
  • Heymann (2015) Heymann, D., editor (2015). Control of Communicable Diseases Manual. APHA Press, Washington, DC, 20th edition.
  • Höhle and an der Heiden (2014) Höhle, M. and an der Heiden, M. (2014). Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011. Biometrics, 70(4):993–1002.
  • Jandarov et al. (2014) Jandarov, R., Haran, M., Bjørnstad, O., and Grenfell, B. (2014). Emulating a gravity model to infer the spatiotemporal dynamics of an infectious disease. J R Stat Soc C-Appl, 63(3):423–444.
  • King et al. (2016) King, A., Nguyen, D., and Ionides, E. (2016). Statistical inference for partially observed Markov processes via the R package pomp. J Stat Softw, 69(12):1–43.
  • Magal and Webb (2018) Magal, P. and Webb, G. (2018). The parameter identification problem for SIR epidemic models: identifying unreported cases. J Mat Biol, 77(6):1629–1648.
  • Meyer and Held (2014) Meyer, S. and Held, L. (2014). Power-law models for infectious disease spread. Ann Appl Stat, 8(3):1612–1639.
  • Meyer and Held (2017) Meyer, S. and Held, L. (2017). Incorporating social contact data in spatio-temporal models for infectious disease spread. Biostatistics, 18(2):338–351.
  • Meyer et al. (2017) Meyer, S., Held, L., and Höhle, M. (2017). Spatio-temporal analysis of epidemic phenomena using the R package surveillance. J Stat Softw, 77(11):1–55.
  • Noufaily (2019) Noufaily, A. (2019). Underreporting and reporting delays. In Held, L., Hens, N., O’Neill, P., and Wallinga, J., editors, Handbook of Infectious Disease Data Analysis. Chapman and Hall/CRC, London, UK. To appear.
  • Oliveira et al. (2017) Oliveira, G. L., Loschi, R. H., and Assunção, R. M. (2017). A random‐censoring Poisson model for underreported data. Stat Med, 36(30):4873–4892.
  • Paul et al. (2008) Paul, M., Held, L., and Toschke, A. M. (2008). Multivariate modelling of infectious disease surveillance data. Stat Med, 27(29):6250–6267.
  • Plummer (2015) Plummer, M. (2015). Cuts in Bayesian graphical models. Stat Comput, 25(1):37–43.
  • Richardson et al. (2001) Richardson, M., Elliman, D., Maguire, H., Simpson, J., and Nicoll, A. (2001). Evidence base of incubation periods, periods of infectiousness and exclusion policies for the control of communicable diseases in schools and preschools. Pediatr Infect Dis J, 20(4):380–391.
  • Sharmin et al. (2018a) Sharmin, S., Glass, K., Viennet, E., and Harley, D. (2018a). A Bayesian approach for estimating under-reported dengue incidence with a focus on non-linear associations between climate and dengue in Dhaka, Bangladesh. Stat Methods Med Res, 27(4):991–1000.
  • Sharmin et al. (2018b) Sharmin, S., Glass, K., Viennet, E., and Harley, D. (2018b). Geostatistical mapping of the seasonal spread of under-reported dengue cases in Bangladesh. PLOS Neglect Trop D, 12(11):1–13.
  • Staudenmayer and Buonaccorsi (2005) Staudenmayer, J. and Buonaccorsi, J. P. (2005).

    Measurement error in linear autoregressive models.

    J Am Stat Assoc, 100(471):841–852.
  • Stocks et al. (2018) Stocks, T., Britton, T., and Höhle, M. (2018). Model selection and parameter estimation for dynamic epidemic models via iterated filtering: application to rotavirus in Germany. Biostatistics, page kxy057.
  • Stoner et al. (2019) Stoner, O., Economou, T., and Drummond, G. (2019). A hierarchical framework for correcting under-reporting in count data. J Am Stat Assoc, in press.
  • Tam et al. (2012) Tam, C. C., Rodrigues, L. C., Viviani, L., Dodds, J. P., Evans, M. R., Hunter, P. R., Gray, J. J., Letley, L. H., Rait, G., Tompkins, D. S., and O’Brien, S. J. (2012). Longitudinal study of infectious intestinal disease in the UK (IID2 study): incidence in the community and presenting to general practice. Gut, 61(1):69–77.
  • van Gaalen et al. (2017) van Gaalen, R., van de Kassteele, J., Hahné, S., Bruijning-Verhagen, P., and Wallinga, J. (2017). Determinants of rotavirus transmission. A lag nonlinear time series analysis. Epidemiology, 28(4):503–513.
  • Wang et al. (2011) Wang, Y., Feng, Z., Yang, Y., Self, S., Gao, Y., Longini, I. M., Wakefield, J., Zhang, J., Wang, L., Chen (Bauer), X., Yao, L., Stanaway, J. D., Wang, Z., and Yang, W. (2011). Hand, foot, and mouth disease in China: Patterns of spread and transmissibility. Epidemiology, 22(6):781–792.
  • Weidemann et al. (2014) Weidemann, F., Dehnert, M., Koch, J., Wichmann, O., and Höhle, M. (2014). Bayesian parameter inference for dynamic infectious disease modelling: Rotavirus in Germany. Stat Med, 33(9):1580–1599.
  • Weiß (2018) Weiß, C. (2018). An Introduction to Discrete-Valued Time Series. Wiley, Hoboken, NJ.
  • White and Pagano (2010) White, L. F. and Pagano, M. (2010). Reporting errors in infectious disease outbreaks, with an application to pandemic influenza A/H1N1. Epidemiol Perspect Innov, 7(1):1–12.
  • Zhu (2011) Zhu, F. (2011). A negative binomial integer-valued GARCH model. J Time Ser Anal, 32(1):54–67.
  • Zucchini and MacDonald (2009) Zucchini, W. and MacDonald, I. (2009). Hidden Markov Models for Time Series. Chapman and Hall/CRC, New York, NY.

Supporting Information

Additional supporting information will be made availabele online. The suggested methods have been implemented in R in the package hhh4underreporting. The implementation is largely independent of that for fully observed models in the surveillance package (Meyer et al., 2017). The package can be installed from the repository https://github.com/jbracher/hhh4underreporting which also contains data and code to replicate the presented results.