Comment on "Under-reported data analysis with INAR-hidden Markov chains"

by   Johannes Bracher, et al.
Universität Zürich

In Fernandez-Fontelo et al (Statis. Med. 2016, DOI 10.1002/sim.7026) hidden integer-valued autoregressive (INAR) processes are used to estimate reporting probabilities for various diseases. In this comment it is demonstrated that the Poisson INAR(1) model with time-homogeneous underreporting can be expressed equivalently as a completely observed INAR(inf) model with a geometric lag structure. This implies that estimated reporting probabilities depend on the assumed lag structure of the latent process.



There are no comments yet.


page 1

page 2

page 3

page 4


Generic identification of binary-valued hidden Markov processes

The generic identification problem is to decide whether a stochastic pro...

Bivariate modelling of precipitation and temperature using a non-homogeneous hidden Markov model

Aiming to generate realistic synthetic times series of the bivariate pro...

Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case

The present paper introduces a new model used to study and analyse the s...

Autoregressive Hidden Markov Models with partial knowledge on latent space applied to aero-engines prognostics

[This paper was initially published in PHME conference in 2016, selected...

Hierarchical Hidden Markov Jump Processes for Cancer Screening Modeling

Hidden Markov jump processes are an attractive approach for modeling cli...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We read with great interest the article by Fernández-Fontelo et al [3] who discuss underreporting in count time series models, a problem which is encountered in many real-world settings. In case studies, reporting probabilities for several diseases are estimated under the assumption of a hidden INAR(1) process. In our comment we develop this idea further and point out an identifiability problem which arises when the INAR(1) assumption is relaxed. Specifically it is shown that it is not possible to distinguish between time-homogeneous underreporting and a geometric lag structure in Poisson INAR models. Estimation of reporting probabilities from time series data thus relies on the correct specification of the lag structure of the latent process.

The Poisson INAR(1) model [1] with parameters and is defined as


where the independently follow a distribution. The operator denotes binomial thinning, i.e.  with . The thinning operations are assumed to be independent of each other and of ; further, the thinning operations at each and are assumed to be independent of . Fernández-Fontelo et al [3] introduce an underreported version of this model. However, they assume that counts are only subject to underreporting with a certain probability ; otherwise reporting is assumed to be complete. The observed process is thus


where is a reporting probability and, given , is independent of the past. The assumption that reporting is 100% complete during some periods seems strong in many contexts. And indeed, in two case studies (weekly number of human papillomavirus cases in Girona, Spain; annual deaths from mesothelioma in Great Britain) the estimates

are close to 1 with confidence intervals including this value. This indicates that

, i.e. time-homogeneous underreporting, is an important special case. In the following the focus will thus be on models where, instead of (2), the reporting process is


2 Interplay of underreporting and geometric lags in INAR models

We now generalize the INAR(1) from (1) to an INAR() model with geometric lags and show how closely this aspect is related to underreporting. The starting point is the INAR() model introduced by Alzaid and Al-Osh [2] which is defined as


with .

This INAR() model does not allow for as the multinomial distribution requires a finite number of categories. We therefore reformulate the multinomial distribution in (5) as (compare e.g. [5], p. 33)


where is the indicator function. In the interpretation given in Weiß [7] (p. 46), is the number of individuals (among the from time ) which will be renewed during . The variables are the respective waiting times. This formulation extends more easily to the case , specifically consider a geometric lag structure


with . As we get , i.e. the

follow a geometric distribution with parameter

and support . The INAR() process with parameters and can thus be defined as


This ensures, like the multinomial distribution in (5), that while . Again for each it is assumed that and are independent of the past history, i.e. and for . Such geometric lag structures are closely linked to underreporting as the following can be shown (proofs in the Appendix):

  1. Consider a Poisson INAR(1) process with parameters and the underreported process . There is a Poisson INAR() process of type (10)–(13) which is is equivalent to ; its parameters are .

  2. Consider a Poisson INAR() process of type (10)–(13) with parameters and . The underreported process where is an INAR(1) process with parameters is equivalent to .

  3. More generally, consider a Poisson INAR() process with parameters and . For any there is a Poisson INAR() process so that is equivalent to . The parameters of are

For each underreported INAR() process and, as a special case, each underreported INAR(1) process, there are thus many different INAR() representations. Each of them features a different combination of reporting probability and decay factor for the autoregressive parameters. In Fernández-Fontelo et al [3] identifiability is ensured by the assumption that the latent process is indeed INAR(1), i.e. . Without substantial prior knowledge to justify a precise value of , however, the reporting probability cannot be estimated.

3 Application to human papillomavirus cases in Girona, Spain

For illustration of the argument we revisit the analysis of reported human papillomavirus cases in Girona, Spain (2010–2014) from Fernández-Fontelo et al [3]. The authors assume the latent model (1) with reporting process (2), but as with a confidence interval from 0.78 to 1.07 they suggest that the simpler version (3) could be used as well. We will thus simplifyingly pretend that their parameter estimates come from this simpler model (even though the model then cannot accomodate overdispersion; this could be addressed via a different immigration distribution). The estimated data generating process (eq. (22)–(23) in [3]) is then


The most interesting result from a public health perspective is the estimated reporting probability of 0.33 as it directly translates to an estimate of the unobserved disease burden. Using statement (C) from Section 2, however, we can pick any and obtain the parameters of an INAR() process so that is equivalent to . In Figure 1these parameters are diplayed as functions of .

Figure 1: Parameter values of underreported INAR() models with different reporting probabilities which are all equivalent to an underreported INAR(1) model with parameters (eq. (14)–(15)).

So if we relax the INAR(1) assumption from the original article and assume a latent INAR() process, the data are equally compatible with a whole range of reporting probabilities. The most extreme re-formulation of from equations (14), (15) is a completely observed INAR() process with

where the lagged terms after are cut off as the respective parameters become very small. In certain settings generation time distributions of infectious diseases may give us an idea about an appropriate lag structure and thus value of , making the model identifiable. The period of communicability of HPV is unknown, but likely to be at least as long as the persistence of lesions [4]. Development of lesions is assumed to take 2–3 months in most cases. For weekly data a more spread-out lag structure may thus be more appropriate than an INAR(1) specification.

Appendix: Derivation of statements (A), (B), (C) from Section 2

Consider a Poisson INAR(1) process with

and the underreported process . It is now shown that, as stated in (A), is equivalent to an INAR() process of type (10)–(13) with parameters . The argument is easiest understood when expressed in terms of the survival interpretation of :

  • New individuals can be born at each time step

    ; their number follows a Poisson distribution with rate


  • Individuals already present at have a probability of still being alive at time .

  • Alive individuals are observed with probability at each time step.

All births, deaths and observation events are assumed to be independent. is the number of individuals alive at time , is the number of those who are observed. Now denote by the number of individuals observed in which have not been observed previously, and by the number of individuals observed in which have already been observed at a previous time point so that


The term can be further decomposed by when the individuals were born; denoting by the number of individuals first observed in and born in one gets


Similarly, is decomposed by when the individuals were last observed; denoting by the individuals last observed in and observed again in this leads to


Figure 2 illustrates the definition of and with a simple example.

Figure 2: Stylized example to illustrate notations (17) and (18): individual-level display of a population of individuals following principles (i)–(iii). The total population size follows an INAR(1) process, but only an underreported version is observed. Non-zero values of the newly introduced auxiliary variables and are added in the graph.

Obviously an individual born in can only be observed for the first time in at most one out of , i.e.  is split up into and a part of individuals which is never observed. The probability that an individual born in is first observed in is (the individual has to survive times, stay unobserved times and finally be observed once). The splitting property of the Poisson distribution ([6], 53) and the independence between the then imply that all independently follow Poisson distributions. Their rates depend on , specifically . Consequently, the

, too, are independent of each other. As sums of independent Poisson random variables they likewise follow a Poisson distribution (

[6], 14):


Similarly, an individual observed in can only be observed next in at most one out of . The observed individuals are thus split up into and a part of individuals which is never observed again. The probability that an individual is observed next in is (the individual has to survive times, stay unobserved times and finally be observed once) or with


The probability that the individual will be observed again at all is . Consequently, under the condition that the -th of the individuals will be observed again, the waiting time until this occurs has probability mass function . It thus follows a geometric distribution with parameter . Denoting the number of individuals observed in which will be observed again by we can thus write


Combining this with equations (16), (18) and (19) to


one can see that indeed follows the form (10)–(13) with parameters (the in equation (24) correspond to the from equation (10) and corresponds to ). Also it is clear that both and are independent of the past history of the observed process , i.e. and

Statement (B) follows directly as it is easily verified using (A) that the given and are equivalent. The restriction ensures that and so that and are well-defined.

Statement (C) follows from statements (A) and (B) in the following way. Consider an INAR() process and . Using (B) an INAR(1) process with parameters can be constructed so that is equivalent to . Thus is in turn equivalent to . One can now choose any and define . The process is then obviously equivalent to and thus . The proof is complete as statement (A) implies that has a representation as an INAR() process with parameters


The author would like to thank Leonhard Held, Małgorzata Roos and Christian H. Weiß for helpful comments.


  • [1] MA Al-Osh and AA Alzaid. First-order integer-valued autoregressive (INAR(1)) process. J Time Ser Anal, 8(3):261–275, 1987.
  • [2] AA Alzaid and MA Al-Osh. An integer-valued pth-order autoregressive structure (INAR(p)) process. J Appl Probab, 27(2):314–324, 1990.
  • [3] A Fernández-Fontelo, A Cabaña, P Puig, and D Moriña. Under-reported data analysis with INAR-hidden Markov chains. Stat Med, 35(26):4875–4890, 2016.
  • [4] DL Heymann. Control of Communicable Diseases Manual. APHA Press, Washington, DC, 20th edition, 2015.
  • [5] NL Johnson, S Kotz, and N Balakrishnan. Discrete Multivariate Distributions. Wiley, New York, NY, 1997.
  • [6] JFC Kingman. Poisson Processes. Oxford University Press, Oxford, United Kingdom, 1993.
  • [7] CH Weiss. An Introduction to Discrete-Valued Time Series. Wiley, Hoboken, NJ, 2018.