We read with great interest the article by Fernández-Fontelo et al  who discuss underreporting in count time series models, a problem which is encountered in many real-world settings. In case studies, reporting probabilities for several diseases are estimated under the assumption of a hidden INAR(1) process. In our comment we develop this idea further and point out an identifiability problem which arises when the INAR(1) assumption is relaxed. Specifically it is shown that it is not possible to distinguish between time-homogeneous underreporting and a geometric lag structure in Poisson INAR models. Estimation of reporting probabilities from time series data thus relies on the correct specification of the lag structure of the latent process.
The Poisson INAR(1) model  with parameters and is defined as
where the independently follow a distribution. The operator denotes binomial thinning, i.e. with . The thinning operations are assumed to be independent of each other and of ; further, the thinning operations at each and are assumed to be independent of . Fernández-Fontelo et al  introduce an underreported version of this model. However, they assume that counts are only subject to underreporting with a certain probability ; otherwise reporting is assumed to be complete. The observed process is thus
where is a reporting probability and, given , is independent of the past. The assumption that reporting is 100% complete during some periods seems strong in many contexts. And indeed, in two case studies (weekly number of human papillomavirus cases in Girona, Spain; annual deaths from mesothelioma in Great Britain) the estimates
are close to 1 with confidence intervals including this value. This indicates that, i.e. time-homogeneous underreporting, is an important special case. In the following the focus will thus be on models where, instead of (2), the reporting process is
2 Interplay of underreporting and geometric lags in INAR models
We now generalize the INAR(1) from (1) to an INAR() model with geometric lags and show how closely this aspect is related to underreporting. The starting point is the INAR() model introduced by Alzaid and Al-Osh  which is defined as
where is the indicator function. In the interpretation given in Weiß  (p. 46), is the number of individuals (among the from time ) which will be renewed during . The variables are the respective waiting times. This formulation extends more easily to the case , specifically consider a geometric lag structure
with . As we get , i.e. the
follow a geometric distribution with parameterand support . The INAR() process with parameters and can thus be defined as
This ensures, like the multinomial distribution in (5), that while . Again for each it is assumed that and are independent of the past history, i.e. and for . Such geometric lag structures are closely linked to underreporting as the following can be shown (proofs in the Appendix):
More generally, consider a Poisson INAR() process with parameters and . For any there is a Poisson INAR() process so that is equivalent to . The parameters of are
For each underreported INAR() process and, as a special case, each underreported INAR(1) process, there are thus many different INAR() representations. Each of them features a different combination of reporting probability and decay factor for the autoregressive parameters. In Fernández-Fontelo et al  identifiability is ensured by the assumption that the latent process is indeed INAR(1), i.e. . Without substantial prior knowledge to justify a precise value of , however, the reporting probability cannot be estimated.
3 Application to human papillomavirus cases in Girona, Spain
For illustration of the argument we revisit the analysis of reported human papillomavirus cases in Girona, Spain (2010–2014) from Fernández-Fontelo et al . The authors assume the latent model (1) with reporting process (2), but as with a confidence interval from 0.78 to 1.07 they suggest that the simpler version (3) could be used as well. We will thus simplifyingly pretend that their parameter estimates come from this simpler model (even though the model then cannot accomodate overdispersion; this could be addressed via a different immigration distribution). The estimated data generating process (eq. (22)–(23) in ) is then
The most interesting result from a public health perspective is the estimated reporting probability of 0.33 as it directly translates to an estimate of the unobserved disease burden. Using statement (C) from Section 2, however, we can pick any and obtain the parameters of an INAR() process so that is equivalent to . In Figure 1these parameters are diplayed as functions of .
So if we relax the INAR(1) assumption from the original article and assume a latent INAR() process, the data are equally compatible with a whole range of reporting probabilities. The most extreme re-formulation of from equations (14), (15) is a completely observed INAR() process with
where the lagged terms after are cut off as the respective parameters become very small. In certain settings generation time distributions of infectious diseases may give us an idea about an appropriate lag structure and thus value of , making the model identifiable. The period of communicability of HPV is unknown, but likely to be at least as long as the persistence of lesions . Development of lesions is assumed to take 2–3 months in most cases. For weekly data a more spread-out lag structure may thus be more appropriate than an INAR(1) specification.
Appendix: Derivation of statements (A), (B), (C) from Section 2
Consider a Poisson INAR(1) process with
and the underreported process . It is now shown that, as stated in (A), is equivalent to an INAR() process of type (10)–(13) with parameters . The argument is easiest understood when expressed in terms of the survival interpretation of :
New individuals can be born at each time step
; their number follows a Poisson distribution with rate.
Individuals already present at have a probability of still being alive at time .
Alive individuals are observed with probability at each time step.
All births, deaths and observation events are assumed to be independent. is the number of individuals alive at time , is the number of those who are observed. Now denote by the number of individuals observed in which have not been observed previously, and by the number of individuals observed in which have already been observed at a previous time point so that
The term can be further decomposed by when the individuals were born; denoting by the number of individuals first observed in and born in one gets
Similarly, is decomposed by when the individuals were last observed; denoting by the individuals last observed in and observed again in this leads to
Figure 2 illustrates the definition of and with a simple example.
Obviously an individual born in can only be observed for the first time in at most one out of , i.e. is split up into and a part of individuals which is never observed. The probability that an individual born in is first observed in is (the individual has to survive times, stay unobserved times and finally be observed once). The splitting property of the Poisson distribution (, 53) and the independence between the then imply that all independently follow Poisson distributions. Their rates depend on , specifically . Consequently, the
, too, are independent of each other. As sums of independent Poisson random variables they likewise follow a Poisson distribution (, 14):
Similarly, an individual observed in can only be observed next in at most one out of . The observed individuals are thus split up into and a part of individuals which is never observed again. The probability that an individual is observed next in is (the individual has to survive times, stay unobserved times and finally be observed once) or with
The probability that the individual will be observed again at all is . Consequently, under the condition that the -th of the individuals will be observed again, the waiting time until this occurs has probability mass function . It thus follows a geometric distribution with parameter . Denoting the number of individuals observed in which will be observed again by we can thus write
one can see that indeed follows the form (10)–(13) with parameters (the in equation (24) correspond to the from equation (10) and corresponds to ). Also it is clear that both and are independent of the past history of the observed process , i.e. and
Statement (B) follows directly as it is easily verified using (A) that the given and are equivalent. The restriction ensures that and so that and are well-defined.
Statement (C) follows from statements (A) and (B) in the following way. Consider an INAR() process and . Using (B) an INAR(1) process with parameters can be constructed so that is equivalent to . Thus is in turn equivalent to . One can now choose any and define . The process is then obviously equivalent to and thus . The proof is complete as statement (A) implies that has a representation as an INAR() process with parameters
The author would like to thank Leonhard Held, Małgorzata Roos and Christian H. Weiß for helpful comments.
-  MA Al-Osh and AA Alzaid. First-order integer-valued autoregressive (INAR(1)) process. J Time Ser Anal, 8(3):261–275, 1987.
-  AA Alzaid and MA Al-Osh. An integer-valued pth-order autoregressive structure (INAR(p)) process. J Appl Probab, 27(2):314–324, 1990.
-  A Fernández-Fontelo, A Cabaña, P Puig, and D Moriña. Under-reported data analysis with INAR-hidden Markov chains. Stat Med, 35(26):4875–4890, 2016.
-  DL Heymann. Control of Communicable Diseases Manual. APHA Press, Washington, DC, 20th edition, 2015.
-  NL Johnson, S Kotz, and N Balakrishnan. Discrete Multivariate Distributions. Wiley, New York, NY, 1997.
-  JFC Kingman. Poisson Processes. Oxford University Press, Oxford, United Kingdom, 1993.
-  CH Weiss. An Introduction to Discrete-Valued Time Series. Wiley, Hoboken, NJ, 2018.