1 Introduction
We read with great interest the article by FernándezFontelo et al [3] who discuss underreporting in count time series models, a problem which is encountered in many realworld settings. In case studies, reporting probabilities for several diseases are estimated under the assumption of a hidden INAR(1) process. In our comment we develop this idea further and point out an identifiability problem which arises when the INAR(1) assumption is relaxed. Specifically it is shown that it is not possible to distinguish between timehomogeneous underreporting and a geometric lag structure in Poisson INAR models. Estimation of reporting probabilities from time series data thus relies on the correct specification of the lag structure of the latent process.
The Poisson INAR(1) model [1] with parameters and is defined as
(1) 
where the independently follow a distribution. The operator denotes binomial thinning, i.e. with . The thinning operations are assumed to be independent of each other and of ; further, the thinning operations at each and are assumed to be independent of . FernándezFontelo et al [3] introduce an underreported version of this model. However, they assume that counts are only subject to underreporting with a certain probability ; otherwise reporting is assumed to be complete. The observed process is thus
(2) 
where is a reporting probability and, given , is independent of the past. The assumption that reporting is 100% complete during some periods seems strong in many contexts. And indeed, in two case studies (weekly number of human papillomavirus cases in Girona, Spain; annual deaths from mesothelioma in Great Britain) the estimates
are close to 1 with confidence intervals including this value. This indicates that
, i.e. timehomogeneous underreporting, is an important special case. In the following the focus will thus be on models where, instead of (2), the reporting process is(3) 
2 Interplay of underreporting and geometric lags in INAR models
We now generalize the INAR(1) from (1) to an INAR() model with geometric lags and show how closely this aspect is related to underreporting. The starting point is the INAR() model introduced by Alzaid and AlOsh [2] which is defined as
(4)  
(5) 
with .
This INAR() model does not allow for as the multinomial distribution requires a finite number of categories. We therefore reformulate the multinomial distribution in (5) as (compare e.g. [5], p. 33)
(6)  
(7)  
(8) 
where is the indicator function. In the interpretation given in Weiß [7] (p. 46), is the number of individuals (among the from time ) which will be renewed during . The variables are the respective waiting times. This formulation extends more easily to the case , specifically consider a geometric lag structure
(9) 
with . As we get , i.e. the
follow a geometric distribution with parameter
and support . The INAR() process with parameters and can thus be defined as(10)  
(11)  
(12)  
(13) 
This ensures, like the multinomial distribution in (5), that while . Again for each it is assumed that and are independent of the past history, i.e. and for . Such geometric lag structures are closely linked to underreporting as the following can be shown (proofs in the Appendix):

More generally, consider a Poisson INAR() process with parameters and . For any there is a Poisson INAR() process so that is equivalent to . The parameters of are
For each underreported INAR() process and, as a special case, each underreported INAR(1) process, there are thus many different INAR() representations. Each of them features a different combination of reporting probability and decay factor for the autoregressive parameters. In FernándezFontelo et al [3] identifiability is ensured by the assumption that the latent process is indeed INAR(1), i.e. . Without substantial prior knowledge to justify a precise value of , however, the reporting probability cannot be estimated.
3 Application to human papillomavirus cases in Girona, Spain
For illustration of the argument we revisit the analysis of reported human papillomavirus cases in Girona, Spain (2010–2014) from FernándezFontelo et al [3]. The authors assume the latent model (1) with reporting process (2), but as with a confidence interval from 0.78 to 1.07 they suggest that the simpler version (3) could be used as well. We will thus simplifyingly pretend that their parameter estimates come from this simpler model (even though the model then cannot accomodate overdispersion; this could be addressed via a different immigration distribution). The estimated data generating process (eq. (22)–(23) in [3]) is then
(14)  
(15) 
The most interesting result from a public health perspective is the estimated reporting probability of 0.33 as it directly translates to an estimate of the unobserved disease burden. Using statement (C) from Section 2, however, we can pick any and obtain the parameters of an INAR() process so that is equivalent to . In Figure 1these parameters are diplayed as functions of .
So if we relax the INAR(1) assumption from the original article and assume a latent INAR() process, the data are equally compatible with a whole range of reporting probabilities. The most extreme reformulation of from equations (14), (15) is a completely observed INAR() process with
where the lagged terms after are cut off as the respective parameters become very small. In certain settings generation time distributions of infectious diseases may give us an idea about an appropriate lag structure and thus value of , making the model identifiable. The period of communicability of HPV is unknown, but likely to be at least as long as the persistence of lesions [4]. Development of lesions is assumed to take 2–3 months in most cases. For weekly data a more spreadout lag structure may thus be more appropriate than an INAR(1) specification.
Appendix: Derivation of statements (A), (B), (C) from Section 2
Consider a Poisson INAR(1) process with
and the underreported process . It is now shown that, as stated in (A), is equivalent to an INAR() process of type (10)–(13) with parameters . The argument is easiest understood when expressed in terms of the survival interpretation of :

New individuals can be born at each time step
; their number follows a Poisson distribution with rate
. 
Individuals already present at have a probability of still being alive at time .

Alive individuals are observed with probability at each time step.
All births, deaths and observation events are assumed to be independent. is the number of individuals alive at time , is the number of those who are observed. Now denote by the number of individuals observed in which have not been observed previously, and by the number of individuals observed in which have already been observed at a previous time point so that
(16) 
The term can be further decomposed by when the individuals were born; denoting by the number of individuals first observed in and born in one gets
(17) 
Similarly, is decomposed by when the individuals were last observed; denoting by the individuals last observed in and observed again in this leads to
(18) 
Figure 2 illustrates the definition of and with a simple example.
Obviously an individual born in can only be observed for the first time in at most one out of , i.e. is split up into and a part of individuals which is never observed. The probability that an individual born in is first observed in is (the individual has to survive times, stay unobserved times and finally be observed once). The splitting property of the Poisson distribution ([6], 53) and the independence between the then imply that all independently follow Poisson distributions. Their rates depend on , specifically . Consequently, the
, too, are independent of each other. As sums of independent Poisson random variables they likewise follow a Poisson distribution (
[6], 14):(19) 
Similarly, an individual observed in can only be observed next in at most one out of . The observed individuals are thus split up into and a part of individuals which is never observed again. The probability that an individual is observed next in is (the individual has to survive times, stay unobserved times and finally be observed once) or with
(20) 
The probability that the individual will be observed again at all is . Consequently, under the condition that the th of the individuals will be observed again, the waiting time until this occurs has probability mass function . It thus follows a geometric distribution with parameter . Denoting the number of individuals observed in which will be observed again by we can thus write
(21)  
(22)  
(23) 
Combining this with equations (16), (18) and (19) to
(24) 
one can see that indeed follows the form (10)–(13) with parameters (the in equation (24) correspond to the from equation (10) and corresponds to ). Also it is clear that both and are independent of the past history of the observed process , i.e. and
Statement (B) follows directly as it is easily verified using (A) that the given and are equivalent. The restriction ensures that and so that and are welldefined.
Statement (C) follows from statements (A) and (B) in the following way. Consider an INAR() process and . Using (B) an INAR(1) process with parameters can be constructed so that is equivalent to . Thus is in turn equivalent to . One can now choose any and define . The process is then obviously equivalent to and thus . The proof is complete as statement (A) implies that has a representation as an INAR() process with parameters
Acknlowledgements
The author would like to thank Leonhard Held, Małgorzata Roos and Christian H. Weiß for helpful comments.
References
 [1] MA AlOsh and AA Alzaid. Firstorder integervalued autoregressive (INAR(1)) process. J Time Ser Anal, 8(3):261–275, 1987.
 [2] AA Alzaid and MA AlOsh. An integervalued pthorder autoregressive structure (INAR(p)) process. J Appl Probab, 27(2):314–324, 1990.
 [3] A FernándezFontelo, A Cabaña, P Puig, and D Moriña. Underreported data analysis with INARhidden Markov chains. Stat Med, 35(26):4875–4890, 2016.
 [4] DL Heymann. Control of Communicable Diseases Manual. APHA Press, Washington, DC, 20th edition, 2015.
 [5] NL Johnson, S Kotz, and N Balakrishnan. Discrete Multivariate Distributions. Wiley, New York, NY, 1997.
 [6] JFC Kingman. Poisson Processes. Oxford University Press, Oxford, United Kingdom, 1993.
 [7] CH Weiss. An Introduction to DiscreteValued Time Series. Wiley, Hoboken, NJ, 2018.
Comments
There are no comments yet.