Analysis and Methods to Mitigate Effects of Under-reporting in Count Data

by   Jennifer Brennan, et al.
University of Washington

Under-reporting of count data poses a major roadblock for prediction and inference. In this paper, we focus on the Pogit model, which deconvolves the generating Poisson process from the censuring process controlling under-reporting using a generalized linear modeling framework. We highlight the limitations of the Pogit model and address them by adding constraints to the estimation framework. We also develop uncertainty quantification techniques that are robust to model mis-specification. Our approach is evaluated using synthetic data and applied to real healthcare datasets, where we treat in-patient data as `reported' counts and use held-out total injuries to validate the results. The methods make it possible to separate the Poisson process from the under-reporting process, given sufficient expert information. Codes to implement the approach are available via an open source Python package.


page 1

page 2

page 3

page 4


Multivariate Hierarchical Frameworks for Modelling Delayed Reporting in Count Data

In many fields and applications count data can be subject to delayed rep...

A Hierarchical Framework for Correcting Under-Reporting in Count Data

Tuberculosis poses a global health risk and Brazil is among the top twen...

Bayesian imputation of COVID-19 positive test counts for nowcasting under reporting lag

Obtaining up to date information on the number of UK COVID-19 regional i...

Estimating the real burden of disease under a pandemic situation: The SARS-CoV2 case

The present paper introduces a new model used to study and analyse the s...

Gaussian Process Nowcasting: Application to COVID-19 Mortality Reporting

Updating observations of a signal due to the delays in the measurement p...

Equity in Resident Crowdsourcing: Measuring Under-reporting without Ground Truth Data

Modern city governance relies heavily on crowdsourcing (or "co-productio...

KOALA: A new paradigm for election coverage

Common election poll reporting is often misleading as sample uncertainty...

1 Introduction

Under-reporting of count data is a pervasive problem in many fields, including econometrics (Winkelmann, 1996), epidemiology (Stoner et al., 2019), and engineering (Wood et al., 2016). In population health, under-reporting of key statistics such as injuries and birth defects impedes estimation of the burden of disease (Vos et al., 2020). The goal of statistical modeling in this area is to develop a rigorous approach to deconvolve the data-generating from the under-reporting mechanism. However, accurately separating these two mechanisms is extremely challenging without contextual information, such as covariates related to under-reporting and those related to the true data-generating process.

The predominant approach to modeling under-reported counts focuses on a two-stage model. Events are first generated according to a Poisson process with mean

. They are then reported according to a binomial process with probability

, with resulting reported events having mean . If covariate information is available, both and

can be modeled as functions of those covariates; otherwise, a mathematically convenient distribution over these parameters is assumed. When the probability of reporting is modeled using logistic regression (i.e., with a logit transform), the resulting model is called a

Poisson-Logit (Pogit) model.

Unfortunately, the deconvolution problem to separate the Poisson and bionomal processes, i.e., inferring and from data that inform , is very difficult. Since only reported counts are observed, it is challenging to determine whether these counts result from many events with a low rate of reporting (large , small ) or few events with a high rate of reporting (small , large ). This problem is exacerbated when covariates are shared between the Poisson and binomial processes or when they are highly correlated, and results in a wide range of plausible solutions. Addressing this variability, in both theory and practice, requires strong assumptions on the form of one or both latent processes.

Contributions and roadmap.

Our research makes three core contributions. First, we develop an asymptotic covariance analysis for the Pogit model and use it to qualitatively describe the limitations of the standard Pogit approach, which we illustrate using numerical examples. Second, we introduce new constraints and priors to robustify the estimation process, and we improve the ability to deconvolve under-reporting from the true count process using rigorous sandwich estimation to evaluate the uncertainty of model estimates. Finally, we develop an open source implementation of the algorithm and illustrate its successful application on a large-scale dataset that has a gold standard, validating our results.

Paper organization.

Section 2

describes existing models for under-reported counts. We develop an asymptotic analysis and numerical examples highlighting challenges of the deconvolution problem in Section 

3. In Section 4, we present new methods to overcome these challenges using priors and constraints as well as novel approaches for robust uncertainty quantification. We develop and analyze case studies using real data in Section 5 and present a brief concluding discussion in Section 6.

2 Current models for under-reported counts

The standard approach to modeling under-reported counts assumes a two-step data generating process. Let represent the reported number of counts for observations . In the first step, the unobserved true number of events

is drawn according to a Poisson distribution:


In the second step, these events are filtered through a reporting process, where event has probability

of being reported. The reported counts are modeled as a binomial random variable:


This is equivalent to drawing the reported counts from the Poisson distribution


Estimating the reported process mean and its underlying process can be simply done using Poisson regression and the corresponding generalized linear model. However, in the under-reported counts setting, we need accurate estimates of both the true rate of events and the reporting rate . Previous work proposed several models to separate (deconvolve) and using observations of their product .

2.1 Models for the true rate and reporting rate

The only way to deconvolve into factors is to incorporate additional assumptions about each component; otherwise, and is always a valid solution. Previous work on models for under-reported counts used distributional assumptions on parameters depending on whether the observations are associated with covariates, as described below.

2.1.1 Modeling without covariates

In the absence of covariates, a popular approach is to adopt a hierarchical model, where and are latent variables drawn from underlying prior distributions. When a Gamma prior is placed on and a Beta prior on

, the resulting model is called the Beta-Binomial/Negative Binomial distribution, described and analyzed in

Schmittlein et al., 1985. The specific choice of priors makes it tractable to compute posterior estimates of the individual and , as derived by Fader and Hardie, 2000 for Empirical Bayes estimation of individual and .

2.1.2 Modeling with covariates

Given covariates that predict the true rates and covariates that predict reporting rates, we can model and as functions of these covariates. The most popular model is the Poisson-Logistic regression, or Pogit model, proposed by Winkelmann and Zimmermann, 1993. In this model, the first step of the under-reported counts process (1) is modeled according to standard Poisson regression with coefficients :


The second step (2) is modeled according to logistic regression with coefficients :


The generating distribution for can now be written as the Poisson distribution


The Pogit model has been used to estimate worker absenteeism in econometrics (Winkelmann, 1996), tuberculosis incidence in epidemiology (Stoner et al., 2019), and traffic accidents in highway engineering (Wood et al., 2016).

2.2 Parameter Estimation

Estimating the parameters of the Pogit model remains a deconvolution problem of and processes: a high observed count may be due to either a high underlying rate or a high reporting rate. Previous work addressed identifiability conditions for the Pogit model as well as several ways to include side information to improve the parameter estimates.

2.2.1 Conditions for parameter identifiability

A complete treatment of the difficulties in separating and under the maximum likelihood framework is given by Papadopoulos and Silva (2012), who observed two distinct Pogit model parameterizations that lead to the same conditional law :


The authors show that identifiability can be regained either by knowing the sign of some nonzero element of a priori or by restricting some covariates to , excluding them from . In earlier work (Papadopoulos and Santos Silva, 2008), the authors discussed problems that could arise if the restricted covariate is nearly colinear with the remaining covariates; in this case, they illustrated the resulting near-unidentifiability using employment data from the German Socio-Economic Panel (Wagner et al., 1993), which was previously analyzed in Winkelmann, 2008.

The identifiability problem (8) shows how the inherent ambiguity in deconvolving a product into individual terms directly translates to ambiguity in the Pogit model. In this work, we reduce this ambiguity by incorporating additional information in the parameter estimation, i.e., using constraints and regularization, to better resolve and (and their associated models) in the maximum likelihood framework.

2.2.2 Variance reduction methods

Correlation among covariates for and

increases the risk of unidentifiability in the model and can manifest as high variance, as noted by  

Papadopoulos and Silva (2012). Another source of variance in the Pogit model is model misspecification, which can occur if there is overdispersion in the Poisson process. Several techniques have been developed to address these sources of variance.

One technique adds constraints or priors to the model, incorporating side information to reduce variance. This approach, used by Stoner et al. (2019), applies the Pogit model to the problem of estimating tuberculosis incidence in regions of Brazil using a Bayesian formulation. Here, the side information is a prior on the aggregate rate of tuberculosis reported across all regions, elicited from WHO estimates. The authors emphasize the strong dependence of the fitted model on this prior. Another type of side information useful for reducing variance is the presence of fully reported observations for which the reporting rate is one. This type of regularization induces the function to pass through certain points and is used in the analyses of Stamey et al., 2006 and Dvorzak and Wagner, 2016.

A second technique addresses model misspecification by increasing the Pogit model’s flexibility. Overdispersion of the true observations could be addressed by replacing the Poisson model with a negative binomial, although to our knowledge this has not been done previously. In the Bayesian setting, Stoner et al. (2019) address overdispersion by including additional Gaussian noise in the relationship between and and their covariates.

3 Characterizing the difficulties of , deconvolution

With the exception of Papadopoulos and Silva (2012), no work has analyzed the shortcomings of the Pogit model from a theoretical perspective. In Section 3.1, we derive an asymptotic lower bound for the variance of the maximum likelihood estimate of the Pogit parameters under a simplified setting, where and each depend on a single covariate. This analysis reveals a fundamental difficulty: the variance of grows with , making it difficult to identify , and hence , in a setting with moderate or large . We test this intuition using numerical simulations in Section 3.2, where we present a simple setting that nonetheless makes it impossible to infer even the sign of for any value of .

3.1 Theoretical analysis in the two-covariate setting

To characterize the behavior of the estimated Pogit model, we analyze a simple version of it. In our setting, and are each determined by a single covariate


so that


We are interested in the performance of the estimators of parameters , measured by the mean squared error


For , let covariates and be drawn independently according to


To analyze the behavior of the maximum likelihood estimates of and , we assume as a further technical condition that for some constants . These assumptions help us prove regularity conditions about the maximum likelihood estimator, but in practice they can be chosen as sufficiently large to be inactive at the solution.

First, we provide preliminary results about the Pogit model. We did not find these results in the literature and include them here for completeness, with proofs in Appendix A.

Recall that the Fisher information matrix is defined by



is the probability density function of the data

and given parameters

. We now state the following lemma, which contains the regularity conditions required (1) to show that the MLE is asymptotically normally distributed, and (2) for the Cramér-Rao lower bound to hold:

Lemma 3.1.

Let be drawn as in (13). The following regularity conditions hold.

  1. is identified such that if and , then with respect to the dominating measure .

  2. lies in the interior of , which is assumed to be a compact subset of .

  3. is continuously differentiable at each for all (a.e. will suffice).

  4. for all and .

  5. is twice continuously differentiable, and in a neighborhood, , of .

  6. for all and .

  7. Defining the score vector

    we have that exists and is non-singular.

  8. for all and .

  9. for all and .

And the maximum likelihood estimate of is distributed according to


The inverse of the Fisher Information Matrix provides a lower bound on the variance of any unbiased estimator via the Cramér Rao lower bound; a lower bound on the Fisher Information Matrix therefore provides insight into the fundamental difficulty of the estimation task. Our main result is given below.

Theorem 3.2.

Let be drawn as in (13). Let be any unbiased estimator of . Then, the covariance of is lower bounded by


To gain a more intuitive understanding of the result in Theorem 3.2, we take and , in which case the asymptotic variance for and can be lower-bounded by


The standard deviation in estimating

grows at least proportionally to the value of itself; for a fixed value of , there will be at least a constant probability of incorrectly estimating the sign of , which is equivalent to incorrectly determining whether is an increasing or decreasing function of . In applications where estimating is important, additional information and constraints must be used to decrease the variance of .

3.2 Simulation Study

We perform a simulation study to illustrate the high variance of and in this simple setting. For our simulations, we generate covariates and .

Figure 1: Simulation studies illustrating the dependence of and on and for data points. The left panels vary with fixed , while the right panels vary with fixed . Solid blue lines show the average parameter estimate over

trials; the blue shaded region is the 95% confidence interval based on the empirical standard deviation. The red dashed lines show our theoretical lower bound on the 95% confidence interval.

For the first simulation, shown in Figure 1 (left), we fix and vary from to , then we report the 95% confidence interval based on the empirical standard deviation of and over replicates. We compare this to our theoretical lower bound on the confidence interval for each parameter. Our experiments confirm the theoretical bound in this setting: the standard deviation in is independent of , while the standard deviation of grows linearly with the parameter , highlighting the difficulty of estimating in this setting. We also see that is always within the 95% confidence interval, indicating that no matter how large truly is, we cannot determine whether increases or decreases with in this setting.

We perform a second simulation with fixed while varies from to , shown in Figure 1 (right). The results confirm our interpretation of the lower bound; the standard deviation in both and decreases as the magnitude of increases.

These simulations illustrate that the estimated parameters can suffer from high variance; in particular, we can never reject the hypothesis that for any value of In practical settings, we need additional information to reduce the variance in the model and successfully deconvolve and . We next discuss a framework for incorporating priors and constraints as a powerful way to bring expert knowledge to bear on specific problems.

4 Incorporating prior knowledge into model building

We now discuss how we use prior knowledge to deconvolve the under-reporting and data-generating mechanisms while maintaining model identifiability (Section  4.1) and how we apply the sandwich estimation procedure to quantify the resulting model’s uncertainty (Section 4.2).

4.1 Covariate Specifications

To deconvolve the under-reporting mechanism (parametrized by ) from the data-generating mechanism (parametrized by ), we model both as functions of covariates. These functions consist of a linear predictor and a link function that transforms the linear predictor to the desired space, as in generalized linear models. For example, to model the parameter using a vector of covariates , a vector of regression coefficients , and link function :

Linear predictors.

For , we can choose simple covariate specifications, like including a continuous predictor as a single linear term. Alternatively, nonparametric regression techniques (such as basis splines (De Boor, 1978)) let us flexibly parametrize the relationship between a covariate and or . These spline specifications, which include the degree of the spline and the number and location of knots, are embedded in the linear predictor .

To encode knowledge about the shape of the relationship among covariates, we can use linear inequality constraints on the regression coefficients . For example, such constraints could force the regression coefficient to be positive, which would enforce an increasing relationship between the true rate of reporting and a given covariate. Further, very general linear constraints are particularly useful for working with basis splines. The second derivative of a basis spline can be represented as a linear function of its basis elements, so linear constraints of the regression coefficients let us constrain the second derivative to be positive or negative.

Finally, rather than constraining the regression coefficients

of the linear predictor, we can include a quadratic regularizer, commonly known as a Gaussian prior or ‘ridge’ regression penalty 

(Hoerl and Kennard, 1970). Trading off bias for variance in the parameter estimation, we can avoid over-fitting the data, and we can incorporate prior beliefs in a quantitative way.

Link functions.

In the Pogit model, we use different link functions for and . Specifically, we use the logit function for and the log function for . W can also use other functional forms,as needed. For example, if we understand from prior knowledge that the under-reporting rate is between and , the inverse link function

for parameter captures this information with no need for more complex constraints.


To show the impact of the innovations on the pogit model, we use a simple synthetic example, where and are taken to be simple nonlinear functions

with and simulated as independent uniform random variables. The Pogit model is fitted using only reported data. Spline specifications for and are used to capture the nonlinear relationships. Figure fig:constraintComparisons shows the results for predicted , and across 100 realizations of the experiment. Its first column presents results for the unconstrained spline Pogit approach; though the fit is correct (third row), resolving and is far more difficult. In each column thereafter, we show the impact of the three techniques described above. The table’s second column shows results for the modified link function that bounds between and through its representation. The third column shows results for using quadratic regularization pulling to . Finally, the fourth column presents imposing convexity constraints on (as a function of ) and (as a function of ). All three techniques improve resolution of and , with quadratic regularization helping the most: it provides specific (strong) information about the value of rather than general (weaker) information about the bounds on or the nature of the relationship between and or and . All approaches are comparable in their recovery of , which underscores the fact that the fit alone cannot differentiate successful resolution of and (e.g., as in column 3) and failure to resolve these parameters (as in the unconstrained results of column 1).

Figure 2: Improvements in fits vs. baseline pogit model (first column) of: modified logit link (second column), prior on (third column), and convexity constraints (fourth column). All methods easily fit (third row) since it is directly informed by the observations. All three innovations improve fit for (first row) and (second row), with the prior on showing the largest impact in this example.

4.2 Uncertainty quantification and model diagnostics

We use a robust approach to uncertainty quantification that allows estimation of both the uncertainty for as well as covariate multipliers describing the relationships between these parameters and covariates. Sandwich estimation (Kauermann and Carroll, 2001; Wakefield, 2013) is robust to model misspecification, which proves to be particularly important for the Pogit model. Specifically, our variance-covariance matrix is computed as

where is the Hessian of the log likelihood , and is the Gauss-Newton Hessian approximation , both computed at the maximum likelihood estimate by their empirical approximations of the expectations. In practice, this approach reports wider uncertainty intervals in the presence of model mis-specification, helping modelers to detect difficult cases.

5 Case Studies: Validation on Injury Datasets

We present two case studies exploring the performance of the Pogit model on health-related datasets. Our two case studies, on interpersonal violence and diabetes, illustrate the use of prior knowledge in the form of covariates, constraints and regularization to address the challenges of identifiability and high variance (described in Section 3). In the interpersonal violence study, we estimate the rate of injuries warranting medical care using data from injuries warranting only inpatient medical care, allowing us to apply the “under-reporting” framework. For the diabetes study, we estimate the overall rate of medical encounters, again using only inpatient data. For each case, we validate our predictions from the Pogit model by comparing them to the total of inpatient and outpatient data.

5.1 Case Study: Interpersonal Violence

In the International Classification for Diseases 9 (ICD-9) and 10 (ICD-10) codes, injuries are classified by their cause (e.g., interpersonal violence) and/or their nature (e.g., traumatic brain injury)

Vos et al. (2020). In addition, they are reported separately based on outpatient or inpatient status. For this case study, we consider all injuries resulting from interpersonal violence and separate them by treatment inside or outside the hospital setting.

For the validation setting, is the true rate of all inpatient and outpatient injuries combined, is the proportion of injuries that are seen in the inpatient setting, and is the observed rate of injuries in the inpatient category. Our goal is to recover the total rate of injuries due to interpersonal violence from only inpatient information. We use inpatient and outpatient interpersonal violence injury data aggregated at the national level for the US by The Global Burden of Disease study(National Center for Health Statistics and Prevention, 2018). Only two covariates, age and sex, are available in this dataset.

Interpersonal violence illustrates an interesting case for our models: the covariates controlling the reporting rate are a strict subset of the covariates controlling the true rate, leading to a loss of identifiability that can be recovered only by using constraints. In particular, as explained below, we model the rate of injury as a function of age and sex and the probability of inpatient care using only sex.

Figure 3 shows the observed (inpatient) injury rate per person per year, split by age and sex cohorts, for five-year periods between 1993 and 2012. There are clear age and sex effects in the data, with young adult males having the highest rates of injuries requiring inpatient care.

Figure 3: Observed (inpatient) interpersonal violence injuries in the United States, plotted by age/sex, over four five-year periods between 1993 and 2012.

Domain knowledge is a critical component of the modeling process, and identifiability of the Pogit model depends on proper modeling choices for and . In both of our case studies, we use plots of true and , shown in Figure 4, to make reasonable choices for their functional forms. Based on these plots, we model as a spline in age and sex and as a function of sex alone. This information would not be available to modelers in the real under-reporting setting, who would need domain knowledge to determine the functional forms of and .

Figure 4: True plots of , the combined rate of inpatient and outpatient injuries (left), and , the rate at which injuries are treated as inpatient (right). The data used to make these plots is not provided to the Pogit model, which sees only inpatient data.

5.1.1 Modeling as a function of age and sex.

We model the true injury rate, , as a function of age , sex (coded for males and for females) and a fitted intercept:


Age enters the model as a cubic spline with a knot at age . The placement of knots can be guided by domain knowledge, e.g., about change points in interpersonal violence based on age.

5.1.2 Modeling as a function of sex alone.

As Figure 4 shows, the fraction of interpersonal violence requiring inpatient care (the fraction “reported”) is primarily a function of sex, with a higher rate for males than females. We model this as


5.1.3 Constraints and regularization.

The covariates modeling are a subset of those used to model . As discussed by Papadopoulos and Santos Silva (2008) and in Section 2.2.1, this overlap in covariates renders the model unidentifiable. To recover identifiability, we add the constraint that males seek inpatient care at higher rates than females. This constrains in Eqn (20) and results in an identifiable model.

Even with this constraint, substantial variance remains in the model predictions for and . Our approach to further reduce this variance is to add regularization for , pushing toward using an -norm penalty on the magnitude of .

5.1.4 Results.

Even with regularization and constraints, the variance of model predictions remains high. Figure 5 shows the fitted model and its components and . We see that (1) the fitted is higher than the true reporting rate for both sexes, and (2) the variance is so high that the confidence intervals span almost the entire range , signalling that the problem is difficult. Nonetheless, we can still recover a reasonable estimate for , which is typically the more important quantity from a global health perspective.

We obtain a quantitative comparison of our estimate vs. baseline estimates using the Akaike Information Criteria (AIC). The AIC is given by , where is the likelihood of the model on the fully reported data (see Equation (2)) and is the number of parameters. The first model we compare to is the oracle model of . The oracle observes the combined inpatient and outpatient data (, in the notation of Equation (1)) and fits the Poisson model (19) to that data. This represents the maximum likelihood fit to within the model class of Equation (19). Our second comparison is to the naive baseline of ignoring under-reporting. For this baseline, the Poisson model (19) is fit to the inpatient data only ( in the notation of Equation (2)). This baseline, which represents the model that is unaware of the under-reporting problem, will have a low likelihood on the true when the observations are severely under-reported.

Table 1 shows the AIC values for the three models we consider. Since all three models are of the same parametric form, the term acts as a constant offset. We see that the Pogit fit is significantly better compared to ignoring under-reporting and significantly worse than the oracle fit ( with a likelihood ratio test in both cases).

Oracle Fit Pogit Fit Ignoring Under-reporting
AIC 18000 62000 156000
Table 1: AIC of the true interpersonal violence injury rate over all data points , reported under three different models: the “oracle” Poisson fit of the injury rate to model (19) if we could observe the number of total injuries ; the Poisson portion (19) of the Pogit model fit to inpatient injuries ; and the naive baseline that ignores under-reporting by fitting the Poisson model (19) to the observed inpatient data .
Figure 5: Regularized Pogit fits for the interpersonal violence injuries model using only inpatient data. The estimated total rate and fraction inpatient are plotted for each age/sex cohort against validation data not available to the model. Shaded intervals are 90% confidence intervals computed using sandwich estimation, as described in Section 4.2.

5.2 Case Study: Diabetes Care

In the second case study, we apply the Pogit model to estimate the rate of diabetes care (inpatient and outpatient visits) having observed only inpatient visits. We use MarketScan healthcare claims data that is processed for use in the Global Burden of Disease Study (Truven Health Analytics, ; Vos et al., 2020; National Center for Health Statistics and Prevention, 2010). The data is aggregated at the age and sex level for each US state for the year 2019; the state-level aggregation lets us use a richer set of covariates to model and . Figure 6

shows the rate of inpatient diabetes cases per person per year across all fifty states, as a function of age, sex, and population average fasting plasma glucose (FPG) for each state/age/sex cohort. Both age and population average FPG correlate positively with diabetes inpatient admissions. We parametrize models for

and based on the and plots shown in Figure 7.

Figure 6: Inpatient diabetes care for age/sex (left) and cohort-average FPG (right) in 2019.
Figure 7: True fraction of inpatient visits as a function of age and sex for 2019 (left). Total diabetes care as a function of population average FPG and sex for 2019 (right).

5.2.1 Modeling as a function of FPG and sex.

We model the true rate of diabetes care, , as a function of sex and the state average FPG with a fitted intercept:


FPG enters the model as a quadratic spline .

5.2.2 Modeling as a function of age.

The observed rate of inpatient diabetes care is driven by the true rate of diabetes care and by the fraction of care that is treated in the inpatient vs. outpatient setting. Based on Figure 7, we model as a quadratic spline in age:


We apply several constraints and regularizers on to reduce the variance of our estimate. First, we enforce that decreases from age 25 to age 60. Second, we apply a quadratic regularization of the average fitted toward its true average, . Including this side information improves the model fits for both and .

5.2.3 Results

We fit the model described in (21) and (22) to the diabetes inpatient care data. Figure 8 shows the results. Both fitted and capture the important properties of their respective processes: is convex in age, while is concave in FPG and shows a higher rate for males than females.

Figure 8: Regularized Pogit fits for the diabetes care model using only inpatient data. The estimated total rate of care and fraction of inpatient care are plotted for each age/sex cohort against validation data that was not available to the model. The fitted inpatient care rate is a function of age, sex and state average FPG, so we provide three plots for different age cohorts. Shaded intervals are 90% confidence intervals computed using sandwich estimation, as described in Section 4.2.

Following the methodology of Section 5.1.4, we quantitatively evaluated the fitted model of the combined rate of care, , using the AIC. Table 2 shows the AIC of the Pogit model, the oracle with access to both inpatient and outpatient data, and the naive baseline that ignores under-reporting. The Pogit fit is significantly better than the naive baseline, but not as good as the model that observes both inpatient and outpatient data ( with a likelihood ratio test in both cases). The Pogit model outperforms the naive baseline by a wider margin on the diabetes data than on the interpersonal violence study because the reporting rate is lower for diabetes, so the penalty for ignoring under-reporting is higher.

Oracle Fit Pogit Fit Ignoring Under-reporting
AIC 450 749 37400
Table 2: AIC of the true diabetes visit rate over all data points , reported under three different models: the “oracle” Poisson fit of the visit rate to model (21) if we could observe the number of total visits ; the Poisson portion of the Pogit model fit on only inpatient visits ; and the naive baseline that ignores under-reporting by fitting the Poisson model (21) to the observed inpatient data .

6 Discussion

In this paper, we presented theoretical challenges in modeling under-reported count data using the Pogit model. We showed how priors and constraints can help resolve these issues and used real-world data from the Global Burden of Disease study to validate our approach. We found that the proposed formulation enables successful estimation, provided sufficient prior information can be specified. Examples of such information include aggregate measures (such as national reporting rate), shape of the relationships, and prior values for specific datapoints and covariate values. The tools used to create the results and test the methods are available in a publicly accessible repository.

The approach and analysis in this paper focused on the Pogit model. Future analysis and extensions can be made by considering other count models that better account for over-dispersion. A potential challenge for extensions is that additional flexibility may exacerbate the difficulty of the deconvolution problem.

Another interesting direction for future work is to use the approach developed here to aid in decision making about the kinds of data or information in which to invest. Specifically, decisions to obtain new data sources or conduct additional studies can use the available package to evaluate the type of information most effective for minimizing robust uncertainty estimates. A rigorous framing of this idea is left to future work.

7 Software

The techniques described in this paper have been implemented in the Python package Regmod, available on GitHub at Tutorials are available on GitHub at


The authors would like to thank Emily Johnson for providing the diabetes data set, Madeline Moberg and Erin Hamilton for providing and explaining the road injuries data set, and Dr. Liane Ong for providing expert insight into covariates to use for the diabetes model. We are also grateful to Sandy Kaplan for her review of the manuscript.

Conflict of Interest: None declared.


  • C. De Boor (1978) A practical guide to splines. Vol. 27, springer-verlag New York. Cited by: §4.1.
  • M. Dvorzak and H. Wagner (2016) Sparse bayesian modelling of underreported count data. Statistical Modelling 16 (1), pp. 24–46. Cited by: §2.2.2.
  • P. S. Fader and B. G. Hardie (2000) A note on modelling underreported poisson counts. Cited by: §2.1.1.
  • A. E. Hoerl and R. W. Kennard (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12 (1), pp. 55–67. Cited by: §4.1.
  • G. Kauermann and R. J. Carroll (2001) A note on the efficiency of sandwich covariance matrix estimation. Journal of the American Statistical Association 96 (456), pp. 1387–1396. Cited by: §4.2.
  • C. f. D. C. National Center for Health Statistics and U. S. C. B. (. Prevention (2010) United states national hospital discharge survey. Cited by: §5.2.
  • C. f. D. C. National Center for Health Statistics and Prevention (2018) United states national ambulatory medical care survey. Cited by: §5.1.
  • G. Papadopoulos and J. Santos Silva (2008) Identification issues in models for underreported counts. Cited by: Appendix B, §2.2.1, §5.1.3.
  • G. Papadopoulos and J.M.C. S. Silva (2012) Identification issues in some double-index models for non-negative data. Economics Letters 117 (1), pp. 365–367. External Links: Document, Link Cited by: §2.2.1, §2.2.2, §3.
  • D. C. Schmittlein, A. C. Bemmaor, and D. G. Morrison (1985) Why does the nbd model work? robustness in representing product purchases, brand purchases and imperfectly recorded purchases. Marketing Science 4 (3), pp. 255–266. Cited by: §2.1.1.
  • J. D. Stamey, D. M. Young, and D. Boese (2006) A bayesian hierarchical model for poisson rate and reporting-probability inference using double sampling. Australian & New Zealand Journal of Statistics 48 (2), pp. 201–212. Cited by: §2.2.2.
  • O. Stoner, T. Economou, and G. Drummond Marques da Silva (2019) A hierarchical framework for correcting under-reporting in count data. Journal of the American Statistical Association 114 (528), pp. 1481–1492. Cited by: §1, §2.1.2, §2.2.2, §2.2.2.
  • [13] Truven Health Analytics(Website) Truven Health Analytics. Cited by: §5.2.
  • T. Vos, S. S. Lim, C. Abbafati, K. M. Abbas, M. Abbasi, M. Abbasifard, M. Abbasi-Kangevari, H. Abbastabar, F. Abd-Allah, A. Abdelalim, et al. (2020) Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the global burden of disease study 2019. The Lancet 396 (10258), pp. 1204–1222. Cited by: §1, §5.1, §5.2.
  • G.G. Wagner, R.V. Burkhauser, and F. Behringer (1993) The english language public use file of the german socio-economic panel. Journal of Human Resources 28, pp. 429–433. Cited by: §2.2.1.
  • J. Wakefield (2013) Bayesian and frequentist regression methods. Springer Science & Business Media. Cited by: §4.2.
  • R. Winkelmann and K. F. Zimmermann (1993) Poisson-logistic regression. Volkswirtschaftl. Fakultät d. Ludwig-Maximilians-Univ. München. Cited by: §2.1.2.
  • R. Winkelmann (1996) Markov chain monte carlo analysis of underreported count data with an application to worker absenteeism. Empirical Economics 21 (4), pp. 575–587. Cited by: §1, §2.1.2.
  • R. Winkelmann (2008) Econometric analysis of count data. Springer Science & Business Media. Cited by: §2.2.1.
  • J. S. Wood, E. T. Donnell, and C. J. Fariss (2016) A method to account for and estimate underreporting in crash frequency research. Accident Analysis & Prevention 95, pp. 57–66. Cited by: §1, §2.1.2.

Appendix A Proof of the Estimation Error Lower Bound

In this section, we present proofs of lemmas and theorems in the paper.

Setting A.1 (Two-covariate Pogit Model).

For , let covariates and be drawn independently according to


and let be drawn according to


where for constants . The existence of lower and upper bounds on the parameters is needed to prove certain regularity conditions about the maximum likelihood estimator, but the bounds can be chosen such that they are never attained in practical settings.

Appendix B Proof of Lemma 3.1

To prove this claim, it suffices to show that the following regularity conditions are satisfied:

  1. is identified, in the sense that if and , then with respect to the dominating measure .

  2. lies in the interior of , which is assumed to be a compact subset of .

  3. is continuously differentiable at each for all (a.e. will suffice).

  4. for all and .

  5. is twice continuously differentiable, and in a neighborhood, , of .

  6. for all and .

  7. Defining the score vector

    then exists and is non-singular.

  8. for all and .

  9. for all and .

We first introduce the likelihood decomposition, which we will use repeatedly in the following proofs:


And for log likelihood:


Condition (1). Identifiability holds because the covariates for , , are independent from the covariates for , (as shown in Papadopoulos and Santos Silva (2008)).

Condition (2) holds by the assumption given in Setting A.1.

Condition (3). From (27) and the fact that is continuously differentiable in and is continuously differentiable in , we know that is continuously differentiable in .

Condition (4). Since is a compact set, for every we can attain the maximum and minimum value of with respect to :

And we could write out our upper bound function in terms of :

For simplicity, we use to ambiguously denote either or branch.

The expectation in the condition (6) can be written as

Here, we argue that since Poisson model satisfies the regularity condition, we know that

And we assume the distribution of satisfies the regularity condition, as well:

We can therefore conclude that

Since the preceding argument applies to both and , we have

Condition (5). Using the expression (26) and the fact that is twice differentiable with respect to , and is twice differentiable with respect to , we know that is twice differentiable with respect to . Moreover, since is always positive in the neighborhood of and is always positive, we know that is always positive in the neighborhood of .

To verify Condition (6), we begin with the decomposition of the likelihood and then use the fact that describes and (26) to write

Since is a compact set, the upper and lower bounds of the partial derivative can be attained. Here, we use same notions , and as in the proof of condition (4). Our dominating function can now be written as

Since is the uniform measure over and , we have

The Poisson likelihood satisfies the regularity condition because it is a generalized linear model, and therefore we have

We can now write an upper bound on the quantity of interest:

Since is the product of an exponential and an expit function, the partial derivative of with respect to

can grow at most as fast as exponential function, and it will be dominated by the density function of Gaussian distribution

. Therefore, we conclude that

which shows that Condition (6) is satisfied.

Condition (7) is satisfied by inspection of the Fisher Information Matrix, which is computed in the proof of Theorem 3.2.

The proof of Condition (8) and Condition (9) follows from the same arguments used to prove Condition (4) and Condition (6), respectively.

Since all conditions are satisfied, the conclusion follows immediately.

Appendix C Proof of Theorem 3.2

We will prove this result using the Cramér-Rao lower bound. The result states that under certain regularity conditions (which we show are satisfied in Lemma 3.1),


where is the Fisher Information matrix, defined as


where is the likelihood of the observed data under the parameters . Under mild regularity conditions, which we show in Lemma 3.1 are satisfied in this instance, we can write the Fisher information matrix as the negative Hessian of the log likelihood function


We compute the covariance in with respect to the randomness in both and :


where we have used the fact that the distribution of is independent of the parameters . Next, we compute the Hessian of the negative log conditional likelihood. The log conditional likelihood under this model is given by


and the Hessian of the negative log conditional likelihood has components


We will bound each of these quantities independently. Note that the matrix is symmetric; we begin by showing that the off-diagonal entry is zero:


Note that the second expectation is over an odd function. Since we assumed

is symmetric around zero, this term is zero, and


Next, we compute the first diagonal entry in the matrix. We begin by separating it into terms that depend on and terms that depend on :


Next, we evaluate the first expectation using the known distribution of and then completing the square: