Count data often exhibit dispersion relative to a classical Poisson model. For overdispersed counts, where the conditional variance is larger than the conditional mean, models based on the negative binomial, Poisson inverse-Gaussian and other Poisson rate mixture distributions have been well-covered in the literature from both frequentist and Bayesian points of view(e.g., Wilmot, 1987; McCullagh and Nelder, 1989; Gelman & Hill, 2007). For underdispersed counts, where the conditional variance is smaller than the conditional mean, the options are far more limited due to the difficulty in constructing full probability distributions that can handle underdispersion. A recent survey of a handful of existing models can be found in Sellers & Morris (2017)
. The lack of options is particularly detrimental from a Bayesian point of view because a full probability distribution for the data is required for the application of Bayes’ theorem to determine the posterior distribution and subsequent inferences.
This note takes a first step towards filling this gap in the literature by presenting a Bayesian framework for the regression modelling of counts that can (i) handle both over and under dispersion, and (ii) retain the same level of parsimony and interpretability as classical counts models. We do this by building on a recent reparametrisation of the Conway-Maxwell-Poisson distribution (CMP) that allows the mean to be modelled directly(Huang, 2017), so that simple and interpretable models, such as a log-linear model, can be constructed for both over and under dispersed counts. This places underdispersed counts on the same footing as equidispersed and overdispersed counts which are readily handled by the familiar Poisson and negative binomial models, respectively.
More specifically, the mean-parametrized CMP distribution with mean and dispersion is characterized by the probability mass function
where the rate is a function of and given by the solution to
and is a normalizing function. It is easy to show that implies overdispersion and implies underdispersion relative to a Poisson distribution of the same mean. When , the distribution coincides with the Poisson distribution. We write CMP for the mean-parametrized CMP distribution (1.1) to distinguish it from the standard CMP distribution of Shmueli et. al. (2005).
CMP distributions are particularly useful for modelling dispersed counts because they retain all the attractive properties of standard CMP distributions whilst being able to model the mean directly (Huang, 2017; Andrianatos, 2017). They form two-parameter exponential families, and for fixed or given values of the dispersion they become one-parameter exponential families, making them immediately adaptable to regression modelling via the generalized linear model (GLM, McCullagh and Nelder, 1989)
framework. They also form a continuous bridge between other classical models, passing through the geometric and Poisson distributions as special cases, with the Bernoulli distribution as a limiting case. Moreover, unlike the generalized Poisson(Consul & Famoye, 1992), quasi-Poisson (Wedderburn, 1974), and the recently-proposed Extended Poisson-Tweedie (Bonat et. al., 2018) models, CMP distributions always correspond to full probability models for any choice of model parameters, covering both over and under dispersion. This makes them particularly suitable for Bayesian modelling and inferences.
From a Bayesian point of view, Kadane et. al. (2006)
explored the use of conjugate priors for the standard CMP distribution in the independent and identically distributed case. While this can be easily extended to mean-parametrized CMP distributions (see Online Supplement), there are three practical limitations of using the conjugate prior. First, the subsequent posterior is known only up to a normalizing constant that has no closed-form, so computing it requires either numerical integration, Markov Chain Monte Carlo (MCMC) sampling, or some other approximation method. There is therefore no practical advantage of using the conjugate prior over any other prior. Second, appropriate specification of the hyperparameters can be rather opaque due to the unfamiliar form of the conjugate prior.Kadane et. al. (2006)
offer an applet to translate prior information into an appropriate elucidation of hyperparameters, but a simpler and more interpretable prior can mitigate this issue altogether. Finally, the conjugacy property only holds for intercept-only models, and the extension to the regression case in which the mean of each observation may depend on a set of covariates is not at all immediate. This is in fact true of most Bayesian generalized linear models – for example, while the Gamma distribution is conjugate for the Poisson distribution for intercept-only models, there is no extension to the Poisson regression case(Hoff, 2009, page 173). Instead, multivariate normal priors are placed on the regression coefficients for interpretability and parsimony (as implemented in the popular MCMCpack package in R by Martin et. al. (2011), for example).
To the best of our knowledge, this note is the first to consider a Bayesian framework for the regression modelling of both over and under dispersed counts that circumvents these limitations, yet retains the parsimony and interpretability of familiar count models such as the log-linear Poisson and negative binomial models. This work is part of a larger ongoing project that looks at efficient Bayesian inferences for dispersed counts in hierarchical models.
2 Bayesian CMP generalized linear model for dispersed counts
Suppose we have independent observation pairs , where each is a count response and each is a corresponding set of covariates. A CMP generalized linear model for the data that can handle both over and under dispersion can be specified via
is a vector of regression coefficients,is a dispersion parameter, and (in a slight abuse of notation) is a user-specified mean-model or inverse-link function. For this note, we focus on the popular log-linear model,
so that each component of can be interpreted as the expected change in the mean response (on the log scale) for a unit increase in the corresponding component of . Of course, other link functions can be considered, as with any GLM. Indeed, the key advantage of the CMP distribution is that is directly parametrized via the mean so that simple, easily-interpretable mean-models can be considered. In contrast, standard CMP distributions (Shmueli et. al., 2005; Lord et. al., 2008; Sellers & Shmueli, 2010) and its variants (e.g., Guikema & Coffelt, 2008) model either a latent rate parameter or some power transformation of it, which cannot be interpreted as the mean.
For a Bayesian model specification, we need a prior distribution on the model parameters and . In this note, we focus on easily interpretable prior specifications, such as
so that the hyperparameters and have clear interpretations as prior means and variances. This makes it easy to translate prior beliefs about the data-generating process into sensible specification of hyperparameter values. In contrast, the conjugate prior of Kadane et. al. (2006) has no clear interpretation, making elucidation of hyperparameter values rather opaque. Of course, the prior distribution in (2.2) can be replaced with any user-specified prior, with the proposed method in this note being applicable for any prior specification. Note that taking the prior variances in (2.2) to be arbitrarily large leads to improper flat priors, , which we consider in Section 4.3 and in the Online Supplement.
When the normal and log-normal priors (2.2) are used, the joint posterior of the parameters and given the observed data has the form,
which is known up to a normalizing constant. Direct computation of this posterior distribution is not possible, but the explicit form of the density kernel in (2) means that a simple Metropolis-Hastings (MH) algorithm can be directly used. We illustrate this in Section 3.
A reviewer pointed out that a key advantage of the Bayesian approach over the frequentist approach of Huang (2017)
is that posterior predictive distributions can be obtained for a new observationcorresponding to some covariate , given the observed values, via
This can be estimated via Monte Carlo averaging of the likelihoodevaluated at draws from the posterior distribution (2) obtained from the proposed MH algorithm.
3 Bayesian inferences for dispersed counts
From a Bayesian point of view we are interested in evaluating posterior distributions of the form (2) and making inferences on model parameters
via computing posterior means and credible intervals. Recent work byChanialidis et. al. (2017) presents an efficient approach to Bayesian inference for CMP models via a rejection sampling based method called ”exchange algorithm”, which is applicable to situations where the likelihood function can be computed only up to a normalising constant. This novel algorithm does not require computation of normalizing constants for the Metropolis-Hastings acceptance ratio by incorporating an auxiliary data generated from the sampling model where is drawn from the proposed distribution of the model parameters . This approach allows us to cancel out all intractable normalising constants. Detailed explanation of the algorithm and its comparison with the standard Metropolis-Hastings algorithm is presented in Section 3.2 of Chanialidis et. al. (2017). Benson & Friel (2017) also develop a new, faster CMP rejection sampler by building two tractable enveloping bounds. They then link the sampler with the reciprocal normalising constant, which allows the intractable likelihood function to be estimated without bias. While these approaches work nicely for regression models that are based on the standard CMP of Shmueli et. al. (2005) and the reparameterized CMP by Guikema & Coffelt (2008), it has not (yet) been adapted for the mean-parametrized distribution – this has been flagged for immediate future research.
Instead, we present a direct Metropolis-Hastings (MH) algorithm that is particularly straightforward to implement. There is also no restriction on the types of proposal densities that can be used, and we can choose to alternate updates for each parameter component which we have found to improve convergence speed of the subsequent Markov Chain.
For simplicity of illustration, we consider multivariate normal proposals for in our MH algorithm, with mean of each proposal given by the current value and some covariance matrix . That is, we sample proposals
. Due to the symmetry of normal distributions, the acceptance probability of this proposal is simply the ratio of the likelihoods,
where is the current value of . Similarly, we can consider exponential proposals for with mean of each proposal given by the current value . That is, we sample proposals with density . The corresponding acceptance probability is therefore given by
where is the current value of . Of course, other proposal distributions can also be used. However, we find the exponential proposal distribution to be particularly simple to work with as there is no secondary “variance”-type parameter to select. Indeed, the normal–exponential pair of proposal distributions led to excellent mixing times in the all examples that we looked at.
The MH MCMC algorithm obtained via alternating these proposals is summarized in Algorithm 1 below. We readily admit that this is by no means the most efficient algorithm possible – indeed, this note is part of a larger study on efficient Bayesian inferences for dispersed counts in hierarchical models. However, we believe that this is the first framework that places the analysis of underdispersed counts on the same footing as classical Poisson and negative binomial regression models for equidispersed and overdispersed counts, respectively. The algorithm is implemented as a simple plug-in to the mpcmp package (Fung et. al., 2019) in R, and can be obtained via emailing the corresponding author.
4 Numerical examples
The proposed framework is illustrated on two data analysis examples, one exhibiting overdipsersion and the other underdispersion. These examples are accompanied by a simulation study that demonstrates some favourable frequentist properties of the approach. All examples and simulations were carried out in R 3.4.1 (R Development Core Team, 2017) on an iMac desktop computer with an i7 3.4 GHz Intel Core, 16.0GB RAM.
To implement Algorithm 1 in practice, we first obtain the MLE of the model parameters for two purposes. First, the MLE estimates and are used to initialize the starting values and in the MCMC sampler. This replaces the burn-in period. Second, the estimated variance-covariance matrix of is used for the variance matrix in the proposal distribution for . For each real or simulated example, MCMC samples were generated with a thinning factor of 10. Thinning was used to reduce the autocorrelation in the samples.
4.1 Overdispersed class attendance data
The GLM (2.1)–(2.2) is used to analyze the attendance dataset from http://www.ats.ucla.edu/stat/stata/dae/nb_data.dta which contains the total number of days absent in an academic year for each of 314 students sampled from two urban high schools. Explicitly, our model for the data is the log-linear CMP regression model,
where for each student , is the total number of days absent, is an indicator for being female, is an indicator for being in the academic program, is an indicator for being in the vocational program, and is the mathematics score. For our priors we consider
corresponding to vague a priori beliefs about the model parameters. Algorithm 1 was used to obtain a sequence of random draws from the posterior distribution arising from model (4.1)–(4.2). Graphical summaries of the MCMC output and subsequent Bayesian inferences for the model parameters using the visualisation tool from Marley & Wand (2010) are presented in Figure 1.
We see from column 2 of Figure 1 that even with no burn-in the trace plots indicate that the Markov chain exhibited good mixing. There is also negligible correlation with the lagged-one sample (column 3), as well as minimal autocorrelation across all other lags (column 4). Column 5 plots the kernel estimates of the marginal posterior densities for all model parameters. In these plots, we also overlay the endpoints of credible intervals (marked in maroon) as well as those from a standard Poisson log-linear model (marked in pink). The last column provides a numerical summary of the MCMC estimates with the posterior means and the endpoints of the 95% credible intervals.
The use of the simple log-linear mean-model allows the posterior estimates to be easily interpreted. For example, the posterior mean of estimates that female students miss times as many days of school compared to male students, with a 95% credible interval between and times. Analogously, students in the academic and vocational programs are each expected to miss an estimated and times as many days, respectively, than students in the General (baseline) program. Comparing the credible intervals of the model with that of the standard Poisson model, we note that the Poisson model has not accounted for the conditional overdispersion in the data, as evidenced by the narrower intervals across all regression coefficients. Finally, the estimated dispersion parameter has posterior mean of with 95% credible interval of , confirming strong conditional overdispersion in the data. Each MCMC update took approximately 0.2 seconds on average to run.
4.2 Underdispersed takeover bids data
where is the number of bids received by each firm , and the explanatory variables are described in the Online Supplement. For our priors we again consider
corresponding to vague a priori beliefs about the model parameters. Algorithm 1 was used to obtain a sequence of random draws from the posterior distribution arising from model (4.3)–(4.4). Graphical summaries of the MCMC output and subsequent Bayesian inferences for the model parameters are presented in Figure 2.
The interpretation of the posterior means of each parameter is analogous to those from the first example. For example, firms that employed a legal officer during takeover negotiations (leglrest = 1) are estimated to receive times as many bids as a comparable firm that did not. The 95% credible interval for this increase is between and times. The posterior distributions for the other regression parameters can be interpreted in a similar way. For the dispersion parameter, the posterior mean is 1.62 with a 95% credible interval of (1.15, 2.06). This indicates that the data exhibit strong underdispersion.
Again, overlaying the kernel density plots in the fifth column of Figure 2 are the 95% credible intervals obtained from both the CMP and standard Poisson models. We see that the credible intervals from the Poisson model are wider than their counterparts across all predictors. In this case, the Poisson distribution has failed to account for the strong underdispersion in these data. The average computation time for each MCMC update here was seconds.
4.3 Sensitivity to the choice of prior
To examine sensitivity of the posterior distributions to the choice of prior, we re-ran the two data analysis examples using improper flat priors , so that the posterior is directly proportional to the likelihood. These improper priors can be conceptualised as taking the variance hyperparameters in (2.2) to be arbitrarily large. The results from the corresponding MH MCMC are summarised in Figures 3 and 4 in the Online Supplement. We see that the posterior distributions, means and credible intervals for model parameters are all very similar to the earlier results, suggesting that the proposed framework can be quite robust to the choice of prior, at least for problems that are similar in size and complexity to these two examples.
4.4 Frequentist coverage rates of credible intervals
The data analysis examples demonstrate the parsimony of the proposed framework for handling both over and under dispersed counts. Here, we complement these examples by examining frequentist coverage rates of credible intervals for regression coefficients obtained from the Bayesian CMP GLM (2.1)–(2.2). We compare these coverage rates with those of credible intervals generated from a classical Bayesian Poisson or negative binomial regression model.
Synthetic data are generated from the following three distributional settings:
no dispersion – Poisson)
underdispersion – CMP
overdispersion – negative binomial() with mean and variance
where the means are given by the fitted posterior mean model
Each synthetic dataset contains samples, the same as the original takeover bids dataset, and are conditioned on the same set of covariates. For each simulation, the Bayesian CMP model (2.1)–(2.2) was fit to the data along with the standard Bayesian Poisson and negative binomial models. For all three models, the prior distribution for the mean parameters was set to , whilst the prior distribution for the dispersion parameter was set to Log-Normal for the CMP and negative binomial models. Each simulation setting was replicated times.
The coverage rates of nominal 90%, 95% and 99% credible intervals from all three models are displayed in Table 1. Note that we have focussed on the coefficients of the first four covariates which correspond to possible defensive actions that could be taken by a target firm. The remaining six coefficients exhibit similar behaviour, but their corresponding covariates are not actionable factors at the firm level and so are not as interesting to look at.
We see from Table 1
that for counts with no dispersion the credible intervals from all three models are in agreement with their nominal coverage rates. When counts are generated from an overdispersed negative binomial distribution, credible intervals from both the negative binomial and the misspecifiedmodels have coverage rates that remain close to their nominal levels and comparable to the correctly-specified negative binomial model, but the Poisson model undercovers considerably. For underdispersed counts, coverage rates of the negative binomial model are omitted because they cannot handle underdispersion – in fact, the limiting case would coincide with the Poisson model. Indeed, the Poisson posterior credible intervals are far too wide with the true mean covered at a much higher rate than the nominal levels. In contrast, the model performs well in all three scenarios, reflecting its flexibility in handling equi, over and under dispersion equally well.
We also examine coverage properties of posterior credible intervals for the dispersion parameter using the proposed Bayesian CMP approach. In Table 1 we see that for both Poisson and underdispersed CMP data, the coverage rates for the dispersion parameter were again close to their nominal rates. For overdispersed negative binomial data, the CMP model is misspecified – we can instead look at the power of the posterior credible intervals, noting that nominal 95% credible intervals they did not contain in 98.0% of simulations. For narrower 90% credible intervals the power improved to 99.1%, and for wider 99% credible intervals the power was at 93.4% in our simulations. These results again suggest that the framework can be equally useful for handling equi, over and under dispersion under both correctly specified and misspecified data-generating mechanisms. Moreover, additional simulations carried out by the second author under different sample sizes and different levels of dispersion yielded similar results to those presented in Table 1.
|Data generating distribution|
This note proposes a Bayesian generalized linear model framework for both over and under dispersed counts. While it is demonstrated to be particularly simple to implement in practice and exhibits good frequentist coverage properties, we postulate that the proposed approach can be further improved by making it more efficient using the techniques of Chanialidis et. al. (2017) and Benson & Friel (2017). This has been earmarked for future research, along with extending the approach to hierarchical mixed models for dependent counts as well as including covariates in the dispersion parameter to account for non-constant dispersion.
We thank the Associate Editor and two anonymous referees for comments and suggestions that improved this paper. We thank Dr. Thomas Fung (Macquarie University) for help with the mpcmp package in R.
The online supplement contains mathematical details of the conjugate prior for CMP distributions, a description of the predictors for the takeover bids dataset, and summaries of posterior inferences for the two data analysis examples using flat priors. R code implementing the proposed algorithm can be obtained by emailing the authors.
- Andrianatos (2017) Andrianatos, I., On robustness and compatibility of count regression models, Honours thesis, University of Queensland (2017).
- Benson & Friel (2017) Benson, A. & Friel, N., Bayesian inference, model selection and likelihood estimation using fast rejection sampling: the Conway-Maxwell-Poisson distribution, arXiv:1709.03471 (2017) .
- Bonat et. al. (2018) Bonat, W.H., Jorgensen, B., Kokonendji, C.C., Hinde, J., & Demetrio, C.G.B., Extended Poisson–Tweedie: Properties and regression models for count data, Statistical Modelling, 18, (2018).
- Chanialidis et. al. (2017) Chanialidis, C., Evers, L., Neocleous, T., & Nobile, A., Efficient Bayesian inference for COM-Poisson regression models, Statistics and Computing, (2017) (to appear).
- Consul & Famoye (1992) Consul, P.C., & Famoye, F., Generalized poisson regression model, Communications in Statistics - Theory and Methods, 21, 89–109 (1992).
- Fung et. al. (2019) Fung, T., Alwan, A., Wishart, J., & Huang, A., mpcmp: Mean-parametrized Conway-Maxwell-Poisson, R package, (2019)
- Gelman & Hill (2007) Gelman, A., & Hill, J., Data Analysis Using Regression and Multilevel/Hierarchical Models, Cambridge University Press, Cambridge (2007).
- Guikema & Coffelt (2008) Guikema, S.D., & Coffelt, J.P., A Flexible Count Data Regression Model for Risk Analysis, Risk Analysis, 28, 213–223 (2008).
Hoff, P. D., A First Course in Bayesian Statistical Methods, Springer Series in Statistics, New York (2009).
- Huang (2017) Huang, A., Mean-parametrized Conway-Maxwell-Poisson regression models for dispersed counts, Statistical Modelling, 17, 1–22 (2017).
- Kadane et. al. (2006) Kadane, J.B, Shmueli, G., Minka, T.P., Borle, S. & Boatwright, P., Conjugate Analysis of the Conway-Maxwell-Poisson Distribution, Bayesian Analysis, 1, 363–374 (2006).
- Lord et. al. (2008) Lord, D., Guikema, S. D., & Geedipally, S. R., Application of the Conway-Maxwell-Poisson generalized linear model for analyzing motor vehicle crashes. Accident analysis and prevention. 40, 1123-1134, (2008).
- Marley & Wand (2010) Marley, J. K., & Wand, M. P., Non-standard semiparametric regression via BRugs. Journal of Statistical Software. 37, 1-30, (2010).
- Martin et. al. (2011) Martin, A. D., Quinn, K. M., & Park, J. H., MCMCpack: Markov Chain Monte Carlo in R. Journal of Statistical Software. 42, 1-21, (2011) URL http://www.jstatsoft.org/v42/i09/.
- McCullagh and Nelder (1989) McCullagh, P., & Nelder, J.A., Generalized Linear Models, second ed., Chapman and Hall, London (1989).
- Sáez-Castillo & Conde-Sánchez (2013) Sáez-Castillo, A. J. and Conde-Sánchez, A., A hyper-Poisson regression model for overdispersed and underdispersed count data, Computational Statistics & Data Analysis, 61, 148–157 (2013).
- Sellers & Morris (2017) Sellers, K. F. and Morris, D. S., Underdispersion models: Models that are “under the radar”, Communications in Statistics - Theory and Methods, 46, 12075–12086 (2017).
- Sellers & Shmueli (2010) Sellers, K. F. and Shmueli, G., A flexible regression model for count data, Annals of Applied Statistics, 4, 943–961 (2010).
- Shmueli et. al. (2005) Shmueli, G., Minka, T.P., Kadane, J.B., Borle, & S., Boatwright, P., A useful distribution for fitting discrete data: revival of the Conway-Maxwell-Poisson distribution, Journal of the Royal Statistical Society: Series C (Applied Statistics), 54, 127–142 (2005).
- Wedderburn (1974) Wedderburn, R.W.M., Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method, Biometrika, 61, 439–447 (1974).
Wilmot, G.E., The Poisson-Inverse Gaussian distribution as an alternative to the negative binomial, Scandinavian Actuarial Journal, 113–127 (1987).
Supplementary material for “Bayesian generalized linear model for over and under dispersed counts”
Alan Huang, and Andy Sang Il Kim
S1. Conjugate prior for CMP distributions
distributions form two-parameter exponential families (Appendix 1.2, Huang, 2017). We can therefore obtain the conjugate prior and posterior for the distribution by substituting into equation (8) of Kadane et. al. (2006). Then, we can write the the density function of the conjugate prior for the distribution in the form of
where is a normalising constant given by
and , , and are hyperparameters restricted by (10) from Kadane et. al. (2006). The corresponding posterior is of the same form,
with , and .
S2. Description of predictors for takeover bids dataset
The following description of explanatory variables are taken from Sáez-Castillo & Conde-Sánchez (2013):
Defensive actions taken by management of target firm: indicator variable for legal defense by lawsuit (leglrest), proposed changes in asset structure (rearest), proposed change in ownership structure (finrest) and management invitation for friendly third-party bid (whtknght).
Firm-specific characteristics: bid price divided by price 14 working days before bid (bidprem), percentage of stock held by institutions (insthold), total book value of assets in billions of dollars (size) and book value squared (size).
Intervention by federal regulators: an indicator variable for Department of Justice intervention (regulatn).
S3. Class attendance and takeover bids data analysis examples using improper flat priors
The results of the class attendance and takeover bids data analysis examples using the improper flat priors, , are summarized in Figures 3 and 4 respectively. These results are very similar to the results based on the Normal and log-Normal pair of priors from the main text, suggesting that the framework is robust to the choice of prior, at least for problems with similar size and complexity to these two examples.