Bayesian inference, model selection and likelihood estimation using fast rejection sampling: the Conway-Maxwell-Poisson distribution

09/11/2017
by   Alan Benson, et al.
0

Bayesian inference for models with intractable likelihood functions represents a challenging suite of problems in modern statistics. In this work we analyse the Conway-Maxwell-Poisson (COM-Poisson) distribution, a two parameter generalisation of the Poisson distribution. COM-Poisson regression modelling allows the flexibility to model dispersed count data as part of a generalised linear model (GLM) with a COM-Poisson response, where exogenous covariates control the mean and dispersion level of the response. The major difficulty with COM-Poisson regression is that the likelihood function contains multiple intractable normalising constants and is not amenable to standard inference and MCMC techniques. Recent work by Chanialidis et al. (2017) has seen the development of a sampler to draw random variates from the COM-Poisson likelihood using a rejection sampling algorithm. We provide a new rejection sampler for the COM-Poisson distribution which significantly reduces the CPU time required to perform inference for COM-Poisson regression models. A novel extension of this work shows that for any intractable likelihood function with an associated rejection sampler it is possible to construct unbiased estimators of the intractable likelihood which proves useful for model selection or for use within pseudo-marginal MCMC algorithms (Andrieu and Roberts, 2009). We demonstrate all of these methods on a real-world dataset of takeover bids.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

01/22/2019

A Conway-Maxwell-Poisson GARMA Model for Count Data

We propose a flexible model for count time series which has potential us...
01/29/2018

Reparametrization of COM-Poisson Regression Models with Applications in the Analysis of Experimental Data

In the analysis of count data often the equidispersion assumption is not...
07/26/2017

On the "Poisson Trick" and its Extensions for Fitting Multinomial Regression Models

This article is concerned with the fitting of multinomial regression mod...
03/29/2022

Direct Sampling with a Step Function

The direct sampling method proposed by Walker et al. (JCGS 2011) can gen...
08/13/2020

Flexible Modeling of Hurdle Conway-Maxwell-Poisson Distributions with Application to Mining Injuries

While the hurdle Poisson regression is a popular class of models for cou...
07/15/2021

Multivariate Conway-Maxwell-Poisson Distribution: Sarmanov Method and Doubly-Intractable Bayesian Inference

In this paper, a multivariate count distribution with Conway-Maxwell (CO...
09/22/2017

Barker's algorithm for Bayesian inference with intractable likelihoods

In this expository paper we abstract and describe a simple MCMC scheme f...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modelling count data continues to be a important area in the practice of statistics. Often covariate information is available which may prove useful in explaining the counts, for example, time of day influencing the number calls to a particular call centre or amount borrowed influencing the number of loan defaults by a consumer. Poisson regression is a standard framework for modelling covariate-dependent count data allowing the mean response to depend on the covariates through a log-linear link function. In practice, Poisson regression fails to capture the phenomenon of dispersion, where the mean and variance of the response differ significantly. Dispersion measures the spread of a random variable and is quantified through the ratio of the variance of a random variable to its mean. Count data often exhibits underdispersion (variance less than the mean), overdispersion (variance greater than the mean) or equidispersion (equality of mean and variance) but Poisson regression is only capable of modelling equidispersed data. Many alternative approaches have been suggested to capture dispersion, such as using quasilikelihood or the popular negative binomial regression

(Hilbe, 2011) which allows modelling of overdispersed data only. The COM-Poisson distribution (Conway and Maxwell, 1962) was first introduced to model queuing systems and was then revived as a statistical model for dispersed count data by Shmueli et al. (2005). This revival lead to the introduction of COM-Poisson regression models by Guikema and Goffelt (2008) as a flexible regression model for dispersed count data with covariates. However, this flexibility comes with the caveat that the COM-Poisson likelihood function contains an intractable normalising term which requires the need for non-standard methods to conduct parameter inference. This intractability has restricted the popularity of COM-Poisson regression as a viable solution to modelling dispersion. Recent work by Chanialidis et al. (2017) has seen a Bayesian analysis of COM-Poisson regression under the framework of doubly-intractable Bayesian inference, where the authors overcome the intractability through the use of a rejection sampling algorithm for the COM-Poisson distribution. This rejection sampler is then used within the exchange algorithm of Murray et al. (2006) to perform Bayesian parameter inference.

Our novel contribution in this paper is to provide a more efficient approach to that of Chanialidis et al. (2017) by designing a simpler rejection sampler for the COM-Poisson distribution which is computationally less intensive to sample from and leads to faster inference within COM-Poisson regression models. In one example in this paper, we observe at a greater than threefold reduction in CPU time by using our rejection sampler. The rejection sampler of Chanialidis et al. (2017) is derived from a discrete version of adaptive rejection sampling (Gilks and Wild, 1992)

and requires the costly construction of a piecewise truncated geometric enveloping distribution to tightly encapsulate the COM-Poisson probability density function. Our rejection sampler requires only a single envelope distribution depending on the dispersion parameter of the COM-Poisson distribution. This single envelope distribution does come with a higher rejection rate but completely bypasses the costly setup and sampling cost involved in piecewise truncated geometric envelope distributions. A further and important novelty of our work is to show that when a rejection sampling algorithm is available to sample from an intractable likelihood function of interest, then it is possible to construct an unbiased estimate of this intractable likelihood function using the rejection sampler. Our idea works by exploiting a link between rejection sampling efficiency and its relationship with the reciprocal intractable normalising constant of the likelihood function.

The remainder of this paper is organised as follows: Section 2 reviews the COM-Poisson distribution and COM-Poisson regression models under the framework of doubly-intractable Bayesian inference. Section 3 reviews rejection sampling for intractable likelihoods and our new COM-Poisson rejection sampler is introduced in Section 3.1. Section 3.2 describes the novel method of constructing unbiased estimators of intractable likelihood functions through rejection sampling. Section 4 shows how pseudo-marginal MCMC can be performed using this unbiased intractable likelihood estimator. Section 5 shows results of these methods applied firstly to a dataset with no covariates in Section 5.1 and then to a real-world COM-Poisson regression dataset of company takeover bids in Section 5.2. Finally, Section 6 gives discussion and some thoughts for future work.

2 Bayesian doubly-intractable inference for GLMs

The COM-Poisson distribution (Conway and Maxwell, 1962)

is a two-parameter discrete probability distribution that allows the flexibility to model count data with over-, under- and equi-dispersion. The probability mass function for a COM-Poisson random variable

with parameters and is defined over the non-negative integers as

The unnormalised component of the mass function is denoted and is an intractable normalising constant, having no closed form representation for this model. The mode of the COM-Poisson distribution is , with two consecutive modes at and in the case where

is an integer. The moments of the COM-Poisson distribution are unavailable directly due to the intractable normalising constant. The moments can however be approximated through the use of an asymptotic representation of

derived by Shmueli et al. (2005). The approximations for the mean and variance using this asymptotic representation are, respectively,

(1)

These approximations are quite accurate for a wide range of and , except for small or small . The purpose of the parameter is to control dispersion through this parameter’s inverse relationship with the variance as seen in (1). When the distribution exhibits equidispersion and the probability mass function reduces to that of a Poisson random variable with . Alternatively setting gives overdispersion and setting gives underdispersion.

The main application of the COM-Poisson distribution is to GLMs, referred to as COM-Poisson regression models, where responses are assumed to follow a COM-Poisson distribution with the parameters and of each response being conditional on available covariates for . Due to this conditioning we now index the parameters for observation as and . COM-Poisson regression modelling was first introduced by Guikema and Goffelt (2008) and Sellers and Shmueli (2010) in an effort to extend Poisson regression by allowing covariates to influence not only the location parameter but also the dispersion parameter . For notational simplicity we now let be the parameter for observation with derived from setting where is a link function of the covariates and are the parameters of this link function. The major difficulty is that the likelihood of the COM-Poisson regression model involves multiple intractable normalising constants,

(2)

The particular link function for the COM-Poisson was specified by Guikema and Goffelt (2008) as a dual-link function to allow the available covariates to enter both the and parameter. In this dual-link function scenario the parameter

is split into two sub-vectors as

, where are the coefficients for the link component and are the coefficients for the link component. The dual-link function is then

where some are set fixed to 0 when the covariate is not included in , similarly for .

Bayesian parameter inference for in COM-Poisson regression models is a doubly-intractable problem as we now demonstrate. Placing a prior on , the posterior distribution for is formed with this prior and the complete likelihood (2) as

(3)

This posterior is termed doubly-intractable since the likelihood is intractable, containing intractable normalising constants, and the posterior itself cannot be normalised. Without the ability to evaluate pointwise, the direct application of standard MCMC methods is infeasible. For example, a Metropolis-Hastings algorithm proposing a move in the link function from to using a proposal distribution requires evaluation of the intractable ratios to compute the acceptance ratio

(4)

Possible methods to compute this acceptance ratio include replacing the intractable ratios with approximations using the asymptotic approximation from Shmueli et al. (2007) or by a truncated sum but both of these methods introduce some bias into the acceptance ratio.

An ingenious solution to the issue of computing the acceptance ratio (4) was proposed by Murray et al. (2006) drawing on earlier work by Møller et al. (2006). The solution, known as the exchange algorithm, augments the doubly-intractable posterior (3) with auxiliary data, provided that it is possible to sample data exactly from the intractable likelihood. The original exchange algorithm overcomes a single intractable ratio i.e. however the algorithm can easily be extended to the case of intractable ratios as we now show. Starting from the posterior (3) which contains the product of independent non-identical intractable likelihood functions, the exchange algorithm proposes, as in standard Metropolis-Hastings, to update the parameter from the current state to proposed state using but in addition the posterior is augmented with auxiliary draws generated from the likelihood at the proposed parameters . The augmented posterior is written

(5)

and the marginal of this augmented posterior for is the target posterior of interest (3

). The algorithm can be seen as a Markov chain over an augmented parameter

. The acceptance ratio for the augmented posterior (5) is now tractable due to the cancellation of all intractable normalising constants

(6)

The clever cancellation of the normalising constants in (6) is due to the exchange of parameters () associated with the observed data () and the auxiliary data (), the auxiliary being data discarded after each move. The major requirement in the exchange algorithm is the ability to generate the auxiliary data, requiring exact draws from the likelihood at each proposed parameter. Perfect sampling (Propp and Wilson, 1996) from the likelihood is one possible method to do this but can be prohibitively expensive for many models. In certain models sampling from an intractable likelihood may even be as difficult as computing the intractable normalising constant itself and instead one could resort to sampling using an approximate likelihood sampler such as in the case of the exponential random graph model (Caimo and Friel, 2011). The COM-Poisson distribution was shown by Chanialidis et al. (2017) to be amenable to rejection sampling, allowing the auxiliary data to be generated. In Section 3.1 we present our faster rejection sampler having a cheaper sampling cost than the sampler of Chanialidis et al. (2017). Rejection sampling is not applicable to all intractable likelihoods but when rejection sampling is available we show in Section 3.2 that the intractable likelihood can be estimated without bias through an unbiased estimate of the reciprocal normalising constant. This unbiased intractable likelihood estimator is also guaranteed to be positive which differs from other estimates using Russian roulette stochastic truncation (Lyne et al., 2015) or truncated Markov chains (Wei and Murray, 2016).

3 Rejection sampling to an intractable likelihood estimator

Rejection sampling (von Neumann, 1951) is a sampling scheme to generate statistically independent samples from a target probability distribution of interest. We denote the target density (likelihood) as

where in this section we drop the observation index for ease of notation. This target density may be tractable or intractable but is assumed in any case to be difficult to sample from. The rejection sampling method works by constructing an envelope distribution over the target density, proposing samples from this envelope distribution and choosing to accept a portion of these samples. The envelope distribution should be chosen so that it is both computationally efficient to sample from and matches closely in shape to the target density. The portion of samples that will be accepted depends on the similarity of the envelope and the target distribution. The closer the envelope distribution matches to the target distribution, the higher the number of samples that will be accepted. To generate a single draw from using rejection sampling, we consider an envelope distribution where is its parameter. We assume that the envelope density can be written in a similar form to as

The envelope distribution’s normalising constant may be either tractable or intractable. A necessary condition for is that it dominates the support of i.e. . It is also necessary to bound using and a positive finite bounding constant that satisfies the envelope inequality . Finding such a constant can be a difficult task even for tractable densities. The optimal value of is denoted as . In the case of intractable likelihoods, this optimal constant is impossible to evaluate since at least one of either or is unknown leaving intractable. In the case where both and are tractable distributions and can be found, the vanilla rejection sampling algorithm to obtain a single draw from proceeds as in Algorithm 1.

Input: Target distribution , envelope distribution and
1 Sample a proposal from
2 Accept as a draw from by a Bernoulli trial with acceptance probability
(7)
otherwise repeat.
Algorithm 1 Rejection sampling

Algorithm 1 introduces the rejection sampling Bernoulli trial with the outcome conditional on the proposed through the acceptance probability . This Bernoulli trial is at the core of the rejection sampling method as it decides for a given what proportion of samples from to accept. Once intractability is introduced into this Bernoulli trial through , it would appear that becomes intractable but this is fortunately not the case. The intractable bounding constant can be decomposed into tractable and intractable components by extracting the reciprocal normalising constants from as follows

(8)

and this introduces the tractable bounding constant . can be viewed as the upper bound on the ratio of the unnormalised densities and , whereas is the upper bound on the associated normalised densities and . The advantage of rejection sampling algorithms is that they can be run without full knowledge of , it is sufficient to know which is usually more readily available. Substituting the expression into the acceptance probability (7) gives the tractable acceptance probability for the conditional Bernoull trial as

It is important to note that is also not required to be tractable, although it typically is for simple envelope distributions. A case where is intractable would be when the envelope distribution itself requires sampling by another rejection sampler and this opens the possibility of chaining many rejection samplers together to sample from evermore complex target distributions. The main purpose of rejection sampling in this paper is to provide a means to generate the auxiliary draws required within the exchange algorithm acceptance ratio (6) allowing the doubly-intractable inference to be performed but Section 3.2 will show another novel usage for rejection sampling. Before discussing this, we demonstrate our COM-Poisson rejection sampler in the next section.

3.1 COM-Poisson rejection sampler

For the COM-Poisson distribution, our sampler uses a choice of two envelope distributions dependent on the value of . This choice is motivated by Figure 1, where a Poisson distribution is used in the case of

and a geometric distribution in the case of

. Theorem 3.1 demonstrates and proves why this works. The COM-Poisson rejection sampling algorithm is given in pseudocode after this theorem.

Theorem 3.1 (COM-Poisson intractable rejection sampler).

Suppose that , a Poisson random variable with parameter and , a geometric random variable with parameter , for some , are used as two enveloping distributions as follows,

(9)
together with the following tractable enveloping bounds
(10)

then a sample from the COM-Poisson distribution can be drawn at any parameter value ().

Proof.

See Appendix A. ∎

Figure 1: Enveloping the COM-Poisson distribution for rejection sampling: In the underdispersed () case (left) a Poisson distribution, having a higher relative dispersion is scaled to envelope the COM-Poisson distribution. In the overdispersed () case (right) a geometric distribution, having a higher relative dispersion is used. The parameter of the geometric distribution in the overdispersed case controls the efficiency of the enveloping. To choose we match the first moment of the geometric distribution with the COM-Poisson distribution (11), shown as the black curve (right). The blue curve corresponds to an alternative choice of , showing how an inefficient choice of results in a geometric envelope with a high rejection region (area between blue and green curves).

Before we present the practical implementation of the COM-Poisson sampling algorithm developed from Theorem 3.1, it is necessary to provide some brief discussion on the choice of in the geometric envelope. The optimal parameter, say, would be one for which the acceptance rate of the rejection sampling is maximised. However it is not possible to compute since the acceptance rate, as we will show in Section 3.2, involves intractable normalising constants. Nevertheless any choice of will result in a valid rejection sampler and clearly the closer is to , the more efficient the sampler is. We have chosen to select by matching the first moment of the geometric to the approximate expected value of the COM-Poisson distribution using (1). Setting the first moment of the geometric distribution equal to the expected value approximation (1) implies,

(11)

Algorithm 2 presents pseudocode of the COM-Poisson sampler for coding on a computer. To analyse the efficiency of the COM-Poisson sampler we conducted an experiment in which the sampler was run for different values of and with results presented in Appendix C. We show that the sampler can achieve between and acceptance for and and acceptance for . A comparison of Algorithm 2 with the sampler from Chanialidis et al. (2017) is deferred to Section 5.3 where we demonstrate an example of using both algorithms. Another benefit of rejection sampling algorithms is that they can be implemented in parallel thus speeding up the computation, however care is needed when implementing parallel random number generators with potential issues discussed in Mascagni and Srinivasan (2000). Algorithm 2 gives the pseudocode for the COM-Poisson rejection sampler.

Input: Parameter
1 START
2 if  then
3       Sample using the algorithm from Ahrens and Dieter (1982).
4       Calculate using (10) and set .
5       Generate .
6       if  then
7            return
8      else
9            GOTO START
10       end if
11      
12 end if
13if  then
14       Compute .
15       Sample by first sampling and returning .
16       Calculate using (10) and set .
17       Generate .
18       if  then
19            return
20      else
21            GOTO START
22       end if
23      
24 end if
Note: For multiple draws from a COM-Poisson distribution, calculate the bound or once and then use it as an input to the algorithm.
Algorithm 2 Sampler for the COM-Poisson distribution

3.2 Unbiased likelihood estimation by monitoring rejection sampling efficiency

Note: In this section we reintroduce the index parameter to discuss rejection sampling at different parameters , each necessarily introduces indexed tractable and intractable sampling bounds and envelope parameters which we denote as , and respectively.

Literature on rejection sampling has explored some methods for recycling the rejected proposals from the envelope distribution. Casella and Robert (1996) consider reducing the variance of rejection sampling estimates by incorporating the rejected samples in a post-processing step using the Rao-Blackwell theorem. Rao et al. (2016)

uses the rejected proposals from a rejection sampler in a different way by considering the joint distribution of rejected and accepted draws leading to a method to perform doubly-intractable inference in the case where the likelihood has an associated rejection sampler. Our approach considers yet another use for the rejected proposals by examining the relationship of rejected proposals with the efficiency of the rejection sampler. The efficiency is used to provide an unbiased estimator of the reciprocal normalising constant

for each parameter . An unbiased estimator of the reciprocal normalising constant at each will lead directly to an unbiased estimator of the complete likelihood (2).

The efficiency of the rejection sampler is critical to the overall performance of the exchange algorithm. The efficiency is defined as the total number of draws required from the envelope distribution until the first acceptance of a draw from or in general the total number of draws required until acceptances from . The number of draws determines how closely the envelope distribution matches to the target distribution. The ideal envelope distribution would exactly match the target distribution and reject no proposals, however in this scenario we could simply circumvent rejection sampling and sample from the envelope distribution directly. In Algorithm 1, run at parameter , each of the samples proposed from the envelope is followed by a Bernoulli trial with conditional acceptance probability . To remove the conditional dependence on , consider the unconditional acceptance probability over all envelope proposals by integrating with respect to the envelope,

It is now obvious that the unconditional acceptance probability and from this the rejection sampler efficiency are controlled directly by the magnitude of the intractable bound . On average, one expects to accept a portion of proposals from as being from and to discard the remainder. Equivalently, the total number of draws until the first accepted draw follows a geometric distribution with success probability and mean i.e. . In general the total number of draws () until

acceptances follows a negative binomial distribution with mean

. A natural method to estimate is to run rejection sampling at the parameter and record the observed total number of draws required for acceptances. Then an unbiased estimate of based on acceptances is given as

By the central limit theorem increasing

will lead to lower variance estimates of since can be viewed as the average of independent geometric trials each having variance . To use to give an unbiased estimate of the likelihood, we have the final assumption that is known i.e. the envelope distribution is tractable. Then by rewriting (8) a link emerges between the inverse normalising constants of and as

(12)

By replacing with its unbiased estimate and multiplying both sides by an unbiased estimator of the likelihood for observation is given as

(13)

with all terms on the RHS of (13) known. The estimate of the complete likelihood follows as

(14)

by estimating at each .

This unbiased estimator of the intractable likelihood is very useful especially for model selection as it can be used to construct an estimate of the Bayesian information criterion (BIC) (Schwarz, 1978). Bayesian model choice is a challenging problem for models with intractable likelihood functions. The intractability prevents calculation of the likelihood which is required for most model selection criteria. To our knowledge rejection sampling efficiency has not previously been used to construct a unbiased estimate of an intractable likelihood. The BIC estimate from the unbiased likelihood estimator (14) is

(15)

where maximises the unbiased estimate. The variance of this BIC estimate depends on and we find in experiments that needs to be high () to achieve an accurate estimate of the BIC.

It may also be possible to use the unbiased likelihood estimator (13

) as part of other Bayesian model selection frameworks, for example, in the calculation of the marginal likelihood (or evidence) and Bayes factors. This could be a focus of future work. In addition to BIC estimation we consider plugging the unbiased estimator directly into the intractable acceptance ratio (

4) which is the basis of pseudo-marginal MCMC algorithms discussed in the next section.

Finally, we note that Chanialidis et al. (2017) presented results comparing models based on the deviance information criterion (DIC) (Spiegelhalter et al., 2002), without explaining how they overcame the need to evaluate the intractable likelihood for each draw from the MCMC sample. Following personal communication with the authors it turns out that they approximated the normalising constant, , using a truncated summation involving terms, for positive integers, , , where , as follows,

(16)

Here and are selected to satisfy the inequality , where, as before, is the mode of the COM-Poisson distribution. In turn, they approximated the likelihood for each draw from the sample generated from their MCMC algorithm by

(17)

which is then used, for example, to estimate the posterior expected (deviance) log-likelihood which is required for calculation of the DIC. Of course, this introduces an approximation of the likelihood into the estimation of the DIC. This raises questions as to how many terms are needed in order to get an accurate approximation of the likelihood as well as how to choose both and and finally raises the issue of the computational cost involved in evaluating the finite truncation for the terms in the product in (17).

4 Pseudo-marginal MCMC with the intractable likelihood estimator

MCMC methods for Bayesian inference require the ability the evaluate the unnormalised posterior distribution. For doubly-intractable problems where the posterior distribution cannot be evaluated as such, their exists pseudo-marginal MCMC algorithms (Andrieu and Roberts, 2009) which require only an unbiased positive estimate of the posterior distribution. Pseudo-marginal algorithms have been applied to a range of problems such as genetic modelling (Beaumont, 2003) and Markov jump processes (Georgoulas et al., 2017). This pseudo-marginal approach can be seen as an alternative to the exchange algorithm which uses exact sampling in place of unbiased posterior estimators.

We now describe the pseudo-marginal framework and show how our unbiased estimator of the likelihood (14) can be used. Consider a target distribution which cannot be pointwise evaluated, but assume that it is possible to construct an unbiased estimate of the target through the use of auxiliary variables , . Using in place of the true target in a Metropolis-Hastings algorithm leads to the approximate acceptance ratio

Pseudo-marginal MCMC algorithms can be viewed as Metropolis-Hastings algorithms over the joint parameter and auxiliary variable state space . Andrieu and Roberts (2009) describe two alternative forms: the “Monte Carlo within Metropolis” or MCWM algorithm and the “grouped independence Metropolis-Hastings” or GIMH algorithm, with the only latter algorithm having the correct target distribution . In MCWM, both the current and proposed estimate of the target are refreshed every iteration using new auxiliary draws and . In GIMH, only the proposed estimate is refreshed and is fixed to the estimate used when was last accepted.

The difficulty with pseudo-marginal algorithms is in constructing an unbiased estimate of the target. Using the rejection sampling unbiased estimator (14) the pseudo-marginal acceptance ratio for the to move in the COM-Poisson model is

(18)

This acceptance ratio is entirely tractable once the estimates have been computed by running rejection sampling at each for the GIMH algorithm or at both and for the MCWM algorithm.

5 Results

We present two examples of COM-Poisson modelling. The first example is a retail inventory dataset from Shmueli et al. (2005) which contains no covariate information. This first example is a toy example that demonstrates how the BIC can be approximated using the unbiased likelihood estimate. The second example is a takeover bids dataset which contains covariates which we model using a COM-Poisson regression model. The second example will firstly the reduction in computation time available using our sampler and also the ability to choose different regression models using the unbiased likelihood estimate.

5.1 Inventory Data

Figure 2: Inventory data: Sales of a particular item of clothing from a well-known clothing brand. The sample mean sales is 3.56 with sample variance 11.31.

Inventory data or stock count data is of high importance to retailers as it facilitates to them to make decisions regarding future stock levels. The counts we observe in this data are the quarterly sales counts of particular items of clothing in each store across retail stores. A frequency barplot of the data is shown in Figure 2. The maximum quarterly sales of the clothing item sold was 30 in one store. No stores sold between 22 and 29 items and many stores had 0 sales of the clothing item. The sample mean sales is 3.56, sample variance 11.31 and the sample dispersion index is 3.18, thus overdispersed. This data was analysed by Shmueli et al. (2005) using a maximum likelihood approach.

Since this data contains no extra covariate information, we model and directly rather than considering them as functions of covariates. This could be viewed as using the link function with only intercept where and but we consider it easier to model and . We place Gamma(1, 1) and Gamma(0.0625, 0.25) priors on and , respectively. We run the exchange algorithm for 200,000 iterations and discard the first 5,000 as a burn-in. Each parameter and were updated using a single-site update. The results are shown in Table 1 with the target acceptance rate of for single-site updates achieved.

Posterior Mean Posterior SD Accept Rate
0.8243 0.1444 44.39%
0.1286 0.0119 41.68%
Table 1: Inventory data posterior estimates: Posterior estimates from the exchange algorithm. The posterior estimate for shows overdispersion in data.
Figure 3: Inventory data posterior visualisation: The distribution of MCMC points for is shown along with posterior mean (left). The estimated autocorrelation function for (top right) and (bottom right). There is high correlation between and in this example.

We now compare the COM-Poisson distribution to the Poisson distribution using with . The BIC for the Poisson distribution can be calculated exactly since the likelihood is tractable. The comparison of both models is shown in Table 2 where the COM-Poisson is shown to outperform the Poisson under BIC.

Poisson COM-Poisson
 BIC 17927.68 15067.39
Table 2: Inventory data: The COM-Poisson model significantly outperforms the Poisson model using the BIC. The estimate BIC for the COM-Poisson was calculated using with .

5.2 Takeover bids COM-Poisson regression model

We consider the dataset from Jaggia and Thosar (1993)

on the number of bids received by U.S. firms that were targets of bid offers during the period 1978-85 and taken over within 1 year of the initial offer. The response variable is the count of takeover bids excluding the initial bid (

NUMBIDS) filed for the particular company. The covariates available include variables which capture the firm-specific characteristics and defensive action taken by the firm in response to their initial takeover bid. A detailed list of available predictors with is given in Appendix B. This data has previously been analysed using Poisson regression by Jaggia and Thosar (1993) and series expansions by Cameron and Johansson (1997). We reconsider this analysis by comparing two Poisson regression models to three COM-Poisson regression models. The three models considered in our analysis are listed in Table 3. We performed a prior variable selection by fitting a Poisson GLM by maximum likelihood to the data to establish which predictors to consider in the Bayesian GLM analysis. The first model and simplest considers NUMBIDS in a Poisson GLM where the linear predictor of contains an intercept and two predictors (BIDPREM and WHTKNGHT). The second model adds an extra linear predictor, SIZE, to Model 1. For COM-Poisson regression we consider 3 models, Model 3 moves the covariate SIZE into the link function. Model 4 considers dropping BIDPREM from the link function and finally Model 5 considers adding another covariate FINREST into the link function. For simplicity we use to denote coefficients within the link function and to denote coefficients within the link function.

Priors for and were set as . The exchange algorithm was run for 100,000 observations with a burnin of 10,000 draws discarded. Each coefficient was updated using a single-site update with a random walk proposal. The proposal variance of the random walk was tuned to achieve an acceptance rate of 44% for each site. The posterior results for all models are presented in Table 4. The BIC was calculated using (15) with set to 5000 to give an accurate estimate of the BIC. Model 5 was selected as the best model with the lowest BIC of 386.40. All 3 COM-Poisson regression models are selected as better than the Poisson models, Model 1 and Model 2 suggesting that including covariates in the dispersion link function is worthwhile.

Model Response () Linear Predictor(s)
 1 (Poisson) NUMBIDS 3
 2 (Poisson) NUMBIDS 4
 3 (COM-Poisson) NUMBIDS 5
 4 (COM-Poisson) NUMBIDS 4
 5 (COM-Poisson) NUMBIDS 6
Table 3: Takeover bids candidate models: The column gives the number of parameters estimated which will be used directly in the calculation of the BIC. Descriptions of all covariates are given in Appendix B.
Parameter Model 1 Model 2 Model 3 Model 4 Model 5
Mean (SD) Mean (SD) Mean (SD) Mean (SD) Mean (SD)
-1.130 (0.505) -1.063 (0.532) -1.077 (0.384) -0.329 (0.100) -0.354 (0.091)
-0.728 (0.368) -0.713 (0.382) -0.553 (0.281) - -
-0.583 (0.152) -0.576 (0.152) -0.458 (0.110) -0.463 (0.111) -0.431 (0.103)
- -0.035 (0.017) - - -
- - -0.674 (0.175) -0.646 (0.175) -0.789 (0.179)
- - -0.171 (0.051) -0.174 (0.052) -0.176 (0.049)
- - - - -0.952 (0.448)
Rank 4 5 (Worst) 2 3 1 (Best)
Table 4: Takeover bids: Posterior parameter estimates are shown for all 5 models. The lowest BIC is shown in bold. The COM-Poisson regression models provide the best fit to the data according to BIC with all COM-Poisson models having a BIC lower than the Poisson models.

5.3 Comparing sampler efficiency

To highlight the efficiency of our COM-Poisson rejection sampler we also fit each COM-Poisson regression model in Table 3 using the rejection sampler of Chanialidis et al. (2017). Their sampler involves constructing a tight envelope distribution around the COM-Poisson distribution using a piecewise envelope distribution constructed from 4 truncated geometric distributions. This piecewise geometric is then sampled from by using the inversion method for truncated geometric distributions. These truncated geometric distribution are designed to match closely to the COM-Poisson density function, however the computational cost of constructing this envelope is significant and cannot be disregarded in the CPU time. Our COM-Poisson sampler described in Algorithm 2 is much simpler and requires only a single envelope component depending on the value of . Although the single envelope results in a moderately higher rejection rate compared to Chanialidis et al. (2017), we show in Table LABEL:table:samplercompare that the higher rejection is offset, in CPU time, by the simplicity of constructing and sampling from the single envelope distribution. This outlines a consideration of rejection sampling in that the choice of using an efficient envelope must be considered simultaneously with the computational cost of sampling from and constructing this efficient envelope. To show the reduction in computational time required for our sampler we examine the MCMC draws per second in the COM-Poisson models (Models 3, 4 and 5) in Table 3. Since both our algorithm and the algorithm of Chanialidis et al. (2017) are exact algorithms, the difference in the number of MCMC draws per second is purely computational and not related to the mixing time of the MCMC chain. Our sampler provides a threefold increase in the number of MCMC draws per second in each of the COM-Poisson models.

Model 3 Model 4 Model 5
 MCMC draws per sec. (Algorithm 2) 9973 12534 9297
 MCMC draws per sec. (Chanialidis et al. (2017)) 2866 3585 2828
Relative efficiency 3.480 3.496 3.287
Table 5: Algorithm efficiency comparison: This table shows the number of MCMC draws per second for our rejection sampler (Algorithm 2) compared with the rejection sampler of Chanialidis et al. (2017). Our algorithm is at least 3 times faster for all 3 models.

5.4 Pseudo-marginal MCMC results

We choose Model 5 in Table 3 to implement the GIMH pseudo-marginal MCMC algorithm using the unbiased likelihood estimator (14) in the acceptance ratio. The unbiased likelihood estimator was calculated for 5 scenarios with . The posterior density estimates for each of the pseudo-marginal MCMC runs are shown in Table 6 along with the CPU time and multivariate effective sample size (mESS) (Vats et al., 2015). The mESS is a generalisation of the effective sample size (ESS), which captures the cross-autocorrelation between parameters in the MCMC output.

where is the number of MCMC draws, is the sample autocovariance matrix of the MCMC output and is the true covariance of the posterior which can be estimated by a batch-means method described in Vats et al. (2015). Figure 4 compares the pseudo-marginal () posterior MCMC chain, density and autocorrelation with the exchange algorithm. There is close agreement between the two methods, however the CPU time for the pseudo-marginal algorithm is dramatically increased.

Exchange Pseudo-marginal MCMC - GIMH
Mean () Mean () Mean () Mean () Mean () Mean ()
-0.354 (0.09) -0.337 (0.08) -0.361 (0.09) -0.354 (0.09) -0.354 (0.09) -0.356 (0.09)
-0.431 (0.10) -0.446 (0.10) -0.424 (0.10) -0.432 (0.10) -0.433 (0.10) -0.431 (0.11)
-0.789 (0.18) -0.785 (0.14) -0.796 (0.17) -0.790 (0.17) -0.794 (0.18) -0.793 (0.18)
-0.176 (0.05) -0.176 (0.05) -0.172 (0.05) -0.178 (0.05) -0.175 (0.05) -0.175 (0.05)
-0.952 (0.45) -0.817 (0.50) -0.981 (0.44) -0.944 (0.45) -0.962 (0.45) -0.955 (0.45)
CPU(s) 10.617 46.257 188.050 355.001 1668.14 3512.57
mESS 4962 345 904 2701 8820 10,922
Table 6: Pseudo-marginal MCMC results: The results for the pseudo-marginal MCMC algorithm run with the unbiased likelihood estimator.
Figure 4: Comparison of the pseudo-marginal MCMC algorithm and the exchange algorithm. Both algorithms show agreement in posterior density estimates. Pseudo-marginal MCMC shows a reduction in autocorrelation compared to the exchange algorithm for all parameters however this is at the expense of increased computational time shown in Table 6.

6 Discussion

This paper provides a new rejection sampler to sample from the COM-Poisson distribution. This rejection sampler allows for faster parameter inference to be performed in COM-Poisson regression models compared with the sampler of Chanialidis et al. (2017). Our rejection sampling algorithm shows that one must consider both the rejection rate of the enveloping distribution and the sampling time from the enveloping distribution in order to construct an efficient rejection sampler. We have also shown how the number of rejected proposals within rejection sampling can be used to construct an unbiased intractable likelihood estimator. This estimator can be used to perform model selection using BIC but future work should included this estimator within other Bayesian model choice strategies. Model selection for doubly-intractable problems such as COM-Poisson regression is a difficult problem and our approach offers one solution to perform model selection. The unbiased likelihood estimator was shown to work well when used within a pseudo-marginal MCMC algorithm (Andrieu and Roberts, 2009) as it provided an unbiased estimate of the acceptance ratio. This unbiased likelihood estimator could be constructed for other intractable models where a rejection sampling algorithm exists to sample from the likelihood, presenting opportunities for future research.

References

  • Ahrens and Dieter (1982) Ahrens, J. H. and U. Dieter (1982).

    Computer generation of Poisson deviates from modified Normal distributions.

    ACM Transactions on Mathematical Software 8(2), 163–179.
  • Andrieu and Roberts (2009) Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics 37(2), 697–725.
  • Beaumont (2003) Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics 164(3), 1139–1160.
  • Caimo and Friel (2011) Caimo, A. and N. Friel (2011). Bayesian inference for exponential random graph models. Social Networks 33(1), 41 – 55.
  • Cameron and Johansson (1997) Cameron, A. C. and P. Johansson (1997). Count data regression using series expansions: With applications. Journal of Applied Econometrics 12(3), 203–223.
  • Casella and Robert (1996) Casella, G. and C. P. Robert (1996). Rao-blackwellisation of sampling schemes. Biometrika 83(1), 81–94.
  • Chanialidis et al. (2017) Chanialidis, C., L. Evers, T. Neocleous, and A. Nobile (2017). Efficient Bayesian inference for COM-Poisson regression models. Statistics and Computing, (to appear). https://doi.org/10.1007/s11222-017-9750-x.
  • Conway and Maxwell (1962) Conway, R. W. and W. L. Maxwell (1962). A queuing model with state dependent service rates. Journal of Industrial Engineering 12(2), 132–136.
  • Georgoulas et al. (2017) Georgoulas, A., J. Hillston, and G. Sanguinetti (2017). Unbiased Bayesian inference for population Markov jump processes via random truncations. Statistics and Computing 27(4), 991–1002.
  • Gilks and Wild (1992) Gilks, W. R. and P. Wild (1992). Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics) 41(2), 337–348.
  • Guikema and Goffelt (2008) Guikema, S. D. and J. P. Goffelt (2008). A flexible count data regression model for risk analysis. Risk Analysis 28(1), 213–223.
  • Hilbe (2011) Hilbe, J. M. (2011). Negative binomial regression. Cambridge University Press.
  • Jaggia and Thosar (1993) Jaggia, S. and S. Thosar (1993). Multiple bids as a consequence of target management resistance: A count data approach. Review of Quantitative Finance and Accounting 3(4), 447–457.
  • Lyne et al. (2015) Lyne, A.-M., M. Girolami, Y. Atchadé, H. Strathmann, and D. Simpson (2015, 11). On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical Science 30(4), 443–467.
  • Mascagni and Srinivasan (2000) Mascagni, M. and A. Srinivasan (2000). Algorithm 806: SPRNG: A scalable library for pseudorandom number generation. ACM Transactions on Mathematical Software 26(3), 436–461.
  • Murray et al. (2006) Murray, I., Z. Ghahramani, and D. J. C. MacKay (2006). MCMC for doubly-intractable distributions. In

    Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06)

    , pp. 359–366. AUAI Press.
  • Møller et al. (2006) Møller, J., A. N. Pettitt, R. Reeves, and K. K. Berthelsen (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458.
  • Propp and Wilson (1996) Propp, J. G. and D. B. Wilson (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms 9(1&2), 223–252.
  • Rao et al. (2016) Rao, V., L. Lin, and D. B. Dunson (2016). Data augmentation for models based on rejection sampling. Biometrika 103(2), 319–335.
  • Schwarz (1978) Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
  • Sellers and Shmueli (2010) Sellers, K. F. and G. Shmueli (2010). A flexible regression model for count data. The Annals of Applied Statistics 4(2), 943–961.
  • Shmueli et al. (2005) Shmueli, G., T. P. Minka, J. B. Kadane, S. Borle, and P. Boatwright (2005). A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54(1), 127–142.
  • Shmueli et al. (2007) Shmueli, G., R. P. Russo, and W. Jank (2007). The BARISTA: A model for bid arrivals in online auctions. The Annals of Applied Statistics 1(2), 412–441.
  • Spiegelhalter et al. (2002) Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(4), 583–639.
  • Vats et al. (2015) Vats, D., J. M. Flegal, and G. L. Jones (2015). Multivariate Output Analysis for Markov chain Monte Carlo. ArXiv e-prints. https://arxiv.org/abs/1512.07713.
  • von Neumann (1951) von Neumann, J. (1951). Various Techniques Used in Connection with Random Digits. National Bureau of Standards Applied Mathematics Series 12, 36–38.
  • Wei and Murray (2016) Wei, C. and I. Murray (2016). Markov Chain Truncation for Doubly-Intractable Inference. ArXiv e-prints. https://arxiv.org/abs/1610.05672.

Appendices

Appendix A COM-Poisson rejection sampler Proof

Proof.

There are two cases to consider depending on the value of .
(In this proof we use the Iverson bracket , where if the enclosed statement is true and 0 if the statement is false.)

Case 1: When

The enveloping bound as illustrated in Figure 1 (left) is the ratio of the COM-Poisson mass function to a Poisson mass function

To find the supremum, assume that the supremum occurs at the point and by the unimodality of the COM-Poisson distribution, it will satisfy