Modelling count data continues to be a important area in the practice of statistics. Often covariate information is available which may prove useful in explaining the counts, for example, time of day influencing the number calls to a particular call centre or amount borrowed influencing the number of loan defaults by a consumer. Poisson regression is a standard framework for modelling covariate-dependent count data allowing the mean response to depend on the covariates through a log-linear link function. In practice, Poisson regression fails to capture the phenomenon of dispersion, where the mean and variance of the response differ significantly. Dispersion measures the spread of a random variable and is quantified through the ratio of the variance of a random variable to its mean. Count data often exhibits underdispersion (variance less than the mean), overdispersion (variance greater than the mean) or equidispersion (equality of mean and variance) but Poisson regression is only capable of modelling equidispersed data. Many alternative approaches have been suggested to capture dispersion, such as using quasilikelihood or the popular negative binomial regression(Hilbe, 2011) which allows modelling of overdispersed data only. The COM-Poisson distribution (Conway and Maxwell, 1962) was first introduced to model queuing systems and was then revived as a statistical model for dispersed count data by Shmueli et al. (2005). This revival lead to the introduction of COM-Poisson regression models by Guikema and Goffelt (2008) as a flexible regression model for dispersed count data with covariates. However, this flexibility comes with the caveat that the COM-Poisson likelihood function contains an intractable normalising term which requires the need for non-standard methods to conduct parameter inference. This intractability has restricted the popularity of COM-Poisson regression as a viable solution to modelling dispersion. Recent work by Chanialidis et al. (2017) has seen a Bayesian analysis of COM-Poisson regression under the framework of doubly-intractable Bayesian inference, where the authors overcome the intractability through the use of a rejection sampling algorithm for the COM-Poisson distribution. This rejection sampler is then used within the exchange algorithm of Murray et al. (2006) to perform Bayesian parameter inference.
Our novel contribution in this paper is to provide a more efficient approach to that of Chanialidis et al. (2017) by designing a simpler rejection sampler for the COM-Poisson distribution which is computationally less intensive to sample from and leads to faster inference within COM-Poisson regression models. In one example in this paper, we observe at a greater than threefold reduction in CPU time by using our rejection sampler. The rejection sampler of Chanialidis et al. (2017) is derived from a discrete version of adaptive rejection sampling (Gilks and Wild, 1992)
and requires the costly construction of a piecewise truncated geometric enveloping distribution to tightly encapsulate the COM-Poisson probability density function. Our rejection sampler requires only a single envelope distribution depending on the dispersion parameter of the COM-Poisson distribution. This single envelope distribution does come with a higher rejection rate but completely bypasses the costly setup and sampling cost involved in piecewise truncated geometric envelope distributions. A further and important novelty of our work is to show that when a rejection sampling algorithm is available to sample from an intractable likelihood function of interest, then it is possible to construct an unbiased estimate of this intractable likelihood function using the rejection sampler. Our idea works by exploiting a link between rejection sampling efficiency and its relationship with the reciprocal intractable normalising constant of the likelihood function.
The remainder of this paper is organised as follows: Section 2 reviews the COM-Poisson distribution and COM-Poisson regression models under the framework of doubly-intractable Bayesian inference. Section 3 reviews rejection sampling for intractable likelihoods and our new COM-Poisson rejection sampler is introduced in Section 3.1. Section 3.2 describes the novel method of constructing unbiased estimators of intractable likelihood functions through rejection sampling. Section 4 shows how pseudo-marginal MCMC can be performed using this unbiased intractable likelihood estimator. Section 5 shows results of these methods applied firstly to a dataset with no covariates in Section 5.1 and then to a real-world COM-Poisson regression dataset of company takeover bids in Section 5.2. Finally, Section 6 gives discussion and some thoughts for future work.
2 Bayesian doubly-intractable inference for GLMs
The COM-Poisson distribution (Conway and Maxwell, 1962)
is a two-parameter discrete probability distribution that allows the flexibility to model count data with over-, under- and equi-dispersion. The probability mass function for a COM-Poisson random variablewith parameters and is defined over the non-negative integers as
The unnormalised component of the mass function is denoted and is an intractable normalising constant, having no closed form representation for this model. The mode of the COM-Poisson distribution is , with two consecutive modes at and in the case where
is an integer. The moments of the COM-Poisson distribution are unavailable directly due to the intractable normalising constant. The moments can however be approximated through the use of an asymptotic representation ofderived by Shmueli et al. (2005). The approximations for the mean and variance using this asymptotic representation are, respectively,
These approximations are quite accurate for a wide range of and , except for small or small . The purpose of the parameter is to control dispersion through this parameter’s inverse relationship with the variance as seen in (1). When the distribution exhibits equidispersion and the probability mass function reduces to that of a Poisson random variable with . Alternatively setting gives overdispersion and setting gives underdispersion.
The main application of the COM-Poisson distribution is to GLMs, referred to as COM-Poisson regression models, where responses are assumed to follow a COM-Poisson distribution with the parameters and of each response being conditional on available covariates for . Due to this conditioning we now index the parameters for observation as and . COM-Poisson regression modelling was first introduced by Guikema and Goffelt (2008) and Sellers and Shmueli (2010) in an effort to extend Poisson regression by allowing covariates to influence not only the location parameter but also the dispersion parameter . For notational simplicity we now let be the parameter for observation with derived from setting where is a link function of the covariates and are the parameters of this link function. The major difficulty is that the likelihood of the COM-Poisson regression model involves multiple intractable normalising constants,
The particular link function for the COM-Poisson was specified by Guikema and Goffelt (2008) as a dual-link function to allow the available covariates to enter both the and parameter. In this dual-link function scenario the parameter
is split into two sub-vectors as, where are the coefficients for the link component and are the coefficients for the link component. The dual-link function is then
where some are set fixed to 0 when the covariate is not included in , similarly for .
Bayesian parameter inference for in COM-Poisson regression models is a doubly-intractable problem as we now demonstrate. Placing a prior on , the posterior distribution for is formed with this prior and the complete likelihood (2) as
This posterior is termed doubly-intractable since the likelihood is intractable, containing intractable normalising constants, and the posterior itself cannot be normalised. Without the ability to evaluate pointwise, the direct application of standard MCMC methods is infeasible. For example, a Metropolis-Hastings algorithm proposing a move in the link function from to using a proposal distribution requires evaluation of the intractable ratios to compute the acceptance ratio
Possible methods to compute this acceptance ratio include replacing the intractable ratios with approximations using the asymptotic approximation from Shmueli et al. (2007) or by a truncated sum but both of these methods introduce some bias into the acceptance ratio.
An ingenious solution to the issue of computing the acceptance ratio (4) was proposed by Murray et al. (2006) drawing on earlier work by Møller et al. (2006). The solution, known as the exchange algorithm, augments the doubly-intractable posterior (3) with auxiliary data, provided that it is possible to sample data exactly from the intractable likelihood. The original exchange algorithm overcomes a single intractable ratio i.e. however the algorithm can easily be extended to the case of intractable ratios as we now show. Starting from the posterior (3) which contains the product of independent non-identical intractable likelihood functions, the exchange algorithm proposes, as in standard Metropolis-Hastings, to update the parameter from the current state to proposed state using but in addition the posterior is augmented with auxiliary draws generated from the likelihood at the proposed parameters . The augmented posterior is written
and the marginal of this augmented posterior for is the target posterior of interest (3
). The algorithm can be seen as a Markov chain over an augmented parameter. The acceptance ratio for the augmented posterior (5) is now tractable due to the cancellation of all intractable normalising constants
The clever cancellation of the normalising constants in (6) is due to the exchange of parameters () associated with the observed data () and the auxiliary data (), the auxiliary being data discarded after each move. The major requirement in the exchange algorithm is the ability to generate the auxiliary data, requiring exact draws from the likelihood at each proposed parameter. Perfect sampling (Propp and Wilson, 1996) from the likelihood is one possible method to do this but can be prohibitively expensive for many models. In certain models sampling from an intractable likelihood may even be as difficult as computing the intractable normalising constant itself and instead one could resort to sampling using an approximate likelihood sampler such as in the case of the exponential random graph model (Caimo and Friel, 2011). The COM-Poisson distribution was shown by Chanialidis et al. (2017) to be amenable to rejection sampling, allowing the auxiliary data to be generated. In Section 3.1 we present our faster rejection sampler having a cheaper sampling cost than the sampler of Chanialidis et al. (2017). Rejection sampling is not applicable to all intractable likelihoods but when rejection sampling is available we show in Section 3.2 that the intractable likelihood can be estimated without bias through an unbiased estimate of the reciprocal normalising constant. This unbiased intractable likelihood estimator is also guaranteed to be positive which differs from other estimates using Russian roulette stochastic truncation (Lyne et al., 2015) or truncated Markov chains (Wei and Murray, 2016).
3 Rejection sampling to an intractable likelihood estimator
Rejection sampling (von Neumann, 1951) is a sampling scheme to generate statistically independent samples from a target probability distribution of interest. We denote the target density (likelihood) as
where in this section we drop the observation index for ease of notation. This target density may be tractable or intractable but is assumed in any case to be difficult to sample from. The rejection sampling method works by constructing an envelope distribution over the target density, proposing samples from this envelope distribution and choosing to accept a portion of these samples. The envelope distribution should be chosen so that it is both computationally efficient to sample from and matches closely in shape to the target density. The portion of samples that will be accepted depends on the similarity of the envelope and the target distribution. The closer the envelope distribution matches to the target distribution, the higher the number of samples that will be accepted. To generate a single draw from using rejection sampling, we consider an envelope distribution where is its parameter. We assume that the envelope density can be written in a similar form to as
The envelope distribution’s normalising constant may be either tractable or intractable. A necessary condition for is that it dominates the support of i.e. . It is also necessary to bound using and a positive finite bounding constant that satisfies the envelope inequality . Finding such a constant can be a difficult task even for tractable densities. The optimal value of is denoted as . In the case of intractable likelihoods, this optimal constant is impossible to evaluate since at least one of either or is unknown leaving intractable. In the case where both and are tractable distributions and can be found, the vanilla rejection sampling algorithm to obtain a single draw from proceeds as in Algorithm 1.
Algorithm 1 introduces the rejection sampling Bernoulli trial with the outcome conditional on the proposed through the acceptance probability . This Bernoulli trial is at the core of the rejection sampling method as it decides for a given what proportion of samples from to accept. Once intractability is introduced into this Bernoulli trial through , it would appear that becomes intractable but this is fortunately not the case. The intractable bounding constant can be decomposed into tractable and intractable components by extracting the reciprocal normalising constants from as follows
and this introduces the tractable bounding constant . can be viewed as the upper bound on the ratio of the unnormalised densities and , whereas is the upper bound on the associated normalised densities and . The advantage of rejection sampling algorithms is that they can be run without full knowledge of , it is sufficient to know which is usually more readily available. Substituting the expression into the acceptance probability (7) gives the tractable acceptance probability for the conditional Bernoull trial as
It is important to note that is also not required to be tractable, although it typically is for simple envelope distributions. A case where is intractable would be when the envelope distribution itself requires sampling by another rejection sampler and this opens the possibility of chaining many rejection samplers together to sample from evermore complex target distributions. The main purpose of rejection sampling in this paper is to provide a means to generate the auxiliary draws required within the exchange algorithm acceptance ratio (6) allowing the doubly-intractable inference to be performed but Section 3.2 will show another novel usage for rejection sampling. Before discussing this, we demonstrate our COM-Poisson rejection sampler in the next section.
3.1 COM-Poisson rejection sampler
For the COM-Poisson distribution, our sampler uses a choice of two envelope distributions dependent on the value of . This choice is motivated by Figure 1, where a Poisson distribution is used in the case of
and a geometric distribution in the case of. Theorem 3.1 demonstrates and proves why this works. The COM-Poisson rejection sampling algorithm is given in pseudocode after this theorem.
Theorem 3.1 (COM-Poisson intractable rejection sampler).
Suppose that , a Poisson random variable with parameter and , a geometric random variable with parameter , for some , are used as two enveloping distributions as follows,
|together with the following tractable enveloping bounds|
then a sample from the COM-Poisson distribution can be drawn at any parameter value ().
See Appendix A. ∎
Before we present the practical implementation of the COM-Poisson sampling algorithm developed from Theorem 3.1, it is necessary to provide some brief discussion on the choice of in the geometric envelope. The optimal parameter, say, would be one for which the acceptance rate of the rejection sampling is maximised. However it is not possible to compute since the acceptance rate, as we will show in Section 3.2, involves intractable normalising constants. Nevertheless any choice of will result in a valid rejection sampler and clearly the closer is to , the more efficient the sampler is. We have chosen to select by matching the first moment of the geometric to the approximate expected value of the COM-Poisson distribution using (1). Setting the first moment of the geometric distribution equal to the expected value approximation (1) implies,
Algorithm 2 presents pseudocode of the COM-Poisson sampler for coding on a computer. To analyse the efficiency of the COM-Poisson sampler we conducted an experiment in which the sampler was run for different values of and with results presented in Appendix C. We show that the sampler can achieve between and acceptance for and and acceptance for . A comparison of Algorithm 2 with the sampler from Chanialidis et al. (2017) is deferred to Section 5.3 where we demonstrate an example of using both algorithms. Another benefit of rejection sampling algorithms is that they can be implemented in parallel thus speeding up the computation, however care is needed when implementing parallel random number generators with potential issues discussed in Mascagni and Srinivasan (2000). Algorithm 2 gives the pseudocode for the COM-Poisson rejection sampler.
3.2 Unbiased likelihood estimation by monitoring rejection sampling efficiency
Note: In this section we reintroduce the index parameter to discuss rejection sampling at different parameters , each necessarily introduces indexed tractable and intractable sampling bounds and envelope parameters which we denote as , and respectively.
Literature on rejection sampling has explored some methods for recycling the rejected proposals from the envelope distribution. Casella and Robert (1996) consider reducing the variance of rejection sampling estimates by incorporating the rejected samples in a post-processing step using the Rao-Blackwell theorem. Rao et al. (2016)
uses the rejected proposals from a rejection sampler in a different way by considering the joint distribution of rejected and accepted draws leading to a method to perform doubly-intractable inference in the case where the likelihood has an associated rejection sampler. Our approach considers yet another use for the rejected proposals by examining the relationship of rejected proposals with the efficiency of the rejection sampler. The efficiency is used to provide an unbiased estimator of the reciprocal normalising constantfor each parameter . An unbiased estimator of the reciprocal normalising constant at each will lead directly to an unbiased estimator of the complete likelihood (2).
The efficiency of the rejection sampler is critical to the overall performance of the exchange algorithm. The efficiency is defined as the total number of draws required from the envelope distribution until the first acceptance of a draw from or in general the total number of draws required until acceptances from . The number of draws determines how closely the envelope distribution matches to the target distribution. The ideal envelope distribution would exactly match the target distribution and reject no proposals, however in this scenario we could simply circumvent rejection sampling and sample from the envelope distribution directly. In Algorithm 1, run at parameter , each of the samples proposed from the envelope is followed by a Bernoulli trial with conditional acceptance probability . To remove the conditional dependence on , consider the unconditional acceptance probability over all envelope proposals by integrating with respect to the envelope,
It is now obvious that the unconditional acceptance probability and from this the rejection sampler efficiency are controlled directly by the magnitude of the intractable bound . On average, one expects to accept a portion of proposals from as being from and to discard the remainder. Equivalently, the total number of draws until the first accepted draw follows a geometric distribution with success probability and mean i.e. . In general the total number of draws () until
acceptances follows a negative binomial distribution with mean. A natural method to estimate is to run rejection sampling at the parameter and record the observed total number of draws required for acceptances. Then an unbiased estimate of based on acceptances is given as
By the central limit theorem increasingwill lead to lower variance estimates of since can be viewed as the average of independent geometric trials each having variance . To use to give an unbiased estimate of the likelihood, we have the final assumption that is known i.e. the envelope distribution is tractable. Then by rewriting (8) a link emerges between the inverse normalising constants of and as
By replacing with its unbiased estimate and multiplying both sides by an unbiased estimator of the likelihood for observation is given as
with all terms on the RHS of (13) known. The estimate of the complete likelihood follows as
by estimating at each .
This unbiased estimator of the intractable likelihood is very useful especially for model selection as it can be used to construct an estimate of the Bayesian information criterion (BIC) (Schwarz, 1978). Bayesian model choice is a challenging problem for models with intractable likelihood functions. The intractability prevents calculation of the likelihood which is required for most model selection criteria. To our knowledge rejection sampling efficiency has not previously been used to construct a unbiased estimate of an intractable likelihood. The BIC estimate from the unbiased likelihood estimator (14) is
where maximises the unbiased estimate. The variance of this BIC estimate depends on and we find in experiments that needs to be high () to achieve an accurate estimate of the BIC.
It may also be possible to use the unbiased likelihood estimator (13
) as part of other Bayesian model selection frameworks, for example, in the calculation of the marginal likelihood (or evidence) and Bayes factors. This could be a focus of future work. In addition to BIC estimation we consider plugging the unbiased estimator directly into the intractable acceptance ratio (4) which is the basis of pseudo-marginal MCMC algorithms discussed in the next section.
Finally, we note that Chanialidis et al. (2017) presented results comparing models based on the deviance information criterion (DIC) (Spiegelhalter et al., 2002), without explaining how they overcame the need to evaluate the intractable likelihood for each draw from the MCMC sample. Following personal communication with the authors it turns out that they approximated the normalising constant, , using a truncated summation involving terms, for positive integers, , , where , as follows,
Here and are selected to satisfy the inequality , where, as before, is the mode of the COM-Poisson distribution. In turn, they approximated the likelihood for each draw from the sample generated from their MCMC algorithm by
which is then used, for example, to estimate the posterior expected (deviance) log-likelihood which is required for calculation of the DIC. Of course, this introduces an approximation of the likelihood into the estimation of the DIC. This raises questions as to how many terms are needed in order to get an accurate approximation of the likelihood as well as how to choose both and and finally raises the issue of the computational cost involved in evaluating the finite truncation for the terms in the product in (17).
4 Pseudo-marginal MCMC with the intractable likelihood estimator
MCMC methods for Bayesian inference require the ability the evaluate the unnormalised posterior distribution. For doubly-intractable problems where the posterior distribution cannot be evaluated as such, their exists pseudo-marginal MCMC algorithms (Andrieu and Roberts, 2009) which require only an unbiased positive estimate of the posterior distribution. Pseudo-marginal algorithms have been applied to a range of problems such as genetic modelling (Beaumont, 2003) and Markov jump processes (Georgoulas et al., 2017). This pseudo-marginal approach can be seen as an alternative to the exchange algorithm which uses exact sampling in place of unbiased posterior estimators.
We now describe the pseudo-marginal framework and show how our unbiased estimator of the likelihood (14) can be used. Consider a target distribution which cannot be pointwise evaluated, but assume that it is possible to construct an unbiased estimate of the target through the use of auxiliary variables , . Using in place of the true target in a Metropolis-Hastings algorithm leads to the approximate acceptance ratio
Pseudo-marginal MCMC algorithms can be viewed as Metropolis-Hastings algorithms over the joint parameter and auxiliary variable state space . Andrieu and Roberts (2009) describe two alternative forms: the “Monte Carlo within Metropolis” or MCWM algorithm and the “grouped independence Metropolis-Hastings” or GIMH algorithm, with the only latter algorithm having the correct target distribution . In MCWM, both the current and proposed estimate of the target are refreshed every iteration using new auxiliary draws and . In GIMH, only the proposed estimate is refreshed and is fixed to the estimate used when was last accepted.
The difficulty with pseudo-marginal algorithms is in constructing an unbiased estimate of the target. Using the rejection sampling unbiased estimator (14) the pseudo-marginal acceptance ratio for the to move in the COM-Poisson model is
This acceptance ratio is entirely tractable once the estimates have been computed by running rejection sampling at each for the GIMH algorithm or at both and for the MCWM algorithm.
We present two examples of COM-Poisson modelling. The first example is a retail inventory dataset from Shmueli et al. (2005) which contains no covariate information. This first example is a toy example that demonstrates how the BIC can be approximated using the unbiased likelihood estimate. The second example is a takeover bids dataset which contains covariates which we model using a COM-Poisson regression model. The second example will firstly the reduction in computation time available using our sampler and also the ability to choose different regression models using the unbiased likelihood estimate.
5.1 Inventory Data
Inventory data or stock count data is of high importance to retailers as it facilitates to them to make decisions regarding future stock levels. The counts we observe in this data are the quarterly sales counts of particular items of clothing in each store across retail stores. A frequency barplot of the data is shown in Figure 2. The maximum quarterly sales of the clothing item sold was 30 in one store. No stores sold between 22 and 29 items and many stores had 0 sales of the clothing item. The sample mean sales is 3.56, sample variance 11.31 and the sample dispersion index is 3.18, thus overdispersed. This data was analysed by Shmueli et al. (2005) using a maximum likelihood approach.
Since this data contains no extra covariate information, we model and directly rather than considering them as functions of covariates. This could be viewed as using the link function with only intercept where and but we consider it easier to model and . We place Gamma(1, 1) and Gamma(0.0625, 0.25) priors on and , respectively. We run the exchange algorithm for 200,000 iterations and discard the first 5,000 as a burn-in. Each parameter and were updated using a single-site update. The results are shown in Table 1 with the target acceptance rate of for single-site updates achieved.
|Posterior Mean||Posterior SD||Accept Rate|
We now compare the COM-Poisson distribution to the Poisson distribution using with . The BIC for the Poisson distribution can be calculated exactly since the likelihood is tractable. The comparison of both models is shown in Table 2 where the COM-Poisson is shown to outperform the Poisson under BIC.
5.2 Takeover bids COM-Poisson regression model
We consider the dataset from Jaggia and Thosar (1993)
on the number of bids received by U.S. firms that were targets of bid offers during the period 1978-85 and taken over within 1 year of the initial offer. The response variable is the count of takeover bids excluding the initial bid (NUMBIDS) filed for the particular company. The covariates available include variables which capture the firm-specific characteristics and defensive action taken by the firm in response to their initial takeover bid. A detailed list of available predictors with is given in Appendix B. This data has previously been analysed using Poisson regression by Jaggia and Thosar (1993) and series expansions by Cameron and Johansson (1997). We reconsider this analysis by comparing two Poisson regression models to three COM-Poisson regression models. The three models considered in our analysis are listed in Table 3. We performed a prior variable selection by fitting a Poisson GLM by maximum likelihood to the data to establish which predictors to consider in the Bayesian GLM analysis. The first model and simplest considers NUMBIDS in a Poisson GLM where the linear predictor of contains an intercept and two predictors (BIDPREM and WHTKNGHT). The second model adds an extra linear predictor, SIZE, to Model 1. For COM-Poisson regression we consider 3 models, Model 3 moves the covariate SIZE into the link function. Model 4 considers dropping BIDPREM from the link function and finally Model 5 considers adding another covariate FINREST into the link function. For simplicity we use to denote coefficients within the link function and to denote coefficients within the link function.
Priors for and were set as . The exchange algorithm was run for 100,000 observations with a burnin of 10,000 draws discarded. Each coefficient was updated using a single-site update with a random walk proposal. The proposal variance of the random walk was tuned to achieve an acceptance rate of 44% for each site. The posterior results for all models are presented in Table 4. The BIC was calculated using (15) with set to 5000 to give an accurate estimate of the BIC. Model 5 was selected as the best model with the lowest BIC of 386.40. All 3 COM-Poisson regression models are selected as better than the Poisson models, Model 1 and Model 2 suggesting that including covariates in the dispersion link function is worthwhile.
|Model||Response ()||Linear Predictor(s)|
|Parameter||Model 1||Model 2||Model 3||Model 4||Model 5|
|Mean (SD)||Mean (SD)||Mean (SD)||Mean (SD)||Mean (SD)|
|1.130 (0.505)||1.063 (0.532)||1.077 (0.384)||0.329 (0.100)||0.354 (0.091)|
|-0.728 (0.368)||-0.713 (0.382)||-0.553 (0.281)||-||-|
|0.583 (0.152)||0.576 (0.152)||0.458 (0.110)||0.463 (0.111)||0.431 (0.103)|
|-||-||0.674 (0.175)||0.646 (0.175)||0.789 (0.179)|
|-||-||-0.171 (0.051)||-0.174 (0.052)||-0.176 (0.049)|
|Rank||4||5 (Worst)||2||3||1 (Best)|
5.3 Comparing sampler efficiency
To highlight the efficiency of our COM-Poisson rejection sampler we also fit each COM-Poisson regression model in Table 3 using the rejection sampler of Chanialidis et al. (2017). Their sampler involves constructing a tight envelope distribution around the COM-Poisson distribution using a piecewise envelope distribution constructed from 4 truncated geometric distributions. This piecewise geometric is then sampled from by using the inversion method for truncated geometric distributions. These truncated geometric distribution are designed to match closely to the COM-Poisson density function, however the computational cost of constructing this envelope is significant and cannot be disregarded in the CPU time. Our COM-Poisson sampler described in Algorithm 2 is much simpler and requires only a single envelope component depending on the value of . Although the single envelope results in a moderately higher rejection rate compared to Chanialidis et al. (2017), we show in Table LABEL:table:samplercompare that the higher rejection is offset, in CPU time, by the simplicity of constructing and sampling from the single envelope distribution. This outlines a consideration of rejection sampling in that the choice of using an efficient envelope must be considered simultaneously with the computational cost of sampling from and constructing this efficient envelope. To show the reduction in computational time required for our sampler we examine the MCMC draws per second in the COM-Poisson models (Models 3, 4 and 5) in Table 3. Since both our algorithm and the algorithm of Chanialidis et al. (2017) are exact algorithms, the difference in the number of MCMC draws per second is purely computational and not related to the mixing time of the MCMC chain. Our sampler provides a threefold increase in the number of MCMC draws per second in each of the COM-Poisson models.
|Model 3||Model 4||Model 5|
|MCMC draws per sec. (Algorithm 2)||9973||12534||9297|
|MCMC draws per sec. (Chanialidis et al. (2017))||2866||3585||2828|
5.4 Pseudo-marginal MCMC results
We choose Model 5 in Table 3 to implement the GIMH pseudo-marginal MCMC algorithm using the unbiased likelihood estimator (14) in the acceptance ratio. The unbiased likelihood estimator was calculated for 5 scenarios with . The posterior density estimates for each of the pseudo-marginal MCMC runs are shown in Table 6 along with the CPU time and multivariate effective sample size (mESS) (Vats et al., 2015). The mESS is a generalisation of the effective sample size (ESS), which captures the cross-autocorrelation between parameters in the MCMC output.
where is the number of MCMC draws, is the sample autocovariance matrix of the MCMC output and is the true covariance of the posterior which can be estimated by a batch-means method described in Vats et al. (2015). Figure 4 compares the pseudo-marginal () posterior MCMC chain, density and autocorrelation with the exchange algorithm. There is close agreement between the two methods, however the CPU time for the pseudo-marginal algorithm is dramatically increased.
|Exchange||Pseudo-marginal MCMC - GIMH|
|Mean ()||Mean ()||Mean ()||Mean ()||Mean ()||Mean ()|
|0.354 (0.09)||0.337 (0.08)||0.361 (0.09)||0.354 (0.09)||0.354 (0.09)||0.356 (0.09)|
|0.431 (0.10)||0.446 (0.10)||0.424 (0.10)||0.432 (0.10)||0.433 (0.10)||0.431 (0.11)|
|0.789 (0.18)||0.785 (0.14)||0.796 (0.17)||0.790 (0.17)||0.794 (0.18)||0.793 (0.18)|
|-0.176 (0.05)||-0.176 (0.05)||-0.172 (0.05)||-0.178 (0.05)||-0.175 (0.05)||-0.175 (0.05)|
|-0.952 (0.45)||-0.817 (0.50)||-0.981 (0.44)||-0.944 (0.45)||-0.962 (0.45)||-0.955 (0.45)|
This paper provides a new rejection sampler to sample from the COM-Poisson distribution. This rejection sampler allows for faster parameter inference to be performed in COM-Poisson regression models compared with the sampler of Chanialidis et al. (2017). Our rejection sampling algorithm shows that one must consider both the rejection rate of the enveloping distribution and the sampling time from the enveloping distribution in order to construct an efficient rejection sampler. We have also shown how the number of rejected proposals within rejection sampling can be used to construct an unbiased intractable likelihood estimator. This estimator can be used to perform model selection using BIC but future work should included this estimator within other Bayesian model choice strategies. Model selection for doubly-intractable problems such as COM-Poisson regression is a difficult problem and our approach offers one solution to perform model selection. The unbiased likelihood estimator was shown to work well when used within a pseudo-marginal MCMC algorithm (Andrieu and Roberts, 2009) as it provided an unbiased estimate of the acceptance ratio. This unbiased likelihood estimator could be constructed for other intractable models where a rejection sampling algorithm exists to sample from the likelihood, presenting opportunities for future research.
Ahrens, J. H. and U. Dieter (1982).
Computer generation of Poisson deviates from modified Normal distributions.ACM Transactions on Mathematical Software 8(2), 163–179.
- Andrieu and Roberts (2009) Andrieu, C. and G. O. Roberts (2009). The pseudo-marginal approach for efficient Monte Carlo computations. The Annals of Statistics 37(2), 697–725.
- Beaumont (2003) Beaumont, M. A. (2003). Estimation of population growth or decline in genetically monitored populations. Genetics 164(3), 1139–1160.
- Caimo and Friel (2011) Caimo, A. and N. Friel (2011). Bayesian inference for exponential random graph models. Social Networks 33(1), 41 – 55.
- Cameron and Johansson (1997) Cameron, A. C. and P. Johansson (1997). Count data regression using series expansions: With applications. Journal of Applied Econometrics 12(3), 203–223.
- Casella and Robert (1996) Casella, G. and C. P. Robert (1996). Rao-blackwellisation of sampling schemes. Biometrika 83(1), 81–94.
- Chanialidis et al. (2017) Chanialidis, C., L. Evers, T. Neocleous, and A. Nobile (2017). Efficient Bayesian inference for COM-Poisson regression models. Statistics and Computing, (to appear). https://doi.org/10.1007/s11222-017-9750-x.
- Conway and Maxwell (1962) Conway, R. W. and W. L. Maxwell (1962). A queuing model with state dependent service rates. Journal of Industrial Engineering 12(2), 132–136.
- Georgoulas et al. (2017) Georgoulas, A., J. Hillston, and G. Sanguinetti (2017). Unbiased Bayesian inference for population Markov jump processes via random truncations. Statistics and Computing 27(4), 991–1002.
- Gilks and Wild (1992) Gilks, W. R. and P. Wild (1992). Adaptive rejection sampling for Gibbs sampling. Journal of the Royal Statistical Society: Series C (Applied Statistics) 41(2), 337–348.
- Guikema and Goffelt (2008) Guikema, S. D. and J. P. Goffelt (2008). A flexible count data regression model for risk analysis. Risk Analysis 28(1), 213–223.
- Hilbe (2011) Hilbe, J. M. (2011). Negative binomial regression. Cambridge University Press.
- Jaggia and Thosar (1993) Jaggia, S. and S. Thosar (1993). Multiple bids as a consequence of target management resistance: A count data approach. Review of Quantitative Finance and Accounting 3(4), 447–457.
- Lyne et al. (2015) Lyne, A.-M., M. Girolami, Y. Atchadé, H. Strathmann, and D. Simpson (2015, 11). On Russian roulette estimates for Bayesian inference with doubly-intractable likelihoods. Statistical Science 30(4), 443–467.
- Mascagni and Srinivasan (2000) Mascagni, M. and A. Srinivasan (2000). Algorithm 806: SPRNG: A scalable library for pseudorandom number generation. ACM Transactions on Mathematical Software 26(3), 436–461.
et al. (2006)
Murray, I., Z. Ghahramani, and D. J. C. MacKay (2006).
MCMC for doubly-intractable distributions.
Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), pp. 359–366. AUAI Press.
- Møller et al. (2006) Møller, J., A. N. Pettitt, R. Reeves, and K. K. Berthelsen (2006). An efficient Markov chain Monte Carlo method for distributions with intractable normalising constants. Biometrika 93(2), 451–458.
- Propp and Wilson (1996) Propp, J. G. and D. B. Wilson (1996). Exact sampling with coupled Markov chains and applications to statistical mechanics. Random Structures and Algorithms 9(1&2), 223–252.
- Rao et al. (2016) Rao, V., L. Lin, and D. B. Dunson (2016). Data augmentation for models based on rejection sampling. Biometrika 103(2), 319–335.
- Schwarz (1978) Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics 6(2), 461–464.
- Sellers and Shmueli (2010) Sellers, K. F. and G. Shmueli (2010). A flexible regression model for count data. The Annals of Applied Statistics 4(2), 943–961.
- Shmueli et al. (2005) Shmueli, G., T. P. Minka, J. B. Kadane, S. Borle, and P. Boatwright (2005). A useful distribution for fitting discrete data: Revival of the Conway-Maxwell-Poisson distribution. Journal of the Royal Statistical Society: Series C (Applied Statistics) 54(1), 127–142.
- Shmueli et al. (2007) Shmueli, G., R. P. Russo, and W. Jank (2007). The BARISTA: A model for bid arrivals in online auctions. The Annals of Applied Statistics 1(2), 412–441.
- Spiegelhalter et al. (2002) Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. Van Der Linde (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64(4), 583–639.
- Vats et al. (2015) Vats, D., J. M. Flegal, and G. L. Jones (2015). Multivariate Output Analysis for Markov chain Monte Carlo. ArXiv e-prints. https://arxiv.org/abs/1512.07713.
- von Neumann (1951) von Neumann, J. (1951). Various Techniques Used in Connection with Random Digits. National Bureau of Standards Applied Mathematics Series 12, 36–38.
- Wei and Murray (2016) Wei, C. and I. Murray (2016). Markov Chain Truncation for Doubly-Intractable Inference. ArXiv e-prints. https://arxiv.org/abs/1610.05672.
Appendix A COM-Poisson rejection sampler Proof
There are two cases to consider depending on the value of .
(In this proof we use the Iverson bracket , where if the enclosed statement is true and 0 if the statement is false.)
Case 1: When
The enveloping bound as illustrated in Figure 1 (left) is the ratio of the COM-Poisson mass function to a Poisson mass function
To find the supremum, assume that the supremum occurs at the point and by the unimodality of the COM-Poisson distribution, it will satisfy