Poisson log-linear models are common in statistics and represent one of the most popular choices to model how the distribution of count data varies with predictors. A typical assumption is that, under an independent sample of counts,
, the probability mass function of the genericconditionally on a
-dimensional vector of covariatesis
where is a -dimensional vector of unknown coefficients. Linking the linear predictor and the parameter with the logarithm represents the most natural choice, as the logarithm is the canonical link for the Poisson family (Nelder and Wedderburn, 1972). This model has broad application in several fields, including medicine and epidemiology (Frome, 1983; Frome and Checkoway, 1985; Hutchinson and Holtman, 2005), manufacturing process control (Lambert, 1992), analysis of accident rates (Joshua and Garber, 1990; Miaou, 1994), and crowd counting (Chan and Vasconcelos, 2009), among others.
In the Bayesian context, model (1) does not enjoy any conjugacy property and, thus, regardless of the prior used, the posterior distribution of
is not available in close form. Consequently, inference is conducted using Markov Chain Monte Carlo (MCMC) methods, which obtain a sample from the posterior distribution of the parameters. Several approaches have focused on how to easily obtain the posterior distribution of the coefficients of Poisson models without requiring complex tuning strategies or long computation times. In the context of count-valued time series,Frühwirth-Schnatter and Wagner (2006) proposed a formulation of the model based on two levels of data augmentation, to derive an efficient approximate Gibbs sampler. Frühwirth-Schnatter et al. (2009) exploited a data augmentation strategy to simplify the computation of hierarchical models for count and multinomial data. Data augmentation strategies have also been employed in the case of models for multivariate dependent count data (Karlis and Meligkotsidou, 2005; Bradley et al., 2018). However, the simplest Poisson regression in (1) still lacks a specific and efficient algorithm to sample from the posterior distribution of the parameters for any prior choice, making the Metropolis-Hastings (Hastings, 1970) or Hamiltonian Monte Carlo (HMC) (Neal, 2011) algorithms the only available options.
On the other hand, several efficient computational strategies for binary regression models have been proposed in the literature. Using the probit link, Albert and Chib (1993) proposed an efficient data augmentation based on a latent Gaussian variable, while the more recent contribution by Polson et al. (2013)
exploited the canonical logit link, introducing an efficient Pólya-gamma data augmentation scheme. LeveragingPolson et al. (2013) approach, we propose a novel approximation of the posterior distribution that can be exploited as proposal distribution of a Metropolis-Hastings algorithm or as importance density of an importance sampling for Poisson log-linear models with conditional Gaussian prior distributions on the regression parameters. With conditional Gaussian prior, we refer to a possibly hierarchical prior with conditional distribution , with and/or random. Examples include straightforward Gaussian prior distributions with informative fixed using prior information, and scale mixtures of Gaussian where
is set to zero and the variance has a suitable hierarchical representation, such as the Bayesian lasso prior(Park and Casella, 2008), the horseshoe prior, and its extensions (Carvalho et al., 2010; Piironen and Vehtari, 2017).
More specifically, we introduce an approximation of the posterior density that exploits the negative binomial convergence to the Poisson distribution. Thanks to this result, we are able to leverage the Pólya-gamma data augmentation scheme ofPolson et al. (2013) to derive an efficient sampling scheme. In the next section, we introduce and discuss the proposed algorithms, starting from the definition of an approximate posterior distribution whose sampling can be performed straightforwardly. Sampling from this approximate posterior is then used as proposal density for the Metropolis-Hastings or importance sampler. The performances of the proposed algorithms in terms of computational efficiency is compared with that of state-of-the-art methods in a simulation study. The paper concludes with two illustrative applications.
2 Efficient posterior sampling strategies
2.1 Approximate posterior distribution
Assume is an independent sample of counts from model (1). We introduce an approximation of the posterior density which exploits the negative binomial convergence of the Poisson distribution, i.e., we approximate the -th contribution to the likelihood function with where
which corresponds to the probability mass function of a negative binomial random variable with parameter, the number of failures until the experiment is stopped, and success probability . As approaches infinity, this quantity converges to a Poisson likelihood.
where denotes the density of a Pólya-gamma with parameters .
In what follows, we assume that prior knowledge about the unknown parameters is represented by a conditionally Gaussian prior, i.e. , with a possible hierarchical representation for the parameters and . Examples include default informative Gaussian with fixed or scale mixtures of Gaussian where is set to zero and the variance has a suitable hierarchical representation (Park and Casella, 2008; Carvalho et al., 2010; Piironen and Vehtari, 2017).
The approximate posterior based on the conditionally Gaussian prior and approximate likelihood is consistent with the successful Gibbs sampler of Polson et al. (2013); i.e., sampling from the approximate posterior is equivalent to sampling iteratively from the following full conditionals
where and , with and .
The adherence of this approximate posterior to the true posterior highly depends on the values of , with larger values of resulting in better approximations. However, when employing this result in posterior sampling, large values of imply longer computation time due to the computational cost of sampling Pólya-gamma random variables with large parameters. Although the specific choice of remains an open point—discussed later in Section 2.4—in the context of MCMC sampling, we propose to reduce the computational burden related to the sampling of
Pólya-gamma random variables marginalizing the Gaussian distribution in (3) with respect to the related Pólya-gamma density conditioned on , the last available sampled. Since this marginalization is not in a closed form we introduce a second level of approximation of the true posterior. Specifically, we introduce a density that depends on , defined as the first-order Taylor expansion of the marginalized Gaussian distribution, i.e.
where , , , , and for each the conditional expectation of each is simply
The above construction is eventually used as the building block of efficient Metropolis-Hastings and importance sampling algorithms, as described in the following sections.
2.2 Metropolis-Hastings sampler
We employ the above sampling mechanism as the proposal density in a Metropolis-Hastings algorithm. Consistent with this, at each iteration of the MCMC sampler, an additional step that accepts or rejects the proposed draw is introduced. Specifically, we assume that conditionally on the current state of the chain , a new value is sampled from (5). Then, the acceptance probability
is evaluated to decide whether to accept or reject the proposed , where is the exact posterior distribution.
To compute the acceptance probability in (6), the forward and backward transition densities and must be computed. Consistent with this, approximation (4) is particularly useful: without it, it would be necessary to compute the marginal density where the Pólya-gamma random variables are integrated out. However, the marginalization with respect to the Pólya-gamma density does not lead to a closed form expression; thus, the Metropolis-Hastings algorithm cannot be defined.
Clearly, for increasing the proposal density (5) is closer to the true full conditional distribution; hence, the related acceptance rate will be higher, and the Metropolis-Hastings algorithm will be similar to a Gibbs sampler. On the other hand, setting this parameter to get a lower acceptance rate can result in smaller autocorrelation, and hence a better mixing (Robert and Casella, 2010). We discuss an approach to choose balancing these two extremes in Section 2.4.
2.3 Adaptive importance sampler
The sampling mechanism (5) can also be exploited within the context of importance sampling, where the posterior expectation of a function of the parameter , is evaluated via Monte Carlo integration without direct sampling from . To this end, the general approach is to define an importance density that is used to sample values , which are eventually averaged to obtain an approximation of through
The efficiency of this algorithm is determined by the ability of the importance density to sample values relevant to the target density. To improve this aspect, we modify the original algorithm and, at each iteration, we simulate values from (5), updating the importance density with (4). Thus, the importance density is adaptively updated based on the previously extracted value and the weights become
2.4 Tuning parameters
The values of the parameters , , have to be tuned to balance the trade-off between acceptance rate and autocorrelation in the Metropolis-Hastings, and to control the mixing of the weights in the importance sampler. However, tuning parameters is not practical, especially for moderate to large . The first simple solution sets all parameters equal to a single value , however, in our experience, this resulted in a low effective sample size for some of the sampled chains.
As an alternative strategy, we choose to tune instead the distance of the proposal density from the target posterior. As the expression of the posterior distribution is unknown, we control the distance between the Poisson and negative binomial likelihood. Based on Teerapabolarn (2012)
, we consider the upper bound of the relative error between the Poisson and negative binomial cumulative distribution functions. This result is particularly useful owing to its simplicity, which allows to analytically derive adequate parameters to bound the error to a specific value. Specifically, ifis a Poisson random variable with mean and is a negative binomial random variable with parameters and , as defined in Section 2.1, we have the following result:
Hence, by setting an upper bound
for the distance between the Poisson and negative binomial distribution, all the values of the parameterscan be automatically derived to obtain a proposal density whose distance from the target posterior is constant for every , even for heterogeneous data. Under our notation , thus which is solved by
where and is the Lambert-W function (Lambert, 1758), which can be computed numerically using standard libraries. Hence, in the algorithm, at the beginning of each iteration, the values are computed according to (7) conditionally on the current value of .
3 Numerical illustrations
3.1 Synthetic data
We conducted a simulation study under various settings to compare the efficiency of the proposed Metropolis-Hastings and importance sampler with that of state-of-the-art methods. We focused on the Hamiltonian Monte Carlo approach—as implemented in the Stan software (Stan Development Team, 2021)—as the successful Metropolis-Hastings with standard random walk proposal would require, different from the proposed approaches, the tuning of parameters, which becomes cumbersome for moderate to elevate . The proposed methods are implemented via the R package bpr, which is written in efficient C++ language exploiting the Rcpp package (Eddelbuettel and Francois, 2011) and available at https://github.com/laura-dangelo/bpr.
Data were generated from a Poisson log-linear model with sample sizes and number of covariates . Specifically, for each combination of and , we consider 50 independent dimensional vectors of counts where each () is sampled from a Poisson distribution with mean , with common parameter . The covariates were generated from continuous or discrete/categorical random variables under the constraints that the continuous variables have mean zero and variance one and that . Reproducible scripts to generate the synthetic data are available at github.com/laura-dangelo/bpr/simulation and as Supplementary Materials.
Two prior distributions for the coefficients were assumed, namely a vanilla Gaussian prior with independent components , , and the more complex horseshoe prior (Carvalho et al., 2010) which allows for the following conditionally Gaussian representation
for , where
is the standard half-Cauchy distribution. To implement the samplers under the horseshoe prior, we used the details ofMakalic and Schmidt (2016), and fixed the global shrinkage parameter to the “optimal value” , where is the number of non-zero parameters (van der Pas et al., 2017).
Each method introduced in Section 2
was run for 10000 iterations with 5000 of them discarded as burn-in. The convergence of each algorithm was assessed by graphical inspection of the trace plots of the resulting chains. The convergence was satisfactory for all simulations and comparable for all algorithms, as no systematic bias was found in the posterior mean of the estimated parameters.
To assess the efficiency of the proposed methods, we used a proxy of the time per independent sample, which is estimated as the total time (in seconds) necessary to simulate the entire chain, over the effective sample size of the resulting chain. For the proposed adaptive importance sampler, an estimate of the effective sample size was obtained using the quantity , which takes values between 1 and (Robert and Casella, 2010). Notably, the burn-in samples were removed from the chains before computing the effective sample size. Thus, the obtained times per independent sample do not represent exactly the number of seconds necessary to generate one independent sample—they rather represent an overestimate. Nonetheless, this approach provides a robust and fair comparison between the different competing algorithms. The experiment has been run on a Linux machine with 8 GB DDR4 2400 MHz RAM, CPU Intel i7-7700HQ 3.8 GHz, running R 4.1.1.
Figure 1 and 2 show, for each combination of and , the distribution of the median time per independent sample for the three algorithms computed on the 50 replications under a Gaussian and horseshoe priors, respectively. The plots are presented in the logarithmic scale for clarity.
For the Gaussian prior the performances of the proposed algorithms are better than those obtained using the HMC implemented in Stan, for small values of the dimension . For , instead, the performances of the HMC are quite competitive with respect to the importance sampling and broadly comparable to the proposed efficient Metropolis-Hastings algorithm. Notably, the differences are less evident with increasing sample size.
For the horseshoe prior, the proposed Metropolis-Hastings presents a stable superior performance with respect to the HMC sampler implemented in Stan for each sample size ad number of covariates . The performance of the importance sampler remains competitive. As previously observed for the Gaussian prior, the differences are less evident for increasing sample size.
3.2 Spike train data
Herein, we illustrate the proposed sampling method on data of brain activity in mice in response to visual stimulation. This type of data is relatively new, and it arises from the observation of brain activity through the technique of calcium imaging. The novelty of this technique is that it allows to analyze the brain activity at a neuronal level, and it allows to study the associations between a stimulus and the cells’ response. The data set was generated using a small subset of data from the Allen Brain Observatory(Allen Institute for Brain Science, 2016), which is an extensive survey of the physiological activity of neurons in the mouse visual cortex. In the original data set, for each neuron the fluorescent calcium traces are recorded, which is a proxy of the neuronal activity, under different experimental conditions. From these traces, it is of interest to detect and analyze the activations of neurons, which correspond to transient spikes of the intracellular calcium level. We applied the method reported by Jewell et al. (2019) as described in de Vries et al. (2020) to extract and count the activations of each neuron, to understand how they are affected by the experimental conditions and the location of the neurons in the brain. In the context of neural studies, this approach is usually referred to as “encoding models”, as the interest is to predict the activity of a population of neurons in response to a given visual stimulus and other experimental conditions. Paninski et al. (2007) reviewed some methods commonly employed in encoding problems. Generalized linear models, which are flexible and yet interpretable, are one of the fundamental tools for investigating the response of neurons to external factors. In particular, the authors assert that assuming a Poisson distribution is a plausible assumption to model spike counts; hence, we regressed the estimated number of neurons’ activation on several continuous and categorical covariates available from the study.
Coefficients of the regression on the calcium imaging data set: posterior density, with the posterior mean and 95% credible interval (colored dot and segment).
The covariates are the depth of the neuron, the area of the visual cortex where the neuron is located (factor with 6 levels), the cre transgenic mouse line (factor with 13 levels), and the type of visual stimulation (factor with 4 levels). The depth of the neurons is discretized to 22 levels, ranging from 175 to 625 microns, thus, we could obtain a data set having a full factorial design with 5 replications for each available covariate combination. Moreover, we included a quadratic term of the depth to improve the fitting. The obtained data set is made of 920 observations on 23 variables.
We ran the proposed Metropolis-Hastings algorithm for 9000 iterations, discarding the first 5000 as burn-in. The computation time was 98 seconds. The posterior estimates of the coefficients of the dummies on three categorical variables are shown in Figure3; and for the numeric covariate depth, the posterior mean and 95% credible intervals were equal to for the linear term, and for the quadratic term. Given these estimates of the coefficients, the number of spikes increased with the largest depths. Moreover, as shown in Figure 3, the response of neurons is heterogeneous across the cre-lines and, coherent with the results of de Vries et al. (2020), we obtained that the mean response is lower for the VISam, VISpm and VISrl areas.
3.3 Betting data
Poisson regression models have been widely adopted in sports analytics, where the response variable is the match score or total number of points. Modelling match scores in association football has recently gained considerable interest owing to the popularity of the betting market, and several modelling approaches have been proposed. Classical methods use only the information of the teams playing(Dixon and Coles, 1997; Petretta et al., 2021)
, while other authors have explored the possibility of introducing additional information, such as historical data and bookmakers’ odds(Egidi et al., 2018; Groll et al., 2019). In general, to describe the number of goals scored by each team in a match, the Poisson distribution is considered a valid assumption (Maher, 1982; Lee, 1997). Herein, we considered data of match scores from the Italian Serie A 2020-2021 season, which are publicly available at http://www.football-data.co.uk. We considered the number of goals as the variable of interest, and, as covariates, we included the fixed effects of the team, several betting odds (for different betting types and bookmakers), and an indicator of whether the team is playing home. The resulting data set has 760 observations on 102 variables.
We used the proposed Metropolis-Hastings algorithm with both a Gaussian prior centered at zero and horseshoe prior distribution on the parameters; to compare the results and analyze the variables the two priors select. The results are depicted in Figure 4. For each explanatory variable, the posterior mean and credible interval were obtained using the two priors. The filled symbols indicate that the credible interval does not contain zero, showing that the shrinkage induced by the horseshoe prior selects only a few variables compared to the informative Gaussian prior. Moreover, the horseshoe prior induces a significant reduction of the amplitude of all credible intervals.
A more formal comparison between the two models is obtained using the conditional predictive ordinate (CPO) statistics (Geisser, 1993; Gelfand et al., 1992; Gelfand and Dey, 1994), which is defined for , as , where is the vector of observed data omitting the -th value. Figure 5 shows the boxplots of the resulting CPO statistics for the two models. The graph does not highlight any fundamental difference in the predictive capacity of the two models, implying that the horseshoe prior allows to obtain a more parsimonious model with a similar fit. This is also confirmed by the logarithm of the pseudo-marginal likelihood , which is commonly used as a summary of CPO’s (Ibrahim et al., 2014). It is equal to -1172.055 and to -1167.521 for the models based on the Gaussian prior and the horseshoe, respectively.
Motivated by the lack of specific computational tools for efficient sampling from the posterior distribution of regression parameters in Poisson log-linear models, we introduced an approximate posterior distribution used as building block for the Metropolis-Hastings and importance sampling algorithms. The performances of the proposed solutions, in terms of mixing and computation time, were comparable or superior to those of the efficient Stan implementation of HMC in all scenarios considered and particularly when a hierarchical prior is assumed. The ease of application of our methods is further enhanced by their availability via the R package bpr, which obtains the posterior distribution of several quantities of interest without the need for coding and with minimal tuning.
- Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association 88 (422), pp. 669–679. Cited by: §1.
- Allen brain observatory. Note: http://observatory.brain-map.org/visualcoding Cited by: §3.2.
- Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion). Bayesian Analysis 13 (1), pp. 253–310. External Links: Cited by: §1.
- The horseshoe estimator for sparse signals. Biometrika 97 (2), pp. 465–480. Cited by: §1, §2.1, §3.1.
Bayesian Poisson regression for crowd counting.
2009 IEEE 12th International Conference on Computer Vision, Vol. , pp. 545–551. Cited by: §1.
- A large-scale standardized physiological survey reveals functional organization of the mouse visual cortex. Nature neuroscience 23 (1), pp. 138–151. Cited by: §3.2, §3.2.
- Modelling association football scores and inefficiencies in the football betting market. Journal of the Royal Statistical Society: Series C (Applied Statistics) 46 (2), pp. 265–280. Cited by: §3.3.
- Rcpp: seamless R and C++ integration. Journal of Statistical Software, Articles 40 (8), pp. 1–18. External Links: Cited by: §3.1.
- Combining historical data and bookmakers’ odds in modelling football scores. Statistical Modelling 18 (5-6), pp. 436–459. Cited by: §3.3.
- The analysis of rates using Poisson regression models. Biometrics 39 (3), pp. 665–674. Cited by: §1.
- Use of Poisson regression models in estimating incidence rates and ratios. American Journal of Epidemiology 121 (2), pp. 309–323. Cited by: §1.
- Improved auxiliary mixture sampling for hierarchical models of non-Gaussian data. Stat Comput 19 (479). Cited by: §1.
- Auxiliary mixture sampling for parameter-driven models of time series of counts with applications to state space modelling. Biometrika 93 (4), pp. 827–841. Cited by: §1.
- Predictive inference. Chapman and Hall/CRC. Cited by: §3.3.
- Model determination using predictive distributions with implementation via sampling-based-methods (with discussion). In Bayesian Statistics 4, Cited by: §3.3.
- Bayesian model choice: asymptotics and exact calculations. Journal of the Royal Statistical Society. Series B (Methodological) 56 (3), pp. 501–514. Cited by: §3.3.
A hybrid random forest to predict soccer matches in international tournaments. Journal of Quantitative Analysis in Sports 15 (4), pp. 271–287. Cited by: §3.3.
- Monte Carlo sampling methods using Markov chains and their applications. Cited by: §1.
- Analysis of count data using Poisson regression. Research in Nursing & Health 28 (5), pp. 408–418. Cited by: §1.
- Bayesian survival analysis. In Wiley StatsRef: Statistics Reference Online, pp. . External Links: Cited by: §3.3.
- Fast nonconvex deconvolution of calcium imaging data. Biostatistics 21 (4), pp. 709–726. Cited by: §3.2.
- Estimating truck accident rate and involvements using linear and Poisson regression models. Transportation Planning and Technology 15 (1), pp. 41–58. Cited by: §1.
- Multivariate Poisson regression with covariance structure. Stat Comput 15, pp. 255–265. Cited by: §1.
- Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34 (1), pp. 1–14. Cited by: §1.
- Observations variae in mathesin puram. Acta Helvitica, physico-mathematico-anatomico-botanico-medica 3, pp. 128–168. Cited by: §2.4.
- Modeling scores in the premier league: is Manchester United really the best?. Chance 10 (1), pp. 15–19. Cited by: §3.3.
- Modelling association football scores. Statistica Neerlandica 36 (3), pp. 109–118. Cited by: §3.3.
- A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters 23 (1), pp. 179–182. Cited by: §3.1.
- The relationship between truck accidents and geometric design of road sections: Poisson versus negative binomial regressions. Accident Analysis & Prevention 26 (4), pp. 471–482. Cited by: §1.
- MCMC using hamiltonian dynamics. Handbook of Markov chain Monte Carlo 2 (11), pp. 2. Cited by: §1.
- Generalized linear models. Journal of the Royal Statistical Society. Series A (General) 135 (3), pp. 370–384. Cited by: §1.
- Statistical models for neural encoding, decoding, and optimal stimulus design. In Computational Neuroscience: Theoretical Insights into Brain Function, P. Cisek, T. Drew, and J. F. Kalaska (Eds.), Progress in Brain Research, Vol. 165, pp. 493–507. Cited by: §3.2.
- The Bayesian lasso. Journal of the American Statistical Association 103 (482), pp. 681–686. Cited by: §1, §2.1.
- Mar-co: a new dependence structure to model match outcomes in football. External Links: Cited by: §3.3.
- Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics 11 (2), pp. 5018 – 5051. Cited by: §1, §2.1.
- Bayesian inference for logistic models using Pólya-gamma latent variables. Journal of the American Statistical Association 108 (504), pp. 1339–1349. Cited by: §1, §1, §2.1, §2.1.
- Introducing Monte Carlo methods with R. Springer. Cited by: §2.2, §3.1.
- Stan modeling language users guide and reference manual. Note: url: http://mc-stan.org/ Cited by: §3.1.
- The least upper bound on the Poisson-negative binomial relative error. Communications in Statistics - Theory and Methods 41 (10), pp. 1833–1838. Cited by: §2.4.
- Adaptive posterior contraction rates for the horseshoe. Electronic Journal of Statistics 11 (2), pp. 3196 – 3225. Cited by: §3.1.