Discrete data commonly arise either as natural counts or via data coarsening such as rounding. Bayesian regression models must account for this discreteness in order to provide a coherent data-generating process. However, discrete data often present challenging distributional features, including zero-inflation, over/under-dispersion, boundedness or censoring, and heaping. Customized Bayesian models that target these features are usually too complex for analytical posterior and predictive analysis and can incur significant computational burdens from the requisite approximation algorithms, such as MCMC. The contribution of this paper is to provide conjugate priors, closed-form posteriors, and efficient Monte Carlo posterior and predictive simulation algorithms for a broad class of discrete data regression models.
The impact of data coarsening is often nonignorable, both for coherency of the data-generating process and for correct and efficient inference. dempster1983rounding emphasized the surprisingly large impact of rounded data adjustments in least squares regression, which notably do not decay with the sample size. More broadly, Heitjan1989 reviewed inference and asymptotics for grouped data, where continuous data are observable only up to some coarsening such as an interval. Heitjan1991 studied multiple sources of data coarsening, including rounding, heaping, and, censoring, as well as random coarsening operators in the context of ignorability.
The perspective on discrete data offered by grouping or rounding has also been popularized as a modeling strategy for count and binary data. canale2011bayesian
introduced rounded Gaussian mixture models for count-valued Bayesian nonparametrics.canale2013nonparametric similarly applied rounded Gaussian processes for count regression modeling. Kowal2020a developed extensions that include both rounding and transformations to adapt linear, additive, and BART regression models for count and rounded data. Although these works demonstrate the utility of discretized continuous data models for count data, none of these methods provides analytic posterior or predictive distributions and instead use MCMC for inference. Among related non-Bayesian approaches, Siegfried2020 and Kowal2021b developed count data regression with an emphasis on learning a transformation for more adaptive discrete distributional modeling. Jia2021 deployed a similar strategy for count time series analysis.
Conjugate priors for discrete data regression have received renewed attention, including for binary (Durante2019) and multinomial (Fasano2020) data. Indeed, the closed-form posteriors for probit linear regression derived by Durante2019 are a special case of our results. However, there are additional challenges in our setting. First, our results generalize from binary data to discrete data , including count data, rounded data, and bounded or censored (discrete) data. These generalizations require a different class of distributions for conjugate modeling, and therefore distinct descriptions and computations of the relevant posterior and predictive quantities. Second, our framework incorporates a transformation, which is irrelevant for binary data but widely-used in count data models. For instance, count data are often modeled (incoherently) using continuous regression models after log or square-root transformations, while copula models incorporate transformations via the marginal distributions (Pitt2006). Third, we prioritize closed-form predictive distributions that are discrete to match the support of the data and joint across multiple covariate values. Naturally, a predictive distribution supported on (a subset of)
is a substantial generalization of a binary predictive distribution, which is completely characterized by the success probability. Our results admit direct predictive probability evaluations and Monte Carlo predictive sampling for discrete data , including for sparse regression, nonlinear models, and model-averaged scenarios.
Bayesian regression models for count data commonly build upon the Poisson distribution. Except in special cases, the posterior and predictive distributions do not have known closed forms and therefore require approximations such as MCMC. Further, the Poisson distribution is often inadequate for modeling, especially in the case of over/under-dispersion, zero-inflation, or heaping. Generalizations such as the Conway-Maxwell-Poisson distribution(sellers2010flexible) or the discrete Weibull distribution (peluso2019discrete) provide more flexible count distribution models, yet are burdened by more intensive computations. polson2013bayesian introduced a data augmentation strategy for Gibbs sampling with Negative Binomial likelihoods, while Bradley2018
proposed log-gamma distributions that are conditionally conjugate to the Poisson distribution. These approaches employ MCMC for all posterior and predictive inference and typically require further modifications to account for zero-inflation, bounded or censored data, and heaping.
Operating within a broad and empirically successful class of discrete data regression models, we introduce conjugate priors with closed-form posterior inference. These results provide new analytical clarity on the posterior distribution for count and rounded data regression. Perhaps most important, we develop Monte Carlo simulation algorithms for posterior and predictive inference, which notably avoid the customized diagnostics, lengthy simulation runs, and deteriorating performance in high dimensions that plague MCMC. These tools are adapted to linear and nonlinear regressions, model selection and averaging, and the high-dimensional sparse means problem with count and rounded data (Section 2). Multiple simulation studies demonstrate significant advantages in computational efficiency, predictive performance, and variable selection relative to state-of-the-art competitors (Section 3). Supplementary materials include proofs and derivations, R code to reproduce all results, and an R package on GitHub.
2 Conjugate priors for count and rounded data
2.1 Transformation and rounding models
For paired data with covariates and integer-valued responses , consider the following class of models:
where is a rounding operator, is a (monotone increasing) transformation, and is a covariate-dependent continuous distribution parametrized by . Informally, (1)–(2) is designed to adapt a continuous data model for coherent modeling of discrete data . We focus primarily on the Gaussian linear model for the latent continuous data ,
with modifications available for heteroskedastic or dependent errors and nonlinear versions (see Section 2.4).
It is useful to decouple the rounding operator , which defines the support of , and the transformation , which is a smooth monotone function. First, consider the rounding operator , which is defined via a partition of such that whenever .
Suppose that we observe rounded data instead of continuous data. Let the rounded digit be the first decimal place, which can be satisfied by pre-multiplying the data by the appropriate power of 10 such that . When is the identity, represents the unobserved continuous data version of . In this case, whenever , so and is the floor function.
Suppose that is count-valued and bounded by some known . The rounding operator defined by for , for , and for ensures the correct support for the data-generating process. Equivalently, the partition is , for , and . In addition, (1)–(2) coupled with this choice of produces a valid likelihood for data censored at (Kowal2020a). The case of unbounded counts simply requires . When , our results reproduce Durante2019 for probit regression.
Next, consider the monotone increasing transformation .
The (signed) Box-Cox family provides a parametric option, , for with for . In addition to the log transformation, this family includes the (shifted and scaled) identity () and square-root transformations. These transformations are common for (incoherent) application of continuous models (2) to count data without the rounding operator
; for example, the square-root transformation is the variance-stabilizing transformation of the Poisson distribution.Kowal2020a used this family for Bayesian modeling of (1)–(2).
Using copula-style models, an alternative strategy is to link the transformation to the cumulative distribution function (CDF) of(Liu2009; Siegfried2020; Kowal2021b). Suppose as in the examples above. The CDFs of and are related: By model (2), the latent are Gaussian. Consider a working marginal distribution , so for
. Rearranging terms in the above CDF relation motivates the following estimator:
along with the appropriate limiting behavior, and . Since the transformation is identifiable only up to location () and scale (), the estimator (4
) matches the first two moments ofwith those of : is the sample mean and
is the sample standard deviation of. Similarly, (4) substitutes a point estimate for using the nonparametric estimator that rescales the empirical CDF (ECDF) to avoid boundary issues. Lastly, Kowal2021b
proposed a smooth monotonic interpolation ofon the observed values, , with the limits of also imposed for .
The distinction between and highlights the importance of the rounding operator: is a step function with and therefore does not need the rounding operator. However, the data-generating process of (1)–(2) with transformation is supported only on the observed values: whenever , which occurs for when . Instead, the smoothness of appropriately expands the support of the data-generating process, while the interpolation ensures that preserves (4) at the observed data values .
The estimator (4) may instead use parametric distributions, such as Poisson or Negative Binomial CDFs. In this case, the role of the transformation is to map the latent data model (2) such that the marginal distribution of conforms to a pre-specified parametric family. The accompanying parameters of must be estimated; a useful default is method-of-moments estimation, which recycles and . Jia2021 adopted a related approach for count time series modeling, while Pitt2006 similarly used parametric regression models for the marginals in a multivariate Gaussian copula model.
2.2 Selection distributions
with conditional independence across . Given a prior , the posterior distribution is
where is elementwise.
Previous attempts at Bayesian modeling of (1)–(2) have applied Gibbs sampling algorithms for estimating functionals of (6) (canale2011bayesian; canale2013nonparametric; Kowal2020a). The common strategy is to introduce a data-augmentation step in the MCMC algorithm and iteratively sample from and . However, these MCMC algorithms fail to provide general insights about the posterior or predictive distribution under (1)–(2) and require customized diagnostics and lengthy simulation runs. Perhaps most important, the efficiency of these MCMC algorithms—both in terms of effective samples per draw and computing time per draw—deteriorates significantly as grows (see Section 3).
Here, we characterize the posterior distribution explicitly:
): a selection distribution is defined by random variables of the form
and parametrized in terms of the joint distribution forand the measurable constraint region . So, the observation that is sufficient.
General properties of selection distributions are provided by ArellanoValle2006. When the prior is closed under a set of transformations or closed under marginalization, these properties propagate to the posterior . When have a joint multivariate elliptically contoured distribution, the posterior density and distribution functions can be expressed in terms of the location parameters, dispersion parameters, and density generator function; see ArellanoValle2006 for the explicit forms.
For a more tractable analysis, suppose that the prior and the latent data model are both Gaussian. Specifically, construct the joint distribution
for and , where the moments may depend on the covariates .
The posterior distribution is defined by the joint moments ofin (7) and the constraint region implied by (1). Crucially, there exists a more constructive representation:
Let and be independent with for . Then .
Most important, Theorem 3 enables direct Monte Carlo simulation from the posterior, rather than iterative MCMC or other approximations. Specific examples are provided subsequently.
While simplifications are often available depending on the model form (see Sections 2.3 - 2.4), the primary computational bottleneck for evaluating the density (8) is the integration of an -dimensional multivariate normal density, whereas the Monte Carlo sampler in Theorem 3 is limited by the draw from the -dimensional truncated normal. Our computing strategies leverage recent algorithmic developments. In particular, Botev2017
introduced a minimax exponential tilting strategy that accurately estimates the normalizing constant of a high-dimensional multivariate normal distribution and provides an efficient importance sampling algorithm for truncated multivariate normal variables with an accept-reject algorithm that achieves a high acceptance rate. These algorithms are efficient and accurate for moderate, such as , and are implemented in the TruncatedNormal package in R.
2.3 Posterior distribution for the linear model
Lemma 1 enables evaluation of the posterior density (Theorem 2), direct Monte Carlo simulation from the posterior (Theorem 3), and computation of the posterior moments (equations (9) and (10)) for the linear model. To illustrate, consider Zellner’s -prior for : for . Algorithm 1 provides direct simulation from the posterior distribution . Notably, this Monte Carlo sampler only requires standard regression functionals of , which are a one-time computing cost across Monte Carlo draws.
Under the -prior, the posterior expectation of the regression coefficients also simplifies:
where is the marginal expectation of truncated to . Using Algorithm 1, is easily estimable by replacing with the sample mean of Monte Carlo draws of . When the prior is centered at zero, and is the posterior expectation of the latent data unconditional on . By comparison, the zero-centered -prior with a Gaussian likelihood for produces a posterior expectation of the same form, but with the observed data in place of . This comparison is particularly insightful in the context of the rounded data scenario from Example 1. In this case, is the latent continuous version of the rounded data , so is a reasonable proxy for the unobservable continuous data in the posterior expectation (11).
). In particular, the Gaussian distribution is a special case of the selection normal:, where the moments and constraints on are irrelevant as long as . By generalizing these additional arguments, we obtain a richer class of conjugate priors:
Perhaps the most useful setting of Lemma 2 is sequential updating based on multiple datasets, say and . When these datasets are conditionally independent and follow the same model (1) and (3), the total posterior decomposes as . Under a Gaussian prior for , Lemma 1 implies that the partial posterior is selection normal, which serves as the prior for the second dataset update. Hence, Lemma 2 provides the direct updates from the partial posterior to the total posterior . The concurrent work of King2021 generalizes Lemma 2 for dynamic linear models in (2) to provide analytic recursions for filtering and smoothing (selection normal) distributions.
2.4 Prediction and nonlinear models
. Consider the posterior predictive distribution at thecovariate matrix . Because of the link in (1), the joint predictive distribution of can be expressed via the latent variables :
We consider multiple computing strategies for (functionals of) the posterior predictive distribution. The simplest option is to leverage the Monte Carlo sampling algorithms for (e.g., Theorem 3) to obtain latent predictive samples from , and then map those to via (1):
The result is a direct consequence of the conditional independence assumptions in (2) and the deterministic link in (1). The samples are discrete to match the support of and capture the joint dependence among predictive variables at multiple covariates . For the linear model (3), the posterior sampling step is available via Monte Carlo sampling (e.g., Algorithm 1). As a result, Lemma 3 produces Monte Carlo samples from the joint posterior predictive distribution. The latent predictive step is simply , so Lemma 3 requires minimal additional computational cost given .
More broadly, suppose the goal is exclusively predictive inference, so the posterior distribution of under model (1)–(2) is not required. In this context, it can be advantageous to circumvent the posterior sampling of in Lemma 3 and draw directly from the predictive distribution . To motivate this scenario, consider the following generalization of (3):
where is modeled as a nonlinear function of via the known basis functions , such as splines or wavelets. canale2011bayesian applied a version of model (13) for count data analysis, but required an MCMC algorithm for posterior inference.
Let denote the basis matrix. Basis expansions such as (13) are usually accompanied by a smoothness prior of the form , where controls the smoothness and is a known positive semidefinite penalty matrix. Scheipl2012 showed that and can be reparametrized such that the prior distribution of is unchanged, but is the identity and is diagonal. Hence, we proceed under the assumption that and .
Algorithm 2 provides Monte Carlo simulation from the joint predictive distribution across multiple points for a discrete and nonlinear regression model. Most notably, the key terms are computable without any matrix inversions and avoid any excess sampling steps from the posterior distribution of . The most demanding step is the draw of from the -dimensional truncated normal, but fortunately this draw is shared among all choices of points .
Algorithm 2 is not limited to the nonlinear basis expansions model (13): in fact, it provides predictive simulation from any linear model (1) and (3) with a prior of the form for a known positive semidefinite matrix . Although the reparametrization of changes the interpretation of and , it does not change the data-generating process, so the predictive draws remain valid. Further, the reparametrization is a one-time computing cost, yet the computational simplifications are realized for every draw from Algorithm 2. Again, this strategy circumvents unnecessary posterior sampling when the goal is exclusively predictive inference. A related sampling strategy for the -prior is provided in the supplementary material.
Analytic computations for predictive inference that avoid Monte Carlo sampling are also available. The predictive distribution is computable using marginal densities, , where is the constant term in the denominator of (8) and and are determined by the model (2), e.g., and in the linear model setting of Lemma 1. The joint distribution proceeds similarly via the latent data : where
for the linear model setting of Lemma 1. These direct probability computations of are most useful when is small, such as . The predictive probabilities also provide analytic point predictions, such as
2.5 Model and variable selection
, the hyperparameters in the model (such as), and linear (3) against nonlinear (13) models. This setting also applies to variable selection by replacing each covariate matrix with , where is a variable inclusion indicator and each corresponds to a model . We focus on two goals: model selection and prediction via model-averaging.
Given model probabilities for
, the posterior probability of each model is
where denotes the marginal likelihood under . When the prior on and the latent continuous data model for are both Gaussian in the form of (7) for each , we derive the posterior probabilities directly:
Lemma 4 is naturally informative for model selection, for example by selecting for which is largest. More broadly, the posterior probabilities in (15) enable model uncertainty in posterior and predictive inference. Focusing on the latter setting, consider the goal of prediction that averages over the uncertainty about the models . Using the predictive simulators, which now depend on the specific model , we propose the following Monte Carlo sampler: for , sample with from (15); then sample from Lemma 3 or Algorithm 2. This sampler produces joint and independent draws of . By computing functionals of only, we effectively marginalize over the model uncertainty within yet retain a discrete and joint predictive distribution at .
2.6 The sparse means problem for discrete data
The classical sparse means problem considers continuous data with the goal of identifying the nonzero means, (CvdV). Instead, suppose we observe discrete data under model (1)–(2). In the context of Example 1, this scenario corresponds to the normal means model for rounded data. We introduce sparsity in via the spike-and-slab model
where is the inclusion indicator with . Conditional on the inclusion indicators, this model is a special case of (3) with and , where the prior for the active components is a special case of the Gaussian priors considered previously. This scenario is also useful for wavelet models and can be extended to other spike-and-slab priors.
The sparse means model with prior (16) offers certain computational simplifications, but also introduces new challenges. In particular, inference on the inclusion indicators proceeds via Lemma 4. Since the marginal latent mean is zero and the latent covariance is diagonal, , the key term in (15) simplifies considerably:
In place of an -dimensional integral of a multivariate normal density, we obtain a product of univariate normal integrals, which is a massive advantage for computing with large .
Despite these gains for evaluating each , the search over the model space requires consideration of models, which is computationally prohibitive even for moderate . Instead, we may employ a stochastic search to sample from . Specifically, we apply a Gibbs sampler that cycles through for , which observes that
for the (full conditional) odds of inclusion
where the terms of the form are given by (17). We may expand this Gibbs sampler to incorporate a prior on the inclusion probability, (Scott2010), which adds a simple sampling step of the form with an accompanying update for in (18) at each Gibbs sweep of .
Inference on the regression coefficients is readily available by observing that . Given samples of using the above procedure, we sample from using the following fast version of Algorithm 1: for , sample truncated to ; sample ; and set . We set for all . This sampler is parallelizable across yet produces draws from the joint posterior of . The draws of are not needed for sampling the inclusion indicators , but can be used to obtain predictive samples by application of Lemma 3.
3 Empirical results
3.1 Direct Monte Carlo and Gibbs sampling
To demonstrate the computational advantages of direct Monte Carlo sampling for posterior inference, we compare the proposed sampler from Algorithm 1 with the Gibbs sampler from Kowal2020a. The two samplers are applied to the same linear regression model (3) with a -prior (see Algorithm 1). We fix at the maximum likelihood estimator (MLE; Kowal2021b) and set . The rounding and transformation operators for (1) are given by Example 2 with and the ECDF from Example 4, respectively.
We simulate negative binomial data with independent Gaussian covariates using the design from Kowal2020a. The focus here is exclusively on computational performance. For varying sample sizes and number of covariates , we run each algorithm for 1000 samples (after a burn-in of 1000 for the Gibbs sampler) and repeat for 25 replicates.
For each simulation, we compute the median effective sample size across the regression coefficients and report the result as the percent of the total number of simulations (1000). In addition, we record the raw computing time (using R on a MacBook Pro, 2.8 GHz Intel Core i7). The results are presented in Figure 1. Most notably, the proposed Monte Carlo sampler achieves maximal efficiency for all considered. By comparison, the Gibbs sampler produces 40-70% efficiency for , but the efficiency declines as increases, including only around 5-20% efficiency for . In addition, the raw computing times are comparable for regardless of , while the proposed Monte Carlo sampler incurs some additional cost only for . In general, the proposed Monte Carlo sampler demonstrates substantial efficiency gains with increasing , especially when is moderate.
3.2 Prediction for nonlinear regression with synthetic data
We evaluate the predictive distributions from count-valued nonlinear regression models using simulated data. We consider two data-generating processes: one with a Negative Binomial likelihood (Neg-Bin) and one that resembles (1) and (13) (Mixture-CDF). In both cases, we generate a latent smooth curve on a grid of points, where are orthonormal polynomials of degree and are independent normal variables with mean zero and decreasing variance . For Neg-Bin, we set the mean function to be and the dispersion to be , which produces moderately overdispersed counts. We bound these data at to prevent exceedingly large values. For Mixture-CDF, we specify the rounding operator as in Example 2 with . The transformation is defined akin to Examples 4 or 5: we smooth the CDF of a random variable that places mass on a distribution, mass uniformly on the set of heaps , and mass uniformly on the set of boundaries . The data-generating model (13) is completed by setting to be the mean of the latent data and . The simulations are repeated for 100 iterations in each case.
Monte Carlo samples from the predictive distribution under model (1) and (13) are generated according to Algorithm 2 for the following transformations: identity, square-root, and log (see Example 3); the ECDF transformation (see Example 4); and the Poisson- and Negative Binomial-based parametric transformations (see Example 5). In each case, we set and estimate using the MLE. In addition, we compute the predictive distribution from a model-averaged version of all of these transformations (with equal prior weights) using the sampler following Lemma 4. To streamline the results, only a subset of the transformations are presented, yet all are included in the model averaging. For benchmarking, we include Poisson and Negative Binomial spline regression models implemented in stan_gamm4 in the rstanarm package in R using the default specifications.
The predictive distributions are evaluated using ranked probability scores, which is a proper scoring rule for discrete data predictions (Czado2009). In addition, we compute mean interval width and empirical coverage (on out-of-sample testing points) for 90% pointwise prediction intervals. The results are presented in Figure 2.
For the Neg-Bin setting, the predictive distributions all perform similarly, with the proposed model-averaged predictions offering slight improvements over the fixed transformations. The prediction intervals are all conservative, with the surprising result that the Negative Binomial-based parametric transformation within the proposed framework offers narrower intervals than the Negative Binomial model, even though the latter model is a direct match for the data-generating process. One plausible reason is the computational challenges of the Negative Binomial sampler: it is difficult to learn both the dispersion parameter and the nonlinear mean function, and the output from stan_gamm4 regularly reports low effective sample sizes.
For the Mixture-CDF case, the gains of the proposed strategy are now more apparent: the ranked probability scores especially favor the ECDF transformation (Example 4) and the model-averaged predictions. The prediction intervals confirm these advantages, with an additional warning regarding the Poisson model: the intervals remain narrower as in the previous case, but the empirical coverage is now far below the nominal 90%. Again the nonparametric transformation and the model-averaged predictions perform similarly: the model probabilities place nearly all of the weight on the ECDF transformation, so this result is consistent.
3.3 Sparse means estimation for synthetic rounded data
Lastly, we evaluate the selection capabilities of the sparse means model under rounding. The goal is to assess (i) whether an omission of the rounding operator impacts selection ability and (ii) whether the transformation is helpful for this task. Synthetic rounded data are generated as with and , where and denotes the signal strength with variable inclusion indicator . We use a small or moderate proportion of signals, . Results are presented for and ; increasing uniformly improved results and produced similar conclusions.
For all competing methods, we apply the same sparsity prior (16) accompanied by and . We include two variations of the proposed approach, both with the rounding operator from Example 1: one uses the identity transformation (as in the data-generating process) and the other uses the ECDF transformation (see Example 4). In addition, we include a Gaussian model that omits the rounding operation entirely. To maintain focus on the modeling specifications, we apply identical Gibbs sampling strategies as outlined in Section 2.6; the Gaussian case simply replaces (18) with the relevant Gaussian likelihoods. The additional Gibbs sampling step for all models is truncated to . Lastly, we fix at the MLE from a kmeans model with two clusters.
The primary inferential target is , which we use to select the nonzero means. The receiving operator characteristic (ROC) curves for this selection mechanism (averaged over 100 simulations) and the corresponding area under the curve (AUC) values (presented across 100 simulations) are in Figure 3. Perhaps most surprising, the ECDF transformation from Example 4 provides the best selection capabilities by a wide margin. Comparing the identity models with and without rounding, we find that the rounding operator offers only minor advantages for selection. Nonetheless, the best performer belongs to the proposed modeling class (1)–(2), which is a coherent data-generating process for the discrete (rounded) data.
Discrete data models for count and rounded data are often constructed by coarsening (or rounding) a latent continuous data model. These models have demonstrated widespread utility in a variety of applications, including toxicity studies and marketing data (canale2011bayesian), tumor counts and asthma inhaler usage (canale2013nonparametric), deer-vehicle collisions (Siegfried2020), species counts and hospital utilization (Kowal2020a), and self-reported mental health (Kowal2021b), among others. For this broad class of discrete data models, we have introduced conjugate priors with closed-form posteriors. These results are directly useful for analytical characterization of the posterior distribution and Monte Carlo simulation for efficient posterior and predictive inference. We developed customized Monte Carlo simulation algorithms for linear and nonlinear regression, prediction, and model averaging, along with a stochastic search algorithm for the sparse means problem with count or rounded data. Using simulated data, we demonstrated substantial improvements in computational performance, predictive accuracy, and variable selection relative to competing methods.
Despite the generality of the linear model (1) and (3), the posterior and predictive derivations treat the variance components as known. Our analyses estimate using maximum likelihood, although other estimators are available. Alternatively, we may place a prior on and marginalize over this unknown parameter. When is assigned an inverse-Gamma prior, the joint distribution of as in (7), but unconditional on , is a multivariate -distribution. The posterior distribution is therefore a selection -distribution, for which the density and distribution functions are given in ArellanoValle2006.
For more general regression scenarios with
or with hyperpriors onsuch as for hierarchical modeling, the posterior derivations remain relevant and provide a blocked Gibbs sampling strategy. Let denote the parameters excluding . A blocked Gibbs sampler proceeds iteratively by drawing and . The first block uses the selection normal posterior for , while