The binomial distribution is normally used to model the number of successes obtained in a finite number of experiments. However, in these cases, it is often found that the variance of the response variable
exceeds the theoretical variance of the binomial distribution. This phenomenon, known as extra-binomial variation (overdispersion), can lead to underestimation errors, lost efficiency of estimates and underestimation of the variance, wich that can in turn generate incorrect inferences about the regression parameters or the credible intervals(Williams, 1982; Cox, 1983; Collet, 1991).
There are several approaches to study overdispersed binomial datasets. Hinde and Demétrio (1998)
categorized the majority of overdispersed binomial models in two classes: (1) those in which a more general shape for the variance function is assumed, by adding additional parameters; and (2) models in which it is assumed that the parameter of the distribution of the response variable is itself a random variable. In the first class, the double exponential family of distributions allows the researcher to obtain double binomial models, which allow including a second parameter, which independently from the mean controls for the variance of the response variable and can be modeled from a subset of some explanatory variables(Efron, 1986)
. In the second class, the beta binomial distribution, results by assuming that the response variable follows a binomial distribution and the probability parameter of the binomial distribution follows a beta distribution. From the parameterization of the beta distribution, in terms of its mean and dispersion parameter(Jorgensen, 1997), a parameterization of the beta binomial beta distribution in terms of its mean and dispersion parameters is presented in Cepeda-Cuervo and Cifuentes-Amado (2017).
Despite the versatility of the beta distribution, Hahn (2008)
proposed the rectangular beta distribution as a combination between the beta distribution and the uniform distribution, to admit heavier tails than that admitted by the beta distribution. After that,Hahn and López Martín (2015) introduced tilted beta distribution, which has as particular cases the beta rectangular and the beta distributions.
In this article, we generalize the beta binomial regression models for fitting overdispersed binomial count dataset (Cepeda-Cuervo and Cifuentes-Amado, 2017) by introducing the tilted beta binomial linear regression model. For this, the tilted beta binomial probability is defined by assuming that the parameter of the binomial distribution follows the mean tilted beta distribution. In addition, the beta rectangular binomial models are presented as particular cases of the new proposed model, by assuming that the parameter of the binomial distribution has beta rectangular distribution. The proposed models are fitted using Bayesian methods. Finally, in order to illustrate of the tilted beta binomial model, we fit it to a seed germination count dataset and compare it with the rectangular beta binomial model and the binomial model, using their DIC values.
This paper is organized as follows. After the introduction, in Section 2, the tilted and the reparameterized tilted beta distributions are presented. In Section 3, the tilted beta binomial distribution is introduced and the rectangular beta binomial distribution is presented as a particular case. In Section 4, the tilted beta binomial linear regression model is defined. Finally, in Section 5, we analyze how the proportion of seeds that germinated on each of 21 dishes, is influenced by the type of seed and root, by fitting a tilted beta binomial linear regression, using the OpenBUGS software. The proposed model performance is compared with the binomial and beta binomial regression models.
2 The Tilted Beta Distribution
In different fields there is often a need to model continuous random variables that assume values in a bounded interval on a set of explanatory variables.Cepeda-Cuervo (2001) proposed the beta regression models, where mean and dispersion parameters follow regression structures (see also Cepeda and Gamerman (2005), Cepeda-Cuervo and Garrido (2015)). If the continuous variable assumes values in a bounded open interval , a beta regression models can be proposed, using the basic transformation . However, in order to admit heavier tails than is possible in the beta distribution, Hahn (2008) proposed the rectangular beta distribution as a new distribution that, like the beta distribution, has as domain the open interval . The rectangular beta distribution consists of convex combination between the beta distribution and the uniform distribution . Subsequently Hahn and López Martín (2015), proposed the tilted beta distribution, consisting of a mixture of the beta distribution and the tilted distribution, which has as particular cases the beta rectangular distribution and the beta distribution. This section presents a reparameterization of the tilted beta distribution proposed by Hahn and López Martín (2015), in terms of the mean and the dispersion parameters of the beta distribution and , respectively, and the mean of the tilted beta distribution . The (,,,)-tilted beta binomial distribution results from the convex combination between the tilted reparameterized beta distribution and the binomial distribution.
2.1 The Tilted Distribution
A random variable follows an inclined distribution with a parameter (Hahn and López Martín, 2015) if its density is given by:
The mean of , denoted , is equal to . By reparameterizing (1) in terms of the mean, the density function is defined by:
, given that the moments,, of a random variable which follows the density function (2) are given by:
Their variance, , is given by:
2.2 Reparameterized Tilted Beta Distribution
The tilted beta distribution was introduced by Hahn and López Martín (2015), as the convex combination between the tilted distribution and the beta distribution. If this distribution is obtained from the combination of the mean tilted distribution (2) and the mean and the dispersion beta distribution, , the density function of the tilted beta distribution is given by (4):
where . The notation is used to denote that follows a tilted beta distribution. Since the -th-moment of is given by:
the mean and the variance of the tilted beta distribution are:
3 (,,,) - Tilted Beta Binomial Distribution
Let be a random variable that follows the binomial distribution, where follows the tilted beta distribution, . Then follows a tilted beta binomial distribution with parameters , , and , denoted by . The probability of this distribution is given by:
where denotes the beta function and denotes the probability function of the beta binomial distribution, parameterized in terms of the mean and the dispersion parameters.
The behavior of the ()-tilted beta binomial probability function is illustrated in Figure 1
, for different vectors of parameter values:
The mean and variance of a random variable that follows the ()-tilted beta binomial probability function are given by:
where , denote the mean and variance of the beta distribution, respectively, and , denote the mean and variance of the tilted beta distribution.
3.1 (,,)-Beta Rectangular Binomial Distribution
Let be a random variable that follows the binomial distribution, where follows the beta rectangular distribution (8). Thus, follows the (,,)-beta rectangular binomial distribution. This density function can be obtained as a particular case of the tilted beta binomial distribution (9), by replacing by :
4 Tilted Beta Binomial Regression Model
Let , , be independent random variables with tilted beta binomial distribution. Let , and the covariate vectors of , and regression structures, and , and the respective regression parameter vectors, such that:
Thus, if is assumed to be constant, the likelihood function of the -regression model is:
where represents the beta function.
In order to define the Bayesian tilted beta binomial regression model, the following a priori distributions are assumed for , , and :
5 Seeds Germination Regression Models
The dataset analyzed in this section is available in Spiegelhalter et al. (2003)openbugsExamples2014 and corresponds to the number of seeds that germinated from an initial quantity arranged in each of 21 dishes organized according to a 2 by 2 factorial design (2 seed types and 2 root types). These data were initially reported by Crowder (1978). The variables involved in the experiment are described below:
y: number of seeds germinated in each dish.
n: number of seeds initially arranged in each dish.
x: seed type (1) if it is O. aegyptiaca 75 and (2) if it is O. aegyptica 73.
x: root type (1) if it is bean and (2) if it is cucumber.
In this experiment, there are 21 observations (21 dishes). Since the variable counts the number of germinated seeds in each dish, this variable can be modeled by a linear regression TBB(,,,) model, which includes all the explanatory variables in each of the regression structures. After the process of eliminating the explanatory variables, the best model (the model with smallest DIC value) has the following regression structures:
where . The TBB(,,,) model was fitted to the data using OpenBUGS, a free program used for Bayesian regression based on the Gibbs algorithm (Spiegelhalter et al., 2003). The posterior parameter inferences obtained from a sample of size 100000, burn-in of the first 10000, and taking one sample every 10 iterations to reduce autocorrelation, are summarized in Table 1. The DIC value of this model is 121.9.
|Parameter||Mean||S.D.||95% Cred. Interval||M.C. Error|
In Table 1 the M.C. error denotes an estimation of the standard Monte Carlo error, wich measures the distance between the posterior estimation of the mean and the mean of the posterior distribution, which is expected to converge to zero when the number of iterations goes to infinity. The Monte Carlo error estimates obtained using the OpenBugs software, close to zero for all the regression parameters, is given by (Flegal, 2008). The DIC value of this model is 121.9. According to Figure 2, Pearson’s residuals are close to zero, taking values between -0.4 and 0.2, and have no tendency through the iterations.
5.1 Chain convergence in the tilted beta binomial model
In the parameter estimation process, three posterior samples were generated beginning from different starting values. In all chains, the autocorrelation is close to zero for a lag greater than or equal than 10, and a burn-in bigger than 10000.
To check the convergence of the chains, two convergence diagnoses were applied: the Geweke diagnostic (Geweke, 1992) and the Brooks-Gelman-Rubin convergence diagnostic (Brooks and Gelman, 1998). The Geweke-Brooks plot for the chains of the regression parameters can be observed in Figure 3, where the value of the statistic versus the number of iterations is plotted to determine the burn-in of the chains. This figure shows that the statistic remains within the acceptance zone for a period of burn-in equal to zero. The second method applied is known as the Brooks-Gelman and Rubin convergence diagnostic. It was proposed by Brooks and Gelman (1998) and compares within-chain and between-chain variances through the estimation of the statistic of scale reduction . Values of well above 1 indicate that the chains have not converged. Figure 4 shows that for the regression parameters of this example, the R factor is very close to 1 after the 1000 iterations.
5.2 Models comparison
In order to determine the performance of the proposed model, the following models also were fitted to the seed germination dataset: binomial ), beta binomial ) and beta rectangular binomial ). The deviance and the deviance information Criterion (DIC) for each of these models are given in Table 2, which shows that the lowest average of the deviance and the lowest DIC value correspond to the tilted beta binomial and the beta rectangular binomial models, where the first one presents the lowest DIC value and therefore is the best model.
|Mean||S.D.||Cred. Interval 95%||Median|
In this paper two new distributions are proposed: the tilted beta binomial distribution and the beta rectangular binomial distribution. From these distributions, assuming that their parameters follow regression structures, new overdispersion regression models for count data are proposed: the tilted beta binomial regression model and the beta rectangular binomial regression model. These models are fitted using Bayesian methods, and in the application, show better performance than the beta binomial regression models for statistical analysis of the seed germination dataset.
Given that the tilted beta distribution is flexible and allows considering varying amounts with greater likelihoods than the beta distribution in the extreme tail-area events, it permits accommodating different relative likelihoods of high versus low extreme tail-area events. Thus, the proposed tilted beta binomial regression model which defines a more general overdispersion regression model than the beta binomial regression model, allows considering count events with high or low likelihood of occurrence and better estimation of the regression parameters, credibility (or confidence) intervals and statistical inferences in the analysis of binomial-type overdispersion data.
- Brooks and Gelman (1998) Brooks, S. and Gelman, A. (1998). General methods for monitoring convergence of iterative simulations. J. Comput. Graphi. Stat., 7:434–455.
- Cepeda and Gamerman (2005) Cepeda, E. and Gamerman, D. (2005). Bayesian methodology for modeling parameters in the two parameter exponential family. Revista Estadistica, 57(168-169):93–105.
Cepeda-Cuervo, E. (2001).
Variability Modeling in Generalized Linear Models.
Unpublished Ph.D. thesis, Mathematics Institute, Universidade Federal
do Rio de Janeiro.
- Cepeda-Cuervo and Cifuentes-Amado (2017) Cepeda-Cuervo, E. and Cifuentes-Amado, M. V. (2017). Double generalized beta-binomial and negative binomial regression models. Revista Colombiana de Estadística, 40:141.
- Cepeda-Cuervo and Garrido (2015) Cepeda-Cuervo, E. and Garrido, L. (2015). Bayesian beta regression models with joint mean and dispersion modeling. Monte Carlo Methods and Applications, 21(1):49–58.
- Collet (1991) Collet, D. (1991). Modeling Binary Data. Chapman Hall, London.
- Cox (1983) Cox, D. (1983). Some remarks on overdispersion. Biometrika, 70(1):269–274.
- Crowder (1978) Crowder, M. J. (1978). Beta-binomial anova for proportions. Applied Statistics, 27:34.
- Efron (1986) Efron, B. (1986). Double exponential families and their use in generalized linear regression. J.Amer. Statist. Assoc., 81(395):709–721.
- Flegal (2008) Flegal, J. M. (2008). . Doctoral dissertation, University of MinnesotaMajor: Statistics.
- Geweke (1992) Geweke, J. (1992). Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In Bernardo, J. M., Berger, J., Dawid, A. P., and Smith, J. F. M., editors, Bayesian Statistics 4, pages 169–193. Oxford University Press, Oxford.
- Hahn (2008) Hahn, E. (2008). Mixture densities for project management activity times: A robust approach to pert. European Journal of Operational Research, 188:450–459.
- Hahn and López Martín (2015) Hahn, E. and López Martín, M. (2015). Robust project management with the tilted beta distribution. SORT, 39:253–272.
- Hinde and Demétrio (1998) Hinde, J. and Demétrio, C. (1998). Overdispersion: Models and estimation. Computational Statistics & Data Analysis, 27:151–170.
- Jorgensen (1997) Jorgensen, B. (1997). Proper dispersion models. Brazilian Journal of Probability and Statistics, 11(2):89–128.
- Spiegelhalter et al. (2003) Spiegelhalter, D., Thomas, A., Best, N., and Lunn, D. (2003). WinBUGS user manual. MRC Biostatistics Unit, Institute of Public Health, Cambridge, UK.
- Williams (1982) Williams, D. (1982). Extra-binomial variation in logistic linear models. Appl.Statist., 2:144–188.