1 Introduction
The concept of regression is very important in statistical data analysis (Jrgensen, 1997). In this context, generalized linear models (Nelder & Wedderburn, 1972) are regression models for response variables in the exponential family. Let be
independent random variables. According to
McCullagh & Nelder (1989), the random component of a generalized linear model (GLM) is specified by the probability density function
where is the canonical parameter, and are known appropriate functions. The mean and the variance of arerespectively, where is called the variance function, is regarded as a precision parameter and
vary in an interval of the real line. The variance function characterizes the distribution. For the normal, gamma, inverse Gaussian, binomial and Poisson distributions, we have
and , respectively. However, the linear exponential families with given meanvariance relationships do not always exist (BarLev & Enis, 1986).The main aim of this paper is to propose a regression model that is tailored for situations where the response variable is measured continuously on the positive real line that is in several aspects, like the generalized linear models. In particular, the proposed model is based on the assumption that the response is beta prime (BP) distributed. We considered a new parameterization of the BP distribution in terms of the mean and precision parameters. Under this parameterization, we propose a regression model, and we allow a regression structure for the mean and precision parameters by considering the mean and precision structure separately. The variance function of the proposed model assumes a quadratic form. The proposed regression model is convenient for modeling asymmetric data, and it is an alternative to the generalized linear models when the data presents skewness (see Section 2). Inference, diagnostic and selection tools for the proposed class of models will be presented.
The BP distribution (Keeping, 1962; McDonald, 1984)
is also known as inverted beta distribution or beta distribution of the second kind. However, only a few works have studied the BP distribution.
McDonald (1987) discussed its properties and obtained the maximum likelihood estimates of the model parameters. McDonald & Butler (1990) and Tulupyev et al. (2013) used the BP distribution while discussing regression models for positive random variables. However, these works have considered the usual parameterization of the BP distribution. For more detail and some generalizations of BP distribution, see Johnson et al. (1995).A random variable follows the BP distribution with shape parameters and , denoted by
, if its cumulative distribution function (cdf) is given by
(1) 
where is the incomplete beta function ratio, is the incomplete function, is the beta function and is the gamma function.
The probability density function (pdf) associated with (1) is
(2) 
It can be proved that the BP density function is decreasing with as , if ; is decreasing with mode at , if ; and increases and then decreases with mode at for . Furthermore, if , is concave; if , is convex and then upward, with inflection point at ; and if , is concave, then downward, then upward again, with inflection points at and , where
and
There are some stochastic representation of the BP random variable. Let follow the beta distribution with positive parameters and , then
(3) 
where stands for equality in distribution. Furthermore, let and
be independent random variables following standard (scale parameter equal to 1) gamma distributions with shape parameters
and , respectively, i.e., and , then(4) 
Note that we can use the stochastic representations (3) and (4) for estimating the parameters of the BP model (based on the EMalgorithm).
Remark 1.1.
The BP distribution can be written in the exponential family form, but is not a dispersion model. Then, the BP distribution has a twoparameter general exponential distribution with natural parameters
and and natural sufficient statistics and .The
th moment about zero of
is given byFor and , the th moment about zero simplifies to
In particular, the mean and the variance associated with (2) are given by
(5) 
respectively. Furthermore, and , where is the digamma function.
Some other properties of the BP distribution are as follows. If , then: (i) , that is, the BP distribution is closed under reciprocation; (ii) , that is, the BP distribution is closed under scale transformations; (iii) has positive skewness, but due to its flexibility, symmetric data can also be modeled by the BP distribution (see Figure 1); (iv) has several stochastic representations, see (3) and (4). It should be highlighted that, all of these properties are shared by few distributions, in particular, several of them are not shared by exponential family distributions used in GLM.
Furthermore, the BP distribution has several related distributions. In particular,

If has a Lomax distribution (Lomax, 1954), also known as a Pareto Type II distribution, with shape parameter and scale parameter , then ;

If has a Pareto distribution (Arnold, 2015) with minimum and shape parameter , then ;

If has a standard Pareto Type IV distribution (Rodriguez, 1977) with shape parameter and inequality parameter , then ;

If , then
has the F distribution
(Phillips, 1982) with degrees of the freedom in the numerator and degrees of freedom in the denominator.
The rest of the paper proceeds as follows. In Section 2, we introduce a new parameterization of the BP distribution that is indexed by the mean and precision parameters. Section 3 presents the BP regression model with varying mean and precision. Diagnostic measures are discussed in Section 4. In Section 5, some numerical results of the estimators are presented with a discussion of the obtained results. In Section 6, we discuss an application to real data that demonstrates the usefulness of the proposed model. Concluding remarks are given in the final section.
2 A BP distribution parameterized by its mean and precision parameters
Regression models are typically constructed to model the mean of a distribution. However, the density of the BP distribution is given in Equation (2), where it is indexed by and . In this context, in this section, we considered a new parameterization of the BP distribution in terms of the mean and precision parameters. Consider the parameterization and , i.e., and . Under this new parameterization, it follows from (5) that
From now on, we use the notation to indicate that is a random variable following a BP distribution with mean and precision parameter . Note that is similar to the variance function of the gamma distribution, for which the the variance has a quadratic relation with its mean. In fact, Morris (1982) wrote: “There are six basic natural exponential families having quadratic variance functions: normal, Poisson, gamma, binomial, negative binomial (geometric), and generalized hyperbolic secant distributions”. However, the BP distribution can be written in the exponential family form (see Remark 1.1) and has quadratic variance function (using the proposed parameterization). We note that this parameterization was not proposed in the statistical literature. Using the proposed parameterization, the BP density in (2) can be written as
(6) 
where and . Figure 1 displays some plots of the density function in (6) for some parameter values. It is evident that the distribution is very flexible and it can be an interesting alternative to other distributions with positive support.
The gamma (GA), reparameterized BirnbaumSaunders (RBS) (SantosNeto et al., 2016)
and BP distributions have common properties such as unimodality in the density function and quadratic variance function. The most important indices of the shape of a distribution are the skewness and kurtosis. Table
1 gives a summary of the two indices for the GA, RBS and BP distributions.Skewness  Kurtosis  

GA  
RBS  
BP 
3 BP regression model
Let be independent random variables, where each , , follows the pdf given in (6) with mean and precision parameter . Suppose the mean and the precision parameter of satisfies the following functional relations
(7) 
where and
are vectors of unknown regression coefficients which are assumed to be functionally independent,
and , with , and are the linear predictors, and and are observations on and known regressors, for . Furthermore, we assume that the covariate matrices and have rank and , respectively. The link functions and in (7) must be strictly monotone, positive and at least twice differentiable, such that and , with and being the inverse functions of and , respectively.The loglikelihood function has the form
(8) 
where
The score vector, obtained by differentiating the loglikelihood function, given in (8), with respect to the and has components
where , , , , and are given in Appendix A.
The maximum likelihood (ML) estimators and of and , respectively, can be obtained by solving simultaneously the nonlinear system of equations and . However, no closedform expressions for the ML estimates are possible. Therefore, we must use an iterative method for nonlinear optimization. We can estimate the parameters of the BP regression model defined in (7) using the gamlss function, contained in R (R Core Team, 2017) package of the same name (Stasinopoulos & Rigby, 2007). Theoretical results of this paper have been implemented in this programming language. In special, the glmBP package provides functions for fitting BP regression models using the gamlss function. Loglikelihood maximizations were carried out using the RS nonlinear optimization algorithm (Stasinopoulos & Rigby, 2007). The current version can be downloaded from GitHub via
This package contains a collection of routines for analyzing data from BP distribution. For example, the functions dBP, pBP, qBP and rBP
implement the density function, distribution function, quantile function and random number generation for the BP distribution. Other functions are:
residuals.pearson (Pearson residuals), envelope.BP (simulated envelope) and diag.BP (local influence).When is large and under some regularity conditions (see Cox & Hinkley, 1983), we have that
where means “approximately distributed”. Also, may be approximated by , where , , is the Hessian matrix evaluated at . The matrices and are provided in Appendix B. We then use the estimated parameters in to obtain and obtain
asymptotic confidence intervals (CI) for
.Consider the regression model defined in (7) and the corresponding loglikelihood function given in (8
). A test of the null hypothesis that the precision is constant can be conducted by testing the null hypothesis that
versus the alternate hypothesis that for at least one . Under (null hypothesis), large and regularity conditions, the likelihood ratio (LR) statistic follows a distribution with degrees of freedom. For testing , we can write the LR statistic asLR  (9)  
where and denote the restricted and unrestricted maximum likelihood estimators, respectively.
4 Diagnostic analysis
4.1 Residuals
In this section, to study departures from the error assumptions as well as the presence of outlying observations, we will consider two kinds of residuals for the BP regression model defined in (7).
4.1.1 Quantile residuals
Consider the cdf of the BP distribution given in (1). Since is continuous, then . Then, the quantile residual (Dunn & Smyth, 1996) under the BP regression model is defined by
where
is the cumulative distribution function of the standard normal distribution and
and are the maximum likelihood estimators (MLE’s) of and . Except for the variability due to the estimators of the parameters, the distribution of the quantile residuals is standard normal.4.1.2 Pearson residuals
The second residual is based on the difference and given by
where and are the MLE’s of and . For more details, see Leiva et al. (2014) and SantosNeto et al. (2016). According to Nelder & Wedderburn (1972) a disadvantage of the Pearson residual is that the distribution for nonnormal distributions is often markedly skewed, and so it may fail to have properties similar to those of the normaltheory residual. Despite this, we will study this residual because there is no information of its performance when considering the BP regression models.
4.2 Local influence
The local influence approach based on normal curvature is recommended when the concern is related to investigate the model sensitivity under some minor perturbations in the model (Cook, 1986). The likelihood displacement is defined as , where is the MLE of for a perturbed model and is a perturbation vector. Cook (1986) proposed to study the local behavior of around , which represents the null perturbation vector, such that .
The normal curvature for at the arbitrary direction , with , is given by , where is the Hessian matrix of evaluated at and is a perturbation matrix with elements
and being the loglikelihood function corresponding to the model perturbed by . The maximization of
is equivalent to finding the largest eigenvalue
of the matrix and the direction of largest perturbation around , denoted by, is the corresponding eigenvector. The index plot of
may be helpful in assessing the influence of small perturbation on likelihood displacement.In this work, the main idea is to study the normal curvatures for and in the unitary direction available at and . Such curvatures are expressed as , where
in case the interest is only on the vector , and , where
if the interest lies in studying the local influence on . To assess the influence of the perturbations we may consider the index plot for the normal curvature , where is the th diagonal element of . Lesaffre & Verbeke (1998) suggested to pay a special attention in those observations with , where .
In this paper, we calculate the matrix for five different perturbation schemes, namely: case weighting perturbation, response perturbation, mean covariate perturbation, precision covariate perturbation, and simultaneous covariate perturbation. These results are presented with details in Appendix C. Figure 2 presents a diagram with the perturbation matrices under different schemes.
5 Monte Carlo studies
In this section, we present two simulation scenarios: (i) The first batch of Monte Carlo experiments is conducted to assess the finite sample behavior of the MLE’s; and (ii) We perform a simulation study to examine the distributions of the residuals and .
For scenarios I and II, the Monte Carlo experiments were carried out using
(10) 
where the covariates and , for were generated from a standard uniform. The values of all regressors were kept constant during the simulations. The number of Monte Carlo replications was 5,000. All simulations and all the graphics were performed using the R programming language (R Core Team, 2017). Regressions were estimated for each sample using the gamlss() function.
5.1 Scenario I
The goal of this simulation experiment is to examine the finite sample behavior of the MLE’s. This was done 5,000 times each for sample sizes () of 50, 100 and 150. In order to analyze the point estimation results, we computed, for each sample size and for each estimator: mean (E), bias (B) and root mean square error (RMSE). We will use Monte Carlo simulations to estimate the empirical coverage probability of the asymptotic confidence intervals.
Mean parameter  Precision parameter  

50  1.996  1.601  0.004  0.001  0.114  0.226  2.742  2.117  0.142  0.117  0.776  1.019 
100  1.999  1.602  0.001  0.002  0.077  0.155  2.661  2.048  0.061  0.048  0.654  0.636 
150  1.998  1.601  0.002  0.001  0.063  0.122  2.642  2.025  0.042  0.025  0.281  0.508 
Mean parameter  Precision parameter  

50  0.116  0.225  0.110  0.216  0.763  1.012  3.852  4.291 
100  0.080  0.155  0.077  0.150  0.651  0.635  2.713  2.981 
150  0.064  0.124  0.062  0.122  0.279  0.508  0.271  0.491 
Table 2 presents the mean, bias and RMSE for the maximum likelihood estimators of , , and . The estimates of the regression parameters and are more accurate than those of and
. We note that the RMSEs tend to decrease as larger sample sizes are used, as expected. Finally, note that the standard deviations (SD) of the estimates very close to the asymptotic standard errors (SE) estimates when
tends towards infinity (see, Table 3).Mean parameter  Precision parameter  

50  0.982  0.927  0.872  0.982  0.935  0.875  0.975  0.919  0.865  0.986  0.941  0.882 
100  0.985  0.941  0.885  0.986  0.940  0.888  0.984  0.937  0.882  0.989  0.942  0.887 
150  0.986  0.942  0.889  0.988  0.944  0.890  0.985  0.937  0.888  0.988  0.947  0.891 
Table 4 presents the empirical coverage probabilities of the asymptotic confidence intervals of level , denoted by . Note that the asymptotic confidence intervals have an empirical coverage probability that is less than the nominal values (0.99, 0.95 and 0.90). Overall, we observe that the asymptotic confidence intervals have a good performance.
5.2 Scenario II
The second simulation study was performed to examine how well the distributions of the and are approximated by the standard normal distribution. In each of the 5,000 replications, we obtain 150 observations from the BP distribution and we fit the model given in (10) and generate the residuals.
In Figure 3, we perform a comparison between the empirical distribution of the residuals and the standard normal distribution, by using quantilequantile plots. The residuals ’s have a good agreement with the a standard normal distribution. However, note that the residuals ’s have a distribution skewed to the right. The results presented in Figure 3 show that the distribution of the ’s is better approximated by the standard normal distribution than that of the ’s .
6 Real data application
The analysis was carried out using the glmBP and gamlss packages. We will consider a randomized experiment described in Griffiths et al. (1993). In this study, the productivity of corn (pounds/acre) is studied considering different combinations of nitrogen contents and phosphate contents (40, 80, 120, 160, 200, 240, 280 and 320 pounds/acre). The response variable is the productivity of corn given the combination of nitrogen and phosphate corresponding to the th experimental condition and the data are presented in Figure 4.
In Figure 4, there is evidence of an increasing productivity trend with increased inputs. Moreover, note that there is increased variability with increasing amounts of nitrogen and phosphate, suggesting that the assumption of GA or RBS distributions (both with quadratic variance) for , i.e., we consider that follows , and distributions with a systematic component given by
We compared the BP regression model with the GA and RBS regression models.
Table 5 presents the estimates, standard errors (SE), Akaike information criterion (AIC) and Bayesian information criterion (BIC) for the BP, GA and RBS models. We can note that the BP, GA and RBS regression models present a similar fit according to the information criteria (AIC and BIC) used. In Figure 5, we see that most of the residuals lie inside the envelopes and that there is no noticeable pattern in the residuals. Visual inspection of the plots in Figure 6 suggest that there is no serious violation of the distributional assumptions.
Parameter  BP  GA  RBS  

Estimate  SE  Estimate  SE  Estimate  SE  
0.4471  0.2697  0.4687  0.2808  0.4589  0.2644  
0.3453  0.0399  0.3499  0.0421  0.3473  0.0397  
0.4191  0.0382  0.4100  0.0407  0.4146  0.0384  
45.650  12.260  46.592  11.987  92.570  23.900  
Selection criteria  
Loglikelihood  112.19  112.30  113.86  
AIC  232.38  232.59  232.41  
BIC  237.99  238.20  238.02 
Now, we will introduce an additive perturbation on the 10 observation (randomly selected) by making , where is the sample standard deviation of the response variable. The purpose of this perturbation is to show that the BP model can be less sensitive to the presence of atypical observations in sampled data. Figures 6–7 present the fitted values against the residuals , the simulated envelopes for the and the index plot of for under the caseweights scheme, respectively, for the disturbed data. From Figure 7, we observe that that case 10 is the most influential for the GA and RBS models. From now on, therefore, we will follow our analyzes considering the BP regression model.
Nonconstant variance in can be diagnosed by residual plots. Figure 8(b) note that the residual plot shows a pattern that indicates an evidence of a nonconstant precision because the variability is higher for lower phosphate concentrations. Thus, we will consider the following model for precision of the BP regression model
The ML estimates of its parameters, with estimated asymptotic standard errors (SE) in parenthesis, are: , , , and . Note that the coefficients are statistically significant at the usual nominal levels. We also note that there is a positive relationship between the mean response (the productivity of corn) and nitrogen, and that there is a positive relationship between the mean response and the phosphate. Moreover, the likelihood ratio test for varying precision (given in Equation (9)) is significant at the level of (value =0.0412), for the BP regression model with the structure above.
In Figure 9(a), we note that the quantile residuals under the BP regression model with varying precision have a smaller amplitude of their values. In Figure 9(b), we can see that there is no noticeable pattern in the residuals and there is no evidence against the distributional assumptions.
In Figures 1011, we present diagnostics for the BP regression model with varying precision. We can see in these figures that no observations stood out as potentially influential.
7 Concluding remarks
In this paper, we have developed a new parameterized BP distribution in terms of the mean and precision parameters. The variance function of the proposed model assumes a quadratic form. Furthermore, we have proposed a new regression model for modelling asymmetric positive real data. The proposed parameterization allows for a precision parameter, which also has a systematic component. An advantage of the proposed BP regression model in relation to the GA and RBS regression models is its flexibility for working with positive real data with high skewness, i.e., the proposed model may serve as a good alternative to the GA and RBS regression models for modelling asymmetric positive real data.
Maximum likelihood inference is implemented for estimating the model parameters and its good performance has been evaluated by means of Monte Carlo simulations. Furthermore, we provide closedform expressions for the score function and for Fisher’s information matrix. Diagnostic tools have been obtained to detect locally influential data in the maximum likelihood estimates. In particular, appropriate matrices for assessing local influence on the parameter estimates under different perturbation schemes are obtained. We have proposed two types of residuals for the proposed model and conducted a simulation study to establish their empirical properties in order to evaluate their performances. Finally, an application using a real data set was presented and discussed.
As part of future work, there are several extensions of the new model not considered in this paper that can be addressed in future research: an extension of BP regression model that allow for covariates to be measured with error, zeroaugmented BP regression assumes that the response variable has a mixed continuousdiscrete distribution with probability mass at zero, an extension of BP regression model with censored data may be developed. Additionally, future work should explore other estimation methods for the BP regression model as, for instance, the Bayesian, improved estimation (Cox & Snell, 1968) and EMalgorithm approaches. Finally, one may develop an R
package to perform inference in the BP regression model. Work on these problems is currently under progress and we hope to report these findings in a future paper. We hope that this new model may attract wider applications in regression analysis.
Appendix
Appendix A Score vector
First we will show how the elements of the score vector for this class of models were obtained. This elements are obtained by differentiating the loglikelihood function, given in (8), with respect to the and and are given by
The score function can be written in matrix form as
where , , , , , , and is the Kronecker delta for .
Appendix B Hessian matrix and Fisher’s information matrix
Under the BP regression model defined in Section , the second derivative of with respect:
(i) to and , for , is
where