Code for the paper Dalla Valle, L., Leisen, F., Rossini, L. and Zhu, W. (2019) - "A Polya-Gamma sampler for a generalized logistic regression"
In this paper we introduce a novel Bayesian data augmentation approach for estimating the parameters of the generalised logistic regression model. We propose a Pólya-Gamma sampler algorithm that allows us to sample from the exact posterior distribution, rather than relying on approximations. A simulation study illustrates the flexibility and accuracy of the proposed approach to capture heavy and light tails in binary response data of different dimensions. The methodology is applied to two different real datasets, where we demonstrate that the Pólya-Gamma sampler provides more precise estimates than the empirical likelihood method, outperforming approximate approaches.READ FULL TEXT VIEW PDF
Code for the paper Dalla Valle, L., Leisen, F., Rossini, L. and Zhu, W. (2019) - "A Polya-Gamma sampler for a generalized logistic regression"
Data augmentation (DA) is undoubtedly one of the most popular Markov Chain Monte Carlo (MCMC) methods. The idea is to consider the target distribution as the marginal of a joint distribution in an augmented space. Assuming it is easy to sample from the conditional distributions, the algorithm is a straightforward two steps procedure which makes use of a simple Gibbs sampling. Since the seminal work ofTanner and Wong (1987), data augmentation has been extensively studied, developed and used in different contexts. For instance, Meng and van Dyk (1999), Liu and Wu (1999), van Dyk and Meng (2001) developed strategies to speed up the basic DA algorithm. Hobert and Marchev (2008) studied the efficiency of DA algorithms. Recently, Leisen et al. (2017)
use a DA approach to perform Bayesian inference for the Yule-Simon distribution. The idea is used in the context of infinite hidden Markov models byHensley and Djurić (2017). The literature on DA is vast and it is difficult to list all the contributions to the field. A review of DA algorithms can be found in Brooks et al. (2011), Chapter 10.
In the context of binary regression, the article of Albert and Chib (1993) sets a milestone in the use of DA algorithms. The authors develop exact Bayesian methods for modeling categorical response data by introducing suitable auxiliary variables. In particular, they use a DA approach to estimate the parameters of the probit regression model. However, although some valid approaches have been proposed (Holmes et al., 2006), Bayesian inference for the logistic regression was not satisfactorily addressed until the paper of Polson et al. (2013). The authors proposed an elegant DA approach which makes use of the following identity:
where , , and is the density of a PG(,0) distribution. The distribution PG(
,0) is a Pólya-Gamma distribution with parametersand 0; we refer to the article of Polson et al. (2013) for details about Pólya-Gamma distributions. When is a linear function of predictors, the integrand is the kernel of a Gaussian likelihood in . This naturally suggests a DA scheme which simply requires to be able to sample from the multivariate normal and the Pólya-Gamma distributions. This algorithm is called the Pólya-Gamma sampler. A theoretical study of the algorithm can be found in Choi and Hobert (2013).
In this paper, we propose a Pólya-Gamma sampler to estimate the parameters of the generalized logistic regression model introduced in Dalla Valle et al. (2019). The authors studied immigration in Europe employing a novel regression model which makes use of a generalized logistic distribution. Their work is motivated by the need of a logistic model with heavy tails. The distribution considered in the paper has the following density function:
The parameter controls the tails of the distribution. The standard logistic distribution can be recovered with . Compared with the standard logistic distribution, means lighter tails and means heavier tails, as shown in Figure 1. Unfortunately, the distribution function is not explicit:
where , with , is the incomplete Beta function. Therefore, a straightforward use of the methodology of Polson et al. (2013) is not possible. To overcome the problem, Dalla Valle et al. (2019) use an approximate Bayesian computational method that relies on the empirical likelihood (see Mengersen et al. (2013) and Karabatsos and Leisen (2018)
). However, since the empirical likelihood approach is based on an approximation of the posterior distribution, this method may lead to wide credible intervals and sometimes ambiguous estimates, especially for the tail parameter.
In this paper, we propose a novel approach for estimating the generalised logistic regression, that allows us to draw samples from the exact posterior distribution and is able to overcome the drawbacks of the empirical likelihood method. We show how to set a DA scheme for the generalized logistic regression which makes use of the Pólya-Gamma identity in equation (1). We test the performance of the proposed approach on simulated data and on two different real dataset: the first dataset regards people’s opinions towards immigration in Europe; the second contains information about the recidivism of criminals detained in Iowa. We show that the Pólya-Gamma sampler yields accurate results in high-dimension and outperforms the empirical likelihood algorithm providing a more precise estimation of the model parameters.
The rest of the paper is organised as follows. In Section 2, we introduce the generalized logistic regression framework in a Bayesian setting. In Section 3, we provide the details of the Pólya-Gamma sampler for this model. Section 4 and Section 5 illustrate the algorithm performance with simulated and real data. Section 6 concludes.
Consider a binary regression set-up in which
are independent Bernoulli random variables such thatwhere is a vector of known covariates associated with , , is a vector of unknown regression coefficients, and is a distribution function. In this case, the likelihood function is given by
If is the distribution function of a Gaussian, then we are in the probit regression framework. If , then we are in the special case of the standard logistic regression. Albert and Chib (1993) proposed a DA approach for sampling from the posterior distribution of the probit regression model. Their method makes use of auxiliary variables which allow an easy implementation of the Gibbs sampler. Polson et al. (2013) proposed an elegant algorithm for tackling the logistic regression case.
The generalized logistic distribution in equation (2) has mean zero and scale one. Consider now a generalized logistic with mean and scale one, i.e. with probability density function:
We denote the above distribution with . Mimicking Albert and Chib (1993), suppose to sample from a , , and set if . Otherwise, if , then set . It is easy to see that
The joint posterior distribution is
where and are the prior distributions of and , respectively. With the probit model, the technique of Albert and Chib (1993)
works because the full conditional distributions can be resorted to normal distributions (or truncated normal distributions) which are easy to sample from. In our case, the usual normal prior would lead to a full conditional onwhich is not explicit, requiring the use of a Metropolis step (or alternative algorithms). The Pólya-Gamma identity in (1) allows to overcome this problem. In particular, we get
where is a Pólya-Gamma, , distribution. Note that is the unnormalized density of a random variable. Therefore, the augmented posterior distribution is
The algorithm proposed in Dalla Valle et al. (2019) makes use of an approximate likelihood, which relies on an approximation of the posterior distribution (see Mengersen et al. (2013) and Zhu and Leisen (2016)). The algorithm presented in this paper allows us to sample from the true posterior distribution, making it more appealing to perform the Bayesian inferential exercise.
Starting from the augmented posterior distribution in (2), this Section introduces the full conditional distributions required to run the algorithm. The variables involved are
where , and .
Full conditional for . The random variables are independent with full conditional distributions:
which are truncated normal distributions. More precisely,
if , then is distributed as a truncated to the left by ;
if , then is distributed as a truncated to the right by ;
Full conditional for . Following Polson et al. (2013), it is easy to see that the full conditional distribution of is:
An efficient method for sampling Pólya-Gamma random variables is implemented in a modified version of the R package BayesLogit (Windle et al., 2016).
Full conditional for . Assuming a multivariate prior for , the full conditional distribution of is
Full conditional for . The full conditional for is:
where is a . In our approach we assume a Gamma prior for and we integrate out the ’s in (2). This leads to the following full conditional distribution for :
In order to sample from , we use the slice sampling of Neal (2003). This algorithm requires the computation of the inverse of the above full conditional distribution. We used the function uniroot available in R to compute the numerical inverse. The next sections illustrate the performance of the algorithm with simulated and real data.
In this section, we assess the performance of the algorithm by implementing different simulation experiments. In particular, we use four different values of the tail parameter : and (heavy tails) and and (light tails). We focus on three scenarios:
, with ;
, with ;
, with .
We assume a standard normal distribution for the explanatory variables. For each scenario and each value of , we generate different datasets of sample sizes and , respectively. Regarding the prior choice, we consider vague priors for and . In particular, a multivariate normal distribution, , is chosen for with prior mean vector and prior covariance matrix . For we choose a gamma prior,
with hyperparameters. For each of the different simulated datasets, we run iterations of the MCMC algorithm and discard the first iterations as burn-in period.
In Table 1 we show the posterior means of the different values of and for the three-dimensional
of scenario 1. The values in brackets show the standard deviations over thesimulations. We notice that the posterior means of are very close to the true values, and increasing the dimensionality of from to leads to a better estimation of the parameter. This good performance is also confirmed for the posterior means of the vector of unknown coefficients . We note an improvement of the results as the sample size increases.
Tables 2 and 3 show the posterior mean and the posterior standard deviations for the five- and ten-dimensional scenarios, respectively. These results are in line with the three-dimensional case, thus leading us to conclude that increasing the number of covariates is not an issue in terms of estimation accuracy. Moreover, the posterior standard deviations in Tables 1, 2 and 3 substanstially decrease for all the different scenarios and values when we move from to .
In order to illustrate the complete inferential procedure, we show how all the parameters of the model are estimated. Figure 2 displays the posterior chains for a specific dataset with sample size , where and . As shown in Figure 2, the chains, after a burn-in of iterations, converge quickly to the true values of all parameters and give an excellent representation of the real parameter values.
We also performed a convergence analysis for scenario 1, with three-dimensional 111We performed other convergence tests for the five- and ten-dimensional cases. These results are omitted for lack of space but are available on request.. The analysis was carried out using the R coda package (Plummer et al., 2006). In particular, we used single datasets of sample sizes and , respectively, where we fixed . We computed Geweke’s convergence tests and the autocorrelation and partial autocorrelation functions for the values of and . Table 4 reports the results of Geweke’s convergence test (Geweke, 1992)
, that indicate no issues since the test statistics values are all lower than. As a final check, in Figures 3 and 4 we plotted the autocorrelation and the partial autocorrelation functions for the three-dimensional scenario with and , respectively. From these Figures, there is no indication of serious convergence problems.
|Variables||Test Statistic||Variables||Test Statistic|
In this Section, we present two different real data applications to illustrate the accurate and efficient performance of the Pólya-Gamma sampler for estimating the generalized logistic regression. The applications analyse novel datasets of different dimensions, which are related to current and topical problems in the social sciences. In the first application, we consider data selected from the European Social Survey (ESS), that were previously analysed in Dalla Valle et al. (2019) to identify the determinants of public opinions towards immigration. The second application analyses the motives behind the recidivism of offenders released from prison between 2016 and 2018 in the State of Iowa, USA.
The first dataset focuses on people’s attitudes towards immigration. The data was selected from the ESS and was collected following hour-long face-to-face interviews. We focused on questions regarding immigration from the ESS8 edition 1.0 published in October 2017 and collected between August 2016 and March 2017 (ESS8, 2016, 2018) for Great Britain (GB), Germany (DE) and France (FR), respectively. For these three countries, the total number of observations is , of which are for GB; for DE and for FR.
The dependent variable is immig, that indicates whether the respondent would allow immigrants from poorer countries outside Europe. More precisely, if the respondent is against immigration, if the respondent is in favor of immigration. Following Dalla Valle et al. (2019), we considered as covariates the variables: pplfair, trstep, trstun, happy, agea, edulvlb and hinctnta. The first three variables include answers to question ranging from (most negative) to (most positive): do you think that most people try to take advantage of you (pplfair)? Do you trust the European parliament (trstep)? Do you trust the United Nations (trstun)? The happy variable takes values from (unhappy) to (extremely happy). The last three variables are subject-specific and include the age, ranging from to (agea); the highest level of education, from primary education to doctoral degree (edulvlb) and the household’s total net income (hinctnta).
For this dataset, we run iterations of the Pólya-Gamma sampler algorithm and we discarded the first iterations as burn-in. We used the MLE estimates of and as starting points of the algorithm and we employed vague priors, as in the simulated examples.
In order to assess the performance of the Pólya-Gamma sampler, we report the results obtained using the empirical likelihood approach on the same dataset, adopting a similar setting. Table 5 lists the results of the Pólya-Gamma sampler (left-hand side) and the empirical likelihood approach (right-hand side). For each European country (GB, DE and FR) and for each parameter, Table 5
shows the posterior means, the 2.5% and 97.5% quantiles of the posterior distributions, the credible interval (CI) widths and the difference between the CIs of the Pólya-Gamma and the empirical likelihood methods. The effects of the parameters are similar across countries and suggest that the higher the level of education, the higher the probability for people to be favourable towards immigration. Also, trustful people and individuals with a strong confidence in the European parliament tend to be in favour of immigration. The posterior mean estimates obtained with the Pólya-Gamma method are in line with those obtained with the empirical likelihood. However, the CIs calculated with the first approach are narrower than those calculated with the second approach. This is illustrated in the last column of Table5, where the negative values show that the Pólya-Gamma CI widths are smaller than those of the empirical likelihood method. For some of the covariates, such as pplfair, trstep, agea and hinctnta, the empirical likelihood approach assigns posterior support to the value zero, while the corresponding Pólya-Gamma CIs do not include zero. In addition, although the estimates of the parameter for all countries suggest heavy tails for the generalised logistic regression, the empirical likelihood CIs cover the value , which correspond to the standard logistic regression. On the contrary, the results of the Pólya-Gamma method give no posterior support to the value for DE and FR data, and very limited support for BG data, clearly suggesting the need for the generalised logistic regression model. Therefore, the Pólya-Gamma sampler outperforms the empirical likelihood method for the estimation of the generalised logistic regression, since it provides more precise parameter estimates as for the as well as for the parameters.
|Pólya-Gamma Sampler||Empirical Likelihood|
|Country||Parameters||post. mean||2.50%||97.50%||CI width||post. mean||2.50%||97.50%||CI width||CI difference|
The second dataset contains information about the recidivism of offenders released from prison between 2016 and 2018 in Iowa222See the following website for details: https://data.iowa.gov/Correctional-System/3-Year-Recidivism-for-Offenders-Released-from-Pris/mw8r-vqy4.. Recidivism happens when a prisoner is released from jail and he or she relapses back into criminal behavior and returns to jail. We applied the generalised logistic regression to determine whether a prisoner is likely to recidivate based on his or her characteristics.
After removing missing values, the data consists of records corresponding to offenders detained in prison and released from 2016 to 2018. The dependent variable is recidivism, which is equal to if the prisoner is arrested again within a -year tracking period after being released, and it is equal to if he or she has not returned to prison within the tracking period.
We model the probability of being recidivist using a subsample of all the explanatory variables present in the original database: the sex of the offender (sex, which is equal to for men and for women); the age when released from prison (age); whether an offender committed a felony or misdemeanor (felony, which is equal to for felony and for misdemeanor); whether an offender was charged with a drug crime (drug); whether an offender was charged with a public order crime (puborder); whether an offender was charged with a violent crime (violent); and whether an offender was released on discharge or parole (discharge, which is equal to for discharge and for parole).
In this second example, we run the MCMC algorithm using iterations and discarding the first iterations. As in the immigration example, we adopted vague priors and we stated the initial values of and based on the corresponding MLE estimators. Figure 5 shows the posterior chain of the tail parameter , which demonstrates a good convergence performance.
|Pólya-Gamma Sampler||Empirical Likelihood|
|Parameters||post. mean||2.50%||97.50%||CI width||post. mean||2.50%||97.50%||CI width||CI difference|
As we did with the immigration data, we compared the performances of the Pólya-Gamma sampler and the empirical likelihood approaches, using the same settings for both algorithms. The results for the parameters and are listed in Table 6, that displays the Pólya-Gamma results on the left-hand side and the empirical likelihood results on the right-hand side. Table 6 illustrates the posterior means, the 2.5% and 97.5% quantiles of the posterior distributions, the credible interval (CI) widths and the difference between the CIs of the Pólya-Gamma and the empirical likelihood methods.
The results suggest that men are more likely to recidivate than women, younger prisoners are more likely to become repeat offenders than older prisoners, there is a lower probability that the criminal is a recidivist if he or she was charged with a public order or violent crime and offenders released on parole are more likely to return to jail than those released on discharge.
The sign of the posterior means for both approaches is the same, suggesting a similar direction for the effects obtained using the Pólya-Gamma and the empirical likelihood method. However, the CIs of the first approach are again narrower than those of second approach, as shown by the CIs differences in the last column of Table 6 , which are all negative. Moreover, the Pólya-Gamma method, unlike the empirical likelihood, does not give posterior support to zero to the variables sex and puborder, suggesting that the inclusion of these variables in the model is appropriate.
The posterior estimates of are different for the two approaches. The posterior mean estimated with the Pólya-Gamma method is greater than one, denoting light tails for the generalised logistic model, and there is no posterior support for the value one, suggesting that the tails need to be appropriately modelled. On the contrary, the posterior mean estimated with the empirical likelihood is lower than one and the CI is wide and includes the values one, denoting estimation inaccuracy and uncertainty.
In agreement with our findings with the immigration data, the results obtained with the recidivism dataset demonstrate that the Pólya-Gamma sampler method yields a higher estimates precision and accuracy compared to the empirical likelihood method, as for the covariate parameters as well as for the tail parameter . In particular, a more precise estimation of leads to more clarity on the tail behaviour of the generalised logistic regression.
This paper introduces a novel DA scheme for the generalized logistic regression model. This model is particularly flexible since it is able to accommodate both light and heavy tails in dichotomous response data. The proposed DA scheme makes use of the Pólya-Gamma identity and it is strongly related to the slice sampler algorithm. The Pólya-Gamma sampler allows us to implement a Bayesian method able to draw samples from the exact posterior distribution. On the contrary, other methods, such as the empirical likelihood approach, are based on approximations of the posterior and may lead to low precision in the estimation of the parameters. Our simulation study demonstrates the estimation accuracy of the newly proposed Pólya-Gamma sampler with datasets of different dimensions. We compared the performances of the Pólya-Gamma and the empirical likelihood methods for modelling new interesting datasets regarding the opinion on immigration in European countries and the probability of being recidivist for prisoners in Iowa, USA. Our results demonstrate the superiority of the Pólya-Gamma sampler over the empirical likelihood in terms of parameter precision. The Pólya-Gamma method allows a more accurate estimation of the tail parameter for the generalised logistic regression compared to approximate methods.
Evaluating the accuracy of sampling-based approaches to calculating posterior moments.In Bernardo, J. M., Berger, J., Dawid, A. P., and Smith, J. F. M., editors, Bayesian Statistics 4, pages 169–193. Oxford University Press, Oxford.