# Conjugate priors and bias reduction for logistic regression models

Logistic regression models for binomial responses are routinely used in statistical practice. However, the maximum likelihood estimate may not exist due to data separability. We address this issue by considering a conjugate prior penalty which always produces finite estimates. Such a specification has a clear Bayesian interpretation and enjoys several invariance properties, making it an appealing prior choice. We show that the proposed method leads to an accurate approximation of the reduced-bias approach of Firth (1993), resulting in estimators with smaller asymptotic bias than the maximum-likelihood and whose existence is always guaranteed. Moreover, the considered penalized likelihood can be expressed as a genuine likelihood, in which the original data are replaced with a collection of pseudo-counts. Hence, our approach may leverage well established and scalable algorithms for logistic regression. We compare our estimator with alternative reduced-bias methods, vastly improving their computational performance and achieving appealing inferential results.

## Authors

• 9 publications
• 7 publications
12/05/2018

### Jeffreys' prior, finiteness and shrinkage in binomial-response generalized linear models

This paper studies the finiteness properties of a reduced-bias estimator...
02/28/2018

### Maximum likelihood estimation of a finite mixture of logistic regression models in a continuous data stream

In marketing we are often confronted with a continuous stream of respons...
10/09/2020

### CryptoCredit: Securely Training Fair Models

When developing models for regulated decision making, sensitive features...
01/19/2021

### Firth's logistic regression with rare events: accurate effect estimates AND predictions?

Firth-type logistic regression has become a standard approach for the an...
01/19/2021

### On resampling methods for model assessment in penalized and unpenalized logistic regression

Penalized logistic regression methods are frequently used to investigate...
04/22/2019

### A Maximum Entropy Procedure to Solve Likelihood Equations

In this article we provide initial findings regarding the problem of sol...
08/19/2018

### The empirical likelihood prior applied to bias reduction of general estimating equations

The practice of employing empirical likelihood (EL) components in place ...

## Code Repositories

### logistic-bias-reduction

Bias reduction techniques for logistic regression models

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Logistic regression is arguably one of the most widely used generalized linear models in statistical practice. In such a model, it is assumed that each entry of the vector

is a realization of independent binomial random variables with number of trials

and success probabilities

. Moreover, suppose that each response is associated with a -dimensional covariate vector , for . Then, a logistic regression model has

 (yi∣mi,πi)∼Bin(mi,πi),πi=exp(xTiβ)1+exp(xTiβ),(i=1,…,n), (1)

where is a -dimensional vector of unknown regression coefficients. We customarily assume that the design matrix , whose rows are , is of full rank. Thus, the log-likelihood function is

 ℓ(β;y)=n∑i=1yi(xTiβ)−n∑i=1milog{1+exp(xTiβ)}, (2)

and the maximum likelihood estimate of is the maximizer of (2). Unfortunately, the maximum likelihood estimate may not exist, meaning that at least one component of the vector is infinite. It is well known that such an unpleasant phenomenon occurs in presence of data separation, as carefully described in Albert1984. If separation occurs, standard procedures may fail to converge, resulting in degenerate predicted success probabilities and misleading inferential conclusions.

In a seminal paper, Firth1993 showed that the maximizer of a suitable penalized likelihood, obtained by considering a Jeffrey’s prior penalty, has a smaller asymptotic bias compared to the maximum likelihood estimator, while Kosmidis2021 proved that Firth’s bias-reduced estimates always exist and solve the separability issue. The finiteness and shrinkage properties of Firth’s reduced-biased estimates spurred an interesting line of research, and specialized algorithms are described in Kosmidis2010; Kosmidis2020. Penalized methods for logistic regression models have become increasingly popular in health and medical sciences, and numerous developments of the original approach of Firth1993 have been proposed in such specialized literature (e.g., Greenland2015; Puhr2017).

In high-dimensional settings, that is when and are both large, existing implementations of Firth1993 may be computationally too demanding; refer for instance to Sur2019. Instead, we propose a bias-reduction approach which retains Firth’s appealing properties but it is much easier to implement, since it relies on a simple perturbation of the data. Specifically, we penalize the logistic regression likelihood by the conjugate prior of Diaconis1979, resulting in the following penalized log-likelihood

 ~ℓ(β;y)=ℓ(β;y)+p2mn∑i=1mi(xTiβ)−pmn∑i=1milog{1+exp(xTiβ)}, (3)

where . The penalized log-likelihood (3) can be rewritten, up to a multiplicative constant, in terms of a genuine log-likelihood, having replaced the actual data with a vector of pseudo-counts , defined as

 ~yi=pp+mmi2+mp+myi,(i=1,…,n). (4)

Each pseudo-count is a convex combinations of the actual data and . This is equivalent to adding successes and trials to each and , respectively, therefore shrinking the success proportions towards equiprobability and regularizing the estimation procedure. Importantly, it holds that and therefore well-established algorithms for logistic regression may be used to maximize (3) even in presence of large datasets. In Section 2 and 3 we will show that the maximizer of (3

) always exists and is unique. In addition, the considered conjugate prior penalty is a log-concave distribution, whose moments are always finite

(Diaconis1979; Chen2003).

Adding small corrections to the response of a binomial model is a common regularization strategy. In the simplest case , correction (4

) reduces to the familiar bias-reducing form of the empirical logit

(Haldane1955; Anscombe1956). In more general settings, these penalties have been referred to as data augmentation priors (e.g., Greenland2015). The scheme of Clogg1991 is closely related to (4), albeit with several critical distinctions. Their correction amounts to adding successes and trials to each and , respectively. Thus, Clogg1991 approach shrinks the success proportions towards the mean rather than , which is a key aspect if one aims at reducing the bias (Cordeiro1991; Firth1993). Furthermore, the correction of Clogg1991 depends on the specific aggregation of the data and it leads to a different amount of shrinkage compared to (4).

## 2 Conjugate Bayes for logistic regression

### 2.1 Diaconis and Ylvisaker conjugate priors

Bayesian inference is based on the posterior law , where is the normalizing constant. Let and let

be a vector of hyperparameters. Moreover, let us define a vector of real numbers

such that for . Thus, the conjugate prior of Diaconis1979 for a logistic regression model is

 p(β)=Cexp[τn∑i=1κi(xTiβ)−τn∑i=1milog{1+exp(xTiβ)}], (5)

where the normalizing constant , albeit generally not available in closed form, is necessarily finite (Theorem 1, Diaconis1979); additional considerations about (5) can be found in Chen2003; Greenland2003. The location parameter is the mode of the prior distribution and hence each ratio can be interpreted as the prior guess for the success probability . Instead, the parameter controls the variability and quantifies the strength of our prior beliefs about . More precisely, when then (5) reduces to a uniform improper prior for , whereas as the prior converges to a point mass at . The choice places equal weight to the prior and the likelihood, therefore one typically focuses on the case . In the special case of binomial model with , the prior (5) induces the usual conjugate prior on the probability , namely , with .

Application of Bayes theorem to the binomial log-likelihood (

2) under the prior (5) leads to the following posterior distribution

 p(β∣y)=C(y)exp[(τ+1)n∑i=1y∗i(xTiβ)−(τ+1)n∑i=1milog{1+exp(xTiβ)}], (6)

where is the normalizing constant and is a vector of pseudo-counts such that , for . The posterior is still in the class of Diaconis and Ylvisaker distributions, with updated location and precision . As we shall discuss in Section 3, in light of the connection with Firth1993, we propose a natural default choice with and , implying that , for .

### 2.2 Posterior inference

The posterior law (6) can be expressed as , that is, the posterior distribution is a function of the log-likelihood evaluated on the set of pseudo-counts . Hence, the maximum a posteriori coincides with the maximizer of the log-likelihood

. Heuristically, the pseudo-counts regularize the estimation problem, as each

belongs to the open set . Consequently, the mode of the posterior distribution always exists, effectively solving the separability issue. This is formalized in the following Theorem 1, which refers to a general set of parameters and , but it clearly applies also to the penalized log-likelihood (3), that is when and ; refer to the Supplementary Materials for a proof, which is a natural consequence of results by Diaconis1979 and well-known properties of exponential families.

###### Theorem 1.

Let be of full rank. Then the mode of the posterior distribution (6), corresponding to the maximizer of the penalized likelihood

 ℓ∗(β;y)=(τ+1)ℓ(β;y∗)=ℓ(β;y)+τn∑i=1κi(xTiβ)−τn∑i=1milog{1+exp(xTiβ)}

with respect to , exists and is unique.

The mode of the posterior distribution (6

) can be also regarded as the minimizer of the posterior expectation of a suitable entropy loss function

(Robert1996). Hence, the mode is a formal Bayesian estimator with strong theoretical foundations. Bayesian estimators obtained under the entropy loss are invariant under reparametrizations. To clarify, if denotes the maximizer of (6), then

is the Bayesian estimator of the odds-ratios under the entropy loss.

The optimal value maximizing

can be found through standard Fisher scoring or expectation-maximization

(Durante2019) algorithms for logistic regression, considering the pseudo-counts in place of the original binomial data . Alternative algorithms could be also exploited, including quasi-Newton or conjugate gradient ascent methods (e.g., Nocedal2006). In all settings, the existence and uniqueness of are guaranteed by Theorem 1, whereas the concavity of implies that if an algorithm converges to a stationary point, then it must be the global optimum. Finally, full Bayesian inference based on the posterior (6

) can be performed via Markov chain Monte Carlo. For instance, the data-augmentation scheme of

Polson2013 may be easily adapted to this setting, leading to a simple Gibbs sampling algorithm.

## 3 Penalized maximum likelihood

### 3.1 Shrinkage, bias reduction and data aggregation

The maximum likelihood estimates of in a logistic regression model are biased away from the point ; refer for example to McCullagh1989; Cordeiro1991. Thus, bias correction requires a certain amount of shrinkage towards that point. In Firth1993 this is achieved through a Jeffrey’s prior penalty, whose mode is indeed equal to (Kosmidis2021). In a similar fashion, we set the mode in the prior specification (5), which implies that , for . The amount of shrinkage is regulated by the precision parameter , whose choice is more delicate. Heuristically, one may set proportional to , so that for large enough the contribution of the prior becomes negligible compared to the weight of the likelihood in (6). In particular, if then the amount of shrinkage is proportional to the model complexity, which seems desirable. These choices lead to the penalized log-likelihood (3) and the pseudo-counts (4). Such an intuitive choice ensures an accurate approximation of the reduced-bias estimators of Firth1993, and that the resulting prior distribution is invariant under different aggregation of the data, in contrast with Clogg1991.

### 3.2 Penalized score equations and connection with Firth (1993)

Firth’s penalized maximum likelihood estimate is obtained as the solution of the system of penalized score equations for , having defined

 Ur,\textscfi(β)=n∑i=1(yi−miπi)xir−pn∑i=1(hip)(πi−1/2)xir,(r=1,…,p),

where represent the diagonal elements of the projection matrix

 H(β)=W(β)1/2X{XTW(β)X}−1XTW(β)1/2,

with . Instead, the maximizer of the penalized log-likelihood (3) corresponds to the solution of the system of penalized score equations for , where

 Ur(β)=∂∂βr~ℓ(β;y)=n∑i=1(yi−miπi)xir−pn∑i=1(mim)(πi−1/2)xir,(r=1,…,p).

Clearly, the penalized scores and differs only in their penalty term. Moreover, the matrix has rank and is idempotent and symmetric, implying that the sum of its diagonal terms is . Hence, the following approximation holds

 n∑i=1(hip)(πi−1/2)xir≈n∑i=1(mim)(πi−1/2)xir,(r=1,…,p), (7)

Our method considers the mean of the values , with weights , instead of the mean of the same quantities but with weights , as in Firth1993. Hence, the penalized score functions and are approximately interchangeable, as well as the corresponding solutions. From a computational perspective, Firth’s modified score equations can be solved via quasi-Fisher scoring (e.g. Kosmidis2010). However, this approach can be computationally challenging in settings with large and since the terms from depend on the current value of . The penalized scores, instead, can be solved with any software for logistic likelihood optimization, and therefore provide a computationally convenient option to approximate Firth’s scores in large dimensions.

To shed further light on the approximation (7

), let us consider a disaggregated representation of the data, in which the response variables

are binary indicators, for and , so that . Each is associated with the covariate vector , which is repeated times in the disaggregated design matrix. Furthermore, let be the leverage values of such a binary regression models, for and . We can rewrite the penalties appearing in (7) in terms of these disaggregated quantities, obtaining that and . The first equality is a consequence of the invariance property of Firth1993 under alternative representations of the data. Hence, approximation (7) can be equivalently written as follows

 n∑i=1mi∑j=1(hijp)(πi−1/2)xir≈1mn∑i=1mi∑j=1(πi−1/2)xir,(r=1,…,p), (8)

where the approximation is due to the replacement of a weighted mean with a simple arithmetic mean. Importantly, we have that , clarifying that equations (7) and (8) rely on the substitution of the leverages with their mean . The condition is known as approximate quadratic balance and has been exploited by Cordeiro1991 to obtain the reduced bias estimate , which indeed deflates the maximum likelihood estimate towards zero by the factor . The quadratic balance condition holds exactly true in some special cases. For instance, when then both Firth1993 and our approach correspond to a prior penalty for and coincide with the empirical logit correction of Haldane1955. More generally, in any saturated model with and with balanced number of trials the two approaches formally agree, since we must have that and , implying that . Replacing the diagonal elements of a projection matrix with their mean also appears in seemingly unrelated settings, for example to obtain the so-called generalized cross-validation index (Wasserman2005, §5.3).

### 3.3 Asymptotic properties and inference

Consistency of the penalized maximum likelihood estimator follows directly from the consistency of the maximum likelihood estimator . Moreover, and have the same asymptotic distribution and hence are both asymptotically unbiased and efficient. Indeed, the penalty terms described in (7) are of order and hence dominated by the score function, as discussed in Firth1993; Pagui2017. Thus, and are both asymptotically distributed as multivariate Gaussians centered on the true value and with asymptotic inverse covariance matrix

. Consequently, Wald-type confidence intervals may be constructed by plugging-in

into and then proceeding in the usual manner.

## 4 Illustrations

### 4.1 Infant birthweight study

We replicate an example presented in Kosmidis2020, which considers a study of low birthweight. Data comprises births and the binary outcome of interest is a dichotomization of infant birthweight (below or above kilograms). The probability of low birthweight is modelled as a function of an intercept and six covariates about the mother. The maximum likelihood estimate of the regression coefficients exists and is finite. We simulate datasets from a logistic regression model with parameter  and evaluate the inferential properties of the proposed estimator, comparing them with popular bias-correction methods in a regular scenario.

Results are presented in Table 1. The values for the maximum likelihood are computed excluding samples with data separation, occurring in replications out of . The maximum likelihood estimator performs significantly worse than all the reduced-bias methodologies in terms of bias and root mean squared error. Moreover, we observe a striking empirical similarity in terms of bias between the performance of the proposed penalized estimator and the one of Firth1993; this finding is in line with the considerations discussed in Section 3.2. Empirical results also confirm a limitation of Clogg1991, which fails at reducing the bias for the intercept parameter , since it shrinks the response towards the empirical proportion of successes rather than . Finally, the approach of Pagui2017 is designed for correcting median bias and therefore is expected to be worse, in terms of bias, than Firth1993. The proposed penalized estimator slightly improves the mean squared error compared to Firth1993. This aspect seems to be empirically confirmed also in the following illustrative example.

### 4.2 High dimensional synthetic dataset

We finally consider a synthetic dataset that mimicks the high dimensional scenario described in Sur2019. In particular, we simulate datasets from a binary logistic regression model with observations and

. Covariates are sampled from a normal distribution with mean

and variance

. Moreover, regression coefficients are divided in blocks of size , whose values are . The purpose of this study is to investigate the performance of the proposed estimator in computationally challenging scenarios.

In first place, we shall note that the execution times of the estimation procedure for , obtained via scalable algorithms for logistic regression, are orders of magnitude faster than the brglm2 implementation of Firth1993. For instance, a single replication with and required an average elapsed time of milliseconds in the former case, against approximately half a second in the latter, on a 2020 Macbook Pro. These differences are even more marked for larger values of and , to the extent that Firth1993 estimates could not be obtained within hours of running time with , and correlated design matrix .

Secondly, we computed the bias and the root mean squared error of maximum likelihood and reduced-bias estimators, which are depicted in Figure 1. Current empirical findings confirm the poor behaviour of maximum likelihood estimates, consistently with the existing literature (Kosmidis2021). The proposed correction provides an important improvement in terms of bias reduction compared to the maximum likelihood estimator, although the bias is slightly larger than that of Firth1993. This is expected and consistent with the findings of Sections 3.2 and 4.1. Furthermore, our penalized procedure achieves a lower mean squared error compared to both the maximum likelihood and the approach of Firth1993. This does not come as a contradiction, since Firth1993 approach does not explicitly reduce the mean squared error. This empirical finding is likely due to the different tail behavior of Diaconis1979 priors compared to Jeffrey’s priors, the latter being more dispersed and displaying heavier tails.

## Acknowledgements

The authors wish to thank David Dunson, Sonia Migliorati and Aldo Solari for their insightful comments on an early draft of this manuscript.

## Supplementary material

Supplementary material includes a detailed proof of Theorem 1, a graphical illustration of the proposed prior and an illustrative case study with separability issues. Software is available online at github.com/tommasorigon/logistic-bias-reduction.

## Appendix A Proof of Theorem 1

The posterior mode of coincides with the optimizer of the quantity , namely the log-likelihood evaluated on the vector of pseudo-counts . The Fisher information matrix is

 I(β):=−∂2∂β∂βTℓ(β;y∗)=XTW(β)X,W(β)=diag{m1π1(1−π1),…,mnπn(1−πn)},

which does not depend on . Since is full rank, then the matrix is positive definite for any value of , a well known result. Hence, the optimal value maximizing , if it exists, is unique. Since is full rank, then logistic regression model is a regular -dimensional exponential family and therefore the optimal value of exists if and only if the sufficient statistic

 s=(s1,…,sp)T=(n∑i=1xi1y∗i,…,n∑i=1xipy∗i)T,

belongs to the interior of the closed convex hull of the support of ; see Theorem 5.8 in Pace1997. This is indeed the case, as a consequence of Theorem 1 in Diaconis1979.

## Appendix B Graphical comparison

We provide further insights on the proposed Diaconis-Ylvisaker prior through a graphical comparison with the Cauchy prior of Gelman2008, and the Jeffrey’s prior of Firth1993. The Cauchy prior has been suggested as a default choice for regularizing logistic regression, and is a routinely used in statistical software as a default specification. In this illustrative example, we focus on standardized covariates with observations. Figure 2 depicts the contour plots of the different priors. The locations and scales of quite similar, with the distibutions centered on and assigning most of their mass in the interval . Instead, their shapes are markedly different. In particular, the Diaconis-Ylvisaker and Jeffrey’s prior depend on the specific covariates values and reflect correlations in the predictors. In addition, the Diaconis1979 prior is roughly ellipsoidal and with lighter tails compared to Cauchy and Jeffrey priors.

## Appendix C Endometrial cancer study

We consider the endometrial cancer grade dataset (Heinze2002), a study on patients which aims at evaluating the relationship between the histology of the endometrium (low against high), and three risk factors. A logistic regression model was fitted with parameter vector , with the first coefficient corresponding to an intercept term and the remaining to neovasculation, pulsatility index of arteria uterina and endometrium height, respectively. As shown in Heinze2002, the maximum likelihood estimate does not exist since the estimated value for the coefficient associated with neovasculation is divergent due to quasi-separability. The omission of the neovasculation information from the set of covariates is entirely inappropriate, as the other factors would not be properly adjusted for this highly informative risk factor; refer to Kosmidis2021 for a discussion on the implications of separability on inferential procedures. We therefore compare the proposed penalized maximum likelihood estimate with the approaches of Clogg1991 and Firth1993

. We also consider the median unbiased estimators of

Pagui2017, Kosmidis2020. Computations are performed using the brglm2 R package and the R function glm.

Estimates for the regression coefficients and the corresponding standard errors are reported in Table 2. As expected, all the reducing-bias methods produce finite estimates for , which are all close to and indicate a strong effect of neovasculation on the response probability. Point estimates and standard errors for the remaining coefficients are also quite similar. The approach of Clogg1991 and our penalized method lead to similar estimates since the former shrinks the estimated probabilities towards the sample proportion, which in this case is . Despite the stated aim of Clogg1991 was not performing bias reduction, it performs reasonably well in this example, and the reasons for its good empirical performance are likely due to its similarity with our approach, which in turn approximates the one of Firth1993.