An important challenge in statistical analysis concerns the control of the finite sample bias of estimators. This problem is typically magnified in models with a large number of variables that are possibly allowed to diverge with the sample size (see for example sur2019modern). Thereby, bias reduction techniques have been widely studied (a review can, for example, be found in kosmidis2014bias). These bias reductions can be achieved by simulation methods such as the jackknife (Efro:82) or the bootstrap (Efro:79), by using analytical approximations to the likelihood function (see for example BrDaRe:07; BrDa:08 and the references therein), or, alternatively, by modifications of the estimating equations, as proposed by Firt:93 and extended for example in KoFi:09; KoFi:11; Kosm:14; Kosm:17. Under appropriate conditions, the resulting bias reductions obtained from these techniques have been shown to achieve (at best) an order of .
The main purpose of this paper is to show the existence of a general method that optimally reduces (in an asymptotic sense) the bias of consistent estimators for parametric models. While the formal developments are presented in the next sections, we provide here an intuitive argument for the method we study. Consider a consistent estimator for , say , that is typically biased in finite samples, and define the bias where . In general, the function
is not available in closed-form but an unbiased estimator, say, is available via simulation-based techniques (and is formally presented further on). Similarly, we define a simulation-based version of the bias function , as . In this context, an unbiased estimator based on is therefore . The continuity of the function and the consistency of suggest . This approximation can be used to write
Since , satisfies . It is therefore approximately a fixed-point of the function
In this paper, we study the properties of the fixed-point of that we denote by
. While this estimator is not necessarily unbiased, the above heuristic reasoning implies a strong bias reduction ability. Indeed, we demonstrate thatachieves an asymptotically optimal unbiased property. In order to make this statement precise, we first recall that , where and are real-valued function, implies, by definition, that there exist a and a such that , for all . In this paper, we demonstrate, under some mild conditions, that
which means that , for all , or equivalently, that is unbiased for all . In other words, the result provided in (2) is optimal, as no stronger bias reduction result can be obtained using the big notation as a way to quantify the (asymptotic) bias of estimators. As will be explained further on, belongs to the class of indirect inference estimators (see gourieroux1993indirect) and is directly linked with the Iterative Bootstrap (IB) approach put forward in kuk1995asymptotically.
The assumptions needed to achieve (2) are weak enough so that the bias reduction method we propose is valid for a wide range of models and estimators and does not rely on model-based analytical transformations. For example, our framework allows to consider estimators that are discontinuous in (as is commonly the case when considering discrete data models) as well as high-dimensional settings in which
. Moreover, this bias reduction approach does not necessarily come at the price of an inflated variance and a trade off can be sought between efficiency and computational cost. Indeed, the asymptotic variance ofcan be made arbitrarily close to the one of by improving the “quality” (number of simulations) of the approximation of . We also show that the IB algorithm provides a computationally efficient method for computing . In addition, we demonstrate that the considered bias reduction method preserves its properties also when is inconsistent as long as the asymptotic bias of the latter has a specific form. The approach is therefore applicable in cases where a consistent estimator may be difficult to obtain but a reasonable approximation is available. These results are in line with the observations made by kuk1995asymptotically
. Indeed, while his primary focus was to construct simulation-based consistent estimators, using the IB, for Generalized Linear Mixed Models (GLMM), he noticed that “
[…] the method proposed can lead to estimates which are nearly unbiased even for the variance components while the standard errors are only slightly inflated”.
The paper is organized as follows. In Section 2, we present the mathematical setup in which we place our theoretical results. In particular, we present the simulation-based approach and we state and discuss the formal assumptions that are needed to derive the properties of the resulting estimator. These properties are formally stated in Section 3 while the proofs can be found in the appendix. In Section 4, we apply our approach to derive optimally bias reduced estimators for the logistic regression, with and without random intercept, in high-dimensional settings and possibly with data contamination. These estimators are based on the MLE and an estimator that is robust to data contamination and to the problem of separability. The theoretical results presented in Section 3 are in-line with the simulation studies of Section 4.
2 Mathematical Setup
Let denote a random sample generated under the model (possibly conditional on a set of fixed covariates), where
is the parameter vector we wish to estimate using the consistent estimator. In our setting, the dimension of is not fixed and is allowed to diverge together with the sample size . Our discussion throughout this paper considers cases where is “complex” enough in the sense that it has no closed-form solution and its finite sample bias is unknown.
The bias reduction technique we consider is based on the estimator , which is defined as the fixed-point of the function defined in (1). More precisely, can be expressed as follows:
The estimator can be seen as a special case of an indirect inference estimator (gourieroux1993indirect) when using for the following approximation
where is the same estimator as but computed on the simulated sample under model . We use the subscript to identify distinct samples111The random numbers (which are independent of ) used to construct are redrawn with the same seed values. Therefore, when is computed twice on simulated data from the same model , one gets the same value.. Moreover, the IB of kuk1995asymptotically is also directly related to . Indeed, the IB algorithm (sequence) can be generally defined as
with . Under appropriate conditions (discussed further on), the fixed-point corresponds to the limit (in ) of (4) and therefore can also be characterized as
Therefore, the IB approach can provide a natural and computational efficient algorithm to compute , or more generally, indirect inference estimators (see guerrier2018simulation for more details).
In what follows, we state and discuss the conditions allowing to satisfy the optimal bias reduction property defined in (2). Henceforth, we call the Optimally Bias REduced Estimator (OBREE).
The set is a compact subset of and .
Assumption A is very mild and is typically used in most settings where estimators have no closed-form. Indeed, the compactness of and are common regularity conditions for the consistency and asymptotic normality of the MLE, respectively (see for example newey1994large). Hence, Assumption A may be redundant depending on the requirements already brought by taking
to be consistent (and possibly asymptotic normally distributed).
The bias function exists and is once continuously differentiable in . Moreover, there exists a such that and .
Assumption B is likely to be satisfied in the majority of practical situations. Indeed, the bias function is typically assumed (at least implicitly) to have certain degree of smoothness (see for example kosmidis2014bias). Such property is expected to hold even in situations where the initial estimator may not be continuous (see Assumptions C and D for more details), as it is the case, for example, with discrete data models. Moreover, the second part of Assumption B is very mild. Indeed, many common estimators, including the MLE, can be expanded in decreasing powers of such that . In these cases, this requirement is always satisfied (since in our case ) and may be suitable in high-dimensional settings where . Assumption B is also particularly useful as it allows to decompose the estimator into a non-stochastic component and a random term . Indeed, we can write:
where is a zero-mean random vector. In our next assumptions, which are not required to establish the optimal bias reduction defined in (2), we impose additional restrictions on . These requirements will be used to prove the consistency and the asymptotic normality of .
The variance of exists and is finite for any . Moreover, there exists a such that and .
Assumption C is frequently employed and typically very mild. In the common situation where is -consistent, then we would have and Assumption C would simply require that . In our last assumption, we consider the limiting distribution of (and therefore of ). The latter is used to derive the asymptotic normality of , in order to evaluate, in particular, the potential efficiency loss from to .
For any such that we have
where is nonsingular and continuous in .
While Assumption D is frequently used, it may not necessarily be mild. In low-dimensional settings, this assumption is satisfied for a vast majority of commonly used estimators. However, in high-dimensional settings, the validity of the asymptotic normality of , as expressed in Assumption D, is often unknown for many models. As previously mentioned, this assumption allows to remove the requirements on and of Assumptions A and B provided that .
Finally, our assumption framework is not necessarily the weakest possible in theory and may be further relaxed. However, we do not attempt to pursue the weakest possible conditions to avoid overly technical treatments in establishing the theoretical results presented in the following section.
3 Main Results
In this section, we present the main properties of the OBREE under the assumptions presented in Section 2. Our results are valid for all and in high-dimensional settings and, in most cases, are applicable when . Theorem 1 shows that is an optimally bias reduced estimator in the sense of (2).
The proof of Theorem 1 is given in Appendix A. This result uses a technical lemma (also presented in Appendix A) which provides a general strategy for proving that a function is . This result is arguably a major improvement on bias reduction techniques as it provides an optimal asymptotic reduction while not relying on model-based analytical transformations. This approach is, therefore, widely and readily applicable. Interestingly, the result of Theorem 1 is, in some cases, valid in high-dimensional settings where . An equivalent statement of Theorem 1 is that there exists a finite sample size such that for all greater than . In our experience, as illustrated with the simulation studies presented in Section 4, the value seems to be reasonably small as the OBREE does indeed appear to be unbiased with relatively small sample sizes.
In addition, the result of Theorem 1 may still hold in the wider situation where is not consistent. While a thorough investigation is left for further research, a preliminary result can be found in Corollary 1 in Appendix A where is assumed to have a sub-linear asymptotic bias. As it will be illustrated in the simulation studies of Section 4, we find that when the asymptotic bias is “small”, the OBREE appears to preserve its finite sample bias properties. The latter is therefore applicable in cases where a consistent estimator may be difficult to obtain for example for computational (and/or numerical) reasons, but a “close” approximation is available.
The next result concerns the standard statistical properties of the OBREE, namely consistency and asymptotic normality. While, in our framework, these properties are trivially satisfied in low-dimensional settings, obtaining them in high-dimensional settings may be more challenging.
The proof of Proposition 1 is given in Appendix B. This result shows that the efficiency of the OBREE (relative to the initial estimator ) can be made arbitrarily close to one by increasing the number of simulations . Therefore, this highlights that a trade-off between efficiency and computational cost can be made. Interestingly, the finite sample variance of the OBREE can be smaller than the one of the initial estimator as, for example, illustrated in the simulation studies presented in Section 4. In low-dimensional settings, Proposition 1 is implied by the results on indirect inference estimators presented, for example, in gourieroux1996simulation.
The proof of Proposition 2 is given in Appendix C. Proposition 2 shows that the IB algorithm converges to (in norm) at an exponential rate. In our practical experience, the number of iterations necessary to reach a suitable neighbourhood of the solution appears to be relatively small (typically less than 20 iterations).
The results presented in this section can be used in a wide range of practical settings/models, to obtain estimators with optimal finite sample properties. In Section 4 below, we provide the results of simulation studies involving models for binary outcomes, with and without random effect, and with different initial estimators. For one setting, a comparison can be made with an available bias reduced estimator (KoFi:09), and for the others, our results provide new (optimally) bias reduced estimators.
4 Application to Binary Response Models
In this section, we apply the methodology developed in Sections 2 and 3 to investigate the performance of the OBREE. First, we consider the logistic regression (NeWe:72; McCuNe:89), for which the MLE is known to be biased. It is one of the most commonly used model for binary responses and several bias reduction methods have been proposed as for example the bias reduced estimator of KoFi:09. To illustrate the flexibility of the OBREE, we choose two different initial estimator , the MLE and a robust estimator. We compare these OBREEs to their initial estimators as well as to the bias reduced estimator proposed by KoFi:09
. Their respective performance is studied when the data is generated at the model but also under slight model misspecification where outliers are randomly created. Then, we extend the logistic regression to include a random intercept, a special case of GLMM(see for example LeNe:01; McCuSe:01; JiangBook2007), for which there is no closed-form expression for the likelihood function. In this case, the initial estimator is selected to be a numerically simple approximation of the MLE, and its finite sample behaviour is compared to the ones of the MLE computed using several more precise approximation methods.
4.1 Bias Reduced Estimators for the Logistic Regression
It is well known that in some quite frequent practical situations, the MLE of the logistic regression is biased and/or its computation can become very unstable, especially when performing some type of resampling scheme for inference. The underlying reasons are diverse, but the main ones are the possibly large ratio, separability (leading to regression slope estimates of infinite value) and data contamination (robustness). The first two sources are often confounded and practical solutions are continuously sought to overcome the difficulty in performing “reasonable” inference. For example, in medical studies, the bias of the MLE together with the problem of separability has led to a rule of thumb called the number of Events Per Variable (EPV), that is the number of occurrences of the least frequent event over the number of covariates, which is used in practice to choose the maximal number of covariates one is “allowed” to use in a logistic regression (see for example AuSt:17 and the references therein). Moreover, the MLE is known to be sensitive to slight model deviations that take the form of outliers in the data, leading to the proposal of several robust estimators for the logistic regression and more generally for Generalized Linear Model (GLM) (see for example CaRo:01b; Cize:08; HeCaCoVF:09 and the references therein).
Despite all the available estimators, to the best of our knowledge, none is able to handle the three potential sources of bias jointly, namely small EPV (high-dimensional settings), separation and data contamination. In this section, we make use of the OBREE, which is built through a simple adaptation of available estimators. Although our choices for the initial estimators may not be the optimal ones for this problem, they nevertheless provide OBREE with advantageous properties. Indeed, they appear to be unbiased and have similar finite sample Mean Squared Error (MSE) as the bias reduced MLE of KoFi:09 in uncontaminated data settings. Additionally, in data contaminated settings, the performance of the OBREE based on the robust initial estimator remains nearly unchanged. Moreover, in the latter case, we adapt the initial (robust) estimator so that it is not affected by the problem of separability, a problem that is more severe in the case of robust estimators (see Rousseeuw2003).
Consider the logistic regression with response and linear predictor , where is an matrix of fixed covariates with rows
, and with logit link. The MLE for is given by
and can be used as an initial estimator in order to obtain the OBREE, using for example the IB algorithm. When using, as initial estimator, the MLE defined in (6), we denote the resulting estimator as the OBREE-MLE.
We also consider the robust -estimator proposed by CaRo:01b, with general estimating equations (for GLMs) given by
with being the Pearson residuals and with consistency correction factor
where the expectation is taken over the (conditional) distribution of the responses (given ). For the logistic regression, we have . To avoid a potential problem of separatibility, we follow the suggestion of Rousseeuw2003 and compute the robust initial estimator on the transformed responses, or pseudo-values:
where is a fixed scalar close to zero. Using the pseudo-values leads to an initial (robust) estimator that is not consistent, even if we expect the asymptotic bias to be very small. The finite sample performance of this OBREE is of interest to investigate to what extent, the conditions needed for its bias property to hold, can be enlarged (as hinted in Corollary 1). To compute the initial estimator, we use the implementation of the glmrob function provided in the robustbase package in R (robustbase2018), with in (7
) being the Huber loss function (with default parameter, see huber1964robust) and , being the th diagonal element of the hat matrix . The resulting estimator, called OBREE-R, is in fact robust in the sense that it has a bounded influence function (Hamp:74). We perform a simulation study to validate the properties of the OBREE-MLE and OBREE-R and compare their finite sample performances to other well established estimators. In particular, as a benchmark, we also compute the MLE, the bias reduced MLE (BR-MLE) using the brglm function (with default parameters) of the brglm package in R (brglm2017), as well as the robust estimator (7) using the glmrob function in R without data transformation (ROB). We consider four situations that can occur with real data, which result from the combinations of balanced outcome classes (Setting I) and unbalanced outcome classes (Setting II) with and without data contamination. We also consider a large model with and choose as to provide EPV of respectively and , which are below the usually recommended value of . The parameter values for the simulations are provided in Table 1.
|Parameters||Setting I||Setting II|
The covariates were simulated (similarly to CaSu:19) independently from distributions for Setting I and
for Setting II, in order to ensure that the size of the log-odds ratiodoes not increase with , so that is not trivially equal to either or
. To contaminate the data, we choose a misclassification error that allows to observe a noticeable effect on the different estimators, which consists in permuting 2% of the responses with corresponding larger (smaller) fitted probabilities (expectations). The simulation results are presented in Figure1 as boxplots of the finite sample distribution, and in Figure 2 as the bias and Root Mean Squared Error (RMSE) of the different estimators.
The finite sample distributions presented in Figure 1, as well as the summary statistics given by the bias and RMSE presented in Figure 2, allow us to draw the following conclusions that support the theoretical results. In the uncontaminated case, the MLE and the robust estimator ROB are biased (except when the slope parameters are zero), however, the BR-MLE, OBREE-MLE and OBREE-R are all apparently unbiased. The results for the OBREE-MLE are in-line with its theoretical properties (in particular Theorem 1). The OBREE-R appears to enjoy the same properties although its initial estimator has a “small” asymptotic bias (see Corollary 1). This illustrates that our approach is applicable even when a consistent initial estimator is not available or when it is numerically unreliable (as is the case here). Moreover, the variability of all estimators is comparable, except for ROB which makes it rather inefficient in these settings. With 2% of contaminated data (missclassification error), the only estimator whose behaviour remains stable compared to the uncontaminated data setting is the OBREE-R. This is in line with a desirable property of robust estimators, that is stability with or without (slight) data contamination. The behaviour of all estimators remains the same in both settings, that is, whether or not the responses are balanced. Finally, as argued above, a better proposal for a robust, bias reduced and consistent estimator, as an alternative to OBREE-R, could in principle be proposed, but this is left for further research.
4.2 Bias Reduced Estimator for the Random Intercept Logistic Regression
An interesting way of extending the logistic regression to account for the dependence structure between the observed responses is to use the GLMM family (see for example LeNe:01; McCuSe:01; JiangBook2007, and the references therein). We consider here a commonly used model in practical settings, namely the random intercept model. The binary response is denoted by , where and . The expected value of the response is expressed as
where is a -vector of covariates (possibly accounting for a fixed intercept), is a -vector of regression coefficients and the random effect
is a normal random variable with zero mean and (unknown) variance.
Because the random effects are not observed, the MLE is derived on the marginal likelihood function where the random effects are integrated out. These integrals have no known closed-form solutions, so approximations to the (marginal) likelihood function have been proposed, including the Penalized Quasi-Likelihood (PQL) (see for example BrCl:93), Laplace Approximations (LA) (see for example RaYaYo:00) and adaptive Gauss-Hermite Quadrature (GHQ) (see for example PiCh:06). It is well known that PQL methods lead to biased estimators while LA and GHQ are more accurate (for extensive accounts of methods across software and packages, see for example BOLKER2009127; KiChEm:13). The lme4 R package (BaMaBo:10) uses both the LA and GHQ to compute the likelihood function, while the glmmPQL R function of the MASS library (VeRi:02) uses the PQL.
|Parameters||Setting I||Setting II|
In this section, we consider a simple approximation of the MLE as initial estimator for the OBREE. For computational efficiency, we choose the estimator defined through penalized iteratively reweighted least squares (P-IRLS) (see for example BaMaBo:10) as implemented in the function glmer function (with argument nAGQ set to 0) of the lme4 R package.
To study the behaviour of the OBREE and compare its performance in terms of bias and variance in finite samples to different approximations of the MLE (LA, GHQ or PQL), we perform a simulation study using the two settings described in Table 2. Both settings can be considered as high-dimensional in the sense that is relatively large with respect to . Moreover, while Setting II (, ) reflects a possibly more frequent situation, Setting I concerns the case where is small (and much smaller than ), a situation frequently encountered in cluster randomised trials (see for example CRT-Huang-2016; CRT-Leyrat-2017, and the references therein). As for the simulation study in Section 4.1 on the logistic regression, the covariates are simulated independently from the distribution .
The finite sample distributions are illustrated in Figure 3. Figure 4 presents the finite sample bias and RMSE of the approximated MLE estimators and the OBREE. It can be observed that the proposed OBREE has a drastically reduced finite sample estimation bias, especially for the random effect variance estimator. Moreover, the OBREE also achieves the lowest RMSE. These simulation results are in-line with the theoretical properties of the OBREE and the simulation-based findings of kuk1995asymptotically. This has important advantages when performing inference in practice, and/or when the parameter estimates, such as the random intercept variance estimate, are used, for example, to evaluate the sample size needed in subsequent randomized trials. The OBREE can eventually be based on another initial estimators in order to further improve efficiency for example, however this study is left for future research.
Appendix A Proof of Theorem 1
Before proving Theorem 1, we introduce a particular asymptotic notation and state a lemma that provides a strategy for proving the optimal bias reduction property. Let and be real-valued functions with being strictly positive for all . We write if and only if :
Let be a strictly positive real-valued function such that . If , then there exists a such that for all .
Since , there exists a such that . Without loss of generality we can suppose that the function is decreasing222Indeed, we can define a decreasing step function such that and , for all .. Therefore, for all we have and hence,
In other words, for all which ends the proof. ∎
he condition of Lemma 1, namely , may seem very strong. One may think for example that for all is sufficient. However, the exponential function provides a counter-example. Indeed, since for all , we have that for all , which means that
Since for all , for all appears clearly to be not sufficient for reaching the conclusion of Lemma 1. One may attribute this failure to the dependence in in (12) and propose the following stronger condition
Once again, the exponential function constitutes a counter-example. Indeed, setting and , we have, for all and all ,
implying that satisfies (13). However, we still have that for all and thus the conclusion of Lemma 1 cannot be reached. Both of these examples suggest that the stronger condition in (11) is indeed necessary.
Proof of Theorem 1.
By definition in (3), we have
Using Assumption B, we re-express each side of the above equation as follows:
where for all , is a zero-mean random vector. We directly obtain
elementwise. Consequently, we deduce from (14) that
The main idea is to re-evaluate using the mean value theorem as it allows to demonstrate by induction that, for all ,
Then, few extra steps enables to satisfy Lemma 1 which concludes the proof.
Applying the mean value theorem for vector-valued functions to we have
where and corresponds to a set of vectors lying in the segment for (with respect to the function ). By Assumptions A and B, is also a bounded random variable. Moreover, by Assumption B again, elementwise since elementwise and
for all . For simplicity, we denote
which implies that . Moreover, a consequence of (15) is that
for any and hence
Now, we have
Using Cauchy-Schwarz inequality, we have
Since and are bounded random variables on a compact set, and , so we have
Therefore, we obtain
and hence, using (16) we deduce that
Since , we have
Since and , one can repeat the same computations and deduce by induction that, for all ,
In order to show that
which implies , it is sufficient to demonstrate that
Since for any , we have and , we obtain that
Without loss of generality, we suppose that and . Therefore, setting and , our previous computations implies that, for all and all , we have
Hence (18) holds true which ends the proof. ∎
Interestingly, the optimal asymptotic bias reduction property holds true in settings that are (way) more extreme than required by Assumption B, namely . Indeed, according to (17), the optimal asymptotic bias reduction property holds as long as for some , and in particular, even when . However, the condition is used to prove the consistency of thus justifying Assumption B as it is stated for readability purposes.
As previously said, the optimal bias reduction property of can also be achieved when the initial estimator has a sub-linear asymptotic bias, i.e. can be written as follows,
where with and . In this case, the rate at which is allowed to grow with is more restrictive.
Even if the optimal bias reduction can be achieved when the asymptotic bias is large (there are no constraints on the norms of and ), in practice, the sub-linear assumption is more likely to be (nearly) satisfied when the asymptotic bias small as illustrated by our simulation studies (see Section 4).
which implies that
elementwise. The rest of the argument is similar to the one in the proof of Theorem 1 and leads to
Appendix B Proof of Proposition 1
The proof of Proposition 1 is split into two lemmas. The first concerns the consistency of the OBREE and the second its asymptotic normality.
This proof is directly obtained by verifying the conditions of Theorem 2.1 of newey1994large on the functions and defined as follow:
where and . Reformulating the requirements of this theorem to our setting, we have to show that (i) is compact, (ii) is continuous, (iii) is uniquely minimized at , (iv) converges uniformly in probability to .
On the one hand, Assumption A ensures that is compact. On the other hand, is trivially continuous and uniquely minimized at . What remains to be shown is that converges uniformly in probability to , which is equivalent to show that: for all and for all , there exists a sample size such that for all
Fix and . Using the above definitions, we have that
Considering the first term on the right hand side of (22), we have
by Assumption C. Similarly, we have