When estimating a causal effect of a treatment with observational data, methods that control for confounding using propensity scores [RR:83] are extensively used. The propensity score, the conditional probability of receiving a specific treatment conditional on observed covariates, are used in estimators for average causal effects in several different approaches, such as weighting, matching and regression adjustment, see e.g, Goetghebeur et al (2020) for an overview. In particular, inverse probability weighting (IPW) plays a central role in the theory of semiparametric estimation with missing data [AT:07, seaman2013review]
. A prototypical IPW estimator utilises a parametric model for the propensity score[LD:04, il2019efficiency]. The parametric IPW estimator relies on the correct specification of the propensity score model although IPW estimators under various semi-parametric assumptions of the propensity score have also been proposed [HIR:03, van2014targeted, liu2018alternative].
As an extension to model-based IPW, a class of estimators referred to as calibration or balancing estimators has been developed [hainmueller2012entropy, chan2014oracle, chan2016globally, tan2020regularized]
. In this class, the observed sample is reweighted by minimizing a distance measure or a loss function while imposing a set of balance constraints for the treated and controls in the reweighted sample in terms of means of functions of the confounding covariates. A similar procedure is used by the calibration estimators in unequal probability sampling[deville1992calibration]. As a hybrid between IPW and balancing, IR:14 use a combination of parametric propensity score modeling and calibration by adding estimating equations with balancing constraints for the covariates to the score equations of the propensity score model. Properties of an extended version of the estimation approach targeting a high-dimensional setting is derived in ning2020robust. These approaches relate to the general framework introduced in zhao2019covariate in which the propensity score is estimated using loss functions that are designed to give covariate balance. In a recent study, chattopadhyay2020balancing, referring broadly to balancing vs modelling, review and compare the two weighting approaches.
Several studies have considered weighting procedures derived from balance criteria beyond exact equality of sample moments. zubizarreta2015stable studies minimum-variance weights under approximate equality of moments and derives a link between the pre-specified level of tolerated imbalance and estimator bias given weak nonparametric assumptions for the outcome (response surface). For the same class of estimators, wang2020minimal presents further developments derived by a connection between approximate covariate balance and shrinkage estimation of the inverse propensity score. yiu2018covariate also consider minimum-variance weights, but under a balancing constraint derived from a propensity score model, which imposes weights to eliminate the association between covariates and treatment in the sample. wong2017kernel derive asymptotic properties of a weighting estimator obtained by optimizing balance over an infinite-dimensional reproducing-kernel Hilbert space of covariate functions.
In this paper we derive asymptotic properties for two prototypical entropy balancing estimators of the average treatment effect (ATE). Entropy balancing estimators uses balancing weights that minimize the divergence, or relative entropy, relative a set of uniform weights. By construction, the weights mean balance a set of covariate functions between the treated, controls, and the combined group, a property referred to as three-way balance. Two variants of the relative entropy are considered: the Kullback-Leibler divergence (KL), initially proposed by hainmueller2012entropy (for the average treatment effect of the treated, ATT), and the quadratic Rényi divergence (QR), which yields the minimum variance weights. Entropy balancing does not explicitly use models for the treatment or the outcome, a distinctive feature compared to other three-way balancing estimators, for example the calibration estimator in tan2020regularized and the empirical likelihood approach[qin2007empirical] which both utilize a parametric model for the propensity score.
Although reducing model dependence was expressed by hainmueller2012entropy as a main motivation to the introduction of entropy balancing, estimators obtained by minimizing the two divergences (henceforth referred to as KL and QR estimators) entail implicit functional forms given by the two divergence measures respectively. This means that the correctness of the implied propensity score models plays a role for the consistency of the estimators. Under additional assumptions of models for the outcome, we show that the KL and QR estimators are consistent for the average causal effect and asymptotically normal. The outcome model assumption can be relaxed by separating and combining assumptions for the propensity score and the outcome for the treated and controls. A notable difference for the role of the outcome model assumptions between the KL and QR estimators compared to, e.g., regression imputation estimators or augmented IPW-estimators is that the regression model per se is not used in the estimators. Instead, outcome regression model assumptions are (implicitly) included in the balancing constraints selected in the divergence minimization scheme. Additionally, we propose an estimator for the variance. The entropy balancing estimators and the proposed variance estimator are evaluated in a simulation study. The estimators are applied to a motivating example on the effect of school achievements on acute complications of childhood onset type 1 diabetes mellitus.
The entropy balancing estimators under fixed balance constraints studied here can be seen as parametric versions of the more general nonparametric calibration estimators described in chan2016globally and wang2020minimal where the number of balance constraints are assumed to grow with the sample size. Although these nonparametric versions constitute a larger class of asymptotically optimal calibration estimators, our approach provides guidance on the magnitude of the error resulting from the use of prototypical versions of the estimators applied in practice.
The remaining of the paper is organized as follows. Section 2 contains notation and assumptions together with a brief background to the problem. In Section 3, we introduce entropy balancing and provide details on the two particular cases that are studied here: the Kullback-Leibler- and quadratic Rényi estimators. In Section 4, we state our main results, including asymptotic limits for the estimation error, and conditions for consistency and asymptotic normality. Section 5 presents a simulation study that evaluates the estimators on an extension of a design previously used in the literature. A data example, with a study of acute complications of type 1 diabetes mellitus is provided in Section 6 and in Section 7 we conclude with a discussion.
2. Model and Theory
We use the potential outcome framework [Neyman23, DR:74] to define a causal effect of a treatment on an outcome of interest. Consider a binary treatment, for individuals under treatment and for individuals under no treatment. We define as the potential outcome under treatment and as the potential outcome under the control treatment and assume that the observed . We let denote a
-dimensional vector of covariates,. Thus, for each unit in an iid sample, , we observe the vector and for simplicity we omit the index when not needed. The probability of treatment conditional on the covariates, the propensity score, is denoted by . We consider the average causal effect as the estimand of interest, and let and be the marginal potential outcome means. The conditional mean functions are denoted by and . We use the notation for almost sure convergence (with probability 1), and let and denote convergence in probability and in distribution, respectively.
For identification of with the observed data we use the assumption of no unmeasured confounding and overlap as defined below.
Assumption 1 (No unmeasured confounding).
Assumption 2 (Overlap).
, for some .
In the conventional IPW approach, an empirical version of (2.1) is used to estimate , such as
and , where is an estimate of the propensity score . The missing outcomes for the respective groups are generated with and the estimator is obtained as the difference in mean outcomes between the groups in the generated data.
Due to sensitivity to model misspecification and poor finite sample performance of conventional IPW estimators [waernbaum2012model], it has been suggested to derive weights on the basis of covariate balance rather than propensity score estimation. In this setting, weighting by the inverse propensity is used as a change of probability measure that balances the covariate distribution between the treatment group and the combined group, and likewise creates covariate balance between the control group and the combined group. This is referred to as three-way balance in the sequel and is a formulation of the following property: for any real-valued function with ,
Here, we consider a finite-dimensional version of (2.3) which is specified by a set of user-defined functions . It is common to consider moment constraints up to some degree, for example, if we let equal the dimension of , and put , first order mean balance is obtained. We use the notation and , where , and let denote the corresponding sample vector. The following regularity assumption for the balance functions is used throughout the paper. Similar assumptions are used in HIR:03 and chan2016globally.
For some , almost surely, and the random vector has a nonsingular covariance matrix.
3. Entropy Balancing
Entropy balancing was originally proposed by hainmueller2012entropy. Although presented as a general weighting procedure for pre-processing observational data, the paper introduced the method by considering estimation of the average treatment effect of the treated, .
Here we study entropy balancing for the problem of estimating the average treatment effect . The estimator is of the same form as the IPW-estimator (2.2), but the weights are derived on the basis of balancing property (2.3) instead of propensity score estimation. The weight vector is obtained by minimizing the divergence , or relative entropy, between and a vector of reference weights while imposing empirical mean balance with respect to the functions . To be specific, is found as the solution of the optimization problem to find
subject to the three-way balance- and normalizing constraints
The divergence measure may be chosen in different ways. In information theory it is standard to apply a member of the Rényi divergences, a parametric family defined by
With a divergence from this class, problem (3.1) divides into two problems, one for each group, that can be solved separately. In this paper we investigate two of the most used Rényi divergences: the Kullback-Leibler divergence (KL) () and the quadratic Rényi divergence (QR) (), given by
The base weights are considered as fixed and can represent some prior information although a common choice are uniform base weights, i.e., [hainmueller2012entropy, chan2016globally]. In this study we use uniform base weights, which entails that the entropy of the weights is maximized [kallberg2012statistical]. Next we provide details on how to obtain entropy balancing estimators using KL- and QR divergences, respectively.
3.1. Kullback-Leibler divergence
When the KL-divergence is used, the entropy balancing weights for group , is obtained by solving
subject to the constraints
The solution of this convex and linearly constrained problem can be represented by an exponential tilting [erlander1981entropy], given by
where the -vector solves the unconstrained dual problem
The entropy balancing estimator corresponding to this weighting is given by
Problem (3.7) can be solved by applying an iterative Newton-type algorithm, e.g., as described in hainmueller2012entropy. This requires that the solution is unique, which holds when the observed covariate vectors spans , making the objective function strictly convex. We set when a unique solution of (3.7) does not exist, in which case becomes the (unadjusted) difference in mean. More intricate alternatives to this extended definition are possible, but since it is only technical and serves to enable asymptotic theory for the estimators, our crude variant is sufficient for our purposes.
3.2. Quadratic Rényi divergence
The entropy balancing weights for the QR-divergence in group , denoted by , are found as the solution of
subject to balancing condition (3.5). This is a quadratic problem with linear constraints, and consequently has a straightforward solution. Let and be matrices where row of is given by , and row of is given by . If has full rank, problem (3.9) admits a unique solution which can be represented by the linear expression
where the -vector is obtained as
The quadratic entropy balancing estimator is defined as
Similarly as for the KL-estimator, we define in case does not have full rank, where denotes sample size of group , which corresponds to the unadjusted difference in mean.
4. Asymptotic properties of the KL and QR estimators
The previous section describes how the entropy balancing weights are parameterized by the solutions and of the dual problems (3.7) and (3.11), respectively. The following result on convergence of the dual parameters is used to obtain asymptotic results for the corresponding estimators. Proofs of the results in this section are in Appendix A.
As described in Proposition 1, the sequences and converge to certain constant vectors and . Based on these limits, we introduce asymptotic counterparts of the entropy balancing weights. For the KL estimator, let
and for the QR estimator, let
In the following theorem, these limiting weights are used to formulate a result on weak convergence for the entropy balancing estimators. The subsequent corollary gives representations of the asymptotic estimation error and for ease of presentation we use to denote the KL and QR estimators for results that hold for both estimators.
We see that the asymptotic error depends on the covariance of the propensity score model error and the conditional outcome. As indicated by Theorem 1 and Corollary 1, further assumptions are needed to ensure consistency of . As shown in the following corollary, it suffices to assume a linear outcome regression (OR) model, or a propensity score (PS) model combined with a linear OR model for one of the groups. The purpose of this result is to illustrate the idea and is not intended to be exhaustive, see Remark 1.
Corollary 2 (Consistency, OR and PS models).
We note that a linear model assumption for and is not necessary for consistency of () but can be relaxed by imposing a log-linear (inverse-linear) functional form for the propensity score.
Remark 1 (Two-way balance).
The assumptions of Corollary 2 ensure that and , the components of , are consistent as estimators of the group-wise parameters . This is sufficient but not necessary for consistency of . For example, consider the case of no treatment effect, i.e., . Then the asymptotic estimation error is obtained from Theorem 1 as
Here is consistent when the (asymptotic) weights two-way balance the common conditional mean between the treatment and control group, i.e., when . In Appendix B we describe a class of distributions that results in this type of two-way balance.
In the next theorem, we show that the linear model for and , as in Corollary 2, is sufficient to obtain asymptotic normality of the two studied estimators.
Theorem 2 (Asymptotic normality, OR model).
Note that differences in asymptotic variance between and are given by differences in the limiting weights. The large sample properties of the entropy balancing estimators described above are parametric versions of the calibration method studied by chan2016globally, where an estimator of , denoted , is obtained by three-way balancing a set of functions that grows with the sample size, i.e., as . Under nonparametric assumptions it is shown that , where is the semi-parametric efficiency bound [Hahn1998], given by
The following simple example serves to illustrate the robustness property in Corollary 2. For brevity we consider estimation of the treatment component .
Example 1 (Model flexibility).
Let and be independent variables from a beta distribution with both shape parameters equal
be independent variables from a beta distribution with both shape parameters equal, and define outcome variables as , where the error and the conditional mean
Three different treatment variables , and are examined, which have binomial distributions conditional on
are examined, which have binomial distributions conditional onspecified by the propensity scores
We define estimators and of by balancing and , i.e., we use . Even though the conditional mean outcome of the treated is nonlinear in and , it holds that is consistent when the treatment variable is , and is consistent when the treatment variable is , as implied by the functional forms of the propensity scores and , see Corollary 2. To confirm this, we compute approximations of the asymptotic errors and using the covariance representation from Corollary 1. First the limiting parameters and are estimated using the dual parameters and obtained from two large samples, and then the covariances and are approximated using sample covariances based on new independent sample, with the estimates of the dual limits plugged in, see Figure 1.
4.1. Variance estimation
To make use of the normal approximation in Theorem 2, we need estimators for the asymptotic variances and . We use the plug-in principle to construct estimators and which, under the assumption of constant conditional variances, , have the form
and similar for . Here, different nonparametric estimators of the regression functions can be used for estimation of the variance of the difference of conditional expectations, the first term in (4.5), and the conditional variances , see e.g., ruppert1997local and fan1998efficient. In the simulation study we use a simple approach with linear least squares regression models. In this case, the variance estimators are consistent under the assumptions of Theorem 2 (and ). To see this, note that the two last terms in (4.5) converge weakly to the corresponding expected values, which can be verified by the limiting result for the dual parameters in Proposition 1
together with a weak law of large numbers for averages with estimated parameters[SB:13, Th. 7.3].
The R code used for the simulations are available online, see github.
We study an extension of the simulation scenario used in hainmueller2012entropy, which in turn originates from frolich2007propensity. This includes a six-dimensional covariate vector, , where is multivariate normal with mean zero and , and where , , .
The treatment variable is defined as and where the error term obeys . The outcome variables follow conditional normal models; , where and are independent and . Since the design in hainmueller2012entropy considers only the true treatment effect fixed at zero for all units we additionally use three different designs for the conditional mean , see Table 1 for details.
Both the Kullback-Leibler and the quadratic Rényi divergences are examined, along with two variations of the balancing vector , given by , and a vector that contains and all second order terms and interactions. Hence by the combinations of divergence measures and balancing vectors, four variations of entropy balancing estimators are included. We use the notation , for reference to which balancing vector that is used, either () or (). Approximations of the asymptotic error in Corollary 1 for the entropy balancing estimators in the considered simulation designs are also presented. The values of are obtained as described in Example 1 (here we use samples and average over 10 replications).
For comparison, we also include three IPW-estimators of type (2.2), based on the probit specifications
5.3. Variance estimators
Three different methods for estimating the variances of and and are investigated. In addition to the proposed parametric OLS estimator (4.5), we consider two nonparametric methods; an estimate of the efficiency bound (described below) and a simple bootstrap approach based on 10,000 resamples.
As discussed in previous section, and can be regarded as nonparametric estimators if the balancing vector is assumed to grow in dimension with , and then, under certain assumptions, the variances of and converge to the efficiency bound . chan2016globally introduce a non-parametric estimator of which is an empirical counterpart of the representation where is the efficient influence function given by
The estimator can be written as where is obtained by inserting estimates of the conditional mean , and the inverse probabilities and . The estimate of , denoted by , is the weighted least squares estimator that regresses the outcome on in group . For estimation of the inverse probabilities and , the corresponding sets of weights and are used. We refer to the paper for more details on this estimator.
To indicate which variance estimator is used, we let P denote the proposed parametric OLS estimator (4.5), NP denote the nonparametric estimator , and B denote the bootstrap estimator. For evaluation we use the ratio , where
is the Monte Carlo variance. The ratio is interpreted as the proportion of the mean of estimated variances in relation to the true variance and we use a large sample approximation of its standard error[SB:13, Eq. 9.4].
The bias, standard deviation and root mean squared error (RMSE) of the estimators are displayed in Table3 () and Table 6 (). In addition, Table 4 () and Table 7 () provide coverage, empirical standard errors, and means of estimated standard errors () for three different variance estimators. Evaluations of the variance estimators are presented in Table 5 () and Table 8 (). The error approximations displayed in Table 2 are consistent with the observed bias for the EB estimators.
The small bias of , , and in Design A, and of and in Designs B and C, is expected as the assumption of Corollary 2 is met for both balancing vectors and in Design A, and for in Designs B and C (approximately for Design B). We get a small bias for the correctly specified IPW1, whereas the two misspecified IPW estimators are heavily biased. All three IPW estimators have relatively large standard error. In Design C the standard errors of and increase with factors around 8 and 3 compared to and .
The results for the parametric variance estimators show that the means of the estimated variances are relatively close to the empirical for all four EB estimators. The bootstrap estimator has a similar performance, although it has a tendency to underestimate the variance of in Design C, and to overestimate the variances of and for the smaller sample size . The nonparametric variance estimator underestimates the variance in several cases, and this is more pronounced for the estimator .
Next we give separate comments for the two outcome designs no treatment effect and heterogeneous treatment effect
5.4.1. No treatment effect
The bias is small for and in Design B and C, which is unexpected given that the assumption of Corollary 2(i) is not satisfied for the shorter balancing vector . In Appendix B we provide a thorough examination of this result.
The interval coverage is close to for all estimators, with the exception of when the nonparametric variance estimator is used. This is caused by underestimation of the variance. The same pattern is seen for both sample sizes, although for the coverage varies slightly more.
5.4.2. Heterogeneous treatment effect
We see that and are biased in Design B, and heavily biased in Design C. Note also that is clearly worse, see the discussion in Appendix B. The coverage of and is lower than 95 in Designs B and C for all three variance estimators, where is clearly worse. This is expected given their bias. Note that in Design C, the variance of is underestimated by the nonparametric- and bootstrap methods, but not by the parametric approach, which is reflected in the differences in coverage. For and , the coverage is close to 95 in most cases although combined with the nonparametric variance estimator has a lower coverage in Design B, due to underestimation of the variance.
|1) No treatment effect|
|(True = 0, unadjusted )|
|2) Heterogenous treatment effect|
|(True , unadjusted )|