# Propensity Score Weighting for Covariate Adjustment in Randomized Clinical Trials

Imbalance in baseline characteristics due to chance is common in randomized clinical trials. Regression adjustment such as the analysis of covariance (ANCOVA) is often used to account for imbalance and increase precision of the treatment effect estimate. An objective alternative is through inverse probability weighting (IPW) of the propensity scores. Although IPW and ANCOVA are asymptotically equivalent, the former may demonstrate inferior performance in finite samples. In this article, we point out that IPW is a special case of the general class of balancing weights, and propose the overlap weighting (OW) method for covariate adjustment. The OW approach has a unique advantage of completely removing chance imbalance when the propensity score is estimated by logistic regression. We show that the OW estimator attains the same semiparametric variance lower bound as the most efficient ANCOVA estimator and the IPW estimator with a continuous outcome, and derive closed-form variance estimators for OW when estimating additive and ratio estimands. Through extensive simulations, we demonstrate OW consistently outperforms IPW in finite samples and improves the efficiency over ANCOVA and augmented IPW when the degree of treatment effect heterogeneity is moderate or when the outcome model is incorrectly specified. We apply the proposed OW estimator to the Best Apnea Interventions for Research (BestAIR) randomized trial to evaluate the effect of continuous positive airway pressure on patient health outcomes.

## Authors

• 11 publications
• 61 publications
• 176 publications
02/03/2021

### Covariate adjustment in subgroup analyses of randomized clinical trials: A propensity score approach

Background: Subgroup analyses are frequently conducted in randomized cli...
05/11/2022

### Leveraging baseline covariates to analyze small cluster-randomized trials with a rare binary outcome

Cluster-randomized trials (CRTs) involve randomizing entire groups of pa...
12/17/2020

### Increasing the efficiency of randomized trial estimates via linear adjustment for a prognostic score

Estimating causal effects from randomized experiments is central to clin...
04/01/2019

### A Regression Discontinuity Design for Ordinal Running Variables: Evaluating Central Bank Purchases of Corporate Bonds

We propose a regression discontinuity design which can be employed when ...
05/21/2019

### Robustness of ANCOVA in randomised trials with unequal randomisation

Randomised trials with continuous outcomes are often analysed using ANCO...
05/21/2019

### The P-LOOP Estimator: Covariate Adjustment for Paired Experiments

In paired experiments, participants are grouped into pairs with similar ...
11/30/2017

### Balancing Out Regression Error: Efficient Treatment Effect Estimation without Smooth Propensities

There has been a recent surge of interest in doubly robust approaches to...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Randomized controlled trials are the gold standard for evaluating the efficacy and safety of new treatments and interventions. Statistically, randomization ensures the optimal internal validity and balances both measured and unmeasured potential confounders in expectation. This allows the simple unadjusted difference-in-means estimator to provide an unbiased estimate of the intervention effect

(Rosenberger and Lachin, 2002). Frequently, important patient characteristics are collected at baseline; although over repeated experiments, they will be balanced across treatment arms, chance imbalance often arises in a single trial due to the random nature in allocating the treatment (e.g. Senn, 1989; Ciolino et al., 2015), especially when the sample size is limited (Thompson et al., 2015). If any of the baseline covariates are prognostic risk factors that are predictive of the outcome, adjusting for the imbalance of these factors in the analysis can improve the statistical power and provide a greater chance of identifying the treatment signals when they actually exist (e.g. Ciolino et al., 2015; Pocock et al., 2002; Hernández et al., 2004).

There are two general streams of methods for covariate adjustment in randomized trials: (outcome) regression adjustment (e.g. Yang and Tsiatis, 2001; Kahan et al., 2016; Leon et al., 2003; Tsiatis et al., 2008; Zhang et al., 2008) and the inverse probability of treatment weighting (IPW) based on propensity scores (e.g. Williamson et al., 2014; Shen et al., 2014; Colantuoni and Rosenblum, 2015). For regression adjustment with continuous outcomes, the analysis of covariance (ANCOVA) model is often used, where the outcome is regressed on the treatment, covariates and possibly their interactions (Tsiatis et al., 2008)

. The treatment effect is estimated by the coefficient of the treatment variable. With binary outcomes, a generalized linear model can be postulated to estimate the adjusted risk ratio or odds ratio, with the caveat that the regression coefficient of treatment may not represent the marginal effect potentially due to non-collapsability

(Williamson et al., 2014). By appealing to the semiparametric theory, Tsiatis and colleagues (e.g. Yang and Tsiatis, 2001; Leon et al., 2003; Tsiatis et al., 2008) developed a suite of broadly applicable ANCOVA estimators that improves efficiency over the unadjusted analysis in randomized trials. Lin (2013) further clarified that it is critical to incorporate a full set of covariate-by-treatment interaction terms in regression adjustment for efficiency gain. The finite-sample performance of these ANCOVA estimators has been examined by previous simulations (Tsiatis et al., 2008)

. An attractive feature of the ANCOVA estimators is that the point estimates as well as the standard error estimates when the randomization probability is

remain consistent even if the outcome model is misspecified (e.g. Yang and Tsiatis, 2001; Lin, 2013; Wang et al., 2019). However, misspecification of the outcome model can decrease precision in unbalanced experiments with heterogeneity (Freedman, 2008). An additional limitation of regression adjustment is the potential for inviting a ‘fishing expedition’ to search for an outcome model that gives the most dramatic treatment effect estimate, and thus jeopardizes the objectivity of causal inference with randomized trials (e.g. Tsiatis et al., 2008; Shen et al., 2014).

Originally developed in the context of survey sampling and observational studies (Lunceford and Davidian, 2004), IPW has been advocated as an objective alternative to ANCOVA in randomized trials (Williamson et al., 2014). To implement a typical IPW estimator, one first fits a logistic “working” model to estimate the propensity scores, defined as the conditional probability of receiving the treatment given the baseline covariates (Rosenbaum and Rubin, 1983). With the weights constructed as the inverse of the estimated propensity scores, the treatment effect is estimated by the weighted outcome difference between the treatment arms. In randomized trials, the allocation process of the treatment is completely controlled and the true propensity score is known. Therefore, the working model for the propensity scores is always correctly specified, and the IPW estimator is consistent to the marginal treatment effect. Moreover, with a continuous outcome, the IPW estimator under a logistic working model has the same large-sample variance as the efficient ANCOVA estimator (e.g. Shen et al., 2014; Williamson et al., 2014), but offers several advantages. First, IPW separates the design and analysis (Rubin, 2008) in that the propensity score model only involves baseline covariates and the treatment indicator; it does not require the access to the outcome and hence avoids the ‘fishing expedition’. As such, arguably IPW offers better transparency and objectivity in pre-specifying the analytical adjustment during the design stage. Second, IPW preserves the marginal treatment effect estimand with non-continuous outcomes, while the interpretation of the outcome regression coefficient may change according to different covariate specifications (e.g. Hauck et al., 1998; Robinson and Jewell, 1991). Third, IPW can easily obtain treatment effect estimates for rare binary or categorical outcomes whereas outcome regression models often fail to converge in such situations (Williamson et al., 2014). This is particularly the case when the target parameter is a risk ratio, where log-binomial models are known to have unsatisfying convergence properties (Zou, 2004). On the other hand, a major limitation of IPW is that it may be inefficient compared to ANCOVA in scenarios with limited sample sizes and unbalanced treatment allocations, as demonstrated in a recent simulation study(Raad et al., 2020).

In this paper, we point out that IPW is a special case of the general class of propensity score weights, called the balancing weights (Li et al., 2018), many members of which could be used for covariate adjustment in randomized trials. Within this class, we advocate to use a new weighting method, the overlap weighting (OW), which has been shown previously, in the context of observational studies, to offer theoretical and empirical gain in variance estimation compared to IPW (Li et al., 2019). In the context of randomized trials, a particularly attractive feature of OW is that, if the propensity score is estimated from a logistic regression “working” model, then OW leads to exact mean balance of any baseline covariate in the logistic model, and consequently remove the chance imbalance of that covariate. As a propensity score method, OW retains the aforementioned advantages of IPW while offers better finite-sample operating characteristics in terms of variance reduction (Section 2). Specifically, in Section 3, we demonstrate that the OW estimator, similar as IPW, achieves the same semiparametric variance lower bound and hence is asymptotically equivalent to the efficient ANCOVA estimator for continuous outcomes. For binary outcomes, we further provide closed-form variance estimators of the OW estimator for estimating marginal risk difference, risk ratio and odds ratio, which incorporates the uncertainty in estimating the propensity scores and achieves close to nominal coverage in finite samples. Through extensive simulations in Section 4, we demonstrate the efficiency advantage of OW under small to moderate sample sizes, and also validate the proposed variance estimator for OW. Finally, in Section 5 we apply the proposed method to the Best Apnea Interventions for Research (BestAIR) randomized trial and evaluate the treatment effect of continuous positive airway pressure (CPAP) on clinical outcomes.

## 2 Propensity score weighting for covariate adjustment

### 2.1 The Balancing Weights

We consider a randomized trial with two arms and patients, where and patients are randomized into the treatment and control arm, respectively. Let be the binary treatment indicator, with indicates treatment and control. Under the potential outcome framework (Rubin, 1974), each unit has a pair of potential outcomes , mapped to the treatment and control condition, respectively, of which only the one corresponding to the actual treatment assigned is observed. We maintain the consistency assumption (Hernan and Robins, 2010) so that the observed outcome . In randomized trials, a collection of baseline variables could be recorded for each patient, denoted by . Denote and as the marginal and conditional expectation of the outcome in arm (), respectively. A common estimand on the additive scale is the average treatment effect (ATE):

 τ=E{Yi(1)−Yi(0)}=μ1−μ0. (1)

We assume that the treatment is randomly assigned to patients, where , and is the randomization probability. The most typical study design ensures balanced assignment so that . Other values of may be possible, for example, when there is a perceived benefit of the treatment, and a larger proportion of patients are randomized to the intervention. Under randomization of treatment and the consistency assumption, we have , and thus an unbiased estimator for is the unadjusted difference-in-means estimator:

Below we generalize the ATE to a class of weighted average treatment effect (WATE) estimands to construct alternative weighting methods. Assume the study sample is drawn from a probability density , and let denote the covariate distribution density of a target population, possibly different from the one represented by the observed sample. The ratio is called a tilting function (Li and Li, 2019a), which re-weights the distribution of the baseline characteristics of the study sample to represent the target population. We can represent the ATE on the target population by a WATE estimand:

 τh=Eg[Yi(1)−Yi(0)]=E{h(x)(μ1(x)−μ0(x))}E{h(x)}. (3)

In practice, we usually pre-specify instead of . Most commonly is specified as a function of the propensity score or simply a constant. The propensity score (Rosenbaum and Rubin, 1983) is the conditional probability of treatment given the covariates, . Under the randomization assumption, for any baseline covariate value , and therefore as long as is a function of the propensity score , different corresponds to the same target population , and the WATE reduces to ATE, i.e. . This is distinct from observational studies, where the propensity scores are usually unknown and vary between units, and consequently different corresponds to different target populations and estimands. This special feature under randomized trials provides the basis for considering alternative weighting strategies to achieve better finite-sample performances.

In the context of confounding adjustment in observational studies, Li et al. (2018) proposed a class of propensity score weights, named the balancing weights, to estimate WATE. Specifically, given any , the balancing weights for patients in the treatment and control arm are defined as:

 w1(x)=h(x)/e(x),w0(x)=h(x)/{1−e(x)}, (4)

which balances the distributions of the covariates between treatment and control arms in the target population, so that , where is the conditional distribution of covariates in treatment arm (e.g. Wallace and Moodie, 2015; Li et al., 2018). Then, one can use the following Hájek-type estimator to estimate :

 ^τh=^μh1−^μh0=∑Ni=1w1(xi)ZiYi∑Ni=1w1(xi)Zi−∑Ni=1w0(xi)(1−Zi)Yi∑Ni=1w0(xi)(1−Zi). (5)

Though the function can take any form, we restrict our discussion to the following special cases relevant to randomized trials. For example, when , the balancing weights become the inverse probability weights, ; when , we have the overlap weights (Li et al., 2018), . Other examples of the balancing weights include the average treatment effect among treated (ATT) weights (Hirano and Imbens, 2001) and the matching weights (Li and Greene, 2013).

IPW is the most well-known case of the balancing weights. Specific to covariate adjustment in randomized trials, Williamson et al. (2014) and Shen et al. (2014) suggested the following IPW estimator of :

 ^τ\tiny{IPW}=∑Ni=1ZiYi/^ei∑Ni=1Zi/^ei−∑Ni=1(1−Zi)Yi/(1−^ei)∑Ni=1(1−Zi)/(1−^ei). (6)

We will point out in Section 3 that their findings on IPW are generally applicable to the balancing weights as long as is a smooth function of the true propensity score. The choice of , however, will affect the finite-sample operating characteristics of the weighting estimator. In particular, below we will closely examine the overlap weights.

### 2.2 The Overlap Weights

The overlap weights, , weigh each patient by its probability of being assigned to the opposite group, and thus gradually down-weigh the patients whose propensity scores are away from the center (0.5). In contrast to IPW, the patients with extreme (close to 0 and 1) propensity scores contribute the least in the estimation of the treatment effect. In observational studies, the overlap weights correspond to a target population with the most overlap in the baseline characteristics, and have been shown theoretically to give the smallest asymptotic variance of among all balancing weights (Li et al., 2018) as well as empirically reduce the variance of in finite samples(Li et al., 2019). In randomized trials, as discussed before, because the true propensity score is constant, the overlap weights and IPW target the same estimand , but their finite-sample operating characteristics can be markedly different, as elucidated below.

The OW estimator for the ATE in randomized trials is

 ^τ\tiny{OW}=^μ1−^μ0=∑Ni=1(1−^ei)ZiYi∑Ni=1(1−^ei)Zi−∑Ni=1^ei(1−Zi)Yi∑Ni=1^ei(1−Zi), (7)

where is the estimated propensity score from a logistic regression model:

 ei=e(Xi;θ)=exp(θ0+XTiθ1)1+exp(θ0+XTiθ1), (8)

with parameters and is the maximum likelihood estimate of . Regarding the selection of covariates in the propensity score model, the previous literature suggests to include stratification variables as well as a small number of key prognostic factors pre-specified in the design stage (e.g. Raab et al., 2000; Williamson et al., 2014). These guidelines are also applicable to the OW estimator.

The logistic propensity score model fit underpins a unique exact balance property of OW. Specifically, the overlap weights estimated from model (8) lead to exact mean balance of any predictor included in the model (Theorem 3 in Li et al. (2018)):

 ∑Ni=1(1−^ei)ZiXji∑Ni=1(1−^ei)Zi−∑Ni=1^ei(1−Zi)Xji∑Ni=1^ei(1−Zi)=0,for j=1,...,p. (9)

This property has important practical implications in randomized trials, namely, for any baseline covariate included in the propensity score model, the associated chance imbalance in a single randomized trial vanishes once the overlap weights are applied. If one reports the weighted mean differences in baseline covariates between arms (frequently included in the standard “Table 1” in primary trial reports), those differences are identically zero. Thus the application of OW enhances the face validity of the randomized study.

More importantly, the exact mean balance property translates into better efficiency in estimating . To illustrate the intuition, consider the following simple example. Suppose the true outcome surface is with . Denote the weighted chance imbalance in the baseline covariates by

 ΔX(w0,w1)=∑Ni=1w1(Xi)ZiXi∑Ni=1w1(Xi)Zi−∑Ni=1w0(Xi)(1−Zi)Xi∑Ni=1w0(Xi)(1−Zi),

and the weighted difference in random noise by

 Δε(w0,w1)=∑Ni=1w1(Xi)Ziεi∑Ni=1w1(Xi)Zi−∑Ni=1w0(Xi)(1−Zi)εi∑Ni=1w0(Xi)(1−Zi).

For the unadjusted estimator, substituting the true outcome surface in equation (2) gives , where measures the chance imbalance in a single random allocation, and is pure noise. This expression implies that the estimation error of is a sum of the chance imbalance and random noise, and becomes large when imbalanced covariates are highly prognostic (i.e. large magnitude of ). Similarly, if we substitute the true outcome surface in (6), we can show that the estimation error of IPW is . Intuitively, IPW controls for chance imbalance because we usually have , which reduces the variation of the estimation error over repeated experiments. However, because is not zero, the estimation error remains sensitive to the magnitude of . In contrast, because of the exact mean balance property of OW, we have ; consequently, substituting the true outcome surface in (7), we can see that the estimation error of OW equals , which is only noise and free of . This simple example illustrates that, for each realized randomization, OW should have the smallest estimation error, which translates into larger efficiency in estimating over repeated experiments.

Of note, more flexible models such as machine learning methods are available for estimating the propensity score

(Colantuoni and Rosenblum, 2015). In observational studies, these models could be more helpful because they reduce the chance for model misspecification. However, in randomized trials, the true propensity score is known and the propensity score model is merely a “working” model that is never misspecified. Thus, the benefit of more flexible propensity score models in randomized trials is negligible. More importantly, we employ the logistic propensity score model to obtain the overlap weights that achieve the exact mean balance on baseline covariates.

For non-continuous outcomes, besides the additive estimands we also consider ratio estimands. For example, for binary outcomes, the ATE is also known as the causal risk difference, . Two other standard estimands are the causal risk ratio (RR) and the causal odds ratio (OR) on the log scale, defined by

 τ\tiny{RR}=log(μ1μ0),     τ\tiny{OR}=log{μ1/(1−μ1)μ0/(1−μ0)}. (10)

The OW estimator for risk ratio and odds ratio are , and , respectively, with , being defined in (7).

## 3 Efficiency Considerations and Variance Estimation

In this section we demonstrate that in randomized trials the OW estimator leads to increased large-sample efficiency in estimating the treatment effect compared to the unadjusted estimator. We further propose a consistent variance estimator for the OW estimator of both the additive and ratio estimands.

### 3.1 Continuous Outcomes

Tsiatis et al. (2008) show that the family of regular and asymptotically linear estimators for the additive estimand is

 I:1NN∑i=1{ZiYir−(1−Zi)Yi1−r−Zi−rr(1−r){rg0(Xi)+(1−r)g1(Xi)}}+op(N−1/2), (11)

where is the randomization probability, and , are scalar functions of the baseline covariates . Several commonly used estimators for the treatment effect are members of the family , with different specifications of , . For example, setting , we obtain the unadjusted estimator . Setting , we obtain the “ANCOVA I” estimator in Yang and Tsiatis (2001), which is the least-squares solution of the coefficient of

in a linear regression of

on and . Further, setting and , we obtain the “ANCOVA II” estimator (e.g. Yang and Tsiatis, 2001; Tsiatis et al., 2008; Lin, 2013), which is the least-squares solution of the coefficient of in a linear regression of on and their interaction terms. This estimator achieves the semiparametric variance lower bound within the family , when the conditional mean functions and are correctly specified in the ANCOVA model (e.g. Robins et al., 1994; Leon et al., 2003). Another member of is the target maximum likelihood estimator (e.g. Moore and van der Laan, 2009; Moore et al., 2011; Colantuoni and Rosenblum, 2015), which is asymptotic efficient under correct outcome model specification. The IPW estimator is also a member of . Specifically, Shen et al. (2014) showed that if the logistic model (8) is used to estimate the propensity score , then the IPW estimator is asymptotically equivalent to the “ANCOVA II” estimator and becomes semiparametric efficient if the true and are linear functions of .

In the following Proposition we show that the OW estimator is also a member of and is asymptotically efficient under the linearity assumption. The proof of Proposition 1 is provided in Web Appendix A.

###### Proposition 1

(Asymptotic efficiency of overlap weighting)

(a) If the propensity score is estimated by a parametric model

with parameters that satisfies a set of mild regularity conditions (specified in Web Web Appendix A), then belongs to the class of estimators .
(b) Suppose and are two nested sets of baseline covariates with , and , are nested smooth parametric models. Write and as two OW estimators with the weights defined through and , respectively. Then the asymptotic variance of is no larger than that of .
(c) If the propensity score is estimated from the logistic regression (8), then is asymptotically equivalent to the “ANCOVA II” estimator, and becomes semiparametric efficient as long as the true and are linear in .

Proposition 1 summarizes the large-sample properties of the OW estimator in randomized trials, extending those demonstrated for IPW in Shen et al. (2014). In particular, adjusting for the baseline covariates using OW does not adversely affect efficiency in large samples than without adjustment. Further, the asymptotic equivalence between and the “ANCOVA II” estimator indicates that OW becomes fully semiparametric efficient when the conditional outcome surface is a linear function of the covariates adjusted in the logistic propensity score model. In the special case where the randomization probability , we show in Web Appendix B that the limit of the large-sample variance of is

where is the mean-centered outcome and measures the proportion of explained variance after regressing on . Similar definition of -squared was also used elsewhere when demonstrating efficiency gain with covariate adjustment (e.g. Moore and van der Laan, 2009; Moore et al., 2011; Wang et al., 2019). The amount of variance reduction is also a direct result from the asymptotic equivalence between the OW, IPW, and “ANCOVA II” estimators. Equation (12) shows that incorporating additional covariates into the propensity score model will not reduce the asymptotic efficiency because is non-decreasing when more covariates are considered. Although adding covariates does not hurt the asymptotic efficiency, in practice we recommend incorporating the covariates that exhibit severe baseline imbalance and that have large predictive power for the outcome (Williamson et al., 2014).

Perhaps more interestingly, the results in Proposition 1 apply more broadly to the family of balancing weights estimators, formalized in the following Proposition. The proof of Proposition 2 is presented in Web Appendix A.

###### Proposition 2

(Extension to balancing weights)
Proposition 1 holds for the general family of estimators (5) using balancing weights defined in (4), as long as the tilting function is a “smooth” function of the propensity score, where “smooth” is defined by satisfying a set of mild regularity conditions (specified in details in Web Appendix A).

### 3.2 Binary Outcomes

For binary outcomes, the target estimand could be the causal risk difference, risk ratio and odds ratio, denoted as , and , respectively. The discussions in Section 3.1 directly apply to the estimation of the additive estimand, . When estimating the ratio estimands, one should proceed with caution in interpreting regression parameters for the ANCOVA-type generalized linear models due to the potential non-collapsibility issue. Additionally, it is well-known that the log-binomial model frequently fails to converge with a number of covariates, and therefore one may have to resort to less efficient regression methods such as the modified Poisson regression (Zou, 2004). Williamson et al. (2014) showed that IPW can be used to adjust for baseline covariates without changing the interpretation of the marginal treatment effect estimands, and . Because of the asymptotic equivalence between the IPW and OW estimators (Proposition 1), OW shares the advantages of IPW in improving the asymptotic efficiency over the unadjusted estimators for risk ratio and odds ratio without compromising the interpretation of the marginal estimands. In addition, due to its ability to remove all chance imbalance associated with , OW is likely to give higher efficiency than IPW in finite samples, which we will demonstrate in Section 4.

### 3.3 Variance Estimation

To estimate the variance of propensity score estimators, it is important to incorporate the uncertainty in estimating the propensity scores (Lunceford and Davidian, 2004). Failing to do so leads to conservative variance estimates of the weighting estimator and therefore reduces power of the Wald test for treatment effect (Williamson et al., 2014). Below we use the M-estimation theory (e.g. Tsiatis, 2007) to derive a consistent variance estimator for OW. Specifically, we cast in equation (7), and in the logistic model (8) as the solutions to to the following joint estimation equations , where

 n∑i=1U(Yi,Xi,Zi,λ)=N∑i=1⎡⎢⎣Zi(Yi−μ1)(1−ei)(1−Zi)(Yi−μ0)ei~Xi(Zi−ei)⎤⎥⎦=0, (13)

where is the augmented covariates with intercept. Here, the first two rows represent estimating functions for and and the last rows are the score functions of the logistic model including an intercept and main effects of . If is of dimensions, equation (13) involves scalar estimating equations for parameters. Let ,, the asymptotic covariance matrix for can be written as . Extracting the covariance matrix for the first two components in , we can show that, as goes to infinity,

 √N[^μ1−μ1^μ0−μ0] →N{0,[Σ11,Σ12Σ21,Σ22]}, (14)

where the covariance matrix is defined as the corresponding elements in ,

 Σ11=[A−1BA−T]1,1,Σ22=[A−1BA−T]2,2,Σ12=ΣT21=[A−1BA−T]1,2. (15)

where denotes the th element of the matrix . Using the delta method, we can obtain the asymptotic variance of , , as a function of . Consistent plug-in estimators can then be obtained by estimating the expectations in the “sandwich” matrix by their corresponding sample averages. We summarize the estimators for the asymptotic variance of in the following equations,

where

and depends on the choice of estimands. For , we have ; for , we set ; for , we use with . Detailed derivation of the asymptotic variance and its consistent estimator can be found in Web Appendix B.

## 4 Simulation Studies

This section carries out extensive simulations to investigate the finite-sample operating characteristics of OW relative to IPW, direct regression adjustment and an augmented estimator that combined IPW and outcome regression. The main purpose of the simulation study is to empirically (i) illustrate that OW leads to marked finite-sample efficiency gain compared with IPW in estimating the treatment effect, and (ii) validate the sandwich variance estimator of OW developed in Section 3.3.

### 4.1 Continuous Outcomes

The first set of simulations involves continuous outcomes. We generate

baseline covariates from the standard normal distribution,

, . Fixing the randomization probability

, the treatment indicator is randomly generated from a Bernoulli distribution,

. Given the baseline covariates , we generate the potential outcomes from the following linear model (model 1): for ,

 Yi(z)∼N(zα+XTiβ0+zXTiβ1,σ2y),i=1,2,⋯,N (17)

where is the main effect of the treatment, and are the effects of the covariates and treatment-by-covariate interactions. The observed outcome is set to be . In our data generating process, because the baseline covariates have mean zero, the true average treatment effect on the additive scale . For simplicity, we fix and choose , . We specify the residual variance , and choose the multiplication factor

so that the signal-to-noise ratio (due to the main effects) is 1, namely,

. This specification mimics a scenario where the baseline covariates can explain up to of the variation in the outcome. We also assign different importance to each covariates. For example, the last two covariates, , , explain the majority of the variation, mimicking the scenario that one may have access to only a few strong prognostic risk factors at baseline. We additionally vary the value of to control the strength of treatment-by-covariate interactions. A larger value of indicates a higher level of treatment effect heterogeneity so that the baseline covariates are more strongly associated with the individual-level treatment contrast, . For the randomization probability , we consider two values: represents a balanced design with one-to-one randomization, and an unbalanced assignment where more patients are randomized to the treatment arm. Findings under other scenarios with are similar to those with and are omitted for brevity. We also vary the total sample sizes from to , with and mimicking a small and large sample scenario, respectively.

In each simulation scenario, we compare several different estimators for ATE, including the unadjusted estimator (UNADJ), the IPW estimator , the estimator based on linear regression (LR), and the OW estimator . For the IPW and OW estimators, we estimate the propensity score by logistic regression including all baseline covariates as linear terms, and the final estimator is given by the Hájek-type estimator (5) using the corresponding weights. For the LR estimator, we fit the correctly specified outcome model in (17) (model 1). To explore the performance of various estimators under model misspecification, we repeat the simulations by replacing the potential outcome generating process with the following model (model 2)

 Yi(z)∼N(zα+XTiβ0+zXTiβ1+XTi,intγ,σ2y), (18)

where represents interactions between pairs of covariates with consecutive indices and represents the strength of this interaction effect. The LR estimator omitting these additional interactions is thus considered as misspecified. For IPW and OW, the propensity score model is technically correctly specified (because the true randomization probability is a constant) even though it does not adjust for the interaction term . With a slight abuse of terminology, we refer to this scenario as “model misspecification”.

In addition to the previous four estimators, we also consider an augmented IPW (AIPW) estimator that augments IPW with an outcome regression (Lunceford and Davidian, 2004), which is also a member of the class :

 ^τ\tiny{AIPW}=^μ\tiny{AIPW}1−^μ\tiny{AIPW}0=1NN∑i=1{ZiYi^ei−(Zi−^ei)^μ1(Xi)^ei}−{(1−Zi)Yi1−^ei+(Zi−^ei)^μ0(Xi)1−^ei}, (19)

where is the prediction from the outcome regression. In the context of observational studies, such an estimator is also known as the doubly-robust estimator. Because AIPW hybrids propensity score weighting and outcome regression, it does not retain the objectivity of the former. Nonetheless, the AIPW estimator is often perceived as an improved version of IPW (Bang and Robins, 2005); therefore, we also compare it in the simulations to understand its operating characteristics in randomized trials.

For each scenario, we simulate 2000 replicates, and calculate the bias, Monte Carlo variance and mean squared error for each estimator of . Across all scenarios, as expected we find that the bias of all estimators is negligible, and thus the Monte Carlo variance and the mean squared error are almost identical. For this reason, we focus on reporting the efficiency comparisons using the Monte Carlo variance. We define the relative efficiency

of an estimator as the ratio between the Monte Carlo variance of that estimator and that of the unadjusted estimator. Relative efficiency larger than one indicates that estimator is more efficient than the unadjusted estimator. We also examine the empirical coverage rate of the associated 95% normality-based confidence intervals. Specifically, the confidence interval of

, , and is constructed based on the Huber-White estimator in Lin (2013), the sandwich estimator in Williamson et al. (2014), and the sandwich estimator developed in Section 3.3, respectively. Finally, the confidence interval of is the based on the sandwich variance derived based on the M-estimation theory; the details are presented in Web Appendix C.

Figure 1 presents the relative efficiency of the different estimators in four scenarios. In panel (a), where the outcomes are generated from model 1, with randomization probability and without treatment effect heterogeneity (), it is evident that , and are consistently more efficient than the unadjusted estimator, and the relative efficiency increases with a larger sample size . However, when the sample size is no larger than , OW leads to higher efficiency compared to LR, IPW and AIPW, while IPW is the least efficient among the adjusted estimators. In panel (b), in which we impose strong treatment effect heterogeneity , and become slightly more efficient than ; this is expected as the true outcome model is used and the design is balanced. The efficiency advantage decreases for and as moves closer to zero (Table 1). On the other hand, becomes more efficient than and when the randomization probability deviates from . For instance, in panel (c), when the randomization probability is and sample size is , and become even less efficient than the unadjusted estimator, while OW demonstrates substantial efficiency gain over the unadjusted estimator. The deteriorating performance of under supports the earlier findings in Freedman (2008). These results show that the relative performance between LR and OW is affected by the degree of treatment effect heterogeneity and the randomization probability. In the scenarios with small heterogeneous treatment effect or/and with unbalanced design (Table 1), OW tends to be more efficient than LR. Overall, the LR and OW are generally more comparable when model is correctly specified, both outperforming IPW. But OW becomes more efficient than LR when the outcome model is incorrectly specified. Namely, when the outcomes are generated from model 2, becomes the most efficient even if the propensity model omits important interaction terms in the true outcome model, as in panel (d) of Figure 1. In addition, LR and AIPW have almost identical finite-sample efficiency, suggesting that the regression component dominates the AIPW estimator in randomized trials. Throughout, is consistently more efficient than , regardless of sample size, randomization probability and the degree of treatment effect heterogeneity. These patterns persist when the sample size increases to , although there the differences between methods become smaller as a result of Proposition 1. Additional numerical results on relative efficiency are provided in Table 1.

Table 1 also summarizes the accuracy of the estimated variance and the coverage rate of each interval estimator. The former is measured by the ratio between the average estimated variance and the Monte Carlo variance of the estimator, and a ratio close to indicates adequate performance. In general, we find that estimated variance is close to the truth for both IPW and OW, but less so for the LR and AIPW estimator, especially in small samples such as or . Specifically, the sandwich variance of IPW and our variance for OW tend to adequately quantify the uncertainty, even when the sample size is as small as , when the outcomes are generated from model 1. For these settings, the variance of the Huber-White variance estimator for LR sometimes substantially underestimates the true variance, and leads to under-coverage of the interval estimator. Also, in the case where LR has a slight efficiency advantage (

), the coverage of LR is only around 70% even when the true linear regression model is fit. This results show that the Huber-White sandwich variance, although known to be robust in large samples to heteroscedasticity, could be severely biased towards zero in finite samples and under treatment effect heterogeneity. This finding provides an additional caveat for using LR. Similar to the Huber-White variance for LR, the sandwich variance of AIPW also frequently underestimates the true variance when

. On the other hand, when the outcomes are generated from model 2 and the randomization probability , all variance estimators tend to underestimate the truth, and the coverage rate slightly deteriorates. However, the coverage of IPW and OW estimators is still closer to nominal than LR and AIPW when sample size , suggesting that the sandwich variance of the propensity score weighting estimators have more stable performance.

### 4.2 Binary Outcomes

The second set of simulations involves binary outcomes generated from a generalized linear model. Specifically, we assume the potential outcome follows a logistic regression model (model 3): for ,

 logit{Pr(Yi(z)=1)}=η+zα+XTiβ0+zXTiβ1,   i=1,2,…,N, (20)

where

denotes the vector of

baseline covariates simulated as in Section 4.1, and the parameter to represent the baseline prevalence of the outcomes in the control arm, i.e., . We specify the main effects , where is chosen to be the same value used in Section 4.1. For the covariate-by-treatment interactions, we set and examine scenarios with and , with the latter representing strong treatment effect heterogeneity. Similarly, we set the true treatment effect to be zero . We vary the sample size from 50 to 500 to represent both small and moderately large sample sizes. Additionally, we vary the value of such that the baseline prevalence , representing scenarios where the outcome is common and rare. It is expected that the stability of regression adjustment becomes sensitive to the prevalence of outcome, while propensity score weighting estimators are less affected (Williamson et al., 2014). We also consider a data generating process with additional covariate interaction terms (model 4): for