G-computation and inverse probability weighting for time-to-event outcomes: a comparative study

by   A. Chatton, et al.
University of Nantes

Inverse probability weighting is increasingly used in causal inference, but the g-computation constitutes a promising alternative. We aimed to compare the performances of these methods with time-to-event outcomes. Given the limitations of the interpretability of the hazard ratio for causal inference, our target estimand of interest is the difference in restricted mean survival times. We report the findings of an extensive simulation study, showing that both inverse probability weighting and g-computation were unbiased under a correct model specification, but g-computation was generally more efficient. We also analyse two real-life datasets to illustrate these results. We update the R package RISCA to facilitate the implementation of g-computation.



There are no comments yet.


page 1

page 2

page 3

page 4


PSweight: An R Package for Propensity Score Weighting Analysis

Propensity score weighting is an important tool for causal inference and...

Navigated Weighting to Improve Inverse Probability Weighting for Missing Data Problems and Causal Inference

The inverse probability weighting (IPW) is broadly utilized to address m...

Propensity Score Weighting Analysis of Survival Outcomes Using Pseudo-observations

Survival outcomes are common in comparative effectiveness studies and re...

Quantifying Error in the Presence of Confounders for Causal Inference

Estimating average causal effect (ACE) is useful whenever we want to kno...

Inverse probability of censoring weighting for visual predictive checks of time-to-event models with time-varying covariates

When constructing models to summarize clinical data to be used for simul...

Regression with a right-censored predictor, using inverse probability weighting methods

In a longitudinal study, measures of key variables might be incomplete o...

A bracketing relationship between difference-in-differences and lagged-dependent-variable adjustment

Difference-in-differences is a widely-used evaluation strategy that draw...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Real-world evidence means scientific evidence obtained from data collected outside the context of randomised clinical trials (Sherman_RWE_2016)

. The absence of randomisation complicates the estimation of the marginal causal effect (hereafter referred merely as causal effect) of exposure (including treatment or intervention) due to a potential risk of confounding

(Hernan_2004). rosenbaum_central_1983 introduced the propensity score (PS) as a tool for causal inference in the presence of measured confounders. In a binary exposure setting, it has been shown that the estimated PS is a balancing score, meaning that conditionally on the estimated PS, the distribution of covariates is similar between exposed and unexposed patients. Following this property, the PS can be used in four different ways to provide estimates of the causal exposure effect: matching, stratification, adjustment, and inverse probability weighting (IPW, robins_marginal_2000). Stratification leads to residual confounding (lunceford_stratification_2004) and adjustment relies on stronger modelling assumptions (Vansteelandt_Daniel_2014). Although matching on PS has always been the most popular (ali_reporting_2015), IPW appears to be less biased and more precise in several studies (le_borgne_comparisons_2015; Hajage__2016; austin_performance_2013). Moreover, King_PSM argued for halting the use of PS matching for many reasons such as covariate imbalance, inefficiency, model dependence, and bias.

Causal effects can also be estimated using the G-computation (GC), a maximum likelihood substitution estimator of the g-formula (Snowden_2011; Keil_2014). While IPW rests on the exposure modelling, the GC relies on the prediction of the potential outcomes for each subject under each exposure status. Robins_1986 proposed the GC to estimate causal effects in the presence of time-varying confounders.

Several studies (see Chatton_2020, and references inside)

compared the IPW and the GC in different contexts. They reported a lower variance for the GC than the IPW. Nevertheless, to the best of our knowledge, no study focused on time-to-event outcomes. The presence of right-censoring and its magnitude is of prime importance since a small number of observed events due to censoring may impact the estimation of the outcome model involved in the GC. Contrariwise, the IPW may have good performance as long as the number of exposed patients is sufficient to estimate the PS and the total sample size large enough to limit variability in the estimated weights. Besides the lack of simulation-based studies comparing the GC and the IPW in a time-to-event context, the GC-based estimates, such as hazard ratio or restricted mean survival, are not straightforward to obtain. Time-to-event outcomes result in methodological difficulties, mainly due to the time-varying distribution of individual baseline characteristics.

In the present paper, we aimed to detail the statistical framework for using the GC in time-to-event analyses. We restricted our developments to time-invariant confounders because the joint modelling of time-to-event and longitudinal data requires for controlling the time-varying confounding by GC or IPW (Hernan_2001) is beyond the scope of this paper. We also compared the performances of the GC versus the IPW. The paper is structured as follows. In section 2, we detailed the methods. Section 3 presents the design and findings of a simulation study. In section 4, we proposed a practical comparison with two real-world applications related to treatment evaluations in multiple sclerosis and kidney transplantation. Finally, we discussed the results and provided practical recommendations to help analysts to choose the appropriate analysis method.

2 Methods

2.1 Notations


be the random variables associated with the subject

(. is the sample size, is the participating time, is the censoring indicator (0 if right-censoring and 1 otherwise), is a binary time-invariant exposure initiated at time (1 for exposed subjects and 0 otherwise), and

is the vector of the

measured time-invariant confounders. Let be the survival function of the group at time , and the corresponding instantaneous hazard function. Suppose the number of different observed times of event in the group . At a time (), the number of events and the number of at-risk subjects in the group can be defined as and .

2.2 Estimands

The hazard ratio () has become the estimand of choice to confounder-adjusted studies with time-to-event outcomes. However, it has also been contested (hernan_hazards_2010; Aalen_2015), mainly because the baseline distribution of confounders varies over time. For instance, consider a population in which there is no confounder at baseline, i.e., the distribution of is identical regardless of the exposure’s level . Suppose additionally that the quantitative covariate and the exposure are independently associated with a higher risk of death. In this situation, as illustrated in Figure 1, the difference between the mean values of the two groups increases over time. Then, even when the subject-specific proportional hazard (PH) assumption is true, i.e., the between an exposed and an unexposed patient with the same characteristics is constant over the time, the population-average may vary over the time. Instead of the , one can estimate an average over the time of the different time-specific (Schemper_2009): .

Nevertheless, Aalen_2015 concluded that it is still difficult to draw causal conclusions from such a relative estimand. hernan_hazards_2010 advocated for the use of the adjusted survival curves and related differences. For instance, the restricted mean survival time (RMST) allows us to summarise a survival curve for a specific time-window (Royston_Parmar_2013) and compare two curves by looking at the difference in RMST. The RMST difference up to time is formally defined as :


It corresponds to the difference in terms of mean event-free time between two groups of exposed and unexposed individuals followed-up to time . A further advantage of the RMST difference is its usefulness for public health decision-making (Poole_2010). Note that other alternatives exist, such as the attributable fraction or the number needed to treat (Sjolander_2018), that might dodge this problematic.

Hereafter, we considered and .

2.3 Weighting on the inverse of propensity score

Formally, the PS is , i.e., the probability for the subject to be exposed according to her/his characteristics

. In practice, analysts often used the logistic regression. Let

be the weight of the subject . xu_use_2010 defined , allowing to obtain a pseudo-population in which the distribution of covariates is balanced between exposure groups, enabling to estimate the causal effect in the entire population (Hernan_2004). The use of stabilised weights has been shown to produce a suitable estimate of the variance even when there are subjects with extremely large weights (robins_marginal_2000; xu_use_2010). The weighted number of events and at-risk subjects at time in the group are and , respectively. cole_adjusted_2004 proposed a weighted Kaplan-Meier estimator defined as :


To estimate the corresponding , they also suggested the use of a weighted univariate Cox PH model, in which the exposure is the single explanatory variable. We use the equation (1) to estimate the corresponding .

2.4 G-computation

Akin to the IPW, the GC involves two steps. The first one consists in estimating the outcome model, called the Q-model (Snowden_2011). When suitable, it can consist in a PH regression: where is the conditional hazard function of the subject at time , the baseline hazard function, and and the regression coefficients. The estimations of the cumulative baseline hazard and the regression coefficients () can be obtained by the joint likelihood proposed by Breslow_1972. The second step consists in predicting the counterfactual mean survival function if all subjects would have been exposed () or unexposed ():


Afterwards, could be computed by the mean of the individual counterfactual hazard ratios at the observed event times (Schemper_2009):


where , which is obtained from the equation (3) by numerical differentiation. We use the equation (1) to estimate the corresponding .

2.5 Software

We performed all the analyses using R version 3.6.0. (R). To facilitate their use in practice, we have implemented the previous methods in the R package entitled RISCA (versions 0.8.1), which is available at cran.r-project.org. We obtained the variances by bootstrap, as recently recommended by Austin_boot.

2.6 Identifiability conditions

As for standard regression models, the IPW and the GC require the assumptions of no measurement error, no model misspecification, and no interference (Hudgens_Halloran_2008). Three additional assumptions, called identifiability conditions, are necessary for causal inference (Naimi_2016). (a) The values of exposure under comparisons correspond to well-defined interventions that, in turn, correspond to the versions of exposure in the data. (b) The conditional probability of receiving every value of exposure depends only on the measured covariates. (c) The conditional probability of receiving every value of exposure is greater than zero. These assumptions were known as consistency, conditional exchangeability and positivity, respectively.

3 Simulations study

3.1 Data generation

We generated the data in three steps. (a) We simulated three covariates ( to

) from a Bernoulli distribution with parameter equal to 0.5 and three covariates (


) from a standard normal distribution (Figure

2). (b) We generated the exposure according to a Bernoulli distribution with probability obtained by the logistic model with the following linear predictor: . We set the intercept to obtain a prevalence of exposed individuals at 50%. (c) We generated the times-to-event from a Weibull PH model. We set the scale and shape parameters to 40.0 and 2.0, respectively. Based on a random variable

drawn from a standard uniform distribution, we then computed the time-to-event from a Weibull PH model as

, where

under the null hypothesis or

under the alternative one. We subsequently censored the times-to-event using a uniform distribution on [0,70] or [0,15], leading to approximately 40% and 90% censored observations, respectively.

We explored three sample sizes: 100, 500, or 2000. When the censoring rate was 90%, we did not investigate the smallest sample size due to the reduced number of events. For each scenario, we randomly generated 10,000 datasets. To compute the difference of , we defined in each dataset as the time where at least 10% of the individuals in each group (exposed or unexposed) were still at-risk.

We furthermore investigated the consideration of two sets of covariates: the risk factors of the outcome, or the true confounders.

3.2 Performance criteria

We computed the theoretical values of the and by averaging the estimations, respectively obtained from univariate Cox PH models ( as the only explanatory covariate) and by the equation (1) where the survival functions have been estimated by the Kaplan-Meier estimator, fitted from datasets simulated as above, except that was simulated independently of  (gayat_propensity_2012). We reported the following criteria: (a) the percentage of datasets without convergence ; (b) the mean absolute bias: , where is the estimand of interest ; (c) the mean square error: ; (d) the variance estimation bias: , where

is the asymptotic standard deviation and

the empirical standard deviation ; (e) the empirical coverage rate of the nominal 95% confidence interval (CI), defined as the percentage of 95% CIs including

; (f) the type I error, defined as the percentage of rejection of the null hypothesis under the null hypothesis ; and (g) the statistical power, defined as the percentage of rejections of the null hypothesis under the alternative hypothesis. We computed the Monte Carlo standard errors for each performance measure 


3.3 Results

The Monte Carlo errors were very weak. We have not encountered any convergence problems. Figures 3 and 4 present the results under the alternative hypotheses for and , respectively. The results under the null hypothesis were comparable and can be found in supplementary materials available online.

The MAB associated with IPW and GC were similar to zero in all scenarios, regardless of the considered covariates set (all the risk factors or only the true confounders). Nevertheless, the MAB under the alternative hypothesis was slightly lower with the GC when considering all the risk factors. For instance, in terms of when and a censoring rate at 40%, the MAB was 0.033 for GC versus 0.061 for IPW. When considering the true confounders, these values were 0.147 versus 0.066, respectively.

The GC, when considering all the outcome causes, resulted in the best results in terms of MSE, especially for small sample sizes. For instance, when and a censoring rate at 40%, the MSE related to the was 0.056 for GC versus 0.068 for IPW. When considering the true confounders, these values were 0.076 versus 0.079, respectively.

Regarding the VEB, the results were close between the GC and the IPW. One can nevertheless note lower values for the IPW when and estimating the . For instance, when the censoring rate was 40%, the VEB values were in-between 4.1% and 4.7% for GC, versus 2.4% and 3.7% for IPW.

All scenarios broadly respected the nominal coverage value of 95%. The power was the highest for the GC, especially when considering all the risk factors, with a gain in-between 10% and 13%.

With a censoring rate at 90%, the two methods led to close results, irrespective of the considered covariates.

4 Applications

We used the data from two studies performed by Laplaud_2019 in multiple sclerosis and Masset_2019 in kidney transplantation. We conducted these studies following the French law relative to clinical noninterventional research. Written informed consents were collected. Moreover, the French commission for data protection approved the collection (CNIL decisions DR-2014-327 and 914184, respectively). To guide variable selection, we asked experts which covariates were causes of the exposure or the outcome prognosis to define the causal structure, as proposed by VanderWeele_Shpitser_2011

. We checked the positivity assumption and the considered covariates balance (see supplementary materials available online). The log-linearity hypothesis of continuous covariates has been verified in the univariate analysis if the Bayesian information criterion

(BIC) was not reduced using natural spline transformation compared to the inclusion of the covariate in its natural scale. In case of violation, we used a natural spline transformation. We also assessed the PH assumption by the Grambsch-Therneau test at a significance level of 5% (Grambsch_Therneau_1994). For simplicity, we performed complete case analyses.

4.1 Dimethylfumarate versus Teriflunomide to prevent the relapse in multiple sclerosis

With the increasing number of available drugs for preventing relapses in multiple sclerosis and the lack of head-to-head randomised clinical trials, Laplaud_2019 aimed to compare Teriflunomide (TRF) and Dimethylfumarate (DMF) using data from the multicentric cohort OFSEP. We reanalysed the primary outcome, defined as the time-to-first relapse. We presented the cohort characteristics of 1770 included patients in Table 1: 1057 patients were in the DMF group (59.7%) versus 713 in the TRF group (40.3%). Approximately 39% of patients (40% in the DMF group versus 38% in the TRF group) had at least one relapse during the follow-up.

We presented the confounder-adjusted results in the left panel of the Figure 5. The GC and the IPW were equally robust to the considered set of covariates. For the difference of RMST, the width of the 95% CI was lower for the GC. For instance, when we considered all the risk factors, GC led to a width of 44.2 days versus 50.9 days for IPW.

The conclusion of no significant difference between TRF and DMF was unaffected by the method or the choice of the estimand.

4.2 Basiliximab versus Thymoglobulin to prevent post-transplant complications

Amongst the non-immunised kidney transplant recipients, one can expect similar rejection risk between Thymoglobulin (ATG) and Basiliximab (BSX), two possible immunosuppressive drugs proposed as induction therapy. However, the ATG may be associated with higher serious events, especially in the elderly. We aimed to reanalysed the difference in cardiovascular complications in ATG versus BSX patients (Masset_2019). Table 2 describes the 383 included patients from the multicentric DIVAT cohort: 204 patients were in the BSX group (53.3%) versus 179 in the ATG group (46.7%). Approximately 30% of patients (29% in the BSX group and 31% in the ATG group) had a least one cardiovascular complication during the follow-up. The median follow-up time was 1.8 years (min: 0.0; max: 8.2).

We presented in the right panel of the Figure 5 the confounder-adjusted RMST differences for a cohort followed up to 3 years. Firstly, the GC-based results obtained were slightly less sensitive than those obtained by the IPW to the considered set of covariates. Indeed, GC resulted in a equal to 2.6 (IC 95% from -1.2 to 7.0) and 4.0 (IC 95% from 0.2 to 8.5) months with all the risk factors and only the true confounders, respectively. In contrast, the IPW resulted in a equal to 1.5 (IC 95% from -2.5 to 5.8) and 3.4 (IC 95% from -0.5 to 7.9) months, respectively. Likewise, the log of obtained by GC was in-between -0.272 (IC 95% from -0.703 to 0.123) and -0.440 (IC 95% from -0.898 to -0.021) versus -0.188 (IC 95% from -0.639 to 0.226) and -0.384 (IC 95% from -0.830 to 0.002) by IPW. This variability of the estimations, according to the set of covariates, illustrates the importance of this choice. Secondly, we obtained the smallest 95% CIs by using the GC with all the risk factors, in line with the corresponding lower variance previously reported in the simulations. Thirdly, the conclusion differed depending on the method: the IPW-based results did not point to a significant difference of effect between BSX and ATG, in contrast to the GC with the true confounders.

5 Discussion

We aimed to explain and compare the performances of the GC and the IPW for estimating causal effects in time-to-event analyses with time-invariant confounders. We focused on the average HR and the RMST difference. The results of the simulations showed that the two methods performed similarly in terms of MAB, VEB, coverage rate, and type I error rate. Nevertheless, the GC outperformed the IPW in terms of statistical power, even when the censoring rate was high. Furthermore, the simulations also showed that the Q-model in the GC approach should preferentially include all the risk factors to ensure the smaller mean bias. Therefore, we reported that the main advantage of using the GC is the gain in statistical power.

In the two applications, the 95% CIs were also narrower by using the GC with all the risk factors. Moreover, while the first application in multiple sclerosis did not highlight relevant differences between the GC and the IPW, the second one in kidney transplantation illustrated the importance of the set of covariates to consider. By using the GC with all the risk factors, we concluded to the significant difference between Basiliximab and Thymoglobulin. In contrast, by using the GC with the true confounders, we reported a non-significant difference. Two arguments can explain that the consideration in the Q-model of all the risk factors was associated with the best performances. First, it reduces the residual variance and increases the precision of the predictions of the potential outcomes. Second, even if a risk factor is balanced between the exposure groups at baseline, it could become unbalanced over time (as illustrated in Figure 1).

Nevertheless, the higher power of the GC can be counterbalanced by the inability of GC to evaluate the characteristics balance between exposure group over time and need for bootstrapping to estimate the variance, analytic estimators being available for the IPW (Conner_Trinquart_2019; Hajage_2018). In practice, we must emphasise that bootstrapping the entire estimation procedure, has the advantage to consider a valid post-selection inference (Efron_2014). For instance, data-driven methods for variables selection have recently been developed, such as the super learner (vdl_2007), and may represent a promising perspective when such clinical knowledge is unavailable (Blakely_2019).

The IPW and the GC are not the only possible methods to estimate the causal effect. For instance, Conner_Trinquart_2019 compared IPW performances with other regression-based methods. Overall, the statistical performances were similar. However, the straight advantage of the IPW and the GC compared to other methods is the visualisation of the confounder-adjusted results in terms of survival curve or derivative indicator such as RMST. The use of Doubly Robust Estimators (DREs) (Benkeser_2018; Cai_vdl_2019) could be an extension of our work. The DREs combine both the PS and the GC approaches for a consistent estimation of the exposure effect if a least one model has been correctly specified, circumventing the potential model misspecification aforementioned (Lendle_2013). Unfortunately, DREs can increase bias when the two models are misspecified (Kang_Schafer_2007). Moreover, several studies reported a lower variance for the GC than the DREs (Chatton_2020; Kang_Schafer_2007; Lendle_2013). The use of the GC also represents a partial solution to prevent the selection of instrumental variables (Myers_2011) since it is independent of the exposure modelling.

Our study has several limitations. Firstly, we only studied logistic and Cox PH regression, but other models could be used. Keil_Edwards_2018 proposed a review of possible Q-model with a time-to-event outcome. Secondly, the results of the simulations and applications are not a theoretical argument for generalising to all situations. Thirdly, we only considered a reduced number of covariates, which could explain the equivalence mentioned above between the GC and the IPW with the extreme censoring rate. Lastly, we did not consider competing events or time-varying confounders that require specific estimation methods (Young_2020; Daniel_2013).

To conclude, the present study illustrates the higher power of the G-computation compared to the inverse probability weighting to estimate the causal effect with time-to-event outcomes. All the risk factors should be considered in the Q-model (except those on the causal pathway). Our work is in the continuity of the emerging literature that questions the near-exclusivity of propensity score-based methods in causal inference.


The authors would like to thank the members of DIVAT and OFSEP Groups for their involvement in the study, the physicians who helped recruit patients and all patients who participated in this study. We also thank the clinical research associates who participated in the data collection. We also thank David Laplaud and Magali Giral for their clinical expertise. The analysis and interpretation of these data are the responsibility of the authors. This work was partially supported by a public grant overseen by the French National Research Agency (ANR) to create the Common Laboratory RISCA (Research in Informatic and Statistic for Cohort Analyses, www.labcom-risca.com, reference: ANR-16-LCV1-0003-01) involving the development of Plug-Stat software. Arthur Chatton obtained a grant from IDBC for this work.