Data-adaptive doubly robust instrumental variable methods for treatment effect heterogeneity

by   Karla DiazOrdaz, et al.

We consider the estimation of the average treatment effect in the treated as a function of baseline covariates, where there is a valid (conditional) instrument. We describe two doubly robust (DR) estimators: a locally efficient g-estimator, and a targeted minimum loss-based estimator (TMLE). These two DR estimators can be viewed as generalisations of the two-stage least squares (TSLS) method to semi-parametric models that make weaker assumptions. We exploit recent theoretical results that extend to the g-estimator the use of data-adaptive fits for the nuisance parameters. A simulation study is used to compare standard TSLS with the two DR estimators' finite-sample performance, (1) when fitted using parametric nuisance models, and (2) using data-adaptive nuisance fits, obtained from the Super Learner, an ensemble machine learning method. Data-adaptive DR estimators have lower bias and improved coverage, when compared to incorrectly specified parametric DR estimators and TSLS. When the parametric model for the treatment effect curve is correctly specified, the g-estimator outperforms all others, but when this model is misspecified, TMLE performs best, while TSLS can result in large biases and zero coverage. Finally, we illustrate the methods by reanalysing the COPERS (COping with persistent Pain, Effectiveness Research in Self-management) trial to make inference about the causal effect of treatment actually received, and the extent to which this is modified by depression at baseline.


page 1

page 2

page 3

page 4


Double-robust and efficient methods for estimating the causal effects of a binary treatment

We consider the problem of estimating the effects of a binary treatment ...

Doubly Robust Estimation of Local Average Treatment Effects Using Inverse Probability Weighted Regression Adjustment

We revisit the problem of estimating the local average treatment effect ...

Doubly Robust Inference for Hazard Ratio under Informative Censoring with Machine Learning

Randomized clinical trials with time-to-event outcomes have traditionall...

Deep Neural Networks Guided Ensemble Learning for Point Estimation in Finite Samples

As one of the most important estimators in classical statistics, the uni...

The Bias-Variance Tradeoff of Doubly Robust Estimator with Targeted L_1 regularized Neural Networks Predictions

The Doubly Robust (DR) estimation of ATE can be carried out in 2 steps, ...

On the Evaluation of Surrogate Markers in Real World Data Settings

Shortcomings of randomized clinical trials are pronounced in urgent heal...

A Data-Adaptive Targeted Learning Approach of Evaluating Viscoelastic Assay Driven Trauma Treatment Protocols

Estimating the impact of trauma treatment protocols is complicated by th...

1 Introduction

There has been an increased interest in estimating the causal effect of treatment actually received (in addition to the intention-to-treat effect) in randomised controlled trials (RCTs) in the presence of treatment non-adherence (Wiles et al., 2014; Dodd et al., 2012; Zhang et al., 2014). An additional challenge is posed by appreciable treatment effect heterogeneity, which is often itself of interest. This is a a common issue with psychological interventions (Dunn and Bentall, 2007).

In this work, we consider methods for estimating the dependence of a causal average treatment effect on baseline covariates in RCTs with non-adherence. This is motivated by the COPERS (COping with persistent Pain, Effectiveness Research in Self-management) trial. The intervention introduced cognitive behavioural therapy approaches designed to promote self-efficacy to manage chronic pain, with the primary outcome being pain-related disability. The research team was interested in the causal effect of the received treatment, and whether this effect was modified by a number of baseline variables. Here, we will focus on one possible effect modifier: depression at baseline, measured by the Hospital Anxiety and Depression Scale (HADS).

Instrumental variable (IV) methods are often used to estimate the effect of treatment received in RCTs where randomised treatment is unconfounded by design, but treatment received is not. Assuming that randomised treatment is a valid instrument, and under some additional assumptions reviewed in Section 2, it is possible to identify the average treatment effect in the treated, conditional on baseline covariates . In addition to investigating effect modification by a subset of baseline covariates , it can in other settings be beneficial to use a larger set of baseline covariates for adjustment in the analysis: (i) if the IV assumptions are more plausible conditional on baseline covariates , or (ii) to increase the statistical efficiency of the estimators.

A relatively simple method of estimation for this is the so-called two stage least squares (TSLS). In its simplest form, i.e. when

is null, the first stage predicts the exposure based on an ordinary least squares regression of the exposure on the IV and baseline covariates

, while the second stage consists of regressing the outcome on the predicted exposure from the first stage and baseline covariates , also via ordinary least squares regression. The coefficient corresponding to the predicted exposure in this second model is the TSLS estimator of the desired causal treatment effect. TSLS is robust to the misspecification of the first stage model (Robins, 2000; Wooldridge, 2010) but may be inefficient, especially when the treatment-exposure relationship is non-linear (Vansteelandt and Didelez, 2018). However, as we shall see later, where is non-null and the treatment effect varies by baseline covariates, TSLS relies on the outcome model (the second stage) being correctly specified to obtain consistent effect estimates.

Doubly robust (DR) estimators are appealing in such settings, as they estimate consistently the parameter of interest if at least one of the models, for either the exposure or the outcome is correctly specified. In the context of linear IV models with null, Okui et al. (2012) proposed a locally-efficient estimating equations DR estimator for the causal effect of treatment in the treated, often called a g-estimator. It augments the TSLS estimating equation by adding a model for the instrument given the baseline covariates. This estimator is DR in the sense that it needs to specify correctly either the outcome model or the instrument model. This estimator was generalised to settings where is non-null by Vansteelandt and Didelez (2018) and shown to be locally efficient.

Recently, Tóth and van der Laan (2016) proposed a targeted maximum likelihood estimator (TMLE) for the treatment effect in a linear IV model. TMLE is a general approach for causal inference problems yielding semi-parametric substitution estimators that are doubly robust (van der Laan and Rose, 2011).

Although DR estimators offer partial protection in principle against model misspecification, concerns remain over their performance in practice, when all models are likely to be misspecified (Kang et al., 2007). To further avoid problems of model misspecification, TMLE is usually coupled with machine learning fits of nuisance models, in particular the Super Learner, a cross-validation based estimator selection approach (van der Laan et al., 2007). TMLE and other DR estimators possess a particular orthogonality property that leads to greater suitability with machine learning fits. Estimators based on a single nuisance model can perform poorly when combined with machine learning fits, since the estimator inherits the slow convergence (and hence high finite sample bias) and non-regularity of the machine learning fits themselves, with the latter property making valid statistical inferences extremely complex to obtain, as even the bootstrap is invalid (Benkeser et al., 2017)

. Some DR estimators, such as TMLE, on the other hand, when combined with machine learning fits, have faster convergence and make (asymptotic) analytic statistical inference tractable via the sampling variance of the corresponding influence functions, under conditions on the convergence of the machine learning algorithms used, which, although not guaranteed to hold by any means, are considerably weaker than what would be required to achieve the same with non-DR estimators

(Farrell, 2015).

Chernozhukov et al. (2016) further develop these ideas, and suggest the use of machine learning for estimating the nuisance parameters for a wide class of estimating equation DR estimators, to which the g-estimator presented by Vansteelandt and Didelez (2018) belongs.

We implement these two IV estimators, g-estimation and TMLE, with and without machine learning fits, and compare their performance with that of standard TSLS, in terms of mean bias, root mean squared error (RMSE) and confidence interval (CI) coverage using a simulation study. We also contrast the methods in the case study.

This paper is organised as follows. In the next section, we define the causal parameters of interest and the assumptions for the IV methods. In Section 3.1 we review the standard TSLS, while in Section 3.2, we introduce the g-estimator proposed by Vansteelandt and Didelez (2018). Section 3.3 briefly justifies the use of so-called double machine learning methods (data-adaptive DR estimators), and introduces the Super Learner. The TMLE estimator proposed by Tóth and van der Laan (2016) is described in Section 3.4. In Section 4 we present a simulation study, comparing the performance of these estimators. The proposed methods are then applied to the COPERS RCT in Section 5. We conclude with a discussion.

2 Linear instrumental variables models


Figure 1: DAG depicting a valid conditional instrument for exposure in the presence of possible effect modifiers and unobserved confounders , when the outcome is .

Let be a set of baseline variables, be the randomised treatment indicator, be the exposure of interest, the actual treatment received, presumed to be binary. Let be the continuous outcome of interest, and be the set of all unobserved common causes of and . For simplicity, but without loss of generality, assume that we are only interested in effect modification by a single baseline variable .

We use a subscript 0 to denote the true probability distributions, models and parameters. Let the vector of the observed data for the

-th individual be , where is the true underlying distribution from which an independent identically distributed random sample of size is drawn. The causal relationships between these variables are encoded by the directed acyclic graph (DAG) shown in Figure 1.

Let the counterfactual or potential outcome be the outcome that would occur if were set to . As usual, we assume that no interference and counterfactual consistency hold (Rubin, 1978; VanderWeele, 2009). Following Abadie (2003); Vansteelandt and Didelez (2018), we write the conditional version of the IV assumptions (Angrist et al., 1996), as follows:
(i) Conditional unconfoundedness: is conditionally independent of the unmeasured confounders, conditional on measured covariates , i.e. .
(ii) Exclusion restriction: Conditionally on , and the confounder , the instrument and the response are independent, i.e. ,
(iii) Instrument relevance: Also referred to as first stage assumption: is associated with conditional on , i.e. .

A special case is the simple RCT, where randomised treatment is often assumed to be an unconditional valid IV.

In addition, we assume the following conditional mean model for the outcome:


where and are unknown functions and represents the causal treatment effect given covariates .

Now by the technical assumption of counterfactual consistency, the conditional mean model (1), together with the conditional IV assumptions, implies (Vansteelandt and Didelez, 2018),


While this model can be motivated from model (1), it is often used explicitly as the departure point for causal treatment effect estimation. In fact, Vansteelandt and Didelez (2018) show that these two IV models imply the same restrictions on the observed data distribution. Let denote the statistical model for implied by the IV assumptions and either model (1) or (2). This is often called the linear IV model. Note that in model , the treatment effect in the treated does not depend on . This is known as the ‘no effect modification’ by assumption (Hernán and Robins, 2006).

The causal effect of interest, the average treatment effect in the treated, conditional on taking the value , can be written as a function of as


Since is the conditional expectation of given and , we will first focus on estimating .

Rearranging equation (2) and marginalising over , we can see that

the last equality holding by the conditional unconfoundedness assumption. Model thus implies


where or , depending on whether model (1) or (2) is assumed. Therefore, for a binary IV, we have

showing that under and the IV assumptions, is identified by


Estimation of the conditional expectations in equation (5) would typically involve specifying models for the mean exposure and the mean outcome . Let denote under , which according to equation (4) is a function of , and . We denote by the model for , the model for , and the model for .

3 Doubly robust estimation for the linear instrumental variable model

The methods presented below make different assumptions about the treatment effect model. As we shall see in Section 3.1 and 3.2, TSLS and the g-estimator assume is a known parametric function of , smooth in , with the true an unknown parameter of finite dimension .

To illustrate the methods, we consider throughout a situation where interest lies in the main effect modification by a single variable , i.e. . Under this working model for the effect modification by , the parameter of interest is , where represents the main causal treatment effect, and is the effect modification by . In this case, .

The TSLS and g-estimation require the specification of a parametric model for , so in what follows, these methods presume that this working model is the true model, i.e takes the parametric form


In contrast, the TMLE method does not assume a parametric model for , but instead obtains an initial estimate for using machine learning, which is then updated according to the TMLE algorithm (see details in Section 3.5). This updated estimate is then projected onto a user-supplied parametric working model for the treatment effect curve on , . Importantly, the working parametric model is not assumed to be the true model for . Using the working model , TMLE also estimates .

3.1 Tsls

Estimation of the expectations in equation (5) is often done via an approach known as two-stage least squares (TSLS). The first stage fits a linear treatment selection model, that is a model for conditional on the instrument and the baseline covariates of interest, and then, in a second stage, a regression model of the outcome on the predicted treatment received and baseline covariates is fitted. We write


In principle, there are many parametric choices for the second stage models, and . For TSLS to be consistent, the first stage model

must be the parametric linear regression implied by the second stage, that is, it must include all the covariates and interactions that appear in the second stage model.

For example, if we assume that , and , where abusing notation we assume the vector of ones is the first column of , then the first-stage would involve two equations, as follows


where again, includes 1 to allow for an intercept. Because estimation of these two first-stage models is done separately without acknowledging that the model for should imply the model for , the resulting TSLS estimator may be inefficient (Vansteelandt and Didelez, 2018).

Vansteelandt and Didelez (2018) show that the second stage of the TSLS, equation (8), can be written as the solution to an estimating equation


with any conformable index vector function of dimension . The estimators and obtained solving equation (10), after substituting for (the estimator from the first stage), are consistent asymptotically normal (CAN) , under , i.e. even when is misspecified (Robins, 2000; Wooldridge, 2010). Moreover, in the specific settings where the treatment effect model is linear in the covariates and the instrument is independent of , TSLS is also robust to misspecification of . We refer the interested reader to Vansteelandt and Didelez (2018), Appendix B Proposition 5, for the proof.

This means that for estimators which are DR in the more general settings, with either a non-linear or where depends on , methods beyond TSLS need to be considered.

3.2 G-estimation

Okui et al. (2012) propose an estimator for , constructed by augmenting the estimating equation (10), by multiplying the summand by . The model for the conditional distribution of the IV is often called the instrument propensity score, and it is assumed to be a known mass function , smooth on a finite dimensional parameter .

More specifically, the IV g-estimator of the causal parameter is obtained as a solution to the following estimating equation (Okui et al., 2012)


where is any conformable vector function (i.e. of the appropriate dimension).

This can be made (locally) efficient by choosing (Vansteelandt and Didelez, 2018)


and .

This estimator (denoted by IV-g) can be made feasible by replacing , and by their corresponding consistent estimators , and , and setting equal to 1 (as it is just a proportionality constant). It has been shown to be CAN under , and hence consistent whenever the model for is correctly specified, and either the instrument propensity score or outcome model are correctly specified (Okui et al., 2012). The addition of the instrument propensity score model is particularly helpful when the dependence between and the baseline covariates is known, as would be the case in a randomised trial, thus guaranteeing robustness against misspecification of the outcome model.

Moreover, the IV-g estimator is efficient when , and are correctly specified (Vansteelandt and Didelez, 2018), a property referred to as ‘locally efficiency’.

Since the IV g-estimator is CAN, its asymptotic variance is the variance of its influence function, denoted (Newey, 1990), i.e. . can be written as






Finally, the variance can be estimated by the sample variance of the estimated influence function, obtained by replacing , and by their corresponding consistent estimators , and ,

where we have used the subscript to denote the sample variance on a sample of size .

This variance estimator ignores the nuisance parameter estimation. Robust standard errors can be obtained via the bootstrap or a sandwich estimator.

3.3 Data-adaptive estimation for the IV-g estimator

The IV-g estimator presented thus far is restricted to using parametric working models for the nuisance parameters. In this section, we will relax this restriction by using instead machine learning methods to obtain data-adaptive estimates of the nuisance functionals and plugging these fits into the estimating equation (11). Unlike fixed parametric models where the functional form is chosen by the analyst, these methods use an algorithm, often called a learner, to find the best fitting estimator. In fact, we need not restrict ourselves to a single learner, instead we use meta-learning or ensemble approaches, in particular the Super Learner (SL) (van der Laan et al., 2007). The SL uses cross validation to find the optimal weighted convex combination of multiple candidate estimators specified by the user in the SL library. The library can include parametric and non-parametric estimators. The SL has been shown to perform asymptotically as well as the best learner included in its library, so that adding additional algorithms improves the performance of the SL. The finite sample performance of the SL has been demonstrated extensively in simulations (van der Laan et al., 2007; Porter et al., 2011; Pirracchio et al., 2015).

Machine learning estimation approaches for DR estimators for the IV model have been studied recently. Chernozhukov et al. (2016) give sufficient conditions to guarantee that using data-adaptive fits for the nuisance functionals in a DR estimator results in valid inferences. Briefly, the score function of the DR estimator needs to be Neyman orthogonal to the nuisance parameters i.e. the path-wise (or Gateaux) derivative of the score function exists and vanishes at the true value of the nuisance parameters. Then, as long as the data-adaptive estimators for all nuisance functionals converge to their respective truths, and the product of their convergence rates is faster than , the DR estimator will be CAN. Convergence rates for these data-adaptive estimators depend on the smoothness and number of covariates included (Györfi et al., 2006).

Under sufficient regularity conditions, and if is estimated using parametric models, the arguments used in Chernozhukov et al. (2016) Section 4.2 can be applied to show that the IV-g estimator is still CAN when using data-adaptive fits for and as long as:


where denotes the norm defined by . We refer the interested reader to Chernozhukov et al. (2016) for the technical details.

The IV-g estimator implemented here estimates and jointly using parametric models that include only main terms for both, but uses data-adaptive fits for and the instrument propensity score. Thus the rates in condition (16) are not quite sufficient. We conjecture that in addition we need that

The use of data-adaptive fits for nuisance functionals has been extensively exploited within the TMLE literature which we review next.

3.4 Targeted minimum loss estimation (TMLE)

Targeted minimum loss estimation (TMLE) is a general approach for causal inference, which has been adopted on a wide range of causal problem (Gruber and van der Laan, 2010; van der Laan and Rose, 2011; Zheng and van der Laan, 2012; van der Laan and Gruber, 2012; Petersen et al., 2014).

TMLE is a semi-parametric, double-robust estimation approach, that combines estimates of the nuisance functional and outcome mechanism in a way that guarantees a consistent estimator of the target parameter if either the nuisance parameters or the outcome mechanism are estimated consistently. If both are estimated consistently, TMLE is efficient (van der Laan and Rose, 2011). These initial estimates can be obtained by specifying parametric models or, most commonly via machine learning, in particular the Super Learner (van der Laan et al., 2007). The use of data-adaptive estimates reduces bias, and improves the chances of achieving efficiency, and accurate statistical inference.

TMLE estimators for the linear IV model have been recently proposed by Tóth and van der Laan (2016). We describe below in more detail the linear non-iterative TMLE, which we denote by IV-TMLE.

3.5 Iv-Tmle

Let be the target parameter mapping, from the space of all possible models for the true distribution of the data to , defined by the treatment effect modification working parametric model , as the solution to

We note that only depends on through and the distribution of the covariates . We denote this relevant part by , with , while under , depends on and , so that .

The TMLE algorithm starts by obtaining initial estimates of , , and the instrument propensity score . We denote these initial estimates by writing a hat over the object.

The next step in the construction of a TMLE requires the specification of a loss function

, such that , i.e. the expectation of the loss function is minimised at the true probability distribution. For the linear-fluctuation IV-TMLE, we use the square error loss function , where is the efficient influence function (EIF) under the linear IV model and the working model , which can be written as (see Tóth and van der Laan (2016) for the details):

where and

The term is associated with instrument strength, and can be written as


Now, we proceed to the targeting step, using a linear fluctuating model that updates the initial fit for . For this, initial fits for , and are used in model (4) to define the initial estimate for , . The linear fluctuating model is now:


where the so-called clever covariate is defined as


With these choices for and the fluctuating model, can be found by solving the EIF equation,

This is equivalent to solving for a system of linear equations:


where is obtained by plugging in the initial fits and into equation (19). The TMLE estimator of , denoted by , is now obtained by substituting , the solution to equation (20) into equation (18). Finally, we project onto the working model by OLS, obtaining , the TMLE estimator of the causal parameter of interest.

Tóth and van der Laan (2016) showed that this approach yields an estimator which is consistent when (i) the initial estimators of and are consistent, (ii) the initial estimators of and are consistent, or (iii) the initial estimators of and are consistent.

3.6 Comparison of the two estimators

Recall that the IV-g estimator assumes a parametric model for , e.g. (equation (6)), while the IV-TMLE obtains a first estimate of which can include all other covariates in not only , and then after updating this through the TMLE procedure projects the targeted estimator onto a user-supplied working parametric model, e.g. , which is not assumed to be correctly specified.

The IV-g estimator gains efficiency from assuming equation (6) when equation (6) holds. Suppose however that the model (6) is misspecified (e.g. that the treatment effect curve depends on more covariates, not just , or that the relationship is not linear). Interestingly, the IV-g estimator will continue to be a consistent estimator of the parameters , whenever the mean exposure model is correctly specified and is constant in . This is because the IV-g estimator solves


This is equivalent to

so that,


This converges to if is constant, and continues to hold in large samples, when and the instrument propensity score are replaced by their consistent estimators respectively.

If however is not constant in , then the IV-g estimator under a misspecified converges to a weighted average of the true estimand of interest.

3.6.1 Weak instrument instability for IV-TMLE

The other difference between these estimators is the reliance (or not) on the exposure model. In particular, the clever covariate used to define the IV-TMLE estimator depends on the inverse of . This captures the strength of the instrument in predicting the exposure given , and the variance of the IV-TMLE estimators become very large when this is very small. To stabilise the estimators, we choose the maximum of the estimated value of , and 0.025 when constructing the clever covariate for a given data set. Such corrections are not necessary for the IV-g estimator.

4 Simulation Study

We perform a small factorial simulation study to assess the finite sample performance of the alternative methods, under the different combinations of , or being in turn correctly specified or not, while the instrument model is always correct. We write as an indicator function, for scenarios where the assumed model indexed by is misspecified.

We generate the RCT data, with two-sided non-adherence, i.e. both randomly allocated groups have a non-zero probability of receiving the opposite treatment. We begin by generating five independent standard normal variables and . These are the observed baseline covariates, of which one is the effect modifier . We also generate a standard normal unobserved confounder . We generate randomised treatment also independently of the other variables, , and then simulate the binary treatment received as a function the baseline variables, the unobserved confounder, and the instrument:

Notice that we are generating in such a way that the condition necessary for the IV-g estimator to converge to the parameter of interest when is wrong is no longer satisfied.

The continuous outcome is then simulated as follows:

We simulate RCTs with two different sample sizes, and , and generate 1,000 replicates for each scenario. We perform analyses with TSLS, IV-g and IV-TMLE. The latter two are implemented with and without the use of SL.

Parametric TSLS, IV-g and TMLE are fitted using the main terms only. For TSLS the first stage are as per equation (3.1), while for IV-g and TMLE, these are logistic models. The IV propensity score is also a logistic model. The SEs of the parametric IV-g and TMLE are obtained by bootstrapping (percentile 95% CI using 1999 bootstrap samples).

For SL fits in IV-g, since and

are binary, we use a library including glm (generalised linear models), step (stepwise model selection using AIC), svm (support vector machines), gam (generalised additive models). For the glm, step and gam learners, linear and second-order terms are used.

For the IV-TMLE, we use the same library as above for and , while for and , we include only glm, step, svm and polymars (multivariate adaptive polynomial spline regression), in order to preserve the linear structure of model (4).

The methods described in Chernozhukov et al. (2016) require the use of sample splitting in order to guarantee valid inference when using machine learning for fitting the nuisance models. We do not perform such sample splitting for the IV-g or the TMLE estimators, other than that already being used for the cross validation embedded in the Super Learner. A limitation with this pragmatic approach, in that the variance estimator based on the (E)IF ignores the fact that these factors are not independent from each other, as we have used the data to both obtain the fits and then evaluate the (E)IF. van der Laan and Luedtke (2015) suggest using cross-validation. For example, if using 10-fold cross-validation we can estimate the variance,

where is the proportion of observations in each validation sample , and is the IF (or the EIF) of the parameter of interest.

We compute the mean bias of the estimates, coverage of 95% confidence intervals (CI), and root mean square error (RMSE).

4.1 Results from the simulation

Figure 2 shows the mean bias (top) and CI coverage rate (bottom) corresponding to scenarios with sample size of . The corresponding results when appear in Figure 3.

Figure 2: Performance (Bias and Coverage) of TSLS, TMLE and IV-g estimators, when the sample size is (). Scenarios with correct or misspecified and vary by column. Settings where the parametric is correct are plotted in blue, while mis-specified is plotted in red. The hollow shapes correspond to estimates obtained with parametric models, while the solid shapes correspond to SL. The true values of are , the bias is presented with its Monte Carlo error CIs. Dotted line in the bias plot is the 0 line, the dashed lines in the coverage plot are the 92.5 and 97.5 % coverage rates
Figure 3: Performance (Bias and Coverage) of TSLS, TMLE and IV-g estimators, when the sample size is (). Scenarios with correct or misspecified and vary by column. Settings where the parametric is correct are plotted in blue, while mis-specified is plotted in red. The hollow shapes correspond to estimates obtained with parametric models, while the solid shapes correspond to SL. The true values of are , the bias is presented with its Monte Carlo error CIs. Dotted line in the bias plot is the 0 line, the dashed lines in the coverage plot are the 92.5 and 97.5 % coverage rates.
(a) n=500
(b) n=10,000
Figure 4: RMSE of TSLS, TMLE and IV-g estimators, when the sample size is (a) (top) and (b) . Scenarios with correct or misspecified and vary by column. Settings where the parametric is correct are plotted in blue, while mis-specified is plotted in red. The hollow shapes correspond to estimates obtained with parametric models, while the solid shapes correspond to SL.

As expected, when all models are correctly specified, all methods show close to zero bias for both of the target parameters, already with small () sample size, with coverage levels close to the nominal (between 92.5 and 97.5%) for TSLS and IV-g. While the bootstrap CI corresponding to parametric TMLE resulted in over-coverage (99%), the SL IV-TMLE shows some under-coverage, which improves at large () sample. This is a consequence of the variance estimator used here ignoring the fact that the data have been used to both obtain the fits and then evaluate the EIF (instead of using sample splitting). The RMSE were indistinguishable across the methods. This pattern held both when the methods were implemented using parametric models, and when with the SL.

TSLS performs well when is correctly specified regardless of the other models, but performs very poorly when and are misspecified (bias 200 times the size of the effect and 0 coverage).

Under those misspecified scenarios when the double robust properties were expected to provide protection, these small levels of bias, and good coverage has generally held. Adding a misspecified treatment effect model has doubled RMSE across the methods, but has not affected bias or coverage, and this held true when a misspecified model was used, as long as the exposure model was still correctly specified. However, when the exposure model was misspecified (the last two columns of the Figures), the estimators started showing sensitivity. When the treatment effect model and the exposure model were both misspecified, all of the parametric estimators considered showed high levels of bias. When implementing the IV-g and IV-TMLE methods using the Super Learner, the bias and coverage, and consequently the RMSE, returned to the levels reported under correct specification. An exception is the coverage for the IV-g estimator of , which remained very low (<50%). This is mostly due to the substantial remaining bias, stemming from the fact that the estimator converges to a weighted average of the true when the parametric is mis-specified (equation (22)). When is also misspecified, we observe essentially the same levels of bias, RMSE and coverage, as under the previously described scenario. Across the scenarios, the effect of sample size was not significant in the reported bias, RMSE and coverage levels, except for the scenarios when and were misspecified. Here, the IV-g estimator with parametric implementation reported extremely high bias, RMSE and Monte Carlo errors. With large sample size, the bias was smaller, although still substantial, and the Monte Carlo errors reduced to close to zero.

We summarise the results of the simulation study by comparing estimated RMSEs, Figure 4. Under , we conclude that both DR estimators have reported performance according to their theoretical double-robust properties, and the TSLS method showed similar performance to the parametric implementation of the IV-g method. Both DR methods have benefitted from the data-adaptive estimation of the nuisance parameters: the performance of the estimators have not been harmed in the correctly specified scenarios, and RMSE has been greatly reduced in the scenarios when the DR properties did not provide protection against misspecification. When the assumed model is correctly specified, IV-g outperforms all other methods, with the smallest RMSE. In contrast, when is misspecified, IV-TMLE performs best.

5 Motivating example: the COPERS trial

We now illustrate the methods in practice by applying each in turn to the motivating example. The COping with persistent Pain, Effectiveness Research in Self-management trial (COPERS) was a randomised controlled trial across 27 general practices and community services in the UK. It recruited 703 adults with musculoskeletal pain of at least 3 months duration, and randomised 403 participants to the active intervention and a further 300 to the control arm. The mean age of participants was 59.9 years, with 81% white, 67% female, 23% employed, 85% with pain for at least 3 years, and 23% on strong opioids.

Intervention participants were offered 24 sessions introducing them to cognitive behavioural (CB) approaches designed to promote self-management of chronic back pain. The sessions were delivered over three days within the same week with a follow-up session 2 weeks later. At the end of the 3-day course participants received a relaxation CD and self-help booklet. Controls received usual care and the same relaxation CD and self-help booklet.

The primary outcome was pain-related disability at 12 months, using the Chronic Pain Grade (CPG) disability sub-scale. This is a continuous measure on a scale from 0 to 100, with higher scores indicating worse pain-related disability.

In the active treatment, only 179 (45%) attended all 24 sessions, and 322 (86.1%) received at least one session. The control arm participants had no access to the active intervention sessions. Participants and group facilitators were not masked to the study arm they belonged to.

The intention-to-treat analysis found no evidence that the COPERS intervention had an effect on improving pain-related disability at 12 months in people with long-established, chronic musculoskeletal pain (, % CI to ).

Poor attendance to the sessions was anticipated, and so obtaining causal treatment effect estimates was a pre-defined objective of the study. The original report included a causal treatment effect analysis using TSLS, using a binary indicator for treatment received (attending at least half of the sessions), and assuming that randomisation was a valid instrument for treatment received (Taylor et al., 2016). The IV model adjusted for the following baseline covariates: site of recruitment, age, gender and HADS score and the CPG score at baseline. This IV analysis found no evidence of a treatment effect on CPG at 12 months amongst the compliers ( , % CI to ).

The COPERS study also performed a number of subgroup analyses to investigate treatment effect heterogeneity, but did not carry out IV analysis with effect modification. However, treatment heterogeneity in the causal effect is still of interest.

For our re-analyses, the data set consists of 652 participants followed up for 12 months, 374 allocated to active treatment, and 278 in the control (93% of those recruited). Thirty-five individuals (5%) have missing primary outcome data, and a further 4 ( <1%) have missing baseline depression score, leaving a sample size of 613.

We focus on the causal effect of receiving at least one treatment session as a function of depression at baseline measured using the Hospital Anxiety and Depression Scale (HADS).

We argue that random allocation is a valid IV: the assumptions concerning unconfoundedness and instrument relevance are justified by design. The exclusion restriction assumption seems plausible with our choices for , as only those participants receiving at least one training sessions would know how to use the CB coping mechanisms and potentially to improve their disability. It is unlikely that that random allocation has a direct effect, though since participants were not blinded to their allocation, we cannot completely rule out some psychological effects of knowing one belongs to the control or active group on pain and disability.

We perform each of the methods in turn, TSLS, IV-g and IV-TMLE to estimate . As Table 1 summarises, the use of DR methods, even after using Super Learner does not result in a material change in the point estimates or SEs, compared to standard TSLS. All 5 estimators give the same inference, namely, there is no evidence of an average treatment effect in the treated, and there is no evidence of effect modification by baseline depression. The lack of statistical significance may be due to small numbers of participants in the trial. While not statistically significant, the direction of the treatment effect modification is interesting: the point estimates indicate the treatment may benefit more those with higher depression symptoms at baseline (the ATT resulting in a reduction in the disability score).

TSLS 2.94 4.67 -0.58 0.57
IV-g 2.78 4.66 -0.53 0.54
IV-g SL 2.10 4.75 -0.45 0.54
IV-TMLE 3.16 4.74 -0.64 0.56
IV-TMLE SL 2.22 4.88 -0.51 0.58
Table 1: ATT of the COPERS intervention on CPG, with all-or-nothing binary exposure , main effect and effect modification by depression) .

6 Discussion

This paper compared the performance of two doubly robust estimators for the ATT dependent on baseline covariates, in the presence of unmeasured confounding, but where a valid (conditional) IV is available. These estimators were implemented with and without the use of data-adaptive estimates of the nuisance parameters. We have demonstrated empirically through simulations that the IV-g estimator has good finite sample performance when using data-adaptive fits for the nuisance parameters, provided the parametric model assumed for the treatment effect curve is correctly specified. The IV-TMLE offers greater flexibility, not relying on a correctly specified parametric model, and performs very well under most scenarios. The price for this greater flexibility is less efficiency, compared with that of the IV-g, when the (parametric) model for the treatment effect curve is correctly specified. As the simulations show, the use of data-adaptive fits for the nuisance models greatly reduces bias, and improves coverage, resulting in much smaller RMSEs, when compared with using parametric nuisance models, and thus data-adaptive fits should be used.

The methods were motivated and tested in the context of estimating the ATT with effect modification in RCTs with non-adherence to randomised treatment with binary exposure and a continuous outcome. However, the methods presented here are applicable to other settings. One situation may be where the IV assumptions are believed to be satisfied only after conditioning on baseline covariates, making this applicable to certain observational settings. Extensions to situations with continuous exposure are also straight-forward if one is prepared to assume linearity of the treatment effect curve (Tóth and van der Laan, 2016; Vansteelandt and Didelez, 2018).

We have focused on the as the estimand of interest, but Ogburn et al. (2015) have shown that the same functional of the observed data can be used to identify under monotonicity the local average treatment effects conditional on baseline covariates, LATE(v). In fact, much of the previous literature regarding estimation of instrumental variable models with covariates has assumed monotonicity. In particular, for the special case where , previous methods include full parametric specifications suitable when both the IV and exposure are binary by Little and Yau (1998) and Hirano et al. (2000) and a semi-parametric model proposed by Abadie (2003). In the case where is null, Frölich (2007) characterised two distinct non-parametric estimation methods, while Tan (2006) proposed a DR estimator which is consistent when either the instrument propensity score and either the outcome or the exposure parametric models are correctly specified.

For the , Robins (1994) and Tan (2010) proposed DR estimators in the case where for , and is a strict subset of respectively. The DR estimator presented by Okui et al. (2012) and Vansteelandt and Didelez (2018) builds on the work of Tan (2010). For the special case when is null, Vansteelandt and Didelez (2018) proposed other DR estimators which are locally efficient, and also constructed a bias-reduced DR IV estimator. Several authors have proposed data-adaptive estimators for the linear IV model with no effect modification, beginning with a TSLS where the first stage in fitted using LASSO with a data-adaptive penalty (Belloni et al., 2012). The bias-reduced DR IV estimator has also been implemented when is null using data-adaptive fits for the conditional mean outcome in the unexposed (Vermeulen and Vansteelandt, 2016). Chernozhukov et al. (2016) proposed two other IV DR data-adaptive estimators and gave conditions under which data-adaptive fits can be used for , and . Comparing these DR estimators to the those presented here would be a promising avenue for future research.

The present study has some limitations. Formal proof of the regularity conditions and rates of convergence needed to obtain uniformly valid inferences when using data-adaptive techniques for the IV g-estimator presented here are still lacking, and should be the subject of further study. Similarly, the required convergence conditions for the IV-TMLE estimator have not been formally reported. Another limitation is that we did not seek to quantify the rates of convergence attained by algorithms included in the SL library. This is because in general the rates of convergence of the individual machine learning algorithms depend on the number of included variables, and other tuning parameters, making the assessment of rates of convergence complex. A potential promising solution for this could be to include the highly adaptive lasso (HAL) (Benkeser and Laan, 2016) in the SL library, as this has been proven under sufficient regularity conditions to converge at rates faster than .

A number of extensions to the work presented here are of interest. The IV-g method implemented here jointly estimates and , and thus used parametric models for both. This is not necessary, and an alternative strategy where is estimated beforehand and the fitted values plugged in into the estimating equation (11) is possible, thus allowing the use of data-adaptive fits for the model . Future work could extend the bias-reduced DR estimator to the linear IV model with effect modification, and compare this with IV-TMLE and a fully data-adaptive version of the IV-g estimator.


We thank Prof. Stephanie Taylor and the COPERS trial team for access to the data. We also thank Prof. Stijn Vansteelandt for commenting on an earlier draft of this paper, and Boriska Tóth for sharing her code implementing IV-TMLE.
Karla DiazOrdaz is supported by UK Medical Research Council Career development award in Biostatistics MR/L011964/1.


  • Abadie (2003) Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of econometrics, 113(2):231–263.
  • Angrist et al. (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):pp. 444–455.
  • Belloni et al. (2012) Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.
  • Benkeser et al. (2017) Benkeser, D., Carone, M., Laan, M. V. D., and Gilbert, P. (2017). Doubly robust nonparametric inference on the average treatment effect. Biometrika, 104(4):863–880.
  • Benkeser and Laan (2016) Benkeser, D. and Laan, M. V. D. (2016). The highly adaptive lasso estimator. In

    2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)

    , pages 689–696.
  • Chernozhukov et al. (2016) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., et al. (2016). Double machine learning for treatment and causal parameters. arXiv preprint arXiv:1608.00060.
  • Dodd et al. (2012) Dodd, S., White, I., and Williamson, P. (2012). Nonadherence to treatment protocol in published randomised controlled trials: a review. Trials, 13(1):84.
  • Dunn and Bentall (2007) Dunn, G. and Bentall, R. (2007). Modelling treatment-effect heterogeneity in randomized controlled trials of complex interventions (psychological treatments). Statistics in Medicine, 26(26):4719–4745.
  • Farrell (2015) Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23.
  • Frölich (2007) Frölich, M. (2007). Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econometrics, 139(1):35 – 75. Endogeneity, instruments and identification.
  • Gruber and van der Laan (2010) Gruber, S. and van der Laan, M. J. (2010). A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics, 6(1).
  • Györfi et al. (2006) Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.
  • Hernán and Robins (2006) Hernán, M. A. and Robins, J. M. (2006). Instruments for causal inference: an epidemiologist’s dream? Epidemiology, 17(4):360–372.
  • Hirano et al. (2000) Hirano, K., Imbens, G. W., Rubin, D. B., and Zhou, X.-H. (2000). Assessing the effect of an influenza vaccine in an encouragement design. Biostatistics, 1(1):69–88.
  • Kang et al. (2007) Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539.
  • Little and Yau (1998) Little, R. J. and Yau, L. H. (1998). Statistical techniques for analyzing data from prevention trials: Treatment of no-shows using rubin’s causal model. Psychological Methods, 3(2):147.
  • Newey (1990) Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of Applied Econometrics, 5(2):99–135.
  • Ogburn et al. (2015) Ogburn, E. L., Rotnitzky, A., and Robins, J. M. (2015). Doubly robust estimation of the local average treatment effect curve. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2):373–396.
  • Okui et al. (2012) Okui, R., Small, D. S., Tan, Z., and Robins, J. M. (2012). Doubly robust instrumental variable regression. Statistica Sinica, 22(1):173–205.
  • Petersen et al. (2014) Petersen, M., Schwab, J., Gruber, S., Blaser, N., Schomaker, M., and van der Laan, M. (2014). Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. Journal of causal inference, 2(2):147–185.
  • Pirracchio et al. (2015) Pirracchio, R., Petersen, M. L., and van der Laan, M. (2015). Improving propensity score estimators’ robustness to model misspecification using super learner. American journal of epidemiology, 181(2):108–119.
  • Porter et al. (2011) Porter, K. E., Gruber, S., van der Laan, M. J., and Sekhon, J. S. (2011). The relative performance of targeted maximum likelihood estimators. The international journal of biostatistics, 7(1):1–34.
  • Robins (1994) Robins, J. M. (1994). Correcting for non-compliance in randomized trials using structural nested mean models. Communications in Statistics - Theory and Methods, 23(8):2379–2412.
  • Robins (2000) Robins, J. M. (2000). Robust estimation in sequentially ignorable missing data and causal inference models. In

    Proceedings of the American Statistical Association, Section on Bayesian Statistical Science

    , pages pp. 6–10.
  • Rubin (1978) Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The Annals of statistics, pages 34–58.
  • Tan (2006) Tan, Z. (2006). Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association, 101(476):1607–1618.
  • Tan (2010) Tan, Z. (2010). Marginal and nested structural models using instrumental variables. Journal of the American Statistical Association, 105(489):157–169.
  • Taylor et al. (2016) Taylor, S. J., Carnes, D., and Homer, Kate, e. a. (2016). Improving the self-management of chronic pain: Coping with persistent pain, effectiveness research in self-management (copers). Programme Grants for Applied Research, 4(14).
  • Tóth and van der Laan (2016) Tóth, B. and van der Laan, M. J. (2016). TMLE for marginal structural models based on an instrument. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 350.
  • van der Laan and Rose (2011) van der Laan, M. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Series in Statistics.
  • van der Laan and Gruber (2012) van der Laan, M. J. and Gruber, S. (2012). Targeted minimum loss based estimation of causal effects of multiple time point interventions. The international journal of biostatistics, 8(1).
  • van der Laan and Luedtke (2015) van der Laan, M. J. and Luedtke, A. R. (2015). Targeted learning of the mean outcome under an optimal dynamic treatment rule. Journal of causal inference, 3(1):61–95.
  • van der Laan et al. (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1):1–21.
  • VanderWeele (2009) VanderWeele, T. J. (2009). Concerning the consistency assumption in causal inference. Epidemiology, 20(6):880–883.
  • Vansteelandt and Didelez (2018) Vansteelandt, S. and Didelez, V. (2018). Improving the robustness and efficiency of covariate adjusted linear instrumental variable estimators. to appear in Scandinavian Journal of Statistics.
  • Vermeulen and Vansteelandt (2016) Vermeulen, K. and Vansteelandt, S. (2016). Data-adaptive bias-reduced doubly robust estimation. The international journal of biostatistics, 12(1):253–282.
  • Wiles et al. (2014) Wiles, N. J., Fischer, K., Cowen, P., Nutt, D., Peters, T. J., Lewis, G., and White, I. R. (2014). Allowing for non-adherence to treatment in a randomized controlled trial of two antidepressants (citalopram versus reboxetine): an example from the genpod trial. Psychological Medicine, 44(13):2855–2866.
  • Wooldridge (2010) Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press.
  • Zhang et al. (2014) Zhang, Z., Peluso, M. J., Gross, C. P., Viscoli, C. M., and Kernan, W. N. (2014). Adherence reporting in randomized controlled trials. Clinical Trials, 11(2):195–204.
  • Zheng and van der Laan (2012) Zheng, W. and van der Laan, M. J. (2012). Targeted maximum likelihood estimation of natural direct effects. The international journal of biostatistics, 8(1):1–40.