1 Introduction
There has been an increased interest in estimating the causal effect of treatment actually received (in addition to the intentiontotreat effect) in randomised controlled trials (RCTs) in the presence of treatment nonadherence (Wiles et al., 2014; Dodd et al., 2012; Zhang et al., 2014). An additional challenge is posed by appreciable treatment effect heterogeneity, which is often itself of interest. This is a a common issue with psychological interventions (Dunn and Bentall, 2007).
In this work, we consider methods for estimating the dependence of a causal average treatment effect on baseline covariates in RCTs with nonadherence. This is motivated by the COPERS (COping with persistent Pain, Effectiveness Research in Selfmanagement) trial. The intervention introduced cognitive behavioural therapy approaches designed to promote selfefficacy to manage chronic pain, with the primary outcome being painrelated disability. The research team was interested in the causal effect of the received treatment, and whether this effect was modified by a number of baseline variables. Here, we will focus on one possible effect modifier: depression at baseline, measured by the Hospital Anxiety and Depression Scale (HADS).
Instrumental variable (IV) methods are often used to estimate the effect of treatment received in RCTs where randomised treatment is unconfounded by design, but treatment received is not. Assuming that randomised treatment is a valid instrument, and under some additional assumptions reviewed in Section 2, it is possible to identify the average treatment effect in the treated, conditional on baseline covariates . In addition to investigating effect modification by a subset of baseline covariates , it can in other settings be beneficial to use a larger set of baseline covariates for adjustment in the analysis: (i) if the IV assumptions are more plausible conditional on baseline covariates , or (ii) to increase the statistical efficiency of the estimators.
A relatively simple method of estimation for this is the socalled two stage least squares (TSLS). In its simplest form, i.e. when
is null, the first stage predicts the exposure based on an ordinary least squares regression of the exposure on the IV and baseline covariates
, while the second stage consists of regressing the outcome on the predicted exposure from the first stage and baseline covariates , also via ordinary least squares regression. The coefficient corresponding to the predicted exposure in this second model is the TSLS estimator of the desired causal treatment effect. TSLS is robust to the misspecification of the first stage model (Robins, 2000; Wooldridge, 2010) but may be inefficient, especially when the treatmentexposure relationship is nonlinear (Vansteelandt and Didelez, 2018). However, as we shall see later, where is nonnull and the treatment effect varies by baseline covariates, TSLS relies on the outcome model (the second stage) being correctly specified to obtain consistent effect estimates.Doubly robust (DR) estimators are appealing in such settings, as they estimate consistently the parameter of interest if at least one of the models, for either the exposure or the outcome is correctly specified. In the context of linear IV models with null, Okui et al. (2012) proposed a locallyefficient estimating equations DR estimator for the causal effect of treatment in the treated, often called a gestimator. It augments the TSLS estimating equation by adding a model for the instrument given the baseline covariates. This estimator is DR in the sense that it needs to specify correctly either the outcome model or the instrument model. This estimator was generalised to settings where is nonnull by Vansteelandt and Didelez (2018) and shown to be locally efficient.
Recently, Tóth and van der Laan (2016) proposed a targeted maximum likelihood estimator (TMLE) for the treatment effect in a linear IV model. TMLE is a general approach for causal inference problems yielding semiparametric substitution estimators that are doubly robust (van der Laan and Rose, 2011).
Although DR estimators offer partial protection in principle against model misspecification, concerns remain over their performance in practice, when all models are likely to be misspecified (Kang et al., 2007). To further avoid problems of model misspecification, TMLE is usually coupled with machine learning fits of nuisance models, in particular the Super Learner, a crossvalidation based estimator selection approach (van der Laan et al., 2007). TMLE and other DR estimators possess a particular orthogonality property that leads to greater suitability with machine learning fits. Estimators based on a single nuisance model can perform poorly when combined with machine learning fits, since the estimator inherits the slow convergence (and hence high finite sample bias) and nonregularity of the machine learning fits themselves, with the latter property making valid statistical inferences extremely complex to obtain, as even the bootstrap is invalid (Benkeser et al., 2017)
. Some DR estimators, such as TMLE, on the other hand, when combined with machine learning fits, have faster convergence and make (asymptotic) analytic statistical inference tractable via the sampling variance of the corresponding influence functions, under conditions on the convergence of the machine learning algorithms used, which, although not guaranteed to hold by any means, are considerably weaker than what would be required to achieve the same with nonDR estimators
(Farrell, 2015).Chernozhukov et al. (2016) further develop these ideas, and suggest the use of machine learning for estimating the nuisance parameters for a wide class of estimating equation DR estimators, to which the gestimator presented by Vansteelandt and Didelez (2018) belongs.
We implement these two IV estimators, gestimation and TMLE, with and without machine learning fits, and compare their performance with that of standard TSLS, in terms of mean bias, root mean squared error (RMSE) and confidence interval (CI) coverage using a simulation study. We also contrast the methods in the case study.
This paper is organised as follows. In the next section, we define the causal parameters of interest and the assumptions for the IV methods. In Section 3.1 we review the standard TSLS, while in Section 3.2, we introduce the gestimator proposed by Vansteelandt and Didelez (2018). Section 3.3 briefly justifies the use of socalled double machine learning methods (dataadaptive DR estimators), and introduces the Super Learner. The TMLE estimator proposed by Tóth and van der Laan (2016) is described in Section 3.4. In Section 4 we present a simulation study, comparing the performance of these estimators. The proposed methods are then applied to the COPERS RCT in Section 5. We conclude with a discussion.
2 Linear instrumental variables models
Let be a set of baseline variables, be the randomised treatment indicator, be the exposure of interest, the actual treatment received, presumed to be binary. Let be the continuous outcome of interest, and be the set of all unobserved common causes of and . For simplicity, but without loss of generality, assume that we are only interested in effect modification by a single baseline variable .
We use a subscript 0 to denote the true probability distributions, models and parameters. Let the vector of the observed data for the
th individual be , where is the true underlying distribution from which an independent identically distributed random sample of size is drawn. The causal relationships between these variables are encoded by the directed acyclic graph (DAG) shown in Figure 1.Let the counterfactual or potential outcome be the outcome that would occur if were set to . As usual, we assume that no interference and counterfactual consistency hold (Rubin, 1978; VanderWeele, 2009).
Following Abadie (2003); Vansteelandt and Didelez (2018), we write the conditional version of the IV assumptions (Angrist et al., 1996), as follows:
(i) Conditional unconfoundedness: is conditionally independent of the unmeasured confounders, conditional on measured covariates , i.e. .
(ii) Exclusion restriction: Conditionally on , and the confounder , the instrument and the response are independent, i.e. ,
(iii) Instrument relevance: Also referred to as first stage assumption: is associated with conditional on , i.e. .
A special case is the simple RCT, where randomised treatment is often assumed to be an unconditional valid IV.
In addition, we assume the following conditional mean model for the outcome:
(1) 
where and are unknown functions and represents the causal treatment effect given covariates .
Now by the technical assumption of counterfactual consistency, the conditional mean model (1), together with the conditional IV assumptions, implies (Vansteelandt and Didelez, 2018),
(2) 
While this model can be motivated from model (1), it is often used explicitly as the departure point for causal treatment effect estimation. In fact, Vansteelandt and Didelez (2018) show that these two IV models imply the same restrictions on the observed data distribution. Let denote the statistical model for implied by the IV assumptions and either model (1) or (2). This is often called the linear IV model. Note that in model , the treatment effect in the treated does not depend on . This is known as the ‘no effect modification’ by assumption (Hernán and Robins, 2006).
The causal effect of interest, the average treatment effect in the treated, conditional on taking the value , can be written as a function of as
(3) 
Since is the conditional expectation of given and , we will first focus on estimating .
Rearranging equation (2) and marginalising over , we can see that
the last equality holding by the conditional unconfoundedness assumption. Model thus implies
(4) 
where or , depending on whether model (1) or (2) is assumed. Therefore, for a binary IV, we have
showing that under and the IV assumptions, is identified by
(5) 
3 Doubly robust estimation for the linear instrumental variable model
The methods presented below make different assumptions about the treatment effect model. As we shall see in Section 3.1 and 3.2, TSLS and the gestimator assume is a known parametric function of , smooth in , with the true an unknown parameter of finite dimension .
To illustrate the methods, we consider throughout a situation where interest lies in the main effect modification by a single variable , i.e. . Under this working model for the effect modification by , the parameter of interest is , where represents the main causal treatment effect, and is the effect modification by . In this case, .
The TSLS and gestimation require the specification of a parametric model for , so in what follows, these methods presume that this working model is the true model, i.e takes the parametric form
(6) 
In contrast, the TMLE method does not assume a parametric model for , but instead obtains an initial estimate for using machine learning, which is then updated according to the TMLE algorithm (see details in Section 3.5). This updated estimate is then projected onto a usersupplied parametric working model for the treatment effect curve on , . Importantly, the working parametric model is not assumed to be the true model for . Using the working model , TMLE also estimates .
3.1 Tsls
Estimation of the expectations in equation (5) is often done via an approach known as twostage least squares (TSLS). The first stage fits a linear treatment selection model, that is a model for conditional on the instrument and the baseline covariates of interest, and then, in a second stage, a regression model of the outcome on the predicted treatment received and baseline covariates is fitted. We write
(7)  
(8) 
In principle, there are many parametric choices for the second stage models, and . For TSLS to be consistent, the first stage model
must be the parametric linear regression implied by the second stage, that is, it must include all the covariates and interactions that appear in the second stage model.
For example, if we assume that , and , where abusing notation we assume the vector of ones is the first column of , then the firststage would involve two equations, as follows
(9) 
where again, includes 1 to allow for an intercept. Because estimation of these two firststage models is done separately without acknowledging that the model for should imply the model for , the resulting TSLS estimator may be inefficient (Vansteelandt and Didelez, 2018).
Vansteelandt and Didelez (2018) show that the second stage of the TSLS, equation (8), can be written as the solution to an estimating equation
(10) 
with any conformable index vector function of dimension . The estimators and obtained solving equation (10), after substituting for (the estimator from the first stage), are consistent asymptotically normal (CAN) , under , i.e. even when is misspecified (Robins, 2000; Wooldridge, 2010). Moreover, in the specific settings where the treatment effect model is linear in the covariates and the instrument is independent of , TSLS is also robust to misspecification of . We refer the interested reader to Vansteelandt and Didelez (2018), Appendix B Proposition 5, for the proof.
This means that for estimators which are DR in the more general settings, with either a nonlinear or where depends on , methods beyond TSLS need to be considered.
3.2 Gestimation
Okui et al. (2012) propose an estimator for , constructed by augmenting the estimating equation (10), by multiplying the summand by . The model for the conditional distribution of the IV is often called the instrument propensity score, and it is assumed to be a known mass function , smooth on a finite dimensional parameter .
More specifically, the IV gestimator of the causal parameter is obtained as a solution to the following estimating equation (Okui et al., 2012)
(11) 
where is any conformable vector function (i.e. of the appropriate dimension).
This estimator (denoted by IVg) can be made feasible by replacing , and by their corresponding consistent estimators , and , and setting equal to 1 (as it is just a proportionality constant). It has been shown to be CAN under , and hence consistent whenever the model for is correctly specified, and either the instrument propensity score or outcome model are correctly specified (Okui et al., 2012). The addition of the instrument propensity score model is particularly helpful when the dependence between and the baseline covariates is known, as would be the case in a randomised trial, thus guaranteeing robustness against misspecification of the outcome model.
Moreover, the IVg estimator is efficient when , and are correctly specified (Vansteelandt and Didelez, 2018), a property referred to as ‘locally efficiency’.
Since the IV gestimator is CAN, its asymptotic variance is the variance of its influence function, denoted (Newey, 1990), i.e. . can be written as
(13) 
with
(14) 
and
(15) 
Finally, the variance can be estimated by the sample variance of the estimated influence function, obtained by replacing , and by their corresponding consistent estimators , and ,
where we have used the subscript to denote the sample variance on a sample of size .
This variance estimator ignores the nuisance parameter estimation. Robust standard errors can be obtained via the bootstrap or a sandwich estimator.
3.3 Dataadaptive estimation for the IVg estimator
The IVg estimator presented thus far is restricted to using parametric working models for the nuisance parameters. In this section, we will relax this restriction by using instead machine learning methods to obtain dataadaptive estimates of the nuisance functionals and plugging these fits into the estimating equation (11). Unlike fixed parametric models where the functional form is chosen by the analyst, these methods use an algorithm, often called a learner, to find the best fitting estimator. In fact, we need not restrict ourselves to a single learner, instead we use metalearning or ensemble approaches, in particular the Super Learner (SL) (van der Laan et al., 2007). The SL uses cross validation to find the optimal weighted convex combination of multiple candidate estimators specified by the user in the SL library. The library can include parametric and nonparametric estimators. The SL has been shown to perform asymptotically as well as the best learner included in its library, so that adding additional algorithms improves the performance of the SL. The finite sample performance of the SL has been demonstrated extensively in simulations (van der Laan et al., 2007; Porter et al., 2011; Pirracchio et al., 2015).
Machine learning estimation approaches for DR estimators for the IV model have been studied recently. Chernozhukov et al. (2016) give sufficient conditions to guarantee that using dataadaptive fits for the nuisance functionals in a DR estimator results in valid inferences. Briefly, the score function of the DR estimator needs to be Neyman orthogonal to the nuisance parameters i.e. the pathwise (or Gateaux) derivative of the score function exists and vanishes at the true value of the nuisance parameters. Then, as long as the dataadaptive estimators for all nuisance functionals converge to their respective truths, and the product of their convergence rates is faster than , the DR estimator will be CAN. Convergence rates for these dataadaptive estimators depend on the smoothness and number of covariates included (Györfi et al., 2006).
Under sufficient regularity conditions, and if is estimated using parametric models, the arguments used in Chernozhukov et al. (2016) Section 4.2 can be applied to show that the IVg estimator is still CAN when using dataadaptive fits for and as long as:
(16) 
where denotes the norm defined by . We refer the interested reader to Chernozhukov et al. (2016) for the technical details.
The IVg estimator implemented here estimates and jointly using parametric models that include only main terms for both, but uses dataadaptive fits for and the instrument propensity score. Thus the rates in condition (16) are not quite sufficient. We conjecture that in addition we need that
The use of dataadaptive fits for nuisance functionals has been extensively exploited within the TMLE literature which we review next.
3.4 Targeted minimum loss estimation (TMLE)
Targeted minimum loss estimation (TMLE) is a general approach for causal inference, which has been adopted on a wide range of causal problem (Gruber and van der Laan, 2010; van der Laan and Rose, 2011; Zheng and van der Laan, 2012; van der Laan and Gruber, 2012; Petersen et al., 2014).
TMLE is a semiparametric, doublerobust estimation approach, that combines estimates of the nuisance functional and outcome mechanism in a way that guarantees a consistent estimator of the target parameter if either the nuisance parameters or the outcome mechanism are estimated consistently. If both are estimated consistently, TMLE is efficient (van der Laan and Rose, 2011). These initial estimates can be obtained by specifying parametric models or, most commonly via machine learning, in particular the Super Learner (van der Laan et al., 2007). The use of dataadaptive estimates reduces bias, and improves the chances of achieving efficiency, and accurate statistical inference.
TMLE estimators for the linear IV model have been recently proposed by Tóth and van der Laan (2016). We describe below in more detail the linear noniterative TMLE, which we denote by IVTMLE.
3.5 IvTmle
Let be the target parameter mapping, from the space of all possible models for the true distribution of the data to , defined by the treatment effect modification working parametric model , as the solution to
We note that only depends on through and the distribution of the covariates . We denote this relevant part by , with , while under , depends on and , so that .
The TMLE algorithm starts by obtaining initial estimates of , , and the instrument propensity score . We denote these initial estimates by writing a hat over the object.
The next step in the construction of a TMLE requires the specification of a loss function
, such that , i.e. the expectation of the loss function is minimised at the true probability distribution. For the linearfluctuation IVTMLE, we use the square error loss function , where is the efficient influence function (EIF) under the linear IV model and the working model , which can be written as (see Tóth and van der Laan (2016) for the details):where and
The term is associated with instrument strength, and can be written as
(17) 
Now, we proceed to the targeting step, using a linear fluctuating model that updates the initial fit for . For this, initial fits for , and are used in model (4) to define the initial estimate for , . The linear fluctuating model is now:
(18) 
where the socalled clever covariate is defined as
(19) 
With these choices for and the fluctuating model, can be found by solving the EIF equation,
This is equivalent to solving for a system of linear equations:
(20) 
where is obtained by plugging in the initial fits and into equation (19). The TMLE estimator of , denoted by , is now obtained by substituting , the solution to equation (20) into equation (18). Finally, we project onto the working model by OLS, obtaining , the TMLE estimator of the causal parameter of interest.
Tóth and van der Laan (2016) showed that this approach yields an estimator which is consistent when (i) the initial estimators of and are consistent, (ii) the initial estimators of and are consistent, or (iii) the initial estimators of and are consistent.
3.6 Comparison of the two estimators
Recall that the IVg estimator assumes a parametric model for , e.g. (equation (6)), while the IVTMLE obtains a first estimate of which can include all other covariates in not only , and then after updating this through the TMLE procedure projects the targeted estimator onto a usersupplied working parametric model, e.g. , which is not assumed to be correctly specified.
The IVg estimator gains efficiency from assuming equation (6) when equation (6) holds. Suppose however that the model (6) is misspecified (e.g. that the treatment effect curve depends on more covariates, not just , or that the relationship is not linear). Interestingly, the IVg estimator will continue to be a consistent estimator of the parameters , whenever the mean exposure model is correctly specified and is constant in . This is because the IVg estimator solves
(21) 
This is equivalent to
so that,
(22) 
This converges to if is constant, and continues to hold in large samples, when and the instrument propensity score are replaced by their consistent estimators respectively.
If however is not constant in , then the IVg estimator under a misspecified converges to a weighted average of the true estimand of interest.
3.6.1 Weak instrument instability for IVTMLE
The other difference between these estimators is the reliance (or not) on the exposure model. In particular, the clever covariate used to define the IVTMLE estimator depends on the inverse of . This captures the strength of the instrument in predicting the exposure given , and the variance of the IVTMLE estimators become very large when this is very small. To stabilise the estimators, we choose the maximum of the estimated value of , and 0.025 when constructing the clever covariate for a given data set. Such corrections are not necessary for the IVg estimator.
4 Simulation Study
We perform a small factorial simulation study to assess the finite sample performance of the alternative methods, under the different combinations of , or being in turn correctly specified or not, while the instrument model is always correct. We write as an indicator function, for scenarios where the assumed model indexed by is misspecified.
We generate the RCT data, with twosided nonadherence, i.e. both randomly allocated groups have a nonzero probability of receiving the opposite treatment. We begin by generating five independent standard normal variables and . These are the observed baseline covariates, of which one is the effect modifier . We also generate a standard normal unobserved confounder . We generate randomised treatment also independently of the other variables, , and then simulate the binary treatment received as a function the baseline variables, the unobserved confounder, and the instrument:
Notice that we are generating in such a way that the condition necessary for the IVg estimator to converge to the parameter of interest when is wrong is no longer satisfied.
The continuous outcome is then simulated as follows:
We simulate RCTs with two different sample sizes, and , and generate 1,000 replicates for each scenario. We perform analyses with TSLS, IVg and IVTMLE. The latter two are implemented with and without the use of SL.
Parametric TSLS, IVg and TMLE are fitted using the main terms only. For TSLS the first stage are as per equation (3.1), while for IVg and TMLE, these are logistic models. The IV propensity score is also a logistic model. The SEs of the parametric IVg and TMLE are obtained by bootstrapping (percentile 95% CI using 1999 bootstrap samples).
For SL fits in IVg, since and
are binary, we use a library including glm (generalised linear models), step (stepwise model selection using AIC), svm (support vector machines), gam (generalised additive models). For the glm, step and gam learners, linear and secondorder terms are used.
For the IVTMLE, we use the same library as above for and , while for and , we include only glm, step, svm and polymars (multivariate adaptive polynomial spline regression), in order to preserve the linear structure of model (4).
The methods described in Chernozhukov et al. (2016) require the use of sample splitting in order to guarantee valid inference when using machine learning for fitting the nuisance models. We do not perform such sample splitting for the IVg or the TMLE estimators, other than that already being used for the cross validation embedded in the Super Learner. A limitation with this pragmatic approach, in that the variance estimator based on the (E)IF ignores the fact that these factors are not independent from each other, as we have used the data to both obtain the fits and then evaluate the (E)IF. van der Laan and Luedtke (2015) suggest using crossvalidation. For example, if using 10fold crossvalidation we can estimate the variance,
where is the proportion of observations in each validation sample , and is the IF (or the EIF) of the parameter of interest.
We compute the mean bias of the estimates, coverage of 95% confidence intervals (CI), and root mean square error (RMSE).
4.1 Results from the simulation
Figure 2 shows the mean bias (top) and CI coverage rate (bottom) corresponding to scenarios with sample size of . The corresponding results when appear in Figure 3.
As expected, when all models are correctly specified, all methods show close to zero bias for both of the target parameters, already with small () sample size, with coverage levels close to the nominal (between 92.5 and 97.5%) for TSLS and IVg. While the bootstrap CI corresponding to parametric TMLE resulted in overcoverage (99%), the SL IVTMLE shows some undercoverage, which improves at large () sample. This is a consequence of the variance estimator used here ignoring the fact that the data have been used to both obtain the fits and then evaluate the EIF (instead of using sample splitting). The RMSE were indistinguishable across the methods. This pattern held both when the methods were implemented using parametric models, and when with the SL.
TSLS performs well when is correctly specified regardless of the other models, but performs very poorly when and are misspecified (bias 200 times the size of the effect and 0 coverage).
Under those misspecified scenarios when the double robust properties were expected to provide protection, these small levels of bias, and good coverage has generally held. Adding a misspecified treatment effect model has doubled RMSE across the methods, but has not affected bias or coverage, and this held true when a misspecified model was used, as long as the exposure model was still correctly specified. However, when the exposure model was misspecified (the last two columns of the Figures), the estimators started showing sensitivity. When the treatment effect model and the exposure model were both misspecified, all of the parametric estimators considered showed high levels of bias. When implementing the IVg and IVTMLE methods using the Super Learner, the bias and coverage, and consequently the RMSE, returned to the levels reported under correct specification. An exception is the coverage for the IVg estimator of , which remained very low (<50%). This is mostly due to the substantial remaining bias, stemming from the fact that the estimator converges to a weighted average of the true when the parametric is misspecified (equation (22)). When is also misspecified, we observe essentially the same levels of bias, RMSE and coverage, as under the previously described scenario. Across the scenarios, the effect of sample size was not significant in the reported bias, RMSE and coverage levels, except for the scenarios when and were misspecified. Here, the IVg estimator with parametric implementation reported extremely high bias, RMSE and Monte Carlo errors. With large sample size, the bias was smaller, although still substantial, and the Monte Carlo errors reduced to close to zero.
We summarise the results of the simulation study by comparing estimated RMSEs, Figure 4. Under , we conclude that both DR estimators have reported performance according to their theoretical doublerobust properties, and the TSLS method showed similar performance to the parametric implementation of the IVg method. Both DR methods have benefitted from the dataadaptive estimation of the nuisance parameters: the performance of the estimators have not been harmed in the correctly specified scenarios, and RMSE has been greatly reduced in the scenarios when the DR properties did not provide protection against misspecification. When the assumed model is correctly specified, IVg outperforms all other methods, with the smallest RMSE. In contrast, when is misspecified, IVTMLE performs best.
5 Motivating example: the COPERS trial
We now illustrate the methods in practice by applying each in turn to the motivating example. The COping with persistent Pain, Effectiveness Research in Selfmanagement trial (COPERS) was a randomised controlled trial across 27 general practices and community services in the UK. It recruited 703 adults with musculoskeletal pain of at least 3 months duration, and randomised 403 participants to the active intervention and a further 300 to the control arm. The mean age of participants was 59.9 years, with 81% white, 67% female, 23% employed, 85% with pain for at least 3 years, and 23% on strong opioids.
Intervention participants were offered 24 sessions introducing them to cognitive behavioural (CB) approaches designed to promote selfmanagement of chronic back pain. The sessions were delivered over three days within the same week with a followup session 2 weeks later. At the end of the 3day course participants received a relaxation CD and selfhelp booklet. Controls received usual care and the same relaxation CD and selfhelp booklet.
The primary outcome was painrelated disability at 12 months, using the Chronic Pain Grade (CPG) disability subscale. This is a continuous measure on a scale from 0 to 100, with higher scores indicating worse painrelated disability.
In the active treatment, only 179 (45%) attended all 24 sessions, and 322 (86.1%) received at least one session. The control arm participants had no access to the active intervention sessions. Participants and group facilitators were not masked to the study arm they belonged to.
The intentiontotreat analysis found no evidence that the COPERS intervention had an effect on improving painrelated disability at 12 months in people with longestablished, chronic musculoskeletal pain (, % CI to ).
Poor attendance to the sessions was anticipated, and so obtaining causal treatment effect estimates was a predefined objective of the study. The original report included a causal treatment effect analysis using TSLS, using a binary indicator for treatment received (attending at least half of the sessions), and assuming that randomisation was a valid instrument for treatment received (Taylor et al., 2016). The IV model adjusted for the following baseline covariates: site of recruitment, age, gender and HADS score and the CPG score at baseline. This IV analysis found no evidence of a treatment effect on CPG at 12 months amongst the compliers ( , % CI to ).
The COPERS study also performed a number of subgroup analyses to investigate treatment effect heterogeneity, but did not carry out IV analysis with effect modification. However, treatment heterogeneity in the causal effect is still of interest.
For our reanalyses, the data set consists of 652 participants followed up for 12 months, 374 allocated to active treatment, and 278 in the control (93% of those recruited). Thirtyfive individuals (5%) have missing primary outcome data, and a further 4 ( <1%) have missing baseline depression score, leaving a sample size of 613.
We focus on the causal effect of receiving at least one treatment session as a function of depression at baseline measured using the Hospital Anxiety and Depression Scale (HADS).
We argue that random allocation is a valid IV: the assumptions concerning unconfoundedness and instrument relevance are justified by design. The exclusion restriction assumption seems plausible with our choices for , as only those participants receiving at least one training sessions would know how to use the CB coping mechanisms and potentially to improve their disability. It is unlikely that that random allocation has a direct effect, though since participants were not blinded to their allocation, we cannot completely rule out some psychological effects of knowing one belongs to the control or active group on pain and disability.
We perform each of the methods in turn, TSLS, IVg and IVTMLE to estimate . As Table 1 summarises, the use of DR methods, even after using Super Learner does not result in a material change in the point estimates or SEs, compared to standard TSLS. All 5 estimators give the same inference, namely, there is no evidence of an average treatment effect in the treated, and there is no evidence of effect modification by baseline depression. The lack of statistical significance may be due to small numbers of participants in the trial. While not statistically significant, the direction of the treatment effect modification is interesting: the point estimates indicate the treatment may benefit more those with higher depression symptoms at baseline (the ATT resulting in a reduction in the disability score).
SE  SE  

TSLS  2.94  4.67  0.58  0.57 
IVg  2.78  4.66  0.53  0.54 
IVg SL  2.10  4.75  0.45  0.54 
IVTMLE  3.16  4.74  0.64  0.56 
IVTMLE SL  2.22  4.88  0.51  0.58 
6 Discussion
This paper compared the performance of two doubly robust estimators for the ATT dependent on baseline covariates, in the presence of unmeasured confounding, but where a valid (conditional) IV is available. These estimators were implemented with and without the use of dataadaptive estimates of the nuisance parameters. We have demonstrated empirically through simulations that the IVg estimator has good finite sample performance when using dataadaptive fits for the nuisance parameters, provided the parametric model assumed for the treatment effect curve is correctly specified. The IVTMLE offers greater flexibility, not relying on a correctly specified parametric model, and performs very well under most scenarios. The price for this greater flexibility is less efficiency, compared with that of the IVg, when the (parametric) model for the treatment effect curve is correctly specified. As the simulations show, the use of dataadaptive fits for the nuisance models greatly reduces bias, and improves coverage, resulting in much smaller RMSEs, when compared with using parametric nuisance models, and thus dataadaptive fits should be used.
The methods were motivated and tested in the context of estimating the ATT with effect modification in RCTs with nonadherence to randomised treatment with binary exposure and a continuous outcome. However, the methods presented here are applicable to other settings. One situation may be where the IV assumptions are believed to be satisfied only after conditioning on baseline covariates, making this applicable to certain observational settings. Extensions to situations with continuous exposure are also straightforward if one is prepared to assume linearity of the treatment effect curve (Tóth and van der Laan, 2016; Vansteelandt and Didelez, 2018).
We have focused on the as the estimand of interest, but Ogburn et al. (2015) have shown that the same functional of the observed data can be used to identify under monotonicity the local average treatment effects conditional on baseline covariates, LATE(v). In fact, much of the previous literature regarding estimation of instrumental variable models with covariates has assumed monotonicity. In particular, for the special case where , previous methods include full parametric specifications suitable when both the IV and exposure are binary by Little and Yau (1998) and Hirano et al. (2000) and a semiparametric model proposed by Abadie (2003). In the case where is null, Frölich (2007) characterised two distinct nonparametric estimation methods, while Tan (2006) proposed a DR estimator which is consistent when either the instrument propensity score and either the outcome or the exposure parametric models are correctly specified.
For the , Robins (1994) and Tan (2010) proposed DR estimators in the case where for , and is a strict subset of respectively. The DR estimator presented by Okui et al. (2012) and Vansteelandt and Didelez (2018) builds on the work of Tan (2010). For the special case when is null, Vansteelandt and Didelez (2018) proposed other DR estimators which are locally efficient, and also constructed a biasreduced DR IV estimator. Several authors have proposed dataadaptive estimators for the linear IV model with no effect modification, beginning with a TSLS where the first stage in fitted using LASSO with a dataadaptive penalty (Belloni et al., 2012). The biasreduced DR IV estimator has also been implemented when is null using dataadaptive fits for the conditional mean outcome in the unexposed (Vermeulen and Vansteelandt, 2016). Chernozhukov et al. (2016) proposed two other IV DR dataadaptive estimators and gave conditions under which dataadaptive fits can be used for , and . Comparing these DR estimators to the those presented here would be a promising avenue for future research.
The present study has some limitations. Formal proof of the regularity conditions and rates of convergence needed to obtain uniformly valid inferences when using dataadaptive techniques for the IV gestimator presented here are still lacking, and should be the subject of further study. Similarly, the required convergence conditions for the IVTMLE estimator have not been formally reported. Another limitation is that we did not seek to quantify the rates of convergence attained by algorithms included in the SL library. This is because in general the rates of convergence of the individual machine learning algorithms depend on the number of included variables, and other tuning parameters, making the assessment of rates of convergence complex. A potential promising solution for this could be to include the highly adaptive lasso (HAL) (Benkeser and Laan, 2016) in the SL library, as this has been proven under sufficient regularity conditions to converge at rates faster than .
A number of extensions to the work presented here are of interest. The IVg method implemented here jointly estimates and , and thus used parametric models for both. This is not necessary, and an alternative strategy where is estimated beforehand and the fitted values plugged in into the estimating equation (11) is possible, thus allowing the use of dataadaptive fits for the model . Future work could extend the biasreduced DR estimator to the linear IV model with effect modification, and compare this with IVTMLE and a fully dataadaptive version of the IVg estimator.
Acknowledgements
We thank Prof. Stephanie Taylor and the COPERS trial team for access to the data. We also thank Prof. Stijn Vansteelandt for commenting on an earlier draft of this paper, and Boriska Tóth for sharing her code implementing IVTMLE.
Karla DiazOrdaz is supported by UK Medical Research Council Career development award in Biostatistics MR/L011964/1.
References
 Abadie (2003) Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of econometrics, 113(2):231–263.
 Angrist et al. (1996) Angrist, J. D., Imbens, G. W., and Rubin, D. B. (1996). Identification of causal effects using instrumental variables. Journal of the American Statistical Association, 91(434):pp. 444–455.
 Belloni et al. (2012) Belloni, A., Chen, D., Chernozhukov, V., and Hansen, C. (2012). Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica, 80(6):2369–2429.
 Benkeser et al. (2017) Benkeser, D., Carone, M., Laan, M. V. D., and Gilbert, P. (2017). Doubly robust nonparametric inference on the average treatment effect. Biometrika, 104(4):863–880.

Benkeser and Laan (2016)
Benkeser, D. and Laan, M. V. D. (2016).
The highly adaptive lasso estimator.
In
2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA)
, pages 689–696.  Chernozhukov et al. (2016) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., et al. (2016). Double machine learning for treatment and causal parameters. arXiv preprint arXiv:1608.00060.
 Dodd et al. (2012) Dodd, S., White, I., and Williamson, P. (2012). Nonadherence to treatment protocol in published randomised controlled trials: a review. Trials, 13(1):84.
 Dunn and Bentall (2007) Dunn, G. and Bentall, R. (2007). Modelling treatmenteffect heterogeneity in randomized controlled trials of complex interventions (psychological treatments). Statistics in Medicine, 26(26):4719–4745.
 Farrell (2015) Farrell, M. H. (2015). Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23.
 Frölich (2007) Frölich, M. (2007). Nonparametric IV estimation of local average treatment effects with covariates. Journal of Econometrics, 139(1):35 – 75. Endogeneity, instruments and identification.
 Gruber and van der Laan (2010) Gruber, S. and van der Laan, M. J. (2010). A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics, 6(1).
 Györfi et al. (2006) Györfi, L., Kohler, M., Krzyzak, A., and Walk, H. (2006). A distributionfree theory of nonparametric regression. Springer Science & Business Media.
 Hernán and Robins (2006) Hernán, M. A. and Robins, J. M. (2006). Instruments for causal inference: an epidemiologist’s dream? Epidemiology, 17(4):360–372.
 Hirano et al. (2000) Hirano, K., Imbens, G. W., Rubin, D. B., and Zhou, X.H. (2000). Assessing the effect of an influenza vaccine in an encouragement design. Biostatistics, 1(1):69–88.
 Kang et al. (2007) Kang, J. D., Schafer, J. L., et al. (2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, 22(4):523–539.
 Little and Yau (1998) Little, R. J. and Yau, L. H. (1998). Statistical techniques for analyzing data from prevention trials: Treatment of noshows using rubin’s causal model. Psychological Methods, 3(2):147.
 Newey (1990) Newey, W. K. (1990). Semiparametric efficiency bounds. Journal of Applied Econometrics, 5(2):99–135.
 Ogburn et al. (2015) Ogburn, E. L., Rotnitzky, A., and Robins, J. M. (2015). Doubly robust estimation of the local average treatment effect curve. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(2):373–396.
 Okui et al. (2012) Okui, R., Small, D. S., Tan, Z., and Robins, J. M. (2012). Doubly robust instrumental variable regression. Statistica Sinica, 22(1):173–205.
 Petersen et al. (2014) Petersen, M., Schwab, J., Gruber, S., Blaser, N., Schomaker, M., and van der Laan, M. (2014). Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. Journal of causal inference, 2(2):147–185.
 Pirracchio et al. (2015) Pirracchio, R., Petersen, M. L., and van der Laan, M. (2015). Improving propensity score estimators’ robustness to model misspecification using super learner. American journal of epidemiology, 181(2):108–119.
 Porter et al. (2011) Porter, K. E., Gruber, S., van der Laan, M. J., and Sekhon, J. S. (2011). The relative performance of targeted maximum likelihood estimators. The international journal of biostatistics, 7(1):1–34.
 Robins (1994) Robins, J. M. (1994). Correcting for noncompliance in randomized trials using structural nested mean models. Communications in Statistics  Theory and Methods, 23(8):2379–2412.

Robins (2000)
Robins, J. M. (2000).
Robust estimation in sequentially ignorable missing data and causal
inference models.
In
Proceedings of the American Statistical Association, Section on Bayesian Statistical Science
, pages pp. 6–10.  Rubin (1978) Rubin, D. B. (1978). Bayesian inference for causal effects: The role of randomization. The Annals of statistics, pages 34–58.
 Tan (2006) Tan, Z. (2006). Regression and weighting methods for causal inference using instrumental variables. Journal of the American Statistical Association, 101(476):1607–1618.
 Tan (2010) Tan, Z. (2010). Marginal and nested structural models using instrumental variables. Journal of the American Statistical Association, 105(489):157–169.
 Taylor et al. (2016) Taylor, S. J., Carnes, D., and Homer, Kate, e. a. (2016). Improving the selfmanagement of chronic pain: Coping with persistent pain, effectiveness research in selfmanagement (copers). Programme Grants for Applied Research, 4(14).
 Tóth and van der Laan (2016) Tóth, B. and van der Laan, M. J. (2016). TMLE for marginal structural models based on an instrument. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 350.
 van der Laan and Rose (2011) van der Laan, M. and Rose, S. (2011). Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Series in Statistics.
 van der Laan and Gruber (2012) van der Laan, M. J. and Gruber, S. (2012). Targeted minimum loss based estimation of causal effects of multiple time point interventions. The international journal of biostatistics, 8(1).
 van der Laan and Luedtke (2015) van der Laan, M. J. and Luedtke, A. R. (2015). Targeted learning of the mean outcome under an optimal dynamic treatment rule. Journal of causal inference, 3(1):61–95.
 van der Laan et al. (2007) van der Laan, M. J., Polley, E. C., and Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1):1–21.
 VanderWeele (2009) VanderWeele, T. J. (2009). Concerning the consistency assumption in causal inference. Epidemiology, 20(6):880–883.
 Vansteelandt and Didelez (2018) Vansteelandt, S. and Didelez, V. (2018). Improving the robustness and efficiency of covariate adjusted linear instrumental variable estimators. to appear in Scandinavian Journal of Statistics.
 Vermeulen and Vansteelandt (2016) Vermeulen, K. and Vansteelandt, S. (2016). Dataadaptive biasreduced doubly robust estimation. The international journal of biostatistics, 12(1):253–282.
 Wiles et al. (2014) Wiles, N. J., Fischer, K., Cowen, P., Nutt, D., Peters, T. J., Lewis, G., and White, I. R. (2014). Allowing for nonadherence to treatment in a randomized controlled trial of two antidepressants (citalopram versus reboxetine): an example from the genpod trial. Psychological Medicine, 44(13):2855–2866.
 Wooldridge (2010) Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data. MIT Press.
 Zhang et al. (2014) Zhang, Z., Peluso, M. J., Gross, C. P., Viscoli, C. M., and Kernan, W. N. (2014). Adherence reporting in randomized controlled trials. Clinical Trials, 11(2):195–204.
 Zheng and van der Laan (2012) Zheng, W. and van der Laan, M. J. (2012). Targeted maximum likelihood estimation of natural direct effects. The international journal of biostatistics, 8(1):1–40.