1 Introduction
Across many fields, researchers are interested in the average effect of a treatment on an outcome. This “treatment” might correspond to a drug, a harmful exposure, or a policy intervention. Often, the treatment may not be randomized due to ethical or logistical reasons, which necessitates statistical methodology to address differences between those observed to take the treatment and those observed not to take the treatment (Rosenbaum and Rubin, 1983). These differences are often accounted for through regression adjustment, either through estimation of the mean outcome given treatment and confounders (the outcome regression
), the probability of treatment given confounders (the
propensity score), or both. Many popular techniques for generating efficient estimates of the average treatment effect (ATE) utilize these regression estimates as an intermediate step. For example, the inverse probability of treatment weighted (IPTW) estimator, inverse weights the observed outcomes according to the inverse estimated probability of each observations receiving treatment. More involved procedures are available that use both the outcome regression and propensity score to generate an asymptotically efficient estimate of the average treatment effect. These efficient procedures include augmented inverse probability of treatment weighted (AIPTW) estimators and targeted minimum lossbased estimators (TMLEs).Commonly, the requisite regressions for estimation of the ATE are modeled using parametric techniques such as linear and logistic regression. However, misspecification of regression models can lead to extreme bias in estimates of the ATE
(Kang and Schafer, 2007), which has led to growing interest in the use of nonparametric methods (Robins et al., 2006; Hubbard et al., 2000; her, ; Farrell, 2015; Kennedy, 2018) and adaptive regression techniques to estimate the ATE (van der Laan and Rubin, 2006; Lee et al., 2010; Karim and Platt, 2017; Wyss et al., 2018; Karim et al., 2018). In particular, the field of targeted learning has emerged as a paradigm for deriving formal statistical inference about estimated treatment effects when machine learning techniques are used to fit regressions (van der Laan and Rose, 2011, 2018). A particularly interesting machine learning algorithm in this context is the highly adaptive lasso (HAL, Benkeser and van der Laan (2016); van der Laan (2017)). Under weak conditions, HAL achieves a fast convergence rate irrespective of the dimension of putative confounding variables. In fact, the convergence rate is sufficient to guarantee asymptotic efficiency of AIPTW estimators and TMLEs under weak conditions (van der Laan, 2017).A key problem for any ATE estimation strategy is selecting confounders from a potentially larger set of measured variates. Recent studies have shown that including instrumental variables
– variates that affect the propensity score, but not the outcome regression – in the propensity score leads to inflation of the variance of the estimator of the average treatment effect relative to estimators that exclude such variables
(Schisterman et al., 2009; Rotnitzky et al., 2010; van der Laan et al., 2010; Schneeweiss et al., 2009; Ju et al., 2017). Shortreed and Ertefaie (2017) proposed a datadriven approach to variable selection in an estimate of propensity scores. The proposal involves estimating the outcome regression using a generalized linear model and using the coefficients from this fit as evidence of how strongly each covariate is related to the outcome. The adaptive lasso (Zou, 2006) is used to fit the propensity score, where the penalty associated with each variable is proportional to the absolute value of the inverse of the variable’s outcome regression coefficient. Thus, variables that are strongly related to the outcome (i.e., have a large coefficient in the outcome regression model) are penalized less, and are more likely to be included the propensity score. The resultant propensity score estimate is referred to as the outcome adaptive lasso. The authors use this estimated propensity score to construct an IPTW estimator of the ATE.In this work, we consider the implications of using the proposal of Shortreed and Ertefaie (2017)
, but using HAL instead of parametric outcome regression and propensity score models. This seemingly straightforward extension turns out to have interesting implications for estimation and inference. In particular, the HAL representation of a function as a linear combination of tensor products of infinitesimal indicator basis functions leads to finegrained screening for instruments. Specifically, the method screens for
instrumental basis functions, the set of infinitesimal indicator basis functions that may be excluded from the outcome regression while still capturing its true form. The second interesting implication of utilizing Shortreed’s method is that, by construction, the OHAL propensity score estimate may be inconsistent for the true propensity score when instrumental basis functions are present. Recent works have demonstrated that inference derived from procedures that naïvely use inconsistent nuisance estimators is often severely biased and further steps are needed to provide robust inference (van der Laan, 2014; Benkeser et al., 2017). Building on these results, we propose a consistent estimator of the average treatment effect that has an asymptotic Normal sampling distribution and whose variance can be estimated in closed form irrespective of whether there are true instruments present in the data. We demonstrate the potential benefits of our approach via simulation.The remainder of the article is organized as follows. Section 2 provides background of this work, including identification of the ATE, estimation using TMLE, outcomeadaptive estimation of the propensity score, and estimation using HAL. Section 3 proposes our estimator, describes its implementation, and discusses its weak convergence. Section 4 provides an empirical study of the proposed estimator. We conclude with a discussion of possible future directions.
2 Background
2.1 Identification of average treatment effect
Suppose we observe independent copies of the data unit , which consists of , where is a
dimensional vector of baseline covariates,
is a binary treatment assignment, and is an outcome of interest. Without loss of generality, we assume that . We assume a nonparametric model for . Our interest is in evaluating the difference in average outcome if the entire population were assigned to receive versus . Specifically, we follow Pearl (2009) and define a nonparametric structural equation model (NPSEM), , , , where are deterministic functions, and are exogenous variables. The model assumes data are generated sequentially: the baseline covariates are generated based on and ; then, the treatment is generated , , and ; finally, the outcome is generated based on , , , and . We consider intervening in this system to deterministically set , rather than allowing to determine its value. This intervention generates, socalled counterfactual random variables. For
, we denote by the distribution of the counterfactual random variable . Our parameter of interest is , which we refer to as the average treatment effect (ATE).The ATE is identifiable under the following assumptions: consistency, ; no interference: does not depend on for and ; ignorability: ; positivity: . The first two assumptions are needed so that the counterfactual random variables are well defined. The ignorability condition states that there are no unmeasured confounders of and , while the positivity criterion states that every participant has a nonzero probability of receiving and . If these assumptions hold, the average treatment effect is identified based on the observed data according to the Gcomputation identification
(1) 
2.2 Locally efficient estimation of the average treatment effect using TMLE
For simplicity, in the remainder, we explicitly consider estimation of . A symmetric argument can be made for estimation of , and thus the ATE. Several nonparametric approaches have been proposed for estimating the ATE that use flexible estimates of the outcome regression and propensity score. Targeted minimum lossbased estimation (TMLE) is one such example. For each possible , we denote by an estimate of , the true outcome regression at . Similarly, we denote by an estimate of the propensity score . Using these estimates, a TMLE is computed by defining a particular parametric working model. Because , we may use a logistic working model for the outcome regression, , where
Note that this defines a logistic regression model with offset given by the logit of the initial estimator and single covariate
. We also note that the use of this particular submodel is not restricted to binary outcomes; in particular, any realvalued outcome can be scaled to the unit interval by subtracting the minimum value of and dividing by the range of , so that this same model can be used (Gruber and van der Laan, 2010). A maximum likelihood estimator of is found using data from observations with (e.g., using iteratively reweighted leastsquares) and the TMLE isStudy of the stochastic properties of TMLEs is facilitated through a linearization of the Gcomputation parameter, which involves the efficient influence function, a key object in efficiency theory (Bickel et al., 1998). Given an outcome regression , a propensity score
, and a marginal cumulative distribution function of
, , we define the nonparametric efficient influence function(2) 
We remark that the efficient influence function is a key ingredient in many causal effect estimators. Indeed, the choice of submodel used in the TMLE procedure described above is directly related to the form of the efficient influence function. The efficient influence function is also important for estimating equationsbased estimators and onestep estimators, which are asymptotically equivalent with TMLE (for more discussion, see van der Laan et al. (2018)).
We denote by the empirical cumulative distribution function of . If the outcome regression and propensity score estimators are such that falls in a Donsker class with probability tending to one and converges in probability to zero then
(3) 
where the secondorder remainder is
(4) 
This term can be shown to be asymptotically negligible, in the sense that , provided
(5) 
If indeed , then
. Thus, the central limit theorem implies that the centered and scaled estimator,
converges in distribution to a meanzero Gaussian random variable with variance . The asymptotic variance may be consistently estimated by(6)  
The Waldstyle confidence interval
(7) 
will have asymptotic coverage probability no smaller than .
2.3 Outcomeadaptive estimation of propensity scores
It is common in many applications that relevant baseline covariates must be selected from a larger set of variables. Therefore, an important consideration in any analysis is deciding what variables should comprise . The ignorability assumption suggests that, at a minimum, should include all variables that are related both with the treatment and the outcome . However, there is also potential benefit in including variables related only to the outcome (Zhang et al., 2008; Moore and van der Laan, 2009), and there is risk in including variables related only to the treatment (Schisterman et al., 2009). Moreover, in finite samples it is often beneficial to remove true confounding variables that cause extreme values in the propensity score (Petersen et al., 2012). This has motivated many proposals for automated variable selection for estimation of the ATE (Rotnitzky et al., 2010; Schneeweiss et al., 2009; van der Laan et al., 2010; Ju et al., 2017, 2018).
An interesting approach to automated variable selection was discussed by Shortreed and Ertefaie (2017). Their proposal can be summarized as follows. First, we define a main terms parametric regression model for the outcome regression, e.g., and estimate via maximum likelihood. We denote by the coefficient for variable in the outcome regression, and define . Next, we define a logistic regression model for the propensity score, e.g., and estimate via the adaptive lasso (Zou, 2006) with penalty weight for given by , where is a usersupplied tuning parameter. The authors discuss using the resultant propensity score estimate to construct a stabilized IPTW estimate of the average treatment effect, though in principle, the outcome regression and IPTW estimates could be used in a locally efficient procedure like TMLE.
2.4 Highly adaptive lasso
The highly adaptive lasso (HAL) estimator is a semiparametric regression estimator that is consistent for the true regression function assuming only that the true regression function has finite variation norm (Benkeser and van der Laan, 2016; van der Laan, 2017). Moreover, a TMLE constructed based on HAL estimates of the outcome regression and propensity score will generally satisfy the requisite regularity conditions discussed in section 2.2. HAL is therefore an appealing choice for use when estimating the average treatment effect in settings where flexible regression models are needed.
A HAL regression of an outcome on covariates is achieved by mapping into a set of indicator basis functions. If is univariate, we generate a basis expansion for a typical observation as , where for , . An penalized regression of the outcome is fit using these basis functions, with the optimal norm chosen via crossvalidation. In a single dimension, HAL is equivalent with zeroorder trend filtering (Kim et al., 2009). In higher dimensions, tensorproduct basis functions are included in the basis expansion, e.g., if , we generate firstorder basis functions as well as secondorder basis functions of the form for . We use to denote the vector of basis functions evaluated at a typical observation . The intuition for the particular choice of basis functions used by HAL is as follows. For each on the lasso solution path, the corresponding HAL estimate has been shown to be a minimum lossestimator (MLE) over the class of functions with variation norm no larger than the norm of the coefficient vector. For each , this is a Donsker class, which implies a fast convergence rate of the MLE. Oracle properties of crossvalidation imply that selecting the variation norm using crossvalidation does not adversely affect the convergence rate. For more details pertaining to HAL, we refer readers to the original publications.
Given the formulation of HAL as a regression of an outcome onto a set of (binary indicator) variables, HAL provides a natural extension of the proposal of Shortreed and Ertefaie (2017) for outcomeadaptive propensity score estimation. In the next section, we discuss implications for estimation and inference.
3 Methodology
3.1 Outcome highly adaptive lasso propensity estimation
The reliance of Shortreed and Ertefaie (2017) on parametric regression models limits its applicability settings where more flexible models are required. However, because the highly adaptive lasso in a way is just a special case of the usual lasso, it is straightforward to accommodate outcomeadaptive weights in the HALbased estimation of the propensity score. An algorithm for generating a propensity score estimate is as follows. First, we compute the HAL estimate of the outcome regression, using as outcome and as predictors in the subgroup of participants with . We denote by the vector of HALestimated coefficients associated with the HAL basis functions. Next, we fit HAL using the treatment indicator as outcome, and as features, but using outcomeadaptive weights. Specifically, the HAL model for the propensity score is where is the dimension of , which is no larger than . Fitting HAL corresponds to the optimization problem of finding where is a usersupplied tuning parameter. The penalty parameter is selected via crossvalidation. We call the resultant fit the outcome highly adaptive lasso (OHAL).
An interesting observation on OHAL propensity estimation is that individual basis functions, rather than entire variables are penalized in the fit. Thus, we could say that OHAL screens for instrumental basis functions, those basis functions that parameterize directions in the covariate space along which the outcome regression does not vary. Therefore, the approach provides an extremely granular means of screening for instruments and removing them to avoid possibly extreme propensity score estimates. To illustrate the potential benefit, consider the data set illustrated in Figure 1.
In this example, is not an instrumental variable and, due to the overall linear trend in the outcome regression, would likely not be labeled as such by common screening procedures. However, when is viewed as a collection of infinitesimal indicator basis functions, any basis function of the form for is an instrumental variable because the outcome regression is constant over this region. Moreover, there are extreme propensity score values for these same values of , which may cause erratic behavior of causal effect estimators. Thus, to improve performance of causal effect estimators, we may wish to remove these basis functions from the propensity score fit. This is precisely the goal of our proposed approach. Theoretically, there may be relevant benefits to this approach in both small and large samples. In small samples, avoiding extreme propensity score estimates may yield more stable estimates of treatment effects. Considering large samples we note that OHAL will converge pointwise to for and for . Indeed, we see in the single data analysis, OHAL fits a constant function of that is close to for and in between the extreme values of for . We recall that the form of the asymptotic variance of a nonparametric efficient estimator is and that the efficient influence function involves dividing by . Because is near to 0 for some the efficient variance may be quite large. On the other hand, because pools the propensity score over all , we might expect that a locally efficient estimator based on OHAL rather than, e.g., HAL will lead to improvements in asymptotic variance relative to a locally efficient estimator.
3.2 Inference when using OHAL
Unfortunately, there are generally repercussions with respect to inference when considering locally efficient estimators that are based on inconsistent estimates of either the outcome regression or propensity score (van der Laan, 2014; Benkeser et al., 2017). In particular, the remainder (4) may not be asymptotically negligible if instrumental variables (or instrumental basis functions) are present in , due to the fact that OHAL is estimating rather than . Thus, the remainder term may contribute to the firstorder behavior of a locally efficient estimator. If is a wellspecified MLE in a finitedimensional regression model for
, then locally efficient estimators will still have an asymptotic Normal distribution, though the variance estimator (
6) and confidence interval (7) will generally no longer be valid. In the case that is based on an adaptive algorithm, such as HAL, the repercussions on inference are severe: asymptotically, the confidence interval (7) has zero coverage probability and has been shown to behave poorly in finite samples (Benkeser et al., 2017). van der Laan (2014) proposed TMLE methods that yield confidence intervals that are robust to inconsistent estimation of one of the outcome regression or propensity score. An alternative TMLE approach was proposed and evaluated in Benkeser et al. (2017).3.3 Proposed estimator
We propose a TMLE based on a HAL estimate of the outcome regression and an OHAL estimate of the propensity score. Because HAL is used, our outcome regression estimator is consistent under weak conditions. However, the OHAL propensity score estimator is consistent for , which may or may not equal the true propensity score. This all depends on whether instrumental basis functions are present. Because the analyst is unlikely to have apriori knowledge of whether or not such basis functions are present, we make use of the results of Benkeser et al. (2017) to develop an estimator and confidence intervals that have sound theoretical behavior in either situation without needing to know which situation is true. Our proposed estimator is efficient when no instrumental basis functions are present and may be super efficient otherwise. In both cases, the proposed estimator has a closedform influence function, which we can use to construct robust Waldstyle confidence intervals. Relative to a standard TMLE, this estimator requires estimation of two additional, univariate regression functions, and includes a two rather than onestep targeting procedure.
The proposed TMLE estimator can be implemented as follows.
Obtain initial estimates of nuisance parameters

use HAL to obtain estimate of ;

use OHAL based on to obtain estimate of ;

use HAL to obtain estimate of the “reduceddimension” regression by fitting HAL with outcome and univariate covariate , for ;

use HAL to obtain estimate of the “reduceddimension” regression by fitting HAL with outcome , for .
Iterative targeting of outcome regression

define and , ;

fit a logistic regression using outcome , offset and single covariate using observations with ; call the estimate of the univariate parameter of this regression and set ;

fit a logistic regression using outcome , offset and single covariate using observations with ; call the estimate of the univariate parameter of this regression model , and define the targeted outcome regression as .
Repeat steps (iv)(vi) above until and , where . In our simulation, we used .
Compute plugin estimator

the TMLE is .
In the supplement, we describe an alternative implementation that avoids iteration. We found no practical difference between the estimators in our simulation.
The motivation for the “reduceddimension” regressions can be explained as follows. If , i.e., instrumental basis functions are present, then the remainder can be viewed as a nonparametric plugin estimator of 0, which we may express as , where maps from the outcome regression model space to . In general, plugin estimators based on nonparametric regression functions, like HAL, are not consistent estimators of their target (c.f., certain sieve estimators (Chen, 2007)). For example, the plugin estimator of , based on HAL is not generally consistent. This fact motivates the use of TMLE, where we estimate an additional regression, namely the propensity score, and we modify our initial estimate of based on the estimated propensity score. A parallel can be drawn to the situation above: is not a consistent estimator of zero and so we use a step of TMLE that involves estimation of additional regression quantities that we use to modify our initial estimate of . The result is that is a consistent estimator of zero.
3.4 Statistical inference
For a typical observation , and given outcome regression and “reduceddimension” regressions , we define
Note that implicitly is indexed by a given propensity score, though we omit this notation for simplicity. In the theorem, we assume the index is set to the OHAL limit . We now present a theorem describing the weak convergence of .
Theorem 1.
Under regularity conditions explicitly stated in the appendix,
and converges weakly to a meanzero Normal variate with variance
Note that when the OHAL limit equals the true propensity score for all and thus for all . In words, when there are no instrumental variables nor instrumental basis functions, is a nonparametric efficient estimator of . On the other hand, if there are such basis functions, then is still a consistent estimator of , though there is a firstorder contribution to its influence function stemming from the intentional inconsistent estimation of .
In our numeric studies, we evaluate Waldstyle confidence intervals based on two different standard error estimates. The first is a typical influencefunctionbased estimate,
. The second is a partially crossvalidated influencefunctionbased estimator. The HAL outcome regression and OHAL propensity score regression each utilize fold crossvalidation to select the lasso penalty parameter. Thus, after these regressions are fit, we have lasso fits available for each of the outcome regression and outcomeadaptive propensity score, one for each training fold. For , we denote by the HALestimated outcome regression estimated in the th training fold. Given an observation in the th validation fold, we say that is a partially crossvalidated estimate of . We say partially crossvalidated since observation was used to select the lasso tuning parameter. Similarly, we denote by the outcomeadaptive HAL propensity score estimate obtained in the th training fold. Again, this estimator is partially crossvalidated, since observation was used both to construct the HAL outcome regression estimator and to select the tuning parameter for the OHAL fit. Finally, we denote by the estimate of the reduceddimension regression obtained in the th training fold (i.e., based on and ). For , we denote by the empirical distribution of observations in the th validation fold and define(8) 
In spite of this estimator’s reliance on crossvalidation, it is obtained with essentially no additional computational burden beyond that needed to obtain , since it merely recycles crossvalidated lasso fits needed to compute itself.
4 Simulation
We designed a simulation to highlight the two potential benefits of our proposed methodology. First, our proposal uses flexible estimators of the outcome regression and can adapt to complex underlying true regression functions. In these settings, we expect benefit of our approach relative to approaches that utilize misspecified parametric regression models. On the other hand, more flexible regression methodologies can be used to capture these relationships. However, the second benefit of our proposal is that we use OHAL to estimate the propensity score, which may be expected to offer benefits in settings with instrumental variables and/or instrumental basis functions. So relative to other nonparametric approaches, we also expect to see improvements. Therefore, we designed a simulation study that included both nonlinear relationships and instrumental variables and compared our proposal to several other approaches in this setting.
We simulated one thousand data sets at each of three sample sizes, . We drew
by independently sampling each component from a Uniform(1, 1), Bernoulli(0.5), Uniform(1, 1), and Uniform(0,1) distribution, respectively. The treatment was drawn from a Bernoulli distribution with
. The outcome was drawn from a Bernoulli distribution with . The average treatment effect implied by this data generating distribution is approximately 0.20.We studied several estimators of the average treatment effect. First, we considered two Gcomputation (also known as standardization estimators (Robins, 1986)) estimators based on parametric models. The first was based on a parsimonious, correctlyspecified logistic regression model for the outcome regression. The second was based on a misspecified, main terms logistic regression model. We also considered two inverse probability of treatment weighted (IPTW) estimators, with one again based on a parsimonious, correctlyspecified logistic regression and one based on a misspecified, main terms model for the propensity score. For both the Gcomputation and IPTW estimator, we view the estimators based on correctly specified regression models as idealized benchmarks, since in practice it is often difficult to correctly specify a parametric regression model. Confidence intervals for these estimators were constructed using a percentilebased nonparametric bootstrap based on 500 bootstrap samples.
We also implemented Shortreed’s estimator of the average treatment effect, where the outcome regression was inconsistently estimated via a main terms logistic regression and the propensity score was estimated using the outcome adaptive lasso. Treatment effect estimators based on this approach should remove deleterious effects of the instrumental variable (), but will yet suffer from incorrect specification of the outcome regression. We implemented the IPTW estimator proposed in Shortreed et al, in addition to a TMLE based on these regression fits. Confidence intervals for these estimators were constructed using the bootstrap approach recommended in Shortreed (Efron, 2014).
For a nonparametric treatment effect estimator, we used TMLE based on HALestimated outcome regression and HALestimated propensity score. This estimator should properly account for nonlinearity, but will not remove the instrumental variable, leaving the possibility of extreme estimated propensity scores. On the other hand, our proposed TMLE that uses OHAL estimated propensity score is expected to simultaneously account for nonlinearity, avoid extreme estimated propensity scores, and maintain low bias in spite of the inconsistent propensity score estimate. Two types of confidence intervals were computed for both TMLE HAL and TMLE OHAL. The first was based on the estimated influence function, as in (6); the second was based on the crossvalidated estimate of the influence function, as in (8).
We compared these estimators on their Monte Carloestimated bias, standard error, and mean squarederror. We also compared the coverage of nominal 95% confidence intervals, as well as the median width of the confidence interval.
We found that, as expected, the benchmark estimators were essentially unbiased and had low mean squarederror, with the correctlyspecified Gcomputation estimator performing slightly better than the IPTW estimator (top two rows, Table 1). On the other hand, the estimators based on misspecified parametric models exhibited large bias, resulting in large MSE. The benefits of the outcomeadaptive lasso are seen in the improved standard error of the estimators based on an OAL propensity estimate relative to those that do not penalize the inclusion of instrumental variables. However, the bias of both OALbased estimators are large owing to the misspecification of the relevant regressions. TMLE based on HAL had relatively large bias in small sample sizes, but bias improved in larger samples. By comparison, our proposed TMLE based on OHAL had low bias in all sample sizes. In the smallest sample size, our proposed estimator had improved bias and standard error relative to TMLE based on HAL. In larger sample sizes, our estimator had improved bias. Overall, TMLE with OHAL had MSE 66%, 75%, and 75% that of the TMLE based on HAL for sample sizes and , respectively.
Bias  Standard error  Mean squarederror  

100  500  1000  100  500  1000  100  500  1000  
GCOMP GLM (C)  0.03  0.09  0.02  1.10  1.08  1.03  1.21  1.17  1.06 
IPTW GLM (C)  0.02  0.05  0.05  1.38  1.32  1.25  1.90  1.75  1.57 
GCOMP GLM (M)  0.56  1.34  1.82  1.09  1.10  1.03  1.51  3.02  4.36 
IPTW GLM (M)  0.61  1.43  1.90  1.22  1.17  1.11  1.85  3.41  4.85 
IPTW OAL (M)  0.59  1.42  1.90  1.15  1.15  1.10  1.68  3.33  4.82 
TMLE OAL (M)  0.66  1.48  1.98  1.17  1.15  1.09  1.81  3.51  5.11 
TMLE HAL  0.58  0.91  0.73  1.27  1.27  1.19  1.95  2.44  1.93 
DRTMLE OHAL  0.09  0.39  0.10  1.13  1.30  1.21  1.29  1.83  1.47 
The benchmark estimators had near nominal coverage at all sample sizes (Table 2). Estimators based on misspecified parametric models had poor coverage. Surprisingly, the estimators based on OAL had near nominal coverage in large sample sizes. Because of the bias of these estimators, we suspect that coverage is due to conservative standard error estimates. Indeed, we find that in large samples, these confidence intervals tended to be considerably wider than the other intervals considered. For both nonparametric estimators, we found that the crossvalidated standard error estimates provided better coverage in small samples than their noncrossvalidated counterparts. Confidence intervals for our proposed estimator based on crossvalidated standard error estimates had near nominal coverage at each sample size, while those based on the TMLE that used HAL provided coverage less than the nominal level.
CI coverage (median width)  

100  500  1000  
GCOMP GLM (C)  94.3 (0.42)  93.6 (0.18)  95.6 (0.13) 
IPTW GLM (C)  93.6 (0.52)  93.5 (0.23)  95.8 (0.16) 
GCOMP GLM (M)  89.5 (0.42)  73.9 (0.19)  60.1 (0.13) 
IPTW GLM (M)  90.9 (0.48)  75.6 (0.2)  61.4 (0.14) 
IPTW OAL (M)  89.4 (0.45)  91.2 (0.28)  95.2 (0.24) 
TMLE OAL (M)  86.1 (0.45)  89.8 (0.28)  94.2 (0.24) 
TMLE HAL  75.1 (0.34)  74.4 (0.16)  81.5 (0.12) 
TMLE HAL (CVSE)  93.1 (0.48)  86.2 (0.21)  91.9 (0.15) 
DRTMLE OHAL  88.1 (0.38)  86.4 (0.18)  93.2 (0.14) 
DRTMLE OHAL (CVSE)  96.7 (0.48)  91.9 (0.21)  96.4 (0.15) 
5 Discussion
OHAL provides a new tool for flexibly estimating the propensity score while simultaneously accounting for finegrained variable selection. Our simulation results point to potential benefits in utilizing this approach over existing nonparametric efficient estimators in some situations. In future work, we may wish to investigate the benefit of using OHAL for estimating the causal effect of a treatment administered over several time points, where extreme propensity scores are commonly encountered in practice. Another avenue for future research is to explore methods for adaptively tuning the regularization for HAL outcome regression. The outcome regression in OHAL plays a particularly important role as it contributes to the estimation of the average treatment effect both through its role as an estimator of the true outcome regression, as well as through its role in the propensity score estimation. Thus, errors in tuning the outcome regression may easily propagate down to the estimate of the average treatment effect. We have begun experimentation with recursive tuning procedures, but so far have found no change in the estimators’ performance. Another important consideration is selection of the hyperparameter in OHAL. In our simulations, we simply set , but it is possible that with more careful tuning of , the performance of estimators based on OHAL may be further improved.
6 Acknowledgement
This project was supported by NIH grant R01 AI07434508 and UC Berkeley Superfund Research Program P42 ES00470529.
References
 [1]
 Benkeser and van der Laan [2016] David Benkeser and Mark van der Laan. The highly adaptive lasso estimator. In Data Science and Advanced Analytics (DSAA), 2016 IEEE International Conference on, pages 689–696. IEEE, 2016.
 Benkeser et al. [2017] David Benkeser, Marco Carone, MJ van der Laan, and PB Gilbert. Doubly robust nonparametric inference on the average treatment effect. Biometrika, 104(4):863–880, 2017.
 Bickel et al. [1998] Peter J Bickel, Chris AJ Klaassen, Ya’acov Ritov, Jon A Wellner, et al. Efficient and adaptive estimation for semiparametric models. SpringerVerlag, 1998.
 Chen [2007] Xiaohong Chen. Large sample sieve estimation of seminonparametric models. Handbook of Econometrics, 6:5549–5632, 2007.
 Efron [2014] Bradley Efron. Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507):991–1007, 2014.
 Farrell [2015] Max H Farrell. Robust inference on average treatment effects with possibly more covariates than observations. Journal of Econometrics, 189(1):1–23, 2015.
 Gruber and van der Laan [2010] Susan Gruber and Mark J van der Laan. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics, 6(1), 2010.
 Hubbard et al. [2000] Alan E Hubbard, Mark J Van Der Laan, and James M Robins. Nonparametric locally efficient estimation of the treatment specific survival distribution with right censored data and covariates in observational studies. In Statistical Models in Epidemiology, the Environment, and Clinical Trials, pages 135–177. Springer, 2000.
 Ju et al. [2017] Cheng Ju, Susan Gruber, Samuel D Lendle, Antoine Chambaz, Jessica M Franklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van der Laan. Scalable collaborative targeted learning for highdimensional data. arXiv preprint arXiv:1703.02237, 2017.
 Ju et al. [2018] Cheng Ju, Antoine Chambaz, and Mark J van der Laan. Collaborative targeted inference from continuously indexed nuisance parameter estimators. arXiv preprint arXiv:1804.00102, 2018.
 Kang and Schafer [2007] Joseph DY Kang and Joseph L Schafer. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical science, pages 523–539, 2007.
 Karim and Platt [2017] Mohammad Ehsanul Karim and Robert W Platt. Estimating inverse probability weights using super learner when weightmodel specification is unknown in a marginal structural cox model context. Statistics in Medicine, 36(13):2032–2047, 2017.
 Karim et al. [2018] Mohammad Ehsanul Karim, Menglan Pang, and Robert W Platt. Can we train machine learning methods to outperform the highdimensional propensity score algorithm? Epidemiology, 29(2):191–198, 2018.
 Kennedy [2018] Edward H Kennedy. Nonparametric causal effects based on incremental propensity score interventions. Journal of the American Statistical Association, pages 1–12, 2018.
 Kim et al. [2009] SeungJean Kim, Kwangmoo Koh, Stephen Boyd, and Dimitry Gorinevsky. trend filtering. SIAM review, 51(2):339–360, 2009.
 Lee et al. [2010] Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Improving propensity score weighting using machine learning. Statistics in Medicine, 29(3):337–346, 2010.
 Moore and van der Laan [2009] Kelly L Moore and Mark J van der Laan. Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation. Statistics in Medicine, 28(1):39–64, 2009.
 Pearl [2009] Judea Pearl. Causality. Cambridge University Press, 2009.
 Petersen et al. [2012] Maya L Petersen, Kristin E Porter, Susan Gruber, Yue Wang, and Mark J van der Laan. Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21(1):31–54, 2012.
 Robins [1986] James Robins. A new approach to causal inference in mortality studies with a sustained exposure period—application to control of the healthy worker survivor effect. Mathematical Modelling, 7(912):1393–1512, 1986.
 Robins et al. [2006] James Robins, Aad Van Der Vaart, et al. Adaptive nonparametric confidence sets. The Annals of Statistics, 34(1):229–253, 2006.
 Rosenbaum and Rubin [1983] Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score in observational studies for causal effects. Biometrika, pages 41–55, 1983.
 Rotnitzky et al. [2010] Andrea Rotnitzky, Lingling Li, and Xiaochun Li. A note on overadjustment in inverse probability weighted estimation. Biometrika, 97(4):997–1001, 2010.
 Schisterman et al. [2009] Enrique F Schisterman, Stephen R Cole, and Robert W Platt. Overadjustment bias and unnecessary adjustment in epidemiologic studies. Epidemiology (Cambridge, Mass.), 20(4):488, 2009.
 Schneeweiss et al. [2009] Sebastian Schneeweiss, Jeremy A Rassen, Robert J Glynn, Jerry Avorn, Helen Mogun, and M Alan Brookhart. Highdimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology, 20(4):512, 2009.
 Shortreed and Ertefaie [2017] Susan M Shortreed and Ashkan Ertefaie. Outcomeadaptive lasso: Variable selection for causal inference. Biometrics, 73(4):1111–1122, 2017.
 van der Laan [2017] Mark van der Laan. A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. The International Journal of Biostatistics, 13(2), 2017.
 van der Laan [2014] Mark J van der Laan. Targeted estimation of nuisance parameters to obtain valid statistical inference. The International Journal of Biostatistics, 10(1):29–57, 2014.
 van der Laan and Rose [2011] Mark J van der Laan and Sherri Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer Science & Business Media, 2011.

van der Laan and Rose [2018]
Mark J van der Laan and Sherri Rose.
Targeted Learning in Data Science: Causal Inference for Complex Longitudinal Studies
. Springer, New York, NY, 2018. doi: 10.1007/9783319653044.  van der Laan and Rubin [2006] Mark J van der Laan and Daniel Rubin. Targeted maximum likelihood learning. The International Journal of Biostatistics, 2(1), 2006.
 van der Laan et al. [2010] Mark J van der Laan, Susan Gruber, et al. Collaborative double robust targeted maximum likelihood estimation. The International Journal of Biostatistics, 6(1):Article 17, 2010.
 van der Laan et al. [2018] Mark J van der Laan, David Benkeser, and Oleg Sofrygin. Targeted minimum lossbased estimation. Wiley StatsRef: Statistics Reference Online, pages 1–8, 2018.
 Wyss et al. [2018] Richard Wyss, Sebastian Schneeweiss, Mark van der Laan, Samuel D Lendle, Cheng Ju, and Jessica M Franklin. Using super learner prediction modeling to improve highdimensional propensity score estimation. Epidemiology, 29(1):96–106, 2018.
 Zhang et al. [2008] Min Zhang, Anastasios A Tsiatis, and Marie Davidian. Improving efficiency of inferences in randomized clinical trials using auxiliary covariates. Biometrics, 64(3):707–715, 2008.
 Zou [2006] Hui Zou. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429, 2006.
Appendix
6.1 Proof of Theorem 1
In the proof, we adopt the shorthand notation to denote for any integrable function . Consequently, using to denote the empirical distribution function of observations of , we have . Our proof relies on the following regularity conditions:

falls in a Donsker class with probability tending to one and converges in probability to 0.

.

.



falls in a Donsker class with probability tending to one and converges in probability to 0.
We note that by design our estimators , , satisfy
(9)  
(10) 
We note also that the efficient influence function is doubly robust in the sense that , for any choice of .
We begin with an exact secondorder linearization of the parameter, . We add and subtract and rearrange terms. Noting the doublerobustness of the efficient influence function and (9), we arrive at the equation
where . By assumption A1, . Thus, it remains to examine the behavior of the remainder. Now,
where and . By assumption A2 and A3, and . Now, allowing some abuse of our shorthand notation,
where and . We note that
Comments
There are no comments yet.