1 Introduction
In many areas of research, the scientific question of interest is often answered by drawing statistical inference about the average effect of a treatment on an outcome. Depending on the setting, this “treatment” might correspond to an actual therapeutic treatment, a harmful exposure, or a policy intervention. We use to denote the potential outcome of a typical data unit sampled from the population of interest when the unit receives the treatment of interest, and to denote the potential outcome if that unit instead receives control. In this work, we focus on estimation of the average treatment effect (ATE), the average difference between and in the population of interest.
Often, due to ethical or logistical constraints, we cannot randomly assign data units to receive/not receive the treatment. Thus, in order to draw valid conclusions about the ATE, we require statistical methodology that can control for confounders of the treatment/outcome relationship. Although epidemiological and other applied literatures are still considering the relative merits of various methodologies, the statistical literature has provided direction through consideration of semiparametric efficient methods. The literature provides many examples of such estimators. In some situations, regularized or sieve maximum likelihood estimators can be used (e.g., van der Vaart (1998)); however, this generally requires careful selection of tuning parameters. Existing literature lacks general guidelines for how such parameters can be chosen in practice, which limits the utility of these strategies. On the other hand, methods that are built around a causal effect parameter’s efficient influence function offer a more straightforward approach to estimation. Foremost amongst these approaches are onestep estimation (Ibragimov and Khasminskii, 1981; Pfanzagl, 1982; Bickel et al., 1997) and targeted minimum loss estimation (TMLE) (van der Laan and Rubin, 2006; van der Laan and Rose, 2011).
The efficient influence function often depends on the observed data distribution through certain key nuisance parameters. In the context of the ATE, these nuisance parameters are the conditional mean of the outcome given treatment and confounders (the socalled outcome regression, OR), the conditional probability of treatment given confounders (the socalled propensity score, PS), and the distribution function of confounders in the population of interest. Once estimators of these key quantities are available, each methodology provides its own recipe for combining the relevant nuisance estimators into an estimate of the causal effect of interest. Assuming the nuisance estimators satisfy certain regularity conditions, the resulting estimators of the ATE, when appropriately centered and scaled, converge in distribution to a meanzero Gaussian variate with variance equal to the semiparametric CramerRao efficiency bound for regular estimators. In addition to efficiency, these estimators are also doublyrobust, meaning that they are consistent for the causal effect of interest even if one of the OR or PS is inconsistently estimated.
One of the key assumptions underlying any methodology for estimating the ATE using observational data is the strong positivity assumption, which stipulates that the PS must be bounded between zero and one almost everywhere. That is, if we define strata of data units based on their observed confounders, any stratum with positive support must have some probability of receiving and not receiving the treatment. If this condition fails, then the ATE is not estimable from observed data. Moreover, even if the condition holds theoretically, in practice there may be small estimated propensity scores – socalled practical violations of the positivity assumption (Petersen et al., 2012). In such cases, the onestep estimator and TMLE can suffer from erractic, nonrobust behavior.
To combat this behavior, various extensions have been proposed including collaborative TMLE (CTMLE) (van der Laan and Gruber, 2010; Gruber and van der Laan, 2010; Stitelman and van der Laan, 2010; Wang et al., 2011). In this approach the OR and PS are estimated collaboratively, by selecting an estimate of the PS based on how well it tunes an estimate of the OR. From a theoretical point of view, the goal of CTMLE is generally to provide an estimator that is more robust than TMLE, but that nevertheless maintains TMLE’s asymptotic efficiency. CTMLE enjoys additional theoretical benefits in settings where the OR is inconsistently estimated. In this case, the CTMLE enjoys a more general robustness property than TMLE referred to as collaborative doublerobustness (van der Laan and Gruber, 2010). Many proposed CTMLEs are designed specifically for settings with practical positivity violations, for example by: choosing a truncation level for estimated propensities, selecting variables to be included in the PS, or tuning particular machine learning algorithms (Ju et al., 2019, 2017, 2018). These works show that CTMLE can provide greater robustness than TMLE in challenging situations.
In spite of the putative benefits of CTMLE, these estimators are not widely used. There may be several reasons why. First, the approach involves many decision points for the analyst, who must select an increasingly complex sequence of estimators for the PS and implement each of these estimators. Second, the approach often involves extended computation time relative to traditional doublyrobust methods. In particular, crossvalidation is needed to validate which PS method should be selected from amongst the userchosen sequence. Third, from a theoretical perspective, performing robust inference based on existing CTMLE approaches is also challenging, involving either strong assumptions on nuisance estimators (see Appendix 17 of van der Laan and Rose (2011)) or additional iterative computational steps (van der Laan, 2014).
In this work, we seek to overcome these limitations by proposing a new approach to CTMLE. The twist in the present proposal relative to existing CTMLE approaches is that we assume the OR estimator is consistent at a fast rate. Assuming this consistency, we provide an alternative target for a PS estimator that is explicitly adaptive to the OR. In particular, rather than estimating the true PS, we recommend estimating the propensity for receiving treatment as a function of the estimated OR. This lowdimensional regression can be substituted in place of an estimator of the true PS in a standard TMLE or onestep procedure. We show that, when appropriately scaled, the resultant estimator is asymptotically Normal with variance that is generally smaller than that of an efficient estimator. Thus, our proposal provides a new approach to CTMLE that is tailored both for small and largesample and performance.
2 Background
2.1 Identification of ATE
Suppose we observe independent copies of the data unit , where
is a vector of putative confounders,
is a binary treatment, and is the outcome of interest. Our assumption that does not sacrifice any generality of our proposed methodology. We denote bythe probability distribution of
and assume that is an element of a statistical model . We take to be a nonparametric model.As above, we use and to denote the counterfactual outcomes under treatment and no treatment, respectively. For , we denote by the probability distribution of . The ATE is defined as , the difference in average outcome if the entire population were assigned to receive versus . The ATE is identifiable from the observed data under the following assumptions: consistency, ; no interference: does not depend on for ; ignorability: ; and positivity:
. The first two assumptions are needed in order to have welldefined counterfactual random variables, while the ignorability condition essentially states that there are no unmeasured confounders of
and . As mentioned in the introduction, the positivity criterion stipulates that every unit has a chance of receiving and . If these assumptions hold, the average treatment effect is identified by the Gcomputation formula(1) 
2.2 Estimators of the ATE
For simplicity, we hence consider estimation of , which we refer to as the treatmentspecific mean. Symmetric arguments can be made about , and thus the ATE. Hereafter, when we refer to the OR, it is understood that we are referring to the quantity , the regression of on amongst units observed to receive the treatment.
For each , we denote by the true OR evaluated at and denote by an estimate of based on . We use to denote the model for the OR implied by ; that is, is a collection of all possible ORs allowed by the model . Similarly, for each , we denote by the true PS evaluated at , by an estimate of , and by the model for the PS implied by . Finally, for each , we denote by the distribution function of the vector of confounders. In the remainder, we use the empirical distribution , where is the indicator function, as estimate of . We denote by the model for implied by . It is useful to our discussion to regard the parameter of interest as a mapping from to . That is, given a , is the value of the treatmentspecific mean implied by the OR and confounder distribution . Thus, denoting by the true values of these quantities, we have .
We remind readers that a regular (see Appendix A) estimator of is asymptotically linear if and only if behaves approximately as an empirical mean of a meanzero, finitevariance function of the observed data. This function is referred to as the estimator’s influence function. Depending on the chosen model, there may be a large class of possible influence functions of regular estimators. The influence function in this class that has the smallest variance is referred to as the efficient influence function (EIF). Any estimator with influence function equal to the EIF is said to be efficient amongst regular, asymptotically linear estimators. Given , , and a typical observation , we define the efficient influence function for the treatmentspecific mean relative to ,
Several frameworks exist for constructing estimators with a userspecified influence function. By selecting the EIF, these frameworks can be used to generate efficient estimators. We focus on the onestep and targeted minimum loss estimation frameworks. For our discussion, it is useful to introduce the idea of a plugin estimator. We denote by an estimate of . A plugin estimate has the form . The onestep estimator applies an EIFbased correction to the plugin estimate,
TMLE is a general framework for constructing plugin estimators that satisfy, possibly several, userspecified equations. The framework is implemented in two steps. First, initial estimators of relevant nuisance parameters (e.g., the OR and PS) are generated using a userchosen technique. Subsequently, the initial estimates are carefully modified such that (i) the modified estimators inherit desirable properties of the initial estimators (e.g., their rate of convergence); and (ii) relevant, userspecified equations are satisfied. For the present problem, a TMLE can be implemented by first generating an initial estimate of the OR and of the PS. The OR regression estimator is subsequently updated to a targeted estimator , such that the EIF estimating equation,
, is satisfied. This can be achieved, for example, by defining a logistic regression working model for the OR with
as an offset, no intercept term, and a single covariate . For each and , we define this covariate as . The maximum likelihood estimator (MLE) of the regression coefficient associated with the covariate is estimated (e.g., via iteratively reweighted least squares). For each , we define the socalled targeted OR estimator, . It is straightforward to show that the score of the coefficient at evaluated at a typical observation , equals ; thus, we may deduce that the EIF estimating equation is satisfied by the updated OR estimate . The TMLE of the treatmentspecific mean is computed as the plugin estimator based on the modified OR estimator, .We remark that both the onestep and TMLE frameworks can be seen as first generating an initial estimate based on the OR and subsequently applying a correction procedure that involves an estimate of the PS. This view of the estimators is useful to our discussion below.
2.3 Largesample theory and smallsample considerations
We hence focus discussion on TMLE, with the understanding that similar arguments apply to the onestep estimator. The following equation (derived with assumptions in Appendix B) is useful for studying largesample behavior of a TMLE,
(2) 
where is the inprobability limit of . The term is a secondorder remainder that involves a difference between and ,
(3) 
A key step in establishing asymptotic linearity of a TMLE is showing that is asymptotically negligible, i.e., . This requirement is satisfied if, for example, both and are with respect to the norm. If so, equation (2
) implies that the TMLE is asymptotically linear with influence function equal to the EIF and thus, by definition, the TMLE is efficient. Moreover, the central limit theorem implies that
converges in distribution to a meanzero normal variate with variance .We note an additional interesting feature of efficient estimators of the treatmentspecific mean: they are doublyrobust. That is, the estimated treatmentspecific mean is consistent for the true treatmentspecific mean if either the estimated OR consistently estimates the true OR, the estimated PS consistently estimates the true PS, or both estimators are consistent.
As with any asymptotic analysis, these results provide conditions under which estimators are wellbehaved in large samples, but provide no guarantees of smallsample performance. In particular, in settings where the target estimand is weakly identifiable, in spite of the their asymptotic optimality, the TMLE and onestep may be unstable. For example, when the PS may assume very small values, the variance of the EIF may be large, which can cause erratic behavior of the EIFbased correction procedures. In the context of TMLE for the treatmentspecific mean, this instability may manifest in the estimation the working model parameter
. The covariate in the parametric working model may have extremely large values, leading to a targeted OR estimator whose performance is considerably deteriorated relative to the initial OR estimator .Often, the analyst has little prior information that would suggest whether or not such issues are present. Thus, we are motivated to consider automated procedures for constructing OR and PS estimators that are adaptive to near positivity violations. One such proposal is CTMLE. A CTMLE is based on a sequence of PS estimators that increase in complexity. For example, in the context of estimating the treatmentspecific mean, we may start our sequence with an interceptonly logistic regression model and build a sequence ranging from that simple estimator to a flexible semiparametric estimator such as kernel regression. The sequence of candidate PS estimators of is used to generate a sequence of targeted ORs. The best of the targeted ORs is selected via crossvalidation and is used to create a plugin estimator. The principle underlying CTMLE is that the estimator searches for a reduceddimension alternative to the true PS that is adaptive to how well the estimated OR fits the true OR. If the initial OR estimate provides a good fit, then there may be little benefit (or even detriment) to performing a TMLE correction based on a PS with extremely small values. On the other hand if the initial OR is a poor fit, then we may in fact benefit from such a correction. CTMLE can adapt to each of these situations. Because CTMLE is based on a sequence of increasingly nonparametric PS estimators, the procedure will, with probability tending to 1, select the last consistent estimator of the true PS in the sequence. Thus, in large samples, CTMLE is expected to have similar behavior to a standard TMLE. In this respect, CTMLE may be generally viewed as a procedure that offers finitesample improvements over standard TMLE, while maintaining its asymptotic efficiency.
The above discussion of efficiency and finitesample considerations can be viewed in a more general lens than the context of nonparametric estimation of the ATE. In particular, these issues apply to a more broad set of problems that involve observed data structures that can be represented as a coarsened at random (CAR) version of a full data structure, while many models and parameters relevant to causal inference are special cases of this setting. The general CAR setting is discussed further in Appendix C.
3 Methods
We now propose a particular CTMLE for the treatmentspecific mean that is robust to near positivity violations, but avoids the sequential PS estimator selection that is typical of other CTMLE proposals. The distinct aspect of the current proposal relative to previous CTMLEbased proposals is that we rely fully on converging to faster than with respect to norm. Because the OR estimator is consistent, any PS estimator will lead to a consistent estimate of the treatmentspecific mean, due to the doublerobustness of the EIF. However, our procedure asks for a more stringent property on the adaptive PS estimator. We require that the resulting CTMLE be asymptotically linear, and thereby maintain a Normal limiting distribution. The challenge in sodoing is that our selected PS estimator is generally inconsistent for the true PS. Previous work has shown that, even when a nonparametric estimate of the OR is consistent, inconsistent estimation of the PS can have serious implications for the behavior of onestep estimators and TMLEs (van der Laan, 2014; Benkeser et al., 2017). The issue stems from the fact that the secondorder remainder is generally not asymptotically negligible. Thus, the key in achieving our goal is to choose an adaptive PS estimator such that the secondorder remainder remains asymptotically negligible under reasonable conditions. In Theorem 1, we establish that this goal can be achieved by using an estimate of
rather than an estimate of the true PS. In words, describes the probability of receiving treatment as a function of the conditional mean outcome under treatment. This adaptive PS estimate is substituted into the usual TMLE (or onestep) procedures for estimation and inference.
Our proposed CTMLE is implemented in the following steps:

estimate OR: regress on amongst units observed to receive treatment to obtain OR estimate ;

predict outcome: use estimated OR to obtain a prediction for each observed data unit, ;

estimate adaptive PS: regress on predictions to obtain adaptive PS estimate ;

predict PS: use estimated PS to obtain prediction for each observed data unit, ;

fit OR working model: fit logistic regression of outcome on covariate with offset ; denote by the estimated coefficient;

target OR estimate: use OR working model to obtain a prediction for each observed data unit, ;

compute plugin estimate: the CTMLE is .
An analogous description of the collaborative onestep estimator is included in Appendix E. Sample R code to compute the estimators is included in Appendix F.
The following theorem establishes the weak convergence of the proposed estimator. We explicitly discuss regularity conditions in Appendix G and strategies for weakening these conditions in Appendix F.
Theorem 1.
Under the regularity conditions in Appendix G,
and converges in distribution to a meanzero Normal variate with variance .
The asymptotic variance of is generally smaller than that of the standard TMLE , so that the proposed estimator is super efficient. That is, at any fixed data distribution in , this CTMLE will be asymptotically more efficient than the standard TMLE.
3.1 Variance estimation
We propose to estimate the standard error of
, based on a crossvalidated estimate of the variance of the influence function. Specifically, consider a fold crossvalidation scheme, wherein data are randomly partitioned into blocks of approximately equal size. For , denote by the indices of units in each block. For , we denote by the predicted outcome for observation based on an OR estimate fit when observation was in the holdout block. That is, to obtain , we regress on in units with and . Then we use this fitted regression to obtain predictions based on . Similarly, we denote by the th estimated adaptive PS, . This quantity is computed by regressing on for . This regression is then used to obtain predictions based on , which we denote by . Finally, we denote by the empirical distribution of based on . The crossvalidated variance estimator isRemark. For notational simplicity, we have focused on presenting an estimate of the treatmentspecific mean. An estimate of the ATE can thus be obtained by repeating the entire procedure but switching the labeling of the treatment. Alternatively, we also propose a CTMLE that directly targets the ATE in Appendix F. The major difference is that the adaptive PS now involves regression on both the OR in units who received treated, as well as the OR in units who did not receive treatment.
4 Simulations
We evaluated the performance of the proposed collaborative estimators relative to their standard counterparts in two simulation studies. We focus our presentation on comparing CTMLE and TMLE results, while a comparison of the onestep estimators is included in Appendix G. The first simulation evaluated the relative performance of CTMLE vs. TMLE as a function of sample size and strength of positivity violations. In this setting the estimators of the OR and PS are based on correctlyspecified parametric models. The results demonstrate the behavior the proposed estimators as a function both of sample size and strength of positivity violations. The second simulation offers a competitive setting for comparing the various estimators. In this setting, both the OR and PS are highly nonlinear functions of the covariates and involve complex covariate interactions. To consistently estimate these complex functions, we utilize the highly adaptive loss minimum loss estimator (HALMLE)
(van der Laan, 2017; Benkeser and Van Der Laan, 2016). This estimator has been shown to the requisite regularity conditions of Theorem 1 under extremely mild assumptions on the true nuisance parameters.In both simulation settings, we generated and analyzed 1000 Monte Carlo replicate data sets in each setting at each sample size. We evaluated the estimators on their bias, variance, and meansquared error. We also present visualizations of the estimated sampling distributions of the scaled and centered sampling distributions. Further, we present the coverage probability of nominal 95% Waldstyle confidence intervals based on the Monte Carlo standard deviation of the estimators (i.e., an oracle confidence interval) and based on influence functionbased standard error estimates.
4.1 Simulation 1
For each sample size we generated data as follows. was an eightvariate vector. We drew the first seven components of
from a from an Uniform distribution on
. The final component of was drawn from a Bernoulli(0.5) distribution. Given , the treatmentwas drawn from a Bernoulli distribution with success probability
Given and , the outcome
was drawn from a Normal distribution with unit variance and mean
The true ATE in this setting is one. We induced positivity violations by choosing increasingly large values of . We evaluated three choices,
. These choices resulted in PS bounded in (0.05, 0.95), (0.01, 0.99), and (0.003, 0.997), respectively. The standard TMLE and onestep estimators used correctlyspecified logistic regression for the PS and correctlyspecified linear regression for the OR. The CTMLE and collaborative onestep used correctlyspecified linear regression for the OR and HALMLE for the adaptive PS.
In settings with no positivity issues (), we found that CTMLE and TMLE performed approximately equivalently, though CTMLE offered modest benefits at the smallest sample size (Figure 1). As increased, propensity scores were pushed towards zero and one, and we saw increased performance of CTMLE relative to TMLE. CTMLE offered significant improvements both in terms of bias and variance, and was more than four times as efficient in the , scenario. The sampling distribution of both estimators was well approximated by the reference asymptotic distribution as indicated by nominal coverage of oracle confidence intervals (Figure 2, Panel A). However, while the estimated standard errors of the TMLE estimator performed well in larger samples, the estimated standard errors of CTMLE had poor performance, often underestimating the true variability of the estimator (Figure 2, Panel B).
Results for the collaborative onestep vs. standard onestep estimator were essentially the same as for TMLE (Appendix XXX).
4.2 Simulation 2
We based this simulation setting on the oftcited Kang and Schafer (2007) simulation design. This design is notoriously challenging for causal effect estimators due to extremely nonlinear covariate relationships in the OR and PS and highly complex interactions between covariates. In our simulation, we drew from a Uniform(0.5, 2) distribution and drew from a Uniform distribution on . Given , the treatment was drawn from a Bernoulli distribution with success probability
Given and , the outcome was drawn from a Normal distribution with unit variance and mean
The true ATE in this setting is zero and the true PS is bounded between (0.004, 0.999). A challenge of this simulation setting is that the covariates are not available to the analyst. Instead, we must base our estimation on , which is generated from as follows
As such, the true PS and OR, when expressed as functions of are highly nonlinear and involve interactions between the various components of . To estimate these functions well, we require extremely flexible regression tools. Thus, we use HALMLE for the PS and OR used by the TMLE and onestep estimators. Similarly, we used HALMLE for the OR and adaptive PS.
As expected, both the the TMLE and CTMLE struggled in this very challenging simulation study (Figure 3). While CTMLE offered modest benefits in terms of variance, the bias of the two estimators was comparable. Nevertheless, we do see evidence of asymptotic linearity of both estimators in that the sampling distributions of both of the scaled and centered estimators appear to be moving towards an appropriate center at zero. Nevertheless, both estimators had relatively large bias, even in the largest sample size , as shown by the lessthannominal coverage of the oracle confidence intervals (Figure 4, Panel A). The crossvalidated influence functionbased variance estimators overestimated the variability of the TMLE, which resulted in near nominal coverage for those intervals (Panel B). However, as in simulation 1, we found that the proposed variance estimators for the CTMLE significantly underestimated the variability of the estimator, resulting in poor coverage.
5 Data Analysis
We analyzed data collected via the Cebuano Longitudinal Health and Nutrition Survey (CLHNS) (Adair et al., 2010). CLHNS is an ongoing study of a cohort of Filipino women who gave birth between May 1, 1983, and April 30, 1984. Children born to these women in that period have been followed through subsequent surveys over multiple years. We used these data to estimate the effect of term pregnancy (pre, full, post) on children’s schooling achievement in Cebuano, English, and mathematics. The putative confounders we considered were related to parental characteristics (maternal/paternal age, maternal height, maternal/paternal education, maternal age at first birth, maternal parity, maternal marital status), household characteristics (number of children in the household, child/adult ratio, child dependency ratio, crowding index, number adult males/females in household, urbanicity score, water availability, sanitation), and socioeconomic information (socioeconomic status, familial health care access, total family income). Our strategy for handling missing covariate data is described in Appendix H. Pre, full, and postterm pregnancies were defined as those less than 37 completed weeks, between 37 and 41 completed weeks, and greater than 41 completed weeks, respectively. Our outcome is the standardized total performance on the National Elementary Assessment Test (NEAT) that was collected in 199495. Children were scored on their achievement in Cebuano, English, and mathematics. We standardized and summed these three scores to create a composite school achievement outcome. We analyzed a total of children who completed these tests.
Apriori, we may not expect positivity issues with these data since previous studies have not uncovered strong predictors of birth term (Di Renzo et al., 2011). On the other hand, this same research has highlighted significant geographic variability in predictors of birth term. Thus, when proposing a prespecified analysis in a population with little prior information available on predictors of term birth, we may worry about the potential for near positivity violations. Consequently, we may wish to select a method that hedges against such violations, like the proposed CTMLE.
We analyzed these data using TMLE, CTMLE, onestep, and collaborative onestep. For the outcome regression, we used a super learner. Super learner, a generalized version of regression stacking (Wolpert, 1992; Breiman, 1996), is a crossvalidationbased ensemble approach that creates a regression fit based on a weighted combination of candidate regression fits. The method is implemented in the SuperLearner R package, freely available through the Comprehensive R Archive Network (CRAN) (Polley and van der Laan, 2012). Our candidate regressions for the OR and PS included generalized linear models (function SL.glm in the SuperLearner package), polynomial multivariate regression splines (SL.earth
), random forests (
SL.ranger), lasso (SL.glmnet), gradient boosted regression trees (
SL.gbm), an interceptonly regression (SL.mean), and a forward stepwise generalized linear model (SL.step.forward). Each super learner was based on fivefold crossvalidation. We used the HALMLE to estimate each adaptive PS utilized by the proposed CTMLE and collaborative onestep. There were three such scores: one each for the OR in pre, full, and postterm births. We constructed Waldstyle confidence intervals using influence functionbased standard errors as described above. We also tested the null hypothesis that the average schooling achievement was equivalent across the three categories of birth term using a twodegreeoffreedom Waldstyle test. Finally, for comparison, we estimated each of the relevant quantities using a mainterms linear regression model, with nonparametric bootstrap confidence intervals (based on 500 bootstrap samples) and computed a pvalue using a twodegreeoffreedom Waldstyle test based on a sandwich variance estimator
(White, 1980).Method  Preterm  Fullterm  Postterm  pvalue 

TMLE  0.03 (0.14, 0.07)  0.02 (0.03, 0.06)  0.08 (0.21, 0.06)  0.33 
CTMLE  0.03 (0.13, 0.07)  0.02 (0.03, 0.06)  0.07 (0.20, 0.06)  0.37 
OS  0.03 (0.14, 0.08)  0.02 (0.03, 0.06)  0.07 (0.21, 0.06)  0.36 
COS  0.03 (0.13, 0.07)  0.02 (0.03, 0.06)  0.07 (0.19, 0.06)  0.39 
LM  0.03 (0.13, 0.07)  0.02 (0.03, 0.06)  0.18 (0.46, 0.08)  0.27 
Each of the methods provided similar point estimates, except the estimate of the average standardized score in postterm births was lower when considering the linear modelbased estimator (Table 1). In each case, we fail to reject the null hypothesis of equal school achievement by birth term category at any reasonable typeone error threshold. That the standard efficient estimators gave similar results to their collaborative counterparts is not surprising considering that no significant positivity issues were uncovered in the analysis. In particular, the estimated probabilities of pre, full, and postterm birth for the observed participants were bounded between (0.12, 0.23), (0.68, 0.81), and (0.06, 0.10), respectively. On the other hand, the corresponding estimated adaptive PS’s were bounded between (0.14, 0.17), (0.78, 0.78), and (0.07, 0.08), respectively.
6 Discussion
It has been recognized in the literature that efficient estimators such as TMLE and onestep can show erratic, nonrobust behavior if the target estimand is weakly identifiable. This setting is most often seen in observational studies with highdimensional covariates with little apriori knowledge of which covariates are confounders of the treatment/outcome relationship. In these settings various CTMLE estimators have been proposed. Relative to these existing estimators, our proposed estimator enjoys some important benefits. First, it avoids many of the decision points needed in a typical CTMLE implementation. The user need only choose a regression estimator for the OR and for the adaptive PS. Beyond those choices, standard TMLE software can be used simply by substituting an estimate of the adaptive PS for the true PS. Our theorem establishes similar weak convergence results to those that have been previously proven for TMLE. Another potential benefit of our proposal is that, while the statistical motivation for our choice of adaptive PS is somewhat technical – ensuring a secondorder remainder of the linearized estimator is asymptotically negligible – the interpretation of the adaptive PS is easily explained to applied practitioners. For example, in a medical context, rather than estimating the true PS, which describes a patient’s propensity for receiving treatment as a function their medical history and other confounders, we instead opt for the adaptive PS, which describes the propensity for receiving treatment as a function treated patients’ estimated risk of disease.
We expect that the proposed CTMLE will exhibit more robust finitesample performance relative to standard TMLE in settings where the target estimand is weakly identifiable, and indeed the results of our first simulation explicitly demonstrated this phenomenon. On the other hand, because the proposed CTMLE is an irregular estimator, it may perform poorly for certain data generating distributions. A more careful comparison between standard TMLE and our superefficient CTMLE is thus warranted. First, it is relevant to note that under lack of positivity it might be harder to estimate the OR well and both the small and largesample performance of our estimator is very much tied to the performance of the OR estimator. While recent developments such as HALMLE theoretically ensure that the consistency requirement is satisfied under weak conditions, there are no finitesample guarantees of adequate performance. Therefore, it would seem particularly beneficial for our proposed estimator that a flexible estimation framework be employed for the OR. For example, a crossvalidationbased regression stacking approach, such as the super learner approach that we used in the data analysis may be particularly beneficial. Nevertheless, contrary to other CTMLE approaches that enjoy collaborative doublerobustness, our CTMLE is not doublyrobust: we cannot compensate for an inconsistent OR estimator by using a consistent PS estimator. However, our proposed estimator appears to completely avoid any risk of having a TMLE in which the targeting step decrements the behavior of the initial OR estimator. As a final point of comparison, it appears that variance estimation for this CTMLE is a more challenging task than for standard TMLE, as evidenced by the poor behavior of the confidence intervals in both simulations. It will thus be important in future work to consider alternative strategies for estimating variance (e.g., based on a bootstrap schema).
Another strategy for providing additional robustness against poor estimation in smallsamples is to utilize TMLE frameworks designed for doublyrobust inference (van der Laan, 2014; Benkeser et al., 2017). These estimators involve fitting additional parametric working models for both the OR and PS estimators beyond a typical TMLE implementation. The resultant effect estimator is asymptotically linear even if one of the OR or PS is inconsistently estimated (or is consistently estimated at too slow a rate). The additional fitting aims to minimize the contribution of the secondorder remainder to the behavior of the estimator – essentially the same goal as with our CTMLE proposal. Even when both the OR and adaptive PS estimators are consistent at rates faster than , this approach could result in finitesample improvements for a TMLE or efficient CTMLE. We leave to future study whether and how such strategies can be used to further improve finitesample behavior of estimators.
Overall, we conclude that the relative finitesample performance of an efficient TMLE/CTMLE (possibly modified to accommodate doublyrobust inference) and our proposed superefficient CTMLE will depend on the particular type of data distribution. Clearly, if one has apriori knowledge of the propensity score, in particular that it does not suffer from positivity issues, then an efficient TMLE or CTMLE strategy may be the favorable estimator. On the other hand, if the PS is poorly understood and/or the parameter is weakly identifiable, then the superefficient CTMLE may be a better option.
In future work, we will generalize our asymptotic linearity results to the more general CAR setting described in Appendix B. Such extensions would allow us to tackle other challenging problems in causal inference, such as estimation of the counterfactual mean of a treatment administered at several timepoints subject to timevarying confounding.
References
 Adair et al. (2010) Linda S Adair, Barry M Popkin, John S Akin, David K Guilkey, Socorro Gultiano, Judith Borja, Lorna Perez, Christopher W Kuzawa, Thomas McDade, and Michelle J Hindin. Cohort profile: the Cebu longitudinal health and nutrition survey. International Journal of Epidemiology, 40(3):619–625, 2010. doi: 10.18356/ca188036en.

Benkeser and Van Der Laan (2016)
David Benkeser and Mark Van Der Laan.
The highly adaptive lasso estimator.
In
Proceedings of the International Conference on Data Science and Advanced Analytics 2016.
, pages 689 – 696, 2016. doi: 10.1109/DSAA.2016.93.  Benkeser et al. (2017) David Benkeser, Marco Carone, MJ Van Der Laan, and PB Gilbert. Doubly robust nonparametric inference on the average treatment effect. Biometrika, 104(4):863–880, 2017. doi: 10.1093/biomet/asx053.
 Bickel et al. (1997) P.J. Bickel, C.A.J. Klaassen, Y. Ritov, and J. Wellner. Efficient and adaptive estimation for semiparametric models. Springer, Berlin Heidelberg New York, 1997.
 Breiman (1996) L. Breiman. Stacked regressions. Machine Learning, 24:49–64, 1996. doi: 10.1007/bf00117832.
 Di Renzo et al. (2011) Gian Carlo Di Renzo, Irene Giardina, Alessia Rosati, Graziano Clerici, Michela Torricelli, Felice Petraglia, Italian Preterm Network Study Group, et al. Maternal risk factors for preterm birth: a countrybased population analysis. European Journal of Obstetrics & Gynecology and Reproductive Biology, 159(2):342–346, 2011. doi: 10.1016/j.ejogrb.2011.09.024.
 Groenwold et al. (2012) Rolf HH Groenwold, Ian R White, A Rogier T Donders, James R Carpenter, Douglas G Altman, and Karel GM Moons. Missing covariate data in clinical research: when and when not to use the missingindicator method for analysis. Canadian Medical Association Journal, 184(11):1265–1269, 2012. doi: 10.1503/cmaj.110977.
 Gruber and van der Laan (2010) S. Gruber and M.J. van der Laan. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat, 6(1), 2010. doi: 10.2202/15574679.1182.
 Ibragimov and Khasminskii (1981) I.A. Ibragimov and R.Z. Khasminskii. Statistical estimation. Springer, 1981.

Ju et al. (2017)
Cheng Ju, Richard Wyss, Jessica M Franklin, Sebastian Schneeweiss, Jenny
Haggstrom, and Mark J van der Laan.
Collaborativecontrolled LASSO for constructing propensity scorebased estimators in highdimensional data.
Statistical Methods in Medical Research, 2017. doi: 10.1177/0962280217744588.  Ju et al. (2018) Cheng Ju, Joshua Schwab, and Mark J van der Laan. On adaptive propensity score truncation in causal inference. Statistical Methods in Medical Research, 2018. doi: 10.1177/0962280218774817.
 Ju et al. (2019) Cheng Ju, Susan Gruber, Samuel D Lendle, Antoine Chambaz, Jessica M Franklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van der Laan. Scalable collaborative targeted learning for highdimensional data. Statistical Methods in Medical Research, 28(2):532–554, 2019. doi: 10.1177/0962280217729845.
 Kang and Schafer (2007) Joseph DY Kang and Joseph L Schafer. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4):523–539, 2007. doi: 10.1214/07sts227.
 Petersen et al. (2012) Maya L Petersen, Kristin E Porter, Susan Gruber, Yue Wang, and Mark J van der Laan. Diagnosing and responding to violations in the positivity assumption. Statistical methods in medical research, 21(1):31–54, 2012. doi: 10.1177/0962280210386207.
 Pfanzagl (1982) J. Pfanzagl. Contributions to a general asymptotic statistical theory. Springer, 1982.
 Polley and van der Laan (2012) E. Polley and M. van der Laan. Super Learner Prediction, 2012. R package version 2.06, Available at http://cran.rproject.org/web/packages/SuperLearner/SuperLearner.pdf.
 Silverman (1986) BW Silverman. Density Estimation. Chapman and Hall, London, UK, 1986.
 Stitelman and van der Laan (2010) O.M. Stitelman and Mark J. van der Laan. Collaborative targeted maximum likelihood for timetoevent data. International Journal of Biostatistics, 6(1):Article 21, 2010. doi: 10.2202/15574679.1249.
 van der Laan and Gruber (2016) Mark van der Laan and Susan Gruber. Onestep targeted minimum lossbased estimation based on universal least favorable onedimensional submodels. The International Journal of Biostatistics, 12(1):351–378, 2016. doi: 10.1515/ijb20150054.
 van der Laan (2017) Mark J. van der Laan. A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. The International Journal of Biostatistics, 13(2), 2017. doi: 10.1515/ijb20150097.
 Van der Laan et al. (2007) Mark J Van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1), 2007. doi: 10.2202/15446115.1309.
 van der Laan (2014) M.J. van der Laan. Targeted estimation of nuisance parameters to obtain valid statistical inference. Internation Journal of Biostatistics, 10(1):29–57, 2014. doi: doi.org/10.1515/ijb20120038.
 van der Laan and Gruber (2010) M.J. van der Laan and S. Gruber. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat, 6(1):Article 17, 2010. doi: 10.2202/15574679.1181.
 van der Laan and Rose (2011) M.J. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, Berlin Heidelberg New York, 2011.
 van der Laan and Rubin (2006) M.J. van der Laan and Daniel B. Rubin. Targeted maximum likelihood learning. International Journal of Biostatistics, 2(1), 2006. doi: 10.2202/15574679.1043.
 van der Vaart (1998) A.W. van der Vaart. Asymptotic Statistics. Cambridge, New York, 1998.
 Wang et al. (2011) H. Wang, S. Rose, and M.J. van der Laan. Finding quantitative trait loci genes with collaborative targeted maximum likelihood learning. Statistics and Probability Letters, 81(7):792–796, 2011. doi: 10.1016/j.spl.2010.11.001.
 White (1980) Halbert White. A heteroskedasticityconsistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, pages 817–838, 1980. doi: 10.2307/1912934.
 Wolpert (1992) D. H. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992. doi: 10.1016/s08936080(05)800231.
 Zheng and van der Laan (2011) W. Zheng and M.J. van der Laan. Crossvalidated targeted minimum loss based estimation. In M.J. van der Laan and S. Rose, editors, Targeted Learning: Causal Inference for Observational and Experimental Studies. Springer, New York, 2011.
Appendix A. Regular estimators
We can view an estimator of as the byproduct of an algorithm applied to independent observations drawn from a distribution . We consider a fluctuation , where is an index set, satisfying
with and . We say that is a regular estimator of at if, for any such fluctuation, the estimator of the parameter obtained by applying the algorithm on independent observations , …, drawn from is such that converges in distribution to a random variate whose distribution does not depend on . See Chapter 24 of van der Vaart (1998) for more on regularity in the context of efficient estimation.
Appendix B. Linearization of plugin estimator
Below, we make use of empirical process notation, writing to denote for a given integrable function and for each . We also denote by the empirical distribution function based on , so . A linearization of the parameter allows us to write
where is defined in (3). We assume

;

;

.
Assumption (i) stipulates that the EIF have mean zero, which is satisfied, for example, if either or . Assumption (ii) stipulates that the EIF estimating equation is approximately solved. Because is a targeted estimate of , this assumption is trivially satisfied, since . The third assumption is satisfied if, for example, converges in probability to zero and falls in Donsker class with probability tending to 1.
Appendix C. Generalization to coarsened at random (CAR) data structures
In this Appendix, we describe how the ideas presented in the main text generalize to coarsened at random (CAR) data structures. Suppose we observe i.i.d. copies of for a fulldata random variable , censoring variable , and a many to onemapping . We assume that the conditional distribution of , given , satisfies the CAR assumption. Our model for the observed data is implied by the fulldata model for the fulldata distribution and censoring model for the true conditional distribution of censoring. Thus, the data distribution is parameterized as . A fulldata target parameter is defined as a mapping so that is the target estimand of interest. We assume that is pathwise differentiable at with EIF relative to this model . We define the exact secondorder remainder as
We assume that the fulldata parameter is identifiable so that there exists a , where is the parameter space, so that for all . The statistical estimand is thereby defined as . We assume that is a pathwise differentiable function of with EIF under relative to , . Such parameters allow an expansion , where
is the exact second order remainder for the observed data target parameter.
Suppose that the target estimand depends on through a functional
, and that we have a loss function
such that that , and is the parameter space for . Therefore, with an abuse of notation, we may denote the statistical target parameter as , i.e., a function of . Though is a parameter of the observed data distribution, it is is also a parameter of the fulldata distribution , so it can be viewed both as a mapping as well as .In addition, let this parameter be chosen so that the EIF of the target parameter under relative to the model depends on the observed data distribution through and the censoring mechanism , so that we can parameterize the EIF as . Similarly, we may denote the secondorder remainder by to emphasize that it involves a difference between and .
Given an initial estimator of and of , a TMLE of defines a leastfavorable parametric submodel through so that the lossbased score at along this path spans the estimated EIF . Such a path is called a local least favorable path, while a universal least favorable path requires that this score at any equals (van der Laan and Gruber, 2016). The TMLE estimates the fluctuation parameter of this least favorable path with standard minimum lossestimation , and the corresponding onestep TMLE of the target estimand is the corresponding substitution estimator , where . If the TMLE uses a universal least favorable submodel, then (van der Laan and Gruber, 2016), but, even when one uses a local least favorable submodel, if and are consistent estimators, then, under regularity conditions, (see theorem in Appendix of (van der Laan, 2017)). Thus, by similar arguments to those in Appendix A (i.e., appealing to consistency of the EIF),
where is the inprobability limit of and is the inprobability limit of . As in the case of the treatmentspecific mean, we can generally argue that is asymptotically negligible whenever both and are in norm. Under these conditions, the TMLE is asymptotically efficient.
Collaborative TMLE is a general tool whereby an estimator of the censoring mechanism is constructed sequentially, Starting with an initial estimator, we iteratively select an update among a set of candidates by maximizing the gain in empirical risk during the corresponding TMLE step. Having built a sequence of such candidate censoring mechanism estimators (and corresponding TMLE), we evaluate the crossvalidated risk of each candidate to select the best TMLE among the candidate TMLEs of . The principle of CTMLE is that by maximizing the gain in empirical risk during the TMLE step, we are targeting a dimensionreduced quantity that adjusts for a rich enough set of variables so that the TMLE stays consistent even when is inconsistent. There is a whole class of such targets which will result in this consistency. This preservation of consistency of the CTMLE for a that converges to a with possibly and , is referred to as collaborative double robustness (van der Laan and Gruber, 2010; Gruber and van der Laan, 2010).
Our proposed CTMLE for the treatmentspecific mean avoids iterative building of the fit of the censoring mechanism based on gain during the TMLEstep, but fully relies on being consistent. This strategy may be generalized to the broader CAR setting by choosing an adaptive target so that the second order remainder
Such a CTMLE is asymptotically linear with influence function equal to , which will generally have smaller variance than the EIF . We leave to future work the identification of general strategies for selecting adaptive censoring mechanism target parameters.
Appendix D. Additional estimators
Collaborative onestep estimator
The collaborative onestep estimator can be implemented in the following steps:

estimate OR: regress on amongst units observed to receive treatment to obtain OR estimate ;

predict outcome: use estimated OR to obtain a prediction for each observed data unit, ;

estimate adaptive PS: regress on predictions to obtain adaptive PS estimate ;

predict PS: use estimated PS to obtain prediction for each observed data unit, ;

evaluate influence function: use estimated OR and PS to compute for ;

compute onestep estimate: .
Collaborative TMLE that targets the ATE directly
Here we propose a TMLE that can be used to directly estimate the ATE. For each , we denote by
Comments
There are no comments yet.