A nonparametric super-efficient estimator of the average treatment effect

by   David Benkeser, et al.

Doubly robust estimators of causal effects are a popular means of estimating causal effects. Such estimators combine an estimate of the conditional mean of the outcome given treatment and confounders (the so-called outcome regression) with an estimate of the conditional probability of treatment given confounders (the propensity score) to generate an estimate of the effect of interest. In addition to enjoying the double-robustness property, these estimators have additional benefits. First, flexible regression tools, such as those developed in the field of machine learning, can be utilized to estimate the relevant regressions, while the estimators of the treatment effects retain desirable statistical properties. Furthermore, these estimators are often statistically efficient, achieving the lower bound on the variance of regular, asymptotically linear estimators. However, in spite of their asymptotic optimality, in problems where causal estimands are weakly identifiable, these estimators may behave erratically. We propose two new estimation techniques for use in these challenging settings. Our estimators build on two existing frameworks for efficient estimation: targeted minimum loss estimation and one-step estimation. However, rather than using an estimate of the propensity score in their construction, we instead opt for an alternative regression quantity when building our estimators: the conditional probability of treatment given the conditional mean outcome. We discuss the theoretical implications and demonstrate the estimators' performance in simulated and real data.



There are no comments yet.



Flexible Collaborative Estimation of the Average Causal Effect of a Treatment using the Outcome-Highly-Adaptive Lasso

Many estimators of the average causal effect of an intervention require ...

More Efficient, Doubly Robust, Nonparametric Estimators of Treatment Effects in Multilevel Studies

When studying treatment effects in multilevel studies, investigators com...

Minimax Optimal Nonparametric Estimation of Heterogeneous Treatment Effects

A central goal of causal inference is to detect and estimate the treatme...

Nonparametric inverse probability weighted estimators based on the highly adaptive lasso

Inverse probability weighted estimators are the oldest and potentially m...

Localized Debiased Machine Learning: Efficient Estimation of Quantile Treatment Effects, Conditional Value at Risk, and Beyond

We consider the efficient estimation of a low-dimensional parameter in t...

Universal sieve-based strategies for efficient estimation using machine learning tools

Suppose that we wish to estimate a finite-dimensional summary of one or ...

On the estimation of average treatment effects with right-censored time to event outcome and competing risks

We are interested in the estimation of average treatment effects based o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many areas of research, the scientific question of interest is often answered by drawing statistical inference about the average effect of a treatment on an outcome. Depending on the setting, this “treatment” might correspond to an actual therapeutic treatment, a harmful exposure, or a policy intervention. We use to denote the potential outcome of a typical data unit sampled from the population of interest when the unit receives the treatment of interest, and to denote the potential outcome if that unit instead receives control. In this work, we focus on estimation of the average treatment effect (ATE), the average difference between and in the population of interest.

Often, due to ethical or logistical constraints, we cannot randomly assign data units to receive/not receive the treatment. Thus, in order to draw valid conclusions about the ATE, we require statistical methodology that can control for confounders of the treatment/outcome relationship. Although epidemiological and other applied literatures are still considering the relative merits of various methodologies, the statistical literature has provided direction through consideration of semiparametric efficient methods. The literature provides many examples of such estimators. In some situations, regularized or sieve maximum likelihood estimators can be used (e.g., van der Vaart (1998)); however, this generally requires careful selection of tuning parameters. Existing literature lacks general guidelines for how such parameters can be chosen in practice, which limits the utility of these strategies. On the other hand, methods that are built around a causal effect parameter’s efficient influence function offer a more straightforward approach to estimation. Foremost amongst these approaches are one-step estimation (Ibragimov and Khasminskii, 1981; Pfanzagl, 1982; Bickel et al., 1997) and targeted minimum loss estimation (TMLE) (van der Laan and Rubin, 2006; van der Laan and Rose, 2011).

The efficient influence function often depends on the observed data distribution through certain key nuisance parameters. In the context of the ATE, these nuisance parameters are the conditional mean of the outcome given treatment and confounders (the so-called outcome regression, OR), the conditional probability of treatment given confounders (the so-called propensity score, PS), and the distribution function of confounders in the population of interest. Once estimators of these key quantities are available, each methodology provides its own recipe for combining the relevant nuisance estimators into an estimate of the causal effect of interest. Assuming the nuisance estimators satisfy certain regularity conditions, the resulting estimators of the ATE, when appropriately centered and scaled, converge in distribution to a mean-zero Gaussian variate with variance equal to the semiparametric Cramer-Rao efficiency bound for regular estimators. In addition to efficiency, these estimators are also doubly-robust, meaning that they are consistent for the causal effect of interest even if one of the OR or PS is inconsistently estimated.

One of the key assumptions underlying any methodology for estimating the ATE using observational data is the strong positivity assumption, which stipulates that the PS must be bounded between zero and one almost everywhere. That is, if we define strata of data units based on their observed confounders, any stratum with positive support must have some probability of receiving and not receiving the treatment. If this condition fails, then the ATE is not estimable from observed data. Moreover, even if the condition holds theoretically, in practice there may be small estimated propensity scores – so-called practical violations of the positivity assumption (Petersen et al., 2012). In such cases, the one-step estimator and TMLE can suffer from erractic, non-robust behavior.

To combat this behavior, various extensions have been proposed including collaborative TMLE (CTMLE) (van der Laan and Gruber, 2010; Gruber and van der Laan, 2010; Stitelman and van der Laan, 2010; Wang et al., 2011). In this approach the OR and PS are estimated collaboratively, by selecting an estimate of the PS based on how well it tunes an estimate of the OR. From a theoretical point of view, the goal of CTMLE is generally to provide an estimator that is more robust than TMLE, but that nevertheless maintains TMLE’s asymptotic efficiency. CTMLE enjoys additional theoretical benefits in settings where the OR is inconsistently estimated. In this case, the CTMLE enjoys a more general robustness property than TMLE referred to as collaborative double-robustness (van der Laan and Gruber, 2010). Many proposed CTMLEs are designed specifically for settings with practical positivity violations, for example by: choosing a truncation level for estimated propensities, selecting variables to be included in the PS, or tuning particular machine learning algorithms (Ju et al., 2019, 2017, 2018). These works show that CTMLE can provide greater robustness than TMLE in challenging situations.

In spite of the putative benefits of CTMLE, these estimators are not widely used. There may be several reasons why. First, the approach involves many decision points for the analyst, who must select an increasingly complex sequence of estimators for the PS and implement each of these estimators. Second, the approach often involves extended computation time relative to traditional doubly-robust methods. In particular, cross-validation is needed to validate which PS method should be selected from amongst the user-chosen sequence. Third, from a theoretical perspective, performing robust inference based on existing CTMLE approaches is also challenging, involving either strong assumptions on nuisance estimators (see Appendix 17 of van der Laan and Rose (2011)) or additional iterative computational steps (van der Laan, 2014).

In this work, we seek to overcome these limitations by proposing a new approach to CTMLE. The twist in the present proposal relative to existing CTMLE approaches is that we assume the OR estimator is consistent at a fast rate. Assuming this consistency, we provide an alternative target for a PS estimator that is explicitly adaptive to the OR. In particular, rather than estimating the true PS, we recommend estimating the propensity for receiving treatment as a function of the estimated OR. This low-dimensional regression can be substituted in place of an estimator of the true PS in a standard TMLE or one-step procedure. We show that, when appropriately scaled, the resultant estimator is asymptotically Normal with variance that is generally smaller than that of an efficient estimator. Thus, our proposal provides a new approach to CTMLE that is tailored both for small- and large-sample and performance.

2 Background

2.1 Identification of ATE

Suppose we observe independent copies of the data unit , where

is a vector of putative confounders,

is a binary treatment, and is the outcome of interest. Our assumption that does not sacrifice any generality of our proposed methodology. We denote by

the probability distribution of

and assume that is an element of a statistical model . We take to be a nonparametric model.

As above, we use and to denote the counterfactual outcomes under treatment and no treatment, respectively. For , we denote by the probability distribution of . The ATE is defined as , the difference in average outcome if the entire population were assigned to receive versus . The ATE is identifiable from the observed data under the following assumptions: consistency, ; no interference: does not depend on for ; ignorability: ; and positivity:

. The first two assumptions are needed in order to have well-defined counterfactual random variables, while the ignorability condition essentially states that there are no unmeasured confounders of

and . As mentioned in the introduction, the positivity criterion stipulates that every unit has a chance of receiving and . If these assumptions hold, the average treatment effect is identified by the G-computation formula


2.2 Estimators of the ATE

For simplicity, we hence consider estimation of , which we refer to as the treatment-specific mean. Symmetric arguments can be made about , and thus the ATE. Hereafter, when we refer to the OR, it is understood that we are referring to the quantity , the regression of on amongst units observed to receive the treatment.

For each , we denote by the true OR evaluated at and denote by an estimate of based on . We use to denote the model for the OR implied by ; that is, is a collection of all possible ORs allowed by the model . Similarly, for each , we denote by the true PS evaluated at , by an estimate of , and by the model for the PS implied by . Finally, for each , we denote by the distribution function of the vector of confounders. In the remainder, we use the empirical distribution , where is the indicator function, as estimate of . We denote by the model for implied by . It is useful to our discussion to regard the parameter of interest as a mapping from to . That is, given a , is the value of the treatment-specific mean implied by the OR and confounder distribution . Thus, denoting by the true values of these quantities, we have .

We remind readers that a regular (see Appendix A) estimator of is asymptotically linear if and only if behaves approximately as an empirical mean of a mean-zero, finite-variance function of the observed data. This function is referred to as the estimator’s influence function. Depending on the chosen model, there may be a large class of possible influence functions of regular estimators. The influence function in this class that has the smallest variance is referred to as the efficient influence function (EIF). Any estimator with influence function equal to the EIF is said to be efficient amongst regular, asymptotically linear estimators. Given , , and a typical observation , we define the efficient influence function for the treatment-specific mean relative to ,

Several frameworks exist for constructing estimators with a user-specified influence function. By selecting the EIF, these frameworks can be used to generate efficient estimators. We focus on the one-step and targeted minimum loss estimation frameworks. For our discussion, it is useful to introduce the idea of a plug-in estimator. We denote by an estimate of . A plug-in estimate has the form . The one-step estimator applies an EIF-based correction to the plug-in estimate,

TMLE is a general framework for constructing plug-in estimators that satisfy, possibly several, user-specified equations. The framework is implemented in two steps. First, initial estimators of relevant nuisance parameters (e.g., the OR and PS) are generated using a user-chosen technique. Subsequently, the initial estimates are carefully modified such that (i) the modified estimators inherit desirable properties of the initial estimators (e.g., their rate of convergence); and (ii) relevant, user-specified equations are satisfied. For the present problem, a TMLE can be implemented by first generating an initial estimate of the OR and of the PS. The OR regression estimator is subsequently updated to a targeted estimator , such that the EIF estimating equation,

, is satisfied. This can be achieved, for example, by defining a logistic regression working model for the OR with

as an offset, no intercept term, and a single covariate . For each and , we define this covariate as . The maximum likelihood estimator (MLE) of the regression coefficient associated with the covariate is estimated (e.g., via iteratively re-weighted least squares). For each , we define the so-called targeted OR estimator, . It is straightforward to show that the score of the coefficient at evaluated at a typical observation , equals ; thus, we may deduce that the EIF estimating equation is satisfied by the updated OR estimate . The TMLE of the treatment-specific mean is computed as the plug-in estimator based on the modified OR estimator, .

We remark that both the one-step and TMLE frameworks can be seen as first generating an initial estimate based on the OR and subsequently applying a correction procedure that involves an estimate of the PS. This view of the estimators is useful to our discussion below.

2.3 Large-sample theory and small-sample considerations

We hence focus discussion on TMLE, with the understanding that similar arguments apply to the one-step estimator. The following equation (derived with assumptions in Appendix B) is useful for studying large-sample behavior of a TMLE,


where is the in-probability limit of . The term is a second-order remainder that involves a difference between and ,


A key step in establishing asymptotic linearity of a TMLE is showing that is asymptotically negligible, i.e., . This requirement is satisfied if, for example, both and are with respect to the norm. If so, equation (2

) implies that the TMLE is asymptotically linear with influence function equal to the EIF and thus, by definition, the TMLE is efficient. Moreover, the central limit theorem implies that

converges in distribution to a mean-zero normal variate with variance .

We note an additional interesting feature of efficient estimators of the treatment-specific mean: they are doubly-robust. That is, the estimated treatment-specific mean is consistent for the true treatment-specific mean if either the estimated OR consistently estimates the true OR, the estimated PS consistently estimates the true PS, or both estimators are consistent.

As with any asymptotic analysis, these results provide conditions under which estimators are well-behaved in large samples, but provide no guarantees of small-sample performance. In particular, in settings where the target estimand is weakly identifiable, in spite of the their asymptotic optimality, the TMLE and one-step may be unstable. For example, when the PS may assume very small values, the variance of the EIF may be large, which can cause erratic behavior of the EIF-based correction procedures. In the context of TMLE for the treatment-specific mean, this instability may manifest in the estimation the working model parameter

. The covariate in the parametric working model may have extremely large values, leading to a targeted OR estimator whose performance is considerably deteriorated relative to the initial OR estimator .

Often, the analyst has little prior information that would suggest whether or not such issues are present. Thus, we are motivated to consider automated procedures for constructing OR and PS estimators that are adaptive to near positivity violations. One such proposal is CTMLE. A CTMLE is based on a sequence of PS estimators that increase in complexity. For example, in the context of estimating the treatment-specific mean, we may start our sequence with an intercept-only logistic regression model and build a sequence ranging from that simple estimator to a flexible semiparametric estimator such as kernel regression. The sequence of candidate PS estimators of is used to generate a sequence of targeted ORs. The best of the targeted ORs is selected via cross-validation and is used to create a plug-in estimator. The principle underlying CTMLE is that the estimator searches for a reduced-dimension alternative to the true PS that is adaptive to how well the estimated OR fits the true OR. If the initial OR estimate provides a good fit, then there may be little benefit (or even detriment) to performing a TMLE correction based on a PS with extremely small values. On the other hand if the initial OR is a poor fit, then we may in fact benefit from such a correction. CTMLE can adapt to each of these situations. Because CTMLE is based on a sequence of increasingly nonparametric PS estimators, the procedure will, with probability tending to 1, select the last consistent estimator of the true PS in the sequence. Thus, in large samples, CTMLE is expected to have similar behavior to a standard TMLE. In this respect, CTMLE may be generally viewed as a procedure that offers finite-sample improvements over standard TMLE, while maintaining its asymptotic efficiency.

The above discussion of efficiency and finite-sample considerations can be viewed in a more general lens than the context of nonparametric estimation of the ATE. In particular, these issues apply to a more broad set of problems that involve observed data structures that can be represented as a coarsened at random (CAR) version of a full data structure, while many models and parameters relevant to causal inference are special cases of this setting. The general CAR setting is discussed further in Appendix C.

3 Methods

We now propose a particular CTMLE for the treatment-specific mean that is robust to near positivity violations, but avoids the sequential PS estimator selection that is typical of other CTMLE proposals. The distinct aspect of the current proposal relative to previous CTMLE-based proposals is that we rely fully on converging to faster than with respect to -norm. Because the OR estimator is consistent, any PS estimator will lead to a consistent estimate of the treatment-specific mean, due to the double-robustness of the EIF. However, our procedure asks for a more stringent property on the adaptive PS estimator. We require that the resulting CTMLE be asymptotically linear, and thereby maintain a Normal limiting distribution. The challenge in so-doing is that our selected PS estimator is generally inconsistent for the true PS. Previous work has shown that, even when a nonparametric estimate of the OR is consistent, inconsistent estimation of the PS can have serious implications for the behavior of one-step estimators and TMLEs (van der Laan, 2014; Benkeser et al., 2017). The issue stems from the fact that the second-order remainder is generally not asymptotically negligible. Thus, the key in achieving our goal is to choose an adaptive PS estimator such that the second-order remainder remains asymptotically negligible under reasonable conditions. In Theorem 1, we establish that this goal can be achieved by using an estimate of

rather than an estimate of the true PS. In words, describes the probability of receiving treatment as a function of the conditional mean outcome under treatment. This adaptive PS estimate is substituted into the usual TMLE (or one-step) procedures for estimation and inference.

Our proposed CTMLE is implemented in the following steps:

  1. estimate OR: regress on amongst units observed to receive treatment to obtain OR estimate ;

  2. predict outcome: use estimated OR to obtain a prediction for each observed data unit, ;

  3. estimate adaptive PS: regress on predictions to obtain adaptive PS estimate ;

  4. predict PS: use estimated PS to obtain prediction for each observed data unit, ;

  5. fit OR working model: fit logistic regression of outcome on covariate with offset ; denote by the estimated coefficient;

  6. target OR estimate: use OR working model to obtain a prediction for each observed data unit, ;

  7. compute plug-in estimate: the CTMLE is .

An analogous description of the collaborative one-step estimator is included in Appendix E. Sample R code to compute the estimators is included in Appendix F.

The following theorem establishes the weak convergence of the proposed estimator. We explicitly discuss regularity conditions in Appendix G and strategies for weakening these conditions in Appendix F.

Theorem 1.

Under the regularity conditions in Appendix G,

and converges in distribution to a mean-zero Normal variate with variance .

The asymptotic variance of is generally smaller than that of the standard TMLE , so that the proposed estimator is super efficient. That is, at any fixed data distribution in , this CTMLE will be asymptotically more efficient than the standard TMLE.

3.1 Variance estimation

We propose to estimate the standard error of

, based on a cross-validated estimate of the variance of the influence function. Specifically, consider a -fold cross-validation scheme, wherein data are randomly partitioned into blocks of approximately equal size. For , denote by the indices of units in each block. For , we denote by the predicted outcome for observation based on an OR estimate fit when observation was in the hold-out block. That is, to obtain , we regress on in units with and . Then we use this fitted regression to obtain predictions based on . Similarly, we denote by the -th estimated adaptive PS, . This quantity is computed by regressing on for . This regression is then used to obtain predictions based on , which we denote by . Finally, we denote by the empirical distribution of based on . The cross-validated variance estimator is

Remark. For notational simplicity, we have focused on presenting an estimate of the treatment-specific mean. An estimate of the ATE can thus be obtained by repeating the entire procedure but switching the labeling of the treatment. Alternatively, we also propose a CTMLE that directly targets the ATE in Appendix F. The major difference is that the adaptive PS now involves regression on both the OR in units who received treated, as well as the OR in units who did not receive treatment.

4 Simulations

We evaluated the performance of the proposed collaborative estimators relative to their standard counterparts in two simulation studies. We focus our presentation on comparing CTMLE and TMLE results, while a comparison of the one-step estimators is included in Appendix G. The first simulation evaluated the relative performance of CTMLE vs. TMLE as a function of sample size and strength of positivity violations. In this setting the estimators of the OR and PS are based on correctly-specified parametric models. The results demonstrate the behavior the proposed estimators as a function both of sample size and strength of positivity violations. The second simulation offers a competitive setting for comparing the various estimators. In this setting, both the OR and PS are highly nonlinear functions of the covariates and involve complex covariate interactions. To consistently estimate these complex functions, we utilize the highly adaptive loss minimum loss estimator (HAL-MLE)

(van der Laan, 2017; Benkeser and Van Der Laan, 2016). This estimator has been shown to the requisite regularity conditions of Theorem 1 under extremely mild assumptions on the true nuisance parameters.

In both simulation settings, we generated and analyzed 1000 Monte Carlo replicate data sets in each setting at each sample size. We evaluated the estimators on their bias, variance, and mean-squared error. We also present visualizations of the estimated sampling distributions of the scaled and centered sampling distributions. Further, we present the coverage probability of nominal 95% Wald-style confidence intervals based on the Monte Carlo standard deviation of the estimators (i.e., an oracle confidence interval) and based on influence function-based standard error estimates.

4.1 Simulation 1

Figure 1: Results for simulation 1 comparing CTMLE and TMLE. Each panel displays a different performance metric and each sub-panel displays results for

, representing, respectively, settings with no positivity, moderate positivity, and extreme positivity violations. Panel A: Bias of the estimators. Panel B: Variance of the estimators. Panel C: Relative efficiency (defined as ratio of mean squared-error) of CTMLE to TMLE. Numbers below one indicate greater efficiency of CTMLE. Panel D: Kernel density estimates of sampling distributions using a Gaussian kernel and Silverman’s rule of thumb bandwidth

(Silverman, 1986).

For each sample size we generated data as follows. was an eight-variate vector. We drew the first seven components of

from a from an Uniform distribution on

. The final component of was drawn from a Bernoulli(0.5) distribution. Given , the treatment

was drawn from a Bernoulli distribution with success probability

Given and , the outcome

was drawn from a Normal distribution with unit variance and mean

The true ATE in this setting is one. We induced positivity violations by choosing increasingly large values of . We evaluated three choices,

. These choices resulted in PS bounded in (0.05, 0.95), (0.01, 0.99), and (0.003, 0.997), respectively. The standard TMLE and one-step estimators used correctly-specified logistic regression for the PS and correctly-specified linear regression for the OR. The CTMLE and collaborative one-step used correctly-specified linear regression for the OR and HAL-MLE for the adaptive PS.

In settings with no positivity issues (), we found that CTMLE and TMLE performed approximately equivalently, though CTMLE offered modest benefits at the smallest sample size (Figure 1). As increased, propensity scores were pushed towards zero and one, and we saw increased performance of CTMLE relative to TMLE. CTMLE offered significant improvements both in terms of bias and variance, and was more than four times as efficient in the , scenario. The sampling distribution of both estimators was well approximated by the reference asymptotic distribution as indicated by nominal coverage of oracle confidence intervals (Figure 2, Panel A). However, while the estimated standard errors of the TMLE estimator performed well in larger samples, the estimated standard errors of CTMLE had poor performance, often underestimating the true variability of the estimator (Figure 2, Panel B).

Results for the collaborative one-step vs. standard one-step estimator were essentially the same as for TMLE (Appendix XXX).

Figure 2: Results for simulation 1 comparing confidence intervals for CTMLE and TMLE. Each panel displays the coverage as a function of sample size and each sub-panel displays results for . Panel A: Coverage probability of nominal 95% oracle confidence intervals. Panel B: Coverage probability of nominal 95% confidence intervals based on estimated standard errors.

4.2 Simulation 2

Figure 3: Results for simulation 2 comparing CTMLE and TMLE. Each panel displays a different performance metric and each sub-panel displays results for , representing, respectively, settings with no positivity, moderate positivity, and extreme positivity violations. Panel A: Bias of the estimators. Panel B: Variance of the estimators. Panel C: Relative efficiency (defined as ratio of mean squared-error) of CTMLE to TMLE. Numbers below one indicate greater efficiency of CTMLE. Panel D: Kernel density estimates of sampling distributions using a Gaussian kernel and Silverman’s rule of thumb bandwidth (Silverman, 1986).

We based this simulation setting on the oft-cited Kang and Schafer (2007) simulation design. This design is notoriously challenging for causal effect estimators due to extremely non-linear covariate relationships in the OR and PS and highly complex interactions between covariates. In our simulation, we drew from a Uniform(0.5, 2) distribution and drew from a Uniform distribution on . Given , the treatment was drawn from a Bernoulli distribution with success probability

Given and , the outcome was drawn from a Normal distribution with unit variance and mean

The true ATE in this setting is zero and the true PS is bounded between (0.004, 0.999). A challenge of this simulation setting is that the covariates are not available to the analyst. Instead, we must base our estimation on , which is generated from as follows

As such, the true PS and OR, when expressed as functions of are highly non-linear and involve interactions between the various components of . To estimate these functions well, we require extremely flexible regression tools. Thus, we use HAL-MLE for the PS and OR used by the TMLE and one-step estimators. Similarly, we used HAL-MLE for the OR and adaptive PS.

Figure 4: Results for simulation 1 comparing confidence intervals for CTMLE and TMLE. Each panel displays the coverage as a function of sample size and each sub-panel displays results for . Panel A: Coverage probability of nominal 95% oracle confidence intervals. Panel B: Coverage probability of nominal 95% confidence intervals based on estimated standard errors.

As expected, both the the TMLE and CTMLE struggled in this very challenging simulation study (Figure 3). While CTMLE offered modest benefits in terms of variance, the bias of the two estimators was comparable. Nevertheless, we do see evidence of asymptotic linearity of both estimators in that the sampling distributions of both of the scaled and centered estimators appear to be moving towards an appropriate center at zero. Nevertheless, both estimators had relatively large bias, even in the largest sample size , as shown by the less-than-nominal coverage of the oracle confidence intervals (Figure 4, Panel A). The cross-validated influence function-based variance estimators overestimated the variability of the TMLE, which resulted in near nominal coverage for those intervals (Panel B). However, as in simulation 1, we found that the proposed variance estimators for the CTMLE significantly underestimated the variability of the estimator, resulting in poor coverage.

5 Data Analysis

We analyzed data collected via the Cebuano Longitudinal Health and Nutrition Survey (CLHNS) (Adair et al., 2010). CLHNS is an ongoing study of a cohort of Filipino women who gave birth between May 1, 1983, and April 30, 1984. Children born to these women in that period have been followed through subsequent surveys over multiple years. We used these data to estimate the effect of term pregnancy (pre-, full-, post-) on children’s schooling achievement in Cebuano, English, and mathematics. The putative confounders we considered were related to parental characteristics (maternal/paternal age, maternal height, maternal/paternal education, maternal age at first birth, maternal parity, maternal marital status), household characteristics (number of children in the household, child/adult ratio, child dependency ratio, crowding index, number adult males/females in household, urbanicity score, water availability, sanitation), and socioeconomic information (socioeconomic status, familial health care access, total family income). Our strategy for handling missing covariate data is described in Appendix H. Pre-, full-, and post-term pregnancies were defined as those less than 37 completed weeks, between 37 and 41 completed weeks, and greater than 41 completed weeks, respectively. Our outcome is the standardized total performance on the National Elementary Assessment Test (NEAT) that was collected in 1994-95. Children were scored on their achievement in Cebuano, English, and mathematics. We standardized and summed these three scores to create a composite school achievement outcome. We analyzed a total of children who completed these tests.

A-priori, we may not expect positivity issues with these data since previous studies have not uncovered strong predictors of birth term (Di Renzo et al., 2011). On the other hand, this same research has highlighted significant geographic variability in predictors of birth term. Thus, when proposing a pre-specified analysis in a population with little prior information available on predictors of term birth, we may worry about the potential for near positivity violations. Consequently, we may wish to select a method that hedges against such violations, like the proposed CTMLE.

We analyzed these data using TMLE, CTMLE, one-step, and collaborative one-step. For the outcome regression, we used a super learner. Super learner, a generalized version of regression stacking (Wolpert, 1992; Breiman, 1996), is a cross-validation-based ensemble approach that creates a regression fit based on a weighted combination of candidate regression fits. The method is implemented in the SuperLearner R package, freely available through the Comprehensive R Archive Network (CRAN) (Polley and van der Laan, 2012). Our candidate regressions for the OR and PS included generalized linear models (function SL.glm in the SuperLearner package), polynomial multivariate regression splines (SL.earth

), random forests (

SL.ranger), lasso (SL.glmnet

), gradient boosted regression trees (

SL.gbm), an intercept-only regression (SL.mean), and a forward stepwise generalized linear model (SL.step.forward

). Each super learner was based on five-fold cross-validation. We used the HAL-MLE to estimate each adaptive PS utilized by the proposed CTMLE and collaborative one-step. There were three such scores: one each for the OR in pre-, full-, and post-term births. We constructed Wald-style confidence intervals using influence function-based standard errors as described above. We also tested the null hypothesis that the average schooling achievement was equivalent across the three categories of birth term using a two-degree-of-freedom Wald-style test. Finally, for comparison, we estimated each of the relevant quantities using a main-terms linear regression model, with nonparametric bootstrap confidence intervals (based on 500 bootstrap samples) and computed a p-value using a two-degree-of-freedom Wald-style test based on a sandwich variance estimator

(White, 1980).

Method Pre-term Full-term Post-term p-value
TMLE -0.03 (-0.14, 0.07) 0.02 (-0.03, 0.06) -0.08 (-0.21, 0.06) 0.33
CTMLE -0.03 (-0.13, 0.07) 0.02 (-0.03, 0.06) -0.07 (-0.20, 0.06) 0.37
OS -0.03 (-0.14, 0.08) 0.02 (-0.03, 0.06) -0.07 (-0.21, 0.06) 0.36
COS -0.03 (-0.13, 0.07) 0.02 (-0.03, 0.06) -0.07 (-0.19, 0.06) 0.39
LM -0.03 (-0.13, 0.07) 0.02 (-0.03, 0.06) -0.18 (-0.46, 0.08) 0.27
Table 1: Estimated average standardized schooling achievement score (95% confidence interval) for pre-, full-, and post-term births using various methods. TMLE = targeted minimum loss estimator, CTMLE = the proposed collaborative TMLE, OS = one-step estimator, COS = collaborative OS estimator, LM = estimator based on a main terms linear model. The p-value is from a test of the null hypothesis that the three means are equal.

Each of the methods provided similar point estimates, except the estimate of the average standardized score in post-term births was lower when considering the linear model-based estimator (Table 1). In each case, we fail to reject the null hypothesis of equal school achievement by birth term category at any reasonable type-one error threshold. That the standard efficient estimators gave similar results to their collaborative counterparts is not surprising considering that no significant positivity issues were uncovered in the analysis. In particular, the estimated probabilities of pre-, full-, and post-term birth for the observed participants were bounded between (0.12, 0.23), (0.68, 0.81), and (0.06, 0.10), respectively. On the other hand, the corresponding estimated adaptive PS’s were bounded between (0.14, 0.17), (0.78, 0.78), and (0.07, 0.08), respectively.

6 Discussion

It has been recognized in the literature that efficient estimators such as TMLE and one-step can show erratic, non-robust behavior if the target estimand is weakly identifiable. This setting is most often seen in observational studies with high-dimensional covariates with little a-priori knowledge of which covariates are confounders of the treatment/outcome relationship. In these settings various CTMLE estimators have been proposed. Relative to these existing estimators, our proposed estimator enjoys some important benefits. First, it avoids many of the decision points needed in a typical CTMLE implementation. The user need only choose a regression estimator for the OR and for the adaptive PS. Beyond those choices, standard TMLE software can be used simply by substituting an estimate of the adaptive PS for the true PS. Our theorem establishes similar weak convergence results to those that have been previously proven for TMLE. Another potential benefit of our proposal is that, while the statistical motivation for our choice of adaptive PS is somewhat technical – ensuring a second-order remainder of the linearized estimator is asymptotically negligible – the interpretation of the adaptive PS is easily explained to applied practitioners. For example, in a medical context, rather than estimating the true PS, which describes a patient’s propensity for receiving treatment as a function their medical history and other confounders, we instead opt for the adaptive PS, which describes the propensity for receiving treatment as a function treated patients’ estimated risk of disease.

We expect that the proposed CTMLE will exhibit more robust finite-sample performance relative to standard TMLE in settings where the target estimand is weakly identifiable, and indeed the results of our first simulation explicitly demonstrated this phenomenon. On the other hand, because the proposed CTMLE is an irregular estimator, it may perform poorly for certain data generating distributions. A more careful comparison between standard TMLE and our super-efficient CTMLE is thus warranted. First, it is relevant to note that under lack of positivity it might be harder to estimate the OR well and both the small- and large-sample performance of our estimator is very much tied to the performance of the OR estimator. While recent developments such as HAL-MLE theoretically ensure that the -consistency requirement is satisfied under weak conditions, there are no finite-sample guarantees of adequate performance. Therefore, it would seem particularly beneficial for our proposed estimator that a flexible estimation framework be employed for the OR. For example, a cross-validation-based regression stacking approach, such as the super learner approach that we used in the data analysis may be particularly beneficial. Nevertheless, contrary to other CTMLE approaches that enjoy collaborative double-robustness, our CTMLE is not doubly-robust: we cannot compensate for an inconsistent OR estimator by using a consistent PS estimator. However, our proposed estimator appears to completely avoid any risk of having a TMLE in which the targeting step decrements the behavior of the initial OR estimator. As a final point of comparison, it appears that variance estimation for this CTMLE is a more challenging task than for standard TMLE, as evidenced by the poor behavior of the confidence intervals in both simulations. It will thus be important in future work to consider alternative strategies for estimating variance (e.g., based on a bootstrap schema).

Another strategy for providing additional robustness against poor estimation in small-samples is to utilize TMLE frameworks designed for doubly-robust inference (van der Laan, 2014; Benkeser et al., 2017). These estimators involve fitting additional parametric working models for both the OR and PS estimators beyond a typical TMLE implementation. The resultant effect estimator is asymptotically linear even if one of the OR or PS is inconsistently estimated (or is consistently estimated at too slow a rate). The additional fitting aims to minimize the contribution of the second-order remainder to the behavior of the estimator – essentially the same goal as with our CTMLE proposal. Even when both the OR and adaptive PS estimators are consistent at rates faster than , this approach could result in finite-sample improvements for a TMLE or efficient CTMLE. We leave to future study whether and how such strategies can be used to further improve finite-sample behavior of estimators.

Overall, we conclude that the relative finite-sample performance of an efficient TMLE/CTMLE (possibly modified to accommodate doubly-robust inference) and our proposed super-efficient CTMLE will depend on the particular type of data distribution. Clearly, if one has a-priori knowledge of the propensity score, in particular that it does not suffer from positivity issues, then an efficient TMLE or CTMLE strategy may be the favorable estimator. On the other hand, if the PS is poorly understood and/or the parameter is weakly identifiable, then the super-efficient CTMLE may be a better option.

In future work, we will generalize our asymptotic linearity results to the more general CAR setting described in Appendix B. Such extensions would allow us to tackle other challenging problems in causal inference, such as estimation of the counterfactual mean of a treatment administered at several timepoints subject to time-varying confounding.


  • Adair et al. (2010) Linda S Adair, Barry M Popkin, John S Akin, David K Guilkey, Socorro Gultiano, Judith Borja, Lorna Perez, Christopher W Kuzawa, Thomas McDade, and Michelle J Hindin. Cohort profile: the Cebu longitudinal health and nutrition survey. International Journal of Epidemiology, 40(3):619–625, 2010. doi: 10.18356/ca188036-en.
  • Benkeser and Van Der Laan (2016) David Benkeser and Mark Van Der Laan. The highly adaptive lasso estimator. In

    Proceedings of the International Conference on Data Science and Advanced Analytics 2016.

    , pages 689 – 696, 2016.
    doi: 10.1109/DSAA.2016.93.
  • Benkeser et al. (2017) David Benkeser, Marco Carone, MJ Van Der Laan, and PB Gilbert. Doubly robust nonparametric inference on the average treatment effect. Biometrika, 104(4):863–880, 2017. doi: 10.1093/biomet/asx053.
  • Bickel et al. (1997) P.J. Bickel, C.A.J. Klaassen, Y. Ritov, and J. Wellner. Efficient and adaptive estimation for semiparametric models. Springer, Berlin Heidelberg New York, 1997.
  • Breiman (1996) L. Breiman. Stacked regressions. Machine Learning, 24:49–64, 1996. doi: 10.1007/bf00117832.
  • Di Renzo et al. (2011) Gian Carlo Di Renzo, Irene Giardina, Alessia Rosati, Graziano Clerici, Michela Torricelli, Felice Petraglia, Italian Preterm Network Study Group, et al. Maternal risk factors for preterm birth: a country-based population analysis. European Journal of Obstetrics & Gynecology and Reproductive Biology, 159(2):342–346, 2011. doi: 10.1016/j.ejogrb.2011.09.024.
  • Groenwold et al. (2012) Rolf HH Groenwold, Ian R White, A Rogier T Donders, James R Carpenter, Douglas G Altman, and Karel GM Moons. Missing covariate data in clinical research: when and when not to use the missing-indicator method for analysis. Canadian Medical Association Journal, 184(11):1265–1269, 2012. doi: 10.1503/cmaj.110977.
  • Gruber and van der Laan (2010) S. Gruber and M.J. van der Laan. An application of collaborative targeted maximum likelihood estimation in causal inference and genomics. Int J Biostat, 6(1), 2010. doi: 10.2202/1557-4679.1182.
  • Ibragimov and Khasminskii (1981) I.A. Ibragimov and R.Z. Khasminskii. Statistical estimation. Springer, 1981.
  • Ju et al. (2017) Cheng Ju, Richard Wyss, Jessica M Franklin, Sebastian Schneeweiss, Jenny Haggstrom, and Mark J van der Laan.

    Collaborative-controlled LASSO for constructing propensity score-based estimators in high-dimensional data.

    Statistical Methods in Medical Research, 2017. doi: 10.1177/0962280217744588.
  • Ju et al. (2018) Cheng Ju, Joshua Schwab, and Mark J van der Laan. On adaptive propensity score truncation in causal inference. Statistical Methods in Medical Research, 2018. doi: 10.1177/0962280218774817.
  • Ju et al. (2019) Cheng Ju, Susan Gruber, Samuel D Lendle, Antoine Chambaz, Jessica M Franklin, Richard Wyss, Sebastian Schneeweiss, and Mark J van der Laan. Scalable collaborative targeted learning for high-dimensional data. Statistical Methods in Medical Research, 28(2):532–554, 2019. doi: 10.1177/0962280217729845.
  • Kang and Schafer (2007) Joseph DY Kang and Joseph L Schafer. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22(4):523–539, 2007. doi: 10.1214/07-sts227.
  • Petersen et al. (2012) Maya L Petersen, Kristin E Porter, Susan Gruber, Yue Wang, and Mark J van der Laan. Diagnosing and responding to violations in the positivity assumption. Statistical methods in medical research, 21(1):31–54, 2012. doi: 10.1177/0962280210386207.
  • Pfanzagl (1982) J. Pfanzagl. Contributions to a general asymptotic statistical theory. Springer, 1982.
  • Polley and van der Laan (2012) E. Polley and M. van der Laan. Super Learner Prediction, 2012. R package version 2.0-6, Available at http://cran.r-project.org/web/packages/SuperLearner/SuperLearner.pdf.
  • Silverman (1986) BW Silverman. Density Estimation. Chapman and Hall, London, UK, 1986.
  • Stitelman and van der Laan (2010) O.M. Stitelman and Mark J. van der Laan. Collaborative targeted maximum likelihood for time-to-event data. International Journal of Biostatistics, 6(1):Article 21, 2010. doi: 10.2202/1557-4679.1249.
  • van der Laan and Gruber (2016) Mark van der Laan and Susan Gruber. One-step targeted minimum loss-based estimation based on universal least favorable one-dimensional submodels. The International Journal of Biostatistics, 12(1):351–378, 2016. doi: 10.1515/ijb-2015-0054.
  • van der Laan (2017) Mark J. van der Laan. A generally efficient targeted minimum loss based estimator based on the highly adaptive lasso. The International Journal of Biostatistics, 13(2), 2017. doi: 10.1515/ijb-2015-0097.
  • Van der Laan et al. (2007) Mark J Van der Laan, Eric C Polley, and Alan E Hubbard. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1), 2007. doi: 10.2202/1544-6115.1309.
  • van der Laan (2014) M.J. van der Laan. Targeted estimation of nuisance parameters to obtain valid statistical inference. Internation Journal of Biostatistics, 10(1):29–57, 2014. doi: doi.org/10.1515/ijb-2012-0038.
  • van der Laan and Gruber (2010) M.J. van der Laan and S. Gruber. Collaborative double robust penalized targeted maximum likelihood estimation. Int J Biostat, 6(1):Article 17, 2010. doi: 10.2202/1557-4679.1181.
  • van der Laan and Rose (2011) M.J. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, Berlin Heidelberg New York, 2011.
  • van der Laan and Rubin (2006) M.J. van der Laan and Daniel B. Rubin. Targeted maximum likelihood learning. International Journal of Biostatistics, 2(1), 2006. doi: 10.2202/1557-4679.1043.
  • van der Vaart (1998) A.W. van der Vaart. Asymptotic Statistics. Cambridge, New York, 1998.
  • Wang et al. (2011) H. Wang, S. Rose, and M.J. van der Laan. Finding quantitative trait loci genes with collaborative targeted maximum likelihood learning. Statistics and Probability Letters, 81(7):792–796, 2011. doi: 10.1016/j.spl.2010.11.001.
  • White (1980) Halbert White. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, pages 817–838, 1980. doi: 10.2307/1912934.
  • Wolpert (1992) D. H. Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992. doi: 10.1016/s0893-6080(05)80023-1.
  • Zheng and van der Laan (2011) W. Zheng and M.J. van der Laan. Cross-validated targeted minimum loss based estimation. In M.J. van der Laan and S. Rose, editors, Targeted Learning: Causal Inference for Observational and Experimental Studies. Springer, New York, 2011.

Appendix A. Regular estimators

We can view an estimator of as the byproduct of an algorithm applied to independent observations drawn from a distribution . We consider a fluctuation , where is an index set, satisfying

with and . We say that is a regular estimator of at if, for any such fluctuation, the estimator of the parameter obtained by applying the algorithm on independent observations , …, drawn from is such that converges in distribution to a random variate whose distribution does not depend on . See Chapter 24 of van der Vaart (1998) for more on regularity in the context of efficient estimation.

Appendix B. Linearization of plug-in estimator

Below, we make use of empirical process notation, writing to denote for a given -integrable function and for each . We also denote by the empirical distribution function based on , so . A linearization of the parameter allows us to write

where is defined in (3). We assume

  1. ;

  2. ;

  3. .

Assumption (i) stipulates that the EIF have mean zero, which is satisfied, for example, if either or . Assumption (ii) stipulates that the EIF estimating equation is approximately solved. Because is a targeted estimate of , this assumption is trivially satisfied, since . The third assumption is satisfied if, for example, converges in probability to zero and falls in -Donsker class with probability tending to 1.

Appendix C. Generalization to coarsened at random (CAR) data structures

In this Appendix, we describe how the ideas presented in the main text generalize to coarsened at random (CAR) data structures. Suppose we observe i.i.d. copies of for a full-data random variable , censoring variable , and a many to one-mapping . We assume that the conditional distribution of , given , satisfies the CAR assumption. Our model for the observed data is implied by the full-data model for the full-data distribution and censoring model for the true conditional distribution of censoring. Thus, the data distribution is parameterized as . A full-data target parameter is defined as a mapping so that is the target estimand of interest. We assume that is pathwise differentiable at with EIF relative to this model . We define the exact second-order remainder as

We assume that the full-data parameter is identifiable so that there exists a , where is the parameter space, so that for all . The statistical estimand is thereby defined as . We assume that is a pathwise differentiable function of with EIF under relative to , . Such parameters allow an expansion , where

is the exact second order remainder for the observed data target parameter.

Suppose that the target estimand depends on through a functional

, and that we have a loss function

such that that , and is the parameter space for . Therefore, with an abuse of notation, we may denote the statistical target parameter as , i.e., a function of . Though is a parameter of the observed data distribution, it is is also a parameter of the full-data distribution , so it can be viewed both as a mapping as well as .

In addition, let this parameter be chosen so that the EIF of the target parameter under relative to the model depends on the observed data distribution through and the censoring mechanism , so that we can parameterize the EIF as . Similarly, we may denote the second-order remainder by to emphasize that it involves a difference between and .

Given an initial estimator of and of , a TMLE of defines a least-favorable parametric submodel through so that the loss-based score at along this path spans the estimated EIF . Such a path is called a local least favorable path, while a universal least favorable path requires that this score at any equals (van der Laan and Gruber, 2016). The TMLE estimates the fluctuation parameter of this least favorable path with standard minimum loss-estimation , and the corresponding one-step TMLE of the target estimand is the corresponding substitution estimator , where . If the TMLE uses a universal least favorable submodel, then (van der Laan and Gruber, 2016), but, even when one uses a local least favorable submodel, if and are -consistent estimators, then, under regularity conditions, (see theorem in Appendix of (van der Laan, 2017)). Thus, by similar arguments to those in Appendix A (i.e., appealing to -consistency of the EIF),

where is the in-probability limit of and is the in-probability limit of . As in the case of the treatment-specific mean, we can generally argue that is asymptotically negligible whenever both and are in norm. Under these conditions, the TMLE is asymptotically efficient.

Collaborative TMLE is a general tool whereby an estimator of the censoring mechanism is constructed sequentially, Starting with an initial estimator, we iteratively select an update among a set of candidates by maximizing the gain in empirical risk during the corresponding TMLE step. Having built a sequence of such candidate censoring mechanism estimators (and corresponding TMLE), we evaluate the cross-validated risk of each candidate to select the best TMLE among the candidate TMLEs of . The principle of CTMLE is that by maximizing the gain in empirical risk during the TMLE step, we are targeting a dimension-reduced quantity that adjusts for a rich enough set of variables so that the TMLE stays consistent even when is inconsistent. There is a whole class of such targets which will result in this consistency. This preservation of consistency of the C-TMLE for a that converges to a with possibly and , is referred to as collaborative double robustness (van der Laan and Gruber, 2010; Gruber and van der Laan, 2010).

Our proposed CTMLE for the treatment-specific mean avoids iterative building of the fit of the censoring mechanism based on gain during the TMLE-step, but fully relies on being -consistent. This strategy may be generalized to the broader CAR setting by choosing an adaptive target so that the second order remainder

Such a CTMLE is asymptotically linear with influence function equal to , which will generally have smaller variance than the EIF . We leave to future work the identification of general strategies for selecting adaptive censoring mechanism target parameters.

Appendix D. Additional estimators

Collaborative one-step estimator

The collaborative one-step estimator can be implemented in the following steps:

  1. estimate OR: regress on amongst units observed to receive treatment to obtain OR estimate ;

  2. predict outcome: use estimated OR to obtain a prediction for each observed data unit, ;

  3. estimate adaptive PS: regress on predictions to obtain adaptive PS estimate ;

  4. predict PS: use estimated PS to obtain prediction for each observed data unit, ;

  5. evaluate influence function: use estimated OR and PS to compute for ;

  6. compute one-step estimate: .

Collaborative TMLE that targets the ATE directly

Here we propose a TMLE that can be used to directly estimate the ATE. For each , we denote by