Evaluating the Robustness of Targeted Maximum Likelihood Estimators via Realistic Simulations in Nutrition Intervention Trials

09/28/2021 ∙ by Haodong Li, et al. ∙ berkeley college 0

Several recently developed methods have the potential to harness machine learning in the pursuit of target quantities inspired by causal inference, including inverse weighting, doubly robust estimating equations and substitution estimators like targeted maximum likelihood estimation. There are even more recent augmentations of these procedures that can increase robustness, by adding a layer of cross-validation (cross-validated targeted maximum likelihood estimation and double machine learning, as applied to substitution and estimating equation approaches, respectively). While these methods have been evaluated individually on simulated and experimental data sets, a comprehensive analysis of their performance across “real-world” simulations have yet to be conducted. In this work, we benchmark multiple widely used methods for estimation of the average treatment effect using ten different nutrition intervention studies data. A realistic set of simulations, based on a novel method, highly adaptive lasso, for estimating the data-generating distribution that guarantees a certain level of complexity (undersmoothing) is used to better mimic the complexity of the true data-generating distribution. We have applied this novel method for estimating the data-generating distribution by individual study and to subsequently use these fits to simulate data and estimate treatment effects parameters as well as their standard errors and resulting confidence intervals. Based on the analytic results, a general recommendation is put forth for use of the cross-validated variants of both substitution and estimating equation estimators. We conclude that the additional layer of cross-validation helps in avoiding unintentional over-fitting of nuisance parameter functionals and leads to more robust inferences.



There are no comments yet.


page 1

page 25

page 26

page 27

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Epidemiological studies, particularly based on randomized trials, often aim to estimate the average treatment effect (ATE), or another causal parameter of interest, to understand the effect of a health intervention or exposure on an outcome of interest. Most commonly, in observational studies, inverse probability of treatment weighted (IPTW) estimation and its variants have been used for this purpose  

HorvitzGSRF1952, ipw_hajek_1971, AronowEACED2017. Alternative estimators for causal inference include substitution (or direct) estimators based on G-computation RobinsNACIJ1986, YuCCGFD2002, DanielGECED2011, WangGATEJ2017, those based on the approach of estimating equations (EE) RobinsMSMS2000, LaanUMCL2003, including IPTW and its augmented variant (A-IPTW), and substitution estimators developed within the framework of targeted learning (TL) (we also refer to targeted maximum likelihood estimator, TMLE, a product of this framework LaanTLCI2011). The latter of these has seen increasing use in recent years, both in biostatistical methodological research and applied public health and medical research PetersenAIUTJ2017, SkeemCPSOS2017, RoseRMLVO2018, PlattTERCJ2018, NeafseyGDPEN2015. In Table 1, we provide a list of studies that have examined the relative performance of TL-based and competing estimators (mainly against EE-based methods), including a summary of whether the results suggested superior, neutral, or poorer relative performance of TL-based estimators in comparison to other estimators (the “Pro/Con” column). Thus, while this work is contextualized within dozens of previous studies, few such studies performed “realistic” simulations, and even fewer compared several variants of TL estimators alongside corresponding EE approaches. For example, in Zivich and Breskin’s paper ZivichMLCIM2021

, the authors compared G-computation, IPTW, A-IPTW, TMLE and double cross-fit estimators with data generated from predefined parametric models. Exceptions are efforts that used the proposed realistic bootstrap 

PetersenDRVPF2012 to evaluate the performance for data-generating distributions modeled semiparametrically (using ensemble machine learning) from an existing data set. These include a study of estimating variable importance under positivity violations using collaborative targeted maximum likelihood estimation (C-TMLE) PirracchioCTMLJ2018. In this paper, we use an augmentation of this proposed methodology to examine the relative performance of several versions of both TL and EE estimators in ten realistic data simulations, each based on data collected as part of the Knowledge Integration (KI) database from the Bill & Melinda Gates Foundation  MertensCCCGJ2020. In so doing, we provide a realistic survey, across both different data-generating distributions and different study designs, of the relative performance of estimators of causal parameters.

Authors Title Year Description of Reuslts Pro/Con
Chatton, et al.ChattonGPSMJ2020 G-computation, propensity score-based methods, and targeted maximum likelihood estimator for causal inference with different covariates sets: a comparative simulation study 2020 Article compares different semi-parametic approaches, including TMLE and matching, but finds G-computation performs relatively best. Given their simulation, this was predictable because they simulated from a parametric model and used the same model for estimating the regression, thus showing the superiority of maximum likelihood estimation in parametric models. This is not a realistic setting. Con
Talbot and Beaudoin TalbotGDRBM2020 A generalized double robust Bayesian model averaging approach to causal effect estimation with application to the Study of Osteoporotic Fractures 2020 Proposed a Generalized Bayesian Causal Effect Estimation (GBCEE), which outperformed double robust alternatives(including C-TMLE). Also showed “target” A-IPTW is superior than C-TMLE in a non-realistic setting(only using true confounders). Con
Zivich and Breskin ZivichMLCIM2021 Machine learning for causal inference: on the use of cross-fit estimators 2020 A simulation study assessing the performance of G-computation, IPW, AIPW, TMLE, doubly robust cross-fit (DC) AIPW and DC-TMLE. With correctly specified parametric models, all of the estimators performed well. When used with machine learning, the DC estimators outperformed other estimators. Neutral
Ju, et al. JuSCTLF2019

Scalable collaborative targeted learning for high-dimensional data

2019 Results from simulations suggested superior performance of C-TMLE relative to both A-IPTW and non-collaborative ("standard") TMLE estimators. Pro
Ju, et al. JuAPSTJ2019 On adaptive propensity score truncation in causal inference 2019 By adaptively truncating the estimated propensity score with a more targeted objective function, the Positivity-C-TMLE estimator achieves the best performance for both point estimation and confidence interval coverage among all estimators considered. Pro
Bahamyirou, et al. BahamyirouUDPBJ2019 Understanding and diagnosing the potential for bias when using machine learning methods with doubly robust causal estimators 2019 Simulation results showed superior performance of C-TMLE and TMLE relative to IPTW. Pro
Wei, et al. WeiDTLAS2019 A Data-Adaptive Targeted Learning Approach of Evaluating Viscoelastic Assay Driven Trauma Treatment Protocols 2019 C-TMLE outperformed the other doubly robust estimators (IPTW, A-IPTW, stabilized IPTW, TMLE) in the simulation study. Pro
Rudolph, et al. RudolphCSDED2019 Complier Stochastic Direct Effects: Identification and Robust Estimation 2019 Showed that the EE and TMLE estimators have advantages over the IPTW estimator in terms of efficiency and reduced reliance on correct parametric model specification. Pro
Pirracchio, et al. PirracchioCTMLJ2018 Collaborative targeted maximum likelihood estimation for variable importance measure: Illustration for functional outcome prediction in mild traumatic brain injuries 2018 Showed much more robust performance of C-TMLE relative to TMLE using the same type of realistic parametric bootstrap as used in this paper. This was under severe near-positivity violations. Pro
Luque‐Fernandez, et al. Luque-FernandezTMLE2018 Targeted maximum likelihood estimation for a binary treatment: A tutorial 2018 Showed relatively superior performance of TMLE when compared with A-IPTW estimator in terms of bias. Pro
Levy, et al. LevyFMTEJ2021 A Fundamental Measure of Treatment Effect Heterogeneity 2018 Showed the advantage of CV-TMLE over TMLE in that TMLE was affected by overfitting while CV-TMLE appeared unaffected. Pro
Schuler and Rose SchulerTMLEJ2017 Targeted maximum likelihood estimation for causal inference in observational studies 2017 Showed superior performance of TMLE relative to misspecified parametric models. Pro
Pang, et al. PangEEPSN2016 Effect Estimation in Point-Exposure Studies with Binary Outcomes and High-Dimensional Covariate Data–A Comparison of Targeted Maximum Likelihood Estimation and Inverse Probability of Treatment Weighting 2016 Showed relatively superior performance for the TMLE to IPTW, which showed greater instability when positivity violations occurred. Pro
Schnitzer, et al. SchnitzerVSCCM2016 Variable Selection for Confounder Control, Flexible Modeling and Collaborative Targeted Minimum Loss-Based Estimation in Causal Inference 2016 Using IPTW with flexible prediction for the propensity score can result in inferior estimation, while TMLE and C-TMLE may benefit from flexible prediction and remain robust to the presence of variables that are highly correlated with treatment. Pro
Zheng, et al. ZhengDREEM2016 Doubly Robust and Efficient Estimation of Marginal Structural Models for the Hazard Function 2016 Showed that the TMLE for marginal structual model (MSM) for a hazard function has relatively superior performance. The bias reduction over a misspecified IPTW or Gcomp estimator is clear in the simulation studies even for a moderate sample size. Pro
Schnitzer, et al. SchnitzerDREEJ2016 Double robust and efficient estimation of a prognostic model for events in the presence of dependent censoring 2016 This study demonstrated that even when the analyst is ignorant of the true data generating form, TMLE with Super Learner can perform about as well as IPTW or TMLE with correct parametric model specification. Pro
Kreif, et al. KreifETEMO2016 Evaluating treatment effectiveness under model misspecification: A comparison of targeted maximum likelihood estimation with bias-corrected matching 2014 Examined the relative performance of TMLE, EE and matching estimators showing superior performance of TMLE when the outcome regression is misspecified. Pro
Schnitzer, et al. SchnitzerEBGIJ2014 Effect of breastfeeding on gastrointestinal infection in infants: A targeted maximum likelihood approach for clustered longitudinal data 2014

Compared TMLE with IPTW and G-computation, under the plausible scenario of being given transformed versions of the confounders. Only TMLE with Super Learner was able to unbiasedly estimate the parameter of interest.

Gruber and van der Laan GruberATMLM2013 An Application of Targeted Maximum Likelihood Estimation to the Meta-Analysis of Safety Data 2013 Reported superiority of both TMLE and A-IPTW to misspecified parametric models, but the data-generating distributions used resulted in little difference between the semi-parametric approaches. Neutral
Lendle, et al. LendleTMLEA2013 Targeted maximum likelihood estimation in safety analysis 2013 Showed superior performance of TMLE and C-TMLE relative to A-IPTW estimators in the context of positivity violations. Pro
Díaz and van der Laan DiazTDAED2013 Targeted Data Adaptive Estimation of the Causal Dose Response Curve 2013 Showed relatively superior performance of CV-TMLE relative to CV-A-IPTW estimators, especially in the presence of empirical violations of the positivity assumption. Pro
Schnitzer, et al. SchnitzerTMLEJ2013 Targeted maximum likelihood estimation for marginal time-dependent treatment effects under density misspecification 2013

In the simulation study, TMLE did not produce a reduction in finite-sample bias or variance for correctly specified densities compared with the G-computation estimator, but it had much better performance than G-computation when the outcome model was misspecified.

Petersen, et al. PetersenDRVPF2012 Diagnosing and responding to violations in the positivity assumption 2012 Showed superior performance of TMLE relative to misspecified parametric models, in comparison with A-IPTW, IPTW and G-computation. Pro
van der Laan and Gruber vanderLaanTMLB2012 Targeted Minimum Loss Based Estimation of Causal Effects of Multiple Time Point Interventions 2012 In the setting of multiple time point interventions, showed TMLE outperformed IPTW and MLE estimators. Pro
Porter, et al. PorterRPTM2011 The relative performance of targeted maximum likelihood estimators 2011 Showed relatively superior performance of C-TMLE relative to A-IPTW estimators particularly when there are covariates that are strongly associated with the missingness, while while being weakly or not at all associated with the outcome. Pro
Wang, et al. WangFQTLJ2011 Finding Quantitative Trait Loci Genes with Collaborative Targeted Maximum Likelihood Learning 2011 Based on actual genetic data, results suggested greater robustness of findings using C-TMLE relative to parametric approaches for high throughput genetic data. Pro
Díaz and van der Laan MunozPICEJ2012 Population Intervention Causal Effects Based on Stochastic Interventions 2011 Paper focused on new estimators for stochastic (e.g., shift) interventions relevant to estimating causal effects of continuous interventions. In their simulation, they did not observe significant differences between the TMLE and the A‐IPTW. Neutral
Gruber and van der Laan GruberACTM2010 An application of collaborative targeted maximum likelihood estimation in causal inference and genomics 2010 Showed more robust performance in high-dimensional simulations comparing TMLE to estimating equation approaches (A-IPTW). Pro
Stitelman and van der Laan StitelmanCTML2010 Collaborative Targeted Maximum Likelihood for Time to Event Data 2010 The results show that, compared with TMLE, IPTW and A-IPTW, the C-TMLE method does at least as well as the best estimator under every scenario and, in many of the more realistic scenarios, behaves much better than the next best estimator in terms of both bias and variance. Pro
Moore and van der Laan MooreCARTJ2009 Covariate adjustment in randomized trials with binary outcomes: targeted maximum likelihood estimation 2009 Demonstrated how the use of covariate information in randomized clinical trials could use the TMLE framework, which results in improved performance, without bias, relative to standard methods. Pro
Rose and van der Laan RoseSOWCS2008 Simple Optimal Weighting of Cases and Controls in Case-Control Studies 2008 IPTW method for causal parameter estimation was outperformed in conditions similar to a practical setting by the new case-control weighted TMLE methodology. Pro
Table 1: Overview of literature on comparison of TMLE and other estimators. The Pro/Con column refers to a simple binary classification of the relative performance of the TMLE estimators reported in the paper, "Pro" indicating that the TMLE performed superior to other competing estimators.

2 Background

As large and complex data sets have become increasingly more commonplace, the habitual use of parametric approaches is encountering more data science research for which they are ill-suited. . This has led to machine learning (ML) taking a more central role in deriving estimators of causal impacts in very big statistical models (semiparametric). The theory for the use of ML in the estimators discussed herein has been continuously refined, from developing double robust estimators (both A-IPTW and TMLE substitution estimators) to augmentations of these estimators that are more robust to the overfitting potentially introduced by flexible ML fits. The latter modifications to the original estimators are the cross-validated TMLE (or CV-TMLE, chapter 27 in van der Laan

LaanTLCI2011 and Zheng ZhengATCTN2010), and subsequently the proposal for an analogous modification to estimating equation approaches (double machine learning or cross-fitting ChernozhukovDDMLF2018).

While simulation studies have investigated all of these estimators, they have yet to be analyzed together in a single series of realistic simulation studies. Here, we seek to determine how well these estimators perform in realistic settings, under which conditions they perform best, which augmentations provide the most robustness, and whether or not the results support more general recommendations. In addition, there exist other choices of target parameter when the one being analyzed fails to have adequate performance for any of the competing estimators, such as realistic rules BembomPIIR2007. A recently developed machine learning algorithm (the highly adaptive lasso; HAL BenkeserHALE2016), is potentially an important improvement in estimating realistic DGDs for simulation studies such as ours. It can be optimally undersmoothed to dependably generate realistic estimates of the actual data generating distributions. HAL is particularly well suited to these types of simulations, as it uses a very large nonparametric model and can be tuned to be as flexible as the data support. In this paper, we explore the use of HAL as a basis in conducting realistic data-inspired simulations. The results suggest the proposed use of HAL for realistic data-generating simulations could provide a general method for choosing between machine-learning-based estimators for a particular parameter and data set.

We first introduce the data sets that were selected to motivate our realistic simulations, describe the steps taken for simulating data, including a short description of the estimators tested, and discuss the results. The simulations suggest a general recommendation for the use of an additional layer of cross-validation (CV-TMLE and double machine learning) to ensure robust inference in finite samples.

3 Methods

3.1 Study Selection

We utilized data from ten nutrition intervention trials conducted in Africa and South Asia. In all studies, the measured outcome was a height-to-age Z-score for children from birth to 24 months, which was calculated using World Health Organization (WHO) 2006 child growth standards 

WHOGrowth2006. Details about the resulting composite data, study design and data processing, can be found in companion technical reports MertensCCCGJ2020, MertensCWCSJ2020, Benjamin-ChungECLGJ2020. All interventions were nutrition-based, and for the purposes of this analysis, multi-level interventions were simplified to a binary treatment variable (e.g. nutrition intervention - yes/no). Although different baseline covariates were measured among these studies, there was significant overlap. The sample size of each study is shown in Table 2. We anonymized the study IDs and removed the location information due to confidentiality concerns. Details on each study can be found in the shuffled list in Section B of the Appendix.

Study ID n p
1 418 20
2 4863 26
3 7399 22
4 1204 36
5 2396 42
6 3265 18
7 1931 38
8 840 30
9 27275 42
10 5443 35
Table 2: Dimensions of datasets of Nutrition Intervention Trials, with representing the number of children in sample and being the number of covariates.

3.2 Data Processing

Data from each study was cleaned and processed for this analysis. Our goal for defining the analysis data used to simulate is different from the goals of the original studies and thus our data processing might differ from that used in the resulting publications of the study results. We note that the data are used to motivate the simulations, but, since we define the true DGD to be one that we estimate for each study, and at that point differences with the original study become irrelevant to our comparisons of estimators. Data was filtered down to the last height-to-age Z-score measurement taken at the end of each study for each subject. Subjects were dropped if either their treatment assignment () or outcome measurement () were missing. For covariates (

) that were missing, those that were continuous and discrete were imputed using the median and mode, respectively. In both cases, missingness indicator variables were added to the data set for each covariate with missing rows. As mentioned above, the treatment assignment variable (

) was binarized if it consisted of more than two treatment arms. The control and treatment groups were originally assigned in each study as described in Section B of the Appendix.

3.3 Simulation with Undersmoothed Highly Adaptive Lasso

3.3.1 Undersmoothed Highly Adaptive Lasso

We adopted the undersmoothed highly adaptive lasso (HAL) method to generate data for simulations in a (nearly) nonparametric model, with as few assumptions as possible. HAL is a nonparametric regression estimator, which neither relies upon local smoothness assumptions nor is constructed using local smoothing techniques BenkeserHALE2016. HAL has been shown to have competitive finite-sample performance relative to many other popular machine learning algorithms. With the assumption that the target parameter falls in the Donsker class of all cadlag functions (right-hand continuous, with left-hand limits) with finite variation norm, we have the following representationGillIEBS1995:

where and denotes the indices of sections of . Then let us further denote as the subvector with support of , and as the values of the subvector for the th observation. Now can be approximated by such thatBenkeserHALE2016:

Now if we consider a model with the basis functions as predictors and as coefficients, we haveBenkeserHALE2016:

By the assumption of finite sectional variation norm (an entropy assumption required of all but two of the estimators), cross-validated TMLE and double-robust EE) we have the corresponding subspace BenkeserHALE2016.

The HAL estimator starts with a very large number (at most ) of basis functions that are indicator functions with support at the observed data values. In practice, when some covariates are categorical or binary, the number of unique basis functions will be much fewer than the upper bound. Moreover, to avoid overfitting, one can define a subspace of the linear model such that: , for submodels where the -norm is bounded by

. The dimension of basis functions can be restricted. For example, one can consider only main-term indicators for each of the original predictors as well as all second order tensor products (interaction terms involving the main effect terms). One can use cross-validated selection of

to optimize the fit of the model to future observations from the data-generating distribution (DGD).

It has been recently shown that undersmoothing HAL (using a that is larger than that selected by cross-validation) can yield an asymptotically efficient estimators for functionals of the relevant portions of the DGD while preserving the same rate of convergence, and also solving the efficient score equation for any desired path-wise differentiable target feature of the data distribution vanderLaanEEPDA2019. As a consequence, an undersmoothed HAL results in an efficient plug-in estimator of the desired estimand, and moreover, it will also be efficient for any other smooth estimands of the data distribution vanderLaanEEPDA2019. This could serve as the basis for using HAL in our settings; that is, to estimate the DGD by HAL in a way that optimally preserves the relevant functionals. More intuitively, HAL, with the properly chosen , will result in a DGD for simulations that is as close as one can get nonparametrically to the true DGD, without blowing up the variance of estimation. So, it creates a comparison that is as faithful as possible to the DGD of interest, itself represented by a single data set (experiment). Thus, we argue that it can serve as the basis of a simulation where one wishes to compare estimators for the data in hand. We provide more rigorous justification below.

In our study, we only use the undersmoothed HAL to generate data without pre-specifying any parameter of interest. The stopping criterion for this undersmoothing process is to increase the initial bound until the score equations formed by the product of basis functions and residuals are solved at the rate of Laan_underHAL_2019. Namely, for all “non-trivial directions” (combinations of with non-zero coefficients selected by the initial fit) we need:


where is the empirical average function and . Following the convention of notation in LaanTLCI2011, we define and as its estimate. Also, we use , and denoting the possible value of true .
To justify this criterion, first fix and consider the target parameter . (Notice that for this target parameter, we can treat as a member of so that actually contains and ). The last equality is true since . We claim that is a component of the efficient influence curve (EIC) of , where we denote the EIC as . To prove this, we can start with the empirical estimator . Observe that

Thus the influence curve of the empirical estimator is , denote it as . With it, we can obtain by projecting onto the tangent space  LaanTLCI2011. In addition, since , the tangent space can be decomposed as:  LaanTLCI2011. So the projection of on is equal to the sum of the projections of on and , namely, . Thereby . Then


So, .
Now we have proved that is a component of . Another observation from the calculation above is that .
For different pairs of , each corresponds with an EIC for a particular target parameter . For each plug-in estimator being asymptotically linear we want at minimal for every , which is guaranteed by our choice of criterion. By doing so, we will also be solving for any vector with finite -norm, which enables us to approximate any function of with rate approximately equal to  vanderLaanEEPDA2019. So in this way we are rich enough to guarantee to solve any EIC that can be written as , thereby cover all EIC of features of Q. So the undersmoothing process essentially yields an estimator that is efficient for any target feature of Q that is pathwise differentiable. Combined with the fact that when the bias of an estimator is smaller than then it has minimal impact on coverage, we choose (1) as the stopping criterion based on the proof above.

The specific procedure is stated as follows:

Step 1. Fit the HAL with -norm, obtain the set of basis functions and a starting value of , denote it as . This is a CV-optimal value of the penalty parameter returned by the hal9001 package coyle2021hal9001-rpkg, HejaziHSHAS2020, which is essentially from the “cv.glmnet” function with 10-fold cross-validation.

Step 2. Calculate the absolute value of the normalized score equations for each direction:

Step 3. Take the maximum of this value from all subsets. Compare the max with . If the max is larger, then increase the bound (i.e. decrease the value of the penalty parameter ) and refit the HAL.

Step 4. Repeat 1,2,3 until the stopping criterion is satisfied.

In addition, we speed up the algorithm by controlling the number of basis functions in the initial HAL fits. First, we set the maximum interaction degree to , where is the number of covariates. Second, we use binning method to restrict the maximum number of knots to for the

degree basis functions. These hyperparameters can be set through the

hal9001 package coyle2021hal9001-rpkg, HejaziHSHAS2020. We make the decisions on hyperparameters based on two factors: they can help form a rich model with complex interaction terms and the computing time is acceptable. To make it more rigorous, a cross-validation-based tuning procedure can be considered in future practice.

In the Appendix, we provide a list showing the variables included in the models after undersmoothing (Table 7).

3.3.2 Data Generating Process

The DGD for each study was based upon the following structural causal model (SCM):

where , , and are, in time ordering, the confounders, the binary intervention of interest and the outcome, respectively, with the exogenous independent errors and deterministic functions, . Specifically, the following steps were taken:

  1. Covariates were sampled with replacement from the study data sets with sample size , where is the size of the original data set.

  2. The undersmoothed HAL fit was then used to predict . The intervention

    was then sampled using a binomial distribution with the predicted


  3. The outcome was then simulated with the undersmoothed HAL fit, using the sampled and simulated as input. A mean zero, normal random error was added to the simulated , using a variance based upon the residual variance of the predicted (Namely, ).

Note, we could have used other ways of estimating the error distribution in step 4, including density estimation using HAL, but we left this for future studies.

Steps 1 through 3 were repeated 500 times to generate the data sets for each simulation. For each of the study data (Table 2), we repeated these steps and analyzed the performance of the competing estimators separately by study.

3.3.3 Target parameter

Our treatment variable is binary, and our outcome is continuous, indicating a height-to-age Z-score. represents the measured covariates in each study. The data structure is defined as: with independent and identically distributed (i.i.d.) observations , where

denotes the set of possible probability distributions of

. The target parameter is a feature of that is our quantity of interest SchulerTMLEJ2017. We selected as our target parameter the average treatment effect (ATE), or , ; where denotes the collection of possible distributions of as described by the SCM, and is the outcome for a subject if, possibly contrary to fact, they received nutrition intervention . Given we simulated the data based upon on our causal model, under randomization assumption and positivity assumption we can show that this causal parameter is identified by the following statistical estimand PearlC2009:

We calculate the true ATE value for each study by first randomly drawing a large number of observations () from the empirical of and using:

where we define the and term using the fitted undersmooth HAL model. Note that our simulation process insures the randomization assumption is true and there is no asymptotic violation of the positivity assumption. However, there can be practical violations of positivity (close to 0 or 1 estimated probabilities of getting treatment for some observations given the ) which can deferentially impact estimator performance.

3.4 The Estimation Problem

The target parameter depends on the true DGD, , through the conditional mean and the marginal distribution of , so we can write , where . Our targeted learning estimation procedure begins with estimating the relevant part of the data-generating distribution needed for evaluating the target parameter LaanTMLLD2006.

The two general methods we compare are substitution and estimating equation estimators. Depending on the specific estimator, they can depend on estimators of of the propensity score, , the outcome model, , and sometimes both. We use consistent settings when modeling the outcome and the propensity score via Super Learner (see section 3.8 below for details).

The estimators we compare are not exhaustive and new methods will be developed, so such studies will continue to be important sources of information for deciding what to do in practice. We quickly describe the particular estimators compared in our study below.

3.5 Inverse Probability of Treatment Weighting Estimator

The inverse probability of treatment weighting (IPTW) is a method that relies on estimates of the conditional probability of treatment given covariates , referred to as the propensity score ROSENBAUMCRPSA1983. After it is estimated, the propensity score is used to weight observations such that a simple weighted average is a consistent estimate of the particular causal parameter if the propensity score model is consistent SchulerTMLEJ2017. For the ATE (if were known) the weight is .

The average treatment effect is then estimated by AustinMBPWD2015:

where is the estimate of the true propensity score (). IPTW is not a double robust estimator, in that its consistency depends on consistent estimation of the propensity score LaanUMCL2003. As it is not a substitution estimator, it is not as robust to sparsity SchulerTMLEJ2017. However, it is a commonly used estimator of the ATE, and its form and relationship to well-known inverse probability methods in the analysis of survey data make it relatively popular.

We derived statistical inference using a conservative standard error which assumes that is known (there is an extensive literature on IPTW estimators, but LaanUMCL2003 is a good reference for technical details). Specifically, the standard error for this estimator was constructed by multiplying

by the standard deviation of the plug-in resulting influence curve:

Since IPTW estimator has many problems such as not invariant to location transformation of the outcome and suffering from the extreme predictions of (close to 0 or 1), we use the Hajek/stablized IPTWipw_hajek_1971 by normalizing the weights of as follows:

3.5.1 Cross-Validated Inverse Probability of Treatment Weighting (CV-IPTW) Estimator

To avoid problems that arise when is overfit, we also implemented the CV-IPTW estimator by adding another layer of cross-validation when estimating the propensity score ChernozhukovDDMLF2018. Specifically, the same SL fitting procedure was implemented on training sets. Then, we use this estimate of on the corresponding validation sets; as such, we employ a nested cross-validation. In practice, we used the “Split Sequential SL” method, an approximation to the nested cross-validation proposed by Coyle CoyleCCTL2017, to speed up the estimation while obtaining similar results to standard nested cross-validation. More details on the implementation can be found in section 3.8 below.

3.6 Augmented Inverse Probability of Treatment Weighted (A-IPTW) Estimator

The other estimating equation method included in our study is an augmented version of the IPTW estimator, aptly named the augmented inverse probability of treatment weighted (A-IPTW) estimator GlynnIAIP2010. It is a double robust estimator that is consistent for the ATE as long as either the propensity score model () or the outcome regression () is correctly specified. When compared with the IPTW estimator in a Monte Carlo simulation, A-IPTW typically outperformed IPTW with a lower mean squared error when either the propensity score or outcome model was misspecified GlynnIAIP2010.

Intuitively, the A-IPTW improves upon IPTW by fully utilizing the information in the conditioning set of covariates , which contains both information about the probability of treatment and information about the outcome variable GlynnIAIP2010. More formal justification comes from the fact that the A-IPTW estimator arises as the solution to the efficient influence curve (a key quantity in semiparametric theory), and thus is locally efficient if both and are correctly specified.

For the ATE, A-IPTW estimator solves the mean of the empirical efficient influence curve and can be expressed explicitly for the average treatment effect as follows:

The standard error for this estimator was constructed by multiplying by the standard deviation of the plug-in efficient influence curve:

3.6.1 Cross-Validated Augmented Inverse Probability of Treatment Weighted (CV-A-IPTW) Estimator

Similar to CV-IPTW, to avoid overfitting of the outcome model () or propensity score model (), we implemented the CV-A-IPTW estimator by adding another layer of cross-validation when estimating the and . In practice, as discussed above for the IPTW estimator, we used the “Split Sequential SL” method proposed by Coyle CoyleCCTL2017 to speed up the estimation (for more details, see section 3.8 below).

3.7 Targeted Maximum Likelihood Estimator (TMLE)

The targeted maximum likelihood estimator (TMLE) is an augmented substitution estimator that, in context of the ATE, adds a targeting step to the original outcome model fit to optimize the bias-variance trade-off for the parameter of interest LaanTMLLD2006. Similar to A-IPTW, TMLE is doubly robust, producing unbiased estimates if either (i.e. ) or (i.e.

) is correctly specified. It is asymptotically efficient when both quantities are consistently estimated. As it is a substitution estimator, it is typically more robust to outliers and sparsity than EE estimators

SchulerTMLEJ2017. A finite sample advantage over estimating equation methods comes from the fact that the estimator respects constraints on the parameter bound, such as ensuring that an estimated probability in the range LaanTMLLD2006.

The TMLE, like the A-IPTW estimator, requires preliminary estimates of both and . The first step in TMLE is finding an initial estimate of the relevant part of data-generating distribution . For all estimators, we use an ensemble machine learning algorithm, the Super Learner (SL) algorithm. This avoids arbitrarily using a single algorithm and ensures that the corresponding fit will be optimal (with respect to the true risk) relative to the candidate algorithms used in the estimation. Once this initial estimate has been found, TMLE updates the initial fit to make an optimal bias-variance trade-off for the target parameter LaanTMLLD2006.

For the ATE, the TMLE first requires , the estimate of the conditional expectation of the outcome given the treatment and covariates SchulerTMLEJ2017. Next is the targeting step for optimizing the bias-variance trade-off for the parameter of interest. The propensity score () can also be estimated with a flexible algorithm like the Super Learner vanderLaanSL2007, and these fits are used to predict the conditional probability of treatment and no treatment for each subject (). These probabilities are used for updating the initial estimate of the outcome model. This updated estimate is then used to generate potential outcomes for when and . Like the G-computation estimator, the TMLE estimate of the ATE is calculated as the mean difference between these pairs SchulerTMLEJ2017.

With the ATE as our target parameter, the Super Learner substitution estimator is LaanTLCI2011:

where is the estimate of and the initial estimate of .

The next step is to update the estimator above toward the parameter of interest. The targeting process uses in a so-called clever covariate to define a one-dimensional model for fluctuating the initial estimator. The clever covariate is defined as:

A simple, one-variable logistic regression is then run for the outcome

on the clever covariate, using as the offset to estimate the fluctuation parameter . This is used for updating the initial estimate into a new estimate as follows:

where is the estimate of .

The updated fit is used to calculate the expected outcome under () and () for all subjects. These estimates are then plugged into the following equation for the final TMLE estimate of the ATE:

The fitting of both the and models to the entire data set for the substitution estimator requires entropy assumptions on the fits and underlying true models. It is possible to violate this assumption by an overfit of the models of the DGD, and this can occur even when cross-validation is used to choose the resulting fits (though, this helps tremendously). One can generalize both the estimating equation approach and TMLE to estimators that do not need these entropy assumptions by inclusion of an additional layer of cross-validation. This has also been described as double-machine learning in the context of estimating equations ChernozhukovDDMLF2018, though it had previously been proposed as a way of robustifying the TMLE ZhengATCTN2010, LaanTLCI2011.

The standard error estimate for TMLE can be constructed by multiplying by the standard deviation of the plug-in efficient influence curve:

3.7.1 Cross-Validated Targeted Maximum Likelihood Estimation (CV-TMLE)

Though TMLE is a doubly robust and efficient estimator, its performance suffers when the initial estimator is too adaptive LaanTLCI2011. Intuitively, if the initial estimator of is overfit, there is not realistic residual variation left for the targeting step and the update is unable to reduce residual bias.

To address these shortcomings of TMLE, cross-validated targeted maximum likelihood estimation (CV-TMLE) was developed ZhengATCTN2010. This modified implementation of TMLE utilizes 10-fold cross-validation for the initial estimator to make TMLE more robust in its bias reduction step. The result is that one has greater leeway to use adaptive methods to estimate components of the DGD while keeping realistic residual variation in the validation sample.

Whereas CV-TMLE can add robustness by making the estimator consistent in a larger statistical model, there is still another way for finite sample performance issues to enter estimation. Specifically, if the data suffers from a lack of experimentation such that gets too close to 0 or 1, then the estimator can begin to suffer from the unstable inverse weighting in the targeting step, a violation “positivity”. There are simple methods to avoid this, by choosing a fixed truncation point, such as truncating the estimate of : , for some small (typical value is . However, there exists a more sophisticated method that does a type of model selection in estimating the model which prevents the update from hurting the fit of the model. This is an area of active research and several collaborative-TMLE (C-TMLE) estimators have been proposed, including adaptive selection of the truncation level  LaanTMLLD2006, JuAPSTJ2019a.

3.7.2 Collaborative Targeted Maximum Likelihood Estimation (C-TMLE)

Collaborative Targeted Maximum Likelihood Estimation (C-TMLE) is an extension of TMLE. In the version used for estimation in this study, it applies variable/model selection for nuisance parameter (e.g. the propensity score) estimation in a “collaborative” way, by directly optimizing the empirical metric on the causal estimator vanderLaanCDRTM2010. In this case, we used the original C-TMLE proposed by van der Laan and Gruber vanderLaanCDRTM2010, which is also called “the greedy C-TMLE algorithm”. It consists of two major steps: first, a sequence of candidate estimators of the nuisance parameter is constructed from a greedy forward stepwise selection procedure; second, cross-validation is used to select the candidate from this sequence which minimizes a criterion that incorporates a measure of bias and variance with respect to the targeted parameter vanderLaanCDRTM2010. More recent development on C-TMLE includes scalable variable-selection C-TMLE JuSCTLF2019 and glmnet-C-TMLE algorithm JuCLCPA2019, which might have improved computational efficiency in high-dimensional setting.

3.8 Computation

Our simulation study was coded in the statistical programming language RR_general. We used hal9001coyle2021hal9001-rpkg, HejaziHSHAS2020 and glmnetFriedmanRPGL2010 packages to generate the data via undersmoothed HAL. We used sl3coyle2021sl3, tmle3coyle2021tmle3-rpkg and ctmleJuGruber_ctmle packages to implement each of the estimators described above. To estimate the propensity score and the conditional expectation of the outcome, linear models, mean, GAM (general additive models) HastieGAMA1986, ranger

(random forest)

WrightRFIRM2017, glmnet (lasso), and XGBoostChenXSTBA2016 with different tuning parameters were used to form the SL library. For “Study 9”, we dropped GAM and ranger from the learner library to improve the computational efficiency. Ten-fold cross-validation was chosen by default of sl3 package for every SL fit. We used logistic regression meta-learner for propensity scores, and non-negative least squares meta-learner for estimating conditional expectation of the outcome. We truncated the propensity score estimates between for all estimators.

Theoretically, when constructing CV-TMLE, CV-IPTW and CV-A-IPTW estimators, we need to implement nested SL by adding one more layer of cross-validation. Namely, we first split the data, then fit the SL model (which itself uses a cross-validation) on the training set and make predictions on the validation set. Then we rotate the roles of the validation set and finally obtain a vector of cross-validated predictions of propensity scores and conditional expectations. As discussed above, we used the “Split Sequential SL” method proposed by Coyle CoyleCCTL2017.

After we estimated the relevant parts of the DGD separately for each of the data study data using undersmoothed HAL, the resulting fits were used to simulate data 500 times for each of the 10 studies. Details of the implementation, including the code, can be found in the GitHub repository: https://github.com/HaodongL/realistic_simu.git

4 Results

4.1 Undersmoothed HAL Models and The True Average Treatment Effect

We implemented undersmoothed HAL on the real data and used the fitted model to generate sample for each simulation. Details of each model and the resulting true ATE values are presented in Table 3.

StudyID n p TrueATE Model Undersmoothed Num.coef. Lambda norm
Q T 167 2.4e+01 5.6e-03
1 418 20 -0.0109 g T 180 7.2e-01 6.2e-02
Q T 1747 3.1e-02 4.5e-01
2 4863 26 0.0507 g T 124 3.9e-01 1.4e-02
Q T 1496 3.0e-02 1.9e-01
3 7399 22 0.0007 g T 6 2.6e+01 1.6e-03
Q T 503 2.3e+00 2.1e-02
4 1204 36 -0.0468 g T 5 3.8e+02 2.2e-06
Q T 448 4.5e+00 6.9e-03
5 2396 42 -0.0136 g T 15 1.8e+02 9.0e-06
Q T 2724 3.9e-01 7.6e-02
6 3265 18 0.2523 g T 497 1.1e+00 2.4e-02
Q T 2274 2.3e-02 1.7e+00
7 1931 38 -0.0310 g F 0 9.7e+01 0.0e+00
Q T 138 1.4e+00 2.1e-02
8 840 30 -0.0442 g F 0 1.1e+02 0.0e+00
Q T 3700 1.8e-01 3.1e-02
9 27275 42 0.0089 g T 102 2.7e+01 7.9e-06
Q T 503 1.2e+00 7.3e-03
10 5443 35 0.0203 g F 0 3.5e+03 0.0e+00
Table 3: Statistics of the HAL fits to the individual studies, including the sample size, dimension, and number of basis functions used for the treatment model () and the corresponding outcome model (), the corresponding lambda penalty and the resulting norm.

For Study 7, 8 and 10, the initial HAL fits of models contain no variables, so one is randomized as in a clinical trial. Thereby, the undersoomthing process for

model was omitted for these three studies, and the initial HAL models were used instead. This is not surprising since all ten studies were randomized controlled trials (RCT). Grouping categorical intervention variables into binary variables at data cleaning step might preserve or change the randomization. The remainder of the studies included basis functions in

and so are more akin to observational studies. For only these studies, we also compare the performance of the estimators above with the standard difference-in-means estimates, which is also provides consistent estimators for the ATE for these three data-generating distributions. On the other hand, the counts of non-zero coefficients (“Num.coef.” in Table 3) in the undersmoothed models are large for the remaining studies, and so, regardless of the original treatment mechanism that underlied these studies, these ones do not come from a simple treatment randomization model. The details on the variables included after undersmoothing can be found in Table 7 in the Appendix.

4.2 Estimators’ Performance

The results are shown in Figure 1 and Table 4. Variance dominates bias for all estimators and so contributes overwhelmingly to the mean squared error (MSE) and the relative MSE (rMSE), where rMSE was relative to the IPTW estimator’s MSE. Putting aside Study 1 for now, the MSE/rMSE results suggest that the A-IPTW generally is more efficient than the other estimators, the TMLE, CV-TMLE, CV-A-IPTW and C-TMLE with similar MSE to each other, and the IPTW and CV-IPTW having more erratic performance. The bar plots of the main performance metrics in Table 4 can be found in the Appendix (see Figure 1 - 5)

The 95% confidence interval (CI) coverage, however, shows different relative performance (Figure 1, Table 4). Taking 92.5% as the lower bound defining consistent coverage, then we can observe that: The CV-A-IPTW consistent coverage for all studies. The CV-TMLE and C-TMLE had consistent coverage for all studies except study 1. The TMLE and A-IPTW had coverage ranging from 90% to 95% for most studies. IPTW and CV-IPTW estimates of CI had very conservative coverage (close to 100%) for most studies.

To examine more closely issues of CI coverage, we removed the bias introduced by the estimation procedure for the standard error by using the true sample variance of each estimator (i.e. the sample variance of the estimator across 500 simulations) to derive the standard error (“Coverage2” in Table 4). The coverage of this CI is the oracle coverage one would obtain if one is given the true variance. For this measurement, both CV-TMLE and CV-A-IPTW achieved 95% coverage in all studies, followed by TMLE, C-TMLE, IPTW and CV-IPTW with 95% coverage for nine studies. A-IPTW had 95% coverage for eight studies.

* The black dots represent the estimator-specific medians across all ten studies.

* The reference line is 1 for rMSE, 0 for bias and 0.95 for coverage.

* The original rMSE value (14.698) of TMLE estimator for Study 1 was truncated at 6.3.

Figure 1: Dot plot of the main metrics of performance

“Variance” is the true sample variance; “MSE” is mean-squared error; “rMSE” is the relative (to the IPTW estimator in denominator) mean-squared error; “Coverage” is the coverage using 95% Wald-type confidence intervals (CI) based upon standard error estimates, where “Coverage2” uses the true sample variance; Finally “CI width” is the average width of the “Coverage” CI’s.

Method StudyID TrueATE Variance Bias MSE rMSE Coverage Coverage2 CIwidth
1 -0.0109 0.0056 -0.0373 0.0070 0.704 0.912 0.1658
2 0.0507 0.0005 -0.0012 0.0005 0.934 0.958 0.0849
3 0.0007 0.0003 0.0007 0.0003 0.954 0.948 0.0737
4 -0.0468 0.0019 0.0109 0.0020 0.928 0.950 0.1612
5 -0.0136 0.0020 -0.0046 0.0020 0.926 0.952 0.1640
6 0.2523 0.0010 -0.0266 0.0017 0.672 0.868 0.0829
7 -0.0310 0.0012 0.0093 0.0013 0.862 0.938 0.1098
8 -0.0442 0.0037 0.0037 0.0037 0.914 0.952 0.2112
9 0.0089 0.0001 -0.0005 0.0001 0.940 0.948 0.0362
A-IPTW 10 0.0203 0.0006 -0.0001 0.0006 0.954 0.954 0.0961
1 -0.0109 0.0219 -0.0947 0.0309 0.836 0.890 0.4956
2 0.0507 0.0006 0.0016 0.0006 0.956 0.954 0.0993
3 0.0007 0.0004 0.0018 0.0004 0.948 0.948 0.0782
4 -0.0468 0.0026 0.0046 0.0026 0.948 0.950 0.2005
5 -0.0136 0.0027 -0.0087 0.0027 0.928 0.950 0.1882
6 0.2523 0.0011 -0.0124 0.0012 0.942 0.944 0.1295
7 -0.0310 0.0025 0.0012 0.0025 0.936 0.948 0.1875
8 -0.0442 0.0049 -0.0014 0.0049 0.922 0.960 0.2524
9 0.0089 0.0001 -0.0008 0.0001 0.942 0.950 0.0409
C-TMLE 10 0.0203 0.0006 0.0007 0.0006 0.952 0.940 0.1037
1 -0.0109 0.0565 -0.0262 0.0572 0.926 0.954 0.7789
2 0.0507 0.0006 0.0031 0.0006 0.960 0.954 0.0985
3 0.0007 0.0004 0.0009 0.0004 0.966 0.950 0.0793
4 -0.0468 0.0025 0.0063 0.0025 0.956 0.948 0.2008
5 -0.0136 0.0024 -0.0062 0.0024 0.940 0.946 0.1881
6 0.2523 0.0012 -0.0045 0.0012 0.938 0.944 0.1301
7 -0.0310 0.0020 0.0030 0.0020 0.936 0.940 0.1737
8 -0.0442 0.0045 0.0000 0.0045 0.950 0.952 0.2553
9 0.0089 0.0001 -0.0002 0.0001 0.942 0.948 0.0394
CV-A-IPTW 10 0.0203 0.0006 0.0011 0.0006 0.962 0.944 0.1026
1 -0.0109 0.1632 -0.1129 0.1759 0.984 0.944 2.3263
2 0.0507 0.0008 -0.0012 0.0008 1.000 0.948 0.2686
3 0.0007 0.0005 0.0020 0.0005 1.000 0.954 0.1831
4 -0.0468 0.0032 0.0270 0.0040 1.000 0.936 0.5065
5 -0.0136 0.0028 -0.0202 0.0032 0.982 0.936 0.2817
6 0.2523 0.0014 -0.0057 0.0014 1.000 0.954 0.3305
7 -0.0310 0.0033 0.0017 0.0033 1.000 0.948 0.5111
8 -0.0442 0.0062 0.0006 0.0062 1.000 0.948 0.6842
9 0.0089 0.0001 0.0143 0.0003 0.998 0.756 0.1016
CV-IPTW 10 0.0203 0.0007 0.0006 0.0007 1.000 0.956 0.2451
1 -0.0109 0.1868 -0.0291 0.1876 0.590 0.938 0.7006
2 0.0507 0.0006 0.0031 0.0006 0.958 0.966 0.0985
3 0.0007 0.0004 0.0009 0.0004 0.966 0.958 0.0793
4 -0.0468 0.0025 0.0064 0.0025 0.956 0.954 0.2008
5 -0.0136 0.0024 -0.0063 0.0024 0.940 0.958 0.1881
6 0.2523 0.0013 0.0046 0.0013 0.936 0.958 0.1301
7 -0.0310 0.0020 0.0030 0.0020 0.938 0.946 0.1737
8 -0.0442 0.0044 0.0000 0.0044 0.950 0.964 0.2551
9 0.0089 0.0001 -0.0002 0.0001 0.942 0.952 0.0394
CV-TMLE 10 0.0203 0.0006 0.0011 0.0006 0.962 0.956 0.1026
1 -0.0109 0.0280 -0.0736 0.0334 0.890 0.932 0.5945
2 0.0507 0.0008 -0.0028 0.0008 1.000 0.948 0.2544
3 0.0007 0.0005 0.0022 0.0005 1.000 0.954 0.1789
4 -0.0468 0.0033 0.0276 0.0040 1.000 0.926 0.4889
5 -0.0136 0.0028 -0.0201 0.0032 0.978 0.940 0.2712
6 0.2523 0.0014 -0.0091 0.0015 0.998 0.942 0.2536
7 -0.0310 0.0033 0.0023 0.0033 1.000 0.946 0.4838
8 -0.0442 0.0062 0.0003 0.0062 1.000 0.950 0.6504
9 0.0089 0.0001 0.0172 0.0004 0.992 0.684 0.0990
IPTW 10 0.0203 0.0007 0.0007 0.0007 1.000 0.958 0.2410
1 -0.0109 0.3860 -0.3235 0.4906 0.100 0.920 0.1681
2 0.0507 0.0006 0.0005 0.0006 0.932 0.960 0.0849
3 0.0007 0.0004 0.0007 0.0004 0.946 0.948 0.0737
4 -0.0468 0.0020 0.0099 0.0021 0.916 0.950 0.1611
5 -0.0136 0.0022 -0.0052 0.0022 0.922 0.948 0.1640
6 0.2523 0.0010 -0.0015 0.0010 0.806 0.942 0.0828
7 -0.0310 0.0013 0.0079 0.0014 0.852 0.932 0.1098
8 -0.0442 0.0040 0.0020 0.0040 0.900 0.952 0.2111
9 0.0089 0.0001 -0.0003 0.0001 0.936 0.948 0.0362
TMLE 10 0.0203 0.0006 0.0001 0.0006 0.952 0.954 0.0961
Table 4: Performance of targeted learning and estimating equation estimators by study within the HAL-based simulations. * (continued)

The simulations suggest, across 10 realistic data-generating distributions, that CV-A-IPTW, CV-TMLE and C-TMLE has overall relatively good performance in terms of MSE and reliable 95% coverage. The A-IPTW estimator had superior MSE-based performance, though the confidence interval coverage was sometimes between 90% and 95%. However, plugging in the true standard deviation of the A-IPTW estimator instead of the plug-in influence-curve based one typically used resulted in good coverage. This suggests more robust standard error (SE) estimators could make it a more compelling choice than the empirical performance in these simulations. In addition, CV-A-IPTW can improve the coverage of A-IPTW in most cases, but, due to the estimator being consistent in a bigger model, will have bigger MSE. The results at least show that the CV-TMLE as implemented in the tmle3 package coyle2021tmle3-rpkg can provide robust inferences, suggesting using it “off the shelf” provides reliable results. In next section, we will discuss situations where even the CV-TMLE under-performed, potentially because of small sample size and related empirical positivity violations PetersenDRVPF2012.

4.3 Exploration on Positivity Violation

We now consider Study 1, where the TMLE and CV-TMLE had significantly anti-conservative coverage. In this case, certainly one cause appears to insufficient experimentation of treatment within some covariate groups. Specifically, consider Figure 2, which shows the distributions of the adjustment variable, W_perdiar24 in Study 1. As one can see, there are large differences in the marginal distribution of this covariate; in fact, a fit without smoothing would result in a perfect positivity violation. However, given the variance-bias trade-off resulting in the estimators, it is possible that these empirical violations are smoothed over. A potential consequence of this positivity violation is that the resulting estimator, for the parameter which requires support in the data, will be unstable and biased.

Figure 2: Distributions of W_perdiar24 in Study 1 by intervention group

Table 5 shows the performance of estimators before and after dropping the variable W_perdiar24 in Study 1. We can observe that all estimators can benefit from removing the problematic variable in terms of higher coverage or lower MSE.

Method Dropperdiar TrueATE Variance Bias MSE Coverage Coverage2 CIwidth
No -0.0109 0.0056 -0.0373 0.0070 0.704 0.912 0.1658
A-IPTW Yes 0.0104 0.0084 -0.0010 0.0084 0.908 0.944 0.3117
No -0.0109 0.0219 -0.0947 0.0309 0.836 0.890 0.4956
C-TMLE Yes 0.0104 0.0171 0.0020 0.0171 0.930 0.952 0.4829
No -0.0109 0.0565 -0.0262 0.0572 0.926 0.954 0.7789
CV-A-IPTW Yes 0.0104 0.0131 0.0027 0.0131 0.950 0.940 0.4604
No -0.0109 0.1632 -0.1129 0.1759 0.984 0.944 2.3263
CV-IPTW Yes 0.0104 0.0198 0.0097 0.0199 1.000 0.948 1.3311
No -0.0109 0.1868 -0.0291 0.1876 0.590 0.938 0.7006
CV-TMLE Yes 0.0104 0.0132 0.0030 0.0132 0.950 0.946 0.4600
No -0.0109 0.0280 -0.0736 0.0334 0.890 0.932 0.5945
IPTW Yes 0.0104 0.0202 0.0113 0.0203 1.000 0.948 1.2379
No -0.0109 0.3860 -0.3235 0.4906 0.100 0.920 0.1681
TMLE Yes 0.0104 0.0096 -0.0006 0.0096 0.878 0.942 0.3116
Table 5: Estimators’ performance with/without W_perdiar24 in Study 1 to show the impact of one covariate on performance due to positivity violations. Columns are defined as in table 4.

4.4 Estimators’ Efficiency in Randomized Experiment Setting

As mentioned in earlier section, the initial HAL models for propensity score include no variables for Study 7, 8 and 10, which leads to randomized experiments in the corresponding simulations. In these cases, we add the “difference-in-means” estimator (i.e. ) with its variance estimator proposed by Neyman in 1923 Splawa-NeymanAPTAN1990. Table 6 shows that the CV-TMLE and CV-A-IPTW estimators still gain efficiency in the randomized experiments setting. This is consistent with proposals for using doubly robust estimators of the ATE in randomized trials if there are informative covariates that can increase efficiency over simple, unadjusted estimates MooreCARTJ2009, tsiatis2008covariate.

StudyID Method TrueATE Variance Bias MSE Coverage Coverage2 CIwidth
CV-A-IPTW -0.0310 0.0020 0.0030 0.0020 0.936 0.940 0.1737
CV-TMLE -0.0310 0.0020 0.0030 0.0020 0.938 0.946 0.1737
7 Diff-in-Mean -0.0310 0.0036 0.0021 0.0036 0.944 0.942 0.2435
CV-A-IPTW -0.0442 0.0045 0.0000 0.0045 0.950 0.952 0.2553
CV-TMLE -0.0442 0.0044 0.0000 0.0044 0.950 0.964 0.2551
8 Diff-in-Mean -0.0442 0.0068 -0.0001 0.0068 0.942 0.952 0.3115
CV-A-IPTW 0.0203 0.0006 0.0011 0.0006 0.962 0.944 0.1026
CV-TMLE 0.0203 0.0006 0.0011 0.0006 0.962 0.956 0.1026
10 Diff-in-Mean 0.0203 0.0008 0.0007 0.0008 0.964 0.948 0.1167
Table 6: Relative performance of the two CV-estimators with a simple difference in means in the context of the three studies for which treatment was unrelated to covariates (thus equivalent to randomized clinical trial). Columns are defined as in table 4.

5 Conclusion

The ultimate goal of studies, such as ours, is to move incrementally towards algorithms that can take information on the design, causal model and known constrains in order to produce a data-adaptively optimized estimator without relying on arbitrary model assumptions. Asymptotic theory can provide guidance on some of the choices, but asymptotic efficiency is not a guarantee for superior performance in finite samples. Thus, simulation studies that are based on realistic DGD’s are invaluable for both evaluating estimators and modifying them to increase finite-sample robustness. We provided results supporting the use of a strategically undersmoothed HAL for estimating the relevant components of the DGD in data-driven simulations. Though much remains unresolved, such an approach could be an approach for generating synthetic data mannino2019real.

Our results suggest that if accurate inferences are the highest priority, then the CV-A-IPTW, CV-TMLE, and C-TMLE are good choices for providing robust inferences. Specifically, the results suggest that CV-TMLE might serve as an “off the shelf” algorithm given that 1) it is an asymptotically linear estimator; 2) it is consistent in a large class of statistical models; 3) it allows for the use of aggressive ensemble learning, while protecting the final performance of the estimator with an outer layer of cross-validation; 4) its influence-curve-based standard error combined with the well-behaved (normal) distribution of the estimator results in near perfect coverage for all but one of the studies used. Our results also suggest that modifications to the algorithms for other estimators (such as improving the SE estimator for the A-IPTW) would result in an estimator with acceptable CI coverage and relatively low MSE. We also suggest one basis for deciding which estimator to use for particular data is to perform a similar simulation study for the data based upon fitting the undersmoothed HAL to derive the DGD. Then, one could choose to report the results from the estimator that provided the most reliable performance in such a simulation study. Of course, this is itself a form of over-fitting, since it uses the data both for estimator selection and for reporting the results of that estimator applied to the original data. However, it seems better than applying an arbitrary estimator and hoping that the advertised asymptotic performance matches the performance on the data of interest. Finally, our results support the observations that careful use of covariate information can be used to gain efficiency in the randomized experiment setting.


This research was financially supported by a global development grant (OPP1165144) from the Bill & Melinda Gates Foundation to the University of California, Berkeley, CA, USA. We would also like to thank the following collaborators on the included cohorts and trials for their contributions to study planning, data collection, and analysis: Muhammad Sharif, Sajjad Kerio, Ms. Urosa, Ms. Alveen, Shahneel Hussain, Vikas Paudel (Mother and Infant Research Activities), Anthony Costello (University College London), Noel Rouamba, Jean-Bosco Ouédraogo, Leah Prince, Stephen A Vosti, Benjamin Torun, Lindsey M Locks, Christine M McDonald, Roland Kupka, Ronald J Bosch, Rodrick Kisenge, Said Aboud, Molin Wang, Azaduzzaman, Abu Ahmed Shamim, Rezaul Haque, Rolf Klemm, Sucheta Mehra, Maithilee Mitra, Kerry Schulze, Sunita Taneja, Brinda Nayyar, Vandana Suri, Poonam Khokhar, Brinda Nayyar, Poonam Khokhar, Jon E Rohde, Tivendra Kumar, Jose Martines, Maharaj K Bhan, and all other members of the study staffs and field teams. We would also like to thank all study participants and their families for their important contributions. We are grateful to the LCNI5 and iLiNS research teams, participants and people of Lungwena, Namwera, Mangochi and Malindi, our research assistants for their positive attitude, support, and help in all stages of the studies.

In addition, this research used the Savio computational cluster resource provided by the Berkeley Research Computing program at the University of California, Berkeley (supported by the UC Berkeley Chancellor, Vice Chancellor for Research, and Chief Information Officer). The authors would like to further thank the university and the Savio group for providing computational resources.

Author contributions

Conceptualization: A.E.H., H.L., S.R.
Funding Acquisition: J.M.C., A.E.H., M.J.V., B.F.A.
Data curation: A.M., J.B., J.C.
Formal analyses: H.L., S.R.
Methodology: H.L., S.R., A.E.H., M.J.V.
Visualization: H.L., S.R., J.C.
Writing – Original Draft Preparation: H.L., S.R.
Writing – Review & Editing: A.E.H., H.L., R.V.P., N.H., I.M., B.F.A., J.B.

Financial disclosure

None reported.

Conflict of interest

The authors declare no potential conflict of interests.

Data availability

The data used in this analysis was held by Bill & Melinda Gates Foundation in a repository. The sensitive information contained in the data was still considered theoretically identifiable and can not be released to the public at this point, with the exception of the WASH Benefits trials. We provide the data from "WASH Benefits Bangladesh"LubyEWQSM2018 (Study 2) and "WASH Benefits Kenya"StewartEWQSA2018 (Study 3) as example data sets in the GitHub repository: https://github.com/HaodongL/realistic_simu.git

Appendix A . Supplemental figures and tables

StudyID AllCovariates IncludedCovariates
1 agedays, sex, month, brthmon, hdlvry, enstunt, enwast, W_mage, W_fage, W_meducyrs, W_feducyrs, W_nhh, W_parity, W_perdiar24, W_mhtcm, W_fhtcm, delta_W_mhtcm, delta_W_fhtcm, a agedays*sex*W_fhtcm, agedays*month*brthmon, agedays*month*enstunt, agedays*month*W_mage, agedays*month*W_meducyrs, agedays*month*W_feducyrs, agedays*month*W_mhtcm, agedays*brthmon*enstunt, agedays*brthmon*W_mage, agedays*brthmon*W_fage, agedays*brthmon*W_meducyrs, agedays*brthmon*W_feducyrs, agedays*brthmon*W_nhh, agedays*brthmon*W_fhtcm, agedays*hdlvry*W_mhtcm, agedays*hdlvry*W_fhtcm, agedays*enstunt*W_fage, agedays*enstunt*W_feducyrs, agedays*enstunt*W_mhtcm, agedays*enstunt*W_fhtcm, agedays*enwast*W_feducyrs, agedays*enwast*W_fhtcm, agedays*W_mage*W_fage, agedays*W_mage*W_meducyrs, agedays*W_mage*W_feducyrs, agedays*W_mage*W_nhh, agedays*W_mage*W_mhtcm, agedays*W_mage*W_fhtcm, agedays*W_fage*W_meducyrs, agedays*W_fage*W_feducyrs, agedays*W_fage*W_parity, agedays*W_fage*W_fhtcm, agedays*W_meducyrs*W_feducyrs, agedays*W_meducyrs*W_fhtcm, agedays*W_feducyrs*W_parity, agedays*W_feducyrs*W_perdiar24, agedays*W_feducyrs*W_mhtcm, agedays*W_feducyrs*W_fhtcm, agedays*W_feducyrs*a, agedays*W_parity*W_mhtcm, agedays*W_parity*W_fhtcm, agedays*W_perdiar24*W_fhtcm, W_mage*W_fage*W_mhtcm, W_fage*W_mhtcm*W_fhtcm, agedays*month*hdlvry, agedays*month*W_fage, agedays*month*W_nhh, agedays*month*W_parity, agedays*month*W_fhtcm, agedays*brthmon*W_parity, agedays*brthmon*W_mhtcm, agedays*W_mage*W_parity, agedays*W_fage*W_nhh, agedays*W_fage*W_mhtcm, agedays*W_fage*a, agedays*W_meducyrs*W_nhh, agedays*W_nhh*W_mhtcm, agedays*W_nhh*W_fhtcm, agedays*W_mhtcm*W_fhtcm, agedays*W_mhtcm*a
2 sex, month, brthmon, hfoodsec, enstunt, agedays, W_meducyrs, W_nhh, W_mage, W_mhtcm, W_mwtkg, W_mbmi, enwast, impsan, W_feducyrs, W_parity, delta_W_mage, delta_W_mhtcm, delta_W_mwtkg, delta_W_mbmi, delta_enwast, delta_impsan, delta_W_feducyrs, delta_W_parity, a sex*agedays, month*agedays, month*W_mhtcm, brthmon*agedays, brthmon*W_meducyrs, hfoodsec*agedays, hfoodsec*W_mage, enstunt*agedays, enstunt*W_mhtcm, enstunt*W_mwtkg, enstunt*W_feducyrs, agedays*W_meducyrs, agedays*W_nhh, agedays*W_mage, agedays*W_mhtcm, agedays*W_mwtkg, agedays*W_mbmi, agedays*enwast, agedays*impsan, agedays*W_feducyrs, agedays*W_parity, agedays*delta_W_mage, agedays*delta_W_mhtcm, agedays*delta_W_mwtkg, agedays*delta_W_mbmi, agedays*delta_enwast, agedays*delta_impsan, agedays*delta_W_feducyrs, agedays*delta_W_parity, agedays*a, W_meducyrs*W_mhtcm, W_meducyrs*W_parity, W_mage*W_mwtkg, W_mage*a, W_mhtcm*W_mwtkg, W_mhtcm*W_mbmi, W_mhtcm*enwast, W_mwtkg*enwast, W_mwtkg*W_feducyrs, agedays, sex*W_mage, sex*W_mhtcm, month*W_nhh, month*W_mage, month*W_mwtkg, month*W_feducyrs, brthmon*W_nhh, brthmon*W_mage, brthmon*W_mhtcm, brthmon*W_mwtkg, brthmon*W_feducyrs, hfoodsec*W_meducyrs, hfoodsec*W_mwtkg, W_meducyrs*W_mage, W_meducyrs*W_mwtkg, W_meducyrs*W_feducyrs, W_nhh*W_mage, W_nhh*W_mhtcm, W_nhh*W_mwtkg, W_nhh*W_feducyrs, W_mage*W_mhtcm, W_mhtcm*W_feducyrs, W_mhtcm*W_parity, W_mwtkg*W_mbmi, W_mwtkg*W_parity, W_mbmi*W_feducyrs
3 agedays, sex, month, brthmon, enstunt, enwast, cleanck, impfloor, W_mage, W_mhtcm, W_meducyrs, W_nhh, impsan, delta_cleanck, delta_impfloor, delta_W_mage, delta_W_mhtcm, delta_W_meducyrs, delta_W_nhh, delta_impsan, a agedays*sex, agedays*month, agedays*brthmon, agedays*enstunt, agedays*enwast, agedays*cleanck, agedays*impfloor, agedays*W_mage, agedays*W_mhtcm, agedays*W_meducyrs, agedays*W_nhh, agedays*impsan, agedays*delta_cleanck, agedays*delta_W_nhh, agedays*delta_impsan, agedays*a, month*W_mhtcm, brthmon*W_mhtcm, enstunt*W_mage, enstunt*W_mhtcm, impfloor*W_mhtcm, W_mage*W_mhtcm, W_mhtcm*W_meducyrs, W_mhtcm*W_nhh, W_mhtcm*impsan, W_mhtcm*delta_W_mage, W_mhtcm*delta_W_mhtcm, W_nhh*a, agedays, agedays*delta_impfloor, agedays*delta_W_mage, agedays*delta_W_mhtcm, agedays*delta_W_meducyrs, month*W_mage, month*W_meducyrs, month*W_nhh, brthmon*W_mage, brthmon*W_nhh, W_mage*W_meducyrs, W_mage*W_nhh, W_meducyrs*W_nhh
4 agedays, sex, month, brthmon, enstunt, single, W_gagebrth, W_mage, W_meducyrs, W_feducyrs, W_parity, hhwealth_quart, enwast, vagbrth, hdlvry, earlybf, hfoodsec, W_birthwt, W_mhtcm, W_mwtkg, W_mbmi, W_fage, impsan, delta_enwast, delta_vagbrth, delta_hdlvry, delta_earlybf, delta_hfoodsec, delta_W_birthwt, delta_W_mhtcm, delta_W_mwtkg, delta_W_mbmi, delta_W_fage, delta_impsan, a agedays*sex, agedays*brthmon, agedays*enstunt, agedays*W_gagebrth, agedays*W_birthwt, agedays*W_mhtcm, agedays*W_fage, agedays*delta_hdlvry, agedays*delta_hfoodsec, agedays*a, sex*W_birthwt, brthmon*W_birthwt, enstunt*W_birthwt, single*W_birthwt, W_gagebrth*W_birthwt, W_gagebrth*W_mwtkg, W_mage*W_birthwt, W_meducyrs*W_birthwt, W_feducyrs*W_birthwt, W_parity*W_birthwt, hhwealth_quart*W_birthwt, enwast*W_birthwt, vagbrth*W_birthwt, hdlvry*W_birthwt, earlybf*W_birthwt, hfoodsec*W_birthwt, W_birthwt*W_mhtcm, W_birthwt*W_mbmi, W_birthwt*W_fage, W_birthwt*impsan, W_birthwt*delta_enwast, W_birthwt*delta_hdlvry, W_birthwt*delta_earlybf, W_birthwt*delta_W_birthwt, agedays*month, agedays*W_mage, agedays*W_meducyrs, agedays*W_feducyrs, agedays*W_parity, agedays*hhwealth_quart, agedays*hfoodsec, agedays*W_mwtkg, agedays*W_mbmi, month*W_birthwt, W_birthwt*W_mwtkg, W_birthwt*delta_hfoodsec, W_birthwt*delta_W_fage
5 agedays, sex, month, brthmon, enstunt, anywast06, enwast, vagbrth, hdlvry, single, nchldlt5, hhwealth_quart, pers_wast, W_gagebrth, W_birthwt, W_mage, W_mhtcm, W_mwtkg, W_mbmi, W_meducyrs, W_feducyrs, W_nchldlt5, W_parity, delta_enwast, delta_vagbrth, delta_hdlvry, delta_single, delta_nchldlt5, delta_hhwealth_quart, delta_pers_wast, delta_W_gagebrth, delta_W_birthwt, delta_W_mage, delta_W_mhtcm, delta_W_mwtkg, delta_W_mbmi, delta_W_meducyrs, delta_W_feducyrs, delta_W_nchldlt5, delta_W_parity, a agedays*W_gagebrth, agedays*W_birthwt, agedays*W_mhtcm, agedays*W_meducyrs, sex*W_birthwt, brthmon*W_birthwt, enstunt*W_birthwt, anywast06*W_birthwt, enwast*W_birthwt, vagbrth*W_birthwt, nchldlt5*W_birthwt, hhwealth_quart*W_birthwt, pers_wast*W_birthwt, W_gagebrth*W_birthwt, W_birthwt*W_mage, W_birthwt*W_mhtcm, W_birthwt*W_mwtkg, W_birthwt*W_meducyrs, W_birthwt*W_feducyrs, W_birthwt*W_nchldlt5, W_birthwt*W_parity, W_birthwt*delta_hdlvry, W_birthwt*delta_single, W_birthwt*delta_pers_wast, W_birthwt*delta_W_birthwt, W_birthwt*delta_W_mage, W_birthwt*delta_W_parity, W_birthwt*a, agedays*month, agedays*brthmon, agedays*W_mage, agedays*W_mwtkg, agedays*W_mbmi, agedays*W_feducyrs, month*W_birthwt, W_gagebrth*W_mhtcm, W_birthwt*W_mbmi, W_birthwt*delta_W_gagebrth
6 agedays, sex, month, brthmon, enstunt, enwast, W_mage, W_mhtcm, W_mwtkg, W_mbmi, W_nchldlt5, delta_W_mage, delta_W_mhtcm, delta_W_mwtkg, delta_W_mbmi, delta_W_nchldlt5, a agedays*W_mage, agedays*W_mhtcm, agedays*W_mbmi, W_mhtcm*W_mwtkg, agedays*sex*month, agedays*sex*brthmon, agedays*sex*W_mage, agedays*sex*W_mhtcm, agedays*month*brthmon, agedays*month*enstunt, agedays*month*W_mage, agedays*month*W_mhtcm, agedays*month*W_mwtkg, agedays*month*W_mbmi, agedays*month*W_nchldlt5, agedays*month*delta_W_nchldlt5, agedays*brthmon*enstunt, agedays*brthmon*W_mage, agedays*brthmon*W_mhtcm, agedays*brthmon*W_mwtkg, agedays*brthmon*W_mbmi, agedays*brthmon*W_nchldlt5, agedays*brthmon*delta_W_mhtcm, agedays*brthmon*a, agedays*enstunt*W_mage, agedays*enstunt*W_mhtcm, agedays*enstunt*W_mwtkg, agedays*enstunt*W_mbmi, agedays*enstunt*W_nchldlt5, agedays*enstunt*a, agedays*enwast*W_mhtcm, agedays*enwast*W_mwtkg, agedays*enwast*a, agedays*W_mage*W_mhtcm, agedays*W_mage*W_mwtkg, agedays*W_mage*W_mbmi, agedays*W_mage*a, agedays*W_mhtcm*W_mwtkg, agedays*W_mhtcm*W_mbmi, agedays*W_mhtcm*W_nchldlt5, agedays*W_mhtcm*delta_W_mage, agedays*W_mhtcm*delta_W_mhtcm, agedays*W_mhtcm*delta_W_mwtkg, agedays*W_mhtcm*delta_W_mbmi, agedays*W_mhtcm*delta_W_nchldlt5, agedays*W_mhtcm*a, agedays*W_mwtkg*W_mbmi, agedays*W_mwtkg*W_nchldlt5, agedays*W_mwtkg*delta_W_mage, agedays*W_mwtkg*delta_W_mhtcm, agedays*W_mwtkg*delta_W_mwtkg, agedays*W_mwtkg*delta_W_mbmi, agedays*W_mwtkg*delta_W_nchldlt5, agedays*W_mbmi*W_nchldlt5, agedays*W_nchldlt5*a, sex*brthmon*W_mhtcm, sex*W_mage*W_mwtkg, month*W_mage*W_mhtcm, month*W_mhtcm*W_mwtkg, brthmon*enstunt*W_mhtcm, brthmon*W_mage*W_mhtcm, brthmon*W_mage*W_mwtkg, brthmon*W_mhtcm*W_mwtkg, enstunt*W_mage*W_mhtcm, enstunt*W_mhtcm*W_mwtkg, enwast*W_mage*W_mhtcm, enwast*W_mhtcm*W_mwtkg, W_mage*W_mhtcm*W_mwtkg, W_mage*W_mhtcm*W_nchldlt5, W_mage*W_mwtkg*W_mbmi, W_mhtcm*W_mwtkg*W_mbmi, W_mhtcm*W_mwtkg*W_nchldlt5, agedays*W_mwtkg, agedays*sex*enwast, agedays*sex*W_mwtkg, agedays*sex*W_mbmi, agedays*sex*W_nchldlt5, agedays*month*enwast, agedays*month*delta_W_mage, agedays*month*delta_W_mhtcm, agedays*month*delta_W_mwtkg, agedays*month*delta_W_mbmi, agedays*month*a, agedays*brthmon*enwast, agedays*enwast*W_mage, agedays*enwast*W_mbmi, agedays*enwast*W_nchldlt5, agedays*W_mage*W_nchldlt5, agedays*W_mwtkg*a, sex*W_mhtcm*W_mwtkg, month*brthmon*W_mwtkg, month*W_mage*W_mwtkg, month*W_mage*W_mbmi, month*W_mwtkg*W_mbmi, brthmon*W_mwtkg*W_mbmi, W_mage*W_mhtcm*W_mbmi, W_mage*W_mwtkg*W_nchldlt5
7 agedays, sex, month, brthmon, enstunt, enwast, single, cleanck, hfoodsec, W_mage, W_mhtcm, W_mwtkg, hhwealth_quart, safeh20, W_mbmi, W_fage, W_meducyrs, W_feducyrs, W_nrooms, W_nchldlt5, impsan, delta_single, delta_cleanck, delta_hfoodsec, delta_W_mage, delta_W_mhtcm, delta_W_mwtkg, delta_hhwealth_quart, delta_safeh20, delta_W_mbmi, delta_W_fage, delta_W_meducyrs, delta_W_feducyrs, delta_W_nrooms, delta_W_nchldlt5, delta_impsan, a agedays*sex, agedays*month, agedays*brthmon, agedays*enstunt, agedays*enwast, agedays*single, agedays*cleanck, agedays*hfoodsec, agedays*W_mage, agedays*W_mhtcm, agedays*W_mwtkg, agedays*hhwealth_quart, agedays*W_mbmi, agedays*W_fage, agedays*W_meducyrs, agedays*W_feducyrs, agedays*W_nrooms, agedays*W_nchldlt5, agedays*impsan, agedays*delta_single, agedays*delta_cleanck, agedays*delta_hfoodsec, agedays*delta_W_mage, agedays*delta_W_mhtcm, agedays*delta_W_mwtkg, agedays*delta_hhwealth_quart, agedays*delta_safeh20, agedays*delta_W_mbmi, agedays*delta_W_fage, agedays*delta_W_meducyrs, agedays*delta_W_feducyrs, agedays*delta_W_nrooms, agedays*delta_W_nchldlt5, agedays*delta_impsan, agedays*a, sex*month, month*single, month*W_mhtcm, month*W_mwtkg, month*hhwealth_quart, brthmon*enstunt, brthmon*W_mwtkg, brthmon*safeh20, brthmon*W_fage, brthmon*W_feducyrs, brthmon*W_nrooms, brthmon*W_nchldlt5, enstunt*W_mhtcm, enstunt*W_mbmi, enstunt*W_fage, enstunt*W_feducyrs, single*W_mhtcm, hfoodsec*W_mhtcm, hfoodsec*hhwealth_quart, hfoodsec*W_meducyrs, W_mage*W_mhtcm, W_mage*W_mwtkg, W_mage*hhwealth_quart, W_mage*W_fage, W_mage*W_meducyrs, W_mage*W_nchldlt5, W_mage*delta_W_nrooms, W_mhtcm*W_mwtkg, W_mhtcm*hhwealth_quart, W_mhtcm*W_mbmi, W_mhtcm*W_fage, W_mhtcm*W_nrooms, W_mhtcm*W_nchldlt5, W_mhtcm*delta_impsan, W_mwtkg*hhwealth_quart, W_mwtkg*W_fage, W_mwtkg*W_meducyrs, W_mwtkg*W_feducyrs, W_mwtkg*W_nchldlt5, W_mwtkg*impsan, W_mwtkg*delta_W_nrooms, W_mwtkg*delta_W_nchldlt5, W_mwtkg*a, W_fage*W_feducyrs, W_fage*W_nrooms, W_meducyrs*W_nrooms, W_meducyrs*W_nchldlt5, W_feducyrs*delta_W_nrooms, W_feducyrs*a, agedays, agedays*safeh20, sex*W_mwtkg, sex*W_fage, month*brthmon, month*W_mage, month*W_mbmi, month*W_fage, month*W_meducyrs, month*W_feducyrs, month*W_nrooms, brthmon*hfoodsec, brthmon*W_mage, brthmon*W_mhtcm, brthmon*W_mbmi, brthmon*W_meducyrs, enstunt*W_mage, enstunt*W_mwtkg, enwast*W_mage, single*W_mage, single*W_mwtkg, hfoodsec*W_mage, hfoodsec*W_mwtkg, hfoodsec*W_fage, W_mage*W_mbmi, W_mage*W_feducyrs, W_mage*W_nrooms, W_mhtcm*W_meducyrs, W_mhtcm*W_feducyrs, W_mwtkg*W_mbmi, W_mwtkg*W_nrooms, W_mwtkg*delta_impsan, hhwealth_quart*W_fage, hhwealth_quart*W_meducyrs, hhwealth_quart*W_feducyrs, W_mbmi*W_fage, W_mbmi*W_meducyrs, W_fage*W_meducyrs, W_fage*W_nchldlt5, W_meducyrs*W_feducyrs, W_feducyrs*W_nchldlt5
8 agedays, sex, month, brthmon, enstunt, enwast, W_mage, W_mhtcm, W_mwtkg, W_mbmi, W_meducyrs, W_feducyrs, W_nchldlt5, W_nhh, impsan, hhwealth_quart, safeh20, delta_W_mage, delta_W_mhtcm, delta_W_mwtkg, delta_W_mbmi, delta_W_meducyrs, delta_W_feducyrs, delta_W_nchldlt5, delta_W_nhh, delta_impsan, delta_hhwealth_quart, delta_safeh20, a agedays*sex, agedays*month, agedays*brthmon, agedays*enstunt, agedays*enwast, agedays*W_mage, agedays*W_mhtcm, agedays*W_mbmi, agedays*W_meducyrs, agedays*W_feducyrs, agedays*W_nchldlt5, agedays*W_nhh, agedays*hhwealth_quart, agedays*delta_W_mage, agedays*delta_W_feducyrs, agedays*delta_W_nchldlt5, W_mage*W_feducyrs, W_mhtcm*W_mwtkg, agedays*W_mwtkg, agedays*safeh20, agedays*a
9 agedays, sex, month, brthmon, enstunt, W_parity, vagbrth, impfloor, earlybf, hfoodsec, enwast, hhwealth_quart, safeh20, W_gagebrth, W_birthwt, W_birthlen, W_mage, W_mhtcm, W_nrooms, W_meducyrs, W_feducyrs, W_nchldlt5, impsan, delta_vagbrth, delta_impfloor, delta_earlybf, delta_hfoodsec, delta_enwast, delta_hhwealth_quart, delta_safeh20, delta_W_gagebrth, delta_W_birthwt, delta_W_birthlen, delta_W_mage, delta_W_mhtcm, delta_W_nrooms, delta_W_meducyrs, delta_W_feducyrs, delta_W_nchldlt5, delta_impsan, a agedays*month, agedays*brthmon, agedays*enstunt, agedays*W_parity, agedays*earlybf, agedays*enwast, agedays*W_gagebrth, agedays*W_birthwt, agedays*W_birthlen, agedays*W_mage, agedays*W_mhtcm, agedays*W_nrooms, agedays*W_meducyrs, agedays*W_feducyrs, agedays*W_nchldlt5, agedays*impsan, agedays*delta_enwast, agedays*delta_W_gagebrth, agedays*delta_W_birthwt, agedays*delta_W_meducyrs, sex*W_birthwt, month*W_birthwt, brthmon*W_birthwt, enstunt*W_birthwt, W_parity*W_birthwt, vagbrth*W_birthwt, impfloor*W_birthwt, earlybf*W_birthwt, hfoodsec*W_birthwt, enwast*W_birthwt, hhwealth_quart*W_birthwt, safeh20*W_birthwt, W_gagebrth*W_birthwt, W_gagebrth*W_birthlen, W_gagebrth*W_mage, W_gagebrth*W_mhtcm, W_birthwt*W_birthlen, W_birthwt*W_mage, W_birthwt*W_mhtcm, W_birthwt*W_nrooms, W_birthwt*W_meducyrs, W_birthwt*W_feducyrs, W_birthwt*W_nchldlt5, W_birthwt*delta_vagbrth, W_birthwt*delta_hfoodsec, W_birthwt*delta_enwast, W_birthwt*delta_W_birthlen, W_birthwt*delta_W_feducyrs, W_birthwt*delta_impsan, W_birthwt*a, W_mhtcm*W_meducyrs, W_birthwt, agedays*sex, agedays*impfloor, agedays*hfoodsec, agedays*hhwealth_quart, agedays*delta_impfloor, agedays*delta_earlybf, agedays*delta_hfoodsec, agedays*delta_hhwealth_quart, agedays*delta_W_birthlen, agedays*delta_W_mhtcm, agedays*delta_W_feducyrs, agedays*a, month*W_gagebrth, brthmon*W_gagebrth, W_gagebrth*W_meducyrs, W_gagebrth*W_feducyrs, W_birthwt*impsan, W_birthwt*delta_earlybf, W_birthwt*delta_W_gagebrth, W_birthwt*delta_W_mhtcm, W_birthwt*delta_W_meducyrs, W_mage*W_feducyrs, W_mhtcm*W_feducyrs
10 agedays, sex, month, brthmon, earlybf, enstunt, W_perdiar24, vagbrth, hdlvry, impfloor, hfoodsec, enwast, hhwealth_quart, safeh20, W_birthwt, W_birthlen, W_meducyrs, W_nrooms, W_feducyrs, impsan, delta_vagbrth, delta_hdlvry, delta_impfloor, delta_hfoodsec, delta_enwast, delta_hhwealth_quart, delta_safeh20, delta_W_birthwt, delta_W_birthlen, delta_W_meducyrs, delta_W_nrooms, delta_W_feducyrs, delta_impsan, a agedays*enstunt, agedays*hfoodsec, agedays*W_birthwt, agedays*W_birthlen, agedays*W_meducyrs, agedays*W_nrooms, agedays*delta_enwast, sex*W_birthwt, month*W_birthwt, brthmon*W_birthwt, earlybf*W_birthwt, enstunt*W_birthwt, vagbrth*W_birthwt, hdlvry*W_birthwt, impfloor*W_birthwt, hhwealth_quart*W_birthwt, W_birthwt*W_birthlen, W_birthwt*W_meducyrs, W_birthwt*W_nrooms, W_birthwt*W_feducyrs, W_birthwt*impsan, W_birthwt*delta_vagbrth, W_birthwt*delta_hfoodsec, W_birthwt*delta_W_feducyrs, W_birthwt*a, agedays*month, agedays*brthmon, agedays*hhwealth_quart, agedays*W_feducyrs, hfoodsec*W_birthwt, enwast*W_birthwt, W_birthwt*delta_hdlvry
Table 7: Variables included in the undersmoothed HAL models for the outcome
Figure 1: Coverage of confidence intervals in ten studies
Figure 2: Bias of estimators in ten studies
Figure 3: Variance of estimators in ten studies
Figure 4: Average width of confidence intervals in ten studies
Figure 5: Coverage of oracle confidence intervals in ten studies

Appendix B . Supplemental information on studies

b.1 iLiNS-Zinc

The iLiNS Zinc intervention was a placebo-controlled, cluster-randomized trial that enrolled children in Burkina Faso HessSLNS2015. Data used in this analysis was collected between 2010 and 2012 from 3,265 children in rural southwestern Burkina Faso. The objective of this study was to compare the effect of providing small-quantity lipid-based nutrient supplements (SQ-LNS) with varying amounts of zinc, along with illness treatment, to standard care on zinc-related outcomes. The outcome measure used here was a height-to-age z-score for children. For the purposes of this analysis, the treatment arms were split into a control group (placebo) and a treatment group (zinc supplementation or SQ-LNS with varying amounts of zinc).

b.2 iLiNS-DOSE

The iLiNS-DOSE intervention was a randomized, controlled, single-blind, parallel-group clinical trial that enrolled healthy infants between 5.5 and 6.5 months of age in Malawi MaletaP1LNA2015. Data used in this analysis was collected between 2009 and 2011 for 1,931 children in rural Malawi. The objective of this trial was to identify the lowest growth-promoting daily dose of modified lipid-based nutrient supplements. The outcome measure used here was the children’s height-to-age z-score. For the purposes of this analysis, treatment arms were split into a control group (no supplement during primary follow-up period with delayed supplementation from 18 to 30 months) and a treatment group consisting of multiple interventions (varying doses of milk-containing LNSs or milk-free LNSs between 6 and 18 months of age).

b.3 JiVitA-3: Impact of antenatal multiple micronutrient supplementation on infant mortality

The JiVitA-3 intervention was a cluster-randomized, double-masked trial that enrolled pregnant women from rural Bangladesh WestEMMMD2014; data used in this analysis was collected between 2008 and 2012. The objective of this trial was to compare the effects of daily maternal multiple micronutrient versus iron and folic acid supplementation on 6-month infant mortality and adverse birth outcomes. The outcome measure used here was a height-to-age z-score for infants. Treatment arms were split into a control group (folic acid and iron supplementation only) and a treatment group (supplement containing 15 essential vitamins and minerals).

b.4 JiVitA-4: Effect of fortified complementary food supplementation on child growth in rural Bangladesh

The JiViTa-4 intervention was an unblinded, cluster-randomized, controlled trial that enrolled children at 6 months of age and provided daily supplements for a year ChristianEFCFD2015. Data used in this analysis was collected between 2012 and 2014 for 5,443 children in rural Bangladesh. The objective of this trial was to compare the effects of three specially formulated complementary food supplements and an international product, Plumpy’doz, with a control (no food supplement) on children’s growth outcomes. The outcome measure used here was a height-to-age z-score for children. For the purposes of this analysis, treatment arms were split into a control (nutrition counseling only) and a treatment group (food supplements with nutrition counseling).

b.5 WASH Benefits Bangladesh

The WASH Benefits Bangladesh intervention was a cluster randomized trial that enrolled pregnant women from villages in rural Bangladesh LubyEWQSM2018; data used in this analysis was collected between 2012 and 2015 for 4,863 children. The outcome measure used here was a height-to-age z-score for children. For the purposes of this analysis, treatment arms were split into a control group (data collection only) and a treatment group. This treatment group consists of multiple interventions: chlorinated drinking water; upgraded sanitation; promotion of handwashing with soap; combination of water, sanitation, and handwashing; child nutrition counseling plus lipid-based supplements; and a combination of water, sanitation, handwashing, and nutrition.

b.6 WASH Benefits Kenya

The WASH Benefits Kenya intervention was a cluster randomized trial that enrolled pregnant women from western Kenya StewartEWQSA2018; data used in this analysis was collected between 2013 and 2015 for 7,399 children. The outcome measure used here was a height-to-age z-score for children. For the purposes of this analysis, treatment arms were split into a control group (no visits) and a treatment group. This treatment group consists of multiple interventions: chlorinated drinking water; improved sanitation; handwashing with soap; combined water, sanitation, and handwashing; nutrition counseling plus lipid-based supplements; and combined water, sanitation, handwashing, and nutrition.

b.7 SAS Food Supplementation

The SAS Food Supplementation intervention was a randomized trial that enrolled four-month-old infants in an urban slum in Delhi, India BhandariFSEFJ2001. Data for this analysis was collected between 1995 and 1996 from 418 children. The objective of this study was to measure the effect of feeding a micronutrient-fortified food supplement to infants on physical growth between 4 and 12 months of age. The measured outcome used in this analysis was a height-to-age z-score for children. For the purposes of this analysis, treatment arms were split into a control group (no counseling or food supplements) and a treatment group consisting of multiple interventions (milk-based cereal and nutritional counseling, nutritional counseling alone, or visitation alone).

b.8 iLiNS-DYAD-M

The iLiNS-DYAD-M intervention was a randomized trial that enrolled infants 6 to 18 months of age in rural Malawi AshornSMDPJ2015. Data was collected between 2011 and 2015 from 1,205 children. The objective of this study was to test whether small-quantity lipid-based nutrient supplements (SQ-LNS) would promote child growth with an outcome measure of the children’s height-to-age z-score at 18 months. Treatment arms were split into a control group (iron and folate supplementation during pregnancy) and a treatment group with multiple interventions (multiple micronutrient tablets or small-quantity lipid-based nutrient supplements (SQ-LNS) during pregnancy and first six months of lactation). Children in the SQ-LNS group received SQ-LNSs from 6 to 18 months of age, while children in the control group and multiple-micronutrient tablet group did not receive any supplementation.

b.9 Tanzania Child 2

The Trial of Zinc and Micronutrients in Tanzanian Children was a randomized 2 x 2 factorial, double-blind clinical trial LocksEZMSM2016. Data used in this analysis was collected from 2,396 children born to HIV-negative mothers in Tanzania. The secondary objective of this study was to determine whether daily administration of zinc or multivitamins from 6 weeks of age for 18 months would improve child growth; the measured outcome used in this analysis was the children’s height-to-age z-score. For the purposes of this analysis, treatment arms were split into a control group (placebo) and a treatment group consisting of multiple interventions (zinc only, multivitamins only, or zinc and multivitamins).

b.10 Lungwena Child Nutrition Intervention

The Lungwena Child Nutrition Intervention was a single-blind, parallel randomized trial that enrolled six-month-old healthy infants in the Lungwena area of the Mangochi District in rural Malawi ThakwalakwaPTCMS2009. Data for this analysis was collected between 2008 and 2014 from 840 children. The objective of this study was to determine whether lipid-based nutrient supplementation as a complementary food from 6 months to 18 months of age led to lower incidence of severe stunting in infants. The measured outcome used in this analysis was a height-to-age z-score for children. For the purposes of this analysis, treatment arms were split into a control group (no extra food supplements) and a treatment group consisting of multiple interventions (milk-powder-containing LNS, soy-powder protein supplement, or maize-soy flour supplement).