1 Introduction
In many trials, treatments are randomly allocated to groups of individuals, such as hospitals, schools, or communities, and outcomes are measured on individuals in those groups. These studies are known as group or cluster randomized trials (CRTs). They are implemented when the treatment is naturally delivered to the group or when substantial dependence between individuals within groups is expected [1, 2, 3, 4, 5]. Such dependence may arise from shared grouplevel factors and interactions between participants, including the spread of social behaviors and infectious diseases. Indeed, many grouplevel interventions may benefit both the individuals who receive the intervention and others who do not; CRTs provide an opportunity to assess such spillover effects [1, 6, 7, 8]. Many CRTs are pragmatic in that their goal is to expand results from prior efficacy studies into realworld effectiveness [9]. CRTs are rapidly increasing in popularity; a recent review found a 280fold increase in their use from 1995 to 2015 [10]. Nonetheless, despite extensive research dedicated to their design and conduct, this review also concluded only half of CRTs were analyzed appropriately. CRTs can provide goldstandard evidence of causality, but they face several methodological challenges.
First, missing participant outcomes occur in over 90% of CRTs [11]. When participants with missing outcomes differ meaningfully from those with measured outcomes, completecase analyses yield biased estimates [12, 13, 14, 15]. This potential for bias is exacerbated when, as commonly occurs, the cluster randomized intervention, itself, influences outcome measurement. Suppose, for example, that the clusterlevel intervention increases care engagement, which in turn improves both participants’ outcomes and their chances of having that outcome measured (Figure 1). In this scenario, an unadjusted comparison of outcomes between randomized arms can overestimate or underestimate the treatment effect. Even if key determinants of missingness, such as care engagement, are measured, standard analytic approaches to CRTs will also fail to control for this bias, because care engagement simultaneously mediates the treatmentoutcome relationship and confounds the missingnessoutcome relationship [16, 17, 13, 18].
Second, CRTs often randomize limited numbers of groups [1, 2, 3, 4, 10, 5]; a review found a median of 33 clusters randomized [19]. Even when some form of restricted randomization (e.g., pairmatching) is used, CRTs with few clusters are likely to suffer from chance imbalances between treatment arms on baseline determinants of the outcome. Adjustment for these covariates and others predictive of the outcome can increase statistical power (e.g., [20, 14, 21, 1, 22, 4, 10, 5]). The adjustment approach is often with an outcome regression, characterizing the expected outcome, given the treatment assignment and covariates. This regression must be a priori
specified to avoid inflating TypeI error rates. Additionally to avoid overfitting, a limited number of adjustment variables must be selected from a typically large set of candidates, risking forced adjustment for variables that prove useless for, or even detrimental to, precision
[23, 24, 25]. Thus, the challenge is to define a fully prespecified procedure for CRT analysis that optimizes power through dataadaptive adjustment of baseline covariates, while rigorously preserving TypeI error control.In this manuscript, we propose and and evaluate a novel estimator that addresses the dual challenges of bias due to missing outcomes and imprecision due to few randomized units in CRTs. Our approach uses targeted minimum lossbased estimation (TMLE) twice: first at the individuallevel to adjust for differential measurement of individuallevel outcomes and second at the clusterlevel to improve efficiency when estimating the intervention effect [21]. Therefore, we refer to our estimator as “TwoStage TMLE”. To the best of our knowledge, TwoStage TMLE is the first semiparametric efficient estimator that adaptively adjusts for both individuallevel missingness and clusterlevel imbalance in CRTs. It can be applied to estimate a range of causal parameters and under a range of CRT study designs, including differing randomization schemes (e.g., pairmatched or not) and approaches to participant followup within clusters (e.g., crosssectional sampling or longitudinal followup).
2 Background
2.1 Brief Review of Current Methods for CRT Analysis
We provide an overview of existing CRT methods in Table 1. A simple twostage approach to account for the dependence of participants within clusters is to aggregate the individuallevel data to the clusterlevel and then implement an effect estimator appropriate for independent data, such as a test. Use of an unadjusted effect estimator in the second stage avoids modeling assumptions and the risk of overfitting, but by ignoring covariate information is inefficient (e.g., [20, 14, 21, 1, 22, 4, 10, 5]).
Unadjusted  Compare clusterlevel outcomes by treatment arm; commonly implemented as a ttest. 
CARE  At the clusterlevel, compare observed outcomes with those predicted from a regression of the individuallevel outcome on individual and clusterlevel covariates, but not the clusterlevel treatment [14, 1]. 
Mixed Model  Point estimate and inference based the treatment coefficient in a regression of the individuallevel outcome on the clusterlevel treatment and individual and clusterlevel covariates; use random effects to account for dependence of individuals within a cluster [26]. 
GEE  Point estimate and inference based the treatment coefficient in a regression of the individuallevel outcome on the clusterlevel treatment and individual and clusterlevel covariates; use a working correlation matrix to account for dependence of individuals within a cluster [27]. 
AugmentedGEE  Modification to GEE for the marginal effect (i.e., GEE with only regression coefficients for the intercept and clusterlevel treatment) by including an additional “augmentation” term for the outcome regression (i.e., the conditional expectation of the outcome, given covariates and treatment) [28, 23, 29]. 
Hierarchical TMLE  At the clusterlevel, compare targeted predictions of the outcomes under the intervention and control; initial predictions of the outcome regression and propensity score (i.e., conditional probability of treatment, given the covariates) are made by adaptively selecting from baseline covariates at the individual or clusterlevel [30]. 
In contrast, mixed models and generalized estimating equations (GEE) typically adjust for a number of baseline individuallevel and clusterlevel covariates, providing an opportunity to improve precision of effect estimates [26, 27]
. However, neither address the need for a prespecified approach to select the adjustment variables that optimize efficiency, while preserving valid statistical inference. Further, both are susceptible to allowing the estimator choice to define the effect measure that is estimated (e.g., GEE with a logistic link yields estimates of the conditional odds ratio)
[31].To the best of our knowledge, only three methods generally allow for the estimation of marginal effects in CRTs while adjusting for individual and clusterlevel covariates. First, in the covariate adjusted residuals estimator (CARE), clusterlevel outcomes are compared with those predicted from an individuallevel regression of the outcome on individual and clusterlevel covariates, but not the clusterlevel treatment [14, 1]. Second, augmentedGEE extends GEE for the marginal effect by including an “augmentation” term, inspired by the efficient influence function [28, 23, 29]. Finally, hierarchical TMLE solves the efficient estimating equation for the marginal effect, while remaining a plugin estimator [30].
With regards to missingness, an unadjusted effect estimator requires the strongest identification assumption: there are no common causes of missingness and outcomes (i.e., the missingcompletelyatrandom, or MCAR, assumption holds) [12, 15]
. The other methods rely on a weaker identification assumption; essentially, that the outcome distributions among persons for which the outcome is measured versus missing are exchangeable conditional on the treatment arm and some subset of measured covariates. In this setting, mixed models, GEE, and CARE will yield nearly unbiased estimates of the intervention effect only if the common causes of missingness and outcomes occur at baseline and are included in a correctly specified outcome regression
[32, 4, 33, 34].Combining AugmentedGEE with inverse probability weighting yields a double robust estimator (“DRGEE”); it is nearly unbiased if either the outcome regression or the propensity score for the missingness mechanism (i.e., the conditional probability of outcome measurement given the treatment arm and covariates) is correctly specified [35]. To the best of our knowledge, DRGEE’s methodology and computing code have been limited to adjustment for baseline variables only. Hierarchical TMLE also offers the potential for integrated precision gains and double robust adjustment for differential measurement. However, these extensions remain to be fully studied. Here, we, instead, develop and evaluate TwoStage TMLE to (1) control for potentially differential missingness in each cluster separately, and (2) adaptively adjust for baseline covariates to improve efficiency when estimating treatment effect at the clusterlevel. Before doing so, we first present our motivating example.
2.2 Motivating Example
The SEARCH Study was a twoarmed, pragmatic CRT of 32 communities each with 10,000 persons in rural Kenya and Uganda (ClinicalTrials.gov: NCT01864603) [36]. SEARCH was designed to evaluate the populationlevel effects of annual multidisease testing and universal treatment for persons with HIV (intervention) versus baseline multidisease testing and countryguided treatment (active control) on a range of outcomes including incident HIV, viral suppression among persons with HIV, hypertension control, and incident tuberculosis. A key innovation of the SEARCH intervention was to shift from a standalone HIV service model to a multidisease model. HIV testing was offered at health fairs and at home as part of a broader package for community health screening and education [37, 38]. Treatment for those diagnosed positive was offered rapidly, in a welcoming environment with flexible patientcentered options, and through an integrated approach for chronic (e.g., hypertension and diabetes) and infectious diseases (e.g., malaria and tuberculosis) [39].
As with the vast majority of CRTs, SEARCH Study outcomes were not measured among all study participants and the MCAR assumption was unreasonable for many endpoints. Additionally, despite matching prior to randomization [40], covariate imbalance was expected, and to avoid overfitting, extensive adjustment was prohibited by having only 16 communities per arm. To reduce bias from missingness on individuallevel outcomes and to improve precision through covariate adjustment in the SEARCH Study, we developed TwoStage TMLE, which we describe and evaluate in the remainder of the manuscript.
3 TwoStage TMLE
3.1 Stage 1: Defining and Estimating ClusterSpecific Outcomes
In many CRTs, outcomes are assessed through longitudinal followup of a closed cohort of participants. In the SEARCH Study, for example, the primary outcome was the threeyear cumulative incidence of HIV: the proportion of community residents ( 15 years) who were HIVuninfected at baseline and became infected with HIV over the threeyear study. To assess the treatment effect on such endpoints, the cohort of participants who are at risk of the outcome is defined in each cluster. For each participant, let denote their baseline covariates, be their postbaseline covariates, be an indicator of outcome measurement at study close, and be an indicator of having the outcome of interest. The outcome is only observed when . We also observe clusterlevel covariates and the randomly assigned clusterlevel treatment . Throughout, superscript will be used to distinguish clusterlevel variables from individuallevel variables. In the SEARCH Study, for example, included baseline HIV prevalence and male circumcision coverage; included age, sex, marital status, occupation, education, and mobility; was a communitylevel indicator of randomization to the intervention; was interim HIV testing; was an indicator of HIV testing at year 3, and was an indicator of having a confirmed HIVpositive diagnosis at year 3 testing. More generally, we denote the observed data structure for a participant as .
Let denote the counterfactual outcome for a participant when setting the measurement indicator . Then for a given cluster, our Stage 1 causal parameter is the expected counterfactual outcome under a hypothetical intervention to ensure outcome measurement: . If in all clusters, MCAR held or, equivalently, , the causal parameter could be identified as and consistently estimated as the empirical mean among those measured within each cluster: . This missing data assumption can be relaxed by allowing measurement to depend on participant characteristics as well as the clusterlevel covariates and treatment . Specifically, if following the randomization assumption holds and there is sufficient data support (i.e., the positivity assumption holds [41]), our Stage 1 statistical estimand would be
(1) 
Within each cluster separately, this statistical parameter could be estimated by a variety of algorithms, including inverseweighting and Gcomputation [42, 16]. Importantly, by fully stratifying on cluster, we avoid having to specify interactions between individuallevel variables (, ) and clusterlevel variables (.
For statistical estimation, we focus on TMLE given its asymptotic properties and improved finite sample performance (e.g., [21, 43]). We refer readers [44, 45]
for an introduction. Briefly, TMLE combines estimates of the conditional mean outcome (a.k.a., the outcome regression) with those of the propensity score (e.g., conditional probability of being measured). In doing so, TMLE achieves a number of desirable properties, including double robustness: a consistent estimate is attained if either the outcome regression or the propensity score is consistently estimated. If both are consistently estimated at reasonable rates, TMLE will be efficient. In practice, we recommend implementing TMLE using Super Learner, an ensemble machine learning algorithm, to obtain initial estimates of the outcome regression and the propensity score
[46]. Stepbystep implementation of TMLE for Eq. 3 is given in the Supplementary Materials.When assessing effects on timetoevent endpoints, participants are followed longitudinally until the occurrence of the event of interest or rightcensoring (e.g., due to outmigration or study close). Examples of such endpoints in the SEARCH Study included the probability of treatment initiation, allcause mortality, and the cumulative risk of HIVassociated tuberculosis or death due to illness. The above framework can easily be extended to handle these survivaltype endpoints. Specifically, in Stage 1, we could estimate the clusterspecific endpoint using the KaplanMeier estimator when censoring is nondifferential or TMLE when censoring is differential [47, 48].
Other endpoints may be assessed using a crosssectional design, where participants are measured at a single timepoint. In these settings, we may have missingness on the characteristic defining the subpopulation of interest as well as the outcome of interest. Consider, for example, populationlevel HIV viral suppression, defined as the proportion of all HIVinfected persons whose plasma HIV RNA level is less than some limit, such as 500 copies/mL: . Both baseline and timevarying factors impact HIV status and its measurement, as well as viral suppression and its measurement. To handle missingness on both the outcome (viral suppression) and the conditioning set (HIVpositivity), we redefine the endpoint as the joint probability of being HIVpositive and suppressed, divided by HIV prevalence: . As detailed in [49], we can use TMLE to estimate each quantity separately and then take the ratio to estimate the clusterspecific endpoint in Stage 1.
3.2 Stage 2: Estimation of the Treatment Effect
After Stage 1, we focus on the clusterlevel data for the purposes of defining, estimating, and obtaining inference for treatment effect. Specifically, let be the counterfactual, clusterlevel outcome under an additional intervention to set the treatment to 1 (intervention) or 0 (control). The causal parameter for Stage 2 is a summary measure of the distribution of these counterfactuals. A common target is the population average treatment effect (PATE): . Alternatively, we could be interested in the sample average treatment effect (SATE), which is the effect for the study clusters [50], or in summary measures on the relative scale. In the SEARCH Study, for example, the primary analysis was for the sample risk ratio for the trial communities:
(2) 
where was the counterfactual cumulative HIV incidence in community .
We can consider a wider range of causal parameters by combining each summary measure with weights. Specifically, let be the size of cluster , and consider a weightedversion of the treatmentspecific sample mean: . Then setting gives equal weight to participants, while setting gives equal weight to clusters [51]. When there is an interaction between cluster size and the treatment, cluster size is said to be “informative” [52], and the resulting causal parameters will generally not be equivalent. In all settings, the target causal parameter should be prespecified and be driven by research question.
For estimation of the Stage 2 causal parameter, the observed data can be simplified to the clusterlevel: , where represents the baseline clusterlevel covariates; is an indicator of randomization to the intervention arm, and is the estimated clusterspecific endpoint, which appropriately accounts for differential measurement of individuallevel outcomes (e.g., Eq. 3). Using these data, an intuitive estimator of the intervention effect is the (weighted) average outcome among intervention clusters contrasted with the (weighted) average outcome among control clusters . Instead, we use TMLE to obtain a more efficient estimate of the intervention effect (e.g., [53, 54]). Briefly, an initial estimator of the clusterlevel outcome regression is updated based on an estimate of the clusterlevel propensity score to achieve a targeted estimator . For all clusters, targeted estimates of the expected outcome under the intervention and under the control are generated, averaged across clusters, and contrasted to estimate the intervention effect. Stepbystep implementation for Eq. 4 is given in the Supplementary Materials.
With limited numbers of clusters, it is difficult, if not impossible, to a priorispecify the optimal estimator of the outcome regression or the propensity score . In this setting, we recommend using Adaptive Prespecification to select the adjustment variables that maximize empirical efficiency [24]
. The procedure requires prespecification of (1) a set of candidate estimators for the outcome regression, (2) a set of candidate estimators for the known propensity score, (3) a loss function corresponding to the squared influence curve for the TMLE of the target estimand, and (4) a samplesplitting scheme to select the candidate estimator with the lowest crossvalidated risk estimate. In trials with limited randomized units, such as CRTs, leaveoneout or leaveonepairout crossvalidation is recommended. Altogether, the procedure will dataadaptively select, from a prespecified set, the candidate adjustment variables (and thus TMLE) which maximize precision. Importantly, if none of the prespecified covariates improves precision over the unadjusted effect estimator, then the approach will not adjust for any clusterlevel covariates (i.e., reduces to the unadjusted effect estimator). Thus, decisions about whether and how to adjust for precision gains are made with a rigorous procedure that does not compromise TypeI error.
3.3 Statistical Inference
Now we have a point estimate of the intervention effect and are ready to obtain statistical inference, which occurs at the clusterlevel. Under the following conditions, detailed in the Supplementary Materials, TwoStage TMLE is an asymptotically linear estimator of the intervention effect , meaning that , where is the clusterlevel influence curve and is the remainder term, going to zero in probability [55]:

Stage 1 estimation of the clusterlevel outcomes provides negligible contribution to

Stage 2 estimators of the clusterlevel outcome regression and the clusterlevel propensity score satisfy the usual regularity conditions (e.g., [53])
The second condition is automatically satisfied when Adaptive Prespecification is used to select among generalized linear models for the clusterlevel outcome regression and the known propensity score
. However, to satisfy the first condition we need (1) the Stage 1 estimators of the individuallevel outcome regression and the individuallevel measurement mechanism to converge to their targets at fast enough rates, (2) the within cluster dependence to be weak enough that the Central Limit Theorem applies in cluster size
, and (3) the cluster size is large relative to the total number of clusters (i.e., ; details in the Supplementary Materials). When is substantially larger , we can weaken the independence assumption to allow for a slower rate of convergence. The Stage 1 requirements highlight the importance of appropriate control for missingness on individuallevel outcomes in CRTs. Suppose, for example, individuallevel outcomes are not MCAR, and nonetheless, we use an unadjusted estimator in Stage 1. Then the clusterlevel outcomes will not be consistently estimated, and we will obtain biased point estimates and misleading conclusions in Stage 2. We note these conditions apply to all TwoStage estimators, including the Student’s test, the paired test, and CARE.When the above conditions hold, the limit distribution of the standardized estimator is normal with mean 0 and variance given by the variance of its influence curve. For example, the influence curve for TwoStage TMLE for the treatmentspecific population mean
is approximated as . We obtain a variance estimate with the sample variance of the estimated influence curve divided by the number of independent units . Then using the Student’s distribution with degrees of freedom as a finite sample approximation to the normal distribution [56, 14, 1], we can then construct WaldType 95% confidence intervals and conduct hypothesis testing.
Additionally, through the Delta Method, we can derive the influence curve and variance estimator for the intervention effect on any scale of interest. If, for example, we were interested in the absolute effect , then the estimated influence curve for TMLE would be . If, instead, we were interested in the relative effect, we would apply the Delta method on the logscale [53]. This approach to statistical inference also applies in trials where the treatment is randomized within matched pairs of clusters (Supplementary Materials). This approach could naturally be extended to matched triplets in a threearmed CRT. Analogous methods are used to obtain inference for the sample or conditional effects.
4 Simulation Study
We examine the finite sample performance of our proposed estimator using simulations to incorporate common CRT challenges, such as few randomized clusters and differential missingness. Specifically, we focus on a setting with clusters and where within each cluster, the number of individual participants is sampled with equal probability from {100, 150, 200}. In these simulations, both baseline and postbaseline covariates impact measurement of individuallevel outcomes. Additional simulations and computing code are given in the Supplementary Materials.
4.1 Data Generating Process
For each cluster , we independently generate the clusterspecific data as follows. First, two latent variables are drawn uniformly from (1, 1) and one additional variable is drawn independently from a standard normal distribution. Then, two individuallevel covariates are drawn independently from normal distributions with clusterspecific means: and . We set the observed clusterlevel covariates as the empirical mean of their individuallevel counterparts. The clusterlevel intervention is randomly allocated within pairs of clusters matched on ; therefore, clusters receive the intervention and the control.
The individuallevel mediator is generated as an indicator that , drawn from a Uniform(0,1), is less than the . The underlying, individuallevel outcome is generated as an indicator that , drawn from a Uniform(0,1), is less than . Finally, individuallevel measurement is generated as an indicator that , drawn from a Uniform(0,1), is less than . Thus, the measurement mechanism is highly differential by treatment arm. We set the observed outcomes to be missing for individuals with .
We also generate the counterfactual mediators and outcomes by setting the clusterlevel treatment and preventing missingness (i.e., setting ). The clusterlevel counterfactual outcome is the empirical mean within each cluster . By generating a population of 5000 clusters, we calculate the true value of the treatmentspecific population means for , the risk difference , and the risk ratio .
4.2 Estimators Compared in the Simulation Study
We compare a variety of estimators commonly implemented in CRTs. We consider four completecase approaches, in which the data are subset to exclude participants with missing outcomes (i.e., those with ): an unadjusted estimator, CARE, mixed models, and GEE. We also implement two approaches which use data on all participants including those without measured outcomes (i.e., ): DRGEE and the TwoStage TMLE proposed here.
For the unadjusted approach, we implement the Student’s test; we first aggregate the individuallevel outcomes to the clusterlevel by taking the empirical mean among those measured (i.e., ) and then contrast the average clusterlevel outcomes by treatment arm with inference from the
distribution. For CARE, we pool data across clusters and run logistic regression of the individuallevel outcome
on the individual and clusterlevel covariates ; calculate the residuals by taking the difference between the clusterlevel outcomes (the empirical mean among those measured) and those predicted from the previous regression, and finally contrast the clusterlevel residuals by treatment arm with a test.In mixed models and GEE, we again pool data across clusters and fit a loglinear regression of the individuallevel outcome
on the clusterlevel treatment , individuallevel covariates , and clusterlevel summaries . In DRGEE, we also estimate the measurement mechanism with a pooled logistic regression of on and the augmentation terms with armspecific, pooled, loglinear regressions of on . To account for within cluster dependence, we include a random clusterspecific intercept in mixed models and use an independent working correlation matrix in the GEEs. The default settings of lme4, geepack, and CRTgeeDRpackages are used for standard error estimation
[57, 58, 59]. In all approaches, inference for the intervention effect is based on inference for the treatment coefficient.For TwoStage TMLE, we first implement an individuallevel TMLE within each cluster separately to estimate the clusterspecific endpoint . In these TMLEs, the outcome regression and the measurement mechanism are estimated using Super Learner to combine predictions from main terms logistic regression, generalized additive models, and the empirical mean. In Stage 2, we compare these clusterspecific estimates by treatment arm using TMLE with Adaptive Prespecification to select the optimal adjustment variables from . Inference is obtained via the estimated influence curve and the Student’s distribution, as outlined in Section 3.3.
4.3 Simulation Results
The true values of the risk difference and risk ratio were 9.2% and 0.88, respectively. For both effects, Table 2 summarizes estimator performance when “breaking the matches” (i.e., ignoring the pairmatching scheme used for treatment randomization) and preserving the matches. The exception is for DRGEE, because to our knowledge, there does not yet exist an extension of DRGEE for pairmatched trials.
BREAKING THE MATCHES  KEEPING THE MATCHES  
bias  CI  power  bias  CI  power  
FOR THE RISK DIFFERENCE (true value RD=9.2%)  
ttest  32.1  22.9  0.047  0.050  0.4  100.0  32.1  22.9  0.047  0.048  0.8  100.0 
CARE  17.7  8.5  0.031  0.029  18.0  100.0  15.5  6.2  0.039  0.033  61.4  99.4 
TMLE  9.8  0.6  0.035  0.047  98.0  50.2  9.8  0.6  0.035  0.043  96.6  55.2 
FOR THE RISK RATIO (true value RR=0.88)  
Mixed  0.8  0.9  0.038  0.065  58.2  99.8  0.8  0.9  0.038  0.065  58.0  99.8 
GEE  0.8  0.9  0.038  0.044  18.6  99.8  0.8  0.9  0.044  0.033  5.4  99.8 
DRGEE  0.7  0.7  0.049  0.064  0.0  100.0  
TMLE  0.9  1.0  0.046  0.064  98.6  51.0  0.9  1.0  0.047  0.059  97.4  56.8 
: average point estimate (in % for the RD)  
bias: average deviation in the point estimates vs. true effect (additive scale for RD (in %) & relative scale for RR)  
: standard deviation of the point estimates (on logscale for RR) 

: average standard error estimate (on logscale for RR)  
CI: proportion of 95% confidence intervals containing the true effect (in %)  
power: proportion of trials correctly rejecting the false null hypothesis (in %) 
Focusing first on estimators of the risk difference, we see that test, which fails to adjust for any covariates, is highly biased, as expected given the differential measurement process. On average, it grossly overestimates the intervention effect by 22.9% and attains confidence interval coverage of 1%. By adjusting for covariates, CARE is less biased, but also overestimates the intervention effect by 8.5% when breaking the matches and by 6.2% when preserving the matches. The corresponding confidence interval coverages are much less than the nominal rate: 18.0% and 61.4%, respectively. In contrast, the bias of TwoStage TMLE for the risk difference is low (1%) and confidence interval coverage is good (96%). Also as predicted by theory [40], more power is achieved when preserving (55.2%) versus breaking the matches (50.2%).
Now focusing on estimators of the risk ratio, both mixed models and GEE overestimate the intervention effect (10% larger) on average. This bias is substantial enough to prevent accurate inference; the confidence interval coverage for mixed models is 58% and 5.4%18.6% for GEE. Indeed, we would only expect these estimators to be unbiased when there are only baseline causes of missingness and the outcome regressions are correctly specified. In this simulation, there are postbaseline drivers of measurement, which are simultaneously mediators of the treatmentoutcome relationship. Standard analytic approaches are expected to fail in this setting [16, 13, 18].
DRGEE is expected to reduce bias due to missing outcomes by incorporating weights corresponding to the measurement mechanism, and this is, indeed, seen when there are only baseline causes of measurement (Table 1 in Supplementary Materials). However, to the best of our knowledge, extensions of DRGEE to handle postbaseline causes of missingness do not yet exist, and in the main simulations, DRGEE is the most biased estimator for the risk ratio and attains 0% confidence interval coverage. In contrast, TwoStage TMLE is essentially unbiased and achieves goodtoconservative confidence interval coverage (95%). Again, more power is achieved when keeping (56.8%) versus breaking the matches (51.0%). In simulations under the null, TwoStage TMLE maintains nominal TypeI error control (Table 23 in Supplementary Materials).
5 Application to the SEARCH Study
The results of the SEARCH Study have been previously published in [36]; here, we focus on the efficiency gains from clusterlevel covariate adjustment in Stage 2, after adjusting for individuallevel missingness in Stage 1. For select endpoints, we describe the estimator implementation and then compare point estimates and inference for the intervention effect when using TMLE with Adaptive Prespecification in Stage 2 versus the unadjusted effect estimator in Stage 2. We also compare breaking versus keeping the matched pairs used for randomization.
As previously discussed, the primary outcome in the SEARCH Study was the threeyear cumulative HIV incidence, measured in each community through a cohort of residents who were aged 15+ years and HIVuninfected at baseline. The prespecified primary approach was TwoStage TMLE to assess the intervention effect on the relative scale, weighting communities equally, and keeping the matches. In Stage 1, we estimated the communityspecific, cumulative incidence of HIV with TMLE adjusting for possibly differential capture of final HIV status. These individuallevel TMLEs used Super Learner, an ensemble method to combine predictions from penalized regression, generalized additive models, main terms regression, and the simple mean. In Stage 2, the intervention effect was estimated with a second, communitylevel TMLE, using Adaptive Prespecification to select from the following adjustment variables: baseline HIV prevalence, baseline male circumcision coverage, or nothing (unadjusted).
A similar approach was taken for all secondary endpoints, including the incidence of HIVassociated tuberculosis (TB) or death due to illness, hypertension control among adults (30+ years) with baseline hypertension, and populationlevel HIV viral suppression (HIV RNA500 copies/mL). When assessing the impact on TB or death due to illness, we used the KaplanMeier method in Stage 1 to estimate the threeyear risk in each community, separately; we censored at death due to other causes, outmigration, and study close. When assessing the impacts on hypertension control and HIV viral suppression, we implemented individuallevel TMLEs in Stage 1 to adjust for baseline and postbaseline drivers of measurement. For all secondary endpoints, we used TMLE with Adaptive Prespecification to assess the intervention effect in Stage 2.
As shown in Table 3, the point estimates of the intervention effects are similar, but the precision gains from the Stage 2 approach are notable. Here, “efficiency” is the variance of the unadjusted effect estimator breaking the matches divided by the variance of an alternative approach. For the primary endpoint (HIV incidence), we see precision gains when keeping versus breaking the matches; specifically, the unadjusted effect estimator is 3.1times more efficient in the pairmatched analysis. As expected, TMLE with Adaptive Prespecification keeping the matches is the most efficient approach and 4.6times more efficient than the standard approach.
Similar results are seen for the incidence of HIVassociated TB and hypertension control (Table 3). TMLE using Adaptive Prespecification and keeping the matches is 2times more efficient than the unadjusted effect estimator ignoring the matches. In contrast, minimal gains in efficiency are seen when evaluating the effect on HIV viral suppression. This is because the adaptive approach used in TMLE defaults to the unadjusted effect estimator when adjustment does not improve precision. In this scenario, adjusting for the baseline prevalence of viral suppression or the proportion of youth (1524 years) with HIV did not improve precision over the unadjusted effect estimator. However, as shown in Table 3 of the Supplementary Materials, assuming MCAR and relying on the unadjusted estimator in Stage 1 resulted in vast overestimation of the endpoint in the intervention arm (85.2% vs. 79.0%) and control arm (75.8% vs. 67.8%).
Breaking the Matches  Keeping the Matches  

Stage 2  Effect (95% CI)  Efficiency  Effect (95% CI)  Efficiency  
HIV Incidence  Unadjusted  0.98 (0.66, 1.45)  1  0.98 (0.78, 1.24)  3.1 
TMLE  0.96 (0.73, 1.26)  2.1  0.96 (0.8, 1.17)  4.6  
TB Incidence  Unadjusted  0.79 (0.64, 0.98)  1  0.79 (0.69, 0.92)  2.2 
TMLE  0.8 (0.67, 0.95)  1.4  0.8 (0.69, 0.91)  2.6  
Hypertension Control  Unadjusted  1.19 (1.1, 1.3)  1  1.19 (1.11, 1.28)  1.7 
TMLE  1.18 (1.1, 1.26)  1.6  1.19 (1.11, 1.27)  1.8  
Viral Suppression  Unadjusted  1.15 (1.11, 1.2)  1  1.15 (1.11, 1.2)  1 
TMLE  1.16 (1.13, 1.2)  1.1  1.15 (1.11, 1.2)  1  
Efficiency: Variance estimate for the unadjusted effect estimator breaking the matches used for randomization,  
divided by the variance estimate of another approach. 
6 Discussion
Cluster randomized trials (CRTs) are essential for assessing the effectiveness of interventions delivered to groups of individuals (e.g., clinics or communities). There have been notable advances in the design and conduct of CRTs (e.g., [1, 2, 3, 4, 10, 5]). However, substantial challenges remain and threaten the quality of evidence generated by CRTs. Regardless of best intentions, most CRTs are prone to differential measurement of individuallevel outcomes and to imbalance on predictive covariates between randomized arms. In this paper, we proposed and evaluated a novel estimator, TwoStage TMLE, to address the dual challenges of bias due to missing individuallevel outcomes and imprecision due to few randomized units (i.e., clusters). In Stage 1, a TMLE is implemented within each cluster separately to estimate the clusterspecific endpoint, which appropriately controls for missingness on participant outcomes. Fully stratifying on the cluster allows the missingness mechanism to vary by cluster. In Stage 2, the treatment effect is estimated with a separate, clusterlevel TMLE to compare the clusterlevel endpoints, estimated from Stage 1. In Stage 2, crossvalidation is used to adaptively select the baseline covariates from a prespecified set that maximize precision [24]. Statistical inference is based on the estimated influence curve and the Student’s distribution. Finite sample simulations demonstrated the potential of the proposed method to overcome the shortcomings of existing CRT methods, especially when there are postbaseline causes of missingness. Application to real data from the SEARCH Study demonstrated the precision gains attained through adaptive adjustment for clusterlevel covariates in Stage 2.
To the best of our knowledge, TwoStage TMLE is the first CRT estimator that simultaneously addresses bias due to individuallevel missingness and adaptive adjustment, in a fully prespecified manner, to improve efficiency. The approach is applicable to a wide range of measurement schemes (e.g., single crosssectional sample, repeated crosssectional sampling, and longitudinal followup) and endpoint types (e.g., binary, continuous, timetoevent outcomes). The approach is also applicable to a wide range of causal parameters (e.g., population, conditional, and sample effects) and scales of inference (e.g., absolute or relative measures). Additionally, TwoStage TMLE should naturally generalize to hierarchical data settings with a nonrandomized, clusterlevel exposure. In such an observational setting, the clusterlevel TMLE implemented in Stage 2 would focus on confounding control, as opposed to efficiency improvement. However, the asymptotic properties and finite sample performance of such an estimator remain an area of future work.
Altogether, TwoStage TMLE alleviates, but does not fully resolve, the challenges that arise from missing data and covariate adjustment for efficiency gains in CRTs. In Stage 1, TMLE uses machine learning to flexibly adjust for baseline and timedependent causes of missingness and, as a plugin estimator, provides more stability under strong confounding or rare outcomes. However, adjustment for missingness in TwoStage TMLE occurs within each cluster separately, limiting the breadth of adjustment when the clusterspecific outcome is rare or the clusterspecific sample size is small (e.g., in subgroup analyses). Furthermore, Stage 2 adjustment for efficiency gains is limited to clusterlevel covariates. Future work will address these remaining challenges.
7 Acknowledgments
On behalf of the SEARCH Study, we thank the Ministry of Health of Uganda and of Kenya; our research and administrative teams in San Francisco, Uganda, and Kenya; collaborators and advisory boards; and especially all the communities and participants involved. We also thank Dr. Ted Westling for his thoughtful feedback on this project.
This work was supported by the National Institutes of Health [grant numbers U01AI150510, U01AI099959, UM1AI068636, and R01AI074345]; by the President’s Emergency Plan for AIDS Relief; and by Gilead Sciences, which provided Truvada.
8 Supplementary Materials
8.1 StepbyStep Implementation of the Stage 1 TMLE
For demonstration, we focus on implementation of TMLE for the clusterspecific endpoint
(3) 
where is the individuallevel outcome, is an indicator of measurement, are baseline individuallevel covariates, are postbaseline individuallevel covariates, are the baseline clusterlevel covariates, and is the clusterlevel treatment indicator. To estimate Eq. 3 with TMLE, we take the following steps within each cluster , separately. Throughout, indexes the participants of cluster . (For ease of notation, we drop the superscript when denoting the clustersize in the Supplementary Materials.)

Among those with measured outcomes (i.e., ), use Super Learner to flexibly model the relationship between the outcome and adjustment factors .

Use the output from #1 to predict the outcome for all participants, regardless of their measurement status: for .

Target these machine learningbased predictions with information in the estimated measurement mechanism , also fit with Super Learner.

Calculate the “clever covariate” for

Run logistic regression of outcome
on only the intercept, using the logit of the initial estimator
as offset (i.e., fixing its coefficient to 1) and the clever covariate as weight. 
Denote the resulting intercept as


Obtain targeted predictions of the outcome for all participants, regardless of their measurement status: for .

Add the estimated intercept to the logit of the initial estimates and transform back to the original scale (i.e., take the inverselogit):


Average the targeted predictions to obtain a clusterspecific endpoint estimate adjusted for missingness on individuallevel outcomes:
Because we are implementing TMLE in each cluster separately, we do not include the clusterlevel covariates or treatment in the above estimation procedure. Updating on the logitscale is recommended for binary and continuous individuallevel outcomes; for details see [60]. For timetoevent outcomes with potentially differential censoring (i.e., survivaltype endpoints), we would, instead, implement longitudinal TMLE within each cluster separately [47, 48].
8.2 StepbyStep Implementation of the Stage 2 TMLE
Given estimates of the clusterspecific endpoints for from Stage 1, we then implement a clusterlevel TMLE to more efficiently estimate the intervention effect in Stage 2. For demonstration, we focus on TMLE for the sample effect on the relative scale:
(4) 
where is the counterfactual, clusterlevel outcome under treatmentlevel .

Obtain an initial estimate of the conditional expectation of the clusterlevel outcome, given the clusterlevel treatment and covariates: . We could, for example, fit a “working” regression of the estimated outcome on an intercept with main terms for the clusterlevel treatment and selected clusterlevel covariates [53, 54].

Use the output from #1 to predict the outcome for all clusters under both the intervention and control conditions: and for .

Target the initial predictions using information in the estimated propensity score for .

To estimate the clusterlevel propensity score, we could again fit a “working” logistic regression of the clusterlevel treatment indicator on an intercept and selected clusterlevel covariates .

Calculate the twodimensional “clever” covariate: and for .

Run logistic regression of clusterlevel outcome on the clever covariates and , suppressing the intercept, and using the logit of the initial estimator as offset (i.e., fixing its coefficient to 1).

Denote the resulting coefficient estimates corresponding to and as and , respectively.


Obtain targeted predictions of the outcome for all clusters under both the intervention and control conditions:

Obtain a point estimate by dividing the average of the targeted predictions under the intervention condition by the average of the targeted predictions under the control condition:
If the known propensity score is not estimated (e.g., in twoarmed CRTs with balanced allocation), then the targeting step can be skipped. As detailed in [53], using a twodimensional clever covariate during updating (step 3) allows for simultaneous targeting of the treatmentspecific means and effects on the additive, relative, and odds ratio scales. As detailed in [61], implementation to obtain a point estimate is identical for the population, conditional, and sample effects. In other words, we would follow the same steps if our goal were .
8.3 Asymptotic Linearity of TwoStage TMLE
Briefly, an estimator is asymptotically linear if the difference between the estimator and the estimand behaves (in first order) as an empirical average of a meanzero and finite variance function, known as the influence curve, of the unit data [62, 63, 21]. An asymptotically linear estimator will be consistent and normally distributed in its limit. Therefore, the Central Limit Theorem can be applied to construct 95% confidence intervals and test the null hypothesis.
Recall that in Stage 1, we first define the clusterspecific outcome . If all individuallevel outcomes are completely measured, then could be defined as the expected individuallevel outcome within each cluster: . If the individuallevel outcomes are missingcompletelyatrandom, then could be defined as the expected individuallevel outcome among those measured: . Likewise, if measurement depends on individuallevel, baseline covariates , then could be defined as the expected individuallevel outcome given measurement and those covariates, standardized with respect to the covariate distribution: . Extensions to scenarios with postbaseline drivers of missingness and/or rightcensoring follow analogously.
Next, we estimate the clusterspecific outcome within each cluster , separately. When outcomes are completely measured () or are missingcompletelyatrandom (), a simple and intuitive estimator is the empirical mean outcome among those measured. When outcomes are missingatrandom within values of the adjustment variables (e.g., ), we recommend using TMLE for estimation of the clusterspecific outcome. The empirical mean outcome (among those measured) can be considered a special case of TMLE where the adjustment set is empty: .
To emphasize how the Stage 1 estimator depends on the individuallevel data within each cluster, let denote the true distribution of the individuallevel data in cluster . Likewise, let denote the targeted estimator of that distribution based on individuals in cluster . Then we can write the Stage 1 clusterspecific estimand as and the Stage 1 clusterspecific plugin estimator as .
The Stage 2 clusterlevel effect estimator is, therefore, a function of , . Consider, for example, the treatmentspecific sample mean as our Stage 2 target parameter. The clusterlevel TMLE of in Stage 2 would be
An unadjusted effect estimator in Stage 2 can again be considered a special case of the clusterlevel TMLE where the adjustment set is empty: . Altogether, TwoStage TMLE will be asymptotically linear under the following conditions
where represents the clusterlevel influence curve and is remainder term, going to zero in probability:

Deviations between the estimated clusterlevel outcomes and the true clusterlevel outcomes,
, provide a negligible contribution to the remainder term .
The conditions on Stage 2 estimation are satisfied when estimating the known, clusterlevel propensity score with a “working” logistic regression and when estimating the clusterlevel outcome regression with another “working” parametric regression (e.g., [53, 54]). However, to the best of our knowledge, all previously existing TwoStage estimators (e.g., a ttest on the clusterlevel means) have simply ignored the contribution from estimating the clusterlevel outcome to . Suppose, for example, our Stage 1 estimator is the average outcome within each cluster: . (Such an estimator would only be appropriate when the individuallevel outcomes are completely measured or are missingcompletelyatrandom.) Since the individuallevel outcomes are not i.i.d. within each cluster, we need the following to hold for this estimator’s contribution to to be essentially zero: (1) the within cluster dependence is weak enough that the Central Limit Theorem applies in , and (2) smallest cluster is much larger than the total number of clusters (i.e., ).
When the Stage 1 estimator is a TMLE of the Stage 1 estimand , the relevant component of the remainder term can be written as
(5) 
where and are the cluster specific efficient influence curve and remainder terms, respectively. As before, we need that the within cluster dependence is weak enough such that and that the ratio of total number of clusters to the clustersize goes to zero (i.e., ). We note that when the clustersize is substantially larger than , we can weaken this independence assumption to allow for a slower rate of convergence. Additionally, we need that estimators of the individuallevel outcome regression and the individuallevel missingness mechanism converge to their targets at fast enough rates such that [21]. Implementing Super Learner with highly adaptive LASSO (HAL) [64] or internal samplesplitting can help ensure these conditions hold in practice [65, 66].
This approach to statistical inference also applies in CRTs where the treatment is randomized within matched pairs of clusters. Briefly, let and denote the observed data for the first and second cluster within matched pair , respectively. To obtain statistical inference for sample effect in a pairmatched setting, we replace with the following paired version: [61]. Our variance estimator is then given by the sample variance of the paired influence curve divided by the number of pairs (), and we use the Student’s distribution with degrees of freedom [1]. This could naturally be extended to matched triplets in a threearmed trial.
8.4 Additional Simulation Study with Baseline (only) Causes of Missingness
Here, we consider a simplified scenario where only baseline (but not postbaseline) covariates impact the measurement of individuallevel outcomes. As before, we focus on a setting with clusters and where within each cluster, the number of individual participants is sampled with equal probability from {100, 150, 200}.
For each cluster , we independently generate the clusterspecific data as follows. First, one latent variable is drawn uniformly from (1.75, 2.25) and two additional variables are drawn independently from a standard normal distribution. Then, two individuallevel covariates are generated by drawing from a normal distribution with means depending on the clusterlevel latent factors: and . We set the observed clusterlevel covariates as the empirical mean of their individuallevel counterparts. The intervention is randomly allocated within pairs of clusters matched on ; therefore, clusters receive the intervention and the control.
The underlying, individuallevel outcome is generated as an indicator that , drawn from a Uniform(0,1), is less than . Finally, we incorporate individuallevel missingness by generating as an indicator that , drawn from a Uniform(0,1), is less than . Thus, participants in the intervention arm , and especially those with higher values of , are more likely to have the outcome and also be missing. The observed outcomes are set to be missing for individuals with .
We also generate the counterfactual, individuallevel outcomes and by setting the clusterlevel treatment to and , respectively, and preventing missingness (i.e., setting ). As before, the clusterlevel counterfactual outcome is the empirical mean within each cluster . The true values of the treatmentspecific population means for , the risk difference as and the risk ratio as are calculated for a population of 5000 clusters.
We compare the same estimators as the main simulation study. The implementation is identical, except the postbaseline variable does not exist and is, thus, excluded from the analysis.
8.4.1 Results from the Second Simulation Study
Table 4 illustrates estimator performance in this simplified setting. The true values of the treatmentspecific population means and are 47.4% and 39.6%, respectively. The corresponding risk difference and risk ratio are 7.7% and 1.20, respectively.
Focusing first on estimating the risk difference, we see that test, which fails to adjust for any covariates, is highly biased, as expected given the differential measurement process. On average, it grossly underestimates the intervention effect by 12% and attains a confidence interval coverage of 25%, much lower than the nominal rate of 95%. By adjusting for covariates that influence measurement and underlying outcomes, CARE is less biased, but still underestimates the intervention effect by 2.8% when breaking the matches and by 5.1% when preserving the matches. The corresponding confidence interval coverages for CARE are less than the nominal rate: 91.8% and 41.8%, respectively. In contrast, the bias of TwoStage TMLE for the risk difference is negligible, and as predicted by theory [61], the confidence interval coverage is conservative (99%). Also as predicted by theory [40], higher power is achieved when preserving, as compared to breaking, the matches: 56.6% versus 49.0%, respectively.
Now focusing on estimating the risk ratio, we see that both mixed models and GEE overestimate the intervention effect (1.4times higher on average). This bias is substantial enough to prevent accurate inference. The confidence interval coverage for mixed models is 38.6% and 17.2%35.6% for GEE. Lower coverage for GEE is likely due to underestimation of the standard errors (). While both mixed models and GEE are adjusting for the appropriate variables, both are relying on a misspecified regression to control for missing data and to estimate the intervention effect. Theoretically, DRGEE should remove this bias from GEE by incorporating estimates of the missingness mechanism. Indeed, DRGEE exhibits lower bias, but still does not obtain valid inference (confidence interval coverage of 53.6%). This again highlights the need for flexible (i.e., dataadaptive) estimators of the individuallevel outcome regression and missingness mechanism. In contrast, TwoStage TMLE for the risk ratio is essentially unbiased. As predicted by theory [40, 61], the confidence interval coverage is conservative (99%) and more power is achieved when preserving (62.2%) versus breaking the matched (53.2%). Importantly, in simulations under the null, TwoStage TMLE maintains nominal to conservative TypeI error control in all settings (results not shown).
BREAKING THE MATCHES  KEEPING THE MATCHES  
bias  CI  power  bias  CI  power  
FOR THE RISK DIFFERENCE (true value RD=7.7%)  
ttest  4.7  12.3  0.041  0.043  22.2  16.6  4.7  12.3  0.041  0.039  18.2  20.0 
CARE  4.9  2.8  0.019  0.028  91.8  29.6  2.6  5.1  0.012  0.023  41.8  0.6 
TMLE  7.3  0.4  0.023  0.037  99.6  49.0  7.3  0.4  0.024  0.032  99.0  56.6 
FOR THE RISK RATIO (true value RR=1.20)  
Mixed  1.7  1.4  0.146  0.144  38.6  91.4  1.7  1.4  0.137  0.145  38.6  92.6 
GEE  1.6  1.4  0.147  0.137  35.6  89.6  1.7  1.4  0.168  0.097  17.2  96.4 
DRGEE  1.4  1.2  0.102  0.085  53.6  94.2  
TMLE  1.2  1.0  0.051  0.083  99.4  53.2  1.2  1.0  0.052  0.073  99.0  62.2 
: average point estimate (in % for the RD)  
bias: average deviation in the point estimates vs. true effect (additive scale for RD (in %) & relative scale for RR)  
: standard deviation of the point estimates (on logscale for RR)  
: average standard error estimate (on logscale for RR)  
CI: proportion of 95% confidence intervals containing the true effect (in %)  
power: proportion of trials correctly rejecting the false null hypothesis (in %) 
8.5 Main Simulation Study Under the Null with and clusters
To assess Type I error control, we repeated the main simulation study when there was no treatment effect (RD=0; RR=1). Under the null, the individuallevel outcome is not impacted by either the clusterlevel treatment or the individuallevel postbaseline covariates . In other words, the postbaseline covariates are not mediators in this setting. Therefore, the baseline covariates are sufficient to control for incomplete measurement: .
The results are given in Table 5. As expected in this setting, all adjusted estimators exhibit much smaller bias. However, for CARE under pairmatching and the GEEs, we see lower than nominal confidence interval coverage (70.4%93.4%). This is partially explained by underestimation of the standard errors (i.e., ) in the setting of clusters. However, when increasing the number of clusters to , these estimators still exhibit poor coverage and high TypeI error (7.8%29.6%). In contrast, good confidence interval coverage and TypeI error control are exhibited by mixed models and TwoStage TMLE in all settings.
BREAKING THE MATCHES  KEEPING THE MATCHES  
bias  CI  bias  CI  
CLUSTERS  
FOR THE RISK DIFFERENCE (true value RD=0%)  
ttest  3.5  3.5  0.045  0.047  89.4  10.6  3.5  3.5  0.045  0.044  87.0  13.0 
CARE  1.5  1.5  0.025  0.028  96.6  3.4  1.2  1.2  0.023  0.022  93.4  6.6 
TMLE  0.3  0.3  0.036  0.043  97.0  3.0  0.3  0.3  0.036  0.039  97.2  2.8 
FOR THE RISK RATIO (true value RR=1.0)  
Mixed  1.0  1.0  0.041  0.064  99.6  0.4  1.0  1.0  0.041  0.065  99.6  0.4 
GEE  1.0  1.0  0.041  0.040  90.8  9.2  1.0  1.0  0.044  0.029  73.6  26.4 
DRGEE  1.0  1.0  0.040  0.036  91.6  8.4  
TMLE  1.0  1.0  0.051  0.060  96.8  3.2  1.0  1.0  0.051  0.055  97.0  3.0 
CLUSTERS  
FOR THE RISK DIFFERENCE (true value RD=0%)  
ttest  3.9  3.9  0.034  0.037  84.8  15.2  3.9  3.9  0.034  0.035  81.6  18.4 
CARE  1.6  1.6  0.019  0.022  93.8  6.2  1.3  1.3  0.018  0.017  90.0  10.0 
TMLE  0.6  0.6  0.026  0.034  98.4  1.6  0.6  0.6  0.026  0.031  97.6  2.4 
FOR THE RISK RATIO (true value RR=1.0)  
Mixed  1.0  1.0  0.029  0.049  99.8  0.2  1.0  1.0  0.029  0.049  99.8  0.2 
GEE  1.0  1.0  0.029  0.032  90.2  9.8  1.0  1.0  0.032  0.022  70.4  29.6 
DRGEE  1.0  1.0  0.028  0.028  92.2  7.8  
TMLE  1.0  1.0  0.037  0.048  98.4  1.6  1.0  1.0  0.037  0.044  97.4  2.6 
: average point estimate (in % for the RD)  
bias: average deviation between & true effect (additive scale for RD (in %) & relative scale for RR)  
: standard deviation of the point estimates (on logscale for RR)  
: average standard error estimate (on logscale for RR)  
CI: proportion of 95% confidence intervals containing the true effect (in %)  
: proportion of trials incorrectly rejecting the true null hypothesis (in %) 
8.6 Additional Results from the SEARCH Study
The full statistical analysis plan for the SEARCH Study is available at [67]. In Table 6, we provide a comparison of results when using an unadjusted estimator in Stage 1 and Stage 2 as compared to TwoStage TMLE when estimating populationlevel HIV viral suppression (the proportion of all persons with HIV who are suppressing viral replication 500 copiess/mL) in each arm and the intervention effect [49].
Breaking matches  Keeping matches  

Estimator  Intervention (95% CI)  Control (95% CI)  Effect (95% CI)  Effect (95% CI) 
Unadjusted  85.2% (83.5%, 86.8%)  75.8% (73.5%, 78.2%)  1.12 (1.08, 1.16)  1.12 (1.09, 1.16) 
TMLE  79% (77.1%, 80.8%)  67.8% (66.2%, 69.5%)  1.16 (1.13, 1.2)  1.15 (1.11, 1.2) 
8.7 Computing code
All simulations were conducted in R (v4.0.3) using the nbpMatching, lme4, geepack, CRTgeeDR ltmle, and SuperLearner packages [68, 69, 57, 58, 59, 70, 71]. Computing code to reproduce the simulation study is available at https://github.com/LauraBalzer/TwoStageTMLE. Computing code used to analyze the SEARCH Study data is available at https://github.com/LauraBalzer/SEARCH_Analysis_Adults.
References
 Hayes and Moulton [2009] R.J. Hayes and L.H. Moulton. Cluster Randomised Trials. Chapman & Hall/CRC, Boca Raton, 2009.
 Crespi [2016] C.M. Crespi. Improved designs for cluster randomized trials. Annu Rev Public Health, 37(1):1–16, 2016.
 Turner et al. [2017a] E.L. Turner, F. Li, J.A. Gallis, M. Prague, and D.M. Murray. Review of recent methodological developments in grouprandomized trials: Part 1design. Am J Public Health, 107(6):907–915, 2017a.
 Turner et al. [2017b] E.L. Turner, M. Prague, J.A. Gallis, F. Li, and D.M. Murray. Review of recent methodological developments in grouprandomized trials: Part 2analysis. Am J Public Health, 107(7):1078–1086, 2017b.
 Murray et al. [2020] D.M. Murray, M. Taljaard, E.L. Turner, and S.M. George. Essential ingredients and innovations in the design and analysis of grouprandomized trials. Annu Rev Public Health, 41:1–19, 2020.
 Halloran and Struchiner [1991] M.E. Halloran and C.J. Struchiner. Study designs for dependent happenings. Epidemiology, 2:331–338, 1991.
 Halloran and Struchiner [1995] M.E. Halloran and C.J. Struchiner. Causal inference in infectious diseases. Epidemiology, 6(2):142–151, 1995. PMID: 7742400.
 Hudgens and Halloran [2008] M.G. Hudgens and M.E. Halloran. Toward Causal Inference With Interference. J Am Stat Assoc, 103(482):832–842, 2008. PMCID: PMC2600548.
 Platt et al. [2010] R. Platt, S.U. Takvorian, E. Septimus, et al. Cluster randomized trials in comparative effectiveness research: randomizing hospitals to test methods for prevention of healthcareassociated infections. Med Care, 48(6Suppl):S52–57, 2010.
 Murray et al. [2018] D.M. Murray, S.L. Pals, S.M. George, A. Kuzmichev, G.Y. Lai, J.A. Lee, R.L. Myles, and S.M. Nelson. Design and analysis of grouprandomized trials in cancer: A review of current practices. Prev Med, 111:241–247, 2018.
 Fiero et al. [2016] M.H. Fiero, S. Huang, E. Oren, and M.L. Bell. Statistical analysis and handling of missing data in cluster randomized trials: a systematic review. Trials, 17, 2016.
 Rubin [1976] D.B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
 Robins et al. [1995] J.M. Robins, A. Rotnitzky, and L.P. Zhao. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc, 90:106–121, 1995.
 Gail et al. [1996] M.H. Gail, S.D. Mark, R.J. Carroll, S.B. Green, and D. Pee. On design considerations and randomizationbased inference for community intervention trials. Stat Med, 15:1069–1092, 1996.
 Little and Rubin [2002] R.J.A. Little and Donald B. Rubin. Statistical analysis with missing data. Wiley, Hoboken, 2nd edition, 2002.
 Robins [1986] J.M. Robins. A new approach to causal inference in mortality studies with sustained exposure periods–application to control of the healthy worker survivor effect. Mathematical Modelling, 7:1393–1512, 1986. doi: 10.1016/02700255(86)900886.
 Robins et al. [2000] J.M. Robins, M.A. Hernán, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5):550–560, 2000.
 Robins and Hernán [2009] J.M. Robins and M.A. Hernán. Estimation of the causal effects of timevarying exposures. In G. Fitzmaurice, M. Davidian, G. Verbeke, and G. Molenberghs, editors, Longitudinal Data Analysis, chapter 23. Chapman & Hall/CRC, Boca Raton, FL, 2009.
 Selvaraj and Prasad [2013] S. Selvaraj and V. Prasad. Characteristics of cluster randomized trials: Are they living up to the randomized trial? JAMA Internal Medicine, 173(23):313, 2013. doi: 10.1001/jamainternmed.2013.1638.
 Fisher [1932] R.A. Fisher. Statistical methods for research workers. Oliver and Boyd Ltd., Edinburgh, 4th edition, 1932.
 van der Laan and Rose [2011] M. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York Dordrecht Heidelberg London, 2011.
 Colantuoni and Rosenblum [2015] E. Colantuoni and M. Rosenblum. Leveraging prognostic baseline variables to gain precision in randomized trials. Stat Med, 34(26022617), 2015.
 Stephens et al. [2013] A.J. Stephens, E.J. Tchetgen Tchetgen, and V. De Gruttola. Flexible covariateadjusted exact tests of randomized treatment effects with application to a trial of HIV education. Ann Appl Stat, 7(4):2106–2137, 2013.
 Balzer et al. [2016a] L. Balzer, M. van der Laan, M. Petersen, and the SEARCH Collaboration. Adaptive prespecification in randomized trials with and without pairmatching. Statistics in Medicine, 35(10):4528–4545, 2016a. doi: 10.1002/sim.7023.
 Kahan et al. [2016] B.C. Kahan, G. Forbes, Y. Ali, V. Jairath, et al. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials, 17:438, 2016.
 Laird and Ware [1982] N.M. Laird and J.H. Ware. Randomeffects models for longitudinal data. Biometrics, 38(4):963–974, 1982. PMID: 7168798.
 Liang and Zeger [1986] K.Y. Liang and S.L. Zeger. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika, 73(1):13–22, 1986. URL http://www.jstor.org/stable/2336267.
 Stephens et al. [2012] A.J. Stephens, E.J. Tchetgen Tchetgen, and V. DeGruttola. Augmented generalized estimating equations for improving efficiency and validity of estimation in cluster randomized trials by leveraging clusterlevel and individuallevel covariates. Statistics in Medicine, 31:915–930, 2012.
 Stephens et al. [2014] A. Stephens, E. Tchetgen Tchetgen, and V. DeGruttola. Locally efficient estimation of marginal treatment effects when outcomes are correlated: Is the prize worth the chase? The International Journal of Biostatistics, 10(1):59–75, 2014. doi: 10.1515/ijb20130031.
 Balzer et al. [2019] L.B. Balzer, W. Zheng, M.J. van der Laan, M.L. Petersen, and the SEARCH Collaboration. A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a clusterlevel exposure. Stat Meth Med Res, 28(6):1761–1780, 2019.
 Hubbard et al. [2010] A.E. Hubbard, J. Ahern, N.L. Fleischer, M. van der Laan, S.A. Lippman, N. Jewell, T. Bruckner, and W.A. Satariano. To GEE or not to GEE comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology, 21(4):467–474, 2010.
 DeSouza et al. [2009] C.M. DeSouza, A.T.R. Legedza, and A.J. Sankoh. An overview of practical approaches for handling missing data in clinical trials. J Biopharm Stat, 19(6):1055–1073, 2009.
 Hossain et al. [2017a] A. Hossain, K. DiazOrdaz, and J.W. Bartlett. Missing binary outcomes under covariatedependent missingness in cluster randomised trials. Stat Meth Med Res, 36(19):3092–3109, 2017a.
 Hossain et al. [2017b] A. Hossain, K. DiazOrdaz, and J.W. Bartlett. Missing continuous outcomes under covariate dependent missingness in cluster randomised trials. Stat Meth Med Res, 26(3):1543–1562, 2017b.
 Prague et al. [2016] M. Prague, R. Wang, A. Stephens, E. Tchetgen Tchetgen, and V. De Gruttola. Accounting for interactions and complex intersubject dependency in estimating treatment effect in clusterrandomized trials with missing outcomes. Biometrics, 72(4):1066–1077, 2016. doi: 10.1111/biom.12519.
 Havlir et al. [2019] D.V. Havlir, L.B. Balzer, E. Charlebois, T.D. Clark, D. Kwarisiima, J. Ayieko, J Kabami, N. Sang, et al. HIV testing and treatment with the use of a community health approach in rural Africa. New England Journal of Medicine, 381:219–229, 2019.
 Chamie et al. [2012] G. Chamie, D. Kwarisiima, T.D. Clark, J. Kabami, V. Jain, E. Geng, M.L. Petersen, H. Thirumurthy, M.R. Kamya, D.V. Havlir, E.D. Charlebois, and the SEARCH Collaboration. Leveraging rapid communitybased HIV testing campaigns for noncommunicable diseases in rural Uganda. PLoS One, 7(8):e43400, 2012. doi: 10.1371/journal.pone.0043400.
 Chamie et al. [2016] G. Chamie, T.D. Clark, J. Kabami, K. Kadede, E. Ssemmondo, et al. A hybrid mobile HIV testing approach for populationwide HIV testing in rural East Africa. Lancet HIV, 3(3):e111–119, 2016.
 Kwarisiima et al. [2017] D. Kwarisiima, M. Kamya, A. Owaraganise, et al. High rates of viral suppression in adults and children with high CD4+ counts using a streamlined ART delivery model in the SEARCH trial in rural Uganda and Kenya. J Int AIDS Soc, Jul 21(20(Suppl 4)):21673, 2017.
 Balzer et al. [2015] L.B. Balzer, M.L. Petersen, M.J. van der Laan, and the SEARCH Consortium. Adaptive pairmatching in randomized trials with unbiased and efficient effect estimation. Statistics in Medicine, 34(6):999–1011, 2015. doi: 10.1002/sim.6380.
 Petersen et al. [2012] M.L. Petersen, K.E. Porter, S. Gruber, Y. Wang, and M.J. van der Laan. Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21(1):31–54, 2012. doi: 10.1177/0962280210386207.
 Horvitz and Thompson [1952] D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47:663–685, 1952. doi: 10.2307/2280784.
 Gruber and van der Laan [2012] S. Gruber and M.J. van der Laan. Targeted minimum loss based estimator that outperforms a given estimator. Int J Biostat, 8(1):Article 11, 2012. PMID: 22628356.
 Schuler and Rose [2017] M.S. Schuler and S. Rose. Targeted maximum likelihood estimation for causal inference in observational studies. American Journal of Epidemiology, 185(1):65–73, 2017.
 Blakely et al. [2019] T. Blakely, J. Lynch, K. Simons, R. Bentley, and S. Rose. Reflection on modern methods: when worlds collide  prediction, machine learning and causal inference. International Journal of Epidemiology, dyz132:1–7, 2019.
 van der Laan et al. [2007] M.J. van der Laan, E.C. Polley, and A.E. Hubbard. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1):25, 2007. doi: 10.2202/15446115.1309.
 Petersen et al. [2014] M.L. Petersen, J. Schwab, S. Gruber, N. Blaser, M. Schomaker, and M.J. van der Laan. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. Journal of Causal Inference, 2(2), 2014. doi: 10.1515/jci20130007.
 Benkeser et al. [2019] D. Benkeser, P.B. Gilbert, and M. Carone. Estimating and testing vaccine sieve effects using machine learning. J Am Stat Assoc, 114(527):1038–1049, 2019.
 Balzer et al. [2020] L.B. Balzer, J. Ayieko, D. Kwarisiima, G. Chamie, E.D. Charlebois, J. Schwab, M.J. van der Laan, M.R. Kamya, D.V. Havlir, and M.L. Petersen. Far from MCAR: obtaining populationlevel estimates of HIV viral suppression. Epidemiology, 31(5):620–627, 2020.
 Rubin [1990] D.B. Rubin. Comment: Neyman (1923) and causal inference in experiments and observational studies. Statistical Science, 5(4):472–480, 1990.
 Benitez [2020] A. Benitez. Targeted machine learning approaches for leveraging data in resourceconstrained settings. PhD thesis, University of California, Berkeley, 2020.
 Seaman et al. [2014] S.R. Seaman, M. Pavlou, and A.J. Copas. Review of methods for handling confounding by cluster and informative cluster size in clustered data. Statistics in Medicine, 33:5371–5387, 2014.
 Moore and van der Laan [2009] K.L. Moore and M.J. van der Laan. Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation. Stat Med, 28(1):39–64, 2009. doi: 10.1002/sim.3445.
 Rosenblum and van der Laan [2010] M. Rosenblum and M.J. van der Laan. Simple, efficient estimators of treatment effects in randomized trials using generalized linear models to leverage baseline variables. The International Journal of Biostatistics, 6(1):Article 13, 2010. doi: 10.2202/15574679.1138.
 van der Vaart [1998] A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, New York, 1998.
 Cornfield [1978] J. Cornfield. Randomization by group: a formal analysis. Am J Epidemiol, 108(2):100–102, 1978.
 Bates et al. [2015] D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixedeffects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
 Hojsgaard et al. [2006] S. Hojsgaard, U. Halekoh, and J. Yan. The R package geepack for generalized estimating equations. J Stat Softw, 15(2):1–11, 2006.
 Prague et al. [2017] M. Prague, R. Wang, , and V. De Gruttola. CRTgeeDR: an R package for doubly robust generalized estimating equations estimations in cluster randomized trials with missing data. The R Journal, 9(2):105–115, 2017.
 Gruber and van der Laan [2010] S. Gruber and M.J. van der Laan. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics, 6(1):Article 26, 2010. doi: 10.2202/15574679.1260.
 Balzer et al. [2016b] L.B. Balzer, M.L. Petersen, and M.J. van der Laan. Targeted estimation and inference of the sample average treatment effect in trials with and without pairmatching. Statistics in Medicine, 35(21):3717–3732, 2016b. doi: 10.1002/sim.6965.
 Bickel et al. [1993] P.J. Bickel, C.A.J. Klaassen, Y. Ritov, and J.A. Wellner. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore, 1993.
 van der Vaart and Wellner [1996] A.W. van der Vaart and J.A. Wellner. Weak convergence and empirical processes. Springer, Berlin Heidelberg New York, 1996.
 Benkeser and van der Laan [2016] D. Benkeser and M. van der Laan. The highly adaptive lasso estimator. Proc Int Conf Dat Sci Adv Anal, pages 689–696, 2016.
 van der Laan [2011] W. Zhengand M. van der Laan. Crossvalidated targeted minimumlossbased estimation. In M.J. van der Laan and S. Rose, editors, Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York Dordrecht Heidelberg London, 2011.
 Díaz [2019] I. Díaz. Machine learning in the estimation of causal effects: targeted minimum lossbased estimation and double/debiased machine learning. Biostatistics, kxz042, 2019.
 Balzer et al. [2018] L.B. Balzer, D.V. Havlir, J. Schwab, M.J. van der Laan, M.L. Petersen, and the SEARCH Collaboration. Statistical analysis plan for SEARCH phase I: Health outcomes among adults. Technical report, arXiv: https://arxiv.org/abs/1808.03231, 2018.
 R Core Team [2020] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. URL http://www.Rproject.org.
 Beck et al. [2016] C. Beck, B. Lu, and R. Greevy. nbpMatching: functions for optimal nonbipartite optimal matching, 2016. URL https://CRAN.Rproject.org/package=nbpMatching. R package version 1.5.0.
 Schwab et al. [2017] J. Schwab, S. Lendle, M. Petersen, and M van der Laan. ltmle: Longitudinal Targeted Maximum Likelihood Estimation, 2017. URL http://CRAN.Rproject.org/package=ltmle.
 Polley et al. [2018] E. Polley, E. LeDell, C. Kennedy, and M. van der Laan. SuperLearner: Super Learner Prediction, 2018. URL http://CRAN.Rproject.org/package=SuperLearner. R package version 2.024.
Comments
There are no comments yet.