Two-Stage TMLE to Reduce Bias and Improve Efficiency in Cluster Randomized Trials

06/29/2021
by   Laura B. Balzer, et al.
9

Cluster randomized trials (CRTs) randomly assign an intervention to groups of individuals (e.g., clinics or communities), and measure outcomes on individuals in those groups. While offering many advantages, this experimental design introduces challenges that are only partially addressed by existing analytic approaches. First, outcomes are often missing for some individuals within clusters. Failing to appropriately adjust for differential outcome measurement can result in biased estimates and inference. Second, CRTs often randomize limited numbers of clusters, resulting in chance imbalances on baseline outcome predictors between arms. Failing to adaptively adjust for these imbalances and other predictive covariates can result in efficiency losses. To address these methodological gaps, we propose and evaluate a novel two-stage targeted minimum loss-based estimator (TMLE) to adjust for baseline covariates in a manner that optimizes precision, after controlling for baseline and post-baseline causes of missing outcomes. Finite sample simulations illustrate that our approach can nearly eliminate bias due to differential outcome measurement, while other common CRT estimators yield misleading results and inferences. Application to real data from the SEARCH community randomized trial demonstrates the gains in efficiency afforded through adaptive adjustment for cluster-level covariates, after controlling for missingness on individual-level outcomes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/28/2022

Extending inferences from a cluster randomized trial to a target population

We describe methods that extend (generalize or transport) causal inferen...
10/18/2021

Comparative Methods for the Analysis of Cluster Randomized Trials

Across research disciplines, cluster randomized trials (CRTs) are common...
02/16/2019

Generalizing trial findings using nested trial designs with sub-sampling of non-randomized individuals

To generalize inferences from a randomized trial to the target populatio...
02/16/2019

Generalizing trial findings in nested trial designs with sub-sampling of non-randomized individuals

To generalize inferences from a randomized trial to the target populatio...
10/14/2019

The Statistical Performance of Matching-Adjusted Indirect Comparisons

Indirect comparisons of treatment-specific outcomes across separate stud...
04/26/2022

Outcome coding choice in randomized trials of programs to reduce violence

Over the last decade, the number of randomized trials of programs to red...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In many trials, treatments are randomly allocated to groups of individuals, such as hospitals, schools, or communities, and outcomes are measured on individuals in those groups. These studies are known as group or cluster randomized trials (CRTs). They are implemented when the treatment is naturally delivered to the group or when substantial dependence between individuals within groups is expected [1, 2, 3, 4, 5]. Such dependence may arise from shared group-level factors and interactions between participants, including the spread of social behaviors and infectious diseases. Indeed, many group-level interventions may benefit both the individuals who receive the intervention and others who do not; CRTs provide an opportunity to assess such spillover effects [1, 6, 7, 8]. Many CRTs are pragmatic in that their goal is to expand results from prior efficacy studies into real-world effectiveness [9]. CRTs are rapidly increasing in popularity; a recent review found a 280-fold increase in their use from 1995 to 2015 [10]. Nonetheless, despite extensive research dedicated to their design and conduct, this review also concluded only half of CRTs were analyzed appropriately. CRTs can provide gold-standard evidence of causality, but they face several methodological challenges.

First, missing participant outcomes occur in over 90% of CRTs [11]. When participants with missing outcomes differ meaningfully from those with measured outcomes, complete-case analyses yield biased estimates [12, 13, 14, 15]. This potential for bias is exacerbated when, as commonly occurs, the cluster randomized intervention, itself, influences outcome measurement. Suppose, for example, that the cluster-level intervention increases care engagement, which in turn improves both participants’ outcomes and their chances of having that outcome measured (Figure 1). In this scenario, an unadjusted comparison of outcomes between randomized arms can overestimate or underestimate the treatment effect. Even if key determinants of missingness, such as care engagement, are measured, standard analytic approaches to CRTs will also fail to control for this bias, because care engagement simultaneously mediates the treatment-outcome relationship and confounds the missingness-outcome relationship [16, 17, 13, 18].

Figure 1: Simplified causal graph to illustrate the challenges of adjustment for measurement impacted by the randomized treatment and post-baseline factors (here, being in care).

Second, CRTs often randomize limited numbers of groups [1, 2, 3, 4, 10, 5]; a review found a median of 33 clusters randomized [19]. Even when some form of restricted randomization (e.g., pair-matching) is used, CRTs with few clusters are likely to suffer from chance imbalances between treatment arms on baseline determinants of the outcome. Adjustment for these covariates and others predictive of the outcome can increase statistical power (e.g., [20, 14, 21, 1, 22, 4, 10, 5]). The adjustment approach is often with an outcome regression, characterizing the expected outcome, given the treatment assignment and covariates. This regression must be a priori

-specified to avoid inflating Type-I error rates. Additionally to avoid over-fitting, a limited number of adjustment variables must be selected from a typically large set of candidates, risking forced adjustment for variables that prove useless for, or even detrimental to, precision

[23, 24, 25]. Thus, the challenge is to define a fully pre-specified procedure for CRT analysis that optimizes power through data-adaptive adjustment of baseline covariates, while rigorously preserving Type-I error control.

In this manuscript, we propose and and evaluate a novel estimator that addresses the dual challenges of bias due to missing outcomes and imprecision due to few randomized units in CRTs. Our approach uses targeted minimum loss-based estimation (TMLE) twice: first at the individual-level to adjust for differential measurement of individual-level outcomes and second at the cluster-level to improve efficiency when estimating the intervention effect [21]. Therefore, we refer to our estimator as “Two-Stage TMLE”. To the best of our knowledge, Two-Stage TMLE is the first semiparametric efficient estimator that adaptively adjusts for both individual-level missingness and cluster-level imbalance in CRTs. It can be applied to estimate a range of causal parameters and under a range of CRT study designs, including differing randomization schemes (e.g., pair-matched or not) and approaches to participant follow-up within clusters (e.g., cross-sectional sampling or longitudinal follow-up).

2 Background

2.1 Brief Review of Current Methods for CRT Analysis

We provide an overview of existing CRT methods in Table 1. A simple two-stage approach to account for the dependence of participants within clusters is to aggregate the individual-level data to the cluster-level and then implement an effect estimator appropriate for independent data, such as a -test. Use of an unadjusted effect estimator in the second stage avoids modeling assumptions and the risk of over-fitting, but by ignoring covariate information is inefficient (e.g., [20, 14, 21, 1, 22, 4, 10, 5]).

Unadjusted

Compare cluster-level outcomes by treatment arm; commonly implemented as a t-test.

CARE At the cluster-level, compare observed outcomes with those predicted from a regression of the individual-level outcome on individual- and cluster-level covariates, but not the cluster-level treatment [14, 1].
Mixed Model Point estimate and inference based the treatment coefficient in a regression of the individual-level outcome on the cluster-level treatment and individual- and cluster-level covariates; use random effects to account for dependence of individuals within a cluster [26].
GEE Point estimate and inference based the treatment coefficient in a regression of the individual-level outcome on the cluster-level treatment and individual- and cluster-level covariates; use a working correlation matrix to account for dependence of individuals within a cluster [27].
Augmented-GEE Modification to GEE for the marginal effect (i.e., GEE with only regression coefficients for the intercept and cluster-level treatment) by including an additional “augmentation” term for the outcome regression (i.e., the conditional expectation of the outcome, given covariates and treatment) [28, 23, 29].
Hierarchical TMLE

At the cluster-level, compare targeted predictions of the outcomes under the intervention and control; initial predictions of the outcome regression and propensity score (i.e., conditional probability of treatment, given the covariates) are made by adaptively selecting from baseline covariates at the individual- or cluster-level

[30].
Table 1: Description of CRT effect estimators as commonly implemented.

In contrast, mixed models and generalized estimating equations (GEE) typically adjust for a number of baseline individual-level and cluster-level covariates, providing an opportunity to improve precision of effect estimates [26, 27]

. However, neither address the need for a pre-specified approach to select the adjustment variables that optimize efficiency, while preserving valid statistical inference. Further, both are susceptible to allowing the estimator choice to define the effect measure that is estimated (e.g., GEE with a logistic link yields estimates of the conditional odds ratio)

[31].

To the best of our knowledge, only three methods generally allow for the estimation of marginal effects in CRTs while adjusting for individual- and cluster-level covariates. First, in the covariate adjusted residuals estimator (CARE), cluster-level outcomes are compared with those predicted from an individual-level regression of the outcome on individual- and cluster-level covariates, but not the cluster-level treatment [14, 1]. Second, augmented-GEE extends GEE for the marginal effect by including an “augmentation” term, inspired by the efficient influence function [28, 23, 29]. Finally, hierarchical TMLE solves the efficient estimating equation for the marginal effect, while remaining a plug-in estimator [30].

With regards to missingness, an unadjusted effect estimator requires the strongest identification assumption: there are no common causes of missingness and outcomes (i.e., the missing-completely-at-random, or MCAR, assumption holds) [12, 15]

. The other methods rely on a weaker identification assumption; essentially, that the outcome distributions among persons for which the outcome is measured versus missing are exchangeable conditional on the treatment arm and some subset of measured covariates. In this setting, mixed models, GEE, and CARE will yield nearly unbiased estimates of the intervention effect only if the common causes of missingness and outcomes occur at baseline and are included in a correctly specified outcome regression

[32, 4, 33, 34].

Combining Augmented-GEE with inverse probability weighting yields a double robust estimator (“DR-GEE”); it is nearly unbiased if either the outcome regression or the propensity score for the missingness mechanism (i.e., the conditional probability of outcome measurement given the treatment arm and covariates) is correctly specified [35]. To the best of our knowledge, DR-GEE’s methodology and computing code have been limited to adjustment for baseline variables only. Hierarchical TMLE also offers the potential for integrated precision gains and double robust adjustment for differential measurement. However, these extensions remain to be fully studied. Here, we, instead, develop and evaluate Two-Stage TMLE to (1) control for potentially differential missingness in each cluster separately, and (2) adaptively adjust for baseline covariates to improve efficiency when estimating treatment effect at the cluster-level. Before doing so, we first present our motivating example.

2.2 Motivating Example

The SEARCH Study was a two-armed, pragmatic CRT of 32 communities each with 10,000 persons in rural Kenya and Uganda (ClinicalTrials.gov: NCT01864603) [36]. SEARCH was designed to evaluate the population-level effects of annual multi-disease testing and universal treatment for persons with HIV (intervention) versus baseline multi-disease testing and country-guided treatment (active control) on a range of outcomes including incident HIV, viral suppression among persons with HIV, hypertension control, and incident tuberculosis. A key innovation of the SEARCH intervention was to shift from a stand-alone HIV service model to a multi-disease model. HIV testing was offered at health fairs and at home as part of a broader package for community health screening and education [37, 38]. Treatment for those diagnosed positive was offered rapidly, in a welcoming environment with flexible patient-centered options, and through an integrated approach for chronic (e.g., hypertension and diabetes) and infectious diseases (e.g., malaria and tuberculosis) [39].

As with the vast majority of CRTs, SEARCH Study outcomes were not measured among all study participants and the MCAR assumption was unreasonable for many endpoints. Additionally, despite matching prior to randomization [40], covariate imbalance was expected, and to avoid over-fitting, extensive adjustment was prohibited by having only 16 communities per arm. To reduce bias from missingness on individual-level outcomes and to improve precision through covariate adjustment in the SEARCH Study, we developed Two-Stage TMLE, which we describe and evaluate in the remainder of the manuscript.

3 Two-Stage TMLE

3.1 Stage 1: Defining and Estimating Cluster-Specific Outcomes

In many CRTs, outcomes are assessed through longitudinal follow-up of a closed cohort of participants. In the SEARCH Study, for example, the primary outcome was the three-year cumulative incidence of HIV: the proportion of community residents ( 15 years) who were HIV-uninfected at baseline and became infected with HIV over the three-year study. To assess the treatment effect on such endpoints, the cohort of participants who are at risk of the outcome is defined in each cluster. For each participant, let denote their baseline covariates, be their post-baseline covariates, be an indicator of outcome measurement at study close, and be an indicator of having the outcome of interest. The outcome is only observed when . We also observe cluster-level covariates and the randomly assigned cluster-level treatment . Throughout, superscript will be used to distinguish cluster-level variables from individual-level variables. In the SEARCH Study, for example, included baseline HIV prevalence and male circumcision coverage; included age, sex, marital status, occupation, education, and mobility; was a community-level indicator of randomization to the intervention; was interim HIV testing; was an indicator of HIV testing at year 3, and was an indicator of having a confirmed HIV-positive diagnosis at year 3 testing. More generally, we denote the observed data structure for a participant as .

Let denote the counterfactual outcome for a participant when setting the measurement indicator . Then for a given cluster, our Stage 1 causal parameter is the expected counterfactual outcome under a hypothetical intervention to ensure outcome measurement: . If in all clusters, MCAR held or, equivalently, , the causal parameter could be identified as and consistently estimated as the empirical mean among those measured within each cluster: . This missing data assumption can be relaxed by allowing measurement to depend on participant characteristics as well as the cluster-level covariates and treatment . Specifically, if following the randomization assumption holds and there is sufficient data support (i.e., the positivity assumption holds [41]), our Stage 1 statistical estimand would be

(1)

Within each cluster separately, this statistical parameter could be estimated by a variety of algorithms, including inverse-weighting and G-computation [42, 16]. Importantly, by fully stratifying on cluster, we avoid having to specify interactions between individual-level variables (, ) and cluster-level variables (.

For statistical estimation, we focus on TMLE given its asymptotic properties and improved finite sample performance (e.g., [21, 43]). We refer readers [44, 45]

for an introduction. Briefly, TMLE combines estimates of the conditional mean outcome (a.k.a., the outcome regression) with those of the propensity score (e.g., conditional probability of being measured). In doing so, TMLE achieves a number of desirable properties, including double robustness: a consistent estimate is attained if either the outcome regression or the propensity score is consistently estimated. If both are consistently estimated at reasonable rates, TMLE will be efficient. In practice, we recommend implementing TMLE using Super Learner, an ensemble machine learning algorithm, to obtain initial estimates of the outcome regression and the propensity score

[46]. Step-by-step implementation of TMLE for Eq. 3 is given in the Supplementary Materials.

When assessing effects on time-to-event endpoints, participants are followed longitudinally until the occurrence of the event of interest or right-censoring (e.g., due to outmigration or study close). Examples of such endpoints in the SEARCH Study included the probability of treatment initiation, all-cause mortality, and the cumulative risk of HIV-associated tuberculosis or death due to illness. The above framework can easily be extended to handle these survival-type endpoints. Specifically, in Stage 1, we could estimate the cluster-specific endpoint using the Kaplan-Meier estimator when censoring is non-differential or TMLE when censoring is differential [47, 48].

Other endpoints may be assessed using a cross-sectional design, where participants are measured at a single timepoint. In these settings, we may have missingness on the characteristic defining the sub-population of interest as well as the outcome of interest. Consider, for example, population-level HIV viral suppression, defined as the proportion of all HIV-infected persons whose plasma HIV RNA level is less than some limit, such as 500 copies/mL: . Both baseline and time-varying factors impact HIV status and its measurement, as well as viral suppression and its measurement. To handle missingness on both the outcome (viral suppression) and the conditioning set (HIV-positivity), we redefine the endpoint as the joint probability of being HIV-positive and suppressed, divided by HIV prevalence: . As detailed in [49], we can use TMLE to estimate each quantity separately and then take the ratio to estimate the cluster-specific endpoint in Stage 1.

3.2 Stage 2: Estimation of the Treatment Effect

After Stage 1, we focus on the cluster-level data for the purposes of defining, estimating, and obtaining inference for treatment effect. Specifically, let be the counterfactual, cluster-level outcome under an additional intervention to set the treatment to 1 (intervention) or 0 (control). The causal parameter for Stage 2 is a summary measure of the distribution of these counterfactuals. A common target is the population average treatment effect (PATE): . Alternatively, we could be interested in the sample average treatment effect (SATE), which is the effect for the study clusters [50], or in summary measures on the relative scale. In the SEARCH Study, for example, the primary analysis was for the sample risk ratio for the trial communities:

(2)

where was the counterfactual cumulative HIV incidence in community .

We can consider a wider range of causal parameters by combining each summary measure with weights. Specifically, let be the size of cluster , and consider a weighted-version of the treatment-specific sample mean: . Then setting gives equal weight to participants, while setting gives equal weight to clusters [51]. When there is an interaction between cluster size and the treatment, cluster size is said to be “informative” [52], and the resulting causal parameters will generally not be equivalent. In all settings, the target causal parameter should be pre-specified and be driven by research question.

For estimation of the Stage 2 causal parameter, the observed data can be simplified to the cluster-level: , where represents the baseline cluster-level covariates; is an indicator of randomization to the intervention arm, and is the estimated cluster-specific endpoint, which appropriately accounts for differential measurement of individual-level outcomes (e.g., Eq. 3). Using these data, an intuitive estimator of the intervention effect is the (weighted) average outcome among intervention clusters contrasted with the (weighted) average outcome among control clusters . Instead, we use TMLE to obtain a more efficient estimate of the intervention effect (e.g., [53, 54]). Briefly, an initial estimator of the cluster-level outcome regression is updated based on an estimate of the cluster-level propensity score to achieve a targeted estimator . For all clusters, targeted estimates of the expected outcome under the intervention and under the control are generated, averaged across clusters, and contrasted to estimate the intervention effect. Step-by-step implementation for Eq. 4 is given in the Supplementary Materials.

With limited numbers of clusters, it is difficult, if not impossible, to a priori-specify the optimal estimator of the outcome regression or the propensity score . In this setting, we recommend using Adaptive Pre-specification to select the adjustment variables that maximize empirical efficiency [24]

. The procedure requires pre-specification of (1) a set of candidate estimators for the outcome regression, (2) a set of candidate estimators for the known propensity score, (3) a loss function corresponding to the squared influence curve for the TMLE of the target estimand, and (4) a sample-splitting scheme to select the candidate estimator with the lowest cross-validated risk estimate. In trials with limited randomized units, such as CRTs, leave-one-out or leave-one-pair-out cross-validation is recommended. Altogether, the procedure will data-adaptively select, from a pre-specified set, the candidate adjustment variables (and thus TMLE) which maximize precision. Importantly, if none of the pre-specified covariates improves precision over the unadjusted effect estimator, then the approach will not adjust for any cluster-level covariates (i.e., reduces to the unadjusted effect estimator). Thus, decisions about whether and how to adjust for precision gains are made with a rigorous procedure that does not compromise Type-I error.

3.3 Statistical Inference

Now we have a point estimate of the intervention effect and are ready to obtain statistical inference, which occurs at the cluster-level. Under the following conditions, detailed in the Supplementary Materials, Two-Stage TMLE is an asymptotically linear estimator of the intervention effect , meaning that , where is the cluster-level influence curve and is the remainder term, going to zero in probability [55]:

  1. Stage 1 estimation of the cluster-level outcomes provides negligible contribution to

  2. Stage 2 estimators of the cluster-level outcome regression and the cluster-level propensity score satisfy the usual regularity conditions (e.g., [53])

The second condition is automatically satisfied when Adaptive Pre-specification is used to select among generalized linear models for the cluster-level outcome regression and the known propensity score

. However, to satisfy the first condition we need (1) the Stage 1 estimators of the individual-level outcome regression and the individual-level measurement mechanism to converge to their targets at fast enough rates, (2) the within cluster dependence to be weak enough that the Central Limit Theorem applies in cluster size

, and (3) the cluster size is large relative to the total number of clusters (i.e., ; details in the Supplementary Materials). When is substantially larger , we can weaken the independence assumption to allow for a slower rate of convergence. The Stage 1 requirements highlight the importance of appropriate control for missingness on individual-level outcomes in CRTs. Suppose, for example, individual-level outcomes are not MCAR, and nonetheless, we use an unadjusted estimator in Stage 1. Then the cluster-level outcomes will not be consistently estimated, and we will obtain biased point estimates and misleading conclusions in Stage 2. We note these conditions apply to all Two-Stage estimators, including the Student’s -test, the paired -test, and CARE.

When the above conditions hold, the limit distribution of the standardized estimator is normal with mean 0 and variance given by the variance of its influence curve. For example, the influence curve for Two-Stage TMLE for the treatment-specific population mean

is approximated as . We obtain a variance estimate with the sample variance of the estimated influence curve divided by the number of independent units . Then using the Student’s -distribution with degrees of freedom as a finite sample approximation to the normal distribution [56, 14, 1]

, we can then construct Wald-Type 95% confidence intervals and conduct hypothesis testing.

Additionally, through the Delta Method, we can derive the influence curve and variance estimator for the intervention effect on any scale of interest. If, for example, we were interested in the absolute effect , then the estimated influence curve for TMLE would be . If, instead, we were interested in the relative effect, we would apply the Delta method on the log-scale [53]. This approach to statistical inference also applies in trials where the treatment is randomized within matched pairs of clusters (Supplementary Materials). This approach could naturally be extended to matched triplets in a three-armed CRT. Analogous methods are used to obtain inference for the sample or conditional effects.

4 Simulation Study

We examine the finite sample performance of our proposed estimator using simulations to incorporate common CRT challenges, such as few randomized clusters and differential missingness. Specifically, we focus on a setting with clusters and where within each cluster, the number of individual participants is sampled with equal probability from {100, 150, 200}. In these simulations, both baseline and post-baseline covariates impact measurement of individual-level outcomes. Additional simulations and computing code are given in the Supplementary Materials.

4.1 Data Generating Process

For each cluster , we independently generate the cluster-specific data as follows. First, two latent variables are drawn uniformly from (-1, 1) and one additional variable is drawn independently from a standard normal distribution. Then, two individual-level covariates are drawn independently from normal distributions with cluster-specific means: and . We set the observed cluster-level covariates as the empirical mean of their individual-level counterparts. The cluster-level intervention is randomly allocated within pairs of clusters matched on ; therefore, clusters receive the intervention and the control.

The individual-level mediator is generated as an indicator that , drawn from a Uniform(0,1), is less than the . The underlying, individual-level outcome is generated as an indicator that , drawn from a Uniform(0,1), is less than . Finally, individual-level measurement is generated as an indicator that , drawn from a Uniform(0,1), is less than . Thus, the measurement mechanism is highly differential by treatment arm. We set the observed outcomes to be missing for individuals with .

We also generate the counterfactual mediators and outcomes by setting the cluster-level treatment and preventing missingness (i.e., setting ). The cluster-level counterfactual outcome is the empirical mean within each cluster . By generating a population of 5000 clusters, we calculate the true value of the treatment-specific population means for , the risk difference , and the risk ratio .

4.2 Estimators Compared in the Simulation Study

We compare a variety of estimators commonly implemented in CRTs. We consider four complete-case approaches, in which the data are subset to exclude participants with missing outcomes (i.e., those with ): an unadjusted estimator, CARE, mixed models, and GEE. We also implement two approaches which use data on all participants including those without measured outcomes (i.e., ): DR-GEE and the Two-Stage TMLE proposed here.

For the unadjusted approach, we implement the Student’s -test; we first aggregate the individual-level outcomes to the cluster-level by taking the empirical mean among those measured (i.e., ) and then contrast the average cluster-level outcomes by treatment arm with inference from the

-distribution. For CARE, we pool data across clusters and run logistic regression of the individual-level outcome

on the individual- and cluster-level covariates ; calculate the residuals by taking the difference between the cluster-level outcomes (the empirical mean among those measured) and those predicted from the previous regression, and finally contrast the cluster-level residuals by treatment arm with a -test.

In mixed models and GEE, we again pool data across clusters and fit a log-linear regression of the individual-level outcome

on the cluster-level treatment , individual-level covariates , and cluster-level summaries . In DR-GEE, we also estimate the measurement mechanism with a pooled logistic regression of on and the augmentation terms with arm-specific, pooled, log-linear regressions of on . To account for within cluster dependence, we include a random cluster-specific intercept in mixed models and use an independent working correlation matrix in the GEEs. The default settings of lme4, geepack, and CRTgeeDR

packages are used for standard error estimation

[57, 58, 59]. In all approaches, inference for the intervention effect is based on inference for the treatment coefficient.

For Two-Stage TMLE, we first implement an individual-level TMLE within each cluster separately to estimate the cluster-specific endpoint . In these TMLEs, the outcome regression and the measurement mechanism are estimated using Super Learner to combine predictions from main terms logistic regression, generalized additive models, and the empirical mean. In Stage 2, we compare these cluster-specific estimates by treatment arm using TMLE with Adaptive Pre-specification to select the optimal adjustment variables from . Inference is obtained via the estimated influence curve and the Student’s -distribution, as outlined in Section 3.3.

4.3 Simulation Results

The true values of the risk difference and risk ratio were -9.2% and 0.88, respectively. For both effects, Table 2 summarizes estimator performance when “breaking the matches” (i.e., ignoring the pair-matching scheme used for treatment randomization) and preserving the matches. The exception is for DR-GEE, because to our knowledge, there does not yet exist an extension of DR-GEE for pair-matched trials.

BREAKING THE MATCHES KEEPING THE MATCHES
bias CI power bias CI power
FOR THE RISK DIFFERENCE (true value RD=-9.2%)
t-test -32.1 -22.9 0.047 0.050 0.4 100.0 -32.1 -22.9 0.047 0.048 0.8 100.0
CARE -17.7 -8.5 0.031 0.029 18.0 100.0 -15.5 -6.2 0.039 0.033 61.4 99.4
TMLE -9.8 -0.6 0.035 0.047 98.0 50.2 -9.8 -0.6 0.035 0.043 96.6 55.2
FOR THE RISK RATIO (true value RR=0.88)
Mixed 0.8 0.9 0.038 0.065 58.2 99.8 0.8 0.9 0.038 0.065 58.0 99.8
GEE 0.8 0.9 0.038 0.044 18.6 99.8 0.8 0.9 0.044 0.033 5.4 99.8
DR-GEE 0.7 0.7 0.049 0.064 0.0 100.0
TMLE 0.9 1.0 0.046 0.064 98.6 51.0 0.9 1.0 0.047 0.059 97.4 56.8
: average point estimate (in % for the RD)
bias: average deviation in the point estimates vs. true effect (additive scale for RD (in %) & relative scale for RR)

: standard deviation of the point estimates (on log-scale for RR)

: average standard error estimate (on log-scale for RR)
CI: proportion of 95% confidence intervals containing the true effect (in %)

power: proportion of trials correctly rejecting the false null hypothesis (in %)

Table 2: Over 500 simulated trials, the performance of CRT estimators when missingness depends on baseline and post-baseline variables. Results are shown when the target of inference is the risk difference (top 3 rows), when the target is the risk ratio (bottom 4 rows), when breaking the matches during analysis (left), and when preserving the matches during analysis (right).

Focusing first on estimators of the risk difference, we see that -test, which fails to adjust for any covariates, is highly biased, as expected given the differential measurement process. On average, it grossly overestimates the intervention effect by 22.9% and attains confidence interval coverage of 1%. By adjusting for covariates, CARE is less biased, but also overestimates the intervention effect by 8.5% when breaking the matches and by 6.2% when preserving the matches. The corresponding confidence interval coverages are much less than the nominal rate: 18.0% and 61.4%, respectively. In contrast, the bias of Two-Stage TMLE for the risk difference is low (1%) and confidence interval coverage is good (96%). Also as predicted by theory [40], more power is achieved when preserving (55.2%) versus breaking the matches (50.2%).

Now focusing on estimators of the risk ratio, both mixed models and GEE overestimate the intervention effect (10% larger) on average. This bias is substantial enough to prevent accurate inference; the confidence interval coverage for mixed models is 58% and 5.4%-18.6% for GEE. Indeed, we would only expect these estimators to be unbiased when there are only baseline causes of missingness and the outcome regressions are correctly specified. In this simulation, there are post-baseline drivers of measurement, which are simultaneously mediators of the treatment-outcome relationship. Standard analytic approaches are expected to fail in this setting [16, 13, 18].

DR-GEE is expected to reduce bias due to missing outcomes by incorporating weights corresponding to the measurement mechanism, and this is, indeed, seen when there are only baseline causes of measurement (Table 1 in Supplementary Materials). However, to the best of our knowledge, extensions of DR-GEE to handle post-baseline causes of missingness do not yet exist, and in the main simulations, DR-GEE is the most biased estimator for the risk ratio and attains 0% confidence interval coverage. In contrast, Two-Stage TMLE is essentially unbiased and achieves good-to-conservative confidence interval coverage (95%). Again, more power is achieved when keeping (56.8%) versus breaking the matches (51.0%). In simulations under the null, Two-Stage TMLE maintains nominal Type-I error control (Table 2-3 in Supplementary Materials).

5 Application to the SEARCH Study

The results of the SEARCH Study have been previously published in [36]; here, we focus on the efficiency gains from cluster-level covariate adjustment in Stage 2, after adjusting for individual-level missingness in Stage 1. For select endpoints, we describe the estimator implementation and then compare point estimates and inference for the intervention effect when using TMLE with Adaptive Pre-specification in Stage 2 versus the unadjusted effect estimator in Stage 2. We also compare breaking versus keeping the matched pairs used for randomization.

As previously discussed, the primary outcome in the SEARCH Study was the three-year cumulative HIV incidence, measured in each community through a cohort of residents who were aged 15+ years and HIV-uninfected at baseline. The pre-specified primary approach was Two-Stage TMLE to assess the intervention effect on the relative scale, weighting communities equally, and keeping the matches. In Stage 1, we estimated the community-specific, cumulative incidence of HIV with TMLE adjusting for possibly differential capture of final HIV status. These individual-level TMLEs used Super Learner, an ensemble method to combine predictions from penalized regression, generalized additive models, main terms regression, and the simple mean. In Stage 2, the intervention effect was estimated with a second, community-level TMLE, using Adaptive Pre-specification to select from the following adjustment variables: baseline HIV prevalence, baseline male circumcision coverage, or nothing (unadjusted).

A similar approach was taken for all secondary endpoints, including the incidence of HIV-associated tuberculosis (TB) or death due to illness, hypertension control among adults (30+ years) with baseline hypertension, and population-level HIV viral suppression (HIV RNA500 copies/mL). When assessing the impact on TB or death due to illness, we used the Kaplan-Meier method in Stage 1 to estimate the three-year risk in each community, separately; we censored at death due to other causes, outmigration, and study close. When assessing the impacts on hypertension control and HIV viral suppression, we implemented individual-level TMLEs in Stage 1 to adjust for baseline and post-baseline drivers of measurement. For all secondary endpoints, we used TMLE with Adaptive Pre-specification to assess the intervention effect in Stage 2.

As shown in Table 3, the point estimates of the intervention effects are similar, but the precision gains from the Stage 2 approach are notable. Here, “efficiency” is the variance of the unadjusted effect estimator breaking the matches divided by the variance of an alternative approach. For the primary endpoint (HIV incidence), we see precision gains when keeping versus breaking the matches; specifically, the unadjusted effect estimator is 3.1-times more efficient in the pair-matched analysis. As expected, TMLE with Adaptive Pre-specification keeping the matches is the most efficient approach and 4.6-times more efficient than the standard approach.

Similar results are seen for the incidence of HIV-associated TB and hypertension control (Table 3). TMLE using Adaptive Pre-specification and keeping the matches is 2-times more efficient than the unadjusted effect estimator ignoring the matches. In contrast, minimal gains in efficiency are seen when evaluating the effect on HIV viral suppression. This is because the adaptive approach used in TMLE defaults to the unadjusted effect estimator when adjustment does not improve precision. In this scenario, adjusting for the baseline prevalence of viral suppression or the proportion of youth (15-24 years) with HIV did not improve precision over the unadjusted effect estimator. However, as shown in Table 3 of the Supplementary Materials, assuming MCAR and relying on the unadjusted estimator in Stage 1 resulted in vast over-estimation of the endpoint in the intervention arm (85.2% vs. 79.0%) and control arm (75.8% vs. 67.8%).

Breaking the Matches Keeping the Matches
Stage 2 Effect (95% CI) Efficiency Effect (95% CI) Efficiency
HIV Incidence Unadjusted 0.98 (0.66, 1.45) 1 0.98 (0.78, 1.24) 3.1
TMLE 0.96 (0.73, 1.26) 2.1 0.96 (0.8, 1.17) 4.6
TB Incidence Unadjusted 0.79 (0.64, 0.98) 1 0.79 (0.69, 0.92) 2.2
TMLE 0.8 (0.67, 0.95) 1.4 0.8 (0.69, 0.91) 2.6
Hypertension Control Unadjusted 1.19 (1.1, 1.3) 1 1.19 (1.11, 1.28) 1.7
TMLE 1.18 (1.1, 1.26) 1.6 1.19 (1.11, 1.27) 1.8
Viral Suppression Unadjusted 1.15 (1.11, 1.2) 1 1.15 (1.11, 1.2) 1
TMLE 1.16 (1.13, 1.2) 1.1 1.15 (1.11, 1.2) 1
Efficiency: Variance estimate for the unadjusted effect estimator breaking the matches used for randomization,
divided by the variance estimate of another approach.
Table 3: Point estimates, 95% confidence intervals, and efficiency comparisons for key outcomes in the SEARCH Study when estimating the intervention effect in Stage 2 with the unadjusted estimator, TMLE with Adaptive Pre-specification, and when breaking versus keeping the matches.

6 Discussion

Cluster randomized trials (CRTs) are essential for assessing the effectiveness of interventions delivered to groups of individuals (e.g., clinics or communities). There have been notable advances in the design and conduct of CRTs (e.g., [1, 2, 3, 4, 10, 5]). However, substantial challenges remain and threaten the quality of evidence generated by CRTs. Regardless of best intentions, most CRTs are prone to differential measurement of individual-level outcomes and to imbalance on predictive covariates between randomized arms. In this paper, we proposed and evaluated a novel estimator, Two-Stage TMLE, to address the dual challenges of bias due to missing individual-level outcomes and imprecision due to few randomized units (i.e., clusters). In Stage 1, a TMLE is implemented within each cluster separately to estimate the cluster-specific endpoint, which appropriately controls for missingness on participant outcomes. Fully stratifying on the cluster allows the missingness mechanism to vary by cluster. In Stage 2, the treatment effect is estimated with a separate, cluster-level TMLE to compare the cluster-level endpoints, estimated from Stage 1. In Stage 2, cross-validation is used to adaptively select the baseline covariates from a pre-specified set that maximize precision [24]. Statistical inference is based on the estimated influence curve and the Student’s -distribution. Finite sample simulations demonstrated the potential of the proposed method to overcome the shortcomings of existing CRT methods, especially when there are post-baseline causes of missingness. Application to real data from the SEARCH Study demonstrated the precision gains attained through adaptive adjustment for cluster-level covariates in Stage 2.

To the best of our knowledge, Two-Stage TMLE is the first CRT estimator that simultaneously addresses bias due to individual-level missingness and adaptive adjustment, in a fully pre-specified manner, to improve efficiency. The approach is applicable to a wide range of measurement schemes (e.g., single cross-sectional sample, repeated cross-sectional sampling, and longitudinal follow-up) and endpoint types (e.g., binary, continuous, time-to-event outcomes). The approach is also applicable to a wide range of causal parameters (e.g., population, conditional, and sample effects) and scales of inference (e.g., absolute or relative measures). Additionally, Two-Stage TMLE should naturally generalize to hierarchical data settings with a non-randomized, cluster-level exposure. In such an observational setting, the cluster-level TMLE implemented in Stage 2 would focus on confounding control, as opposed to efficiency improvement. However, the asymptotic properties and finite sample performance of such an estimator remain an area of future work.

Altogether, Two-Stage TMLE alleviates, but does not fully resolve, the challenges that arise from missing data and covariate adjustment for efficiency gains in CRTs. In Stage 1, TMLE uses machine learning to flexibly adjust for baseline and time-dependent causes of missingness and, as a plug-in estimator, provides more stability under strong confounding or rare outcomes. However, adjustment for missingness in Two-Stage TMLE occurs within each cluster separately, limiting the breadth of adjustment when the cluster-specific outcome is rare or the cluster-specific sample size is small (e.g., in subgroup analyses). Furthermore, Stage 2 adjustment for efficiency gains is limited to cluster-level covariates. Future work will address these remaining challenges.

7 Acknowledgments

On behalf of the SEARCH Study, we thank the Ministry of Health of Uganda and of Kenya; our research and administrative teams in San Francisco, Uganda, and Kenya; collaborators and advisory boards; and especially all the communities and participants involved. We also thank Dr. Ted Westling for his thoughtful feedback on this project.

This work was supported by the National Institutes of Health [grant numbers U01AI150510, U01AI099959, UM1AI068636, and R01AI074345]; by the President’s Emergency Plan for AIDS Relief; and by Gilead Sciences, which provided Truvada.

8 Supplementary Materials

8.1 Step-by-Step Implementation of the Stage 1 TMLE

For demonstration, we focus on implementation of TMLE for the cluster-specific endpoint

(3)

where is the individual-level outcome, is an indicator of measurement, are baseline individual-level covariates, are post-baseline individual-level covariates, are the baseline cluster-level covariates, and is the cluster-level treatment indicator. To estimate Eq. 3 with TMLE, we take the following steps within each cluster , separately. Throughout, indexes the participants of cluster . (For ease of notation, we drop the superscript when denoting the cluster-size in the Supplementary Materials.)

  1. Among those with measured outcomes (i.e., ), use Super Learner to flexibly model the relationship between the outcome and adjustment factors .

  2. Use the output from #1 to predict the outcome for all participants, regardless of their measurement status: for .

  3. Target these machine learning-based predictions with information in the estimated measurement mechanism , also fit with Super Learner.

    1. Calculate the “clever covariate” for

    2. Run logistic regression of outcome

      on only the intercept, using the logit of the initial estimator

      as offset (i.e., fixing its coefficient to 1) and the clever covariate as weight.

    3. Denote the resulting intercept as

  4. Obtain targeted predictions of the outcome for all participants, regardless of their measurement status: for .

    1. Add the estimated intercept to the logit of the initial estimates and transform back to the original scale (i.e., take the inverse-logit):

  5. Average the targeted predictions to obtain a cluster-specific endpoint estimate adjusted for missingness on individual-level outcomes:

Because we are implementing TMLE in each cluster separately, we do not include the cluster-level covariates or treatment in the above estimation procedure. Updating on the logit-scale is recommended for binary and continuous individual-level outcomes; for details see [60]. For time-to-event outcomes with potentially differential censoring (i.e., survival-type endpoints), we would, instead, implement longitudinal TMLE within each cluster separately [47, 48].

8.2 Step-by-Step Implementation of the Stage 2 TMLE

Given estimates of the cluster-specific endpoints for from Stage 1, we then implement a cluster-level TMLE to more efficiently estimate the intervention effect in Stage 2. For demonstration, we focus on TMLE for the sample effect on the relative scale:

(4)

where is the counterfactual, cluster-level outcome under treatment-level .

  1. Obtain an initial estimate of the conditional expectation of the cluster-level outcome, given the cluster-level treatment and covariates: . We could, for example, fit a “working” regression of the estimated outcome on an intercept with main terms for the cluster-level treatment and selected cluster-level covariates [53, 54].

  2. Use the output from #1 to predict the outcome for all clusters under both the intervention and control conditions: and for .

  3. Target the initial predictions using information in the estimated propensity score for .

    1. To estimate the cluster-level propensity score, we could again fit a “working” logistic regression of the cluster-level treatment indicator on an intercept and selected cluster-level covariates .

    2. Calculate the two-dimensional “clever” covariate: and for .

    3. Run logistic regression of cluster-level outcome on the clever covariates and , suppressing the intercept, and using the logit of the initial estimator as offset (i.e., fixing its coefficient to 1).

    4. Denote the resulting coefficient estimates corresponding to and as and , respectively.

  4. Obtain targeted predictions of the outcome for all clusters under both the intervention and control conditions:

  5. Obtain a point estimate by dividing the average of the targeted predictions under the intervention condition by the average of the targeted predictions under the control condition:

If the known propensity score is not estimated (e.g., in two-armed CRTs with balanced allocation), then the targeting step can be skipped. As detailed in [53], using a two-dimensional clever covariate during updating (step 3) allows for simultaneous targeting of the treatment-specific means and effects on the additive, relative, and odds ratio scales. As detailed in [61], implementation to obtain a point estimate is identical for the population, conditional, and sample effects. In other words, we would follow the same steps if our goal were .

8.3 Asymptotic Linearity of Two-Stage TMLE

Briefly, an estimator is asymptotically linear if the difference between the estimator and the estimand behaves (in first order) as an empirical average of a mean-zero and finite variance function, known as the influence curve, of the unit data [62, 63, 21]. An asymptotically linear estimator will be consistent and normally distributed in its limit. Therefore, the Central Limit Theorem can be applied to construct 95% confidence intervals and test the null hypothesis.

Recall that in Stage 1, we first define the cluster-specific outcome . If all individual-level outcomes are completely measured, then could be defined as the expected individual-level outcome within each cluster: . If the individual-level outcomes are missing-completely-at-random, then could be defined as the expected individual-level outcome among those measured: . Likewise, if measurement depends on individual-level, baseline covariates , then could be defined as the expected individual-level outcome given measurement and those covariates, standardized with respect to the covariate distribution: . Extensions to scenarios with post-baseline drivers of missingness and/or right-censoring follow analogously.

Next, we estimate the cluster-specific outcome within each cluster , separately. When outcomes are completely measured () or are missing-completely-at-random (), a simple and intuitive estimator is the empirical mean outcome among those measured. When outcomes are missing-at-random within values of the adjustment variables (e.g., ), we recommend using TMLE for estimation of the cluster-specific outcome. The empirical mean outcome (among those measured) can be considered a special case of TMLE where the adjustment set is empty: .

To emphasize how the Stage 1 estimator depends on the individual-level data within each cluster, let denote the true distribution of the individual-level data in cluster . Likewise, let denote the targeted estimator of that distribution based on individuals in cluster . Then we can write the Stage 1 cluster-specific estimand as and the Stage 1 cluster-specific plug-in estimator as .

The Stage 2 cluster-level effect estimator is, therefore, a function of , . Consider, for example, the treatment-specific sample mean as our Stage 2 target parameter. The cluster-level TMLE of in Stage 2 would be

An unadjusted effect estimator in Stage 2 can again be considered a special case of the cluster-level TMLE where the adjustment set is empty: . Altogether, Two-Stage TMLE will be asymptotically linear under the following conditions

where represents the cluster-level influence curve and is remainder term, going to zero in probability:

  1. Stage 2 estimators of the cluster-level outcome regression and the cluster-level propensity score meet the usual regularity conditions, which are quite weak in a randomized trial (e.g., [53, 54]).

  2. Deviations between the estimated cluster-level outcomes and the true cluster-level outcomes,
    , provide a negligible contribution to the remainder term .

The conditions on Stage 2 estimation are satisfied when estimating the known, cluster-level propensity score with a “working” logistic regression and when estimating the cluster-level outcome regression with another “working” parametric regression (e.g., [53, 54]). However, to the best of our knowledge, all previously existing Two-Stage estimators (e.g., a t-test on the cluster-level means) have simply ignored the contribution from estimating the cluster-level outcome to . Suppose, for example, our Stage 1 estimator is the average outcome within each cluster: . (Such an estimator would only be appropriate when the individual-level outcomes are completely measured or are missing-completely-at-random.) Since the individual-level outcomes are not i.i.d. within each cluster, we need the following to hold for this estimator’s contribution to to be essentially zero: (1) the within cluster dependence is weak enough that the Central Limit Theorem applies in , and (2) smallest cluster is much larger than the total number of clusters (i.e., ).

When the Stage 1 estimator is a TMLE of the Stage 1 estimand , the relevant component of the remainder term can be written as

(5)

where and are the cluster -specific efficient influence curve and remainder terms, respectively. As before, we need that the within cluster dependence is weak enough such that and that the ratio of total number of clusters to the cluster-size goes to zero (i.e., ). We note that when the cluster-size is substantially larger than , we can weaken this independence assumption to allow for a slower rate of convergence. Additionally, we need that estimators of the individual-level outcome regression and the individual-level missingness mechanism converge to their targets at fast enough rates such that [21]. Implementing Super Learner with highly adaptive LASSO (HAL) [64] or internal sample-splitting can help ensure these conditions hold in practice [65, 66].

This approach to statistical inference also applies in CRTs where the treatment is randomized within matched pairs of clusters. Briefly, let and denote the observed data for the first and second cluster within matched pair , respectively. To obtain statistical inference for sample effect in a pair-matched setting, we replace with the following paired version: [61]. Our variance estimator is then given by the sample variance of the paired influence curve divided by the number of pairs (), and we use the Student’s -distribution with degrees of freedom [1]. This could naturally be extended to matched triplets in a three-armed trial.

8.4 Additional Simulation Study with Baseline (only) Causes of Missingness

Here, we consider a simplified scenario where only baseline (but not post-baseline) covariates impact the measurement of individual-level outcomes. As before, we focus on a setting with clusters and where within each cluster, the number of individual participants is sampled with equal probability from {100, 150, 200}.

For each cluster , we independently generate the cluster-specific data as follows. First, one latent variable is drawn uniformly from (1.75, 2.25) and two additional variables are drawn independently from a standard normal distribution. Then, two individual-level covariates are generated by drawing from a normal distribution with means depending on the cluster-level latent factors: and . We set the observed cluster-level covariates as the empirical mean of their individual-level counterparts. The intervention is randomly allocated within pairs of clusters matched on ; therefore, clusters receive the intervention and the control.

The underlying, individual-level outcome is generated as an indicator that , drawn from a Uniform(0,1), is less than . Finally, we incorporate individual-level missingness by generating as an indicator that , drawn from a Uniform(0,1), is less than . Thus, participants in the intervention arm , and especially those with higher values of , are more likely to have the outcome and also be missing. The observed outcomes are set to be missing for individuals with .

We also generate the counterfactual, individual-level outcomes and by setting the cluster-level treatment to and , respectively, and preventing missingness (i.e., setting ). As before, the cluster-level counterfactual outcome is the empirical mean within each cluster . The true values of the treatment-specific population means for , the risk difference as and the risk ratio as are calculated for a population of 5000 clusters.

We compare the same estimators as the main simulation study. The implementation is identical, except the post-baseline variable does not exist and is, thus, excluded from the analysis.

8.4.1 Results from the Second Simulation Study

Table 4 illustrates estimator performance in this simplified setting. The true values of the treatment-specific population means and are 47.4% and 39.6%, respectively. The corresponding risk difference and risk ratio are 7.7% and 1.20, respectively.

Focusing first on estimating the risk difference, we see that -test, which fails to adjust for any covariates, is highly biased, as expected given the differential measurement process. On average, it grossly underestimates the intervention effect by 12% and attains a confidence interval coverage of 25%, much lower than the nominal rate of 95%. By adjusting for covariates that influence measurement and underlying outcomes, CARE is less biased, but still underestimates the intervention effect by 2.8% when breaking the matches and by 5.1% when preserving the matches. The corresponding confidence interval coverages for CARE are less than the nominal rate: 91.8% and 41.8%, respectively. In contrast, the bias of Two-Stage TMLE for the risk difference is negligible, and as predicted by theory [61], the confidence interval coverage is conservative (99%). Also as predicted by theory [40], higher power is achieved when preserving, as compared to breaking, the matches: 56.6% versus 49.0%, respectively.

Now focusing on estimating the risk ratio, we see that both mixed models and GEE overestimate the intervention effect (1.4-times higher on average). This bias is substantial enough to prevent accurate inference. The confidence interval coverage for mixed models is 38.6% and 17.2%-35.6% for GEE. Lower coverage for GEE is likely due to underestimation of the standard errors (). While both mixed models and GEE are adjusting for the appropriate variables, both are relying on a misspecified regression to control for missing data and to estimate the intervention effect. Theoretically, DR-GEE should remove this bias from GEE by incorporating estimates of the missingness mechanism. Indeed, DR-GEE exhibits lower bias, but still does not obtain valid inference (confidence interval coverage of 53.6%). This again highlights the need for flexible (i.e., data-adaptive) estimators of the individual-level outcome regression and missingness mechanism. In contrast, Two-Stage TMLE for the risk ratio is essentially unbiased. As predicted by theory [40, 61], the confidence interval coverage is conservative (99%) and more power is achieved when preserving (62.2%) versus breaking the matched (53.2%). Importantly, in simulations under the null, Two-Stage TMLE maintains nominal to conservative Type-I error control in all settings (results not shown).

BREAKING THE MATCHES KEEPING THE MATCHES
bias CI power bias CI power
FOR THE RISK DIFFERENCE (true value RD=7.7%)
t-test -4.7 -12.3 0.041 0.043 22.2 16.6 -4.7 -12.3 0.041 0.039 18.2 20.0
CARE 4.9 -2.8 0.019 0.028 91.8 29.6 2.6 -5.1 0.012 0.023 41.8 0.6
TMLE 7.3 -0.4 0.023 0.037 99.6 49.0 7.3 -0.4 0.024 0.032 99.0 56.6
FOR THE RISK RATIO (true value RR=1.20)
Mixed 1.7 1.4 0.146 0.144 38.6 91.4 1.7 1.4 0.137 0.145 38.6 92.6
GEE 1.6 1.4 0.147 0.137 35.6 89.6 1.7 1.4 0.168 0.097 17.2 96.4
DR-GEE 1.4 1.2 0.102 0.085 53.6 94.2
TMLE 1.2 1.0 0.051 0.083 99.4 53.2 1.2 1.0 0.052 0.073 99.0 62.2
: average point estimate (in % for the RD)
bias: average deviation in the point estimates vs. true effect (additive scale for RD (in %) & relative scale for RR)
: standard deviation of the point estimates (on log-scale for RR)
: average standard error estimate (on log-scale for RR)
CI: proportion of 95% confidence intervals containing the true effect (in %)
power: proportion of trials correctly rejecting the false null hypothesis (in %)
Table 4: Over 500 simulated trials, the performance of CRT estimators when missingness is only impacted by baseline variables (i.e., the supplemental simulation study) . Results are shown when the target of inference is the risk difference (top 3 rows), when the target is the risk ratio (bottom 4 rows), when breaking the matches during analysis (left), and when preserving the matches during analysis (right).

8.5 Main Simulation Study Under the Null with and clusters

To assess Type I error control, we repeated the main simulation study when there was no treatment effect (RD=0; RR=1). Under the null, the individual-level outcome is not impacted by either the cluster-level treatment or the individual-level post-baseline covariates . In other words, the post-baseline covariates are not mediators in this setting. Therefore, the baseline covariates are sufficient to control for incomplete measurement: .

The results are given in Table 5. As expected in this setting, all adjusted estimators exhibit much smaller bias. However, for CARE under pair-matching and the GEEs, we see lower than nominal confidence interval coverage (70.4%-93.4%). This is partially explained by underestimation of the standard errors (i.e., ) in the setting of clusters. However, when increasing the number of clusters to , these estimators still exhibit poor coverage and high Type-I error (7.8%-29.6%). In contrast, good confidence interval coverage and Type-I error control are exhibited by mixed models and Two-Stage TMLE in all settings.

BREAKING THE MATCHES KEEPING THE MATCHES
bias CI bias CI
CLUSTERS
FOR THE RISK DIFFERENCE (true value RD=0%)
t-test -3.5 -3.5 0.045 0.047 89.4 10.6 -3.5 -3.5 0.045 0.044 87.0 13.0
CARE 1.5 1.5 0.025 0.028 96.6 3.4 1.2 1.2 0.023 0.022 93.4 6.6
TMLE -0.3 -0.3 0.036 0.043 97.0 3.0 -0.3 -0.3 0.036 0.039 97.2 2.8
FOR THE RISK RATIO (true value RR=1.0)
Mixed 1.0 1.0 0.041 0.064 99.6 0.4 1.0 1.0 0.041 0.065 99.6 0.4
GEE 1.0 1.0 0.041 0.040 90.8 9.2 1.0 1.0 0.044 0.029 73.6 26.4
DR-GEE 1.0 1.0 0.040 0.036 91.6 8.4
TMLE 1.0 1.0 0.051 0.060 96.8 3.2 1.0 1.0 0.051 0.055 97.0 3.0
CLUSTERS
FOR THE RISK DIFFERENCE (true value RD=0%)
t-test -3.9 -3.9 0.034 0.037 84.8 15.2 -3.9 -3.9 0.034 0.035 81.6 18.4
CARE 1.6 1.6 0.019 0.022 93.8 6.2 1.3 1.3 0.018 0.017 90.0 10.0
TMLE -0.6 -0.6 0.026 0.034 98.4 1.6 -0.6 -0.6 0.026 0.031 97.6 2.4
FOR THE RISK RATIO (true value RR=1.0)
Mixed 1.0 1.0 0.029 0.049 99.8 0.2 1.0 1.0 0.029 0.049 99.8 0.2
GEE 1.0 1.0 0.029 0.032 90.2 9.8 1.0 1.0 0.032 0.022 70.4 29.6
DR-GEE 1.0 1.0 0.028 0.028 92.2 7.8
TMLE 1.0 1.0 0.037 0.048 98.4 1.6 1.0 1.0 0.037 0.044 97.4 2.6
: average point estimate (in % for the RD)
bias: average deviation between & true effect (additive scale for RD (in %) & relative scale for RR)
: standard deviation of the point estimates (on log-scale for RR)
: average standard error estimate (on log-scale for RR)
CI: proportion of 95% confidence intervals containing the true effect (in %)
: proportion of trials incorrectly rejecting the true null hypothesis (in %)
Table 5: Over 500 simulated trials, the performance of CRT estimators when missingness depends on baseline and post-baseline variables (i.e., the main simulation study) and there is no intervention effect (i.e., under the null). Results are shown with clusters (top sub-table) and clusters (bottom sub-table), when the target of inference is the risk difference (top 3 rows of each sub-table), when the target is the risk ratio (bottom 4 rows of each sub-table), when breaking the matches during analysis (left), and when preserving the matches during analysis (right).

8.6 Additional Results from the SEARCH Study

The full statistical analysis plan for the SEARCH Study is available at [67]. In Table 6, we provide a comparison of results when using an unadjusted estimator in Stage 1 and Stage 2 as compared to Two-Stage TMLE when estimating population-level HIV viral suppression (the proportion of all persons with HIV who are suppressing viral replication 500 copiess/mL) in each arm and the intervention effect [49].

Breaking matches Keeping matches
Estimator Intervention (95% CI) Control (95% CI) Effect (95% CI) Effect (95% CI)
Unadjusted 85.2% (83.5%, 86.8%) 75.8% (73.5%, 78.2%) 1.12 (1.08, 1.16) 1.12 (1.09, 1.16)
TMLE 79% (77.1%, 80.8%) 67.8% (66.2%, 69.5%) 1.16 (1.13, 1.2) 1.15 (1.11, 1.2)
Table 6: Summary of arm-specific and effect measures for population-level HIV viral suppression in the SEARCH Study. Point estimates and 95% confidence intervals are provided when assuming MCAR in Stage 1 and using an unadjusted effect estimator in Stage 2 (“Unadjusted”) versus when using Two-Stage TMLE to control for missing individual-level outcomes and improve efficiency when estimating the intervention effect (“TMLE”), both when breaking the matches used for randomization and keeping the matches.

8.7 Computing code

All simulations were conducted in R (v4.0.3) using the nbpMatching, lme4, geepack, CRTgeeDR ltmle, and SuperLearner packages [68, 69, 57, 58, 59, 70, 71]. Computing code to reproduce the simulation study is available at https://github.com/LauraBalzer/TwoStageTMLE. Computing code used to analyze the SEARCH Study data is available at https://github.com/LauraBalzer/SEARCH_Analysis_Adults.

References

  • Hayes and Moulton [2009] R.J. Hayes and L.H. Moulton. Cluster Randomised Trials. Chapman & Hall/CRC, Boca Raton, 2009.
  • Crespi [2016] C.M. Crespi. Improved designs for cluster randomized trials. Annu Rev Public Health, 37(1):1–16, 2016.
  • Turner et al. [2017a] E.L. Turner, F. Li, J.A. Gallis, M. Prague, and D.M. Murray. Review of recent methodological developments in group-randomized trials: Part 1-design. Am J Public Health, 107(6):907–915, 2017a.
  • Turner et al. [2017b] E.L. Turner, M. Prague, J.A. Gallis, F. Li, and D.M. Murray. Review of recent methodological developments in group-randomized trials: Part 2-analysis. Am J Public Health, 107(7):1078–1086, 2017b.
  • Murray et al. [2020] D.M. Murray, M. Taljaard, E.L. Turner, and S.M. George. Essential ingredients and innovations in the design and analysis of group-randomized trials. Annu Rev Public Health, 41:1–19, 2020.
  • Halloran and Struchiner [1991] M.E. Halloran and C.J. Struchiner. Study designs for dependent happenings. Epidemiology, 2:331–338, 1991.
  • Halloran and Struchiner [1995] M.E. Halloran and C.J. Struchiner. Causal inference in infectious diseases. Epidemiology, 6(2):142–151, 1995. PMID: 7742400.
  • Hudgens and Halloran [2008] M.G. Hudgens and M.E. Halloran. Toward Causal Inference With Interference. J Am Stat Assoc, 103(482):832–842, 2008. PMCID: PMC2600548.
  • Platt et al. [2010] R. Platt, S.U. Takvorian, E. Septimus, et al. Cluster randomized trials in comparative effectiveness research: randomizing hospitals to test methods for prevention of healthcare-associated infections. Med Care, 48(6Suppl):S52–57, 2010.
  • Murray et al. [2018] D.M. Murray, S.L. Pals, S.M. George, A. Kuzmichev, G.Y. Lai, J.A. Lee, R.L. Myles, and S.M. Nelson. Design and analysis of group-randomized trials in cancer: A review of current practices. Prev Med, 111:241–247, 2018.
  • Fiero et al. [2016] M.H. Fiero, S. Huang, E. Oren, and M.L. Bell. Statistical analysis and handling of missing data in cluster randomized trials: a systematic review. Trials, 17, 2016.
  • Rubin [1976] D.B. Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
  • Robins et al. [1995] J.M. Robins, A. Rotnitzky, and L.P. Zhao. Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. J Am Stat Assoc, 90:106–121, 1995.
  • Gail et al. [1996] M.H. Gail, S.D. Mark, R.J. Carroll, S.B. Green, and D. Pee. On design considerations and randomization-based inference for community intervention trials. Stat Med, 15:1069–1092, 1996.
  • Little and Rubin [2002] R.J.A. Little and Donald B. Rubin. Statistical analysis with missing data. Wiley, Hoboken, 2nd edition, 2002.
  • Robins [1986] J.M. Robins. A new approach to causal inference in mortality studies with sustained exposure periods–application to control of the healthy worker survivor effect. Mathematical Modelling, 7:1393–1512, 1986. doi: 10.1016/0270-0255(86)90088-6.
  • Robins et al. [2000] J.M. Robins, M.A. Hernán, and B. Brumback. Marginal structural models and causal inference in epidemiology. Epidemiology, 11(5):550–560, 2000.
  • Robins and Hernán [2009] J.M. Robins and M.A. Hernán. Estimation of the causal effects of time-varying exposures. In G. Fitzmaurice, M. Davidian, G. Verbeke, and G. Molenberghs, editors, Longitudinal Data Analysis, chapter 23. Chapman & Hall/CRC, Boca Raton, FL, 2009.
  • Selvaraj and Prasad [2013] S. Selvaraj and V. Prasad. Characteristics of cluster randomized trials: Are they living up to the randomized trial? JAMA Internal Medicine, 173(23):313, 2013. doi: 10.1001/jamainternmed.2013.1638.
  • Fisher [1932] R.A. Fisher. Statistical methods for research workers. Oliver and Boyd Ltd., Edinburgh, 4th edition, 1932.
  • van der Laan and Rose [2011] M. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York Dordrecht Heidelberg London, 2011.
  • Colantuoni and Rosenblum [2015] E. Colantuoni and M. Rosenblum. Leveraging prognostic baseline variables to gain precision in randomized trials. Stat Med, 34(2602-2617), 2015.
  • Stephens et al. [2013] A.J. Stephens, E.J. Tchetgen Tchetgen, and V. De Gruttola. Flexible covariate-adjusted exact tests of randomized treatment effects with application to a trial of HIV education. Ann Appl Stat, 7(4):2106–2137, 2013.
  • Balzer et al. [2016a] L. Balzer, M. van der Laan, M. Petersen, and the SEARCH Collaboration. Adaptive pre-specification in randomized trials with and without pair-matching. Statistics in Medicine, 35(10):4528–4545, 2016a. doi: 10.1002/sim.7023.
  • Kahan et al. [2016] B.C. Kahan, G. Forbes, Y. Ali, V. Jairath, et al. Increased risk of type I errors in cluster randomised trials with small or medium numbers of clusters: a review, reanalysis, and simulation study. Trials, 17:438, 2016.
  • Laird and Ware [1982] N.M. Laird and J.H. Ware. Random-effects models for longitudinal data. Biometrics, 38(4):963–974, 1982. PMID: 7168798.
  • Liang and Zeger [1986] K.Y. Liang and S.L. Zeger. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika, 73(1):13–22, 1986. URL http://www.jstor.org/stable/2336267.
  • Stephens et al. [2012] A.J. Stephens, E.J. Tchetgen Tchetgen, and V. DeGruttola. Augmented generalized estimating equations for improving efficiency and validity of estimation in cluster randomized trials by leveraging cluster-level and individual-level covariates. Statistics in Medicine, 31:915–930, 2012.
  • Stephens et al. [2014] A. Stephens, E. Tchetgen Tchetgen, and V. DeGruttola. Locally efficient estimation of marginal treatment effects when outcomes are correlated: Is the prize worth the chase? The International Journal of Biostatistics, 10(1):59–75, 2014. doi: 10.1515/ijb-2013-0031.
  • Balzer et al. [2019] L.B. Balzer, W. Zheng, M.J. van der Laan, M.L. Petersen, and the SEARCH Collaboration. A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure. Stat Meth Med Res, 28(6):1761–1780, 2019.
  • Hubbard et al. [2010] A.E. Hubbard, J. Ahern, N.L. Fleischer, M. van der Laan, S.A. Lippman, N. Jewell, T. Bruckner, and W.A. Satariano. To GEE or not to GEE comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology, 21(4):467–474, 2010.
  • DeSouza et al. [2009] C.M. DeSouza, A.T.R. Legedza, and A.J. Sankoh. An overview of practical approaches for handling missing data in clinical trials. J Biopharm Stat, 19(6):1055–1073, 2009.
  • Hossain et al. [2017a] A. Hossain, K. Diaz-Ordaz, and J.W. Bartlett. Missing binary outcomes under covariate-dependent missingness in cluster randomised trials. Stat Meth Med Res, 36(19):3092–3109, 2017a.
  • Hossain et al. [2017b] A. Hossain, K. Diaz-Ordaz, and J.W. Bartlett. Missing continuous outcomes under covariate dependent missingness in cluster randomised trials. Stat Meth Med Res, 26(3):1543–1562, 2017b.
  • Prague et al. [2016] M. Prague, R. Wang, A. Stephens, E. Tchetgen Tchetgen, and V. De Gruttola. Accounting for interactions and complex inter-subject dependency in estimating treatment effect in cluster-randomized trials with missing outcomes. Biometrics, 72(4):1066–1077, 2016. doi: 10.1111/biom.12519.
  • Havlir et al. [2019] D.V. Havlir, L.B. Balzer, E. Charlebois, T.D. Clark, D. Kwarisiima, J. Ayieko, J Kabami, N. Sang, et al. HIV testing and treatment with the use of a community health approach in rural Africa. New England Journal of Medicine, 381:219–229, 2019.
  • Chamie et al. [2012] G. Chamie, D. Kwarisiima, T.D. Clark, J. Kabami, V. Jain, E. Geng, M.L. Petersen, H. Thirumurthy, M.R. Kamya, D.V. Havlir, E.D. Charlebois, and the SEARCH Collaboration. Leveraging rapid community-based HIV testing campaigns for non-communicable diseases in rural Uganda. PLoS One, 7(8):e43400, 2012. doi: 10.1371/journal.pone.0043400.
  • Chamie et al. [2016] G. Chamie, T.D. Clark, J. Kabami, K. Kadede, E. Ssemmondo, et al. A hybrid mobile HIV testing approach for population-wide HIV testing in rural East Africa. Lancet HIV, 3(3):e111–119, 2016.
  • Kwarisiima et al. [2017] D. Kwarisiima, M. Kamya, A. Owaraganise, et al. High rates of viral suppression in adults and children with high CD4+ counts using a streamlined ART delivery model in the SEARCH trial in rural Uganda and Kenya. J Int AIDS Soc, Jul 21(20(Suppl 4)):21673, 2017.
  • Balzer et al. [2015] L.B. Balzer, M.L. Petersen, M.J. van der Laan, and the SEARCH Consortium. Adaptive pair-matching in randomized trials with unbiased and efficient effect estimation. Statistics in Medicine, 34(6):999–1011, 2015. doi: 10.1002/sim.6380.
  • Petersen et al. [2012] M.L. Petersen, K.E. Porter, S. Gruber, Y. Wang, and M.J. van der Laan. Diagnosing and responding to violations in the positivity assumption. Statistical Methods in Medical Research, 21(1):31–54, 2012. doi: 10.1177/0962280210386207.
  • Horvitz and Thompson [1952] D.G. Horvitz and D.J. Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47:663–685, 1952. doi: 10.2307/2280784.
  • Gruber and van der Laan [2012] S. Gruber and M.J. van der Laan. Targeted minimum loss based estimator that outperforms a given estimator. Int J Biostat, 8(1):Article 11, 2012. PMID: 22628356.
  • Schuler and Rose [2017] M.S. Schuler and S. Rose. Targeted maximum likelihood estimation for causal inference in observational studies. American Journal of Epidemiology, 185(1):65–73, 2017.
  • Blakely et al. [2019] T. Blakely, J. Lynch, K. Simons, R. Bentley, and S. Rose. Reflection on modern methods: when worlds collide - prediction, machine learning and causal inference. International Journal of Epidemiology, dyz132:1–7, 2019.
  • van der Laan et al. [2007] M.J. van der Laan, E.C. Polley, and A.E. Hubbard. Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1):25, 2007. doi: 10.2202/1544-6115.1309.
  • Petersen et al. [2014] M.L. Petersen, J. Schwab, S. Gruber, N. Blaser, M. Schomaker, and M.J. van der Laan. Targeted maximum likelihood estimation for dynamic and static longitudinal marginal structural working models. Journal of Causal Inference, 2(2), 2014. doi: 10.1515/jci-2013-0007.
  • Benkeser et al. [2019] D. Benkeser, P.B. Gilbert, and M. Carone. Estimating and testing vaccine sieve effects using machine learning. J Am Stat Assoc, 114(527):1038–1049, 2019.
  • Balzer et al. [2020] L.B. Balzer, J. Ayieko, D. Kwarisiima, G. Chamie, E.D. Charlebois, J. Schwab, M.J. van der Laan, M.R. Kamya, D.V. Havlir, and M.L. Petersen. Far from MCAR: obtaining population-level estimates of HIV viral suppression. Epidemiology, 31(5):620–627, 2020.
  • Rubin [1990] D.B. Rubin. Comment: Neyman (1923) and causal inference in experiments and observational studies. Statistical Science, 5(4):472–480, 1990.
  • Benitez [2020] A. Benitez. Targeted machine learning approaches for leveraging data in resource-constrained settings. PhD thesis, University of California, Berkeley, 2020.
  • Seaman et al. [2014] S.R. Seaman, M. Pavlou, and A.J. Copas. Review of methods for handling confounding by cluster and informative cluster size in clustered data. Statistics in Medicine, 33:5371–5387, 2014.
  • Moore and van der Laan [2009] K.L. Moore and M.J. van der Laan. Covariate adjustment in randomized trials with binary outcomes: Targeted maximum likelihood estimation. Stat Med, 28(1):39–64, 2009. doi: 10.1002/sim.3445.
  • Rosenblum and van der Laan [2010] M. Rosenblum and M.J. van der Laan. Simple, efficient estimators of treatment effects in randomized trials using generalized linear models to leverage baseline variables. The International Journal of Biostatistics, 6(1):Article 13, 2010. doi: 10.2202/1557-4679.1138.
  • van der Vaart [1998] A.W. van der Vaart. Asymptotic Statistics. Cambridge University Press, New York, 1998.
  • Cornfield [1978] J. Cornfield. Randomization by group: a formal analysis. Am J Epidemiol, 108(2):100–102, 1978.
  • Bates et al. [2015] D. Bates, M. Mächler, B. Bolker, and S. Walker. Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1):1–48, 2015.
  • Hojsgaard et al. [2006] S. Hojsgaard, U. Halekoh, and J. Yan. The R package geepack for generalized estimating equations. J Stat Softw, 15(2):1–11, 2006.
  • Prague et al. [2017] M. Prague, R. Wang, , and V. De Gruttola. CRTgeeDR: an R package for doubly robust generalized estimating equations estimations in cluster randomized trials with missing data. The R Journal, 9(2):105–115, 2017.
  • Gruber and van der Laan [2010] S. Gruber and M.J. van der Laan. A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. The International Journal of Biostatistics, 6(1):Article 26, 2010. doi: 10.2202/1557-4679.1260.
  • Balzer et al. [2016b] L.B. Balzer, M.L. Petersen, and M.J. van der Laan. Targeted estimation and inference of the sample average treatment effect in trials with and without pair-matching. Statistics in Medicine, 35(21):3717–3732, 2016b. doi: 10.1002/sim.6965.
  • Bickel et al. [1993] P.J. Bickel, C.A.J. Klaassen, Y. Ritov, and J.A. Wellner. Efficient and Adaptive Estimation for Semiparametric Models. Johns Hopkins University Press, Baltimore, 1993.
  • van der Vaart and Wellner [1996] A.W. van der Vaart and J.A. Wellner. Weak convergence and empirical processes. Springer, Berlin Heidelberg New York, 1996.
  • Benkeser and van der Laan [2016] D. Benkeser and M. van der Laan. The highly adaptive lasso estimator. Proc Int Conf Dat Sci Adv Anal, pages 689–696, 2016.
  • van der Laan [2011] W. Zhengand M. van der Laan. Cross-validated targeted minimum-loss-based estimation. In M.J. van der Laan and S. Rose, editors, Targeted Learning: Causal Inference for Observational and Experimental Data. Springer, New York Dordrecht Heidelberg London, 2011.
  • Díaz [2019] I. Díaz. Machine learning in the estimation of causal effects: targeted minimum loss-based estimation and double/debiased machine learning. Biostatistics, kxz042, 2019.
  • Balzer et al. [2018] L.B. Balzer, D.V. Havlir, J. Schwab, M.J. van der Laan, M.L. Petersen, and the SEARCH Collaboration. Statistical analysis plan for SEARCH phase I: Health outcomes among adults. Technical report, arXiv: https://arxiv.org/abs/1808.03231, 2018.
  • R Core Team [2020] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2020. URL http://www.R-project.org.
  • Beck et al. [2016] C. Beck, B. Lu, and R. Greevy. nbpMatching: functions for optimal non-bipartite optimal matching, 2016. URL https://CRAN.R-project.org/package=nbpMatching. R package version 1.5.0.
  • Schwab et al. [2017] J. Schwab, S. Lendle, M. Petersen, and M van der Laan. ltmle: Longitudinal Targeted Maximum Likelihood Estimation, 2017. URL http://CRAN.R-project.org/package=ltmle.
  • Polley et al. [2018] E. Polley, E. LeDell, C. Kennedy, and M. van der Laan. SuperLearner: Super Learner Prediction, 2018. URL http://CRAN.R-project.org/package=SuperLearner. R package version 2.0-24.