References
1 Introduction
Cluster randomized trials (CRTs) provide an opportunity to assess the populationlevel effects of interventions, which are randomized to groups of individuals, such as communities, clinics, or schools. These groups are commonly called “clusters”. The choice to randomize clusters, instead of individuals, is often driven by the type of intervention as well as practical concerns [moultonhayes]. For example, interventions to improve medical practices are often randomized at the hospital or cliniclevel to reduce logistical burden and to minimize potential contamination between arms if individual patients were instead randomized. The design and conduct of CRTs has improved considerably [turner17a, turner17b, crepsi, Murray2020], and results from CRTs have been widely published in public health, education, policy, and economics literature [icc]. However, a recent review found that only 50% of CRTs were analyzed with appropriate methods [murray].
Due to the hierarchical nature of the data and the correlation of participant outcomes within clusters, the analysis of CRTs is fundamentally more complicated than for individually randomized trials [moultonhayes, murray]
. To start, there are many ways to define the causal parameter of interest in CRTs. For example, we can define the effect in terms of summaries of individuallevel outcomes or clusterlevel outcomes. Additionally, we may be interested in the effect for the sample included in the CRT or for a wider target population. Once the primary effect measure has been selected, statistical estimation and inference must respect that the cluster is the independent unit and all observations within the cluster are dependent to some degree. Ignoring this dependence can lead to power calculations based on the wrong sample size and underestimates of the standard error. Analyses ignoring the clustering may also inappropriately attribute the impact of a clusterlevel characteristic to the intervention and, together with variance underestimation, result in inflated TypeI errors. Many analytic approaches are available to account for clustering, including conducting a clusterlevel analysis or applying a correction factor
[moultonhayes, murray, turner17a, turner17b, crepsi].Once we have committed to an analytic approach that addresses clustering, the adjustment of baseline covariates is often considered for additional precision gains. Fortunately, many methods are available to estimate adjusted intervention effects in CRTs. Examples include wellestablished methods, such as generalized estimating equations (GEE) and covariate adjusted residuals estimator (CARE), as well as more recent developments, such as targeted maximum likelihood estimation (TMLE) and augmented GEE (AugGEE) [liangzeger, gailcare, moultonhayes, rose_tl, hierarchical, auggee]. While these methods differ in their exact implementation, each aims to improve statistical power of the CRT by controlling for individual or clusterlevel covariates when fitting the “outcome regression”: the conditional expectation of the outcome given the randomized intervention and the adjustment variables. Previous literature [crt_comparative_bellamy] has used simulation studies to compare the power and TypeI error of different CRT analysis approaches, including clusterlevel methods, GEE, random effects models, ‘designeffect’ models, and methods ignoring the clustering altogether. However, to the best of our knowledge, these comparisons have excluded the more recent approaches of AugGEE and TMLE.
This paper aims to clarify the strengths and weaknesses of common methods to analyze CRTs. To motivate this comparison, we consider the Preterm Birth Initiative (PTBi) study, a maternalinfant CRT which took place in 20 health facilities across Kenya and Uganda (ClinicalTrials.gov: NCT03112018). The trial assessed whether a facilitybased intervention, designed to improve uptake of evidencebased practices, was effective in reducing 28day mortality among preterm infants. Following PBTi, we focus on a setting where individual participants are grouped into clusters (e.g., patients in a health facility). However, our discussion and results are equally applicable to other hierarchical data structures that may be observed in a CRT. Examples include households within a community or classrooms within a school. Also to match the real data example, we focus on a binary individuallevel outcome throughout. However, our work also applies to CRTs with count or continuous outcomes. Finally, we note that modifications to each of the above analysis approaches can be made to account for pairmatching clusters prior to randomization [moultonhayes]. However, “breaking the match” and analyzing a pairmatched trial, such as PTBi, as if were completely randomized is also a valid approach and the focus of our discussion.
The remaining paper is organized as follows. In Section 2, we use structural causal models to describe the data generating process for a CRT [pearl]. We also discuss several effects commonly targeted in CRTs. We highlight how the endpoint may be defined at the individuallevel or at the clusterlevel, often as an aggregate of individuallevel outcomes. We additionally address the distinction between effects defined for the study sample versus a target population. In Section 3, we describe each estimation method, including its statistical properties and implementation. In Section 4, we provide two simulation studies to evaluate the finite sample performance of common CRT estimators and to demonstrate the distinction between different causal effects. In Section 5, we apply these methods to estimate intervention effects for the PTBi study. We conclude in Section 6 with a brief discussion.
2 Causal Methods
We begin by formalizing the notation that will be used throughout. We assume that each cluster, such as a hospital or smaller health facility, is sampled from some target population of interest and is indexed by . Each cluster is comprised of a finite set of individuals (i.e., participants), which are indexed . The clustersize, denoted , could be fixed or could vary by cluster. In each cluster, denotes the baseline covariates for the participants. Additional baseline covariates related to the cluster are denoted ; these could be summaries of individuallevel characteristics or may have no individuallevel counterpart. (We refer the reader to [hierarchical] for a discussion of the inclusion of clusterlevel summaries of in ).
In a CRT, the intervention is randomized among clusters. In the case of a pairmatched trial, which may increase the power of the study, the randomization occurs within pairs of clusters matched on characteristics expected to be predictive of the primary endpoint [moultonhayes, imai2009, balzer_matching]. Throughout, we will use as an indicator that cluster was randomized to the intervention. The outcome of interest is , which for simplicity we assume is measured for all participants in cluster . Extensions to handling missing data are important, but beyond the scope of this paper [diaz_missing, fiero, Balzer2021twostage].
2.1 Hierarchical Structural Causal Models
We now use Pearl’s structural causal model [pearl] to represent the hierarchical data generating process of a CRT. These models formally represent the relationships between the intervention, the outcome, measured characteristics, and unmeasured variables. First, we briefly review the hierarchical model, proposed in [hierarchical], and then we consider a more restricted hierarchical model that more accurately represents the PTBi study.
As detailed in [hierarchical], the following model encodes independence between clusters, but makes no restrictions on the dependence of observations within a cluster.
(1) 
Here, ) denote the set of functions that determine the values of the observed data variables: the clusterlevel covariates , the individuallevel covariate matrix , the intervention assignment
, and the vector of individuallevel outcomes
. These functions are assumed to be common in , and each accounts for unmeasured factors: . By design in a CRT, the unmeasured factors determining the intervention assignment are independent of other unmeasured factors. This model assumes clusters are independent, but encodes the many sources of dependence within a cluster. For example, the joint error term induces correlation among participants’ outcomes within a cluster. Furthermore, within cluster , an individual’s outcome may depend on the covariates of others in the same cluster but not on the covariates of individuals belonging to other clusters for .We now consider the data generating process for the PTBi study. We assume each health facility, representing the cluster, is sampled from a target population of interest. For cluster , we measure facilitylevel baseline characteristics , including the average monthly delivery volume, facility preparedness assessment score, staff to delivery ratio, and communitytype (i.e., urban versus rural). The facility is then randomly assigned to intervention or control . When a mother delivers her baby, the covariates for the motherbaby dyad are collected. These include the mother’s characteristics, such as age, last menstrual period, and health insurance type, and the baby’s characteristics, such as sex, weight, length, and arm circumference. Finally, the baby’s vital status is recorded; is an indicator of infant death within 28 days. Over the course of study followup, we observe many such deliveries, but for the primary outcome of interest, we restrict to preterm births, defined as born before 37 weeks of gestation. This process is repeated for sample size of facilities.
In the PTBi study, it is reasonable to assume that one motherbaby’s covariates are independent from another’s, and that one baby’s outcome does not impact another’s. In other words, the causal dependence is more restricted than in Model 1. Therefore, we propose the following model to represent the hierarchical data generating process for facility in the PTBi study:
(2) 
Now we assume the functions are common across . We also assume each individual’s outcome is drawn from a common distribution, and this outcome does not depend on the measured covariates of all other individuals in that cluster . This model may be more appropriate when outcomes are not socially or biologically transmitted, as in the realdata example for PTBi.
2.2 Counterfactuals and Target Causal Effects
For either causal model, we generate counterfactual outcomes by replacing the structural equation with our desired intervention. Let be the counterfactual outcome for individual in cluster if, possibly contrarytofact, their cluster received . In the PTBi study, represents the 28day vital status for infant in facility if that facility had been randomized to the intervention arm , while represents the 28day vital status for infant in facility if that facility had been randomized to the control arm .
We can also define counterfactual outcomes at the clusterlevel by taking aggregates of the individuallevel ones. While many summary measures are possible, we focus on weighted sums of the participants from cluster :
(3) 
Commonly, this weight is selected to be the inverse of the clustersize and thus constant across participants in a cluster: for . With this choice, is the average counterfactual outcome for the participants in cluster . For a binary individuallevel outcome, using the inverse clustersize yields a clusterlevel outcome
corresponding to a proportion or probability. In PTBi, for example,
is the counterfactual cumulative incidence of death by 28days if, possibly contrarytofact, facility received treatmentlevel . Of course, other weighting schemes can be used to summarize the individuallevel counterfactual outcomes to the clusterlevel.Then by applying a summary measure to the distribution of counterfactual outcomes, we define the causal effect corresponding to our research query. A wide variety of causal parameters can be expressed a the empirical mean over the sample [neyman1923, rubin1990, imbens_2004, imai_2008, balzer_sate]:
(4) 
where again is the userspecified weight. Throughout, superscript denotes a summary of clusterlevel outcomes, and superscript denotes a sample parameter. As the number of clusters grows (, the sample parameter converges to the expectation over the population of clusters:
(5) 
In words, is the expected clusterlevel counterfactual outcome if all clusters in the population had been assigned to treatmentlevel , whereas (in Eq. 4) is the average clusterlevel counterfactual outcome for the clusters in the CRT. For the PTBi study and weights , represents the expected incidence of 28day mortality among preterm infants if all health facilities in the target population had been assigned to treatment arm , while represents the average incidence if the health facilities included in the study had been assigned to treatment arm .
By taking contrasts of these treatmentspecific causal parameters, we can define causal effects on any scale of interest. For example, we may be interested in the relative effect for the clusters included in the study: . Alternatively, we may be interested in their difference at the populationlevel (a.k.a., the average treatment effect): . For simplicity we refer to contrasts defined using Eq. 4 or 5 as “clusterlevel effects”.
Letting be the total number of participants in the CRT, we can also use Eq. 4 to define causal effects at the individuallevel by setting :
(6) 
As sample size grows, this converges to the expectation over the target population of clusters, each containing a finite number of participants:
(7) 
In words, is the expected individuallevel outcome if all clusters in the target population received treatmentlevel , whereas (in Eq. 6) is the average individuallevel outcome for the participants in the CRT. In the PTBi study, represents the counterfactual risk of mortality for a preterm infant if all health facilities in the target population received treatmentlevel , while represents the counterfactual proportion of preterm infants who would die if all health facilities in the CRT received treatmentlevel . As before, we can take the difference or ratio of these treatmentspecific parameters to define causal effects. For simplicity we refer to contrasts defined using Eqs. 6 and 7 as “individuallevel effects”.
In summary, we can define a wide variety of causal parameters by considering alternative summary measures of the individuallevel or clusterlevel counterfactual outcomes. (Further generalizations are available in Appendix A.) When the clustersize varies (i.e., ), causal parameters giving equal weight to clusters (e.g., Eq. 4) will generally differ from causal parameters giving equal weight to participants (e.g., Eq. 6). Clusterlevel and individuallevel effects can diverge drastically when the clustersize modifies the intervention effect [Seaman2014, nevail_ics2014]. Even if the clustersize is constant (i.e., ), the clusterlevel and individuallevel effects have subtly different interpretations, as discussed previously. For completeness, we defined both samplespecific and populationlevel measures. In many scenarios, however, the selected study clusters correspond to the target population of interest. Additionally, estimation procedures are identical for the population and sample effects [balzer_sate]. Therefore, for ease of presentation, we focus on the population parameters (Eqs. 5 and Eqs. 7) for the remainder of the paper.
2.3 Observed Data and Identifiability
For a given cluster, the observed data are the set of measured clusterlevel covariates, the matrix of individuallevel covariates, the randomized treatment, and the vector of individuallevel outcomes:
As before, we can summarize the outcome vector in a variety of ways. We again consider weighted sums of the individuallevel outcomes within a cluster:
(8) 
where matches the definition of the clusterlevel counterfactual outcome in Eq. 3. Throughout, we focus on the empirical mean outcome within each cluster: for . However, as previously discussed, we can consider alternative weights depending on our research question.
We assume the observed data were generated by sampling times from a data generating process compatible with either of the causal models described above. This provides a link between the causal model and the statistical model, which is the set of possible distributions of the observed data [rose_tl]. Both causal models (Eqs. 1 and 2) encode that the clusterlevel treatment is randomized and thus imply a semiparametric statistical model . We denote the true underlying distribution that generated the observed data as . Thus, we have independent, identically distributed (i.i.d.) copies of .
To identify the expected counterfactual clusterlevel outcome or the expected counterfactual individuallevel outcome as a function of the observed data distribution, we require the following two assumptions, which are satisfied by design in a CRT. First, there must be no unmeasured confounding, such that . Second, there must be a positive probability of receiving each treatmentlevel: . Given these conditions are met in a CRT, we can express the clusterlevel causal parameter as the expected clusterlevel outcome under treatmentlevel of interest , where subscript 0 is used to denote the observed data distribution ; a proof is provided in [hierarchical]. Likewise, the individuallevel causal parameter equals the expected individuallevel outcome under treatmentlevel level of interest: .
We can gain efficiency in CRTs by adjusting for baseline covariates (e.g., [rubin_vdl_max_eff_2008, adaptive_prespec, murray, moultonhayes]). Specifically, the clusterlevel estimand can be written as the conditional expectation of the clusterlevel outcome, given the treatmentlevel of interest and the baseline covariates, averaged over the covariate distribution: [hierarchical]. In practice, we cannot adjust for the entire matrix of individuallevel covariates during estimation; however, we can still improve precision by considering summary measures of this matrix as clusterlevel covariates . Then, the clusterlevel statistical estimand can be written as
(9) 
Similarly, the individuallevel estimand can be written as the conditional expectation of the individuallevel outcome, given the treatmentlevel of interest and baseline covariates and then averaged over the covariate distribution:
(10) 
recognizing the slight abuse to notation because the observed data are the clusterlevel. Here, denotes the set of covariates for a given individual and denotes their outcome. We emphasize that covariate adjustment is being used for efficiency gains only and not to control for confounding.
As before, we can take contrasts of the clusterlevel or individuallevel estimands correspond to the causal effect of interest. To give context for the methods comparison, we will focus on relative scale for the remainder of the paper. Specifically, the relative effect is identified at the clusterlevel as
(11) 
and at the individuallevel as
(12) 
As detailed below, most analytic methods naturally only estimate one of the above statistical estimands, whereas few have the flexibility to estimate both.
3 Statistical Methods: Estimation
In this section, we compare the methods commonly used to analyze CRTs. We aim to describe their target of inference and their ability to adjust for baseline covariates to improve precision and thereby statistical power, while preserving TypeI error control. We broadly consider two classes of estimation methods: approaches based on clusterlevel data and approaches using both individual and clusterlevel data. Clusterlevel approaches immediately aggregate the data to the cluster; therefore, any covariate adjustment must be at the clusterlevel. Examples of these approaches include the ttest and TMLE, a general class of double robust, semiparametric efficient, plugin estimators [rose_tl]. Clusterlevel approaches naturally target clusterlevel effects (e.g., Eq. 11), but with the appropriate choice of weights can also target individual effects (Appendix A).
Individuallevel approaches allow for adjustment of individuallevel covariates, an appealing option, as these pair naturally with individuallevel outcomes. The individuallevel methods we will consider include GEE, CARE, and hierarchical TMLE [moultonhayes, liangzeger, hierarchical]. As detailed below, each of these individuallevel approaches commonly target different causal effects. To the best of our understanding, only TMLE can target effects defined at the clusterlevel (e.g., Eq. 11) or at the individuallevel (e.g., Eq. 12).
We now define the notation used throughout this section. Recall the cluster leveloutcome is defined as a weighted sum of individuallevel outcomes, as in Eq. 8. We denote the conditional expectation of the clusterlevel outcome given the clusterlevel intervention , clusterlevel covariates , and clusterlevel summaries as
(13) 
Likewise, we denote the conditional expectation of the individuallevel outcome given the clusterlevel intervention , clusterlevel covariates , and that individual’s covariates as
(14) 
Throughout, we refer to Eq. 13 and to Eq. 14 as the clusterlevel and individuallevel outcome regressions, respectively. The unadjusted expectations of the clusterlevel and individuallevel outcomes within intervention arm are defined as and , respectively. We denote the clusterlevel propensity score as
(15) 
and the individuallevel propensity score as
(16) 
We define unadjusted probabilities and analogously.
3.1 ClusterLevel Approaches
Clusterlevel approaches obtain point estimates and inference after the individuallevel data have been aggregated to the clusterlevel [moultonhayes, murray]. Most commonly, this aggregation is done by taking the empirical mean within each cluster. However, as previously detailed in Section 2.2, we can consider several ways to summarize the individuallevel data to the clusterlevel (i.e., weighting schemes).
3.1.1 Unadjusted
Once the data are aggregated to the clusterlevel, a common approach for estimation and inference is based on contrasts of the treatmentspecific averages:
(17) 
where denotes the unadjusted estimate of the clusterlevel propensity score (i.e., the proportion of clusters in the study receiving treatmentlevel ). For simplicity for the remainder of the manuscript, we assume the trial has equal allocation of arms, such that . Then, if we let denote the clusterlevel outcome for observation in treatmentarm , the treatmentspecific mean simplifies to . In a CRT,
provides an unbiased estimator of
. In PTBi, represents the average incidence of 28day infant mortality among facilities that received intervention level .With estimates and
, we can obtain a point estimate of the intervention effect by contrasting them on the scale of interest and obtaining inference with a ttest. Statistical power may be improved by considering alternative weighting schemes when summarizing individuallevel outcomes to the clusterlevel; however, as previously discussed, selections of different weights
imply different target parameters.For the relative effect (Eq. 11
), applying the logarithmic transformation is sometimes recommended when the clusterlevel summaries are skewed, which may be more common for ratetype outcomes
[moultonhayes]. However, it is important to note that depending on how this transformation is implemented, the resulting target parameter may be the ratio of the geometric means  as opposed to the ratio of the arithmetic means. (Recall for
observations of some variable , the geometric mean is whereas the arithmetic mean is ). Specifically, suppose we first take the log of the clusterlevel outcomes and then take the average within each arm:(18) 
where again denotes the clusterlevel outcome for cluster in treatment arm . Applying a ttest to the difference in these treatmentspecific means (and then exponentiating) targets the ratio of the geometric means. To avoid changing the target of inference, we can instead apply the Delta Method to obtain point estimates and inference for the arithmetic mean ratio (Eq. 11). We refer the reader to [Moore2009] for more details.
3.1.2 Clusterlevel TMLE with Adaptive Prespecification
As previously discussed, statistical power is often improved by adjusting for clusterlevel covariates and clusterlevel summaries of individuallevel covariates . Specifically, once the data have been aggregated to the clusterlevel, we can proceed with estimation and inference for , as if the data were i.i.d.. Examples of common algorithms for include parametric Gcomputation, inverse probability of treatment weighting estimators (IPTW), and TMLE [rose_tl]. Due to randomization of the intervention, these algorithms will be consistent, even under misspecification of the outcome regression [rct_rosenblum_2010]. Since these methods are wellestablished and implementation is identical to the nonclustered setting, we focus our discussion on alternative implementations that can harness both individual and clusterlevel covariates to increase precision. Before doing so, we briefly review the steps of a clusterlevel TMLE and discuss a solution for selection of the adjustment covariates in CRTs with limited numbers of clusters.
To implement a clusterlevel TMLE, we first obtain an initial estimator of the expected clusterlevel outcome . Next, we update this initial estimator using information contained in the estimated propensity score . Specifically, we define the “clever covariate” as the inverse of the estimated propensity score for cluster :
Then on the logitscale, we regress the clusterlevel outcome
on the covariates and with the initial estimator as the intercept. This provides the following targeted estimator, while simultaneously solving the efficient score equation:(19) 
where and denote the estimated coefficients for and , respectively. Finally, we obtain a point estimate of the treatmentspecific mean by averaging the targeted predictions of the clusterlevel outcomes across the clusters:
To recover the desired target effect, we can contrast of these estimates on the scale of interest, such as
. The variance of asymptotically linear estimators, such as the TMLE, may be estimated using the estimator’s influence function. These types of estimators enjoy properties that follow from the Central Limit Theorem, allowing us to construct 95% Waldtype confidence intervals. For the treatmentspecific mean
, for example, the influence function and asymptotic variance for the clusterlevel TMLE are wellapproximated as and , respectively. Alternatively, as described shortly, we can use a crossvalidated estimate of the influence function during variance estimation [adaptive_prespec].In a CRT, the propensity score is known and does not need to be estimated. However, further gains in efficiency can be achieved through estimation of the propensity score [rct_rosenblum_2010, rubin_vdl_max_eff_2008]. If both the clusterlevel outcome regression and clusterlevel propensity score are consistently estimated, the TMLE will be an asymptotically efficient estimator. However, consistent estimation of the outcome regression is nearly impossible when using an a priori
specified regression model. To improve precision while preserving TypeI error control, we previously proposed “Adaptive Prespecification”, a supervised learning approach using samplesplitting to choose the adjustment set that maximizes efficiency
[adaptive_prespec].Briefly, we prespecify a set of candidate generalized linear models (GLMs) for the clusterlevel outcome regression and propensity score . To reduce the potential for overfitting in small trials (e.g.,
), we recommend limiting these GLMs to having only one or two covariates. We also prespecify a crossvalidation scheme; for small trials, we recommend leaveoneclusterout. To measure performance, we prespecify the squared influence function as our loss function. Then we choose the candidate estimator of
that minimizes the crossvalidated variance estimate using the influence function based on the known propensity score (i.e., ). We then select the candidate estimator of propensity score that further minimizes the crossvalidated variance estimate using the influence function when combined with the previously selected estimator . Together, the selected estimators and form the “optimal” TMLE according to the principle of empirical efficiency maximization [rubin_vdl_max_eff_2008]. To account for the dataadaptive procedure used to select this TMLE, we can use a crossvalidated estimate of the influence function for inference. However, with very few candidates, the crossvalidated inference may be overly conservative and the standard approach to inference may be preferable [adaptive_prespec]. Finally, we note to incorporate a pairmatched design, consider the pair as the independent unit for variance estimation and samplesplitting; specifically, we recommend leaveonepairout crossvalidation.3.1.3 Targeting an Individuallevel Effect with a ClusterLevel Approach
With or without adaptive prespecification, the previously described clusterlevel TMLE naturally estimates a clusterlevel effect, defined using contrasts of clusterlevel counterfactuals for . Furthermore, the unadjusted effect estimator can be considered a special case of the TMLE where . In these clusterlevel analytic approaches, we can alternatively estimate a causal effect corresponding to contrasts of individuallevel counterfactuals for . To do so, we still implement a clusterlevel analysis but include weights in each step (Appendix A).
3.2 Individuallevel Approaches
We now discuss how to leverage individuallevel covariates when estimating effects in CRTs. This is can done by aggregating to the clusterlevel after estimating the expected individuallevel outcome or implementing a fully individuallevel approach. In all cases, clustering must be accounted for during variance estimation. These estimators vary in their flexibility to estimate both clusterlevel effects (e.g., Eq. 11) and individuallevel effects (e.g., Eq. 12).
3.2.1 Individuallevel TMLE
In Section 3.1, we discussed a clusterlevel TMLE for estimating effects in CRTs based on aggregating the data to the clusterlevel and applying a TMLE for i.i.d. data. We now present an individuallevel TMLE, which accounts for the hierarchical nature of the data while leveraging the natural pairing of individuallevel outcomes with individuallevel covariates [hierarchical]. In Appendix B, we present a second TMLE that can also be applied to estimate effects in CRTs. Both extend to a pairmatched design [hierarchical].
A valid analytic approach for clustered data settings, such as CRTs, is to apply an algorithm appropriate for individuallevel data for point estimation and then account for clustering during variance estimation [Schnitzer2014, hierarchical]. Specifically, we could pool participants over clusters, implement an individuallevel TMLE, and adjust statistical inference to respect that the cluster as the independent unit. Briefly, we now consider initial estimators of the individuallevel outcome , which are updated using individuallevel propensity score . Both estimators and would be fit pooling individuals across clusters. As before, we calculate the “clever covariate”, but now at the individuallevel:
for . Then on the logitscale, we would regress the individuallevel outcome on the individuallevel covariates and with the initial individuallevel estimator as the intercept. This provides the following updated estimator of the expectation of the individuallevel outcome, while simultaneously solving the efficient score equation:
(20) 
where and now denote the estimated coefficients for and . Then we would obtain a point estimate by averaging these targeted predictions:
where denotes the total number of participants. To obtain statistical inference for , we aggregate an individuallevel influence function to the clusterlevel and then take the sample variance of the estimated clusterlevel influence function, scaled by , as detailed in [Schnitzer2014]. Specifically, the influence function for is wellapproximated by where .
Altogether, this individuallevel approach naturally targets the individuallevel treatmentspecific mean and therefore the individuallevel effect, as in Eq. 12. However, as detailed in Balzer et al. [hierarchical], we can also target the clusterlevel treatmentspecific mean and therefore the clusterlevel effect, as in Eq. 11, by incorporating the weights in each step, including variance estimation. For either choice of target parameters, the ltmle package can be used to obtain point estimates and statistical inference by specifying id as the cluster and observation.weights equal to 1 for the individuallevel effect and equal to for the clusterlevel effect [ltmlepackage].
As before, we recommend using Adaptive Prespecification to select among candidate estimators of the individuallevel expected outcome and propensity score
that maximize the empirical efficiency for the target effect of interest. Leveraging the larger effective sample size in the individuallevel TMLE, we could consider use machine learning methods, such as Super Learner, to expand the candidate estimators of
and , beyond GLMs [superlearner_2007]. In all cases, the crossvalidation scheme must respect the cluster as the unit of independence.3.2.2 Covariateadjusted Residuals Estimator (CARE)
The covariateadjusted residuals estimator (CARE) was first proposed in [gailcare] and later popularized by [moultonhayes]. In CARE, we pool individuals across clusters and regress the individuallevel outcome on individuallevel and clusterlevel covariates of interest , but not the clusterlevel intervention
. Then the predictions from this regression are aggregated to the clusterlevel. Finally, a ttest comparing mean residuals (i.e., the discrepancies between observed and predicted outcomes) by arms is performed, since the average residuals should be the same between arms under the null hypothesis. This estimator’s target estimand is clusterlevel effect, such as Eq.
11 for the relative scale.Unlike TMLE with Adaptive Prespecification, CARE relies on a fixed specification of the individuallevel outcome regression [gailcare]. In more detail, for the relative effect (Eq. 11) and a binary outcome
, we could, for example, fit the following individuallevel logistic regression
(21) 
where and
denote the magnitude by which the log odds of the outcome for the
th individual in the th cluster is affected (linearly) by the clusterlevel covariates and individuallevel covariates , respectively. Once this model is fit, the expected number of events in the th cluster is calculated as , and compared the observed number of events through ratioresiduals:Hayes and Moulton [moultonhayes] note that these ratioresiduals are often rightskewed and recommend a logarithmic transformation. Specifically, they recommend applying a ttest to obtain point estimates and inference for the difference in the treatmentspecific averages of the logtransformed residuals:
(22) 
where denotes the ratioresidual for cluster in arm . As detailed in Section 3.1.1, after exponentiation, we recover estimates and inference for the ratio of the geometric means and thereby a different target parameter than the standard risk ratio, given in Eq. 11. A straightforward extension to pairmatched design is illustrated in [moultonhayes].
3.2.3 Generalized estimating equations (GEE)
We now consider a class of estimating equations, sometimes referred to as “populationaverage models”, for estimating effects in CRTs [warelaird]. In GEE, estimation and inference is conducted at the individuallevel and a working correlation matrix is used to account for the dependence of outcomes within clusters. Therefore, the target of inference is now the individuallevel effect, as in Eq. 12, instead of a clusterlevel effect, as in Eq. 11.
In GEE, the expected individuallevel response is modeled a function of the treatment and possibly covariates of interest [warelaird, hubbard_ahern]. Specifically, consider the following “marginal model” for the expected individuallevel outcome :
(23) 
where denotes the inverselink function. Commonly, the identity link is used for continuous outcomes, the loglink for count outcomes, and the logitlink for binary outcomes. Effect estimation in GEE is usually done by obtaining a point estimate and inference for the treatment coefficient . At the individuallevel, represents causal risk difference for the identity link, represents the causal risk ratio for the loglink, and represents the causal odds ratio for the logitlink. In other words, the link function often determines the scale on which the effect is estimated.
As with other CRT approaches, GEE may improve efficiency by adjusting for covariates. Consider, for example, the following “conditional model” for the expected individuallevel outcome :
(24) 
where again denotes the inverselink function. Except for linear and loglinear models without interaction terms, the interpretation of is generally not the same as in the marginal model. For the logistic link function, for example, in Eq. 24 would yield the conditional logodds ratio instead of the marginal logodds ratio. However, a recent modification to GEE, presented next, allows for estimation of marginal effects, while adjusting for individuallevel or clusterlevel covariates.
For either a marginal or conditional specification, the GEE estimator solves the following equation:
where is the vector containing individuallevel outcome regressions for cluster , is the gradient matrix, and is the working correlation matrix used to account for dependence of individuals within a cluster [warelaird].
GEE yield a consistent estimate of under the marginal model, even if the correlation matrix has been misspecified. However, under misspecification of the correlation matrix, the usual standard errors obtained are not valid, and the sandwich variance estimator must be used [warelaird]. In general, unless the number of clusters is relatively large and the number of participants within cluster is relatively small, the sandwichbased standard errors can underestimate the true variance of
, and yield confidence intervals with coverage probability below desired nominal level. Corrections to variance estimation exist for when there are fewer than 38 degrees of freedom
[murray, gee_correction]. The extension to a pairmatched analysis requires specifying a fixed effect for the pair while maintaining the correlation structure at the clusterlevel. However, pairmatched analyses are discouraged for studies with fewer than 40 clusters [murray, gee_correction, moultonhayes].3.2.4 AugmentedGEE
As discussed in the previous subsection, the coefficient resulting from GEE does not always correspond to the marginal effect. Recently, a modification to GEE was proposed to ensure can be interpreted as a marginal effect, while simultaneously adjusting for baseline covariates to improve efficiency [auggee]. This approach is referred to as AugmentedGEE (AugGEE) and again targets individuallevel effects, as in Eq. 12.
As commonly implemented, AugGEE modifies GEE by including an additional “augmentation” which incorporates the conditional expectation of the individuallevel outcome . The general form of the AugGEE for a binary treatment is given by
(25) 
with augmentation term
Here, for cluster , denotes the vector of marginal regressions (as in Eq. 23) and denotes the vector of conditional regressions (as in Eq. 24). The clusterlevel propensity score, defined in Eq. 15, is treated as known (i.e., ). By solving Eq. 25, we can obtain point estimates and inference for the marginal effects. As with the other estimators considered, if the conditional outcome regression is misspecified, the resulting estimator is asymptotically normal and consistent; however, it is not efficient [auggee]. Indeed, the efficiency of the estimator highly depends on the matrix . Stephens et al. [stephens_auggee14] show how to further improve AugGEE by deriving the semiparametric locally efficient estimator, but conclude the derivation of the highdimensional inverse covariance matrix presents a substantial barrier to any additional gains.
4 Simulation Studies
To examine the finite sample properties of the previously discussed CRT estimators and to demonstrate the importance of careful specification of the target of inference, we conducted two simulation studies.
4.1 Simulation I
We simulated a simplified data generating process reflecting the hierarchical data structure of the PTBi study, which randomized clusters, corresponding to health facilities. For each of the 20 clusters, we sampled clusterlevel covariates and and withincluster sample size . For each cluster
, we also simulated a random variable
to act as an unmeasured source of dependence within each cluster. To reflect the PTBi study design and examine the impacts of “breaking the match” during the analysis, we paired clusters on using the nonbipartite matching algorithm [nbp_match]. Within the pair, one cluster was randomized to the intervention arm and the other to the control arm.For participant in each cluster , we generated two individuallevel covariates: and . We also calculated clusterlevel summaries and as the empirical mean of and , respectively. Lastly, we simulated the individuallevel outcomes as a function of the intervention , cluster and individuallevel covariates, and unmeasured factor :
(26) 
We set the coefficients for intervention terms to assess the estimator’s power when there was an effect, and to assess TypeI Error under the null. For both settings, we generated counterfactual outcomes under the intervention and under the control by setting and , respectively, for a population of 9000 clusters. These counterfactuals were used to calculate the true values of the causal parameters, defined at both the cluster and individuallevel. Specifically, we assessed the performance of the ttest and TMLE as estimators of the clusterlevel effect (i.e., Eq. 11) and the GEEs as estimators of the individuallevel effect (i.e., Eq. 12). Additionally, we implemented CARE to estimate the geometric incidence ratio (i.e., the ratios of the geometric means of the clusterlevel outcomes).
We implemented two versions of TMLE: one using only clusterlevel data (hereafter called “clusterlevel TMLE”) and one incorporating individuallevel data (hereafter called “individuallevel TMLE”) For covariate adjustment, both TMLEs used Adaptive Prespecification to choose at most two covariates for adjustment if they improved efficiency relative to the unadjusted estimator. The clusterlevel TMLE could select from when estimating the clusterlevel outcome regression and clusterlevel propensity score, while the individuallevel TMLE could select from when estimating the individuallevel outcome regression and individuallevel propensity score. To limit the overall number of candidates and thus reduce the estimator’s variability, we removed a covariate from consideration for propensity score estimation if that covariate was already selected for adjustment in the outcome regression. Inference for the TMLEs was based on the crossvalidated influence function. CARE, GEE, and AugGEE relied on a fixed specification of the individuallevel outcome regression which adjusted for . In CARE, we applied the logtransformation to the ratio residuals, as recommended in by [moultonhayes], and obtained inference with a ttest. In the GEEs, we used the loglink function, and an exchangeable working correlation matrix for statistical inference.
4.2 Simulation I Results
The true value of the clusterlevel incidence ratio (Eq. 11), targeted by the ttest and the TMLEs, was 0.744 and very similar to the true value of the individuallevel risk ratio (Eq. 12), targeted by the GEEs: 0.741. However, the true value of the ratio of geometric means (Section 3.2.2), targeted by CARE, was: 0.71, indicating a larger effect size. In other words, the simulated intervention resulted in a 26% reduction in the clusterlevel outcome on the arithmetic scale and a 29% reduction on the geometric scale. Of course, there is no guarantee that this will always be true, and it is essential to specify a target parameter corresponding to the research query.
While the CRT estimators targeted different effects, it is still valid to compare their attained power, defined as the proportion of times the false null hypothesis was rejected at the 5% significance level, and confidence interval coverage, defined the proportion of times the 95% confidence intervals contained the true value of corresponding target effect. Additionally, we examined TypeI error, defined as the proportion of times the true null hypothesis was rejected at the 5% significance level.
When there is an effect  Under the null  

pt  bias  covg  power  pt  bias  covg  TypeI  
ttest  0.745  0.001  0.175  0.171  0.952  0.302  1.001  0.001  0.141  0.154  0.960  0.040 
Clust TMLE  0.740  0.004  0.112  0.106  0.962  0.722  1.001  0.001  0.103  0.098  0.959  0.041 
Indv TMLE  0.741  0.003  0.113  0.106  0.964  0.723  1.001  0.001  0.103  0.096  0.960  0.040 
CARE  0.737  0.007  0.110  0.083  0.954  0.757  1.006  0.006  0.101  0.104  0.951  0.052 
GEE  0.745  0.001  0.096  0.107  0.928  0.851  1.005  0.005  0.087  0.097  0.921  0.096 
AugGEE  0.745  0.001  0.094  0.107  0.922  0.855  1.006  0.006  0.086  0.100  0.918  0.101 

“Clust” and “Indv” refer to the TMLE using clusterlevel (only) and individuallevel data, respectively. Each is implemented with Adaptive Prespecification.

“pt” is the average point estimate; “bias” is the average difference between the point estimate and the target effect;
is the standard deviation of the point estimates;
is the average standard error estimate; “covg” is the proportion of times the 95% confidence interval contained the true effect, “power” is the proportion of times the false null hypothesis was rejected, and “TypeI” is the proportion of times the true null hypothesis was rejected.
These metrics are shown in Table 1 for each estimator (for its corresponding target estimand) across all 2500 iterations of the data generating process, each with clusters. The ttest, an unadjusted effect estimator, maintained nominal TypeI error (4%) under the null, but when there was an effect, it achieved low statistical power (30%) with nominal confidence interval coverage (95%). In these simulations, the clusterlevel TMLE and individuallevel TMLE performed similarly. Both provided substantial efficiency gains over the unadjusted estimator of the same parameter, which was the clusterlevel relative effect. Specifically, they achieved a statistical power of 72%, while maintaining nominal confidence interval coverage (96%) and TypeI error control (). As expected, however, the TMLEs differed in their selection of adjustment variables. The individuallevel TMLE selected as the adjustment variable for the outcome regression in 98% of simulations, while the clusterlevel TMLE selected the corresponding clusterlevel covariate in only 34% of simulations. Both selected an unadjusted estimator of the propensity scores in 93% of repetitions.
For estimation of the geometric incidence ratio, CARE maintained TypeI Error control (5%) and achieved slightly more power (76%) than the TMLEs; this is not surprising given that in this simulation the clusterlevel effect summarized with geometric means (0.71) was further from the null (1.0) than the clusterlevel effect summarized with arithmetic means (0.74). Finally, in this simulation, both GEEs, which targeted the individuallevel risk ratio, achieved the highest power (85%86%), but also exhibited inflated TypeI error rates (10%) and less than nominal confidence interval coverage (92%).
4.3 Simulation II
In many CRTs, participant outcomes are influenced by the cluster size. Suppose, for example, that the smallest health facilities have the fewest resources to the detriment of their patients’ health, while largest health facilities are overburdened also to the detriment of their patients’ health. In this setting, wide variation in cluster size can result in a divergence between clusterlevel and individuallevel effects. Intuitively, clusterlevel parameters (e.g., Eqs. 4 and 5) give equal weight to each cluster, regardless of its size, while individuallevel parameters (e.g., Eqs. 6 and 7) give equal weight to all trial participants. The distinction between the parameters is exacerbated when clustersize interacts with the intervention and said to be “informative” [nevail_ics2014, Seaman2014]. Therefore, in the second simulation study, we considered a more complex datagenerating process to highlight the distinction between the clusterlevel effects and individuallevel effects.
We again focused on a setting with clusters, reflecting the PTBi study. For each, we sampled a clusterlevel covariate , the withincluster sample size , and an unmeasured clusterlevel variable . Clusters were pairmatched on , and within the pair, one cluster was randomized to the intervention arm and the other to the control arm. For participant in each cluster , we generated two individuallevel covariates: and . As before, the clusterlevel aggregates and were estimated by taking the average value of and , respectively, within cluster. Lastly, we simulated the individuallevel outcomes as a function of the intervention, the cluster and individuallevel covariates, and unmeasured factor :
Unlike Simulation I, the probability of the individuallevel outcome is now a function of cluster size . Specifically, the outcome risk is higher for participants in larger clusters, especially large clusters in the control arm. As before, we also set the coefficients for the intervention and for its interactions with covariates to 0 to assess TypeI Error under the null. For both settings, we generated counterfactual outcomes under the intervention and under the control by setting and , respectively. Then for a population of 9,000 clusters, we calculated the relative effect at the clusterlevel (Eq. 11) and at the individuallevel (Eq. 12).
For this simulation, we focused on the performance of the clusterlevel and individuallevel TMLEs, given their flexibility to estimate effects defined at the clusterlevel or individuallevel and their ability to incorporate baseline covariates to improve precision and statistical power, while maintaining TypeI error control. As previously discussed, through the application of weights, the clusterlevel TMLE can estimate individuallevel effects. Likewise, the individuallevel TMLE can estimate clusterlevel effects. It is possible that the other CRT estimators have this flexibility, but the needed extensions remain to be fully studied.
As before, each TMLE used Adaptive Prespecification to choose at most two covariates for adjustment if their inclusion improved efficiency as compared to an unadjusted estimator. The candidate adjustment set for the clusterlevel TMLE included , while the candidate adjustment set for individuallevel TMLE included . As before, if a covariate was adjusted for in the outcome regression, it was not considered in estimation of the propensity score. For comparison, we also considered a clusterTMLE with fixed specification adjusting for and an individuallevel TMLE adjusting for . Inference for TMLEs using Adaptive Prespecification was based on a crossvalidated estimate of the influence function, while inference for the TMLEs based on fixed adjustment was based on the standard (i.e., not crossvalidated) estimate of the influence function. For simplicity, we focus on the analysis “breaking” the matches used for randomization.
4.4 Simulation II Results
In these simulations, the true value of the clusterlevel relative effect (Eq. 11) was 0.81, substantially larger than the true value of the individuallevel relative effect (Eq. 12) of 0.85. In other words, the intervention resulted in a 19% reduction in the incidence of outcome and only a 15% reduction in the individuallevel risk of the outcome. Of course, this is just one simulation study; in practice, there is no guarantee that the clusterlevel effect will be larger, or even different, from the individuallevel effect. For this simulation study, Table 2 shows the performance of the clusterlevel and individuallevel TMLEs with fixed and adaptive adjustment. Given the differing magnitude of the effects, it is unsurprising that estimators of the clusterlevel effect, shown on the left, achieved notably higher power than estimators of the individuallevel effect, shown on the right. Specifically, estimators of the clusterlevel effect achieved a maximum power of 58%, while estimators of the individuallevel effect achieved a maximum power of 41%.
Focusing on estimators of the clusterlevel effect (Table 2Left), we see that performance was quite similar across the approaches. Specifically, all analyses resulted in low absolute bias, similar variability (), and nearnominal confidence interval coverage (9395%). For a given adjustment strategy (fixed or adaptive), there was little practical difference in performance between analyses using clusterlevel or individuallevel data. However, the crossvalidated standard error estimates () from the TMLEs using Adaptive Prespecification were notable larger than those relying on fixed specification, resulting in notably lower power: 48% versus 58%, respectively. Of course, a prioriselection of fixed adjustment set is easy when there are limited covariates as in this simulation study. In most applications, there are many potential adjustment variables, and it often unclear which combination will maximize precision for the effect of interest, motivating the need for Adaptive Prespecification.
This is, in fact, seen for estimators of the individuallevel effect (Table 2Right). As before, all approaches exhibited minimal bias, similar variability (), and nearnominal to conservative confidence interval coverage (93%97%). Now, however, we see that the TMLEs using Adaptive Prespecification achieved notably higher power (%) as compared to the TMLEs relying on fixed specification of the adjustment variables (%). Again we see little practical difference between approaches relying on clusterlevel data or utilizing individuallevel data.
Clusterlevel effect: = 0.806  Individuallevel effect: = 0.845  
pt  bias  covg  power  pt  bias  covg  power  
Clust  0.807  0.001  0.106  0.095  0.930  0.583  0.843  0.002  0.092  0.102  0.965  0.341 
ClustAP  0.809  0.003  0.107  0.107  0.946  0.479  0.845  0.000  0.093  0.091  0.933  0.411 
Indv  0.806  0.000  0.106  0.096  0.931  0.583  0.844  0.001  0.091  0.101  0.967  0.336 
IndvAP  0.808  0.002  0.107  0.102  0.949  0.484  0.845  0.000  0.092  0.091  0.939  0.408 

“Clust” and “Indv” refer to the TMLE using clusterlevel and individuallevel data, respectively. Each is implemented fixed or adaptive adjustment via Adaptive Prespecification (“AP”).

“pt” is the average point estimate; “bias” is the average difference between the point estimate and the target effect; is the standard deviation of the point estimates; is the average standard error estimate; “covg” is the proportion of times the 95% confidence interval contained the true effect, and power is the proportion of times the false null hypothesis was rejected.
Altogether the results of this simulation demonstrate the ability of TMLE to estimate both clusterlevel and individuallevel effects, while adaptively adjusting for baseline covariates. These results also highlight the critical importance of prespecifying the primary effect measure and using an CRT estimator of that effect. Additional simulation results in Appendix C demonstrate the danger of using an estimator that does not target the primary effect measure and thereby fails to answer the research query of interest. Specifically, notable bias and poor coverage can arise by applying an estimator of the clusterlevel parameter when the target of inference is the individuallevel effect and, conversely, when applying an estimator of the individuallevel parameter when the target of inference is the clusterlevel effect. While we illustrate this critical step in CRT analysis with TMLE, similar challenges apply to all CRT estimators. In all settings, the research question should drive specification of the causal effect and thereby the statistical estimation approach [Petersen2014roadmap, BalzerMLcomm2021].
5 Real data application: The PTBi Study in Kenya and Uganda
In East Africa, preterm birth remains a leading risk factor for perinatal mortality, defined as stillbirth and firstweek deaths [ku_protocol]. Evidencebased practices, such as use of antenatal corticosteroids and skintoskin contact, are not routinely used and have the potential to improve outcomes for preterm infants during the critical intrapartum and immediate newborn periods. The PTBi study was a CRT designed to improve the quality of care at the time of birth through a health facility intervention for mothers and preterm infants (ClinicalTrials.gov: NCT03112018) [ku_protocol]. The primary endpoint was intrapartum stillbirth and 28day mortality among preterm infants delivered from October 2016 to May 2018 in Western Kenya and Eastern Uganda.
In more detail, 20 public sector health facilities, including large hospitals and smaller health centers, were selected for participation in the study. The facilities ranged in size, staffpatient ratio, and capacity to perform cesarean section (Csection), among other characteristics. The facilities were pairmatched within region and randomized to either the intervention or control arm [ku_protocol]. Facilities in the control arm received (1) strengthening of routine data collection and (2) introduction of WHO Safe Childbirth Checklist [WHO_checklist]. The facilities in the intervention arm received the components included in the control in addition to (1) PRONTO Simulation training [pronto], and (2) quality improvement collaboratives aimed to reinforce and optimize use of evidencebased practices. All study components consisted of known interventions and strategies aiming to improve quality of care, teamwork, communication, and data use [ku_protocol].
The results of the PTBi Study have been previously published in [ptb_ku_walker]. Here, we focus on the realworld impact of varying cluster sizes. Specifically, during the study period, an unforeseen political strike led to lack of medical providers at certain facilities, thereby decreasing volume at some facilities while increasing volume at others. The number of preterm births for a given facility ranged between 40366 in the intervention arm and between 31447 in the control arm. Differences in cluster size were also pronounced within matched pairs and ranged between 9211.
To study the impact of high variability in cluster size, we return to the consequences of specifying the effect in terms of clusterlevel outcomes (Eq. 11) versus individuallevel outcomes (Eq. 12). In PTBi, the individuallevel outcome was an indicator of preterm infant mortality by 28day followup. Infants dying before discharge (stillbirth and predischarge mortality) were also included in the study; for these infants, . For facility , the clusterlevel endpoint was the incidence of fresh stillbirth and 28day allcause mortality among preterm births and calculated as the empirical mean of the individuallevel outcomes: .
For both the clusterlevel and individuallevel effects, we compared estimates and inference from TMLEs using clusterlevel or individuallevel data. The clusterlevel TMLE used Adaptive Prespecification to select between no adjustment and adjustment for the proportion of mothers receiving a Csection, while the individuallevel TMLE used Adaptive Prespecification to select between no adjustment and adjustment for an indicator of receiving a Csection. For comparison, we also implemented the TMLEs without adjustment and each approach preserving or ignoring the matched pairs used for randomization.
5.1 PTBi Results
The average incidence of 28day mortality among preterm infants was 0.122 among facilities randomized to the intervention and 0.150 among facilities randomized to the control. Therefore, the unadjusted estimator of the clusterlevel relative effect (Eq. 11) was 0.815 (95%CI: 0.5951.116; p: 0.174), corresponding to 18.5% reduction in combined incidence of mortality at the facilitylevel. The results for the individuallevel outcome were markedly different. Specifically, the proportion of preterm infants who died within 28days was 0.153 in the intervention arm and 0.223 in the control arm. Therefore, the unadjusted estimator of the individuallevel, relative effect (Eq. 12) was 0.656 (95%CI: 0.5310.812; p0.005), corresponding to a 34.4% reduction in the mortality risk.
As shown in Table 3, all estimates from TMLE indicated the PTBi intervention reduced preterm mortality; however, estimates of both the treatmentspecific means and relative effects varied by the target of inference and the estimation approach. As seen with the unadjusted analyses, estimates of clusterlevel effect were less extreme than estimates of the individuallevel effect. We note the adjusted analyses were slightly less precise than the unadjusted ones, because, for demonstration, the adjusted analyses restricted to participants with known Csection status (), while the unadjusted analyses used all participants (). As expected, for a given TMLE (defined by target of inference and level of the analysis), the point estimates when breaking versus keeping the matches were identical. Additionally as predicted by theory [balzer_matching], the pairmatched analysis was more precise. The gain in efficiency from pairmatching was particularly notable for the individuallevel TMLE.
For the clusterlevel effect  For the individuallevel effect  
Ratio (95%CI)  pval  Ratio (95%CI)  pval  
Unmatched analysis  
Clust  0.128  0.145  0.885 (0.585, 1.338)  0.542  0.135  0.162  0.834 (0.567, 1.227)  0.337 
Indv  0.124  0.149  0.836 (0.495, 1.411)  0.481  0.148  0.213  0.695 (0.435, 1.109)  0.120 
Pairmatched analysis  
Clust  0.128  0.145  0.885 (0.587, 1.335)  0.518  0.135  0.162  0.834 (0.573, 1.214)  0.303 
Indv  0.124  0.149  0.836 (0.600, 1.164)  0.252  0.148  0.213  0.695 (0.549, 0.880)  0.007 

“Clust” and “Indv” refer to the TMLE using clusterlevel and individuallevel data, respectively.
Focusing specifically on estimators of the clusterlevel effect (Table 3Left), we see that estimates of the expected incidence of mortality under the intervention and of the expected incidence of mortality under the control varied whether or not an clusterlevel or individuallevel analysis was taken. In particular, the expectations were more extreme (i.e., lower incidence in the intervention and higher incidence in the control) with the individuallevel estimation approach. This may reflect that while both approaches can target the same parameter  here, the clusterlevel effect  the relationship between the clusterlevel covariates and outcome may be distinct from the relationship between the individuallevel covariates and outcome. Specifically, the impact of the proportion of women receiving a Csection on the clusterlevel incidence of mortality is likely to be different than the impact of an individual woman’s Csection on the risk of mortality for her child. This could contribute to the individuallevel TMLE being a more precise estimator of the clusterlevel effect in this application.
Finally focusing on estimators of the individuallevel effect (Table 3Right), we see that estimates of the risk of mortality under the intervention and of the risk of mortality under the control are larger with the individuallevel approach. Focusing on pairmatched analyses for simplicity, clusterlevel TMLE suggested that the effect of the PTBi intervention on reducing the individuallevel risk of preterm mortality was 0.834 (95%CI: 0.5731.214; p: 0.303). In contrast, the individuallevel TMLE found effect of the PTBi intervention to be 0.695 (95%CI: 0.5490.880; p: 0.007), corresponding to a 30.5% reduction. Altogether, these results highlight that the importance of prespecifying the research question, selecting a causal parameter corresponding to that question, and finally implementing an estimator for that effect.
6 Discussion
Cluster randomized trials (CRTs) are commonly used to evaluate the effects of interventions, which are randomized to groups of individuals, such as communities, clinics, or schools. In both resourcerich and resourcelimited settings, high costs and barriers to implementation often limit the number of clusters which can be enrolled and randomized. Therefore, it is important to understand the potential advantages and pitfalls of common analytic approaches in smallscale CRTs. In this manuscript, we aimed to provide indepth an comparison of methods available for CRT analysis, overall and when there were 30 clusters. To do so, we first demonstrated how causal models can be used to describe the data generating process of a CRT and intervened upon to generate counterfactual outcomes of interest. Using these counterfactuals, we explored a variety of causal parameters that could be defined, and we stressed the importance of a priorispecifying the primary effect measure. We then delved into common CRT methods, considered each theoretically and with finite sample simulations. Finally, we demonstrated the impact of key analytic decisions using real data from the PTBi study, a clinicbased CRT designed to reduce mortality among preterm infants in East Afica.
Our theoretical comparison of estimators revealed that common methods often target different causal effects. Results from our first simulation study demonstrated how targeted maximum likelihood estimation (TMLE) and the covariate adjusted residuals estimator (CARE) could improve statistical power by adjusting for covariates, while maintaining nominal confidence interval coverage. In this simulation study, approaches based on generalized estimating equations (GEE and AugGEE) failed to preserve TypeI error control; this is consistent with the literature warning against their application (without finite sample corrections) with fewer than 30 clusters [moultonhayes, murray]
. However, some studies may value minimizing Type II error over Type I error, and this should be carefully considered in the analysis plan.
Results from our second simulation study demonstrated the impact of informative cluster size, which occurs when the cluster size interacts with the intervention effect. In this setting, there can be a sharp divergence in the interpretation and magnitude of effects defined at the clusterlevel or at the individuallevel. This divergence was observed in the real data application, where larger facilities in the control arm had worse outcomes.
There are several limitations to the applied analysis, which merit additional consideration. First, high levels of missing data on baseline covariates limited our investigation of adjustment to a single covariate: having a Csection. Future work could address imputation of missing data on baseline covariates and its impact on statistical performance of adjusted estimators. Second, the adjusted analysis was restricted to motherinfant dyads with known Csection status. This could bias our results if the data are not missing at random; we plan to address this in a future analysis using the methods in
[Balzer2021twostage]. Third, during the delivery, the infant may have also received additional interventions which could influence their outcome. While the PTBi study was not designed to collect this information for its primary analysis, future work could focus on methods for assessing the joint impact of the cluster randomized intervention combined with another individuallevel exposure (randomized or not). Lastly, we assumed a onetoone relationship between mothers and their infants for simplicity; however, in further sensitivity analyses, we will formally consider accounting for all twins and triplets. Standard error estimates should account for this additional layer of clustering.In summary, CRTs are a popular approach for researchers seeking to scale up interventions from the individuallevel to the populationlevel. Many of the challenges faced by the PTBi study are common and highlight the importance of carefully selecting the target effect and corresponding analytic approach. We recommend considering both interpretation and impacts on statistical power when selecting the target effect. For estimation, we recommend using TMLE given its flexibility to estimate a variety of effects and dataadaptively adjust for both clusterlevel and individuallevel covariates, while preserving TypeI error control.
7 Appendix A: Generalization of the causal parameter definition
We can further generalize our definition of treatmentspecific mean as
(27) 
for userspecified weights such that . For example, setting recovers the clusterlevel parameter , as in Eq. 4. Let be the weights used to define the clusterlevel outcome: . Then setting recovers the individuallevel parameter , as in Eq. 6. To illustrate, we again focus on the setting where is the inverse clustersize. Then we have
(28) 
Altogether Eq. 27 allows us to consider a wide range of effects defined at the clusterlevel or individuallevel, regardless of whether the data are collected at the clusterlevel or individuallevel.
8 Appendix B: Hybrid TMLE
Recall that the first step of the clusterlevel TMLE (Section 3.1) is to obtain an initial estimator of the conditional expectation of the clusterlevel outcome . Instead of only considering clusterlevel approaches, we can expand our candidate estimators by including aggregates of individual approaches [hierarchical]. Consider, for example, the following specification of the expected individuallevel outcome :
We could estimate these coefficients by pooling individuals across clusters and running an individuallevel logistic regression. Afterwards, for each cluster , we would obtain and summarize the individuallevel predicted outcomes to generate a candidate estimator of the expected clusterlevel outcome:
with the selected corresponding to relevant clusterlevel summary of the individuallevel outcomes (i.e., Eqs. 3 and 8); throughout, we have been focusing on for . Then estimation and inference would proceed as described in Section 3.1 for the clusterlevel TMLE. Thus, this approach naturally targets the clusterlevel treatmentspecific mean , and therefore clusterlevel effects, as in Eq. 11. However, as previously discussed, we can modify the weights to, instead, target an individuallevel effect.
More importantly, we can now use Adaptive Prespecification to choose between candidate estimators of based on clusterlevel approaches or aggregates of individuallevel approaches. In the latter, the initial estimator of is based off of pooling individuals across clusters; therefore, we could consider methods that are more dataadaptive than GLMs. Specifically, we could use Super Learner, an ensemble machine learning algorithm [superlearner_2007].
9 Appendix C: Additional Simulation Results
As demonstrated in the Simulation Study II, TMLE can estimate effects defined at the clusterlevel (e.g., Eq.11) as well as effects defined at the individuallevel (e.g., Eq. 12) regardless of whether the data are at the individuallevel or the clusterlevel. Here, we again emphasize the importance of implementing an estimator corresponding to the CRT’s primary effect measure. To do so, we reconsider Simulation II and show the performance of the TMLEs when there is a mismatch between the researcher’s goal and the estimator’s target.
Specifically, in Table 4Left, we show the finite sample performance when the TMLEs are targeting the clusterlevel effect ( = 0.806), but the researcher’s goal is to evaluate the individuallevel effect (
Comments
There are no comments yet.