Comparative Methods for the Analysis of Cluster Randomized Trials

10/18/2021
by   Alejandra Benitez, et al.
0

Across research disciplines, cluster randomized trials (CRTs) are commonly implemented to evaluate interventions delivered to groups of participants, such as communities and clinics. Despite advances in the design and analysis of CRTs, several challenges remain. First, there are many possible ways to specify the intervention effect (e.g., at the individual-level or at the cluster-level). Second, the theoretical and practical performance of common methods for CRT analysis remain poorly understood. Here, we use causal models to formally define an array of causal effects as summary measures of counterfactual outcomes. Next, we provide a comprehensive overview of well-known CRT estimators, including the t-test and generalized estimating equations (GEE), as well as less known methods, including augmented-GEE and targeted maximum likelihood estimation (TMLE). In finite sample simulations, we illustrate the performance of these estimators and the importance of effect specification, especially when cluster size varies. Finally, our application to data from the Preterm Birth Initiative (PTBi) study demonstrates the real-world importance of selecting an analytic approach corresponding to the research question. Given its flexibility to estimate a variety of effects and ability to adaptively adjust for covariates for precision gains while maintaining Type-I error control, we conclude TMLE is a promising tool for CRT analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/29/2021

Two-Stage TMLE to Reduce Bias and Improve Efficiency in Cluster Randomized Trials

Cluster randomized trials (CRTs) randomly assign an intervention to grou...
09/28/2021

Evaluating the Robustness of Targeted Maximum Likelihood Estimators via Realistic Simulations in Nutrition Intervention Trials

Several recently developed methods have the potential to harness machine...
06/15/2020

Targeted Maximum Likelihood Estimation of Community-based Causal Effect of Community-Level Stochastic Interventions

Unlike the commonly used parametric regression models such as mixed mode...
09/16/2019

Novel Methods for the Analysis of Stepped Wedge Cluster Randomized Trials

Stepped wedge cluster randomized trials (SW-CRTs) have become increasing...
03/04/2022

Improving sandwich variance estimation for marginal Cox analysis of cluster randomized trials

Cluster randomized trials (CRTs) frequently recruit a small number of cl...
03/04/2020

Estimating the Effect of Central Bank Independence on Inflation Using Longitudinal Targeted Maximum Likelihood Estimation

Whether a country's central bank independence (CBI) status has a lowerin...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

References

1 Introduction

Cluster randomized trials (CRTs) provide an opportunity to assess the population-level effects of interventions, which are randomized to groups of individuals, such as communities, clinics, or schools. These groups are commonly called “clusters”. The choice to randomize clusters, instead of individuals, is often driven by the type of intervention as well as practical concerns [moultonhayes]. For example, interventions to improve medical practices are often randomized at the hospital or clinic-level to reduce logistical burden and to minimize potential contamination between arms if individual patients were instead randomized. The design and conduct of CRTs has improved considerably [turner17a, turner17b, crepsi, Murray2020], and results from CRTs have been widely published in public health, education, policy, and economics literature [icc]. However, a recent review found that only 50% of CRTs were analyzed with appropriate methods [murray].

Due to the hierarchical nature of the data and the correlation of participant outcomes within clusters, the analysis of CRTs is fundamentally more complicated than for individually randomized trials [moultonhayes, murray]

. To start, there are many ways to define the causal parameter of interest in CRTs. For example, we can define the effect in terms of summaries of individual-level outcomes or cluster-level outcomes. Additionally, we may be interested in the effect for the sample included in the CRT or for a wider target population. Once the primary effect measure has been selected, statistical estimation and inference must respect that the cluster is the independent unit and all observations within the cluster are dependent to some degree. Ignoring this dependence can lead to power calculations based on the wrong sample size and underestimates of the standard error. Analyses ignoring the clustering may also inappropriately attribute the impact of a cluster-level characteristic to the intervention and, together with variance underestimation, result in inflated Type-I errors. Many analytic approaches are available to account for clustering, including conducting a cluster-level analysis or applying a correction factor

[moultonhayes, murray, turner17a, turner17b, crepsi].

Once we have committed to an analytic approach that addresses clustering, the adjustment of baseline covariates is often considered for additional precision gains. Fortunately, many methods are available to estimate adjusted intervention effects in CRTs. Examples include well-established methods, such as generalized estimating equations (GEE) and covariate adjusted residuals estimator (CARE), as well as more recent developments, such as targeted maximum likelihood estimation (TMLE) and augmented GEE (Aug-GEE) [liang-zeger, gail-care, moultonhayes, rose_tl, hierarchical, aug-gee]. While these methods differ in their exact implementation, each aims to improve statistical power of the CRT by controlling for individual or cluster-level covariates when fitting the “outcome regression”: the conditional expectation of the outcome given the randomized intervention and the adjustment variables. Previous literature [crt_comparative_bellamy] has used simulation studies to compare the power and Type-I error of different CRT analysis approaches, including cluster-level methods, GEE, random effects models, ‘design-effect’ models, and methods ignoring the clustering altogether. However, to the best of our knowledge, these comparisons have excluded the more recent approaches of Aug-GEE and TMLE.

This paper aims to clarify the strengths and weaknesses of common methods to analyze CRTs. To motivate this comparison, we consider the Preterm Birth Initiative (PTBi) study, a maternal-infant CRT which took place in 20 health facilities across Kenya and Uganda (ClinicalTrials.gov: NCT03112018). The trial assessed whether a facility-based intervention, designed to improve uptake of evidence-based practices, was effective in reducing 28-day mortality among preterm infants. Following PBTi, we focus on a setting where individual participants are grouped into clusters (e.g., patients in a health facility). However, our discussion and results are equally applicable to other hierarchical data structures that may be observed in a CRT. Examples include households within a community or classrooms within a school. Also to match the real data example, we focus on a binary individual-level outcome throughout. However, our work also applies to CRTs with count or continuous outcomes. Finally, we note that modifications to each of the above analysis approaches can be made to account for pair-matching clusters prior to randomization [moultonhayes]. However, “breaking the match” and analyzing a pair-matched trial, such as PTBi, as if were completely randomized is also a valid approach and the focus of our discussion.

The remaining paper is organized as follows. In Section 2, we use structural causal models to describe the data generating process for a CRT [pearl]. We also discuss several effects commonly targeted in CRTs. We highlight how the endpoint may be defined at the individual-level or at the cluster-level, often as an aggregate of individual-level outcomes. We additionally address the distinction between effects defined for the study sample versus a target population. In Section 3, we describe each estimation method, including its statistical properties and implementation. In Section 4, we provide two simulation studies to evaluate the finite sample performance of common CRT estimators and to demonstrate the distinction between different causal effects. In Section 5, we apply these methods to estimate intervention effects for the PTBi study. We conclude in Section 6 with a brief discussion.

2 Causal Methods

We begin by formalizing the notation that will be used throughout. We assume that each cluster, such as a hospital or smaller health facility, is sampled from some target population of interest and is indexed by . Each cluster is comprised of a finite set of individuals (i.e., participants), which are indexed . The cluster-size, denoted , could be fixed or could vary by cluster. In each cluster, denotes the baseline covariates for the participants. Additional baseline covariates related to the cluster are denoted ; these could be summaries of individual-level characteristics or may have no individual-level counterpart. (We refer the reader to [hierarchical] for a discussion of the inclusion of cluster-level summaries of in ).

In a CRT, the intervention is randomized among clusters. In the case of a pair-matched trial, which may increase the power of the study, the randomization occurs within pairs of clusters matched on characteristics expected to be predictive of the primary endpoint [moultonhayes, imai2009, balzer_matching]. Throughout, we will use as an indicator that cluster was randomized to the intervention. The outcome of interest is , which for simplicity we assume is measured for all participants in cluster . Extensions to handling missing data are important, but beyond the scope of this paper [diaz_missing, fiero, Balzer2021twostage].

2.1 Hierarchical Structural Causal Models

We now use Pearl’s structural causal model [pearl] to represent the hierarchical data generating process of a CRT. These models formally represent the relationships between the intervention, the outcome, measured characteristics, and unmeasured variables. First, we briefly review the hierarchical model, proposed in [hierarchical], and then we consider a more restricted hierarchical model that more accurately represents the PTBi study.

As detailed in [hierarchical], the following model encodes independence between clusters, but makes no restrictions on the dependence of observations within a cluster.

(1)

Here, ) denote the set of functions that determine the values of the observed data variables: the cluster-level covariates , the individual-level covariate matrix , the intervention assignment

, and the vector of individual-level outcomes

. These functions are assumed to be common in , and each accounts for unmeasured factors: . By design in a CRT, the unmeasured factors determining the intervention assignment are independent of other unmeasured factors. This model assumes clusters are independent, but encodes the many sources of dependence within a cluster. For example, the joint error term induces correlation among participants’ outcomes within a cluster. Furthermore, within cluster , an individual’s outcome may depend on the covariates of others in the same cluster but not on the covariates of individuals belonging to other clusters for .

We now consider the data generating process for the PTBi study. We assume each health facility, representing the cluster, is sampled from a target population of interest. For cluster , we measure facility-level baseline characteristics , including the average monthly delivery volume, facility preparedness assessment score, staff to delivery ratio, and community-type (i.e., urban versus rural). The facility is then randomly assigned to intervention or control . When a mother delivers her baby, the covariates for the mother-baby dyad are collected. These include the mother’s characteristics, such as age, last menstrual period, and health insurance type, and the baby’s characteristics, such as sex, weight, length, and arm circumference. Finally, the baby’s vital status is recorded; is an indicator of infant death within 28 days. Over the course of study follow-up, we observe many such deliveries, but for the primary outcome of interest, we restrict to pre-term births, defined as born before 37 weeks of gestation. This process is repeated for sample size of facilities.

In the PTBi study, it is reasonable to assume that one mother-baby’s covariates are independent from another’s, and that one baby’s outcome does not impact another’s. In other words, the causal dependence is more restricted than in Model 1. Therefore, we propose the following model to represent the hierarchical data generating process for facility in the PTBi study:

(2)

Now we assume the functions are common across . We also assume each individual’s outcome is drawn from a common distribution, and this outcome does not depend on the measured covariates of all other individuals in that cluster . This model may be more appropriate when outcomes are not socially or biologically transmitted, as in the real-data example for PTBi.

2.2 Counterfactuals and Target Causal Effects

For either causal model, we generate counterfactual outcomes by replacing the structural equation with our desired intervention. Let be the counterfactual outcome for individual in cluster if, possibly contrary-to-fact, their cluster received . In the PTBi study, represents the 28-day vital status for infant in facility if that facility had been randomized to the intervention arm , while represents the 28-day vital status for infant in facility if that facility had been randomized to the control arm .

We can also define counterfactual outcomes at the cluster-level by taking aggregates of the individual-level ones. While many summary measures are possible, we focus on weighted sums of the participants from cluster :

(3)

Commonly, this weight is selected to be the inverse of the cluster-size and thus constant across participants in a cluster: for . With this choice, is the average counterfactual outcome for the participants in cluster . For a binary individual-level outcome, using the inverse cluster-size yields a cluster-level outcome

corresponding to a proportion or probability. In PTBi, for example,

is the counterfactual cumulative incidence of death by 28-days if, possibly contrary-to-fact, facility received treatment-level . Of course, other weighting schemes can be used to summarize the individual-level counterfactual outcomes to the cluster-level.

Then by applying a summary measure to the distribution of counterfactual outcomes, we define the causal effect corresponding to our research query. A wide variety of causal parameters can be expressed a the empirical mean over the sample [neyman1923, rubin1990, imbens_2004, imai_2008, balzer_sate]:

(4)

where again is the user-specified weight. Throughout, superscript denotes a summary of cluster-level outcomes, and superscript denotes a sample parameter. As the number of clusters grows (, the sample parameter converges to the expectation over the population of clusters:

(5)

In words, is the expected cluster-level counterfactual outcome if all clusters in the population had been assigned to treatment-level , whereas (in Eq. 4) is the average cluster-level counterfactual outcome for the clusters in the CRT. For the PTBi study and weights , represents the expected incidence of 28-day mortality among preterm infants if all health facilities in the target population had been assigned to treatment arm , while represents the average incidence if the health facilities included in the study had been assigned to treatment arm .

By taking contrasts of these treatment-specific causal parameters, we can define causal effects on any scale of interest. For example, we may be interested in the relative effect for the clusters included in the study: . Alternatively, we may be interested in their difference at the population-level (a.k.a., the average treatment effect): . For simplicity we refer to contrasts defined using Eq. 4 or 5 as “cluster-level effects”.

Letting be the total number of participants in the CRT, we can also use Eq. 4 to define causal effects at the individual-level by setting :

(6)

As sample size grows, this converges to the expectation over the target population of clusters, each containing a finite number of participants:

(7)

In words, is the expected individual-level outcome if all clusters in the target population received treatment-level , whereas (in Eq. 6) is the average individual-level outcome for the participants in the CRT. In the PTBi study, represents the counterfactual risk of mortality for a preterm infant if all health facilities in the target population received treatment-level , while represents the counterfactual proportion of preterm infants who would die if all health facilities in the CRT received treatment-level . As before, we can take the difference or ratio of these treatment-specific parameters to define causal effects. For simplicity we refer to contrasts defined using Eqs. 6 and  7 as “individual-level effects”.

In summary, we can define a wide variety of causal parameters by considering alternative summary measures of the individual-level or cluster-level counterfactual outcomes. (Further generalizations are available in Appendix A.) When the cluster-size varies (i.e., ), causal parameters giving equal weight to clusters (e.g., Eq. 4) will generally differ from causal parameters giving equal weight to participants (e.g., Eq. 6). Cluster-level and individual-level effects can diverge drastically when the cluster-size modifies the intervention effect [Seaman2014, nevail_ics2014]. Even if the cluster-size is constant (i.e., ), the cluster-level and individual-level effects have subtly different interpretations, as discussed previously. For completeness, we defined both sample-specific and population-level measures. In many scenarios, however, the selected study clusters correspond to the target population of interest. Additionally, estimation procedures are identical for the population and sample effects [balzer_sate]. Therefore, for ease of presentation, we focus on the population parameters (Eqs. 5 and Eqs. 7) for the remainder of the paper.

2.3 Observed Data and Identifiability

For a given cluster, the observed data are the set of measured cluster-level covariates, the matrix of individual-level covariates, the randomized treatment, and the vector of individual-level outcomes:

As before, we can summarize the outcome vector in a variety of ways. We again consider weighted sums of the individual-level outcomes within a cluster:

(8)

where matches the definition of the cluster-level counterfactual outcome in Eq. 3. Throughout, we focus on the empirical mean outcome within each cluster: for . However, as previously discussed, we can consider alternative weights depending on our research question.

We assume the observed data were generated by sampling times from a data generating process compatible with either of the causal models described above. This provides a link between the causal model and the statistical model, which is the set of possible distributions of the observed data [rose_tl]. Both causal models (Eqs. 1 and 2) encode that the cluster-level treatment is randomized and thus imply a semi-parametric statistical model . We denote the true underlying distribution that generated the observed data as . Thus, we have independent, identically distributed (i.i.d.) copies of .

To identify the expected counterfactual cluster-level outcome or the expected counterfactual individual-level outcome as a function of the observed data distribution, we require the following two assumptions, which are satisfied by design in a CRT. First, there must be no unmeasured confounding, such that . Second, there must be a positive probability of receiving each treatment-level: . Given these conditions are met in a CRT, we can express the cluster-level causal parameter as the expected cluster-level outcome under treatment-level of interest , where subscript 0 is used to denote the observed data distribution ; a proof is provided in [hierarchical]. Likewise, the individual-level causal parameter equals the expected individual-level outcome under treatment-level level of interest: .

We can gain efficiency in CRTs by adjusting for baseline covariates (e.g., [rubin_vdl_max_eff_2008, adaptive_prespec, murray, moultonhayes]). Specifically, the cluster-level estimand can be written as the conditional expectation of the cluster-level outcome, given the treatment-level of interest and the baseline covariates, averaged over the covariate distribution: [hierarchical]. In practice, we cannot adjust for the entire matrix of individual-level covariates during estimation; however, we can still improve precision by considering summary measures of this matrix as cluster-level covariates . Then, the cluster-level statistical estimand can be written as

(9)

Similarly, the individual-level estimand can be written as the conditional expectation of the individual-level outcome, given the treatment-level of interest and baseline covariates and then averaged over the covariate distribution:

(10)

recognizing the slight abuse to notation because the observed data are the cluster-level. Here, denotes the set of covariates for a given individual and denotes their outcome. We emphasize that covariate adjustment is being used for efficiency gains only and not to control for confounding.

As before, we can take contrasts of the cluster-level or individual-level estimands correspond to the causal effect of interest. To give context for the methods comparison, we will focus on relative scale for the remainder of the paper. Specifically, the relative effect is identified at the cluster-level as

(11)

and at the individual-level as

(12)

As detailed below, most analytic methods naturally only estimate one of the above statistical estimands, whereas few have the flexibility to estimate both.

3 Statistical Methods: Estimation

In this section, we compare the methods commonly used to analyze CRTs. We aim to describe their target of inference and their ability to adjust for baseline covariates to improve precision and thereby statistical power, while preserving Type-I error control. We broadly consider two classes of estimation methods: approaches based on cluster-level data and approaches using both individual and cluster-level data. Cluster-level approaches immediately aggregate the data to the cluster; therefore, any covariate adjustment must be at the cluster-level. Examples of these approaches include the t-test and TMLE, a general class of double robust, semiparametric efficient, plug-in estimators [rose_tl]. Cluster-level approaches naturally target cluster-level effects (e.g., Eq. 11), but with the appropriate choice of weights can also target individual effects (Appendix A).

Individual-level approaches allow for adjustment of individual-level covariates, an appealing option, as these pair naturally with individual-level outcomes. The individual-level methods we will consider include GEE, CARE, and hierarchical TMLE [moultonhayes, liang-zeger, hierarchical]. As detailed below, each of these individual-level approaches commonly target different causal effects. To the best of our understanding, only TMLE can target effects defined at the cluster-level (e.g., Eq. 11) or at the individual-level (e.g., Eq. 12).

We now define the notation used throughout this section. Recall the cluster level-outcome is defined as a weighted sum of individual-level outcomes, as in Eq. 8. We denote the conditional expectation of the cluster-level outcome given the cluster-level intervention , cluster-level covariates , and cluster-level summaries as

(13)

Likewise, we denote the conditional expectation of the individual-level outcome given the cluster-level intervention , cluster-level covariates , and that individual’s covariates as

(14)

Throughout, we refer to Eq. 13 and to Eq. 14 as the cluster-level and individual-level outcome regressions, respectively. The unadjusted expectations of the cluster-level and individual-level outcomes within intervention arm are defined as and , respectively. We denote the cluster-level propensity score as

(15)

and the individual-level propensity score as

(16)

We define unadjusted probabilities and analogously.

3.1 Cluster-Level Approaches

Cluster-level approaches obtain point estimates and inference after the individual-level data have been aggregated to the cluster-level [moultonhayes, murray]. Most commonly, this aggregation is done by taking the empirical mean within each cluster. However, as previously detailed in Section 2.2, we can consider several ways to summarize the individual-level data to the cluster-level (i.e., weighting schemes).

3.1.1 Unadjusted

Once the data are aggregated to the cluster-level, a common approach for estimation and inference is based on contrasts of the treatment-specific averages:

(17)

where denotes the unadjusted estimate of the cluster-level propensity score (i.e., the proportion of clusters in the study receiving treatment-level ). For simplicity for the remainder of the manuscript, we assume the trial has equal allocation of arms, such that . Then, if we let denote the cluster-level outcome for observation in treatment-arm , the treatment-specific mean simplifies to . In a CRT,

provides an unbiased estimator of

. In PTBi, represents the average incidence of 28-day infant mortality among facilities that received intervention level .

With estimates and

, we can obtain a point estimate of the intervention effect by contrasting them on the scale of interest and obtaining inference with a t-test. Statistical power may be improved by considering alternative weighting schemes when summarizing individual-level outcomes to the cluster-level; however, as previously discussed, selections of different weights

imply different target parameters.

For the relative effect (Eq. 11

), applying the logarithmic transformation is sometimes recommended when the cluster-level summaries are skewed, which may be more common for rate-type outcomes

[moultonhayes]

. However, it is important to note that depending on how this transformation is implemented, the resulting target parameter may be the ratio of the geometric means - as opposed to the ratio of the arithmetic means. (Recall for

observations of some variable , the geometric mean is whereas the arithmetic mean is ). Specifically, suppose we first take the log of the cluster-level outcomes and then take the average within each arm:

(18)

where again denotes the cluster-level outcome for cluster in treatment arm . Applying a t-test to the difference in these treatment-specific means (and then exponentiating) targets the ratio of the geometric means. To avoid changing the target of inference, we can instead apply the Delta Method to obtain point estimates and inference for the arithmetic mean ratio (Eq. 11). We refer the reader to [Moore2009] for more details.

3.1.2 Cluster-level TMLE with Adaptive Pre-specification

As previously discussed, statistical power is often improved by adjusting for cluster-level covariates and cluster-level summaries of individual-level covariates . Specifically, once the data have been aggregated to the cluster-level, we can proceed with estimation and inference for , as if the data were i.i.d.. Examples of common algorithms for include parametric G-computation, inverse probability of treatment weighting estimators (IPTW), and TMLE [rose_tl]. Due to randomization of the intervention, these algorithms will be consistent, even under misspecification of the outcome regression [rct_rosenblum_2010]. Since these methods are well-established and implementation is identical to the non-clustered setting, we focus our discussion on alternative implementations that can harness both individual and cluster-level covariates to increase precision. Before doing so, we briefly review the steps of a cluster-level TMLE and discuss a solution for selection of the adjustment covariates in CRTs with limited numbers of clusters.

To implement a cluster-level TMLE, we first obtain an initial estimator of the expected cluster-level outcome . Next, we update this initial estimator using information contained in the estimated propensity score . Specifically, we define the “clever covariate” as the inverse of the estimated propensity score for cluster :

Then on the logit-scale, we regress the cluster-level outcome

on the covariates and with the initial estimator as the intercept. This provides the following targeted estimator, while simultaneously solving the efficient score equation:

(19)

where and denote the estimated coefficients for and , respectively. Finally, we obtain a point estimate of the treatment-specific mean by averaging the targeted predictions of the cluster-level outcomes across the clusters:

To recover the desired target effect, we can contrast of these estimates on the scale of interest, such as

. The variance of asymptotically linear estimators, such as the TMLE, may be estimated using the estimator’s influence function. These types of estimators enjoy properties that follow from the Central Limit Theorem, allowing us to construct 95% Wald-type confidence intervals. For the treatment-specific mean

, for example, the influence function and asymptotic variance for the cluster-level TMLE are well-approximated as and , respectively. Alternatively, as described shortly, we can use a cross-validated estimate of the influence function during variance estimation [adaptive_prespec].

In a CRT, the propensity score is known and does not need to be estimated. However, further gains in efficiency can be achieved through estimation of the propensity score [rct_rosenblum_2010, rubin_vdl_max_eff_2008]. If both the cluster-level outcome regression and cluster-level propensity score are consistently estimated, the TMLE will be an asymptotically efficient estimator. However, consistent estimation of the outcome regression is nearly impossible when using an a priori

specified regression model. To improve precision while preserving Type-I error control, we previously proposed “Adaptive Prespecification”, a supervised learning approach using sample-splitting to choose the adjustment set that maximizes efficiency

[adaptive_prespec].

Briefly, we prespecify a set of candidate generalized linear models (GLMs) for the cluster-level outcome regression and propensity score . To reduce the potential for over-fitting in small trials (e.g.,

), we recommend limiting these GLMs to having only one or two covariates. We also prespecify a cross-validation scheme; for small trials, we recommend leave-one-cluster-out. To measure performance, we prespecify the squared influence function as our loss function. Then we choose the candidate estimator of

that minimizes the cross-validated variance estimate using the influence function based on the known propensity score (i.e., ). We then select the candidate estimator of propensity score that further minimizes the cross-validated variance estimate using the influence function when combined with the previously selected estimator . Together, the selected estimators and form the “optimal” TMLE according to the principle of empirical efficiency maximization [rubin_vdl_max_eff_2008]. To account for the data-adaptive procedure used to select this TMLE, we can use a cross-validated estimate of the influence function for inference. However, with very few candidates, the cross-validated inference may be overly conservative and the standard approach to inference may be preferable [adaptive_prespec]. Finally, we note to incorporate a pair-matched design, consider the pair as the independent unit for variance estimation and sample-splitting; specifically, we recommend leave-one-pair-out cross-validation.

3.1.3 Targeting an Individual-level Effect with a Cluster-Level Approach

With or without adaptive pre-specification, the previously described cluster-level TMLE naturally estimates a cluster-level effect, defined using contrasts of cluster-level counterfactuals for . Furthermore, the unadjusted effect estimator can be considered a special case of the TMLE where . In these cluster-level analytic approaches, we can alternatively estimate a causal effect corresponding to contrasts of individual-level counterfactuals for . To do so, we still implement a cluster-level analysis but include weights in each step (Appendix A).

3.2 Individual-level Approaches

We now discuss how to leverage individual-level covariates when estimating effects in CRTs. This is can done by aggregating to the cluster-level after estimating the expected individual-level outcome or implementing a fully individual-level approach. In all cases, clustering must be accounted for during variance estimation. These estimators vary in their flexibility to estimate both cluster-level effects (e.g., Eq. 11) and individual-level effects (e.g., Eq. 12).

3.2.1 Individual-level TMLE

In Section 3.1, we discussed a cluster-level TMLE for estimating effects in CRTs based on aggregating the data to the cluster-level and applying a TMLE for i.i.d. data. We now present an individual-level TMLE, which accounts for the hierarchical nature of the data while leveraging the natural pairing of individual-level outcomes with individual-level covariates [hierarchical]. In Appendix B, we present a second TMLE that can also be applied to estimate effects in CRTs. Both extend to a pair-matched design [hierarchical].

A valid analytic approach for clustered data settings, such as CRTs, is to apply an algorithm appropriate for individual-level data for point estimation and then account for clustering during variance estimation [Schnitzer2014, hierarchical]. Specifically, we could pool participants over clusters, implement an individual-level TMLE, and adjust statistical inference to respect that the cluster as the independent unit. Briefly, we now consider initial estimators of the individual-level outcome , which are updated using individual-level propensity score . Both estimators and would be fit pooling individuals across clusters. As before, we calculate the “clever covariate”, but now at the individual-level:

for . Then on the logit-scale, we would regress the individual-level outcome on the individual-level covariates and with the initial individual-level estimator as the intercept. This provides the following updated estimator of the expectation of the individual-level outcome, while simultaneously solving the efficient score equation:

(20)

where and now denote the estimated coefficients for and . Then we would obtain a point estimate by averaging these targeted predictions:

where denotes the total number of participants. To obtain statistical inference for , we aggregate an individual-level influence function to the cluster-level and then take the sample variance of the estimated cluster-level influence function, scaled by , as detailed in [Schnitzer2014]. Specifically, the influence function for is well-approximated by where .

Altogether, this individual-level approach naturally targets the individual-level treatment-specific mean and therefore the individual-level effect, as in Eq. 12. However, as detailed in Balzer et al. [hierarchical], we can also target the cluster-level treatment-specific mean and therefore the cluster-level effect, as in Eq. 11, by incorporating the weights in each step, including variance estimation. For either choice of target parameters, the ltmle package can be used to obtain point estimates and statistical inference by specifying id as the cluster and observation.weights equal to 1 for the individual-level effect and equal to for the cluster-level effect [ltmlepackage].

As before, we recommend using Adaptive Prespecification to select among candidate estimators of the individual-level expected outcome and propensity score

that maximize the empirical efficiency for the target effect of interest. Leveraging the larger effective sample size in the individual-level TMLE, we could consider use machine learning methods, such as Super Learner, to expand the candidate estimators of

and , beyond GLMs [superlearner_2007]. In all cases, the cross-validation scheme must respect the cluster as the unit of independence.

3.2.2 Covariate-adjusted Residuals Estimator (CARE)

The covariate-adjusted residuals estimator (CARE) was first proposed in [gail-care] and later popularized by [moultonhayes]. In CARE, we pool individuals across clusters and regress the individual-level outcome on individual-level and cluster-level covariates of interest , but not the cluster-level intervention

. Then the predictions from this regression are aggregated to the cluster-level. Finally, a t-test comparing mean residuals (i.e., the discrepancies between observed and predicted outcomes) by arms is performed, since the average residuals should be the same between arms under the null hypothesis. This estimator’s target estimand is cluster-level effect, such as Eq. 

11 for the relative scale.

Unlike TMLE with Adaptive Prespecification, CARE relies on a fixed specification of the individual-level outcome regression [gail-care]. In more detail, for the relative effect (Eq. 11) and a binary outcome

, we could, for example, fit the following individual-level logistic regression

(21)

where and

denote the magnitude by which the log odds of the outcome for the

th individual in the th cluster is affected (linearly) by the cluster-level covariates and individual-level covariates , respectively. Once this model is fit, the expected number of events in the th cluster is calculated as , and compared the observed number of events through ratio-residuals:

Hayes and Moulton [moultonhayes] note that these ratio-residuals are often right-skewed and recommend a logarithmic transformation. Specifically, they recommend applying a t-test to obtain point estimates and inference for the difference in the treatment-specific averages of the log-transformed residuals:

(22)

where denotes the ratio-residual for cluster in arm . As detailed in Section 3.1.1, after exponentiation, we recover estimates and inference for the ratio of the geometric means and thereby a different target parameter than the standard risk ratio, given in Eq. 11. A straightforward extension to pair-matched design is illustrated in [moultonhayes].

3.2.3 Generalized estimating equations (GEE)

We now consider a class of estimating equations, sometimes referred to as “population-average models”, for estimating effects in CRTs [ware-laird]. In GEE, estimation and inference is conducted at the individual-level and a working correlation matrix is used to account for the dependence of outcomes within clusters. Therefore, the target of inference is now the individual-level effect, as in Eq. 12, instead of a cluster-level effect, as in Eq. 11.

In GEE, the expected individual-level response is modeled a function of the treatment and possibly covariates of interest [ware-laird, hubbard_ahern]. Specifically, consider the following “marginal model” for the expected individual-level outcome :

(23)

where denotes the inverse-link function. Commonly, the identity link is used for continuous outcomes, the log-link for count outcomes, and the logit-link for binary outcomes. Effect estimation in GEE is usually done by obtaining a point estimate and inference for the treatment coefficient . At the individual-level, represents causal risk difference for the identity link, represents the causal risk ratio for the log-link, and represents the causal odds ratio for the logit-link. In other words, the link function often determines the scale on which the effect is estimated.

As with other CRT approaches, GEE may improve efficiency by adjusting for covariates. Consider, for example, the following “conditional model” for the expected individual-level outcome :

(24)

where again denotes the inverse-link function. Except for linear and log-linear models without interaction terms, the interpretation of is generally not the same as in the marginal model. For the logistic link function, for example, in Eq. 24 would yield the conditional log-odds ratio instead of the marginal log-odds ratio. However, a recent modification to GEE, presented next, allows for estimation of marginal effects, while adjusting for individual-level or cluster-level covariates.

For either a marginal or conditional specification, the GEE estimator solves the following equation:

where is the vector containing individual-level outcome regressions for cluster , is the gradient matrix, and is the working correlation matrix used to account for dependence of individuals within a cluster [ware-laird].

GEE yield a consistent estimate of under the marginal model, even if the correlation matrix has been misspecified. However, under misspecification of the correlation matrix, the usual standard errors obtained are not valid, and the sandwich variance estimator must be used [ware-laird]. In general, unless the number of clusters is relatively large and the number of participants within cluster is relatively small, the sandwich-based standard errors can underestimate the true variance of

, and yield confidence intervals with coverage probability below desired nominal level. Corrections to variance estimation exist for when there are fewer than 38 degrees of freedom

[murray, gee_correction]. The extension to a pair-matched analysis requires specifying a fixed effect for the pair while maintaining the correlation structure at the cluster-level. However, pair-matched analyses are discouraged for studies with fewer than 40 clusters [murray, gee_correction, moultonhayes].

3.2.4 Augmented-GEE

As discussed in the previous subsection, the coefficient resulting from GEE does not always correspond to the marginal effect. Recently, a modification to GEE was proposed to ensure can be interpreted as a marginal effect, while simultaneously adjusting for baseline covariates to improve efficiency [aug-gee]. This approach is referred to as Augmented-GEE (Aug-GEE) and again targets individual-level effects, as in Eq. 12.

As commonly implemented, Aug-GEE modifies GEE by including an additional “augmentation” which incorporates the conditional expectation of the individual-level outcome . The general form of the Aug-GEE for a binary treatment is given by

(25)

with augmentation term

Here, for cluster , denotes the vector of marginal regressions (as in Eq. 23) and denotes the vector of conditional regressions (as in Eq. 24). The cluster-level propensity score, defined in Eq. 15, is treated as known (i.e., ). By solving Eq. 25, we can obtain point estimates and inference for the marginal effects. As with the other estimators considered, if the conditional outcome regression is misspecified, the resulting estimator is asymptotically normal and consistent; however, it is not efficient [aug-gee]. Indeed, the efficiency of the estimator highly depends on the matrix . Stephens et al. [stephens_aug-gee14] show how to further improve Aug-GEE by deriving the semiparametric locally efficient estimator, but conclude the derivation of the high-dimensional inverse covariance matrix presents a substantial barrier to any additional gains.

4 Simulation Studies

To examine the finite sample properties of the previously discussed CRT estimators and to demonstrate the importance of careful specification of the target of inference, we conducted two simulation studies.

4.1 Simulation I

We simulated a simplified data generating process reflecting the hierarchical data structure of the PTBi study, which randomized clusters, corresponding to health facilities. For each of the 20 clusters, we sampled cluster-level covariates and and within-cluster sample size . For each cluster

, we also simulated a random variable

to act as an unmeasured source of dependence within each cluster. To reflect the PTBi study design and examine the impacts of “breaking the match” during the analysis, we paired clusters on using the nonbipartite matching algorithm [nbp_match]. Within the pair, one cluster was randomized to the intervention arm and the other to the control arm.

For participant in each cluster , we generated two individual-level covariates: and . We also calculated cluster-level summaries and as the empirical mean of and , respectively. Lastly, we simulated the individual-level outcomes as a function of the intervention , cluster and individual-level covariates, and unmeasured factor :

(26)

We set the coefficients for intervention terms to assess the estimator’s power when there was an effect, and to assess Type-I Error under the null. For both settings, we generated counterfactual outcomes under the intervention and under the control by setting and , respectively, for a population of 9000 clusters. These counterfactuals were used to calculate the true values of the causal parameters, defined at both the cluster and individual-level. Specifically, we assessed the performance of the t-test and TMLE as estimators of the cluster-level effect (i.e., Eq. 11) and the GEEs as estimators of the individual-level effect (i.e., Eq. 12). Additionally, we implemented CARE to estimate the geometric incidence ratio (i.e., the ratios of the geometric means of the cluster-level outcomes).

We implemented two versions of TMLE: one using only cluster-level data (hereafter called “cluster-level TMLE”) and one incorporating individual-level data (hereafter called “individual-level TMLE”) For covariate adjustment, both TMLEs used Adaptive Prespecification to choose at most two covariates for adjustment if they improved efficiency relative to the unadjusted estimator. The cluster-level TMLE could select from when estimating the cluster-level outcome regression and cluster-level propensity score, while the individual-level TMLE could select from when estimating the individual-level outcome regression and individual-level propensity score. To limit the overall number of candidates and thus reduce the estimator’s variability, we removed a covariate from consideration for propensity score estimation if that covariate was already selected for adjustment in the outcome regression. Inference for the TMLEs was based on the cross-validated influence function. CARE, GEE, and Aug-GEE relied on a fixed specification of the individual-level outcome regression which adjusted for . In CARE, we applied the log-transformation to the ratio residuals, as recommended in by [moultonhayes], and obtained inference with a t-test. In the GEEs, we used the log-link function, and an exchangeable working correlation matrix for statistical inference.

4.2 Simulation I Results

The true value of the cluster-level incidence ratio (Eq. 11), targeted by the t-test and the TMLEs, was 0.744 and very similar to the true value of the individual-level risk ratio (Eq. 12), targeted by the GEEs: 0.741. However, the true value of the ratio of geometric means (Section 3.2.2), targeted by CARE, was: 0.71, indicating a larger effect size. In other words, the simulated intervention resulted in a 26% reduction in the cluster-level outcome on the arithmetic scale and a 29% reduction on the geometric scale. Of course, there is no guarantee that this will always be true, and it is essential to specify a target parameter corresponding to the research query.

While the CRT estimators targeted different effects, it is still valid to compare their attained power, defined as the proportion of times the false null hypothesis was rejected at the 5% significance level, and confidence interval coverage, defined the proportion of times the 95% confidence intervals contained the true value of corresponding target effect. Additionally, we examined Type-I error, defined as the proportion of times the true null hypothesis was rejected at the 5% significance level.

When there is an effect Under the null
pt bias covg power pt bias covg Type-I
t-test 0.745 0.001 0.175 0.171 0.952 0.302 1.001 0.001 0.141 0.154 0.960 0.040
Clust TMLE 0.740 -0.004 0.112 0.106 0.962 0.722 1.001 0.001 0.103 0.098 0.959 0.041
Indv TMLE 0.741 -0.003 0.113 0.106 0.964 0.723 1.001 0.001 0.103 0.096 0.960 0.040
CARE 0.737 -0.007 0.110 0.083 0.954 0.757 1.006 0.006 0.101 0.104 0.951 0.052
GEE 0.745 0.001 0.096 0.107 0.928 0.851 1.005 0.005 0.087 0.097 0.921 0.096
Aug-GEE 0.745 0.001 0.094 0.107 0.922 0.855 1.006 0.006 0.086 0.100 0.918 0.101
  • “Clust” and “Indv” refer to the TMLE using cluster-level (only) and individual-level data, respectively. Each is implemented with Adaptive Prespecification.

  • “pt” is the average point estimate; “bias” is the average difference between the point estimate and the target effect;

    is the standard deviation of the point estimates;

    is the average standard error estimate; “covg” is the proportion of times the 95% confidence interval contained the true effect, “power” is the proportion of times the false null hypothesis was rejected, and “Type-I” is the proportion of times the true null hypothesis was rejected.

 

Table 1: Performance of common CRT estimators when there is an effect and under the null across 2500 iterations of Simulation I.

These metrics are shown in Table 1 for each estimator (for its corresponding target estimand) across all 2500 iterations of the data generating process, each with clusters. The t-test, an unadjusted effect estimator, maintained nominal Type-I error (4%) under the null, but when there was an effect, it achieved low statistical power (30%) with nominal confidence interval coverage (95%). In these simulations, the cluster-level TMLE and individual-level TMLE performed similarly. Both provided substantial efficiency gains over the unadjusted estimator of the same parameter, which was the cluster-level relative effect. Specifically, they achieved a statistical power of 72%, while maintaining nominal confidence interval coverage (96%) and Type-I error control (). As expected, however, the TMLEs differed in their selection of adjustment variables. The individual-level TMLE selected as the adjustment variable for the outcome regression in 98% of simulations, while the cluster-level TMLE selected the corresponding cluster-level covariate in only 34% of simulations. Both selected an unadjusted estimator of the propensity scores in 93% of repetitions.

For estimation of the geometric incidence ratio, CARE maintained Type-I Error control (5%) and achieved slightly more power (76%) than the TMLEs; this is not surprising given that in this simulation the cluster-level effect summarized with geometric means (0.71) was further from the null (1.0) than the cluster-level effect summarized with arithmetic means (0.74). Finally, in this simulation, both GEEs, which targeted the individual-level risk ratio, achieved the highest power (85%-86%), but also exhibited inflated Type-I error rates (10%) and less than nominal confidence interval coverage (92%).

4.3 Simulation II

In many CRTs, participant outcomes are influenced by the cluster size. Suppose, for example, that the smallest health facilities have the fewest resources to the detriment of their patients’ health, while largest health facilities are overburdened also to the detriment of their patients’ health. In this setting, wide variation in cluster size can result in a divergence between cluster-level and individual-level effects. Intuitively, cluster-level parameters (e.g., Eqs. 4 and 5) give equal weight to each cluster, regardless of its size, while individual-level parameters (e.g., Eqs. 6 and 7) give equal weight to all trial participants. The distinction between the parameters is exacerbated when cluster-size interacts with the intervention and said to be “informative” [nevail_ics2014, Seaman2014]. Therefore, in the second simulation study, we considered a more complex data-generating process to highlight the distinction between the cluster-level effects and individual-level effects.

We again focused on a setting with clusters, reflecting the PTBi study. For each, we sampled a cluster-level covariate , the within-cluster sample size , and an unmeasured cluster-level variable . Clusters were pair-matched on , and within the pair, one cluster was randomized to the intervention arm and the other to the control arm. For participant in each cluster , we generated two individual-level covariates: and . As before, the cluster-level aggregates and were estimated by taking the average value of and , respectively, within cluster. Lastly, we simulated the individual-level outcomes as a function of the intervention, the cluster and individual-level covariates, and unmeasured factor :

Unlike Simulation I, the probability of the individual-level outcome is now a function of cluster size . Specifically, the outcome risk is higher for participants in larger clusters, especially large clusters in the control arm. As before, we also set the coefficients for the intervention and for its interactions with covariates to 0 to assess Type-I Error under the null. For both settings, we generated counterfactual outcomes under the intervention and under the control by setting and , respectively. Then for a population of 9,000 clusters, we calculated the relative effect at the cluster-level (Eq. 11) and at the individual-level (Eq. 12).

For this simulation, we focused on the performance of the cluster-level and individual-level TMLEs, given their flexibility to estimate effects defined at the cluster-level or individual-level and their ability to incorporate baseline covariates to improve precision and statistical power, while maintaining Type-I error control. As previously discussed, through the application of weights, the cluster-level TMLE can estimate individual-level effects. Likewise, the individual-level TMLE can estimate cluster-level effects. It is possible that the other CRT estimators have this flexibility, but the needed extensions remain to be fully studied.

As before, each TMLE used Adaptive Prespecification to choose at most two covariates for adjustment if their inclusion improved efficiency as compared to an unadjusted estimator. The candidate adjustment set for the cluster-level TMLE included , while the candidate adjustment set for individual-level TMLE included . As before, if a covariate was adjusted for in the outcome regression, it was not considered in estimation of the propensity score. For comparison, we also considered a cluster-TMLE with fixed specification adjusting for and an individual-level TMLE adjusting for . Inference for TMLEs using Adaptive Prespecification was based on a cross-validated estimate of the influence function, while inference for the TMLEs based on fixed adjustment was based on the standard (i.e., not cross-validated) estimate of the influence function. For simplicity, we focus on the analysis “breaking” the matches used for randomization.

4.4 Simulation II Results

In these simulations, the true value of the cluster-level relative effect (Eq. 11) was 0.81, substantially larger than the true value of the individual-level relative effect (Eq. 12) of 0.85. In other words, the intervention resulted in a 19% reduction in the incidence of outcome and only a 15% reduction in the individual-level risk of the outcome. Of course, this is just one simulation study; in practice, there is no guarantee that the cluster-level effect will be larger, or even different, from the individual-level effect. For this simulation study, Table 2 shows the performance of the cluster-level and individual-level TMLEs with fixed and adaptive adjustment. Given the differing magnitude of the effects, it is unsurprising that estimators of the cluster-level effect, shown on the left, achieved notably higher power than estimators of the individual-level effect, shown on the right. Specifically, estimators of the cluster-level effect achieved a maximum power of 58%, while estimators of the individual-level effect achieved a maximum power of 41%.

Focusing on estimators of the cluster-level effect (Table 2-Left), we see that performance was quite similar across the approaches. Specifically, all analyses resulted in low absolute bias, similar variability (), and near-nominal confidence interval coverage (93-95%). For a given adjustment strategy (fixed or adaptive), there was little practical difference in performance between analyses using cluster-level or individual-level data. However, the cross-validated standard error estimates () from the TMLEs using Adaptive Prespecification were notable larger than those relying on fixed specification, resulting in notably lower power: 48% versus 58%, respectively. Of course, a priori-selection of fixed adjustment set is easy when there are limited covariates as in this simulation study. In most applications, there are many potential adjustment variables, and it often unclear which combination will maximize precision for the effect of interest, motivating the need for Adaptive Prespecification.

This is, in fact, seen for estimators of the individual-level effect (Table 2-Right). As before, all approaches exhibited minimal bias, similar variability (), and near-nominal to conservative confidence interval coverage (93%-97%). Now, however, we see that the TMLEs using Adaptive Prespecification achieved notably higher power (%) as compared to the TMLEs relying on fixed specification of the adjustment variables (%). Again we see little practical difference between approaches relying on cluster-level data or utilizing individual-level data.

Cluster-level effect: = 0.806 Individual-level effect: = 0.845
pt bias covg power pt bias covg power
Clust 0.807 0.001 0.106 0.095 0.930 0.583 0.843 -0.002 0.092 0.102 0.965 0.341
Clust-AP 0.809 0.003 0.107 0.107 0.946 0.479 0.845 0.000 0.093 0.091 0.933 0.411
Indv 0.806 0.000 0.106 0.096 0.931 0.583 0.844 -0.001 0.091 0.101 0.967 0.336
Indv-AP 0.808 0.002 0.107 0.102 0.949 0.484 0.845 0.000 0.092 0.091 0.939 0.408
  • “Clust” and “Indv” refer to the TMLE using cluster-level and individual-level data, respectively. Each is implemented fixed or adaptive adjustment via Adaptive Prespecification (“-AP”).

  • “pt” is the average point estimate; “bias” is the average difference between the point estimate and the target effect; is the standard deviation of the point estimates; is the average standard error estimate; “covg” is the proportion of times the 95% confidence interval contained the true effect, and power is the proportion of times the false null hypothesis was rejected.

 

Table 2: Performance of TMLEs for the cluster-level and individual-level relative effects across 2500 iterations in Simulation II.

Altogether the results of this simulation demonstrate the ability of TMLE to estimate both cluster-level and individual-level effects, while adaptively adjusting for baseline covariates. These results also highlight the critical importance of pre-specifying the primary effect measure and using an CRT estimator of that effect. Additional simulation results in Appendix C demonstrate the danger of using an estimator that does not target the primary effect measure and thereby fails to answer the research query of interest. Specifically, notable bias and poor coverage can arise by applying an estimator of the cluster-level parameter when the target of inference is the individual-level effect and, conversely, when applying an estimator of the individual-level parameter when the target of inference is the cluster-level effect. While we illustrate this critical step in CRT analysis with TMLE, similar challenges apply to all CRT estimators. In all settings, the research question should drive specification of the causal effect and thereby the statistical estimation approach [Petersen2014roadmap, BalzerMLcomm2021].

5 Real data application: The PTBi Study in Kenya and Uganda

In East Africa, preterm birth remains a leading risk factor for perinatal mortality, defined as stillbirth and first-week deaths [ku_protocol]. Evidence-based practices, such as use of antenatal corticosteroids and skin-to-skin contact, are not routinely used and have the potential to improve outcomes for preterm infants during the critical intrapartum and immediate newborn periods. The PTBi study was a CRT designed to improve the quality of care at the time of birth through a health facility intervention for mothers and preterm infants (ClinicalTrials.gov: NCT03112018) [ku_protocol]. The primary endpoint was intrapartum stillbirth and 28-day mortality among preterm infants delivered from October 2016 to May 2018 in Western Kenya and Eastern Uganda.

In more detail, 20 public sector health facilities, including large hospitals and smaller health centers, were selected for participation in the study. The facilities ranged in size, staff-patient ratio, and capacity to perform cesarean section (C-section), among other characteristics. The facilities were pair-matched within region and randomized to either the intervention or control arm [ku_protocol]. Facilities in the control arm received (1) strengthening of routine data collection and (2) introduction of WHO Safe Childbirth Checklist [WHO_checklist]. The facilities in the intervention arm received the components included in the control in addition to (1) PRONTO Simulation training [pronto], and (2) quality improvement collaboratives aimed to reinforce and optimize use of evidence-based practices. All study components consisted of known interventions and strategies aiming to improve quality of care, teamwork, communication, and data use [ku_protocol].

The results of the PTBi Study have been previously published in [ptb_ku_walker]. Here, we focus on the real-world impact of varying cluster sizes. Specifically, during the study period, an unforeseen political strike led to lack of medical providers at certain facilities, thereby decreasing volume at some facilities while increasing volume at others. The number of preterm births for a given facility ranged between 40-366 in the intervention arm and between 31-447 in the control arm. Differences in cluster size were also pronounced within matched pairs and ranged between 9-211.

To study the impact of high variability in cluster size, we return to the consequences of specifying the effect in terms of cluster-level outcomes (Eq. 11) versus individual-level outcomes (Eq. 12). In PTBi, the individual-level outcome was an indicator of preterm infant mortality by 28-day follow-up. Infants dying before discharge (stillbirth and predischarge mortality) were also included in the study; for these infants, . For facility , the cluster-level endpoint was the incidence of fresh stillbirth and 28-day all-cause mortality among preterm births and calculated as the empirical mean of the individual-level outcomes: .

For both the cluster-level and individual-level effects, we compared estimates and inference from TMLEs using cluster-level or individual-level data. The cluster-level TMLE used Adaptive Prespecification to select between no adjustment and adjustment for the proportion of mothers receiving a C-section, while the individual-level TMLE used Adaptive Prespecification to select between no adjustment and adjustment for an indicator of receiving a C-section. For comparison, we also implemented the TMLEs without adjustment and each approach preserving or ignoring the matched pairs used for randomization.

5.1 PTBi Results

The average incidence of 28-day mortality among preterm infants was 0.122 among facilities randomized to the intervention and 0.150 among facilities randomized to the control. Therefore, the unadjusted estimator of the cluster-level relative effect (Eq. 11) was 0.815 (95%CI: 0.595-1.116; p: 0.174), corresponding to 18.5% reduction in combined incidence of mortality at the facility-level. The results for the individual-level outcome were markedly different. Specifically, the proportion of preterm infants who died within 28-days was 0.153 in the intervention arm and 0.223 in the control arm. Therefore, the unadjusted estimator of the individual-level, relative effect (Eq. 12) was 0.656 (95%CI: 0.531-0.812; p0.005), corresponding to a 34.4% reduction in the mortality risk.

As shown in Table 3, all estimates from TMLE indicated the PTBi intervention reduced preterm mortality; however, estimates of both the treatment-specific means and relative effects varied by the target of inference and the estimation approach. As seen with the unadjusted analyses, estimates of cluster-level effect were less extreme than estimates of the individual-level effect. We note the adjusted analyses were slightly less precise than the unadjusted ones, because, for demonstration, the adjusted analyses restricted to participants with known C-section status (), while the unadjusted analyses used all participants (). As expected, for a given TMLE (defined by target of inference and level of the analysis), the point estimates when breaking versus keeping the matches were identical. Additionally as predicted by theory [balzer_matching], the pair-matched analysis was more precise. The gain in efficiency from pair-matching was particularly notable for the individual-level TMLE.

For the cluster-level effect For the individual-level effect
Ratio (95%CI) pval Ratio (95%CI) pval
Unmatched analysis
Clust 0.128 0.145 0.885 (0.585, 1.338) 0.542 0.135 0.162 0.834 (0.567, 1.227) 0.337
Indv 0.124 0.149 0.836 (0.495, 1.411) 0.481 0.148 0.213 0.695 (0.435, 1.109) 0.120
Pair-matched analysis
Clust 0.128 0.145 0.885 (0.587, 1.335) 0.518 0.135 0.162 0.834 (0.573, 1.214) 0.303
Indv 0.124 0.149 0.836 (0.600, 1.164) 0.252 0.148 0.213 0.695 (0.549, 0.880) 0.007
  • “Clust” and “Indv” refer to the TMLE using cluster-level and individual-level data, respectively.

 

Table 3: Results from applying TMLE to estimate cluster-level and individual-level effects in the PTBi study

Focusing specifically on estimators of the cluster-level effect (Table 3-Left), we see that estimates of the expected incidence of mortality under the intervention and of the expected incidence of mortality under the control varied whether or not an cluster-level or individual-level analysis was taken. In particular, the expectations were more extreme (i.e., lower incidence in the intervention and higher incidence in the control) with the individual-level estimation approach. This may reflect that while both approaches can target the same parameter - here, the cluster-level effect - the relationship between the cluster-level covariates and outcome may be distinct from the relationship between the individual-level covariates and outcome. Specifically, the impact of the proportion of women receiving a C-section on the cluster-level incidence of mortality is likely to be different than the impact of an individual woman’s C-section on the risk of mortality for her child. This could contribute to the individual-level TMLE being a more precise estimator of the cluster-level effect in this application.

Finally focusing on estimators of the individual-level effect (Table 3-Right), we see that estimates of the risk of mortality under the intervention and of the risk of mortality under the control are larger with the individual-level approach. Focusing on pair-matched analyses for simplicity, cluster-level TMLE suggested that the effect of the PTBi intervention on reducing the individual-level risk of preterm mortality was 0.834 (95%CI: 0.573-1.214; p: 0.303). In contrast, the individual-level TMLE found effect of the PTBi intervention to be 0.695 (95%CI: 0.549-0.880; p: 0.007), corresponding to a 30.5% reduction. Altogether, these results highlight that the importance of pre-specifying the research question, selecting a causal parameter corresponding to that question, and finally implementing an estimator for that effect.

6 Discussion

Cluster randomized trials (CRTs) are commonly used to evaluate the effects of interventions, which are randomized to groups of individuals, such as communities, clinics, or schools. In both resource-rich and resource-limited settings, high costs and barriers to implementation often limit the number of clusters which can be enrolled and randomized. Therefore, it is important to understand the potential advantages and pitfalls of common analytic approaches in small-scale CRTs. In this manuscript, we aimed to provide in-depth an comparison of methods available for CRT analysis, overall and when there were 30 clusters. To do so, we first demonstrated how causal models can be used to describe the data generating process of a CRT and intervened upon to generate counterfactual outcomes of interest. Using these counterfactuals, we explored a variety of causal parameters that could be defined, and we stressed the importance of a priori-specifying the primary effect measure. We then delved into common CRT methods, considered each theoretically and with finite sample simulations. Finally, we demonstrated the impact of key analytic decisions using real data from the PTBi study, a clinic-based CRT designed to reduce mortality among preterm infants in East Afica.

Our theoretical comparison of estimators revealed that common methods often target different causal effects. Results from our first simulation study demonstrated how targeted maximum likelihood estimation (TMLE) and the covariate adjusted residuals estimator (CARE) could improve statistical power by adjusting for covariates, while maintaining nominal confidence interval coverage. In this simulation study, approaches based on generalized estimating equations (GEE and Aug-GEE) failed to preserve Type-I error control; this is consistent with the literature warning against their application (without finite sample corrections) with fewer than 30 clusters [moultonhayes, murray]

. However, some studies may value minimizing Type II error over Type I error, and this should be carefully considered in the analysis plan.

Results from our second simulation study demonstrated the impact of informative cluster size, which occurs when the cluster size interacts with the intervention effect. In this setting, there can be a sharp divergence in the interpretation and magnitude of effects defined at the cluster-level or at the individual-level. This divergence was observed in the real data application, where larger facilities in the control arm had worse outcomes.

There are several limitations to the applied analysis, which merit additional consideration. First, high levels of missing data on baseline covariates limited our investigation of adjustment to a single covariate: having a C-section. Future work could address imputation of missing data on baseline covariates and its impact on statistical performance of adjusted estimators. Second, the adjusted analysis was restricted to mother-infant dyads with known C-section status. This could bias our results if the data are not missing at random; we plan to address this in a future analysis using the methods in

[Balzer2021twostage]. Third, during the delivery, the infant may have also received additional interventions which could influence their outcome. While the PTBi study was not designed to collect this information for its primary analysis, future work could focus on methods for assessing the joint impact of the cluster randomized intervention combined with another individual-level exposure (randomized or not). Lastly, we assumed a one-to-one relationship between mothers and their infants for simplicity; however, in further sensitivity analyses, we will formally consider accounting for all twins and triplets. Standard error estimates should account for this additional layer of clustering.

In summary, CRTs are a popular approach for researchers seeking to scale up interventions from the individual-level to the population-level. Many of the challenges faced by the PTBi study are common and highlight the importance of carefully selecting the target effect and corresponding analytic approach. We recommend considering both interpretation and impacts on statistical power when selecting the target effect. For estimation, we recommend using TMLE given its flexibility to estimate a variety of effects and data-adaptively adjust for both cluster-level and individual-level covariates, while preserving Type-I error control.

7 Appendix A: Generalization of the causal parameter definition

We can further generalize our definition of treatment-specific mean as

(27)

for user-specified weights such that . For example, setting recovers the cluster-level parameter , as in Eq. 4. Let be the weights used to define the cluster-level outcome: . Then setting recovers the individual-level parameter , as in Eq. 6. To illustrate, we again focus on the setting where is the inverse cluster-size. Then we have

(28)

Altogether Eq. 27 allows us to consider a wide range of effects defined at the cluster-level or individual-level, regardless of whether the data are collected at the cluster-level or individual-level.

8 Appendix B: Hybrid TMLE

Recall that the first step of the cluster-level TMLE (Section 3.1) is to obtain an initial estimator of the conditional expectation of the cluster-level outcome . Instead of only considering cluster-level approaches, we can expand our candidate estimators by including aggregates of individual approaches [hierarchical]. Consider, for example, the following specification of the expected individual-level outcome :

We could estimate these coefficients by pooling individuals across clusters and running an individual-level logistic regression. Afterwards, for each cluster , we would obtain and summarize the individual-level predicted outcomes to generate a candidate estimator of the expected cluster-level outcome:

with the selected corresponding to relevant cluster-level summary of the individual-level outcomes (i.e., Eqs. 3 and 8); throughout, we have been focusing on for . Then estimation and inference would proceed as described in Section 3.1 for the cluster-level TMLE. Thus, this approach naturally targets the cluster-level treatment-specific mean , and therefore cluster-level effects, as in Eq. 11. However, as previously discussed, we can modify the weights to, instead, target an individual-level effect.

More importantly, we can now use Adaptive Prespecification to choose between candidate estimators of based on cluster-level approaches or aggregates of individual-level approaches. In the latter, the initial estimator of is based off of pooling individuals across clusters; therefore, we could consider methods that are more data-adaptive than GLMs. Specifically, we could use Super Learner, an ensemble machine learning algorithm [superlearner_2007].

9 Appendix C: Additional Simulation Results

As demonstrated in the Simulation Study II, TMLE can estimate effects defined at the cluster-level (e.g., Eq.11) as well as effects defined at the individual-level (e.g., Eq. 12) regardless of whether the data are at the individual-level or the cluster-level. Here, we again emphasize the importance of implementing an estimator corresponding to the CRT’s primary effect measure. To do so, we reconsider Simulation II and show the performance of the TMLEs when there is a mismatch between the researcher’s goal and the estimator’s target.

Specifically, in Table 4-Left, we show the finite sample performance when the TMLEs are targeting the cluster-level effect ( = 0.806), but the researcher’s goal is to evaluate the individual-level effect (