A note on Horvitz-Thompson estimators for rare subgroup analysis in the presence of interference

by   Erin E Gabriel, et al.
Karolinska Institutet

When there is interference, a subject's outcome depends on the treatment of others and treatment effects may take on several different forms. This situation arises often, particularly in vaccine evaluation. In settings where interference is likely, two-stage cluster randomized trials have been suggested as a means of estimating some of the causal contrast of interest. Working in the finite population setting to investigate rare and unplanned subgroup analyses using some of the estimators that have been suggested in the literature, include Horvitz-Thompson, Hajek, and what might be called the natural extension of the marginal estimators suggested in Hudgens and Halloran 2008. I define the estimands of interest conditional on individual, group and both individual and group baseline variables, giving unbiased Horvitz-Thompson style estimates for each. I also provide variance estimators for several estimators. I show that the Horvitz-Thompson (HT) type estimators are always unbiased provided at least one subject within the group or population, whatever the level of interest for the estimator, is in the subgroup of interest. This is not true of the "natural" or the Hajek style estimators, which will often be undefined for rare subgroups.



page 1

page 2

page 3

page 4


Inverse Probability Weighted Estimators of Vaccine Effects Accommodating Partial Interference and Censoring

Estimating population-level effects of a vaccine is challenging because ...

Causal inference for interfering units with cluster and population level treatment allocation programs

Interference arises when an individual's potential outcome depends on th...

Graph Agnostic Estimators with Staggered Rollout Designs under Network Interference

Randomized experiments are widely used to estimate causal effects across...

On the construction of unbiased estimators for the group testing problem

Debiased estimation has long been an area of research in the group testi...

Optimized variance estimation under interference and complex experimental designs

Unbiased and consistent variance estimators generally do not exist for d...

G-Formula for Observational Studies with Partial Interference, with Application to Bed Net Use on Malaria

Assessing population-level effects of vaccines and other infectious dise...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Interference is present in settings where the subject’s outcomes are not independent of other subject’s treatment. Interference changes both the types of causal effects of treatment and their estimation. Much of the causal inference literature assumes that units of interest are independent. In the presence of interference, causal inference becomes much more difficult. However, interest has increased in the last decade with regards to causal inference in the presence of interference. There have been several papers which discuss causal inference with interference starting with the foundational work of Hudgens and Halloran (2008) which was extended by Tchetgen and VanderWeele (2012), and since then there has been an explosion of new works Liu and Hudgens (2014); VanderWeele (2013); Liu et al. (2016); Aronow et al. (2017); Sävje et al. (2017). Many papers focus on marginal effects and estimands, VanderWeele (2012) and VanderWeele (2013) deal with conditional estimands as does Halloran and Hudgens (2012).

I am interested in randomization inference within subgroups defined by rare baseline variables where the randomization has ignored the baseline variable(s) in question and may therefore not contain any members of the subgroup in one of the randomized arms. In this setting, what one might call the “natural” estimators or the Hájek estimators will be either undefined or biased. If one could randomize stratified by the desired conditioning variable, all Hudgens and Halloran (2008) and Tchetgen and VanderWeele (2012) theorems and propositions would apply directly. Although this may seem like a minor point, pre-specification of intended analysis is often required in randomized clinical trials, thus one cannot change the analysis after looking at the unblinded data.

I show that the Horvitz-Thompson estimators Horvitz and Thompson (1952) are unbiased and defined, provided there are any members of the subgroup in the population of interest, regardless of the realized randomization. I consider two-stage randomization to a fixed number of clusters at the first stage, and then a fixed number of subjects within the cluster at the second stage Hudgens and Halloran (2008). For simplicity, I make the same assumptions as Hudgens and Halloran (2008). However, as has now been shown now in many works, there are estimators that are unbiased for interesting estimands under reduced assumptions Sävje et al. (2017); Aronow et al. (2017).

2 Notation

Following the notation of Hudgens and Halloran (2008), suppose there are clusters of individuals. For , let denote the number of subjects in the th cluster, indexed by . Let the treatment assignments for the individuals in cluster

be denoted by the vector

, with the sub-vector excluding the treatment assignment of subject written as . For simplicity of notation and illustration, the treatment assignment for subject in cluster is either 0 or 1 with the realization being denoted by . The realization of the cluster level treatment assignment can be any of possible values; realizations excluding the assignment for subject will be written as .

Let be some baseline variable(s) observed at the cluster level prior to randomization and be some, not necessarily related, baseline variable(s) at the individual level. Let these variables have an arbitrary domain, and let or denote some subset of that domain at the cluster or individual level, respectively. I will consider two coverage proportions for the same intervention, denoted by and . I define the vector of cluster assignments to these strategies as Q, where if cluster is assigned to strategy and is otherwise.

Let with and .
Let the number of clusters with be denoted by .
Let for and .
Let the number of clusters that have both and be denoted by .
Let the set of groups with , be denoted as , and the set of groups with as .

Let be the potential clinical outcome of subject in cluster given the cluster level treatment assignment was and under intervention for the subject. This allows the th subject’s outcome to be influenced by both and the treatment of other subjects in the same cluster, . Let the realized outcome for subjects in cluster be denoted as .

I will assume two-part randomization, under which I first randomize clusters to or then randomize subjects within each cluster to match the given strategy. The randomization strategy is assumed to be mixed, assigning a set number of clusters and then a set number of subjects within each cluster to treatment as in Hudgens and Halloran (2008). Let the set of all possible randomizations for a cluster of size that satisfy strategy be denote as and the subset of these randomizations for which subject in this cluster is assigned be denoted by . It should be noted that this notation assumes that there is no interference between clusters.

Let denote the potential outcome of subject in cluster if the randomization within the cluster was realized to be . I define the individual average clinical outcome under for strategy as

The cluster average outcome is then given by . I can also define the marginal potential outcomes, marginalizing over within a cluster. For subject in cluster under strategy let be the individual average marginal clinical outcome defined by

I can now define the conditional estimands of interest. Based on these one can define cluster and population level summaries of these potential outcomes as in Hudgens and Halloran (2008), as well as contrasts of interest for defining causal effects.

The group average potential outcomes are and the population average potential outcomes are and

I can then define contrasts of these potential outcomes to define causal effects. The direct effect, as given in Hudgens and Halloran (2008) in a cluster is given by and the direct effect at the population level is defined as

The indirect effect at the population level comparing and is defined as and the population direct effect plus the indirect effect is the total effect, The population overall effect comparing and is defined as Other definitions of the direct effects, as well as decomposition have been considered VanderWeele and Tchetgen (2011).

2.1 Conditional estimands

I now consider baseline variable conditional versions of the above estimands that are conditional in three ways, conditional on individual level baseline covariates, conditional on cluster level covairates and conditional on both. These are the same estimands as those considered in Hudgens and Halloran (2008), but within a subgroup defined by the baseline variable. Consider that individual in group has the individual average outcome under is and is zero otherwise. Similarly, if cluster has cluster level baseline variable and is zero otherwise. Finally, the cross-conditional estimand is equal to if cluster has cluster level baseline variable and is zero otherwise. Thus, individual and cluster level conditional estimands are special cases of the cross-conditional estimands, when all subjects or all clusters are within the range of interest for the individual or cluster level baseline variable. Hence, this is how they are displayed in Table 1. I discuss estimators for each type of conditioning, and properties of them, separately for clarity.

Notation Definitions for given conditioning
All All
Table 1: Conditional Estimand Definitions

3 Assumptions

I first need to make assumptions to link the observable data to the desired counterfactuals and estimators.

  • Assumption (a): Consistency,

  • Assumption (b): No interference between clusters,

  • Assumption (c): All assignment strategies are mixed as defined in Hudgens and Halloran (2008) and Sobel (2006)

  • Assumption (d): Stratified interference as defined in Hudgens and Halloran (2008).

Assumption (a) “consistency” simply means that if and then , (VanderWeele, 2009). Assumption (b) is implicitly made in the notation and is explained in detail in Section 2

. Under Assumption (c), all clusters have the same probability of being assigned a given strategy;

is the probability that a cluster will receive coverage . Under the mixed strategy this is equal to , as a fixed number of clusters, will be randomized to . As well, within a cluster, each individual has the same probability of being assigned to treatment given the randomized coverage which is denoted as , which under a mixed strategy is equal to , as fixed number of subjects, will be randomized to . Assumption (d), Stratified interference, is an assumption outlined in both Hudgens and Halloran (2008) and Tchetgen and VanderWeele (2012). It states that only a subject’s treatment assignment and the total proportion of people assigned to within their cluster impacts their outcome. Therefore, all possible counterfactuals will have the same value for all .

4 Estimation and Inference

I will consider finite population inference following Hudgens and Halloran (2008). Throughout, I mean the expected value E to be with respect to the randomization distribution had each subject and cluster been randomized in each possible pattern, for the fixed and observed, due to consistency, potential outcome.

Within groups of clusters assigned to , let the cross-conditional outcome estimators be defined by:

with, similarly defined, and

This makes the the estimator of ,

At the population level, then estimators and then given by:

This makes the contrast estimators:
, and

Let the individual level baseline variables conditional outcome estimators be defined by:

with, similarly defined, and

The the population level estimators are given by


with the contrast estimator following in the same way as above.

Let the cluster level baseline variables conditional outcome estimators be defined by:

with, similarly defined, and

The the population level estimators are given by


Again, the contrast estimator following in the same way as above. All of the estimators are similarly defined.


  1. Under Assumptions a-c and when

    • .

  2. Under assumptions a-c and when

    • .

  3. Under Assumptions a-c

  4. Under assumptions a-c and when

    • and

    • .

  5. Under assumptions a-c and when

    • .

  6. Under assumptions a-c and when

    • .

Theorem 1
Under assumptions a-d and and , where

with and defined similarly.

Theorem 2
Under assumptions a-d and and for all ,


with defined similarly. Proofs of all theorems and results are given in the supplementary materials. The proofs of theorems 1 and 2 consider conditioning under each type, group and individual level subgroups, separately as well as together.

5 Numerical Example

group ID Y(1) Y(0) b B group ID Y(1) Y(0) b B
1 11 3 0 1 1 2 21 0 2 0 0
1 12 2 0 0 1 2 22 2 3 0 0
1 13 10 2 1 1 2 23 4 6 0 0
1 14 1 1 0 1 2 24 5 7 0 0
3 31 1 2 1 0 4 41 0 3 1 0
3 32 2 1 0 0 4 42 2 1 0 0
3 33 3 0 1 0 4 43 4 5 0 0
3 34 10 1 0 0 4 44 5 7 0 0
111Subjects only have two counterfactual outcomes that need to be considered because of our assumption of stratified interference.
Table 2: Example Data

Consider a setting in which there are four groups of four subjects each, in which two of four groups will received 50% coverage (, exactly 2 of 4 individuals per group receive treatment), and the other two of four groups will receive 25% coverage (, exactly 1 of 4 individuals per group receive treatment). Example data are given in Table 2. There are 6 total ways to randomize the clusters to 50% or 25% in a one-to-one ratio. Let the realized randomization at the group level be, and that I wish to do a subgroup analysis based on the individual level covariate . For example, this group may represent the sex of the subjects, which may not normally be rare, but could be in some settings. Under this randomization, there are six possible randomizations for group 4, with 50% coverage, , , , , , . By the definition of the true group average value given above, , as subject 41 has a value under placebo of 3. This also makes clear that the estimand definitions in this setting are sensible, as you would want to know the average value within the subgroup had all subjects within the group been assigned to placebo, but still under 50% coverage.

I want to estimate , the individual variable conditional group average outcome for the untreated. Let us consider the possible estimators. One possible “natural” estimator for the individual level conditional group average outcome under based on Hudgens and Halloran (2008) would be


where the superscript is for natural.

Similarly, I consider a possible subgroup conditional version of the Hájek estimator Hájek (1971),


Finally, the estimator given above is:


In group 4, the “natural” and Hájek estimators will either be undefined, denoted as “NA” in the table, or have a different but defined value than the HT estimator if the one subject with is assigned to treatment, as is displayed in Table 3. If I instead set the “natural” and Hájek estimators to zero when they were undefined, this would result in a bias towards 0. As defined in theorem 1, will be equal to, 18, 0, 0, 18, 0, 18, giving an average of 9, which is the true variance of over the possible randomizations, as can be seen in Table 3.

3 NA NA 3 NA 3 NA
3 NA NA 3 NA 3 NA
6 0 0 6 0 6 3
Table 3: Estimator Values over the possible randomizations of Group 4

If instead one wanted to estimate , the individual variable conditional group average outcome for the untreated under 25% coverage, all estimators would be defined and unbiased. Note that, the estimators are the same as given above for group 4 but with replaced with . Let us consider the variance of the HT estimator. As defined in theorem 1, will take on the values, 0, 0.44, 0.44, 0.44, over the randomizations, for an average of 0.33, which can be seen to be the true variance of over the 4 randomizations in Table 4. This is a lower variance than either of the other two estimators, which both have a true variance over the randomizations of 0.5.

0 1 2 1 1
0 1 2 1 1
0 1.333 1.333 1.333 1
Table 4: Estimator Values over the possible randomizations of Group 3

I do not further speculate on how the other estimators would be extended to the cross-conditioning setting. However, in this setting the HT style estimator , for example, under the realized randomization , would be 0 and the variance estimate would be 0, as group 1, the only group with , is assigned to coverage.

6 Discussion

In this article I define a set of conditional estimators for rare subgroup analysis following the style of the HT marginal estimator. I show that these estimators are unbiased provided there is at least one subject in the group, or population, of interest within the subgroup of interest. I provide variance estimates for average groups level and population