Targeted Maximum Likelihood Estimation of Community-based Causal Effect of Community-Level Stochastic Interventions

06/15/2020 ∙ by Chi Zhang, et al. ∙ 0

Unlike the commonly used parametric regression models such as mixed models, that can easily violate the required statistical assumptions and result in invalid statistical inference, target maximum likelihood estimation allows more realistic data-generative models and provides double-robust, semi-parametric and efficient estimators. Target maximum likelihood estimators (TMLEs) for the causal effect of a community-level static exposure were previously proposed by Balzer et al. In this manuscript, we build on this work and present identifiability results and develop two semi-parametric efficient TMLEs for the estimation of the causal effect of the single time-point community-level stochastic intervention whose assignment mechanism can depend on measured and unmeasured environmental factors and its individual-level covariates. The first community-level TMLE is developed under a general hierarchical non-parametric structural equation model, which can incorporate pooled individual-level regressions for estimating the outcome mechanism. The second individual-level TMLE is developed under a restricted hierarchical model in which the additional assumption of no covariate interference within communities holds. The proposed TMLEs have several crucial advantages. First, both TMLEs can make use of individual level data in the hierarchical setting, and potentially reduce finite sample bias and improve estimator efficiency. Second, the stochastic intervention framework provides a natural way for defining and estimating casual effects where the exposure variables are continuous or discrete with multiple levels, or even cannot be directly intervened on. Also, the positivity assumption needed for our proposed causal parameters can be weaker than the version of positivity required for other casual parameters.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

1.1 Motivation

The literature in fields such as epidemiology, econometrics and social science on the causal impact of community-level intervention, is rapidly evolving, both in observational studies and randomized trials. In observation settings, there is a rich literature on assessment of causal effects of families, schools and neighborhoods on child and adolescent development (5; 20). For instance, the problem addressed by (4) is to estimate the impact of community violence exposure on anxiety among children of African American mothers with depression. Similarly, randomized community trials have increased in recent years. As pointed out by (14) and (24), scientifically speaking, community randomized controlled trials (CRCT) would be a superior strategy estimate the effects of community-level exposures due to self-selection and other difficulties. One example is the MTO study, which estimates the lower-poverty neighborhood effects on crime for female and male youth (11). Another CRCT example is the ongoing SEARCH study, which estimates the community level interventions for the elimination of HIV in rural communities in East Africa (25). Despite recent statistical advances, many of the current applications still rely on estimation techniques such as random effect models (or mixed models) (12) and generalized estimating equations (GEE) approach (13; 7). However, those methods define the causal effect of interest as a coefficient in a most likely misspecified regression model, often resulting in bias and invalid statistical inference in observational settings, and loss of efficiency in randomized community trials. By contrast, the targeted maximum likelihood estimators (TMLE) is constructed based on the efficient influence curve , and therefore inherits its double robustness and local efficiency properties. Instead of using directly to construct an efficient estimating equation, TMLE is obtained by constructing a locally least favorable submodel that its score (derivative of the log-likelihood) spans (3; 26).

Deterministic interventions, in which each unit’s treatment is set to a fixed value or a value defined by a deterministic function of the covariates, are the main strategy implemented in the current literature for the estimation of causal effects from observational data. One causal assumption needed for parameter identifiability is the positivity assumption. For example, the strong positivity assumption requires that all individuals in the population have a nonzero probability of receiving all levels of the treatment. As argued by (17), this strong assumption could be quite unrealistic in many cases. For example, patients with certain characteristics may never receive a particular treatment. On the other hand, a stochastic intervention is one in which each subject receives a probabilistically assigned treatment based on a known specified mechanism. Because the form of the positivity assumption needed for identifiability is model and parameter-specific, stochastic intervention causal parameters are natural candidates if requiring a weaker version of positivity compared to other causal parameters for continuous exposures. Furthermore, a policy intervention will lead to stochastic rather than deterministic interventions if the exposure of interest can only be manipulated indirectly, such as when studying the benefits of vigorous physical activity on a health outcome of interest in the elderly (2). Because it is unrealistic to enforce every elderly person to have a certain level of physical activity depending on a deterministic rule. To deal with the previous considerations, stochastic interventions could be a more flexible strategy of defining a question of interest and being better supported by the data than deterministic interventions. Thus, using stochastic intervention causal parameters is a good way of estimating causal effects of realistic policies, which could also be naturally used to define and estimate causal effects of continuous treatments or categorical multilevel treatments (9).

1.2 Organization of article

The rest of this article is organized as follows. In this article, we apply the roadmap for targeted learning of a causal effect (18). In Section 2 we specify the causal model through a non-parametric structural equation model (NPSEM), allowing us to define the community-level causal effect of interest for arbitrary community-level stochastic interventions as a parameter of the NPSEM, define the corresponding observed data structure, and establish the identifiability of the causal parameter from the observed data generating distribution. We allow for general types of single time-point interventions, including static, dynamic and stochastic interventions. In other words, there is no further restrictions on the intervention distributions, which could be either degenerate (for deterministic interventions) or non-degenerate (for stochastic interventions). Next, Section 3 and 4 introduce two different TMLEs of the counterfactual mean outcome across communities under a community level intervention that are based on community-level and individual-level analysis, respectively. Both TMLEs can make use of individual level data in the hierarchical setting. The first community-level TMLE is developed under a general hierarchical causal model and can incorporate some working models about the dependence structure in a community. In other words, the Super Learner library of candidate estimators for the outcome regression can be expanded to include pooled individual-level regressions based on the working model. The first TMLE also includes the case of observing one individual per community unit as a special case. The second individual-level TMLE is developed under a more restricted hierarchical model in which the additional assumption of dependence holds.

2 Definition of statistical estimation problem

2.1 General hierarchical casual model

Throughout this chapter, we use the bold font capital letters to denote random vectors and matrices. In studies of community-level interventions, we begin with a simple scenario that involves randomly selecting J independent communities from some target population of communities, sampling individuals from those chosen communities, and measuring baseline covariates and outcomes on each sampled individual at a single time point. Also, the number of chosen individuals within each community is not fixed, so communities are indexed with

and individual within the community are indexed with .

After selection of the communities and individuals, pre-intervention covariates and a post-intervention outcome are measured on each sampled unit. Because only some of the pre-intervention covariates have clear individual-level counterpart, the pre-intervention covariates separates into two sets: firstly, let denote the () vector of such individual-level baseline characteristics, and so is an matrix of individual-level characteristics; secondly let represent the vector of community-level (environmental) baseline characteristics that have no individual-level counterpart and are shared by all community members, including the number of individuals selected within the community (i.e., ). Last, is the exposure level assigned or naturally occurred in community and is the vector of individual outcomes of interest.

In order to translate the scientific question of interest into a formal causal quantity, we first specify a NPSEM with endogenous variables that encodes our knowledge about the causal relationships among those variables and could be applied in both observational setting and randomized trials (15; 16).


where the components are exogenous error terms, which are unmeasured and random with an unknown distribution . Given an input , the function deterministically assigns a value to each of the endogenous variables. For example, model (2.1) assumes that each individual’s outcome is affected by its baseline community-level and individual-level covariates together with its community-level intervention(s) and unobserved factors . First, while we might have specification of , the structural equations do not necessarily restrict the functional form of the causal relationships, which could be nonparametric (entirely unspecific), semiparametric or parametric that incorporates domain knowledge. Second, as summarized by (1), structural causal model (2.1) covers a wide range of practical scenarios as it allows for the following types of between-individual dependencies within a community: (i) the individual-level covariates (and outcomes) among members of a community may be correlated as a consequence of shared measured and unmeasured community-level covariates , and of possible correlations between unmeasured individual-level error terms , and (ii) an individual ’s outcome may influence another ’s outcome within community , and (iii) an individual’s baseline covariates may influence another outcome . Actually, we can make an assumption about the third type of between-individual dependence, and so the structural equation will be specified under this assumption. More details will be discussed in section (4.4). Third, an important ingredient of this model is to assume that distinct communities are causally independent and identically distributed. The NPSEM defines a collection of distributions , representing the full data model, where each distribution is determined by and (i.e.,

is the true probability distribution of

). We denote the model for with .

2.2 Counterfactuals and stochastic interventions

allows us to define counterfactual random variables as functions of

, corresponding with arbitrary interventions. For example, with a static intervention on , counterfactual can be defined as , replacing the structural equation with the constant (30). Thus, represents the vector of individual-level outcomes that would have been obtained in community if all individuals in that community had actually been treated according to the exposure level . More generally, we can replace data generating functions for that correspond with degenerate choices of distributions for drawing , given and , by user-specified conditional distributions of . Such non-degenerate choices of intervention distributions are often referred to as stochastic interventions.

First, let denote our selection of a stochastic intervention identified by a set of multivariate conditional distributions of , given the baseline covariates . For convenience, we represent the stochastic intervention with a structural equation, where in terms of random errors , and so define . Then denotes the corresponding vector of individual-level counterfactual outcome for community . Second, let denote a scalar representing a community-level outcome that is defined as a aggregate of the outcomes measured among individuals who are members within a community, and so is the corresponding community-level counterfactual of interest. One typical choice of is the weighted average response among the individuals sampled from community , i.e. , for some user-specified set of weights for which . If the underlying community size differs, a natural choice of is the reciprocal of the community size (i.e., ).

2.3 Target parameter on the NPSEM

We focus on community-level causal effects where all communities in the target population receive the intervention , then our causal parameter of interest is given by

To simply expression, we use in the remainder of article. We also assume (without loss of generality) that the community-level outcome is bounded in . If instead , the the original outcome will be automatically transformed into , and our target parameter is corresponding to

. Statistical inference such as the point estimate, limiting distribution and confidence interval for the latter target parameter can be immediately mapped into statistical inference for the original target parameter based on

, by simply multiplying by (8).

One type of stochastic interventions could be a shifted version of the current treatment mechanism , i.e., given a known shift function . A simple example is a constant shift of . Another more complex type could be stochastic dynamic interventions, in which the interventions can be viewed as random assignments among dynamic rules. A simple example corresponding to the previous shift function is , indicating that shifted exposure is always bounded by the minimum of the observed exposure .

One might also be interested in the contrasts of the expectation of community-level outcome across the target population of communities under different interventions, i.e.,

where and are two different stochastic interventions.

Finally, additive treatment effect is a special case of average causal effect with two static interventions and for any , i.e.,

2.4 Link to observed data

Consider the study design presented above where for a randomly selected community, the observed data consist of the measured pre-intervention covariates, the intervention assignment, the vector of individual-level outcomes. Formally, one observation on community , is coded as

which follows the typical time ordering for the variables measured on the individuals within the community.

Assume the observed data consists of independent and identically distributed copies of , where is an unknown underlying probability distribution in a model space . Here denotes the statistical model that is the set of possible distributions for the observed data and only involves modeling (i.e., specification of ). The true observed data distribution is thus .

2.5 Identifiability

By defining the causal quantity of interest in terms of stochastic interventions (and target causal parameter as a parameter of the distribution ) on the NPSEM and providing an explicit link between this model and the observed data, we lay the groundwork for addressing the identifiability through .

In order to express as a parameter of the distribution of the observed data , we now need to address the identifiability of by adding two key assumptions on the NPSEM: the randomization assumption so called ”no unmeasured confounders” assumption (Assumption 1) and the positivity assumption (Assumption 2). The identifiability assumptions will be briefly reviewed here, for details on identifiability, we refer to see (21; 27; 28; 9).

Assumption 1.

where the counterfactual random variable represents a collection of outcomes measured on the individuals from a community if its intervention is set to in causal model (2.1), replacing the structural equation with the constant .

Assumption 2.

where , and assume for some small .

Informally, Assumption 1 restricts the allowed distribution for to ensure that and shares no common causes beyond any measured variables in . For example, assumption 1 holds if is independent of , given . Then, this randomization assumption implies . In addition, as is specified by users in Assumption 2, a good selection of can be used to estimate the causal parameter of interest, but yet does not generate unstable weighting that causes violations of the positivity assumption. Therefore, this posivitiy assumption is easier to achieve compared to other positivity assumptions that other causal parameters used for continuous interventions.

Under Assumption 1 and 2, jointly with the consistency assumption (i.e., implies ),

So our counterfactual distribution can be written as:

by the law of iterated conditional expectation
by consistency assumption

with respect to some dominating measure .

Then, is identified by the G-computational formula (21):

This provides us with a general identifiability result for , the causal effect of the community-level stochastic intervention on any community-level outcome that is some real valued function of the individual-level outcome :

2.6 The statistical parameter and model for observed data

If we only assume the randomization assumption in the previous section, then the statistical model is nonparametric. Based on the result of identifiability, we note that represents a mapping from a probability distribution of into a real number, and denotes the target estimand corresponding to the target causal quantity .

Before defining the statistical parameter, we introduce some additional notation. First, we denote the marginal distribution of the baseline covariates by , with a well-defined density , with respect to some dominating measure . There is no additional assumption of independence for . Second, let denote the observed exposure conditional distribution for that has a conditional density . Third, we assume that all within a community are sampled from the distribution with density given by , conditional on the exposure and the baseline covariates . Now we introduce the notation for , and the statistical model becomes , where and denote the parameter space for and , respectively, and here is nonparametric.

Next, we define as the user-supplied intervention with a new density , which will replace the observed conditional distribution . So is a conditional distribution that describes how each intervened treatment is produced conditional on the baseline covariate . Given and , we use to denote a random variable generated under the post-intervention distribution . Namely, is the G-computation formula for the post-intervention distribution of observed data under stochastic intervention (21), and the likelihood for can be factorized as:


Thus our target statistical quantity is now defined as , where is the target estimand of the true distribution of the observed data (i.e., a mapping from the statistical model to ). We then define as the conditional mean evaluated under common-in- distribution , and so as the conditional mean of the community-level outcome. Now we can refer to as the part of the observed data distribution that our target parameter is a function of (i.e., with a slight abuse of notation ), the parameter can be written as:


with respect to some dominating measures and , where is the common support of .

Sometimes researchers might be interested in target quantities defined as the difference or ratio of two stochastic interventions. For example, one might define two target estimands and evaluated under two different interventions and , then defining the target quantity as . Actually a generalization of target quantities can be expressed as Euclidean-value functions of a collection , where denotes a finite set of possible stochastic interventions.

3 Estimation and inference under the general hierarchical causal model

In the previous section, we have defined a statistical model for the distribution of , and a statistical target parameter mapping for which only depends on through a relevant part of . Now we want to estimate via a target maximum likelihood estimator (TMLE) and construct an asymptotically valid confidence interval through the efficient influence curve (EIC). Furthermore, we present a novel method for the estimation of the outcome regression in which incorporates additional knowledge about the data generating mechanism that might be known by design.

As a two-stage procedure, TMLE needs to estimate both the outcome regressions and treatment mechanism . Since TMLE solves the EIC estimating equation, its estimator inherits the double robustness property of this EIC and is guaranteed to be consistent (i.e., asymptotically unbiased) if either or is consistently estimated. For example, in a community randomized controlled trial is known to be 0.5 and can be consistently estimated, thus its TMLE will always be consistent. Besides, TMLE is efficient when both are consistently estimated. In other words, when is consistent, a choice of the initial estimator for that is better able to approximate the true value

may improve the asymptotic efficiency along with finite sample bias and variance of the TMLE


3.1 The efficient influence curve

Before constructing a community-level TMLE of , we must understand its efficient influence curve. The EIC, evaluated at the true distribution , is given by:


Here and are defined as the projection of the EIC onto the tangent space of at and at , given , respectively. Note that the projection of the EIC onto the tangent space of (i.e., the exposure mechanism) is zero.

3.2 The community-level TMLE

The community-level TMLE first obtains an initial estimate for the conditional mean of the community-level outcome , and also an estimate of the community-level density of the conditional treatment distribution . The second targeting step is to create a targeted estimator of by updating the initial fit through a parametric fluctuation that exploits the information in the estimated density for the conditional treatment distribution . The plug-in community-level TMLE is then computed by the updated estimate and the empirical distribution of . In this subsection, we describe the community-level TMLE algorithm for estimating the community-based effect under community-level stochastic interventions. For further discussion, please see (9; 22; 1).

3.2.1 Estimation of exposure mechanisms and

A data-adaptive estimator of a conditional density that can be used to estimate the exposure mechanism is proposed by Dáaz and van der Laan (10). Here, we build on this work and present how to use the histogram-like estimator to estimate the community-level multivariate exposure mechanism . First let’s define , where the exposures and baseline covariates denote the random variables drawn jointly from the distribution with the density . Here denotes the marginal density of the baseline covariates , and communities are indexed with . Then, let’s denote . The fitting algorithm for the non-parametric estimator is equivalent, except that now the exposures and baseline covariates are randomly drawn from with the density defined as , where is determined by the user-supplied (stochastic) intervention.

Note that can be multivariate (i.e., ) where represents the number of treatment variables, and any of its components can be either binary, categorical or continuous. The joint probability model for can be factorized as a sequence:

where each of these conditional probability models is fitted separately, depending on the type of the -specific outcome variable . For binary , the conditional probability will be esimtated by a user-specific library of candidate algorithms, including both parametric estimators and data-adaptive estimators. For continuous (or categorical) , consider a sequence of values that span the range of and define bins and the corresponding bin indicators, in which case each bin indicator is used as an binary outcome in a seperate user-specific library of candidate algorithms, with predictors given by . That is how the joint probability is factorized into such an entire tree of binary regression models.

For simplicity (and without loss of generality), we now suppose is univariate (i.e., ) and continuous and a general template of an fitting algorithm for is summarized below:

  1. Initialization. Consider the usual setting in which we observe independently and identically distributed copies of the random variable , where the observed exposure are continuous.

  2. Estimation of .

    1. As described above, consider a sequence of values that span the support of values into bin intervals for a continuous variable . Then any observed data point belongs to one of the intervals, in other words, for each possible value (even if this is not in the observed , there always exists a such that ), and the length (bandwidth) of the interval can be defined as .

    2. Then let the mapping denote a unique index of the indicator in that falls in, where if , namely . Moreover, we use to denote a binary indicator of whether the observed belongs to bin (i.e., for all ).

      • This is similar to methods for censored longitudinal data, which treats exposures as censored or missing once the indicator jumps from 0 to 1.

      • Since is a realization of the random variable for one community, the corresponding random binary indicator of whether belongs to bin can be denoted as:

    3. Then for each , a binary nonparametric regression is used to estimate the conditional probability , which corresponds to the probability of jumping from 0 to 1, given and the baseline covariates . Here for each , the corresponding nonparametric regression model is fitted only among observations that are uncensored (i.e., still at risk of getting with ). Note the above conditional probability

      which is the probability of belongs to the interval , conditional on does not belong to any intervals before , and .

    4. Then the discrete conditional hazard function for each is defined as a normalization of the conditional probability using the corresponding interval bandwidth :

    5. Finally, for any given observation , we first find out the interval index to which belongs (i.e., ). Then the discretized conditional density of can be factorized by:

      which corresponds to the conditional probability of belongs to the interval and does not belong to any intervals before, given .

  3. The conditional density estimators of is now proportional to:


    can be estimated by either parametric or data-adaptive algorithms, or the combination of them (i.e., Super Learner). For example, using a main-term only logistic regression:

    where we assume that the dimension of is and the dimension of is , and indicates if falls within the interval

    . Alternatively, we can use Super Learner to build a convex combination of the candidate algorithms in the SL library to minimize the cross-validated risk, given a user-specified loss function.

Note that we need a clever way to determine the bin (interval) cutoffs for a continuous exposure. As proposed by Denby and Mallows (6), we can use a histogram-based method that is a compromise between the equal-bin-width histogram and equal-area histogram methods, and the corresponding parameters can be selected by cross validation. For detailed on constructing a histogram-like cross-validated density estimator, we refer to (10).

3.2.2 Loss function and initial (non-targeted) estimator of

As an initial estimator of , we can simply regress the community-level outcome onto the exposure and baseline covariates . The estimation of

could be processed by either the usual parametric MLE or loss-based machine learning algorithms based on cross validation, such as loss-based super learning. Given that

is bounded continuous or discrete for some known range , the estimation of can be based on the following negative Bernoulli log-likelihood loss function:

or the squared error loss

For example, for continuous , the fitted parameter in a least squares regression can be defined as:

3.2.3 Loss function and the least favorable fluctuation submodel that spans the efficient influence curve

Recall that the targeting step in the TMLE algorithm needs to define a fluctuation parametric submodel for and a corresponding user-specified loss function. Given the initial estimator of outcome mechanism , and the initial estimator of treatment mechanisms and for each community