1.1 Motivation and Background
In many areas of social, economic and medical sciences, researchers are interested in assessing not just the association but the causal relationship between two variables, i.e., exposure to a condition and an outcome variable that is expected to be affected by the exposure. Causality plays a crucial role during the decision-making process, because the decision maker must know the consequences of specific actions. Many studies conducted for assessing the effect of the exposure to a certain observed condition, as well as many non-experimental or experimental studies designed for evaluating the impact of public policies and programs, are actually aiming at inferring causal effects of the exposure to the observed condition or the implementation of the program. In both cases, causal conclusions are typically used for predicting causal consequences of an hypothetical intervention that manipulates the exposure or implements the public policy or program eventually by selecting or improving specific components.
Causal inference can be drawn using experimental or non-experimental studies. In the former, the main challenges lies in the design, which involves both the sampling design and the assignment of subjects to different experimental conditions. In the latter, the main challenge is covariate imbalance across individual in different conditions. Covariate adjustment methods are needed to compare similar units in different treatment arms (Imbens & Rubin, 2015; Hernán & Robins, 2018).
Most estimators of causal effects rely on the assumption of no interference between units, that is, a unit’s outcome is assumed to depend only on the treatment he received. When, instead, interference is present, common estimators fail to estimate the causal effect of the treatment. Interference mechanism are common in many fields, from economics to epidemiology. For example, (Angelucci & De Giorgi, 2009), recipients of conditional cash transfers may share resources with or purchase goods and services from ineligible households who live in the same area; de-worming a group of children may also benefit untreated children by reducing disease transmission (Miguel & Kremer, 2004).
In the past two decades the literature on causal inference in the presence of interference has rapidly started to grow, with increasingly rapid advances in both areas of experimental design (e.g., Hudgens & Halloran 2008; Toulis & Kao 2013; Ugander et al. 2013; Eckles et al. 2014) and statistical inference. The recently proposed approaches for the estimation of spillover effects can be categorized as dealing with one of the following cases: randomized experiments on clusters (e.g.,Hudgens & Halloran 2008), randomized experiments on networks (e.g.,Rosenbaum 2007; Bowers et al. 2013; Aronow 2012; Athey et al. 2015; Aronow & Samii 2017), observational studies on clusters (e.g.,Hong & Raudenbush 2006; Tchetgen Tchetgen & VanderWeele 2012; Liu et al. 2016, and observational studies on networks (Van der Laan, 2014; Sofrygin & van der Laan, 2017; Forastiere et al., 2018).
1.2 Contributions and Outline of the paper
Building on recent work by Forastiere et al. (2018)
, we extend the proposed generalized propensity score estimator to more flexible functional forms of the outcome model and to incorporate neighborhood correlation. We develop a Bayesian estimation method for finite sample causal effects, which relies on a modular technique and the imputation approach to causal inference. As an alternative to the commonly used frequentist and Fisherian perspectives, this paper pioneers the use of Bayesian inference for the estimation of treatment and spillover effects in the presence of interference. The proposed Bayesian methodology allows flexible estimation of a large range of causal effects, incorporates different sources of uncertainty and allows taking into account correlation among neighbors using community random effects. In addition, the modular technique enables preserving robustness to model misspecification.
The remainder of the paper is structured as follows. In Section 2 we review the potential outcome framework and the imputation-based and Bayesian approach to causal inference. We provide an overview of the propensity score-based methods for observational studies for both binary and continuous treatment. In Section we introduce the problem of causal inference on networks and review the concepts of individual and neighborhood propensity scores. Section 4 is dedicated to our proposed Bayesian estimator based on the generalized propensity score and its performances are assessed with a simulation study reported in Section 5. Section 6 concludes the paper.
2 Potential Outcome Framework
The most widely used statistical framework for causal inference is the potential outcome framework (Rubin, 1974, 1978), also known as the Rubin Causal Model (RCM) (Holland, 1986). The first component of the RCM are potential outcomes defined as potential values of the outcome variable that each unit would experience under each level of the treatment condition. Causal effects are then defined as comparisons of potential outcomes under different treatment conditions for the same set of units. The concept of potential outcomes was first proposed by Neyman (1923) in the context of randomized experiments, and was extended to observational studies by Rubin (1974)
. The fundamental problem of causal inference under the RCM is that, in one study, for each unit at most one of the potential outcomes is observed – the one corresponding to the treatment to the unit is actually exposed to –, and the other potential outcomes are missing. Therefore, unit-level causal effects are not identifiable without further assumptions. The second component of the RCM is the assignment mechanism: the process that determines which units receive which treatments, hence which potential outcomes are realized and thus can be observed, and which are missing. In randomized experiment the assignment mechanism is under the control of the experimenter. In contrast, in observational studies the assignment mechanism is the unknown process underlying the observed distribution of treatment and in general depends on units’ characteristics. Identification of causal effects relies on specific assumptions on the assignment mechanism. The last optional component of the RCM is a model for the potential outcomes and covariates. Incorporating scientific understanding in a probability distribution allows formal probability statements about the causal effects
2.2 Set up and Notation
Consider a study where we observe a set of N units. Let i be the indicator of a unit in the sample, with .
The variable the causal effect of which is under investigation can be the exposure to a certain condition (e.g., environmental exposure, socio-economic status, behavior) or a certain treatment or intervention. Throughout we will refer to this variable as treatment. Let be the treatment variable indicator for unit i and
the observed outcome that we wish to estimate the effect on. For each unit we also collect a vector of baseline covariate. Let denote the observed data in the sample, where is the collection of baseline covariate across all units, is the treatment vector in the sample and is the the corresponding outcome vector.
In principle, we should define the potential outcome of each unit as the potential value of the outcome variable that the unit would experience under a specific assignment of the whole treatment vector . Under this general definition, a potential outcome is denoted by or simply . We then have for each unit potential outcomes but we can only observe the one corresponding to the treatment vector that was actually observed, i.e., . A dimensionality reduction is needed both for the definition and for the identification of causal estimands as comparisons between potential outcomes under different treatment conditions. Restrictions can be given by structural assumptions or by specific assignment mechanisms in randomized experiments.
The first basic assumption that is typically invoked is the stable unit treatment value assumption (SUTVA; Rubin, 1980). There are two components to this assumption. The first is that there is only one version of each treatment level possible for each unit (consistency). The second is the no interference assumption, that is, the treatment of one unit does not affect the potential outcomes of other units, formally:
Stable Unit Treatment Value (SUTVA)
Under this assumption we can index potential outcomes of unit i only by the treatment received by unit i, i.e., or simply . Therefore, under SUTVA, for each unit there exist only one potential outcome for each treatment level, with the observed outcome or .
2.4 Causal Estimands
Causal estimands are defined as comparisons of potential outcomes under different treatment levels. Unit-level estimands are comparisons at the unit level, while average estimands are average comparisons on the same sets units. A common comparison is the average difference. For the (finite) population of units, under SUTVA, this is the SATE or simply ATE:
If the set of units is considered a sample from a larger (finite of size or infinite) population, then define population average causal effects, sometimes referred to as PATE:
2.5 Causal Inference as a Missing Data Problem
The problem of identifying and estimating causal estimands relies on the fundamental problem of causal inference (Holland, 1986), that is, the inability to simultaneously observe all the potential outcomes of the same unit. In fact, for each unit, we can observe at most the potential outcome corresponding to the the treatment to which the unit is exposed, i.e., , whereas all the other potential outcomes , with , are missing. The missing potential outcomes are often referred to as ‘counterfactuals’ in the literature. Therefore, causal inference is inherently a missing data problem and causal effects are not identifiable without further assumptions, which, in general, allow extrapolations of information on missing potential outcomes. The central identifying assumptions concern the assignment mechanism
Because potential outcomes of the same unit are never simultaneously observed, we cannot in general estimate individual contrasts in potential outcomes. We could estimate a marginal contrast in potential outcomes, e.g., , if we could recover the distribution of for units . This could be done by assuming that the distribution of potential outcomes is independent of the actual treatment received. This assumption is known as unconconfoudedness, ignorability or exchangeability. Throughout we will use the term unconconfoudedness. In the literature we can find alternative expressions of such assumption. Here we report a weak version, which assumes the marginal independence between each potential potential outcome and the treatment receipt. Formally, we have the following assumption.
Unconconfoudedness of the Treatment
Given an assignment mechanism, a treatment is unconfounded if it does not depend on the potential outcomes:
This implies that for all individuals treatment assignment does not depend on the potential outcomes and, in turn, the distribution of the potential outcomes is independent of the treatment assignment. The unconfoudedness assumption holds by design in classical randomized experiments, where the assignment mechanism is known and the treatment is randomly assigned. Thus, the randomized assignment mechanism implies that those assigned to and those assigned to are exchangeable. The independence of the potential outcomes from the treatment assignment, is more intuitive in the way it is crucial to causal inference. In fact, unconfoudedness implies that and thus allows imputation of the missing potential outcomes , with , from the distribution of observed outcomes in the treatment arm where the treatment received is actually . This ensures the identifiability of causal effects from the observed data.
Specifically, under Assumption 2 is identified.
2.6 Imputation Approach to Causal Inference
Estimating causal effects requires properly handling the missing potential outcomes. As a matter of fact, all methods for causal inference can be viewed as imputation methods. In fact, because any causal estimand depends on missing potential outcomes, every estimator of causal estimands must explicitly or implicitly estimate or impute the missing potential outcomes.
The unconfoudedness assumption is crucial for this imputation: one can recover the distribution of missing potential outcomes from the distribution of observed outcomes of units in other treatment arms.
The imputation approach to causal inference explicitly imputes the missing potential outcomes for every unit in the sample. This can be done from the empirical distribution of observed outcomes of units in other treatment arms or from a parametric model.
A unit-level causal estimand is a comparison between potential outcomes corresponding to different treatment levels
. A finite sample causal estimand is a sample moment of the distribution ofin the whole sample or in a sub-sample defined by o bserved treatment and covariates; it can be expressed by . Similarly, a population level causal estimand is a population moment of the population distribution of unconditional or conditional on a treatment level and/or covariates; it can be expressed by . An imputation-based method hinges on a stochastic model for all potential outcomes, both observed and missing:
Such model generally depends on some unknown parameters . The observed data are used to learn about these parameters. The postulated model with the estimated parameters is then used to impute the missing potential outcomes, given the observed data, and to conduct inference for the estimands of interest. Typically, potential outcomes of different units are assumed independent and identically distributed. In addition, potential outcomes of a unit corresponding to different treatment levels are considered independent. Therefore, under the unconfoudedness assumption, we posit a model for each potential outcome
Once parameters are estimated and potential outcomes for all units are imputed, the distribution of causal estimands is drawn. Specifically, population average causal estimands are computed as a function of the parameters . In contrast, finite sample causal estimands are derived using the imputed potential outcomes: first unit-level estimands are computed from the imputed set of potential outcomes , then average causal estimands are determined by sample moments of the distribution of in the sample.
2.7 Bayesian Causal Inference
The outline of Bayesian inference for causal effects was first proposed in Rubin (1978)
. From the Bayesian perspective, the observed outcomes are considered to be realizations of random variables and the unobserved potential outcomes are unobserved random variables. Thus, missing potential outcomes are considered as unknown parameters. Bayesian inference for the causal estimands is obtained from the posterior predictive distribution of the missing potential outcomes, given a model for potential outcomes and a prior distribution for the parameters.
Let be the vector of missing potential outcomes for each unit and the corresponding vector in the whole sample. Given the prior distribution , under unconfoudedness and common independence assumptions, the posterior predictive distribution of these missing potential outcomes can be written as
The inner product of (1) is the likelihood function. Equation (1) suggests that, under unconfoudedness and iid units, one only needs to specify the potential outcome distribution and the prior distribution
to conduct Bayesian inference for causal effects. Closed-form posterior distribution of the causal estimand is generally not available. Instead, we can use Markov Chain Monte Carlo (MCMC) methods such as Gibbs sampler(Gelfand & Smith, 1990); specifically, one can simulate the joint posterior distribution of all unknown random variables, , by iteratively drawing from the two conditional distributions, and . After obtaining the posterior draws of and of the parameters is derived from posterior draws of the parameters or functions of the parameters. For finite-sample average estimands , one would instead obtain the posterior draws of the missing potential outcomes and then for each draw compute the sample average of interest.
2.8 Unconfoundedness in Observational Studies
In randomized experiments Assumption 2 holds by design. Conversely, in observational studies the investigator does not control the assignment of treatments and cannot ensure that similar subjects receive different levels of treatment. Therefore, Assumption 2 cannot be ensured by design. Moreover, such assumption cannot be directly validated by the data because it involves missing quantities, that is, the distribution of a potential outcome cannot be observed in the treatment arms where and, hence, we are unable to say whether it would be the same across all treatment arms. In an observational setting we typically relax the unconditional unconfoudedness assumption by assuming exchangeability conditional on a set of observed covariates. The set of covariates that can make the unconfoudedness assumption more plausible must be chosen based on subject matter knowledge.
The conditional unconfoudedness assumption can be formally stated as follows.
Conditional Unconconfoudedness of the Treatment
Given an assignment mechanism, the treatment is unconfouded if it does not depend on potential outcomes conditional on an observed covariate set X:
Assumption 3 implies that the treatment is as randomized among units with the same value of the observed covariates. In other words, we can say that the treatment is unconfouded conditional on an observed set of covariates if there are no other unmeasured confounders, that is, unobserved characteristics that – by affecting both the exposure to treatment and the potential outcomes – are said to confound the relationship between the treatment and the outcome.
Under conditional unconfoudedness, identification of causal effects also requires a nonzero probability of all levels of the treatment for all covariate combinations, also known as the positivity assumption (Hernán & Robins, 2018).
2.9 Propensity Score for Binary Treatment
Most of causal inference literature is concerned with settings with a binary treatment In this case, for each unit there are only two potential outcomes and , one observed and one missing.
Several covariate-adjustment methods, relying on the unconfoudedness assumption, have been proposed. When the set of covariates needed to make the unconfoudedness assumption more plausible is large and includes continuous covariates, a simple stratification in groups where units share the same values of covariates is not possible. In such settings, propensity score-based methods are particularly useful. For each unit, the propensity score, denoted by , is defined as the probability of receiving the active treatment given the unit’s covariates (Rosenbaum & Rubin, 1983). Formally, we can write
The propensity score has two important properties: (i) it is a balancing score, that is, it satisfies ; (ii) if the treatment assignment is unconfounded given , then it is also unconfounded given , that is, .
Therefore, under unconfoundedness, adjusting for the difference in the propensity scores between treated and control units would remove all biases due to covariate imbalance, i.e., we can compare the observed outcomes between treated and control units within groups defined by the value of the propensity score rather than by the value of covariates. Because the propensity score can be viewed as a scalar summary of the multivariate covariates, the use of the propensity score for covariate adjustment is often more convenient.
In randomized experiments, the propensity score is known and in classical randomized experiments it does not depend on covariates. On the contrary, in observational studies, the propensity score is typically unknown and must be estimated. It has been shown that a consistent estimate of the propensity score leads to more efficient estimation of the ATE than the true propensity score (Rosenbaum, 1987; Rubin & Thomas, 1996; Hahn, 1998; Hirano et al., 2003).
Propensity score methods usually involve two stages: in the first stage (‘PS stage’), the propensity scores
are determined by estimating the parameters of a model, usually through a logistic regression (), and then by computing the individual probabilities given covariates; in the second stage (‘outcome stage’), estimate the causal effects based on the estimated propensity scores, through three main alternative strategies: matching, stratification, weighting or combination of these methods (for a review, see Stuart 2010 or Austin 2011).
For matching, arguably the most popular causal inference method, the propensity score is used as the distance metric for finding matched pairs of treated and control units and the causal effects are estimated by the average within-pair difference in the observed outcome. The most common method dor propensity score matching is one-to-one or pair matching, in which pairs of treated and control subjects are formed such that matched subjects have similar values of the propensity score. There different sets of options for forming propensity score matched sets. The first choice is between matching with or without replacement. Matching with replacement allows a given subject to be included in more than one matched set, resulting in closer pairs but higher variance due to the fact that less information is used (Rosenbaum, 2002). On the contrary, in matching without replacement each unit is included in at most one matched set. In this case, one must choose between greedy and optimal matching (Rosenbaum, 2002). The greedy matching method is a sequential process where at each step the nearest control unit is selected for matching to a given treated unit, even if that control unit would better serve as a match for a subsequent treated unit. An alternative to greedy matching is optimal matching, in which matches are formed so as to minimize the total within-pair difference of the propensity score. Different optimization algorithms can be used (e.g., Rosenbaum 2002; Hansen 2004).
When there are large numbers of control individuals, it is sometimes possible to get multiple good matches for each treated individual, called ratio matching (e.g., Rubin & Thomas 1996).
For stratification, one stratifies units into K (usually 5 or 6) subclasses based on quantiles of the propensity scores so that treated and control units in each subclass have similar covariate distribution, and then calculates the difference in the average outcome between treated and control units in each subclass. The stratification estimator is the average of these within-subclass causal estimates weighted by the size of each subclass (Imbens & Rubin, 2015).
For weighting, the basis is an analogy of the Horvitz-Thompson estimator in survey sampling (Horvitz and Thompson, 1952): Therefore, one can define an inverse-probability weight (IPW) for each unit: for treated units and
for control units, and then the difference of the weighted average outcome between treatment groups is an unbiased estimator of the average causal effect (e.g.,Hernán & Robins 2018).
In general, variance estimation for these propensity score-based estimators rely on the common structure of weighted estimators (Imbens & Rubin, 2015). In addition, different empirical formulas have been proposed to account for correlation within matched sets as well as for the uncertainty in the estimation of the propensity score (e.g., Abadie & Imbens 2006, 2009). When researchers want to account for the uncertainty in the propensity score, a bootstrap procedure has been found to outperform other methods (Hill & Reiter, 2006), but standart bootstrap methods can yield invalid inferences when applied to matching estimators (Abadie & Imbens, 2008).
2.10 Propensity Score for Continuous Treatment
In many studies the treatment is not binary but units may receive different treatment levels. In such settings, where the treatment is discrete or continuous, propensity score methods for binary treatment cannot be used. In fact, when the treatment is multivalued there is a propensity score for each treatment level. Thus, adjusting for the difference in the propensity score corresponding to one specific treatment level does not yield unbiased estimate of causal estimands comparing potential outcomes under different treatment levels, i.e., with .
In addition, when the treatment is discrete or continuous, we are usually not only interested in comparing specific treatment levels, instead the focus is on assessing the heterogeneity of treatment effects arising from different amounts of treatment exposure, that is, on estimating an average dose-response function.
Over the last years, propensity score methods have been generalized to the case of discrete treatments (e.g., Imbens 2000; Lechner 2001) and, more recently, continuous treatments and arbitrary treatment regimes (e.g., Hirano & Imbens 2004; Imai & Van Dik 2004; Flores et al. 2012; Kluve et al. 2012; Bia et al. 2014). Here we review the work by Hirano & Imbens (2004), who introduced the concept of the Generalized Propensity Score (GPS) and use it to estimate an average dose-response function (aDRF) .
2.10.1 Generalized Propensity Score (GPS)
The GPS, denoted by , is the conditional density of the treatment given the covariates: . Each unit is then characterized by a different density of the treatment. For each unit the GPS corresponding to the actual treatment to which the unit is exposed, , is the probability for that unit of receiving the treatment he actually received given his characteristics . As the classic propensity score, the GPS is a balancing score. In the continuous case, this means that, within strata with the same value of , the probability that does not depend on the value of . In other words, for units with the same value of the GPS corresponding to a specific treatment level z, the distribution of covariates is the same between the arm assigned to treatment z and all the other arms. Furthermore, the unconfoundedness assumption, combined with the balancing score property, implies that the treatment is unconfounded given the GPS. Formally,
Thus, any bias caused by covariate unbalance across groups with different treatment levels can be removed adjusting for the difference in the GPS. Under unconfoundedness of the treatment given , then
According to Equation (4), for each treatment level z, within strata with same GPS at level z, , the average potential outcome corresponding to z, , can be estimated using the average outcome observed for units who were actually exposed to level z of the treatment. Given the continuous nature of the treatment and the GPS, such stratification is unfeasible.
Hirano & Imbens (2004) estimate the DRF using the estimated GPS by employing a parametric partial mean approach. Specifically, we posit a model for the treatment
where is a link function,
is a probability density function (pdf),is a flexible function of the covariates depending on a vector of parameters , and is a scale parameter. For instance, can be the identity function, f is the normal pdf, is a linear combination of the covariates, i.e., , and
is the standard deviation of Z. We also postulate a model for the potential outcomes given the GPS:
where is a link function, is a probability density function (pdf), is a flexible function of the treatment and the GPS depending on a vector of parameters , and is a scale parameter. In Hirano & Imbens (2004) is a cubic polynomial function of the treatment and the generalized propensity score , including the interaction term.
Bia et al. (2014)
replace the parametric approach with a semiparametric estimator based on penalized spline techniques. In particular, they use penalized bivariate splines, with radial basis functions of the form.
Estimating Procedure based on the GPS
We describe here the two-step estimator proposed by Hirano & Imbens (2004) for the estimation of the average DRF based on the generalized propensity score. The first step involves parametrically modeling and estimating the GPS. The second step consists of estimating the average DRF.
2.11 Bayesian Propensity Score Adjustment
After Rubin (1985) reflected on the usefulness of propensity scores for Bayesian inference, only recently has Bayesian estimation of causal effects been combined with propensity score methods. (Hoshino, 2008; McCandless et al., 2009; An, 2010; McCandless et al., 2010; Kaplan & Chen, 2012; Zigler et al., 2013; McCandless et al., 2012; Zigler & Dominici, 2014). The first advantage of the Bayesian propensity score approach is that it allows embedding propensity score adjustment within broader Bayesian modeling strategies, incorporating prior information as well as complex models for hierarchical data, measurement error or missing data (Rubin, 1985). The second major motivation for using Bayesian inference for propensity scores is the propagation of propensity score uncertainty in the estimation of causal effects. In fact, traditional frequentist approach accomodate the two-stage nature of propensity score methods with a separate and sequential estimation: the estimated model and the predicted propensity scores from the first stage are treated as fixed and known in the outcome stage. A limitation of this sequential approach is that confidence intervals for the treatment effect estimate are usually calculated without acknowledging uncertainty in the estimated propensity scores. On the contrary, Bayesian methods offer a natural strategy for modeling uncertainty in the propensity scores.
McCandless et al. (2009)
proposed to model the joint distribution of the data and parameters with the propensity score as a latent variable. Letand be the vectors of parameters of the propensity score and the outcome model, respectively. Markov chain Monte Carlo (MCMC) methods allow to draw from the posterior distribution for model parameters,
by successively drawing from the full conditional distributions
In McCandless et al. (2009), as well as in the whole literature on Bayesian propensity score, the focus is on binary treatment.
Within the Bayesian framework, we denote the propensity score of a binary treatment by to highlight the dependence on the parameters. With propensity score adjustment, we do not directly model the dependence of the outcome of covariates, but rather we adjust for covariates by modeling parametrically or semi-parametrically the propensity score . As a consequence, the outcome model indirectly depends on all parameters, including the set of parameters of the propensity score models, i.e., , and, thus, the likelihood cannot be factorized into two parts, , that separately depend on different sets of parameters. Therefore, the posterior distribution of the parameters of the PS stage are in part informed by the outcome stage. Because of this phenomenon, referred to as ‘model feedback’ (e.g., McCandless et al. 2010; Zigler et al. 2013), the joint Bayesian PS estimation has raised some concerns. First, since the propensity score adjustment is meant to approximate the design stage in a randomized experiment it should be done without access to the outcome data (Rubin, 2007, 2008). Furthermore, a practical consequence is the propagation of error due to model misspecification. In fact, when the model for the relationship between the outcome and the propensity score is misspecified, then the joint Bayesian approach was shown to provide invalid inferences for , which distorts the balancing property of the propensity score and yields incorrect estimates of the treatment effect (Zigler et al., 2013).
Various methods described as ‘two-step Bayesian’ have been recently proposed to ‘cut the feedback’ between the propensity score and outcome stages (Hoshino, 2008; McCandless et al., 2010; Kaplan & Chen, 2012). These methods represent a special case of the so-called ‘modularization’ in Bayesian inference (Liu et al., 2009). To limit feedback, the general idea is based on an approximate Bayesian technique that uses the posterior distribution of the PS model as an input when fitting the outcome model. Specifically, the posterior distribution of the parameters of the PS model are not updated from the full conditional, but rather from the approximate conditional distribution
which ignores the likelihood contribution from the outcome. This restricts the flow of information between models during MCMC computation, and is similar in spirit to two-stage estimation (Lunn et al., 2009).
3 Causal Inference under Interference on Network Data
In this section we describe the problem of interference on network data in observational studies. The presence of interference on networks poses two major challenges: i) spillover effects of a unit’s treatment on other units’ outcomes, including through a contagion mechanism, and homophily, that is, the tendency of units with similar characteristics of forming ties, which creates a dependence structure among interacting units in both pre-treatment characteristics and in the outcome.
In addition, in observational studies, where the treatment is not randomized, homophily can generate correlation among neighbors’ treatment due to similar propensity of taking the treatment given similar covariates, as well as peer influence in the treatment uptake.
We will first define the problem of interference on networks and review common assumptions of interference on neighborhood. Then, we will define causal estimands of interest, review identifying assumptions, and finally propose a novel Bayesian estimator.
3.1 Network Data
Consider a network of N units, indexed by i, with adjacency matrix , where element represents the presence of a tie between unit i and unit j. Ties are assumed to be fixed and known. Recall that for each unit we measure a vector of covariates , a treatment variable , and an outcome variable . Here we focus on binary treatments . The adjacency matrix A defines for each unit the set of units that have a direct tie with . We refer to this set as neighborhood of unit , denoted by , and to the units belonging to this set as neighbors of unit . In real-world applications, neighbors can be geographical neighbors, or friends, partners or collaborators. Let denote the number of neighbors of unit it, referred to as degree of unit i. The complement of in , excluding i, is denoted by .
For each unit , the partition defines the following partitions of the treatment and outcome vectors: and . With non-network data, typically includes individual characteristics or cluster-level characteristics representing contextual factors (e.g., demographic or socio-economic factors) or contextual covariates (e.g., geographical factors or presence of infrastructures). On the contrary, in network data might also include variables describing the network. In particular, it might contain variables representing of the neighborhood , including the topology but also the distribution of individual-level characteristics, and it can contain network properties at node-level representing the position of unit’s neighborhood in the graph (e.g., centrality, betweenness, number of shared neighbors, … ).
3.2 Neighborhood interference
In general, the potential outcome for unit depends on the entire treatment assignment vector Z, i.e., . The no interference assumption, or SUTVA, restricts the dependency to only the treatment received by unit , i.e., . On the contrary, under interference the potential outcome for unit depends on the treatment received by other units. However, if each outcome depends on the whole treatment vector, then for each treatment vector Z each unit would be observed under the same treatment Z. Therefore, the data would not provide any information on missing potential outcomes under different treatment conditions. It is clear that the definition of causal effects as well as their estimation requires assumption that restricts the number of potential outcomes for each unit. Depending on the type of interference can depend on the treatment received by a specific group of units. In network data, the adjacency matrix provides information on the interaction between units and thus on the flow of the treatment effect. Oftentimes, we have reasons to assume that the outcome of a unit only depends on the treatment received by the neighbors, that is, by the units that individual is in direct contact with. In such case, the potential outcome could be written as . This assumption excludes spillover effects of the treatment received by higher order connections. Nevertheless, when the number of neighbors is substantial then under each treatment vector Z the variability on the treatment vector across units is still very low, and information on the missing potential outcomes would be hard to extrapolate.
For this reason, we introduce the concept of exposure mapping. In general terms, we define an exposure mapping as a function that maps a treatment vector z, the adjacency matrix and unit-level characteristics to an exposure value denoted by : , with . Hudgens & Halloran (2008) consider the ‘partial interference’ assumption, that allows units to be affected only by the treatment received by units belonging to the same clusters. This can be expressed by a specific exposure mapping function that only depends on group indicators. In network data, a special case is the function , which receives as input only the treatment vector in the neighborhood, the unit’s row of the adjacency matrix, the unit’s covariates, and the covariates of units in the neighborhood. Given this definition, we can formalize the neighborhood interference assumption.
Assumption 4 (Neighborhood Interference)
Given a function , , and , then .
Assumption 4 rules out the dependence of the potential outcomes of unit from the treatment received by units outside his neighborhood, i.e., , but allows to depend on the treatment received by his neighbors, i.e., . Moreover, this dependence is assumed to be through a specific exposure mapping function . This formulation is similar to the ‘exposure mapping’ introduced by Aronow & Samii (2017) and the one in Van der Laan (2014). In Assumption 4 the function is assumed to be known and well-specified. When analyzing network data, we must use substantive knowledge of the subject matter and judgment about the mechanism of interference to fix the exposure mapping function. We refer to as the neighborhood treatment, and, by contrast, we refer to as the individual treatment. In general, we can write , where is the element of the adjacency matrix, which could be binary or weighted, and is a function of unit-level characteristics of the interacting units. This means that we assume that the extent to which the treatment of unit j affects the outcome of unit i depends on the level on interaction between the two units, encoded in , and on their similarity in terms of characteristics. In the simplest case, can be the number or the proportion of treated neighbors, i.e., or , respectively. The domain of depends on how the function is defined. For example, if we consider the simple number of treated neighbors, then . We denote the overall domain by .
Under Assumption 4, potential outcomes can be indexed just by the the individual treatment and the neighborhood treatment, i.e., , which can be simplified to . The potential outcome represents the outcome that unit i would exhibit under individual treatment and if exposed to the value of a function of the treatment vector of his neighbors, .
A potential outcome is defined only for a subset of nodes where can take on value . We denote this subset by . For instance, in the case where is the number of treated neighbors, is the set of nodes with degree , that is, with at least neighbors. It is worth noting that each unit can belong to different subsets , depending on the cardinality of .
3.3 Causal estimands: Treatment and Spillover effects
We define here the causal estimands of interest under the neighborhood interference. We focus on finite-sample causal estimands, that is, estimands that are defined on the network at hand. The advantage of this type of estimands is that their definition does not require the specification of the sampling mechanism from a larger population, which can be difficult to conceptualize in network settings. We first define the average potential outcome in a set of units as:
is a set of units, possibly defined by covariates, including individual or network characteristics. In order for the potential outcomes to be well-defined, must be the set (or a subset of the set) of units where is a possible value, i.e., . We can view as an average dose-response function (ADRF), which measures the heterogeneity of potential outcomes arising from a variation in the bivariate treatment, i.e., the binary individual treatment, and the neighborhood treatment, which is a discrete or continuous variable.
We can now define causal estimands as comparisons between average potential outcomes. We define the average (individual) treatment effect at neighborhood level by
which denotes the average causal effect of the individual treatment when the neighborhood treatment is set to level . Again here . Instead of fixing the neighborhood treatment, we can consider an hypothetical intervention that assigns the neighborhood treatment to unit i based on a probability distribution . Thus, we define the average treatment effect given the neighborhood treatment assignment by the average effect of the individual treatment marginalized over the probability distribution of the neighborhood treatment, that is
We now define the causal effects of the neighborhood treatment, often referred to as spillover effects or peer effects. We define the average spillover effect of having the neighborhood treatment set to level versus , when the unit is under the individual treatment , by
Notice that must be a subset of units belonging to both and , i.e., . Finally, define the average spillover effect of intervention vs by
Hypothetical Intervention vs can be given by real experiments with assignment mechanism , which reflects into the probability distribution of the neighborhood treatment, or it can be directly defined as a probability distribution of given covariates (see also Papadogeorgou et al., 2018).
The treatment effects in (8) and spillover effects in (10) are average comparisons of potential outcomes under fixed values of the individual and neighborhood treatment. Conversely, in the average treatment and spillover effects in (9) and (11), the individual treatment is kept fixed while the neighborhood treatment is drawn from the probability of hypothetical interventions.
3.4 Identifying Assumption: Unconfoundedness
Because the causal effects of interest depend on the comparison between two quantities with different values of the individual and neighborhood treatments, identification results can focus on the identification of the ADRF .
Assumption 5 (Unconfoundedness of Individual and Neighborhood Treatment)
This assumption states that the individual and neighborhood treatments are independent of the potential outcomes of unit , conditional on the vector of covariates .
Assumption 5 states that the vector contains all the potential confounders of the relationship between the individual and the neighborhood treatment and the potential outcomes for each unit i. The plausibility of this assumption depends on how the vector is defined in relation to the probability distribution of the treatment and to the network structure. Assumption 5 rules out the presence of latent variables (not included in ) that affect both the probability of taking the treatment and/or the value of neighborhood treatment and the outcome. Neighborhood covariates, that is, the topology of the neighborhood or individual-level covariates among neighbors, are potential confounders only if the affect the outcome of the unit.
In addition, in principle the assumption does not rule out the presence of homophily, that is tendency of individuals who share similar characteristics to form ties. In fact, homophily does not violate the unconfoudedness assumption in the cases where characteristics driving the homophily mechanism i) are included in , ii) even if unobserved they do not affect the outcomes iii) they correspond to the treatment variable, that is people who share the same treatment/exposure variable tend to form ties. The only situation where homophily is a threat to identification is when variables underlying the network formation process are not included in and affect the outcome.
4 Bayesian Generalized Propensity Score Estimator for Causal Effects under Neighborhood Interference
Building on the generalized propensity score estimator proposed by Forastiere et al. (2018), here we develop a new Bayesian semi-parametric estimator for the ADRF , and in turn for the causal estimands in Section 3.3. The idea is to combine results on the generalized propensity score for multivalued treatment proposed by Hirano & Imbens (2004) (Section 2.10.1) and extended by Forastiere et al. (2018) to interference settings, with the two-step Bayesian propensity score estimator for a binary treatment without interference (see Section 2.11). The proposed estimator lies within the Bayesian imputation approach to causal inference reviewed in Section 2.7. In addition we will replace the parametric partial mean approach of Hirano & Imbens (2004) with a semiparametric technique based on penalized Bayesian multivariate splines. To take into account the dependence in the outcome, we also include random effects in the outcome model, with groups defined by a community detection algorithm. This Bayesian approach allows us to easily quantify the unceratinty due to the assignment of and and to the inherent variability of the (missing) potential outcomes.
4.1 Individual and Neighborhood Propensity Scores
Under the unconfoudedness assumption (Assumption 5) the ADRF could be estimated by taking the average of the observed outcomes within cells defined by covariates. Nevertheless, the presence of continuous covariates or a large number of covariates poses some challenges in the estimation. Under SUTVA, propensity score-based estimators are common solutions (see Section 2.9). Conversely, under the neighborhood interference assumption, Forastiere et al. (2018) propose a new propensity score-based estimator, based on the adjustment for the so-called individual and neighborhood propensity scores.
The individual propensity score, denoted by , is the probability of having the individual treatment at level conditional on covariates , i.e., . Similarly, the neighborhood propensity score, denoted by , is the probability of having the neighborhood treatment at level conditional on a specific value of the individual treatment and on the vector of covariates , i.e., . is the subset of covariates affecting the individual treatment, and is the subset of covariates affecting the neighborhood treatment. Typically, should include individual characteristics and is likely to include neighborhood characteristic. Nevertheless, and could also coincide and both include all kind of covariates.
Forastiere et al. (2018) show that the individual and neighborhood propensity scores satisfy the balancing and unconfoundedness properties. In particular, if Assumption 5 holds given , then the unconfoundedness assumption holds conditional on the two propensity scores separately, i.e., . This property allows deriving a covariate-adjustment method that separately adjusts for the individual propensity score and for the neighborhood propensity score . Because is the propensity score of a binary treatment, we can always adjusts for the propensity score . Forastiere et al. (2018) propose the use of a subclassification approach on the individual propensity score and, within subclasses that are approximately homogenous in , a model-based approach for the neighborhood propensity score, similar to the one in Hirano & Imbens (2004). Here we replace the frequentist subclassification and generalized propensity score-based estimator with a semiparametric Bayesian approach.
4.2 Propensity Scores and Outcome Models
4.2.1 Individual and Neighborhood Propensity Scores Models
We first posit a model for the binary individual treatment
where is the logit or probit link function, and a model for the neighborhood treatment
where again is a link function, is a probability density function (pdf), is a flexible function of the covariates depending on a vector of parameters , and is a scale parameter.
4.2.2 Outcome Model with Penalized Splines and Random Effects
We now postulate a model for the potential outcomes given and :
where as usual is a link function, is a probability density function (pdf), and is a scale parameter. The key feature here is , which we model semiparametrically using a set of penalized spline basis functions. Splines yield several advantages that include flexibility as well as interpretability via representations that use a compact set of basis functions and coefficients. In particular, the predictor , where , can be written in the mixed model representation (Ruppert et al., 2003):
where , such that the first two terms of Equation (15) represent the linear predictor with interactions, and contains spline basis functions. In particular, we use multivariate smoothing splines with radial basis functions of the form
where is the euclidean norm and is a basis function. Here our choice goes to thin plate splines of the form
where m is an integer satisfying , that controls the order of the spline and its smoothness (Ruppert et al., 2003; Wood, 2003). The default is to use the smallest integer satisfying that condition. The advantage of radial basis functions in multivariate smoothing is that they are rotational invariant. The distribution of the coefficients and is a mixed model representation of penalties. The variances and
are indeed the parameters controlling the degree of smoothness. A large value of these parameters , that is, a strong roughness penalty, leads to a smoother fit, while a small value (close to zero) leads to an irregular fit and essentially interpolation of the data. A key component in fitting splines is the choice of the number and the placement of knots (K). We address this issue by using penalized splines; wherein we choose a large enough number of knots that are sufficient to capture the local nonlinear features present in the data and control for overfitting by using a penalization on the basis coefficients. Knots are first placed on data locations. For large datasets we randomly subsample a maximum number of data locations (the default maximum number is 2000). Then a truncated eigen-decomposition is used to achieve a rank reduction(Wood, 2003, 2017).
In Equation 14, the outcome model depends on the individual treatment, the neighborhood treatment and both the individual and the neighborhood propensity scores. An alternative approach is to replace the model-based adjustment for the individual propensity score with a matching approach. The idea is to match units on the individual propensity score to create a matched sample where covariates are balanced across the treated group () and the control group (). Adjustment for the neighborhood propensity score is then handled, as previously, by a model-based generalized propensity score method applied to the matched samples. For matching with replacement or variable ratio matching weights need to be incorporated into the analysis (Dehejia & Wahba, 1999; Hill et al., 2004; Stuart, 2010). When matching with replacement, individuals receive a frequency weight that reflects the number of matched sets they belong to. When using variable ratio matching, units receive a weight that is proportional to the actual number of units matched to them. In a Bayesian framework, weights can be incorporated by weighting the scale parameter. Therefore, we would assume a model for the outcome as in (14), with , and scaled by the matching weights.
Finally, the last term in Equation (15), , , is the random effect for community j, with . We include this term to take into account any dependence in the outcome data between a unit and his neighbors. In principle, each unit belongs to the neighborhoods of all units in his own neighborhood. Such overlapping nature of neighborhoods complicates the estimation of the correlation structure. We propose an alternative dependence structure by identifying larger non-overlapping clusters incorporating the neighborhoods of multiple units. By defining random effects at such cluster-level, we assume the presence of latent random variable which is shared by all units belonging to the same cluster. Clusters are defined using a community detection algorithm that identifies groups of nodes that are heavily connected among themselves, but sparsely connected to the rest of the network. This definition of communities enables taking into account both the ego-alter correlation and a broader cluster correlation structure.
4.2.3 Community Detection
Unfolding the communities in real networks is widely used to determine the structural properties of these networks. Community detection or clustering algorithms aim at finding groups of related nodes that are densely interconnected and have fewer connections with the rest of the network. These groups of nodes are called communities or clusters. As communities are often associated with important structural characteristics of a complex system, community detection is a common first step in the understanding and analysis of networks. The search for communities that optimize a given quantitative performance criterion is typically an NP-hard problem, so in most cases one must rely on approximate algorithms to identify community structure. The problem of how to find communities in networks has been extensively studied and a substantial amount of work has been done on developing clustering algorithms (an overview can be found in (Schaeffer, 2007; Lancichinetti & Fortunato, 2009; Fortunato, 2010)). In the simulation section (Section 5), we will descrive the specific methods used for our simulations.
Within the Bayesian framework we should posit a prior distribution for the parameter vector