Causal Inference under Networked Interference

02/20/2020
by   Yunpu Ma, et al.
Siemens AG
0

Estimating individual treatment effects from data of randomized experiments is a critical task in causal inference. The Stable Unit Treatment Value Assumption (SUTVA) is usually made in causal inference. However, interference can introduce bias when the assigned treatment on one unit affects the potential outcomes of the neighboring units. This interference phenomenon is known as spillover effect in economics or peer effect in social science. Usually, in randomized experiments or observational studies with interconnected units, one can only observe treatment responses under interference. Hence, how to estimate the superimposed causal effect and recover the individual treatment effect in the presence of interference becomes a challenging task in causal inference. In this work, we study causal effect estimation under general network interference using GNNs, which are powerful tools for capturing the dependency in the graph. After deriving causal effect estimators, we further study intervention policy improvement on the graph under capacity constraint. We give policy regret bounds under network interference and treatment capacity constraint. Furthermore, a heuristic graph structure-dependent error bound for GNN-based causal estimators is provided.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/27/2020

Random Graph Asymptotics for Treatment Effect Estimation under Network Interference

The network interference model for causal inference places all experimen...
10/19/2021

Bayesian Approach to Two-Stage Randomized Experiments in the Presence of Interference and Noncompliance

No interference between experimental units is a critical assumption in c...
04/28/2020

Causal Inference on Networks under Continuous Treatment Interference

This paper presents a methodology to draw causal inference in a non-expe...
08/03/2020

Heterogeneous Treatment and Spillover Effects under Clustered Network Interference

The bulk of causal inference studies rules out the presence of interfere...
06/29/2019

Causal Inference Under Interference And Network Uncertainty

Classical causal and statistical inference methods typically assume the ...
07/01/2021

Randomization-only Inference in Experiments with Interference

In experiments that study social phenomena, such as peer influence or he...
10/19/2020

Causal Network Motifs: Identifying Heterogeneous Spillover Effects in A/B Tests

Randomized experiments, or "A/B" tests, remain the gold standard for eva...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A common assumption made in causal inference is the consistency and interference-free assumption, i.e., the Stable Unit Treatment Value Assumption (SUTVA) Rubin (1980), under which the individual treatment response is consistently defined and unaffected by variations in other individuals. However, this assumption is problematic under a social network setting since peers are not independent or “no man is an island,” as written by the poet John Donne.

Interference occurs when the treatment response of an individual is influenced through the exposure to its social contacts’ treatments or affected by its social neighbors’ outcomes through peer effects Bowers et al. (2013); Toulis and Kao (2013). For instance, the treatment effect of an individual under a vaccination against an infectious disease might influence the health conditions of its surrounding individuals; or a personalized online advertisement might affect other individuals’ purchase of the advertised item through opinion propagation on social networks. Separating individual treatment effect and peer effect in causal inference becomes an intractable problem under interference since, in randomized experiments or observational studies, one can only observe the superposition of both effects. The issue of how to estimate causal responses and make optimal policies on the network is studied in this work.

One of the main objectives of treatment effect estimation is to make better treatment decision rules for individuals according to their characteristics. Population-averaged utility functions have been studied in Manski (2009); Athey and Wager (2017); Kallus (2018); Kallus and Zhou (2018). In those publications, a policy learner can adapt and improve its decision rules through the utility function. However, interactions among units are always ignored. On the other hand, a policy learner usually faces a capacity or budget constraint, as studied in Kitagawa and Tetenov (2017). Therefore, in this work, we develop a new type of utility function defined on interconnected units and investigate provable policy improvement with budget constraints.

1.1 Related Work

Causal inference with interference was studied in Hudgens and Halloran (2008); Tchetgen and VanderWeele (2012); Liu and Hudgens (2014). However, the assumption of group-level interference, having partial interference within the groups and independence across different groups, is often invalid. Hence, several works focus on unit-level causal effects under cross-unit interference and arbitrary treatment assignments, such as  Aronow et al. (2017); Forastiere et al. (2016); Ogburn et al. (2017a, b); Viviano (2019). Other approaches for estimating causal effects on networks use graphical models, which are studied in Arbour et al. (2016); Tchetgen et al. (2017); Ogburn et al. (2018); Sherman and Shpitser (2018); Bhattacharya et al. (2019).

1.2 Notations and Previous Approaches

Let denote a directed or undirected graph with a node set of size , an edge set , and an adjacency matrix . For a node, or unit, , let indicate the set of neighboring nodes with excluding the node itself, and let

denote the covariate vector of node

which is defined in the space . We focus on the Neyman–Rubin causal inference model Rubin (1974); Splawa-Neyman et al. (1990) here temporally. Let

be a binary variable with

indicating that node is in the treatment group, and if is in the control group. Moreover, let be the outcome variable with indicating the potential outcome of under treatment and the potential outcome under control . Moreover, we use and to represent the treatment assignments and potential outcomes of neighboring nodes , and the entire treatment assignments vector.

In the SUTVA assumption, the individual treatment effect on node is defined as the difference between outcomes under treatment and under control, i.e., . To estimate treatment effects under network interference, an exposure variable is proposed in Toulis and Kao (2013); Bowers et al. (2013); Aronow et al. (2017). The exposure variable is a function of neighboring treatments . For instance, can be a variable indicating the level of exposure to the treated neighbors, i.e., .

Under the assumption that the outcome only depends on the individual treatment and neighborhood treatments, Forastiere et al. (2016) defines an individual treatment effect under the exposure as

(1)

Moreover, the spillover effect under the treatment and the exposure is defined as . Treatment and spillover effects are then estimated using generalized propensity score (GPS) weighted estimators.

In general, the outcome model can be more complicated, depending on network topology and covariates of neighboring units. Ogburn et al. (2017a) investigates more general causal structural equations under dimension-reducing assumption, and the potential outcome reads , where and are summary functions of neighborhood covariates and treatment, e.g., they could be the summation or average of neighboring treatment assignments and covariates, respectively. Motivated from the above causal structural equation model, we incorporate GNN-based causal estimators with appropriate covariates and treatment aggregation functions as inputs.

Contributions This work has four major contributions. First, we propose GNN-based causal estimators for causal effect prediction and to recover direct treatment effect under interference (Section 2). Second, we define a novel utility function for policy optimization on a network and derive a graph-dependent policy regret bound (Section 3). Third, we provide an error bound for the GNN-based causal estimators (Section 3 and Appendix). Last, we conduct extensive experiments to verify the superiority of GNN-based causal estimators and show that the accuracy of a causal estimator is crucial for finding the optimal policy (Section 4).

2 GNN-based Causal Estimators

In this section, we introduce our GNN-based causal effect estimators under general network interference.

2.1 Structural Equation Model

Given the graph , the covariates of all units in the graph , and the entire treatment assignments vector , the structural equation model describing the considered data generation process is given as follows

(2)

for units . This structural equation model encodes both the observational studies and the randomized experiments setting. In observational studies, e.g., on the Amazon dataset (see Section 4.1), the treatment depends on the covariate and the unknown specification of , or even on the neighboring units under network interference. In the setting of the randomized experiment, e.g., experiments on Wave1 and Pokec datasets, the treatment assignment function is specified as , where

represents predefined treatment probability. Function

characterizes the causal response, which depends on, in addition to and , the graph and neighboring covariates and treatment assignments. If only influences from first-order neighbors are considered, the response generation can be specified as . When the graph structure is given and fixed, we leave out in the notation.

2.2 Distribution Discrepancy Penalty

Even without network interference, a covariate shift problem of counterfactual inference is commonly observed, namely the factual distribution differs from the counterfactual distribution . To avoid biased inference, Johansson et al. (2016); Shalit et al. (2017) propose a balancing counterfactual inference using domain-adapted representation learning. Covariate vectors are first mapped to a feature space via a feature map . In the feature space, treated and control populations are balanced by penalizing the distribution discrepancy between and using the Integral Probability Metric. This approach is equivalent to finding a feature space such that the treatment assignment and representation become approximately disentangled, namely . We use the Hilbert-Schmidt Independence Criterion (HSIC) as the dependence test in the feature space. The empirical HSIC using a Gaussian RBF kernel is written as  111Expression for is relegated to Appendix A.. Note that incorporating the feature map and the representation balancing penalty is essential to tackle the imbalanced assignments in observational studies, e.g., on the Amazon dataset (see Section 4.1).

2.3 Graph Neural Networks

Graph neural networks can learn and aggregate feature information from distant neighbors, which makes it a right candidate for capturing the spillover effect given by the neighboring units. Different GNNs are employed and compared in our model, and we briefly provide a review.

Graph Convolutional Network (GCN) Kipf and Welling (2016) The graph convolutional layer in GCN is defined as , where is the hidden output from the -th layer with being the input features matrix, and

is the activation function, e.g., ReLU. The modified adjacency

with inserted self-connections is defined as , and denotes the node degree matrix of .

GraphSAGE GraphSAGE Hamilton et al. (2017) is an inductive framework for calculating node embeddings and aggregating neighbor information. The mean aggregation operator in the GraphSAGE reads . Traditional GCN algorithms perform spectral convolution via eigen-decomposition of the full graph Laplacian. In contrast, GraphSAGE computes a localized convolution by aggregating the neighborhood around a node, which resembles the simulation protocol of linear treatment response with spillover effect for semi-synthetic experiments (see Section 4.1). Due to the resemblance, a better causal estimator is expected when using GraphSAGE as the aggregation function (see the beginning of Appendix G.3 for more heuristic motivations.).

-GNN -GNN Morris et al. (2018) is a variation of GraphSAGE, which performs separate transformations of node features and aggregated neighborhood features. Since the features of the considered unit and its neighbors contribute differently to the superimposed outcome, it is expected that the -GNN is more expressive than GraphSAGE. The convolutional operator of -GNN has the form .

Figure 1: Treated and control populations have different distributions in the covariate vectors space. Through a map and distribution discrepancy term HSIC, features and treatment assignments become disentangled in the feature space. Before and after the feature map , the adjacency matrix remains the same. After applying GNNs, for each node , the concatenation is fed into outcome prediction network or

depending on the treatment assignment. The loss function consists of outcome prediction error and the distribution discrepancy in the feature space.

2.4 GNN-based Causal Estimators

We use the percentage of treated neighboring nodes, i.e., the random variable

, as the treatment summary function, and the output of GNNs as the covariate aggregation function. The concatenation of node is then fed into the outcome prediction network or , depending on , where and are neural networks with a scalar output. Note that indicates that the treatment vector is also a GNNs’ input. During the implementation, the treatment assignment vector masks the covariates, and GNN models use the masked covariates , for , as inputs. In summary, given and graph , the loss function for GNN-based estimators is defined as

where and

are tunable hyperparameters. Our model is illustrated in Fig. 

1. During the implementation, we incorporate two types of empirical representation balancing: balancing the outputs of representation network to tackle imbalanced assignments, denoted as , and balancing the outputs of the GNN representations to tackle imbalanced spillover exposure, denoted as .

At this point, it is necessary to emphasize that only the causal responses of a part of the units in are relevant to the models. The GNN-based models use this part of causal responses, the network structure , and covariates as input, and can predict the superimposed causal effects of the remaining units. Note that for GNN-based nonparametric models, the identifiability of causal response is guaranteed under reasonable assumptions similar to those given in Section 3.2 of Ogburn et al. (2017a). The proof is relegated to Appendix B.

Notice that the outcome prediction networks and are trained to estimate the superposition of individual treatment effect and spillover effect. Still, after fitting the observed outcomes, we expect to extract the non-interfered individual treatment effect from the causal estimators by assuming that the considered unit is isolated. An individual treatment effect estimator can be defined similarly to Eq. 1. To be more specific, the individual treatment effect of unit is expected to be extracted from GNN-based estimators by setting its exposure to and its neighbors’ covariates to , namely 222Spillover effect can be extracted similarly.

(3)

3 Intervention Policy on Graph

After obtaining the treatment effect estimator, we develop an algorithm for learning intervention assignments to maximize the utility on the entire graph, and the learned rule for assignment is called a policy. As suggested in Athey and Wager (2017), without interference a utility function is defined as . An optimal policy is obtained by maximizing the -sample empirical utility function given the individual treatment response estimator , i.e., , where indicates the policy function class. Notably, tends to assign treatment to units with positive treatment effect and control to units with negative responses.

Now, consider the outcome variable under network interference. For notational simplicity and clarity of the later proof, we assume first-order interference from nearest neighboring units, hence the outcome variable can be written as . Inspired by the definition of , the utility function of a policy under interference is defined as

(4)

where with an empty graph represents the individual outcome under control without any network influence 333Hence and are omitted in the expression.. After some manipulations, equals the sum of individual treatment effect and spillover effect, i.e., , where

To be more specific, is the conventional individual treatment effect, while represents the spillover effect under the policy and when . Due to the network-dependency in the spillover effect, an optimal policy will not merely treat units with positive responses but also adjust its intervention on the entire graph to maximize the spillover effects.

Next, we establish guarantees for the regret of learned intervention policy. Let and denote the estimator of and , respectively. Given the true models and , let be the empirical analogue of , and let be the empirical utility with estimators plugged in. Using learned causal estimators, an optimal intervention policy from the empirical utility perspective can be obtained from . Moreover, the best possible intervention policy from the functional class with respect to the utility is written as , and the policy regret between and is defined as . Throughout the estimation of policy regret, we maintain the following assumptions.

Assumption 1.

(BO) Bounded treatment and spillover effects: There exist such that the individual treatment effect satisfies and the spillover effect satisfies .
(WI) Weak independence assumption: For any node indices and , the weak independence assumption assumes that .
(LIP) Lipschitz continuity of the spillover effect w.r.t. policy: Given two treatment policies and , for any node the spillover effect satisfies , where the Lipschitz constant satisfies and .
(ES) Uniformly consistency: after fitting experimental or observational data on , individual treatment effect estimator satisfies , and spillover estimator satisfies , , where and are scaling factors that characterize the errors of estimators.

Notice that the (ES) assumption requires consistent estimators of the individual treatment effect and the spillover effect, which is the fundamental problem of causal inference with interference. In our GNN-based model, these empirical errors are particularly difficult to estimate due to the lack of proper theoretical tools for understanding GNNs. To grasp how these GNN-based causal estimators are influenced by the network structure and network effect, in Appendix G.3, we study a particular class of GNNs, which is inspired by the surrogate model of nonlinear graph neural networks and have the following claim.

Claim 1.

GNN-based causal estimators restricted to a particular class for predicting the superimposed causal effects have an error bound , where and is the maximal node degree in the graph.

The above claim indicates that an accurate and consistent causal estimator is difficult with large network effects. Worse case is that the convergence rate in the (ES) assumption becomes unreachable when depends on the number of units. The exact convergence rate of causal estimators is impossible to derive since it depends on the topology of the network, and it beyond the theoretical scope of this work.

Besides, (LIP) assumes that the change of received spillover effect is bounded after modifying the treatment assignments of one unit’s neighbors. We will use hypergraph techniques, instead of chromatic number arguments, to give a tighter bound of policy regrets. Another advantage is that the weak independence (WI) assumption can be relaxed to support longer dependencies on the network. However, by relaxing (WI), the power of in Theorem 4 and 2 needs to be modified correspondingly. For example, if we assume a next-nearest neighbors dependency of covariates, i.e., for , then the term in Theorem 4 and 2 needs to be modified to .

Under Assumption 1, we can have the following bound.

Theorem 1.

By Assumption 1, for any small , the policy regret is bounded by with probability at least , where indicates the covering number 444The covering number characterizes the capacity of a functional class. Definition is provided in the Appendix G. on the functional class with radius , and is the maximal node degree in the graph .

Proof.

Under (WI) and (BO), we can use concentration inequalities of networked random variables defined on a hypergraph, which is derived from graph to bound the convergence rate. Moreover, the Lipschitz assumption (LIP) allows an estimation of the covering number of the policy functional class . More discussions on the plausibility of Assumption 1 and the full proof are relegated to Appendix G. ∎

Suppose that the policy functional class is finite and its capacity is bounded by . According to Theorem 4, with probability at least , the policy regret is bounded by . It indicates that optimal policies are more difficult to find in a dense graph even under weak interactions between neighboring nodes.

In a real-world setting, treatments could be expensive. So the policymaker usually encounters a budget or capacity constraints, e.g., the proportion of patients receiving treatment is limited, and to decide who should be treated under constraints is a challenging problem Kitagawa and Tetenov (2017). Through the interference-free welfare function , a policy is trained to make treatment choices using only each individual’s features. In contrast, under interference, a smart policy should maximize the utility function Eq. (4) by deciding whether to treat an individual or expose it under neighboring treatment effects such that a required constraint can be satisfied. Therefore, in the second part of the experiments, after fitting causal estimators, we investigate policy networks that maximize the utility function on the graph and satisfy a treatment proportion constraint.

To be more specific, we consider the constraint where only percentage of the population can be assigned to treatment 555Note that here differs from the treatment probability from causal structural equations in the randomized experiment setting.. The corresponding sample-averaged loss function for a policy network under capacity constraint is defined as , where is a hyperparameter for the constraint. Optimal policy under capacity constraint is obtained by . A capacity-constrained policy regret bound is provided in Theorem 2, which is proved in Appendix G.2. It indicates that if in the constraint is small, then the optimal capacity-constrained policy will be challenging to find. Increasing the treatment probability can not guarantee the improvement of the group’s interest due to the non-linear network effect. Therefore, finding the balance between optimal treatment probability, treatment assignment, and group’s welfare is a provocative question in social science.

Theorem 2.

By Assumption 1, for any small , the policy regret under the capacity constraint is bounded by with probability at least , where indicates the covering number on the functional class with radius , and is the maximal node degree in the graph .

4 Experiments

4.1 Datasets

The difficulties of evaluating the performance of the proposed estimators lie in the broad set of missing outcomes under counterfactual inference. Therefore, we conduct randomized experiments on two semi-synthetic datasets with ground-truth response generation functions, and observational studies on one real dataset with unknown treatment assignment and response generation functions. Notably, in the randomized experiment setting, we consider a linear response generation function inspired by Eq. 5 of Toulis and Kao (2013), , where is the outcome under control and without network interference, and represents Gaussian noise. and represent individual treatment effect and spillover effect, respectively, whose forms are dataset-dependent and discussed below.

To further investigate the superiority of the GNN-based causal estimators on nonlinear causal responses, we consider the following data generation function inspired by Section 4.2 of Toulis and Kao (2013), , where characterizes the strength of nonlinear effects. In addition, a more complicated nonlinear response generation function is considered, where the quadratic terms signify the spillover effect depending on the individual treatment effect.

Wave1 Pokec
DA GB
DA RF
DR GB
DR EN
GPS
GCN +
GraphSAGE +
-GNN +
Improve
Table 1: Experimental results of randomized experiments on the Wave1 and Pokec datasets using linear response generation function . For Wave1, we set (node degree) , (decay parameter), and (treatment probability) , and for Pokec . Improvements are obtained by comparing with the best baselines.

Wave1

Wave1 is an in-school questionnaire data collected through the National Longitudinal Study of Adolescent Health project 

Chantala and Tabor (1999). The questionnaire contains questions such as age, grade, health insurance, etc. Due to the anonymity of Wave1, we use the symmetrized -NN graph derived from the questionnaire data as the friendship network. In our experiments, we choose , and the resulting friendship network has nodes and links. We assume a randomized experiment conducted on the friendship network which describes students’ improvements of performance through assigning to a tutoring program or through the peer effect. Hence represents the overall performance of student before assignment to a tutoring program and before being exposed to peer influences, the simulated performance difference after an assignment, and the synthetic peer effect. Exact forms of and depend nonlinearly on the features of each student. Moreover, the first-order peer effect is simulated as , where the decay parameter characterizes the decay of influence. In randomized experiments reported in the main text, we randomly assign of the population to the treatment. Details of the generating process and more experiment results with different settings are relegated to Appendix C and F.

DA GB
DA RF
DR GB
DR EN
GPS
GCN
GCN +
GCN +
GraphSAGE
GraphSAGE +
GraphSAGE +
-GNN
-GNN +
-GNN +
Improve
Table 2: Experimental result on the pos Amazon dataset without representation balancing and under different imbalance penalties. Improvements are obtained by comparing with the best baselines.

Pokec The friendship network derived from the Wave1 questionnaire data may violate the power-law degree distribution of real networks. Hence, we further conduct experiments on the real social network Pokec Takac and Zabovsky (2012) with generated responses. Pokec is an online social network in Slovakia with profile data, including age, gender, education, etc. We consider randomized experiments on the Pokec social network, in which personalized advertisements of a new health medicine are pushed to some users. We assume that the response of exposed users to the advertisement only depends on a few properties, such as age, weight, smoking status, etc. We keep profiles with complete information on these properties, and the resulting Pokec social network contains nodes and links. Let represent the purchase of this new health medicine without external influence on the decision, the purchase difference after seeing the advertisement, the purchase difference due to social influences. For randomized experiments on the Pokec social network, we also consider peer effects from next-nearest neighbors by defining , where the decay parameter characterizes the decay of influence. Details and more experimental results with different hyperparameter settings are given in Appendix D and F.

Amazon The co-purchase dataset from Amazon contains product details, review information, and a list of similar products. Therefore, there is a directed network of products that describes whether a substitutable or complementary product is getting co-purchased with another product Leskovec et al. (2007). To study the causal effect of reviews on the sales of products, Rakesh et al. (2018) generates a dataset containing products with only positive reviews from the Amazon co-purchase dataset, named as pos Amazon, and Amazon for short. In this dataset, all items have positive reviews, i.e., the average rating is larger than , and one item is considered to be treated if there are more than three reviews under this item; otherwise, an item is in the control group. In this setting, pos Amazon is an over-treated dataset with more than of products being in the treatment group. Word2vec embedding of an item’s review serves as the feature vector of this item. Moreover, the individual treatment effect of an item is approximated by matching it to other items having similar features and under minimal exposure to neighboring nodes’ treatments.

Wave1 Pokec
DA GB

DA RF

DR GB

DR EN

GPS
GCN

GraphSAGE

-GNN
Improve
Table 3: Experimental results of randomized experiments on the Wave1 and Pokec datasets using nonlinear response generation functions and with . For Wave1, we set (node degree) , (decay parameter) , and (treatment probability) , and for Pokec, we set . and are deployed in the GNN-based estimators. Improvements are obtained by comparing with the best baselines.

4.2 Results of Causal Estimators

Evaluation Metrics

One evaluation metric is the square root of MSE for the prediction of the observed outcomes on the test dataset

, which is defined as , where denotes the output of the outcome prediction network (see and in Fig. 1). This metric reflects how well an estimator can predict the superimposed individual treatment and spillover effects on a network. Another evaluation metric that quantifies the quality of extracted individual treatment effect is the Precision in Estimation of Heterogeneous Effect studied in Hill (2011), which is defined as , where is defined in Eq. (3).

Baselines

Baseline models are domain adaption method Künzel et al. (2019)

with gradient boosting regression (

DA GB

), with random forest regression (

DA RF), doubly-robust estimator Funk et al. (2011) with gradient boosting regression (DR GB), and elastic net regression (DR EN). They are implemented via EconML Research (2019) with grid-searched hyperparameters. These baselines incorporate the feature vectors as inputs and exposure as the control variable into the model. For randomized experiments on Wave1 and Pokec, the predefined treatment probability is provided, while for the observational studies on the Amazon dataset, the covariate-dependent treatment probability is estimated. Moreover, the generalized propensity score (GPS) method is reproduced and enhanced for a fair comparison, equipped with the same feature map function. More details of baselines, the sketch of the training procedure, and hyperparameters are relegated to Appendix F.

Experiments

We use partial outcomes, both in the randomized experiments and observational settings, to train the GNN-based causal estimators. We investigate the effect of penalizing representation imbalance in the observational studies on the Amazon dataset. The entire data points are randomly divided into training (), validation (), and test () sets. Note that the entire network and the covariates of all units are given during the training and test, while only the causal responses of units in the training set are provided in the training phase. For the randomized experiments using the Wave1 and Pokec datasets, we repeat the experiments times and use different random parameters in the response generation process each time.

Experimental results on the Wave1 and Pokec data generated via linear model are presented in Table 1. Both representation balancing and are deployed in the GNN-based estimators for searching for the best performance. GNN-based estimators, especially the -GNN estimator, are superior for superimposed causal effects prediction. One can observe a improvement of the metric on the Wave1 dataset when comparing the -GNN estimator with the enhanced GPS method and a

improvement on the Pokec dataset. The covariates of neighboring units in the Pokec dataset actually have strong cosine similarity, hence the improvement on the Pokec dataset is not significant, and the network effect can be approximately captured from the exposure variable. Table 

2 shows the experimental results on the pos Amazon dataset in the observational study. In particular, we demonstrate the effects of without representation penalty, and with different penalties. It shows that representation penalties can significantly improve the individual treatment effect recovery, serving as a regularization to avoid over-fitting the network interference. Furthermore, GNN-based estimators using penalty are slightly better than those using penalty; however, by sacrificing the metric .

Wave1 Pokec
DA GB
DA RF
DR GB
DR EN
GPS
GCN
GraphSAGE
-GNN
Table 4: Intervention policy improvements on the Wave1 and Pokec semi-synthetic datasets under treatment capacity constraint with . and represent utility differences evaluated from learned estimators and ground truth, respectively. Note that only reflects the real policy improvement.

Table 3 reports the performance of GNN-based causal estimators on nonlinear response models. Nonlinear responses are generated via and under . For the metric, GNN-based estimators outperform the best baseline GPS dramatically, showing the effectiveness of predicting nonlinear causal responses. Moreover, a and performance improvement on the metric with the Wave1 dataset shows that setting an empty graph, i.e., , in the GNN-based estimators is an appropriate approach for extracting individual causal effect. Results of nonlinear responses with larger strength parameter are reported in Appendix C and D.

4.3 Results on Improved Intervention Policy

Experiment Settings

DA GB DA RF GPS GCN GraphSAGE -GNN
Table 5: Intervention policy improvements on the pos Amazon dataset under treatment capacity constraint with . Only domain adaption methods and GPS are compared since they are the best baseline estimators according to Table 2.

After obtaining the optimal causal effect estimators and feature map (see Fig. 1), we subsequently optimize intervention policy on the same graph. A simple 2-layer neural network, with ReLU activation between hidden layers and sigmoid activation at the end, is employed as the policy network. The output of the policy network lies in , and it is interpreted as the probability of treating a node. The real intervention choice is then sampled from this probability via the Gumbel-softmax trick Jang et al. (2016) such that gradients can be back-propagated. Sampled treatment choices along with corresponding node features are then fed into the feature map and subsequent causal estimators to evaluate the utility function under network interference defined in Eq. (4). Each experiment setting is repeated times until convergence. The hyperparameter in is tuned such that the constraint for the percentage is satisfied within the tolerance . More details of experiment settings and hyperparameters are relegated to Appendix D and E.

To quantify the optimized policy , we evaluate the difference , where represents a randomized intervention underlying the same capacity constraint. The difference indicates how a learned policy can outperform a randomized policy with the same constraint evaluated via learned causal effect estimators. However, from its definition, it is concerned that the policy improvement may be very biased, such that any “expected improvement” may come from the inaccurate causal estimators. Hence, for the Wave1 and Pokec datasets, knowing the generating process of treatment and spillover effects, we also compare the actual utility difference .

Table 4 displays policy optimization results on the under-treated Wave1 and Pokec simulation datasets, where initially only of nodes are randomly assigned to treatment. It shows that an optimized policy network cannot even outperform a randomized policy in ground truth when the causal estimators perform poorly. Hence, policy networks learned from the utility function with plugged in doubly-robust or domain adaption estimators are not reliable. By contrast, the small difference between genuine utility improvement and estimated improvement for the GNN-based causal estimators indicates the reliability of the optimized policy. Moreover, comparing the ground-truth utility improvement on GPS and GCN-based estimator shows that the policy network sensitively relies on the accuracy of the employed causal estimator. Furthermore, one might argue that through baseline estimators, a simple policy network cannot adjust its treatment choice according to neighboring nodes’ features and responses, unlike through GNN-based estimators. For a fair comparison, in Appendix D, we also provide experimental results using a GNN-based policy network. However, we still cannot observe genuine utility improvements on when using baseline models as causal estimators.

Next, we conduct experiments for intervention policy learning on the over-treated pos Amazon dataset under treatment capacity constraint. Since we do not have access to the ground truth of the pos Amazon dataset, Table 5 shows the utility difference under treatment capacity constraint with evaluated only from learned causal estimators. Although the optimized utility improvement achieves the best result via the GPS causal estimator, it might be unreliable compared to the ground truth. A reliable policy improvement having comparable utility improvement via a GNN-based causal estimator is expected.

5 Conclusion

In this work, we first introduced the task of causal inference under general network interference and proposed causal effect estimators using GNNs of various types. We also defined a novel utility function for policy optimization on interconnected nodes, of which a graph-dependent policy regret bound can be derived theoretically. We conduct experiments on semi-synthetic simulation and real datasets. Experiment results show that GNN-based causal effect estimators, especially GraphSAGE and -GNN, with an HSIC distribution discrepancy penalty are superior in superimposed causal effects prediction, and the individual treatment effect can be recovered reasonably well. Subsequent experiments of intervention policy optimization under capacity constraint further confirms the importance of employing an optimal and reliable causal estimator for policy improvement. In future work, we consider the scenario in which the network structure is only partially observed, or dynamic.

References

  • D. Arbour, D. Garant, and D. Jensen (2016) Inferring network effects from observational data. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 715–724. Cited by: §1.1.
  • P. M. Aronow, C. Samii, et al. (2017) Estimating average causal effects under general interference, with application to a social network experiment. The Annals of Applied Statistics 11 (4), pp. 1912–1947. Cited by: §1.1, §1.2.
  • S. Athey and S. Wager (2017) Efficient policy learning. arXiv preprint arXiv:1702.02896. Cited by: §1, §3.
  • R. Bhattacharya, D. Malinsky, and I. Shpitser (2019) Causal inference under interference and network uncertainty. arXiv preprint arXiv:1907.00221. Cited by: §1.1.
  • J. Bowers, M. M. Fredrickson, and C. Panagopoulos (2013) Reasoning about interference between units: A general framework. Political Analysis 21, pp. 97–124. External Links: Document Cited by: §1.2, §1.
  • K. Chantala and J. Tabor (1999) National longitudinal study of adolescent health: strategies to perform a design-based analysis using the add health data. Cited by: §4.1.
  • L. Forastiere, E. M. Airoldi, and F. Mealli (2016) Identification and estimation of treatment and interference effects in observational studies on networks. arXiv preprint arXiv:1609.06245. Cited by: §1.1, §1.2.
  • M. J. Funk, D. Westreich, C. Wiesen, T. Stürmer, M. A. Brookhart, and M. Davidian (2011) Doubly robust estimation of causal effects. American journal of epidemiology 173 (7), pp. 761–767. Cited by: §4.2.
  • W. Hamilton, Z. Ying, and J. Leskovec (2017) Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §2.3.
  • J. L. Hill (2011) Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics 20 (1), pp. 217–240. Cited by: §4.2.
  • M. G. Hudgens and M. E. Halloran (2008) Toward causal inference with interference. jasa 103 (482). Cited by: §1.1.
  • E. Jang, S. Gu, and B. Poole (2016) Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144. Cited by: §4.3.
  • F. Johansson, U. Shalit, and D. Sontag (2016) Learning representations for counterfactual inference. In

    International conference on machine learning

    ,
    pp. 3020–3029. Cited by: §2.2.
  • N. Kallus and A. Zhou (2018) Confounding-robust policy improvement. In Advances in Neural Information Processing Systems, pp. 9269–9279. Cited by: §1.
  • N. Kallus (2018) Balanced policy evaluation and learning. In Advances in Neural Information Processing Systems, pp. 8895–8906. Cited by: §1.
  • T. N. Kipf and M. Welling (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §2.3.
  • T. Kitagawa and A. Tetenov (2017) Who should be treated? empirical welfare maximization methods for treatment choice. Technical report Cemmap working paper. Cited by: §1, §3.
  • S. R. Künzel, J. S. Sekhon, P. J. Bickel, and B. Yu (2019) Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences 116 (10), pp. 4156–4165. Cited by: §4.2.
  • J. Leskovec, L. A. Adamic, and B. A. Huberman (2007) The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1 (1), pp. 5. Cited by: §4.1.
  • L. Liu and M. G. Hudgens (2014) Large sample randomization inference of causal effects in the presence of interference. Journal of the american statistical association 109 (505), pp. 288–301. Cited by: §1.1.
  • C. F. Manski (2009) Identification for prediction and decision. Harvard University Press. Cited by: §1.
  • C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe (2018) Weisfeiler and leman go neural: higher-order graph neural networks. arXiv preprint arXiv:1810.02244. Cited by: §2.3.
  • E. L. Ogburn, I. Shpitser, and Y. Lee (2018) Causal inference, social networks, and chain graphs. arXiv preprint arXiv:1812.04990. Cited by: §1.1.
  • E. L. Ogburn, O. Sofrygin, I. Diaz, and M. J. van der Laan (2017a) Causal inference for social network data. arXiv preprint arXiv:1705.08527. Cited by: §1.1, §1.2, §2.4.
  • E. L. Ogburn, T. J. VanderWeele, et al. (2017b) Vaccines, contagion, and social networks. The Annals of Applied Statistics 11 (2), pp. 919–948. Cited by: §1.1.
  • V. Rakesh, R. Guo, R. Moraffah, N. Agarwal, and H. Liu (2018)

    Linked causal variational autoencoder for inferring paired spillover effects

    .
    In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 1679–1682. Cited by: §4.1.
  • M. Research (2019) EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation. Note: https://github.com/microsoft/EconMLVersion 0.x Cited by: §4.2.
  • D. B. Rubin (1974) Estimating causal effects of treatments in randomized and nonrandomized studies.. Journal of educational Psychology 66 (5), pp. 688. Cited by: §1.2.
  • D. B. Rubin (1980) Randomization analysis of experimental data: the fisher randomization test comment. Journal of the American Statistical Association 75 (371), pp. 591–593. Cited by: §1.
  • U. Shalit, F. D. Johansson, and D. Sontag (2017) Estimating individual treatment effect: generalization bounds and algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3076–3085. Cited by: §2.2.
  • E. Sherman and I. Shpitser (2018) Identification and estimation of causal effects from dependent data. In Advances in neural information processing systems, pp. 9424–9435. Cited by: §1.1.
  • J. Splawa-Neyman, D. M. Dabrowska, and T. Speed (1990)

    On the application of probability theory to agricultural experiments. essay on principles. section 9.

    .
    Statistical Science, pp. 465–472. Cited by: §1.2.
  • L. Takac and M. Zabovsky (2012) Data analysis in public social networks. In International Scientific Conference and International Workshop Present Day Trends of Innovations, Vol. 1. Cited by: §4.1.
  • E. J. T. Tchetgen, I. Fulcher, and I. Shpitser (2017) Auto-g-computation of causal effects on a network. arXiv preprint arXiv:1709.01577. Cited by: §1.1.
  • E. J. T. Tchetgen and T. J. VanderWeele (2012) On causal inference in the presence of interference. Statistical methods in medical research 21 (1), pp. 55–75. Cited by: §1.1.
  • P. Toulis and E. Kao (2013) Estimation of causal peer influence effects. In International conference on machine learning, pp. 1489–1497. Cited by: §1.2, §1, §4.1, §4.1.
  • D. Viviano (2019) Policy targeting under network interference. arXiv preprint arXiv:1906.10258. Cited by: §1.1.