1 Introduction
Predicting largescale societal events, such as disease outbreaks, organized crime, and civil unrest movements, from social media streams, web logs, and news media is of great significance for decisionmaking and resource allocation. Previous methods mainly focus on improving the predictive accuracy of a given event type or multiple event types using historical event data [zhao2015spatiotemporal, zhao2016hierarchical]. Recently, to enhance model explainability, many approaches identify salient features or supporting evidence, such as precursor documents [ning2016modeling], relationships represented as graphs [deng2019learning], and major actors participating in the events [10.1145/3394486.3403209]. However, existing work explains the occurrence of events based on correlationbased indicators.
Attempts to study causality in event analysis and prediction have focused on extracting pairs of causal events from unstructured text [radinsky2012learning], or using humandefined causally related historical events to predict events of interest [radinsky2013mining]
. Causal effect learning has shown advantages in improving predictions in various machine learning problems, such as recommender systems
[bonner2018causal], disease diagnosis prediction [li2020teaching], and computer vision tasks
[chen2021spatial]. This suggests the potential of causal effect learning for better prediction of societal events. Leveraging causal effects can presumably provide new insights into causallevel interpretation and improve the robustness of event prediction, e.g., less susceptible to noise in data. In this study, we explore societal event forecasting methods with the help of causal effect learning.Traditionally, learning causal effects (aka treatment effects) from observational data involves estimating causal effects of a treatment variable (e.g., medication) on an outcome variable (e.g., recovery) given observable covariates (e.g., gender). In practice, there are also unobserved covariates, i.e., hidden confounders, that affect both treatment and outcome variables. For instance, consider a study to evaluate the effectiveness of a medication. Gender as a covariate affects whether a patient chooses to take the medication and the corresponding outcome. The patient’s living habits can be hidden confounders that affect both the patient’s medication and outcome. Exploring hidden confounders allows for more accurate estimations of treatment effects [louizos2017causal, guo2019learning, 10.1145/3437963.3441818].
In this work, we formulate the problem of estimating treatment effects in the context of societal events. Societal events can be classified into different types. Given a time window, we look at multiple types of events (e.g., “appeal”, “investigation”) at a location and define treatment variables to be the detection of increased counts of these events compared to the previous time window. If the sudden and frequent occurrence of such events triggers some event of interest, the implied causal effect can be used to guide and interpret event predictions. We define the outcome as the occurrence of an event of interest (e.g., “protest”) at a future time. Both treatment and outcome variables can be affected by hidden social factors (i.e., hidden confounders) that are difficult to explicitly capture due to complex dependencies. Intuitively, exploring hidden confounders can allow us to estimate causal effects more accurately. To this end, we formulate our main research question as:
can we build a robust event predictive model by incorporating treatment effect estimation with hidden confounder learning? There are some challenges in solving this problem:
Societal events have geographical characteristics and exhibit a high degree of temporal dependency [zhao2016hierarchical, ning2016modeling, 10.1145/3394486.3403209]. Modeling spatiotemporal information requires an indepth investigation of the dynamic spatial dependencies of societal events. However, few studies have focused on modeling spatiotemporal dependencies in causal effect learning, which poses a challenge for learning causal effects from societal events.

Events occur in a complex and evolving social environment. Many unknown social factors increase the difficulty of accurately estimating causal effects of events. Moreover, events are often caused by a variety of factors rather than a single determinant. Utilizing causal effects to assist in event prediction is a new challenge.
We address the above challenges by first introducing the task of Individual Treatment Effect (ITE) estimation from societal events. ITE is defined as the expected difference between the treated outcome and control outcome, where the outcome is the occurrence of a future event (e.g., protest) at a specific place and time, and the treatment is a change in some event (e.g., appeal) in the past. We consider multiple treatments (e.g., appeal, investigation, etc.) with the motivation that the underlying causes of societal events are often complex. We model the spatiotemporal dependencies in learning the representations of hidden confounders to estimate ITEs. We then present an approach to inject the learned causal information into a datadriven predictive model to improve its predictive power. Our contributions are summarized as follows:

We introduce a novel causal inference model for ITE estimation, which learns the representation of hidden confounders by capturing spatiotemporal dependencies of events in different locations.

We propose two robust learning modules for event prediction that take as prior knowledge the information learned from the causal inference model. Incorporating such modules can enable event prediction models to be more robust to data noise.
We evaluate the proposed method against other stateoftheart methods on several realworld event datasets. Through extensive experiments, we demonstrate the strengths of the proposed method in treatment effect learning and robust event prediction.
2 Related Work
2.1 Event Prediction
Event prediction focuses on forecasting future events that have not yet happened based on various social indicators, such as event occurrence rates and news reports. Related research has been conducted in various fields and applications, such as election prediction [tumasjan2010predicting, o2010tweets], stock market forecasting [bollen2011twitter], disease outbreak simulation [signorini2011use, achrekar2011predicting], and crime prediction [wang2012automatic]
. Machine learning models such as linear regression
[bollen2011twitter]and random forests
[kallus2014predicting]were investigated to predict events of interest. Timeseries methods such as autoregressive models were studied to capture the temporal evolution of eventrelated indicators
[achrekar2011predicting]. With the increased availability of various data, more sophisticated features have been shown effective in predicting societal events such as topicrelated keywords [zhao2015spatiotemporal], document embedding [ning2016modeling], word graphs [deng2019learning]and knowledge graphs
[10.1145/3394486.3403209, deng2021understanding]. More advanced machine learning and deep learningbased models have emerged, such as multiinstance learning [ning2016modeling], multitask learning [zhao2015multi, gao2019incomplete]and graph neural networks
[deng2019learning, 10.1145/3394486.3403209]. Given the spatiotemporal dependencies of events, some existing research work studied spatiotemporal correlations in event prediction [gerber2014predicting, wang2012spatio, zhao2014unsupervised]. However, few studies explored the causality in event prediction. Our proposed model incorporates causal effect learning in a spatiotemporal event prediction framework. This gives us the benefit of discovering the effects of different potential causes on predicting future events.2.2 Individual Treatment Effect Estimation
Individual treatment effect (ITE) estimation refers to estimating the causal effect of a treatment variable on its outcome. A wealth of observational data facilitates treatment effect estimation in many fields, such as health care [anglemyer2014healthcare], education [gustafsson2013causal], online advertising [sun2015causal], and recommender systems [bonner2018causal]. Several methods have been studied for ITE estimation including regression and tree based model [hill2011bayesian, chipman2010bart], counterfactual inference [johansson2016learning], and representation learning [shalit2017estimating]. The former approaches rely on the Ignorability assumption [rosenbaum1983central], which is often untenable in realworld studies. A deep latent variable model, CEVAE [louizos2017causal] learns representations of confounders through variational inference. Recent work relaxed the Ignorability assumption and studied ITE estimation from observational data with an auxiliary network structure in a static [guo2019learning] or dynamic environment [10.1145/3437963.3441818]. In addition to the traditional causal effect estimation, a new study of causal inference, including multiple treatments and a single outcome, has emerged, namely, Multiple Causal Inference. Researchers have shown that compared with traditional causal inference, it requires weaker assumptions [wang2019blessings]. ITE estimation would considerably benefit decisionmaking as it can provide potential outcomes with different treatment options. Our work introduces ITE estimation to societal event studies and exploits eventrelated causal information for event forecasting.
2.3 Knowledge Guided Machine Learning
Purely datadriven approaches might lead to unsatisfactory results when limited data are available to train wellperforming and sufficiently generalized models. Such models may also break natural laws or other guidelines [von2019informed]. These problems have led to an increasing amount of research that focuses on incorporating additional prior knowledge into the learning process to improve machine learning models. For example, logic rules [diligenti2017integrating, xu2018semantic] or algebraic equations [karpatne2017physics, stewart2017label, muralidhar2018incorporating]
have been added as constraints to loss functions. Knowledge graphs are adopted to enhance neural networks with information about relations between instances
[battaglia2016interaction]. The growth of this research suggests that the combination of data and knowledgedriven approaches is becoming relevant and showing benefits in a growing number of areas. Existing work has typically focused on preexisting knowledge obtained by human experts. However, such approaches fail when prior knowledge is not available, e.g., for societal events. Some researchers explored causal knowledgeguided methods in health prediction [li2020teaching] and imagetovideo adaptation [chen2021spatial]. In this work, we study causal effects between societal events and use the learned causal information as prior knowledge for event prediction.Notations  Descriptions 

sets of locations, timestamps and numbers of event types  
covariates of the th location before time  
frequency of events at time for location  
observed treatment vector of the th location before time 

observed th treatment of the th location before time  
potential outcomes for the th treatment of the th location at time  
predicted potential outcomes for the th treatment of the th location at time  
connectivity of locations  
ITE of the th location at time for the th treatment  
learned hidden confounders of the th location before time when the th treatment is considered 
3 Problem Formulation
The objective of this study is twofold: (1) given multiple predefined treatment events (e.g., appeal, investigation, etc.), estimate their causal effect on a target event (i.e., protest) individually; (2) predict the probability of the target event occurring in the future with the help of estimated causal information. In the following, we will introduce the observational event data, individual treatment effect learning, and event prediction.
3.1 Observational Event Data
In this work, we focus on modeling the occurrence of one type of societal event (i.e., “protest”) by exploring the possible effects it might receive from other types of events (e.g., “appeals” and “investigation”). A total of categories of societal events are studied. These events happen at different locations and times. We use to denote the sets of locations and timestamps of interest, respectively. The observational event data can be denoted as , where denote the pretreatment covariates/features, observed treatments, and outcome, respectively. represents the connectivity of locations, where each element can denote a fixed geographic distance or the degree of influence of events between locations. Important notations are presented in Table 1.
Covariates: We define the covariates to be the historical events at location with size up to time . is a vector representing the frequencies of types of events that occurred at location at time .
Treatments: The treatments can be represented by a binary vector with dimension where each element indicates the occurrence states of a type of events (e.g., appeal). Specifically, the th element indicates a notable (i.e., 50%) increase of the th event type at window from the previous window . ^{1}^{1}1The comparison of two windows is motivated by studies showing that shortterm historical data can lead to favorable performance in event prediction [10.1145/3394486.3403209, jin2020Renet]
. A threshold of 50% is selected heuristically. We leave variant treatment settings for future work.
A value of 1 means getting treated and 0 means getting controlled. For convenience, we refer to each element in the treatment vector as a treatment event.^{2}^{2}2Our setup differs from multiple causal inference [wang2019blessings, bica2020time], which estimates the potential outcome of a combination of multiple treatments. We are more interested in studying the potential outcome of each element in the treatment vector.Observed Outcome: The observed/factual outcome
is a binary variable denoting if an event of interest (i.e., protest) occurs at location
in the future (). is the lead time indicating the number of timestamps in advance for a prediction.3.2 Individual Treatment Effects Learning
We first define potential outcomes in observational event data following wellstudied causal inference frameworks [rubin1978bayesian, rubin2005causal]. We ignore the location subscript for simplicity unless otherwise stated.
Potential Outcomes: In general, the potential outcome denotes what the outcome an instance would receive, if the instance was to take treatment . A potential outcome is distinct from the observed/factual outcome in that not all potential outcomes are observed in the real world. In our problem, there are two potential outcomes for each treatment event. Given a location at time and the th treatment event, we denote by the potential outcome (i.e., occurrence of protest) if the th treatment event is getting treated, i.e., . Similarly, we denote by the potential outcome we would observe if the treatment event is under control, i.e., .
The factual outcome is when the location has already received the treatment assignment before time . The counterfactual outcome is defined if the location obtains the opposite treatment assignment. In the observational study, only the factual outcomes are available, while the counterfactual outcomes can never be observed.
The Individual Treatment Effect (ITE) is the difference between two potential outcomes of an instance, examining whether the treatment affects the outcome of the instance. In observational event data, for the th treatment event, we formulate the ITE for the location at time in the form of the Conditional Average Treatment Effect (CATE) [shalit2017estimating, 10.1145/3437963.3441818]:
(1) 
We provide a toy example to illustrate ITE estimation on observational event data, as shown in Fig. 2.
In this study, we aim to estimate ITEs and then use them for event prediction. The challenge of ITE estimation lies in how to estimate the missing counterfactual outcome. Our estimation of ITE is built upon some essential assumptions. For simplicity and readability, we omit the subscripts for the location and the treatment event and use to represent the th treatment event.
Assumption 1.
No Interference. Assuming that one instance is defined as a location at a time in observational event data, the potential outcome on one instance should be unaffected by the particular assignment of treatments on other instances.
Assumption 2.
Consistency. The potential outcome of treatment equals to the observed outcome if the actual treatment received , i.e., .
Assumption 3.
Positivity. If the probability , then the probability to receive treatment assignment 0 or 1 is positive, i.e., , .
The Positivity assumption indicates that before time , each treatment assignment has a nonzero probability of being given to a location. This assumption is testable in practice. In addition to these assumptions, most existing work [shalit2017estimating, louizos2017causal, wager2018estimation] relies on the Ignorability assumption, which assumes that all confounding variables are observed and reliably measured by a set of features for each instance, i.e., hidden confounders do not exist.
Definition 1.
Ignorability Assumption. Given pretreatment covariates , the outcome variables are independent of its treatment assignment, .
However, this assumption is untenable in societal event studies due to the complex environment in which societal events occur. We relax this assumption by introducing the existence of hidden confounders [guo2019learning]. Note that hidden confounders are unobserved in observational event data but will be learned in our approach through a spatiotemporal model. We define a causal graph, as shown in Fig. 1. The hidden confounders causally affect the treatment and outcome.^{3}^{3}3For the th treatment event, the hidden confounders can be written as . The potential outcomes are independent of the observed treatment, given the hidden confounders: . In addition, we assume the features and the connectivity of locations are proxy variables for hidden confounders . Unobservable hidden confounders can be measured with and . Based on the temporal and spatial characteristics of our observational event data. We introduce the following assumption [10.1145/3437963.3441818]:
Assumption 4.
Spatiotemporal Dependencies in Hidden Confounders. In observational event data, hidden confounders capture spatial information among locations, reflected by , and show temporal dependencies of events across multiple historical steps (i.e., ).
Note that this assumption does not contradict the No Interference assumption. We focus on the scenario in which spatiotemporal information can be exploited to control confounding bias.
3.3 Event Prediction
We present traditional event prediction and event prediction with causal knowledge proposed in this work.
Definition 2.
Event Prediction. Learn a classifier that predicts the probability of the target event occurring at a location at time based on available data: .
Instead of learning a mapping function from input features to event labels, we are interested in estimating treatment effects under different treatment events individually and exploiting such causal information to enhance event prediction.
Definition 3.
Event Prediction with Causal Knowledge. Build an event forecaster using available data with causal information as prior knowledge: , where is the trained causal inference model that takes features, multiple treatments and the connectivity information of locations as input and outputs potential outcomes.
The multiple treatment setting (i.e., ) aims to produce informative causal knowledge to assist event prediction. We will discuss the proposed method of event prediction with causal knowledge in the following sections.
4 Methodology
We propose a novel framework CAPE, which incorporates causal inference into the prediction of future event occurrences in a spatiotemporal environment.^{4}^{4}4Code is available at https://github.com/amydeng/cape In our framework, ITEs with different treatment events are jointly modeled in a spatiotemporal causal inference model. It will contribute to the final event prediction by feeding the causal output (e.g., potential outcomes) to a noncausal datadriven prediction model. The overall framework, as illustrated by Fig. 3, consists of two parts: (1) causal inference and (2) event prediction. The causal inference component is designed to estimate the ITE, including two essential modules: hidden confounder learning and potential outcome prediction. For each treatment event, it learns the representation of hidden confounders by capturing spatiotemporal dependencies and outputs the potential outcomes under different treatment assignments. The event prediction part comprises two robust learning modules, a feature reweighting module and an approximation constraint loss. They take the causal information learned from the causal inference model as prior knowledge to assist the training of a datadriven event prediction model. Next, we will elaborate on these components.
4.1 Causal Inference
4.1.1 Hidden Confounder Learning
Hidden confounders are common in realworld observational data [pearl2009causal]. Assuming spatiotemporal dependencies exist in hidden confounders, we introduce a novel and effective network that models spatial and temporal information for each location at each time step. It consists of several temporal feature learning layers and spatial feature learning layers. Our network is based on the success of previous work [oord2016wavenet, dauphin2017language]. It is designed to be adaptable to a multitask setting to learn hidden confounders of multiple treatments.
Temporal Feature Learning
Dilated casual convolution networks [yu2015multi] handle long sequences in a nonrecursive manner, which facilitates parallel computation and alleviates the gradient explosion problem. Gating mechanisms have shown benefits to control information flow through layers for convolution networks [oord2016wavenet, dauphin2017language, wu2019graph]. We employ the dilated causal convolution with a gating mechanism in temporal feature learning to capture a location’s temporal dependencies. For a location before time , the multivariate time series of historical event occurrences is a matrix , where each row indicates the frequency sequence of one type of events in the historical window with size
. We use a linear transformation to map the event frequency matrix into a latent space, i.e.,
. indicates the feature dimension in the latent space. Then, we apply the dilated convolution to the sequence. For simplicity, we use to denote a row in the matrix . Formally, for the 1D sequence input and a filter , the dilated causal convolution operation on element of the sequence is defined as:(2) 
where is a dilated convolution. is the filter size. are indices of vectors. is the output vector.
We further incorporate a gated dilated convolutional layer which consists of two parallel dilated convolution layers:
(3) 
where are filters for dilated convolutional layers, and is the Hadamard product. is to regularize the features.
is the sigmoid function that determines the ratio of information passed to the next layer. Specifically, we stack multiple gated dilated convolutional layers (Eq.
3) with increasing dilation factors (e.g.,). Residual and skip connections are applied to avoid the vanishing gradient problem
[oord2016wavenet, wu2019graph]. To this end, the temporal dependencies are captured, and we use to denote the learned temporal features for locations at time .Spatial Feature Learning
Graph convolution is a powerful operation to learn representations of nodes given the graph structure. To capture the spatial dependencies, we adopt the graph convolutional network (GCN) [kipf2016semi] to learn the spatial influence from locations by treating each location as a node in graph:
(4) 
where is the weight matrix for a GCN layer. denotes the spatiotemporal feature matrix referring to all locations, where each row captures the historical information of a specific location as well as the neighboring locations. is a learnable adjacency matrix. The geographical adjacency matrix of locations usually cannot represent the connectivity of locations in the context of societal event forecasting. Therefore, we adopt the selfadaptive adjacency matrix [wu2019graph], which does not require any prior knowledge and is learned through training. We randomly initialize two node embedding matrices with learnable parameters . The selfadaptive adjacency matrix is defined as:
(5) 
where the ReLU activation function eliminates weak connections and the Softmax applies normalization.
Hidden Confounder Learning
To learn the representation of hidden confounders, we utilize the spatiotemporal feature and a learnable embedding specific to each treatment event (i.e., ). It is worth pointing out that in the proposed framework, we include multiple treatment events and expect to estimate the ITE corresponding to each treatment event. The treatmentspecific embedding aims to capture latent information of each treatment event and distinguish the hidden confounder representations learned for each treatment effect learning task. Similar ideas of task embeddings are studied in prior work [vuorio2019multimodal]. Given a location and a time , the representation of hidden confounders for the th treatment is:
(6) 
where is concatenation.
4.1.2 Potential Outcome Prediction
Using the above components, we obtain the representation of hidden confounders . Following the predefined causal graph in Fig. 1, the learned hidden confounders can be used to estimate potential outcomes. We use two networks that output two potential outcomes of the th treatment event, respectively:
(7) 
where denotes the inferred potential outcomes when the th treatment event is getting treated or controlled. are parameterized by deep neural networks with a sigmoid function at the last layer. The networks are trained endtoend, and one can estimate the potential outcomes under multiple treatment events.
4.1.3 Objective Function
Potential Outcome Loss
We use the binary crossentropy loss as the objective factual loss for predicting potential outcomes. When only the th treatment event is considered (i.e., the general case for treatment effect learning [shalit2017estimating, guo2019learning, 10.1145/3437963.3441818]), the factual loss is:
(8) 
where is the observed outcome for location at time . is the predicted outcome given the observed treatment . Since our model predicts potential outcomes for multiple treatment events, we express the total factual loss as follows:
(9) 
where stands for the norm regularization for all training parameters and is the weight for scaling the regularization term.
Representation Balancing
Studies have proved that balancing the representations of treated and control groups would help mitigate the confounding bias and minimize the upper bound of the outcome inference error [johansson2016learning, shalit2017estimating]. Therefore, we incorporate a representation balancing layer to force the distributions of hidden confounders of treated and controlled groups to be similar. Specifically, we adopt the integral probability metric (IPM) [shalit2017estimating] to measure the difference between the distributions of the treated instances and the controlled instances in terms of their hidden confounder representations:
(10) 
where indicate the sets of hidden confounders for samples (in a batch) in the treated group and controlled group, respectively. The IPM can be Wasserstein or Maximum Mean Discrepancy (MMD) distances.
is a hyperparameter that indicates the imbalance penalty.
Formally, we present the loss function of the proposed causal inference model as:
(11) 
4.2 Event Prediction with Causal Knowledge
To improve the robustness of event predictions with imperfect realworld data, we incorporate causal information output by the causal inference model as priors to forecast future events. We introduce two robust learning modules into the training of event predictors: (1) feature reweighting, which involves causal information to weight the original input features to obtain causally enhanced features, and (2) approximation constraints, which use the predicted potential outcomes as value range constraints applied to event prediction scores. Next, we introduce these two modules in detail.
4.2.1 Feature Reweighting
Feature reweighting was introduced in object detection [kang2019few], where a reweighting vector is learned to indicate the importance of meta features for detecting objects. Here, we introduce a new feature reweighting method that leverages causal information. We use the ITE estimated from the causal inference model to reweight the event frequency features to predict future events.
Causal Feature Gates
We define a feature gate based on ITE calculated using predicted potential outcomes from the causal inference model. For the th treatment event, the estimated ITE of a location at time is as follows:
(12) 
When considering multiple treatment events, we obtain the ITE vector , where each element indicates a treatment event. A linear layer with sigmoid function is then applied to model the association between the effects of different treatment events:
(13) 
where is the gating variables that will be applied to the original event frequency features. The sigmoid function converts the gating variable into a soft gated signal with a range of .
Reweighting Feature
We reweight the event frequency features using the gating variables defined above. It is worth emphasizing that the event frequency vector has the same dimension as , and their corresponding elements represent the same event type. Nevertheless, we prefer not to apply the gating variables directly to the feature vector. ITE examines whether the binary treatment variable affects the outcome of an instance, while the event frequency vector refers to discrete numbers. To address this issue, we transform the event frequency feature into a latent vector using a positionwise feedforward network (FFN) [vaswani2017attention]. It maps the features into a continuous space, assuming that the gating variables can be aligned with the variables in this space. The formal procedures are defined as follows:
(14) 
(15) 
where
are learnable parameters. A residual connection is added to ensure that the causally weighted elements still contain some original information. We denote the causality enhanced features across
historical steps as . Such features are fed into a predictor to perform event prediction, denoted as .4.2.2 Approximation Constraints
The approximation constraints method was proposed to limit the reasonable range of the target variable during the model training process to generate a more robust model [muralidhar2018incorporating]. We follow this idea and propose a new method of integrating learned causal information into variable constraints. Given an event predictor , we denote the model’s event prediction for a location at time as . Then, we assume that the causal range of the target variable, i.e., the event prediction, is . The samplewise boundaries are defined as:
(16) 
where is the set of potential outcomes for all treatment events. The minimum and maximum values are the lower and upper limits of the target variable for a given sample. Based on the range obtained from causal knowledge, we define a constraint loss term:
(17) 
The loss term can be involved during the training of the predictor . Given the proposed robust learning modules for event prediction, we train the predictor by minimizing the following loss function:
(18) 
where is the loss function defined by the predictor and is a hyperparameter. The training steps of the proposed method are shown in Algorithm 1.
5 Experimental Evaluation
The goal of the experimental evaluation is to answer the following research questions: RQ1: How well does CAPE estimate ITEs in observational event data? RQ2: Can CAPE improve the robustness of event prediction models? RQ3: What causal information can we learn from studies of causally related event prediction?
Next, we will describe the experimental setup and then show the experimental results to address the above questions.
5.1 Datasets
Experimental evaluation is conducted on two data sources: Integrated Conflict Early Warning System (ICEWS) [icews], and Global Database of Events, Language, and Tone (GDELT) [leetaru2013gdelt]. These two data sources include daily events encoded from news reports.^{5}^{5}5For event data from GDELT, we only select root events identified in news reports. We construct event datasets for four countries, i.e., India, Nigeria, Australia and Canada, based on their large volume of events. Events are categorized into 20 main categories (e.g., appeal, demand, protest, etc.) according to CAMEO methodology [DVN/28075/SCJPXX_2015]. Each event is encoded with geolocation, time (day, month, year), category, etc. In this work, we focus on predicting one category of events: protest, as the target variable, and using event historical data of all event types as feature variables. Data statistics are shown in Table 2. Positive in the table refers to the proportion of positive samples.
Dataset  Positive  Location  Time  Time Unit  Source  

India  14  30.1%  State  20002017  3 days  ICEWS 
Nigeria  6  65.7%  Geopolitical zone  20152020  1 day  GDELT 
Australia  8  44.4%  State  20152020  1 day  GDELT 
Canada  13  26.8%  State  20152020  1 day  GDELT 
5.2 Evaluation Metrics
For the ITE estimation, since there is no ground truth counterfactual outcomes, we report the ATT error [shalit2017estimating], where ATT is the true average treatment effect on the treated, i.e., . denotes the subset of samples simulating a randomized controlled trial. Specifically, given the treatment event, we employ a 1nearest neighbor algorithm [yang2006distance] to find a matching control instance (without replacement) for each treated instance. Euclidean distance is adopted to measure feature vectors. The matching process is performed for each location.
We quantify the predictive performance of event prediction based on Balanced Accuracy (BACC), i.e, . TPR and TNR are the true positive rate and true negative rate, respectively. BACC is a good metric when the classes are imbalanced.
5.3 Comparative Methods
For the ITE estimation, we compare our causal inference model, notated as , with two groups of baselines: (i) Tree based methods: Bayesian Additive Regression Trees (BART) [chipman2010bart] and Causal Forest (CF) [wager2018estimation]; (ii) Representation learning based methods: Counterfactual regression with MMD (CFRMMD) [shalit2017estimating] and Wasserstein metric (CFRWASS) [shalit2017estimating]
, Causal Effect Variational Autoencoder (
CEVAE) [louizos2017causal], Network Deconfounder (NetDeconf) [guo2019learning], and Similarity Preserved Individual Treatment Effect (SITE) [yao2018representation].We study three variants of our model to examine the impact of different components in our model: (i) which removes the spatial feature learning. (ii) which replaces the temporal feature learning with a simple linear transformation. (iii) removes the loss term .
To evaluate the effectiveness of proposed robust learning modules in event prediction, we adopt two spatiotemporal models as the predictor : (i) ColaGNN [deng2020cola]: A graphbased framework for longterm Influenzalike illness prediction; (ii) GWNet [wu2019graph]: A stateoftheart spatiotemporal graph model for traffic prediction. Given the spatiotemporal characteristics of societal event data, these models can be well applied to our problem. Note that we do not adopt protest event prediction models [deng2019learning, deng2021understanding] because they model on more complex data, such as text and knowledge graphs. We leave the causal exploration of such complex data to future work.
6 Implementation Details
For the causal inference model, we use three gated temporal convolutional layers with dilation factors , and two graph convolutional layers. The dimension is set to 10. The feature dimensions of all other hidden layers including are set to be equal and searched from .
The number of treatment events is 20, where each treatment event corresponds to an event type, such as appeal and protest. Following previous work [deng2019learning, 10.1145/3394486.3403209], we set the historical window size to 7 and the lead time to 1. The hyperparameter used for parameter regularization is fixed to 1e5. We use the squared linear MMD for representation balancing [shalit2017estimating]. The imbalance penalty is searched from . The scaling term in Eq. 18 is searched from . All parameters are initialized with Glorot initialization [glorot2010understanding] and trained using the Adam [kinga2015method] optimizer with learning rate and dropout rate 0.5. The batch size is set to 64. We use the objective value on the validation set for early stopping.
For causal inference baselines, CF ^{6}^{6}6https://rdrr.io/cran/grf/man/causal_forest.html and BART ^{7}^{7}7https://rdrr.io/cran/BART/ are implemented using R packages. We implement the causal inference models CFRMMD, CFRWASS, SITE by ourselves and use the source code of CEVAE ^{8}^{8}8https://github.com/rikhelwegen/CEVAE_pytorch and NetDeconf ^{9}^{9}9https://github.com/rguo12/networkdeconfounderwsdm20. We apply parameter searching on all baseline models. For representation learning based approach, the dimension of hidden layers are searched from and the number of hidden layers are searched from . For models that introduce balancing representation learning, we search the hyperparameter from . The model NetDeconf involves an auxiliary network and we use the geographic adjacency matrix for locations.
For the experiments on event forecasting, we run the source code of ColaGNN ^{10}^{10}10https://github.com/amydeng/colagnn and GWNet ^{11}^{11}11https://github.com/nnzhan/GraphWaveNet. For event prediction models, we fixed the dimension of hidden layers to 32. ColaGNN takes the geographic adjacency matrix as input and GWNet learns the adaptive adjacency matrix.
We report the average of 5 randomized trials for all experiments. At each training, we randomly split the data into training, validation, and test sets at a ratio of 70%15%15% with a fixed seed value. All python code is implemented using Python 3.7.7 and Pytorch 1.5.0 with CUDA 9.2.
7 Experimental Results
7.1 Results of ITE Estimation (RQ1)
Treatment event: Appeal  
India  Nigeria  Australia  Canada  
BART  
CF  
CFRMMD  
CFRWASS  
CEVAE  
SITE  
NetDeconf  
ITE estimation results showing the mean and standard deviation of ATT errors on all datasets with treatment event being
Appeal. Lower is better.Treatment event: Reject  
India  Nigeria  Australia  Canada  
BART  
CF  
CFRMMD  
CFRWASS  
CEVAE  
SITE  
NetDeconf  
To evaluate the effectiveness of our proposed causal inference framework, we limit the number of treatment events to be one and compare our model with other baselines. We focus on two treatment events: appeal and reject. The motivation is that appeal events might be a potential cause of protest events, as they express a serious or urgent request, typically to the public. Reject events represent verbal conflicts [DVN/28075/SCJPXX_2015], which contain dissatisfaction with the current state and may lead to a future occurrence of protest. Table 3 and Table 4 report the ATT errors of all causal inference models on four datasets when treatment variable to be appeal and reject, respectively. The results show that the treebased model performs worse than the representational learningbased model. The findings reflect the limitations of the treebased models and highlight the benefits of representation learning for estimating ITE for observational event data. CFRMMD and CFRWASS learn a balanced representation such that the induced treated and control distributions look similar. Both models achieved good results in most cases, demonstrating the importance of controlling for representation distributions to predict potential outcomes. CEVAE learns latent variables based on variational autoencoders and SITE focuses on capturing local similarities to estimate ITE. These two models present the most stable and relatively small ATT errors in all settings. This suggests that learning the latent variables and considering similarity information is useful for estimating ITE for observational event data. The model NetDeconf learns hidden confounders by leveraging network/spatial information. However, it does not outperform representationbased baselines. This may be because the model was designed for semisynthetic datasets and the spatial characteristics of observational event data are different from the network used in the original paper. Our proposed causal inference framework learns hidden confounders while capturing spatial and temporal information and achieves the best performance. For our model variants, we observe that removing the representation balancing makes the results worse. Ignoring the temporal or spatial feature learning can also deteriorate the results. This reflects the possible spatiotemporal dependencies underlying the hidden confounders. It also demonstrates the capability of the proposed model in capturing the spatiotemporal information of the observational event data.
7.2 Robustness Tests in Event Prediction (RQ2)
In this subsection, we perform two robustness tests on event prediction for all datasets and conduct a case study on the proposed feature reweighting module.
7.2.1 Robustness to Test Noise
A model is considered to be robust if its output variable is consistently accurate when one or more input variables drastically change due to unforeseen circumstances. In this setting, we add Poisson noise into the validation and test sets while keeping the training data noisefree. We aim to verify whether our method guarantees good prediction performance when the test input features are biased. We vary the rate parameter (aka expectation) of the Poisson distribution from 1 to 25 and provide the comparison results for different noise levels in Fig
4. We notice that training with the proposed robust learning module leads to higher average BACC results and lower variance over multiple runs. In most cases, the feature reweighting module (+F) contributes more in improving the prediction performance. Incorporating these two modules (+F+L) can lead to better overall results. The results suggest that forecasting events with learned causal information is beneficial to improve the robustness of the prediction.
7.2.2 Robustness to Training Noise
Human errors or machine failures in realworld data collection usually reduce data accuracy. With this motivation, we assume that only the training data are biased and test whether our method can achieve decent event prediction results on unbiased test data. As shown in Fig. 5, applying robust learning modules can help the prediction model achieve better performance in BACC when the noise level increases. Adding the approximation constraint loss (+L) can lead to a higher BACC than adding the two modules (Fig. 4(b) and Fig. 4(c)). The results also illustrate that even with biased data (with corrupted features), the trained causal inference model learns valuable information that contributes to event prediction.
7.2.3 Case Study of Feature Reweighting
To illustrate the functionality of the proposed feature reweighting on robust event prediction, we provide several examples in the India dataset, as shown in Fig. 6. We use ColaGNN for analysis, given the more apparent improvements when it is applied with the feature reweighting module. Specifically, we first train an event prediction model on the India dataset using ColaGNN with the feature reweighting module. We select four corrupted test samples with random noise added to their input features (noise level of 5). We visualize the original features, the noisy features, and the ones obtained from the feature reweighting module. We can observe that the reweighted features can encode similar patterns of original features. It highlights the advantages of the ITE used in the feature reweighting module and demonstrates its ability to capture crucial information underlying the data distribution.
7.3 Causal Effect in Societal Events (RQ3)
In our study, whether there is a significant increase in certain types of events (e.g., appeal) over the past window is defined as the treatment of a location. The outcome is the future occurrence of a target event, i.e., protest. In this case, the ITE measures the difference in the outcome of the protest occurring between the two scenarios of the treatment event (i.e., increased or not). Thus, when the necessary assumptions hold, it implies a causal effect of the treatment event on the protest. A higher ITE suggests that an increase in a treatment event will be more influential in the occurrence of future protests, compared to a decrease or no change. To better illustrate the effect of treatment events on future protests, we visualize the predicted ITEs based on Eq. 12. Violin plots for the four datasets are shown in Fig. 7. We select three treatment events for each dataset. They have relatively low, moderate, and high ITE on average, respectively. The results vary from datasets due to different social environments. In India and Australia, massive historical protests may lead to future protests. In Nigeria and Canada, events related to military posture and threats, respectively, are likely to be more dominant factors in future protests. Nevertheless, we hardly conclude that protests will occur when the treatment event increases substantially because both types of events can be affected by hidden variables (i.e., unknown social factors). These results can provide supporting evidence for conjectures on protest triggers and generate hypotheses for future experiments.
8 Conclusion and Future Work
Learning causal effects of societal events is beneficial to decisionmaking and helps practitioners understand the underlying dynamics of events. In this paper, we introduce a deep learning framework that can estimate the causal effects of societal events and predict societal events simultaneously. We design a novel spatiotemporal causal inference model for estimating ITEs and propose two robust learning modules that use the learned causal information as prior knowledge for societal event prediction. We conducted extensive experiments on several realworld event datasets and showed that our approach achieves the best results in ITE estimation and robust event prediction. One future direction is to examine other potential causes of event occurrence, such as events with specific themes and potentially biased media coverage.
9 Broader Impacts
This work aims to advance computational social science by investigating causal effects among societal events from observational data. The causal effects among different types of societal events have not been extensively studied. In this work, we provide preliminary results on estimating the individual causal effects of one type of event on another and incorporate this causal information to improve the predictive power of event prediction models. We hope to provide a way to understand human behavior from the societal and causal inference aspects and broaden the possibilities for future work on societal event studies.
Comments
There are no comments yet.