Causal Knowledge Guided Societal Event Forecasting

Data-driven societal event forecasting methods exploit relevant historical information to predict future events. These methods rely on historical labeled data and cannot accurately predict events when data are limited or of poor quality. Studying causal effects between events goes beyond correlation analysis and can contribute to a more robust prediction of events. However, incorporating causality analysis in data-driven event forecasting is challenging due to several factors: (i) Events occur in a complex and dynamic social environment. Many unobserved variables, i.e., hidden confounders, affect both potential causes and outcomes. (ii) Given spatiotemporal non-independent and identically distributed (non-IID) data, modeling hidden confounders for accurate causal effect estimation is not trivial. In this work, we introduce a deep learning framework that integrates causal effect estimation into event forecasting. We first study the problem of Individual Treatment Effect (ITE) estimation from observational event data with spatiotemporal attributes and present a novel causal inference model to estimate ITEs. We then incorporate the learned event-related causal information into event prediction as prior knowledge. Two robust learning modules, including a feature reweighting module and an approximate constraint loss, are introduced to enable prior knowledge injection. We evaluate the proposed causal inference model on real-world event datasets and validate the effectiveness of proposed robust learning modules in event prediction by feeding learned causal information into different deep learning methods. Experimental results demonstrate the strengths of the proposed causal inference model for ITE estimation in societal events and showcase the beneficial properties of robust learning modules in societal event forecasting.



There are no comments yet.


page 1

page 2

page 3

page 4


Hypothetical estimands in clinical trials: a unification of causal inference and missing data methods

The ICH E9 addendum introduces the term intercurrent event to refer to e...

Learning Individual Treatment Effects from Networked Observational Data

With convenient access to observational data, learning individual causal...

Real-time Top-K Predictive Query Processing over Event Streams

This paper addresses the problem of predicting the k events that are mos...

ARENA: A Data-driven Radio Access Networks Analysis of Football Events

Mass events represent one of the most challenging scenarios for mobile n...

Enhancing Causal Estimation through Unlabeled Offline Data

Consider a situation where a new patient arrives in the Intensive Care U...

Towards Robust and Adaptive Motion Forecasting: A Causal Representation Perspective

Learning behavioral patterns from observational data has been a de-facto...

Causal Future Prediction in a Minkowski Space-Time

Estimating future events is a difficult task. Unlike humans, machine lea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Predicting large-scale societal events, such as disease outbreaks, organized crime, and civil unrest movements, from social media streams, web logs, and news media is of great significance for decision-making and resource allocation. Previous methods mainly focus on improving the predictive accuracy of a given event type or multiple event types using historical event data [zhao2015spatiotemporal, zhao2016hierarchical]. Recently, to enhance model explainability, many approaches identify salient features or supporting evidence, such as precursor documents [ning2016modeling], relationships represented as graphs [deng2019learning], and major actors participating in the events [10.1145/3394486.3403209]. However, existing work explains the occurrence of events based on correlation-based indicators.

Attempts to study causality in event analysis and prediction have focused on extracting pairs of causal events from unstructured text  [radinsky2012learning], or using human-defined causally related historical events to predict events of interest [radinsky2013mining]

. Causal effect learning has shown advantages in improving predictions in various machine learning problems, such as recommender systems 

[bonner2018causal], disease diagnosis prediction [li2020teaching]

, and computer vision tasks 

[chen2021spatial]. This suggests the potential of causal effect learning for better prediction of societal events. Leveraging causal effects can presumably provide new insights into causal-level interpretation and improve the robustness of event prediction, e.g., less susceptible to noise in data. In this study, we explore societal event forecasting methods with the help of causal effect learning.

Traditionally, learning causal effects (aka treatment effects) from observational data involves estimating causal effects of a treatment variable (e.g., medication) on an outcome variable (e.g., recovery) given observable covariates (e.g., gender). In practice, there are also unobserved covariates, i.e., hidden confounders, that affect both treatment and outcome variables. For instance, consider a study to evaluate the effectiveness of a medication. Gender as a covariate affects whether a patient chooses to take the medication and the corresponding outcome. The patient’s living habits can be hidden confounders that affect both the patient’s medication and outcome. Exploring hidden confounders allows for more accurate estimations of treatment effects [louizos2017causal, guo2019learning, 10.1145/3437963.3441818].

In this work, we formulate the problem of estimating treatment effects in the context of societal events. Societal events can be classified into different types. Given a time window, we look at multiple types of events (e.g., “appeal”, “investigation”) at a location and define treatment variables to be the detection of increased counts of these events compared to the previous time window. If the sudden and frequent occurrence of such events triggers some event of interest, the implied causal effect can be used to guide and interpret event predictions. We define the outcome as the occurrence of an event of interest (e.g., “protest”) at a future time. Both treatment and outcome variables can be affected by hidden social factors (i.e., hidden confounders) that are difficult to explicitly capture due to complex dependencies. Intuitively, exploring hidden confounders can allow us to estimate causal effects more accurately. To this end, we formulate our main research question as:

can we build a robust event predictive model by incorporating treatment effect estimation with hidden confounder learning? There are some challenges in solving this problem:

  • Societal events have geographical characteristics and exhibit a high degree of temporal dependency [zhao2016hierarchical, ning2016modeling, 10.1145/3394486.3403209]. Modeling spatiotemporal information requires an in-depth investigation of the dynamic spatial dependencies of societal events. However, few studies have focused on modeling spatiotemporal dependencies in causal effect learning, which poses a challenge for learning causal effects from societal events.

  • Events occur in a complex and evolving social environment. Many unknown social factors increase the difficulty of accurately estimating causal effects of events. Moreover, events are often caused by a variety of factors rather than a single determinant. Utilizing causal effects to assist in event prediction is a new challenge.

We address the above challenges by first introducing the task of Individual Treatment Effect (ITE) estimation from societal events. ITE is defined as the expected difference between the treated outcome and control outcome, where the outcome is the occurrence of a future event (e.g., protest) at a specific place and time, and the treatment is a change in some event (e.g., appeal) in the past. We consider multiple treatments (e.g., appeal, investigation, etc.) with the motivation that the underlying causes of societal events are often complex. We model the spatiotemporal dependencies in learning the representations of hidden confounders to estimate ITEs. We then present an approach to inject the learned causal information into a data-driven predictive model to improve its predictive power. Our contributions are summarized as follows:

  • We introduce a novel causal inference model for ITE estimation, which learns the representation of hidden confounders by capturing spatiotemporal dependencies of events in different locations.

  • We propose two robust learning modules for event prediction that take as prior knowledge the information learned from the causal inference model. Incorporating such modules can enable event prediction models to be more robust to data noise.

We evaluate the proposed method against other state-of-the-art methods on several real-world event datasets. Through extensive experiments, we demonstrate the strengths of the proposed method in treatment effect learning and robust event prediction.

2 Related Work

2.1 Event Prediction

Event prediction focuses on forecasting future events that have not yet happened based on various social indicators, such as event occurrence rates and news reports. Related research has been conducted in various fields and applications, such as election prediction [tumasjan2010predicting, o2010tweets], stock market forecasting [bollen2011twitter], disease outbreak simulation [signorini2011use, achrekar2011predicting], and crime prediction [wang2012automatic]

. Machine learning models such as linear regression 


and random forests 


were investigated to predict events of interest. Time-series methods such as autoregressive models were studied to capture the temporal evolution of event-related indicators 

[achrekar2011predicting]. With the increased availability of various data, more sophisticated features have been shown effective in predicting societal events such as topic-related keywords [zhao2015spatiotemporal], document embedding [ning2016modeling], word graphs [deng2019learning]

and knowledge graphs 

[10.1145/3394486.3403209, deng2021understanding]. More advanced machine learning and deep learning-based models have emerged, such as multi-instance learning [ning2016modeling], multi-task learning [zhao2015multi, gao2019incomplete]

and graph neural networks 

[deng2019learning, 10.1145/3394486.3403209]. Given the spatiotemporal dependencies of events, some existing research work studied spatiotemporal correlations in event prediction [gerber2014predicting, wang2012spatio, zhao2014unsupervised]. However, few studies explored the causality in event prediction. Our proposed model incorporates causal effect learning in a spatiotemporal event prediction framework. This gives us the benefit of discovering the effects of different potential causes on predicting future events.

2.2 Individual Treatment Effect Estimation

Individual treatment effect (ITE) estimation refers to estimating the causal effect of a treatment variable on its outcome. A wealth of observational data facilitates treatment effect estimation in many fields, such as health care [anglemyer2014healthcare], education [gustafsson2013causal], online advertising [sun2015causal], and recommender systems [bonner2018causal]. Several methods have been studied for ITE estimation including regression and tree based model [hill2011bayesian, chipman2010bart], counterfactual inference [johansson2016learning], and representation learning [shalit2017estimating]. The former approaches rely on the Ignorability assumption [rosenbaum1983central], which is often untenable in real-world studies. A deep latent variable model, CEVAE [louizos2017causal] learns representations of confounders through variational inference. Recent work relaxed the Ignorability assumption and studied ITE estimation from observational data with an auxiliary network structure in a static [guo2019learning] or dynamic environment [10.1145/3437963.3441818]. In addition to the traditional causal effect estimation, a new study of causal inference, including multiple treatments and a single outcome, has emerged, namely, Multiple Causal Inference. Researchers have shown that compared with traditional causal inference, it requires weaker assumptions [wang2019blessings]. ITE estimation would considerably benefit decision-making as it can provide potential outcomes with different treatment options. Our work introduces ITE estimation to societal event studies and exploits event-related causal information for event forecasting.

2.3 Knowledge Guided Machine Learning

Purely data-driven approaches might lead to unsatisfactory results when limited data are available to train well-performing and sufficiently generalized models. Such models may also break natural laws or other guidelines [von2019informed]. These problems have led to an increasing amount of research that focuses on incorporating additional prior knowledge into the learning process to improve machine learning models. For example, logic rules [diligenti2017integrating, xu2018semantic] or algebraic equations [karpatne2017physics, stewart2017label, muralidhar2018incorporating]

have been added as constraints to loss functions. Knowledge graphs are adopted to enhance neural networks with information about relations between instances 

[battaglia2016interaction]. The growth of this research suggests that the combination of data and knowledge-driven approaches is becoming relevant and showing benefits in a growing number of areas. Existing work has typically focused on pre-existing knowledge obtained by human experts. However, such approaches fail when prior knowledge is not available, e.g., for societal events. Some researchers explored causal knowledge-guided methods in health prediction [li2020teaching] and image-to-video adaptation [chen2021spatial]. In this work, we study causal effects between societal events and use the learned causal information as prior knowledge for event prediction.

Notations Descriptions
sets of locations, timestamps and numbers of event types
covariates of the -th location before time
frequency of events at time for location

observed treatment vector of the

-th location before time
observed -th treatment of the -th location before time
potential outcomes for the -th treatment of the -th location at time
predicted potential outcomes for the -th treatment of the -th location at time
connectivity of locations
ITE of the -th location at time for the -th treatment
learned hidden confounders of the -th location before time when the -th treatment is considered
Table 1: Important notations and descriptions.

3 Problem Formulation

The objective of this study is two-fold: (1) given multiple pre-defined treatment events (e.g., appeal, investigation, etc.), estimate their causal effect on a target event (i.e., protest) individually; (2) predict the probability of the target event occurring in the future with the help of estimated causal information. In the following, we will introduce the observational event data, individual treatment effect learning, and event prediction.

3.1 Observational Event Data

In this work, we focus on modeling the occurrence of one type of societal event (i.e., “protest”) by exploring the possible effects it might receive from other types of events (e.g., “appeals” and “investigation”). A total of categories of societal events are studied. These events happen at different locations and times. We use to denote the sets of locations and timestamps of interest, respectively. The observational event data can be denoted as , where denote the pre-treatment covariates/features, observed treatments, and outcome, respectively. represents the connectivity of locations, where each element can denote a fixed geographic distance or the degree of influence of events between locations. Important notations are presented in Table 1.

Covariates: We define the covariates to be the historical events at location with size up to time . is a vector representing the frequencies of types of events that occurred at location at time .

Treatments: The treatments can be represented by a binary vector with dimension where each element indicates the occurrence states of a type of events (e.g., appeal). Specifically, the -th element indicates a notable (i.e., 50%) increase of the -th event type at window from the previous window . 111The comparison of two windows is motivated by studies showing that short-term historical data can lead to favorable performance in event prediction [10.1145/3394486.3403209, jin2020Renet]

. A threshold of 50% is selected heuristically. We leave variant treatment settings for future work.

A value of 1 means getting treated and 0 means getting controlled. For convenience, we refer to each element in the treatment vector as a treatment event.222Our setup differs from multiple causal inference  [wang2019blessings, bica2020time], which estimates the potential outcome of a combination of multiple treatments. We are more interested in studying the potential outcome of each element in the treatment vector.

Observed Outcome: The observed/factual outcome

is a binary variable denoting if an event of interest (i.e., protest) occurs at location

in the future (). is the lead time indicating the number of timestamps in advance for a prediction.

Figure 1: Causal graph defined for the ITE estimation on observational event data. The edges represent causal relations. For a location, we use to denote the covariates, assignment for a treatment event, representation of hidden confounders before time , and outcome at time , respectively. denotes the connectivity among locations.

3.2 Individual Treatment Effects Learning

We first define potential outcomes in observational event data following well-studied causal inference frameworks [rubin1978bayesian, rubin2005causal]. We ignore the location subscript for simplicity unless otherwise stated.

Potential Outcomes: In general, the potential outcome denotes what the outcome an instance would receive, if the instance was to take treatment . A potential outcome is distinct from the observed/factual outcome in that not all potential outcomes are observed in the real world. In our problem, there are two potential outcomes for each treatment event. Given a location at time and the -th treatment event, we denote by the potential outcome (i.e., occurrence of protest) if the -th treatment event is getting treated, i.e., . Similarly, we denote by the potential outcome we would observe if the treatment event is under control, i.e., .

The factual outcome is when the location has already received the treatment assignment before time . The counterfactual outcome is defined if the location obtains the opposite treatment assignment. In the observational study, only the factual outcomes are available, while the counterfactual outcomes can never be observed.

The Individual Treatment Effect (ITE) is the difference between two potential outcomes of an instance, examining whether the treatment affects the outcome of the instance. In observational event data, for the -th treatment event, we formulate the ITE for the location at time in the form of the Conditional Average Treatment Effect (CATE) [shalit2017estimating, 10.1145/3437963.3441818]:


We provide a toy example to illustrate ITE estimation on observational event data, as shown in Fig. 2.

Figure 2: An example of ITE estimation on observational event data. We use the horizontal axis to indicate time and the bars to indicate the count of events in a location. Taking the Appeal event as the treatment event, the treated and controlled scenarios denote whether the amount of appeal events in the current window (on average) is greater or less than the previous window. The window size is 3. The light orange bars indicate the appeal events of the past window. The outcomes denote whether the protest will occur at a future time given each treatment assignment. ITE is estimated from the two potential outcomes.

In this study, we aim to estimate ITEs and then use them for event prediction. The challenge of ITE estimation lies in how to estimate the missing counterfactual outcome. Our estimation of ITE is built upon some essential assumptions. For simplicity and readability, we omit the subscripts for the location and the treatment event and use to represent the -th treatment event.

Assumption 1.

No Interference. Assuming that one instance is defined as a location at a time in observational event data, the potential outcome on one instance should be unaffected by the particular assignment of treatments on other instances.

Assumption 2.

Consistency. The potential outcome of treatment equals to the observed outcome if the actual treatment received , i.e., .

Assumption 3.

Positivity. If the probability , then the probability to receive treatment assignment 0 or 1 is positive, i.e., , .

The Positivity assumption indicates that before time , each treatment assignment has a non-zero probability of being given to a location. This assumption is testable in practice. In addition to these assumptions, most existing work [shalit2017estimating, louizos2017causal, wager2018estimation] relies on the Ignorability assumption, which assumes that all confounding variables are observed and reliably measured by a set of features for each instance, i.e., hidden confounders do not exist.

Definition 1.

Ignorability Assumption. Given pre-treatment covariates , the outcome variables are independent of its treatment assignment, .

However, this assumption is untenable in societal event studies due to the complex environment in which societal events occur. We relax this assumption by introducing the existence of hidden confounders [guo2019learning]. Note that hidden confounders are unobserved in observational event data but will be learned in our approach through a spatiotemporal model. We define a causal graph, as shown in Fig. 1. The hidden confounders causally affect the treatment and outcome.333For the -th treatment event, the hidden confounders can be written as . The potential outcomes are independent of the observed treatment, given the hidden confounders: . In addition, we assume the features and the connectivity of locations are proxy variables for hidden confounders . Unobservable hidden confounders can be measured with and . Based on the temporal and spatial characteristics of our observational event data. We introduce the following assumption [10.1145/3437963.3441818]:

Assumption 4.

Spatiotemporal Dependencies in Hidden Confounders. In observational event data, hidden confounders capture spatial information among locations, reflected by , and show temporal dependencies of events across multiple historical steps (i.e., ).

Note that this assumption does not contradict the No Interference assumption. We focus on the scenario in which spatiotemporal information can be exploited to control confounding bias.

3.3 Event Prediction

We present traditional event prediction and event prediction with causal knowledge proposed in this work.

Definition 2.

Event Prediction. Learn a classifier that predicts the probability of the target event occurring at a location at time based on available data: .

Instead of learning a mapping function from input features to event labels, we are interested in estimating treatment effects under different treatment events individually and exploiting such causal information to enhance event prediction.

Definition 3.

Event Prediction with Causal Knowledge. Build an event forecaster using available data with causal information as prior knowledge: , where is the trained causal inference model that takes features, multiple treatments and the connectivity information of locations as input and outputs potential outcomes.

The multiple treatment setting (i.e., ) aims to produce informative causal knowledge to assist event prediction. We will discuss the proposed method of event prediction with causal knowledge in the following sections.

4 Methodology

We propose a novel framework CAPE, which incorporates causal inference into the prediction of future event occurrences in a spatiotemporal environment.444Code is available at In our framework, ITEs with different treatment events are jointly modeled in a spatiotemporal causal inference model. It will contribute to the final event prediction by feeding the causal output (e.g., potential outcomes) to a non-causal data-driven prediction model. The overall framework, as illustrated by Fig. 3, consists of two parts: (1) causal inference and (2) event prediction. The causal inference component is designed to estimate the ITE, including two essential modules: hidden confounder learning and potential outcome prediction. For each treatment event, it learns the representation of hidden confounders by capturing spatiotemporal dependencies and outputs the potential outcomes under different treatment assignments. The event prediction part comprises two robust learning modules, a feature reweighting module and an approximation constraint loss. They take the causal information learned from the causal inference model as prior knowledge to assist the training of a data-driven event prediction model. Next, we will elaborate on these components.

Figure 3: The overall architecture of the proposed framework. The proposed causal inference method learns to estimate ITEs in the presence of multiple treatment events of interest. The representation of hidden confounders is learned using a spatiotemporal model. Then, the vector representation of hidden confounders corresponding to each treatment event is fed into two neural networks (NNs) for individual treatment effect learning. Next, the framework takes the output of the causal inference model as prior knowledge to forecast events.

4.1 Causal Inference

4.1.1 Hidden Confounder Learning

Hidden confounders are common in real-world observational data [pearl2009causal]. Assuming spatiotemporal dependencies exist in hidden confounders, we introduce a novel and effective network that models spatial and temporal information for each location at each time step. It consists of several temporal feature learning layers and spatial feature learning layers. Our network is based on the success of previous work [oord2016wavenet, dauphin2017language]. It is designed to be adaptable to a multi-task setting to learn hidden confounders of multiple treatments.

Temporal Feature Learning

Dilated casual convolution networks [yu2015multi] handle long sequences in a non-recursive manner, which facilitates parallel computation and alleviates the gradient explosion problem. Gating mechanisms have shown benefits to control information flow through layers for convolution networks [oord2016wavenet, dauphin2017language, wu2019graph]. We employ the dilated causal convolution with a gating mechanism in temporal feature learning to capture a location’s temporal dependencies. For a location before time , the multivariate time series of historical event occurrences is a matrix , where each row indicates the frequency sequence of one type of events in the historical window with size

. We use a linear transformation to map the event frequency matrix into a latent space, i.e.,

. indicates the feature dimension in the latent space. Then, we apply the dilated convolution to the sequence. For simplicity, we use to denote a row in the matrix . Formally, for the 1-D sequence input and a filter , the dilated causal convolution operation on element of the sequence is defined as:


where is a -dilated convolution. is the filter size. are indices of vectors. is the output vector.

We further incorporate a gated dilated convolutional layer which consists of two parallel dilated convolution layers:


where are filters for dilated convolutional layers, and is the Hadamard product. is to regularize the features.

is the sigmoid function that determines the ratio of information passed to the next layer. Specifically, we stack multiple gated dilated convolutional layers (Eq. 

3) with increasing dilation factors (e.g.,

). Residual and skip connections are applied to avoid the vanishing gradient problem 

[oord2016wavenet, wu2019graph]. To this end, the temporal dependencies are captured, and we use to denote the learned temporal features for locations at time .

Spatial Feature Learning

Graph convolution is a powerful operation to learn representations of nodes given the graph structure. To capture the spatial dependencies, we adopt the graph convolutional network (GCN) [kipf2016semi] to learn the spatial influence from locations by treating each location as a node in graph:


where is the weight matrix for a GCN layer. denotes the spatiotemporal feature matrix referring to all locations, where each row captures the historical information of a specific location as well as the neighboring locations. is a learnable adjacency matrix. The geographical adjacency matrix of locations usually cannot represent the connectivity of locations in the context of societal event forecasting. Therefore, we adopt the self-adaptive adjacency matrix [wu2019graph], which does not require any prior knowledge and is learned through training. We randomly initialize two node embedding matrices with learnable parameters . The self-adaptive adjacency matrix is defined as:


where the ReLU activation function eliminates weak connections and the Softmax applies normalization.

Hidden Confounder Learning

To learn the representation of hidden confounders, we utilize the spatiotemporal feature and a learnable embedding specific to each treatment event (i.e., ). It is worth pointing out that in the proposed framework, we include multiple treatment events and expect to estimate the ITE corresponding to each treatment event. The treatment-specific embedding aims to capture latent information of each treatment event and distinguish the hidden confounder representations learned for each treatment effect learning task. Similar ideas of task embeddings are studied in prior work [vuorio2019multimodal]. Given a location and a time , the representation of hidden confounders for the -th treatment is:


where is concatenation.

4.1.2 Potential Outcome Prediction

Using the above components, we obtain the representation of hidden confounders . Following the predefined causal graph in Fig. 1, the learned hidden confounders can be used to estimate potential outcomes. We use two networks that output two potential outcomes of the -th treatment event, respectively:


where denotes the inferred potential outcomes when the -th treatment event is getting treated or controlled. are parameterized by deep neural networks with a sigmoid function at the last layer. The networks are trained end-to-end, and one can estimate the potential outcomes under multiple treatment events.

4.1.3 Objective Function

Potential Outcome Loss

We use the binary cross-entropy loss as the objective factual loss for predicting potential outcomes. When only the -th treatment event is considered (i.e., the general case for treatment effect learning [shalit2017estimating, guo2019learning, 10.1145/3437963.3441818]), the factual loss is:


where is the observed outcome for location at time . is the predicted outcome given the observed treatment . Since our model predicts potential outcomes for multiple treatment events, we express the total factual loss as follows:


where stands for the -norm regularization for all training parameters and is the weight for scaling the regularization term.

Representation Balancing

Studies have proved that balancing the representations of treated and control groups would help mitigate the confounding bias and minimize the upper bound of the outcome inference error [johansson2016learning, shalit2017estimating]. Therefore, we incorporate a representation balancing layer to force the distributions of hidden confounders of treated and controlled groups to be similar. Specifically, we adopt the integral probability metric (IPM) [shalit2017estimating] to measure the difference between the distributions of the treated instances and the controlled instances in terms of their hidden confounder representations:


where indicate the sets of hidden confounders for samples (in a batch) in the treated group and controlled group, respectively. The IPM can be Wasserstein or Maximum Mean Discrepancy (MMD) distances.

is a hyperparameter that indicates the imbalance penalty.

Formally, we present the loss function of the proposed causal inference model as:


4.2 Event Prediction with Causal Knowledge

To improve the robustness of event predictions with imperfect real-world data, we incorporate causal information output by the causal inference model as priors to forecast future events. We introduce two robust learning modules into the training of event predictors: (1) feature reweighting, which involves causal information to weight the original input features to obtain causally enhanced features, and (2) approximation constraints, which use the predicted potential outcomes as value range constraints applied to event prediction scores. Next, we introduce these two modules in detail.

4.2.1 Feature Reweighting

Feature reweighting was introduced in object detection [kang2019few], where a reweighting vector is learned to indicate the importance of meta features for detecting objects. Here, we introduce a new feature reweighting method that leverages causal information. We use the ITE estimated from the causal inference model to reweight the event frequency features to predict future events.

Causal Feature Gates

We define a feature gate based on ITE calculated using predicted potential outcomes from the causal inference model. For the -th treatment event, the estimated ITE of a location at time is as follows:


When considering multiple treatment events, we obtain the ITE vector , where each element indicates a treatment event. A linear layer with sigmoid function is then applied to model the association between the effects of different treatment events:


where is the gating variables that will be applied to the original event frequency features. The sigmoid function converts the gating variable into a soft gated signal with a range of .

Reweighting Feature

We reweight the event frequency features using the gating variables defined above. It is worth emphasizing that the event frequency vector has the same dimension as , and their corresponding elements represent the same event type. Nevertheless, we prefer not to apply the gating variables directly to the feature vector. ITE examines whether the binary treatment variable affects the outcome of an instance, while the event frequency vector refers to discrete numbers. To address this issue, we transform the event frequency feature into a latent vector using a position-wise feed-forward network (FFN) [vaswani2017attention]. It maps the features into a continuous space, assuming that the gating variables can be aligned with the variables in this space. The formal procedures are defined as follows:



are learnable parameters. A residual connection is added to ensure that the causally weighted elements still contain some original information. We denote the causality enhanced features across

historical steps as . Such features are fed into a predictor to perform event prediction, denoted as .

4.2.2 Approximation Constraints

The approximation constraints method was proposed to limit the reasonable range of the target variable during the model training process to generate a more robust model [muralidhar2018incorporating]. We follow this idea and propose a new method of integrating learned causal information into variable constraints. Given an event predictor , we denote the model’s event prediction for a location at time as . Then, we assume that the causal range of the target variable, i.e., the event prediction, is . The sample-wise boundaries are defined as:


where is the set of potential outcomes for all treatment events. The minimum and maximum values are the lower and upper limits of the target variable for a given sample. Based on the range obtained from causal knowledge, we define a constraint loss term:


The loss term can be involved during the training of the predictor . Given the proposed robust learning modules for event prediction, we train the predictor by minimizing the following loss function:


where is the loss function defined by the predictor and is a hyperparameter. The training steps of the proposed method are shown in Algorithm 1.

Input: Observational event data , a predictor with randomly initialized parameters, and initialized model CAPE, including causal inference model and robust learning modules for event prediction.
  Train the causal inference model .
1 while  has not converged  do
2       Input to to obtain predicted potential outcomes. Calculate the loss . Update by optimizing Eq. 11.
Freeze the causal inference model .   Train the predictor and robust modules with .
3 while  have not converged do
4       Input to to obtain predicted potential outcomes. Calculate causally reweighed features from robust learning modules using predicted potential outcomes. Input to the predictor to obtain the event prediction. Calculate the prediction loss . Calculate the constraint loss from using predicted potential outcomes. Update the predictor and causal modules by optimizing Eq. 18.
Output: the updated model CAPE.
Algorithm 1 CAPE

5 Experimental Evaluation

The goal of the experimental evaluation is to answer the following research questions: RQ1: How well does CAPE  estimate ITEs in observational event data? RQ2: Can CAPE  improve the robustness of event prediction models? RQ3: What causal information can we learn from studies of causally related event prediction?

Next, we will describe the experimental setup and then show the experimental results to address the above questions.

5.1 Datasets

Experimental evaluation is conducted on two data sources: Integrated Conflict Early Warning System (ICEWS) [icews], and Global Database of Events, Language, and Tone (GDELT) [leetaru2013gdelt]. These two data sources include daily events encoded from news reports.555For event data from GDELT, we only select root events identified in news reports. We construct event datasets for four countries, i.e., India, Nigeria, Australia and Canada, based on their large volume of events. Events are categorized into 20 main categories (e.g., appeal, demand, protest, etc.) according to CAMEO methodology [DVN/28075/SCJPXX_2015]. Each event is encoded with geolocation, time (day, month, year), category, etc. In this work, we focus on predicting one category of events: protest, as the target variable, and using event historical data of all event types as feature variables. Data statistics are shown in Table 2. Positive in the table refers to the proportion of positive samples.

Dataset Positive Location Time Time Unit Source
India 14 30.1% State 2000-2017 3 days ICEWS
Nigeria 6 65.7% Geopolitical zone 2015-2020 1 day GDELT
Australia 8 44.4% State 2015-2020 1 day GDELT
Canada 13 26.8% State 2015-2020 1 day GDELT
Table 2: Dataset Statistics. is the number of locations in each dataset. Positive indicates the ratio of positive samples, i.e., the protest event has occurred. For India, we select top locations based on the total number of events. Location represents the geographical level of events.

5.2 Evaluation Metrics

For the ITE estimation, since there is no ground truth counterfactual outcomes, we report the ATT error  [shalit2017estimating], where ATT is the true average treatment effect on the treated, i.e., . denotes the subset of samples simulating a randomized controlled trial. Specifically, given the treatment event, we employ a 1-nearest neighbor algorithm [yang2006distance] to find a matching control instance (without replacement) for each treated instance. Euclidean distance is adopted to measure feature vectors. The matching process is performed for each location.

We quantify the predictive performance of event prediction based on Balanced Accuracy (BACC), i.e, . TPR and TNR are the true positive rate and true negative rate, respectively. BACC is a good metric when the classes are imbalanced.

5.3 Comparative Methods

For the ITE estimation, we compare our causal inference model, notated as , with two groups of baselines: (i) Tree based methods: Bayesian Additive Regression Trees (BART[chipman2010bart] and Causal Forest (CF[wager2018estimation]; (ii) Representation learning based methods: Counterfactual regression with MMD (CFR-MMD[shalit2017estimating] and Wasserstein metric (CFR-WASS[shalit2017estimating]

, Causal Effect Variational Autoencoder (

CEVAE[louizos2017causal], Network Deconfounder (Net-Deconf[guo2019learning], and Similarity Preserved Individual Treatment Effect (SITE[yao2018representation].

We study three variants of our model to examine the impact of different components in our model: (i) which removes the spatial feature learning. (ii) which replaces the temporal feature learning with a simple linear transformation. (iii) removes the loss term .

To evaluate the effectiveness of proposed robust learning modules in event prediction, we adopt two spatiotemporal models as the predictor : (i) Cola-GNN [deng2020cola]: A graph-based framework for long-term Influenza-like illness prediction; (ii) GWNet [wu2019graph]: A state-of-the-art spatiotemporal graph model for traffic prediction. Given the spatiotemporal characteristics of societal event data, these models can be well applied to our problem. Note that we do not adopt protest event prediction models [deng2019learning, deng2021understanding] because they model on more complex data, such as text and knowledge graphs. We leave the causal exploration of such complex data to future work.

6 Implementation Details

For the causal inference model, we use three gated temporal convolutional layers with dilation factors , and two graph convolutional layers. The dimension is set to 10. The feature dimensions of all other hidden layers including are set to be equal and searched from .

The number of treatment events is 20, where each treatment event corresponds to an event type, such as appeal and protest. Following previous work [deng2019learning, 10.1145/3394486.3403209], we set the historical window size to 7 and the lead time to 1. The hyperparameter used for parameter regularization is fixed to 1e-5. We use the squared linear MMD for representation balancing [shalit2017estimating]. The imbalance penalty is searched from . The scaling term in Eq. 18 is searched from . All parameters are initialized with Glorot initialization [glorot2010understanding] and trained using the Adam [kinga2015method] optimizer with learning rate and dropout rate 0.5. The batch size is set to 64. We use the objective value on the validation set for early stopping.

For causal inference baselines, CF 666 and BART 777 are implemented using R packages. We implement the causal inference models CFR-MMD, CFR-WASS, SITE  by ourselves and use the source code of CEVAE 888 and Net-Deconf 999 We apply parameter searching on all baseline models. For representation learning based approach, the dimension of hidden layers are searched from and the number of hidden layers are searched from . For models that introduce balancing representation learning, we search the hyperparameter from . The model Net-Deconf  involves an auxiliary network and we use the geographic adjacency matrix for locations.

For the experiments on event forecasting, we run the source code of Cola-GNN 101010 and GWNet 111111 For event prediction models, we fixed the dimension of hidden layers to 32. Cola-GNN  takes the geographic adjacency matrix as input and GWNet  learns the adaptive adjacency matrix.

We report the average of 5 randomized trials for all experiments. At each training, we randomly split the data into training, validation, and test sets at a ratio of 70%-15%-15% with a fixed seed value. All python code is implemented using Python 3.7.7 and Pytorch 1.5.0 with CUDA 9.2.

7 Experimental Results

7.1 Results of ITE Estimation (RQ1)

Treatment event: Appeal
India Nigeria Australia Canada
Table 3:

ITE estimation results showing the mean and standard deviation of ATT errors on all datasets with treatment event being

Appeal. Lower is better.
Treatment event: Reject
India Nigeria Australia Canada
Table 4: ITE estimation results showing the mean and standard deviation of ATT errors on all datasets with treatment event being Reject. Lower is better.

To evaluate the effectiveness of our proposed causal inference framework, we limit the number of treatment events to be one and compare our model with other baselines. We focus on two treatment events: appeal and reject. The motivation is that appeal events might be a potential cause of protest events, as they express a serious or urgent request, typically to the public. Reject events represent verbal conflicts [DVN/28075/SCJPXX_2015], which contain dissatisfaction with the current state and may lead to a future occurrence of protest. Table 3 and Table 4 report the ATT errors of all causal inference models on four datasets when treatment variable to be appeal and reject, respectively. The results show that the tree-based model performs worse than the representational learning-based model. The findings reflect the limitations of the tree-based models and highlight the benefits of representation learning for estimating ITE for observational event data. CFR-MMD  and CFR-WASS  learn a balanced representation such that the induced treated and control distributions look similar. Both models achieved good results in most cases, demonstrating the importance of controlling for representation distributions to predict potential outcomes. CEVAE  learns latent variables based on variational autoencoders and SITE  focuses on capturing local similarities to estimate ITE. These two models present the most stable and relatively small ATT errors in all settings. This suggests that learning the latent variables and considering similarity information is useful for estimating ITE for observational event data. The model Net-Deconf  learns hidden confounders by leveraging network/spatial information. However, it does not outperform representation-based baselines. This may be because the model was designed for semi-synthetic datasets and the spatial characteristics of observational event data are different from the network used in the original paper. Our proposed causal inference framework learns hidden confounders while capturing spatial and temporal information and achieves the best performance. For our model variants, we observe that removing the representation balancing makes the results worse. Ignoring the temporal or spatial feature learning can also deteriorate the results. This reflects the possible spatiotemporal dependencies underlying the hidden confounders. It also demonstrates the capability of the proposed model in capturing the spatiotemporal information of the observational event data.

7.2 Robustness Tests in Event Prediction (RQ2)

In this subsection, we perform two robustness tests on event prediction for all datasets and conduct a case study on the proposed feature reweighting module.

7.2.1 Robustness to Test Noise

A model is considered to be robust if its output variable is consistently accurate when one or more input variables drastically change due to unforeseen circumstances. In this setting, we add Poisson noise into the validation and test sets while keeping the training data noise-free. We aim to verify whether our method guarantees good prediction performance when the test input features are biased. We vary the rate parameter (aka expectation) of the Poisson distribution from 1 to 25 and provide the comparison results for different noise levels in Fig 


. We notice that training with the proposed robust learning module leads to higher average BACC results and lower variance over multiple runs. In most cases, the feature reweighting module (+F) contributes more in improving the prediction performance. Incorporating these two modules (+F+L) can lead to better overall results. The results suggest that forecasting events with learned causal information is beneficial to improve the robustness of the prediction.

(a) India
(b) Nigeria
(c) Australia
(d) Canada
Figure 4: BACC value of event prediction when varying the noise level in validation and test sets. Higher is better.

7.2.2 Robustness to Training Noise

Human errors or machine failures in real-world data collection usually reduce data accuracy. With this motivation, we assume that only the training data are biased and test whether our method can achieve decent event prediction results on unbiased test data. As shown in Fig. 5, applying robust learning modules can help the prediction model achieve better performance in BACC when the noise level increases. Adding the approximation constraint loss (+L) can lead to a higher BACC than adding the two modules (Fig. 4(b) and Fig. 4(c)). The results also illustrate that even with biased data (with corrupted features), the trained causal inference model learns valuable information that contributes to event prediction.

(a) India
(b) Nigeria
(c) Australia
(d) Canada
Figure 5: BACC value of event prediction when varying the noise level in the training set. Higher is better.

7.2.3 Case Study of Feature Reweighting

To illustrate the functionality of the proposed feature reweighting on robust event prediction, we provide several examples in the India dataset, as shown in Fig. 6. We use Cola-GNN  for analysis, given the more apparent improvements when it is applied with the feature reweighting module. Specifically, we first train an event prediction model on the India dataset using Cola-GNN  with the feature reweighting module. We select four corrupted test samples with random noise added to their input features (noise level of 5). We visualize the original features, the noisy features, and the ones obtained from the feature reweighting module. We can observe that the reweighted features can encode similar patterns of original features. It highlights the advantages of the ITE used in the feature reweighting module and demonstrates its ability to capture crucial information underlying the data distribution.

(a) Sample 1
(b) Sample 2
(c) Sample 3
(d) Sample 4
Figure 6: Distributions of original, noisy, and reweighted features of corrupted samples on the India dataset.

7.3 Causal Effect in Societal Events (RQ3)

In our study, whether there is a significant increase in certain types of events (e.g., appeal) over the past window is defined as the treatment of a location. The outcome is the future occurrence of a target event, i.e., protest. In this case, the ITE measures the difference in the outcome of the protest occurring between the two scenarios of the treatment event (i.e., increased or not). Thus, when the necessary assumptions hold, it implies a causal effect of the treatment event on the protest. A higher ITE suggests that an increase in a treatment event will be more influential in the occurrence of future protests, compared to a decrease or no change. To better illustrate the effect of treatment events on future protests, we visualize the predicted ITEs based on Eq. 12. Violin plots for the four datasets are shown in Fig. 7. We select three treatment events for each dataset. They have relatively low, moderate, and high ITE on average, respectively. The results vary from datasets due to different social environments. In India and Australia, massive historical protests may lead to future protests. In Nigeria and Canada, events related to military posture and threats, respectively, are likely to be more dominant factors in future protests. Nevertheless, we hardly conclude that protests will occur when the treatment event increases substantially because both types of events can be affected by hidden variables (i.e., unknown social factors). These results can provide supporting evidence for conjectures on protest triggers and generate hypotheses for future experiments.

(a) India
(b) Nigeria
(c) Australia
(d) Canada
Figure 7: ITE distribution of different treatment events on the outcome protest in violin plot. The loosely dashed line represents the median.

8 Conclusion and Future Work

Learning causal effects of societal events is beneficial to decision-making and helps practitioners understand the underlying dynamics of events. In this paper, we introduce a deep learning framework that can estimate the causal effects of societal events and predict societal events simultaneously. We design a novel spatiotemporal causal inference model for estimating ITEs and propose two robust learning modules that use the learned causal information as prior knowledge for societal event prediction. We conducted extensive experiments on several real-world event datasets and showed that our approach achieves the best results in ITE estimation and robust event prediction. One future direction is to examine other potential causes of event occurrence, such as events with specific themes and potentially biased media coverage.

9 Broader Impacts

This work aims to advance computational social science by investigating causal effects among societal events from observational data. The causal effects among different types of societal events have not been extensively studied. In this work, we provide preliminary results on estimating the individual causal effects of one type of event on another and incorporate this causal information to improve the predictive power of event prediction models. We hope to provide a way to understand human behavior from the societal and causal inference aspects and broaden the possibilities for future work on societal event studies.