Log In Sign Up

Deep Structured Neural Network for Event Temporal Relation Extraction

by   Rujun Han, et al.

We propose a novel deep structured learning framework for event temporal relation extraction. The model consists of 1) a recurrent neural network (RNN) to learn scoring functions for pair-wise relations, and 2) a structured support vector machine (SSVM) to make joint predictions. The neural network automatically learns representations that account for long-term contexts to provide robust features for the structured model, while the SSVM incorporates domain knowledge such as transitive closure of temporal relations as constraints to make better globally consistent decisions. By jointly training the two components, our model combines the benefits of both data-driven learning and knowledge exploitation. Experimental results on three high-quality event temporal relation datasets (TCR, MATRES, and TB-Dense) demonstrate that incorporated with pre-trained contextualized embeddings, the proposed model achieves significantly better performances than the state-of-the-art methods on all three datasets. We also provide thorough ablation studies to investigate our model.


page 1

page 2

page 3

page 4


Joint Event and Temporal Relation Extraction with Shared Representations and Structured Prediction

We propose a joint event and temporal relation extraction model with sha...

Domain Knowledge Empowered Structured Neural Net for End-to-End Event Temporal Relation Extraction

Extracting event temporal relations is a critical task for information e...

From Discourse to Narrative: Knowledge Projection for Event Relation Extraction

Current event-centric knowledge graphs highly rely on explicit connectiv...

Contextualized Word Embeddings Enhanced Event Temporal Relation Extraction for Story Understanding

Learning causal and temporal relationships between events is an importan...

Robustly Pre-trained Neural Model for Direct Temporal Relation Extraction

Background: Identifying relationships between clinical events and tempor...

Effective Distant Supervision for Temporal Relation Extraction

A principal barrier to training temporal relation extraction models in n...

Jointly Embedding Relations and Mentions for Knowledge Population

This paper contributes a joint embedding model for predicting relations ...

1 Introduction

Event temporal relation extraction aims at building a graph where nodes correspond to events within a given text, and edges reflect temporal relations between the events. Figure 0(a) illustrates an example of such graph for the text shown above. Different types of edges specify different temporal relations: the event filed is SIMULTANEOUS with claiming, overruled is BEFORE claiming, and overruled is also BEFORE filed

. Temporal relation extraction is beneficial for many downstream tasks such as question answering, information retrieval, and natural language generation. An event graph can potentially be leveraged to help time-series forecasting and provide guidances for natural language generation. The CaTeRs dataset 

(Mostafazadeh et al., 2016) which annotates temporal and causal relations is constructed for this purpose.

A major challenge in temporal relation extraction stems from its nature of being a structured prediction problem. Although a relation graph can be decomposed into individual relations on each event pair, any local model that is not informed by the whole event graph will usually fail to make globally consistent predictions, thus degrading the overall performance. Figure 0(b) gives an example where the local modelclassifies the relation between overruled and claiming incorrectly as it only considers pairwise predictions: graph temporal transitivity constraint is violated given the relation between filed and claiming is SIMULTANEOUS. In Figure 0(c), the structured model changes the prediction of relation between overruled and claiming from AFTER to BEFORE to ensure compatibility of all predicted edge types.

Prior works on event temporal relation extraction mostly formulate it as a pairwise classification problem (Bethard, 2013; Laokulrat et al., 2013; Chambers, 2013; Chambers et al., 2014) disregarding the global structures. BramsenDLB2006, ChambersJ2008, DoLuRo12, NingWuRo18, P18-1212 explore leveraging global inference to ensure consistency for all pairwise predictions. There are a few prior works that directly model global structure in the training process Yoshikawa et al. (2009); Ning et al. (2017); Leeuwenberg and Moens (2017). However, these structured models rely on hand-crafted features using linguistic rules and local-context, which cannot adequately capture potential long-term dependencies between events. In the example shown in Figure 1, filed occurs in much earlier context than overruled. Thus, incorporating long-term contextual information can be critical for correctly predicting temporal relations.

In this paper, we propose a novel deep structured learning model to address the shortcomings of the previous methods. Specifically, we adapt the structured support vector machine (SSVM) (Finley and Joachims, 2008) to incorporate linguistic constraints and domain knowledge for making joint predictions on events temporal relations. Furthermore, we augment this framework with recurrent neural networks (RNNs) to learn long-term contexts. Despite the recent success of employing neural network models for event temporal relation extraction Tourille et al. (2017a); Cheng and Miyao (2017); Meng et al. (2017); Meng and Rumshisky (2018), these systems make pairwise predictions, and do not take advantage of problem structures.

We develop a joint end-to-end training scheme that enables the feedback from global structure to directly guide neural networks to learn representations, and hence allows our deep structured model to combine the benefits of both data-driven learning and knowledge exploitation. In the ablation study, we further demonstrate the importance of each global constraints, the influence of linguistic features, as well as the usage of contextualized word representations in the local model.

To summarize, our main contributions are:

  • We propose a deep SSVM model for event temporal relation extraction.

  • We show strong empirical results and establish new state-of-the-art for three event relation benchmark datasets.

  • Extensive ablation studies and thorough error analysis are conducted to understand the capacity and limitations of the proposed model, which provide insights for future research on temporal relation extraction.

2 Related Work

Temporal Relation Data.

Temporal relation corpora such as TimeBank (Pustejovsky et al., 2003) and RED (O’Gorman et al., 2016) facilitate the research in temporal relation extraction. The common issue in these corpora is missing annotation. Collecting densely annotated temporal relation corpora with all event pairs fully annotated has been reported to be a challenging task as annotators could easily overlook some pairs (Cassidy et al., 2014; Bethard et al., 2007; Chambers et al., 2014). TB-Dense dataset mitigates this issue by forcing annotators to examine all pairs of events within the same or neighboring sentences. Recent data construction efforts such as MATRES (Ning et al., 2018) and TCR (Ning et al., 2018)

further enhance the data quality by using a multi-axis annotation scheme and adopting start-point of events to improve inter-annotator agreements. However, densely annotated datasets are relatively small both in terms of number of documents and event pairs, which restricts the complexity of machine learning models used in previous research.

Event Temporal Relation Extraction

The series of TempEval competitions (Verhagen et al., 2007, 2010; UzZaman et al., 2013) attract many research interests in predicting event temporal relations. Early attempts Mani et al. (2006); Verhagen et al. (2007); Chambers et al. (2007); Verhagen and Pustejovsky (2008) only use local pair-wise classification with hand-engineered features. Later efforts, such as ClearTK (Bethard, 2013), UTTime (Laokulrat et al., 2013), and NavyTime Chambers (2013) improve earlier work by feature engineering with linguistic and syntactic rules. A noteworthy work, CAEVO (Chambers et al., 2014), builds a pipeline with ordered sieves. Each sieve is either a rule-based classifier or a machine learning model; sieves are sorted by precision, i.e. decisions from a lower precision classifier cannot contradict those from a higher precision model.

More recently, neural network-based methods have been employed for event temporal relation extraction Tourille et al. (2017a); Cheng and Miyao (2017); Meng et al. (2017); Han et al. (2019a) which achieved impressive results. However, they all treat the task as a pairwise classification problem. meng2018context considered incorporating global context for pairwise relation predictions, but they do not explicitly model the output graph structure for event temporal relation.

There are a few prior works exploring structured learning for temporal relation extraction Yoshikawa et al. (2009); Ning et al. (2017); Leeuwenberg and Moens (2017). However, their local models use hand-engineered linguistic features. Despite the effectiveness of hand-crafted features in previous research, the design of features usually fails to capture long-term context in the discourse. Therefore, we propose to enhance the hand-crafted features with contextual representations learned through RNN models and develop an integrated joint training process.

3 Methods

We adapt the notations from Ning et al. (2018), where denotes the set of all possible relations; denotes the set of all event entities.

3.1 Deep SSVM

Our deep SSVM model adapts the SSVM loss as


where denotes model parameters, indexes instances, is the number of event pairs in instance . denote the gold and predicted global assignments for instance , each of which consists of one hot vectors representing true and predicted relation labels for event pair respectively. is a distance measurement between the gold and the predicted assignments; we simply use hamming distance. is a hyper-parameter to balance the loss and the regularizer, and is a pair-wise scoring function to be learned.

The intuition behind the SSVM loss is that it requires the score of gold output structure to be greater than the score of the best output structure under the current model with a margin 111Note that if the best prediction is the same as the gold structure, the margin is zero., or else there will be some loss.

The major difference between our deep SSVM and the traditional SSVM model is the scoring function. Traditional SSVM uses a linear function over hand-crafted features to compute the scores, whereas we propose to use a RNN for estimation.

Figure 2: An overview of the proposed deep structured event relation extraction framework. The input representations consist of BERT representations () and POS tag embeddings (). They are concatenated to pass through BiLSTM layers and classification layers to get pairwise local scores. Incompatible local pairwise prediction (denoted by red lines) is corrected by the SSVM layer. Edge notation follows Figure 1 and denote tokens in the input sentence.

3.2 RNN-Based Scoring Function

We introduce a RNN-based pair-wise scoring function to learn features in a data-driven way and capture long-term context in the input. The local neural architecture is inspired by prior work in entity relation extraction such as  Tourille et al. (2017b). As shown in Figure 2, the input layer consists of word representations and part-of-speech (POS) tag embeddings of each token in the input sentence, denoted as and respectively.222Following the convention of event relation prediction literature (Chambers et al., 2014; Ning et al., 2018, 2018), we only consider event pairs that occur in the same or neighboring sentences, but the architecture can be easily adapted to the case where inputs are longer than two sentences. The word representations are obtained via pre-trained BERT Devlin et al. (2018)333We use pre-trained bert-base-uncased model from model and are fixed throughout training, while the POS tag embeddings are tuned. The word and POS tag embeddings are concatenated to represent an input token, and then fed into a Bi-LSTM layer to get contextualized representations.

We assume the events are labeled in the text and use indices and to denote the tokens associated with an event pair () in the input sentences of length N. For each event pair (), we take the forward and backward hidden vectors corresponding to each event, namely to encode the event tokens. These hidden vectors are then concatenated to form the input to the final linear layer to produce a softmax distribution over all possible pair-wise relations, which we refer to as RNN-based scoring function.

3.3 Inference

The inference is needed both during training to obtain

in the loss function (Equation 

1), as well as during the test time to get globally compatible assignments. The inference framework is established by constructing a global objective function using scores from local model and imposing several global constraints: symmetry and transitivity as in Bramsen et al. (2006); Chambers and Jurafsky (2008); Denis and Muller (2011); Do et al. (2012); Ning et al. (2017); Han et al. (2019b), as well as linguistic rules and temporal-causal constraints proposed by Ning et al. (2018) to ensure global consistency. In this work, we incorporate the symmetry, transitivity, and temporal-causal constraints.

Objective Function.

The objective function of the global inference maximizes the score of global assignments as specified in Equation 2444The objective function is specified on the instance level..


where is a binary indicator specifying if the global prediction is equal to a certain label and is the scoring function obtained from the local model. The output of the global inference is a collection of optimal label assignments for all event pairs in a fixed context. The constraint following immediately from the objective function is that the global inference should only assign one label to each pair of sample inputs.

Symmetry and Transitivity constraint.

The symmetry and transitivity constraints are used across all models and experiments in the paper. They can be specified as follows:

Intuitively, the symmetry constraint forces two pairs with opposite order to have reversed relations. For example, if = BEFORE, then = AFTER. Transitivity constraint rules that if (), () and () pairs exist in the graph, the label (relation) prediction of () pair has to fall into the transitivity set specifying by () and () pairs. The full transitivity table can be found in Ning et al. (2018).

Temporal-causal Constraint.

The temporal-causal constraint is used for the TCR dataset which is the only dataset in our experiments that contains causal pairs and it can written as:

where c and correspond to the label CAUSES and CAUSED_BY, and b represents the label BEFORE. This constraint specifies that if event i causes event j, then i has to occur before j. Note that this constraint only has 91.9% accuracy in TCR data (Ning et al., 2018), but it can help improve model performance based on our experiments.

3.4 Learning

We develop a two-state learning approach to optimize the neural SSVM. We first train the local scoring function without feedback from global constraints. In other words, the local neural network model is optimized using only pair-wise relations in the first stage by minimizing cross-entropy loss. During the second stage, we switch to the global objective function in Equation 1 and re-optimize the network to adjust for global properties555We experiment with optimizing SSVM loss directly, but model performance degrades significantly. We leave further investigation for future work.. We denote the local scoring model in the first stage as local model, and the final model as global model in the following sections.

4 Experimental Setup

In this section, we describe the three datasets that are used in the paper. Then we define the evaluation metrics. Finally, we provide details regarding our model implementation and experiments.

4.1 Data

Experiments are conducted on TB-Dense, MATRES and TCR datasets and an overview of data statistics are shown in Table 1. We focus on event relation, thus, all numbers refer to pairs666For TCR, we also include causal pairs in the table.. Note that in all three datasets, event pairs are always annotated by their appearance order in text, i.e. given a labeled event pair (, ), event always appears prior to event in the text. Following Meng et al. (2017)

, we augment event pairs with flipped-order pairs. That is, if a pair (

, ) exists, pair (, ) is also added to our dataset with the opposite label. The augmentation is applied to training and development split, but test set remains unaugmented777It is noted that if symmetric constraint is applied, scores for testing on augmented or unaugmented set are equal..

# of Documents
Train 22 22 20
Dev 5 5 0
Test 9 9 5
# of Pairs
Train 4032 1098 1992
Dev 629 229 0
Test 1427 310 1008
Table 1: Data Overview


(Cassidy et al., 2014) is based on TimeBank Corpus but addresses the sparse-annotation issue in the original data by introducing the VAGUE label and requiring annotators to label all pairs of events/times in a given window.


(Ning et al., 2018) is based on TB-Dense data, but filters out non-verbal events. The authors project events on multiple axes and only keep those in the main-axis. These two factors explain the large decrease of event pairs in Table 1. Start-point temporal scheme is adopted when out-sourcing the annotation task, which contributes to the performance improvement of machine learning models built on this dataset .


(Ning et al., 2018) follows the same annotation scheme for temporal pairs in MATRES. It is also annotated with causal pairs. To get causal pairs, the authors select candidates based on EventCausality dataset (Do et al., 2011).

4.2 Evaluation Metrics

To be consistent with the evaluation metrics used in baseline models, we adopt two slightly different calculations of metrics.


For all datasets, we compute the micro-average scores. For densely annotated data, the micro-average metric should share the same precision, recall and F1 scores. However, since VAGUE pairs are excluded in the micro-average calculations of TCR and MATRES for fair comparisons with the baseline models, the micro-average for precision, recall and F1 scores are different when reporting results for the two datasets.

Temporal Awareness (TE3)

For TB-Dense dataset, TE3 evaluation scheme (UzZaman et al., 2013) is also adopted in previous research (Ning et al., 2017, 2018). TE3 score not only takes into account of the number of correct pairs but also capture how “useful” a temporal graph is. We report this score for TB-Dense results only. For more details of this metric, please refer to the original paper (UzZaman et al., 2013).

Local Model
hid_size 60 40 30
dropout 0.5 0.7 0.5
BiLSTM layers 1 2 1
learning rate 0.002 0.002 0.002
Structured Learning
learning rate 0.05 0.08 0.08
decay 0.7 0.7 0.9
Table 2: Best hyper-parameters

4.3 Implementation Details

Since our work focuses on event relations, we build our models to predict relations between pairs only when conducting experiments. Thus, all micro-average F1 scores only consider pairs. Note that there are also time entities labeled in the TB-Dense denoted as . and pairs are generally easier to predict using rule-based classifiers or date normalization technique (Do et al., 2012; Chambers et al., 2014; Mirza and Tonelli, 2016). To be consistent with the baseline models Ning et al. (2018, 2018) for TB-Dense data, we add and pairs for TE3 evaluation metric888We rely on annotated data to distinguish different pair types, i.e. , and are assumed to be given..

In the two-stage learning procedure, the local model is trained by minimizing cross-entropy loss with Adam optimizer. We use pre-trained BERT embedding with 768 dimensions as the input word representations and one-layer MLP as the classification layer. As for the structured learning stage, we observe performance boost by switching from Adam optimizer to SGD optimizer with decay and momentum999The weight decay in SGD is exactly the value in Equation 1. We set the momentum in SGD as 0.9 in all datasets.. To solve ILP in the inference process specified in Section 3.3, we leverage off-the-shelf solver provided by Gurobi optimizer, i.e. the best solutions from the Gurobi optimizer are inputs to the global training.

The hyper-parameters are chosen by the performance on the development set101010We randomly select 4 documents from the training set as development set for TCR., and the best combination of hyper-parameters can be found in Table 2. We run experiments on 3 different random seeds and report the average results.

Note that for TCR data, we need a separate classifier for causal relations. Because of small amount of causal pairs, we simply build an independent final linear layer apart from the original linear layer in Figure 2. In other words, there are two final linear layers: only one of them is active when training temporal or causal pairs.

Figure 3: Model Performance (F1 Score) Overview. Our local and global models’ performances are averaged over 3 different random seeds for robustness.

5 Results and Analysis

Figure 3 shows an overview of our model performance on three different datasets. As the chart illustrates, our RNN-based local models outperform state-of-the-art (SOTA) results and the global models further improve the performance over local models across all three datasets.

5.1 Tcr

Detailed model performances for the TCR dataset are shown in Table 3. We only report model performance on temporal pairs. Both of our local and global models outperform the baseline. Our global model is able to improve overall model performance by more than 1.2% over our local model; per McNemar’s test, this improvement is statistically significant (with p-value ).

Local Model Global Model
P R F1 P R F1
Before 82.1 86.9 84.3 81.3 90.0 85.4
After 67.1 73.2 69.7 70.9 70.9 70.9
Simultaneous 0.0 0.0 0.0 0.0 0.0 0.0
Vague 0.0 0.0 0.0 0.0 0.0 0.0
Micro-average 77.1 82.5 79.7 78.2 83.9 80.9**
Ning et al. (2018) 71.1
Table 3: Model Performance Breakdown for TCR. To make fair comparison, we exclude VAGUE pairs in Micro-average score, which is why P, R and F1 are different. **indicates global model outperforms local model with p-value per McNemar’s test.

5.2 Matres

Detailed model performances for the MATRES dataset performances can be found in Table 4. Similar to TCR, both our local and structured models outperform this baseline and the global model is able to improve overall model performance by 1.4%; per McNemar’s test, this improvement is statistically significant (with p-value ).

Local Model Global Model
P R F1 P R F1
Before 79.7 88.1 83.6 80.1 89.6 84.6
After 70.5 83.3 76.3 72.3 84.8 78.0
Simultaneous 0.0 0.0 0.0 0.0 0.0 0.0
Vague 0.0 0.0 0.0 0.0 0.0 0.0
Micro-average 76.2 84.9 80.3 77.4 86.4 81.7*
Ning et al. (2018) 69
Table 4: Model Performance Breakdown for MATRES. Again, we exclude VAGUE pairs in Micro-average score. * indicates global model outperforms local model with p-value per McNemar’s test.

5.3 TB-Dense

Table 5 shows the breakdown performance for all labels as well as the improvement from local to global model by adopting the two-stage structured learning method in TB-Dense dataset. Both our local and global models are able to outperform previous SOTA in micro-average metric (reported by Meng and Rumshisky (2018)) or in TE3 metric (results from Ning et al. (2018)).

Per McNemar’s test, the improvements from local to global model only has p-value , so we are not able to conclude that the improvement is statistically significant. We think one of the reasons is the large share of VAGUE pairs (42.6%). VAGUE pairs make our transitivity rules less conclusive. For example, if = BEFORE and = VAGUE, can be any relation types. Moreover, this impact is magnified by our local model’s prediction bias towards VAGUE pairs. As we can see in Table 5, the recall score for VAGUE pairs are much higher than other relation types, whereas precision score is moderate. Our global model leverages local output structure to enforce global prediction consistency, but when local predictions contain many VAGUE pairs, it introduces lots of noise too.

To make fair comparison between our model and the best reported TE3 F1 score from Ning et al. (2018), we follow their strategy and add CAEVO system’s predictions on and pairs in the evaluation. The scores are shown in Table 5. Our overall system outperforms the baseline over 10% for both micro-average and TE3 F1 scores.

Local Model Global Model
P R F1 P R F1
Before 73.5 52.7 61.3 71.1 58.9 64.4
After 71.6 60.8 65.3 75.0 55.6 63.5
Includes 17.5 4.8 7.4 24.6 4.2 6.9
Is_Include 69.1 4.4 8.0 57.9 5.7 10.2
Simultaneous 0.0 0.0 0.0 0.0 0.0 0.0
Vague 57.9 81.5 67.7 58.3 81.2 67.8
Micro-average 62.6 63.2
Chambers et al. (2014) 49.4
Cheng and Miyao (2017) 52.9
Meng and Rumshisky (2018) 57.0
TE3 Metrics
only 62.1 61.9 62.2 62.7 58.9 62.5
58.6 63.6 61.0 59.0 64.0 61.4
Ning et al. (2018) 52.1
Table 5: Model Performance Breakdown for TB-Dense (all values are percentage). Upper Table: for event pairs only, we adopt standard Micro-average score. Lower Table: TE3 refers to the temporal awareness score adopted by TE-3 Workshop. To make fair comparison with Ning et al. (2018), we add CEAVO predictions on and pairs back into the calculation.

5.4 Error Analysis

To understand why both the local and structured models make mistakes, we randomly sample 50 pairs from 345 cases where both models’ predictions are incorrect among all 3 random seeds. We analyze these pairs qualitatively and categorize them into four cases as shown in Table 6, with each case (except other) paired with an example.

The first case illustrates that correct prediction requires broader contextual knowledge. For example, the gold label for transition and discuss is BEFORE, where the nominal event transition refers to a specific era in history that ends before discuss in the second sentence. Human annotators can easily infer this relation based on their knowledge in history, but it is difficult for machines without prior knowledge. We observe this as a very common mistake especially for pairs with nominal events. As for the second case shows that negation can completely change the temporal order. The gold label for the event pair planned and plans is AFTER because the negation token no postpones the event planned indefinitely. Our models do not pick up this signal and hence predict the relation as VAGUE.

Finally, “intention” events could make temporal relation prediction difficult (Ning et al., 2018). Case 3 demonstrates that our models could ignore the “intention” tokens such as aimed at in the example and hence make an incorrect prediction VAGUE between doubling and signed, whereas the true label is AFTER because doubling is an intention that has not occurred.

Case 1 (32%): Connection with broader context
The program also calls for coordination of economic
reforms and joint improvement of social programs in the
two countries, where many people have become
impoverished during the chaotic post - Soviet transition
to capitalism. Kuchma also planned to visit Russian gas
giant Gazprom, most likely to discuss Ukraine’s DLRS
1.2 billion debt to the company.
Case 2 (20%): Negation
Annan has no trip planned so far. Meanwhile, Secretary
of State Madeleine Albright, Berger and Defense
Secretary William Cohen announced plans to travel
to an unnamed city in the us heartland next week,
to explain to the American people just why military
force will be necessary if diplomacy fails.
Case 3 (14%): Intention Axis
A major goal of Kuchma’s four - day state visit was the
signing of a 10-year economic program aimed at
doubling the two nations’ trade turnover, which fell to
DLRS 14 billion last year, down DLRS 2.5 billion from
1996. The two presidents on Friday signed the plan,
which calls for cooperation in the metallurgy, fuel,
energy, aircraft building, missile, space and chemical
Case 4: (34%) Other
Table 6: Error Categories and Examples in TB-Dense

6 Ablation Studies

Although we have presented strong empirical results, the isolated contribution of each component of our model has not been investigated. In this section, we perform a though ablation study to understand the importance of structured constraints, linguistic features, and the BERT representations.

6.1 Effect of the structured constraints

One of our core claims is that our learning benefits from modeling the structural constraints of event temporal graph. To study the contribution of structured constraints, we provide an ablation study on two constraints that are applied to all three datasets: Symmetry and Transitivity.

A straightforward ablation study on symmetric constraint is to remove it from our global inference step. However, even though we eliminate symmetric constraint explicitly in global inference, it is utilized implicitly in our data augmentation steps (Section  4.1). To better understand the benefits of the symmetry constraints, we study both the contribution of explicitly applying symmetry constraint in our SSVM as well as its implicit impact in data augmentation.

Hence, in this section, we view a pair with original order and flipped order as different instances for learning and evaluation. We denote the pairs with original order as “forward” data, their flipped-order counterparts as “backward” data, and their combinations as “both-way” data.

We train four additional models to study the impacts of symmetry and transitivity constraints: 1) local model trained on forward data; 2) global model with transitivity constraint trained on forward data; 3) local model trained on both-way data; 4) global model with transitivity constraint trained on both-way data, denoted as , , , respectively. and are models that do not apply any symmetric property; and are models that utilize symmetric property implicitly.

Additionally, evaluation setup should be re-scrutinized if we remove the symmetry constraints. In the standard evaluation setup of prior works, evaluation is only performed on the pairs with their original order (forward data) in text. This evaluation assumes a model will work equally well for both forward and backward data, which certainly holds when we explicitly impose symmetry constraints. However, as we can observe in the later analysis, this assumption fails when we remove symmetry constraints. To demonstrate the improvement of model robustness over backward data, we propose to test the model on both forward and both-way data. If a model is robust, it should perform well on both scenarios.

TB-Dense Matres TCR
62.9 61.9 80.4 74.7 80.5 75.7
+ 63.2 62.0 81.7 75.7 81.0 76.3
62.6 62.7 80.3 80.4 79.7 79.6
+ 63.1 63.0 81.4 81.4 80.3 80.2
+ + (Proposed) 63.2 63.2 81.7 81.7 80.9 80.9
Table 7: Ablation over global constraints: Symmetry and Transitivity. Test is conducted on forward test set and both-way test set, which are denoted as and respectively. The local models trained on forward data and both-way data are denoted as and . Symmetry and Transitivity constraints are denoted as and . The results demonstrate that symmetry and transitivity constraints both improve model’s performance.

We summarize our analysis of the results in Table 7 (F1 scores) as follows:

  • Impact of Transitivity: By comparing with and with , the consistent improvements across all three datasets demonstrate the effectiveness of global transitivity constraints.

  • Impact of Implicit Symmetry (data augmentation): Examining the contrast between and as well as and , we can see significant improvements in both-way evaluation despite slight performance drops in forward evaluation. These comparisons imply that data augmentation can help improve model robustness. Note that meng2018context leveraged this data augmentation trick in their model.

  • Impact of Explicit Symmetry: By comparing the proposed model with , the consistent improvements across all datasets demonstrate the benefit of using explicit symmetry property.

  • Model Robustness: Although and show competitive results when evaluated on forward test data, their performance degrade significantly in the both-way evaluation. In contrast, the proposed model achieves strong performances in both test scenarios (best F1 scores except for one), and hence proves the robustness of our proposed method.

6.2 Effect of linguistic features

Previous research establish the success of leveraging linguistic features in event relation prediction. One advantage of leveraging contextualized word embedding is to provide rich semantic representation and could potentially avoid the usage of extra linguistic features. Here, we study the impact of incorporating linguistic features to our model by using simple features provided in the original datasets: token distance, tense and polarity of event entities. These features are concatenated with the Bi-LSTM hidden states before the linear layer (i.e. , , , in Figure 2). Table 8 shows the F1 scores of our local and global model using or not using linguistic features respectively. These additional features likely cause over-fitting and hence do not improve model performance across all three datasets we test. This set of experiments show that linguistic features do not improve the predicting power of our current framework.

local- global- local- global-
w/ feat. w/ feat. w/o feat. w/o feat.
TB-Dense 62.5 63.0 62.6 63.2
MATRES 81.4 81.7 80.3 81.7
TCR 79.5 80.7 79.7 80.9
Table 8: Ablation study on linguistic feature usage. Additional linguistic features do not lead to significant improvement and even hurt performance in 2 out of 3 datasets. The results show that our proposed framework is semantic-rich and capable of avoiding the usage of additional linguistic feature.

6.3 Effect of BERT representations

In this section, we explore the impact of contextualized BERT representations under our deep SSVM framework. We replace BERT representations with the GloVe Pennington et al. (2014) word embeddings. Table 9 shows the F1 scores of our local model and global model using BERT and GloVe111111For GloVe model, additional linguistic features are used. respectively. BERT improves the performance with a significant margin. Besides, even without BERT representations, our RNN-based local model and the deep structured global model still outperform (MATRES and TCR) or are comparable with (TB-Dense) current SOTA. These results confirm the improvements of our method.

previous local- global- local- global-
TB-Dense 57.0 56.6 57.0 62.6 63.2
MATRES 69.0 71.8 75.6 80.3 81.7
TCR 71.1 73.5 76.5 79.7 80.9
Table 9: Ablation over word representation: BERT vs GloVe. Although BERT representation largely contributes to the performance boost, our proposed framework remains strong and outperforms current SOTA approaches when GloVe is used.

7 Conclusion

In this paper, we propose a novel deep structured model based on SSVM that combines the benefits of structured models’ ability to encode structure knowledge, and data-driven deep neural architectures’ ability to learn long-range features. Our experimental results exhibit the effectiveness of this approach for event temporal relation extraction.

One interesting future direction is further leveraging commonsense knowledge, domain knowledge in temporal relation, and linguistics information to create more robust and comprehensive global constraints for structured learning. Another direction is to improve feature representations by designing novel neural architectures that better capture negation and hypothetical phrases as discussed in error analysis. We plan to leverage large amount of unannotated corpora to help event temporal relation extraction as well.


This work is partially funded by DARPA Contracts W911NF-15-1-0543 and an NIH R01 (LM012592). The authors thank the anonymous reviewers for their helpful suggestions and members from USC PLUS lab for early feedbacks. Any opinions, findings, conclusions, or recommendations expressed here are those of the authors and do not necessarily reflect the view of the sponsors.


  • S. Bethard, J. H. Martin, and S. Klingenstein (2007) Timelines from text: identification of syntactic temporal relations. In Proceedings of the International Conference on Semantic Computing, ICSC ’07, Washington, DC, USA, pp. 11–18. External Links: ISBN 0-7695-2997-6, Link, Document Cited by: §2.
  • S. Bethard (2013) ClearTK-timeml: a minimalist approach to tempeval 2013. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pp. 10–14. External Links: Link Cited by: §1, §2.
  • P. Bramsen, P. Deshpande, Y. K. Lee, and R. Barzilay (2006) Inducing temporal graphs. In EMNLP, Sydney, Australia. External Links: Link Cited by: §3.3.
  • T. Cassidy, B. McDowell, N. Chambers, and S. Bethard (2014) An annotation framework for dense event ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 501–506. External Links: Document, Link Cited by: §2, §4.1.
  • N. Chambers (2013) NavyTime: event and time ordering from raw text. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 73–77. External Links: Link Cited by: §1, §2.
  • N. Chambers, T. Cassidy, B. McDowell, and S. Bethard (2014) Dense event ordering with a multi-pass architecture. In ACL, External Links: Link Cited by: §1, §2, §2, §4.3, Table 5, footnote 2.
  • N. Chambers and D. Jurafsky (2008) Jointly combining implicit constraints improves temporal ordering. In EMNLP, Honolulu, United States. External Links: Link Cited by: §3.3.
  • N. Chambers, S. Wang, and D. Jurafsky (2007) Classifying temporal relations between events. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL ’07, Stroudsburg, PA, USA, pp. 173–176. External Links: Link Cited by: §2.
  • F. Cheng and Y. Miyao (2017) Classifying temporal relations by bidirectional lstm over dependency paths. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 1–6. Cited by: §1, §2, Table 5.
  • P. Denis and P. Muller (2011) Predicting globally-coherent temporal structures from texts via endpoint inference and graph decomposition.. In IJCAI, Barcelone, Spain. External Links: Link Cited by: §3.3.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. External Links: 1810.04805 Cited by: §3.2.
  • Q. X. Do, Y. S. Chan, and D. Roth (2011) Minimally supervised event causality identification. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    pp. 294–303. Cited by: §4.1.
  • Q. X. Do, W. Lu, and D. Roth (2012) Joint inference for event timeline construction. In EMNLP, Jeju, Korea. External Links: Link Cited by: §3.3, §4.3.
  • T. Finley and T. Joachims (2008) Training structural svms when exact inference is intractable. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 304–311. External Links: ISBN 978-1-60558-205-4, Link, Document Cited by: §1.
  • R. Han, M. Liang, B. Alhafni, and N. Peng (2019a) Contextualized word embeddings enhanced event temporal relation extraction for story understanding. arXiv preprint arXiv:1904.11942. Cited by: §2.
  • R. Han, Q. Ning, and N. Peng (2019b) Joint event and temporal relation extraction with shared representations and structured prediction. External Links: 1909.05360 Cited by: §3.3.
  • N. Laokulrat, M. Miwa, Y. Tsuruoka, and T. Chikayama (2013) UTTime: temporal relation classification using deep syntactic features. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA, pp. 88–92. External Links: Link Cited by: §1, §2.
  • A. Leeuwenberg and M. Moens (2017) Structured learning for temporal relation extraction from clinical records. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Vol. 1, pp. 1150–1158. Cited by: §1, §2.
  • I. Mani, M. Verhagen, B. Wellner, C. M. Lee, and J. Pustejovsky (2006) Machine learning of temporal relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, Stroudsburg, PA, USA, pp. 753–760. External Links: Link, Document Cited by: §2.
  • Y. Meng, A. Rumshisky, and A. Romanov (2017) Temporal information extraction for question answering using syntactic dependencies in an lstm-based architecture. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 887–896. Cited by: §1, §2, §4.1.
  • Y. Meng and A. Rumshisky (2018) Context-aware neural model for temporal information extraction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Cited by: §1, §5.3, Table 5.
  • P. Mirza and S. Tonelli (2016) CATENA: causal and temporal relation extraction from natural language texts. In ACL, Osaka, Japan. External Links: Link Cited by: §4.3.
  • N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, and J. Allen (2016) A corpus and evaluation framework for deeper understanding of commonsense stories. In NAACL, San Diego, USA. External Links: Link Cited by: §1.
  • Q. Ning, Z. Feng, and D. Roth (2017) A structured learning approach to temporal relation extraction. In EMNLP, Copenhagen, Denmark. External Links: Link Cited by: §1, §2, §3.3, §4.2.
  • Q. Ning, Z. Feng, H. Wu, and D. Roth (2018) Joint reasoning for temporal and causal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2278–2288. External Links: Link Cited by: §2, §3.3, §3.3, §3.3, §3, §4.1, §4.2, §4.3, §5.3, §5.3, Table 3, Table 5, footnote 2.
  • Q. Ning, H. Wu, and D. Roth (2018) A multi-axis annotation scheme for event temporal relations. In ACL, External Links: Link Cited by: §2, §4.1, §4.3, §5.4, Table 4, footnote 2.
  • T. O’Gorman, K. Wright-Bettner, and M. Palmer (2016) Richer event description: integrating event coreference with temporal, causal and bridging annotation. In Proceedings of 2nd Workshop on Computing News Storylines, pp. 47–56. External Links: Link Cited by: §2.
  • J. Pennington, R. Socher, and C. D. Manning (2014) Glove: global vectors for word representation. In In EMNLP, Cited by: §6.3.
  • J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, and L. Ferro (2003) The timebank corpus. In Corpus linguistics, pp. 647–656. Cited by: §2.
  • J. Tourille, O. Ferret, A. Neveol, and X. Tannier (2017a) Neural architecture for temporal relation extraction: a bi-lstm approach for detecting narrative containers. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vol. 2, pp. 224–230. Cited by: §1, §2.
  • J. Tourille, O. Ferret, X. Tannier, and A. Névéol (2017b) LIMSI-cot at semeval-2017 task 12: neural architecture for temporal information extraction from clinical narratives. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 597–602. External Links: Document, Link Cited by: §3.2.
  • N. UzZaman, H. Llorens, L. Derczynski, J. Allen, M. Verhagen, and J. Pustejovsky (2013) SemEval-2013 task 1: tempeval-3: evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pp. 1–9. External Links: Link Cited by: §2, §4.2.
  • M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, G. Katz, and J. Pustejovsky (2007) SemEval-2007 task 15: tempeval temporal relation identification. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, Stroudsburg, PA, USA, pp. 75–80. External Links: Link Cited by: §2.
  • M. Verhagen and J. Pustejovsky (2008) Temporal processing with the tarsqi toolkit. In 22Nd International Conference on on Computational Linguistics: Demonstration Papers, COLING ’08, Stroudsburg, PA, USA, pp. 189–192. External Links: Link Cited by: §2.
  • M. Verhagen, R. Saurí, T. Caselli, and J. Pustejovsky (2010) SemEval-2010 task 13: tempeval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10, Stroudsburg, PA, USA, pp. 57–62. External Links: Link Cited by: §2.
  • K. Yoshikawa, S. Riedel, M. Asahara, and Y. Matsumoto (2009) Jointly identifying temporal relations with markov logic. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pp. 405–413. Cited by: §1, §2.