Fine-Grained Temporal Relation Extraction

02/04/2019 ∙ by SIddharth Vashishtha, et al. ∙ 0

We present a novel semantic framework for modeling temporal relations and event durations that maps pairs of events to real-valued scales for the purpose of constructing document-level event timelines. We use this framework to construct the largest temporal relations dataset to date, covering the entirety of the Universal Dependencies English Web Treebank. We use this dataset to train models for jointly predicting fine-grained temporal relations and event durations. We report strong results on our data and show the efficacy of a transfer-learning approach for predicting standard, categorical TimeML relations.



There are no comments yet.


page 4

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Natural languages provide a myriad of formal and lexical devices for conveying the temporal structure of complex events – e.g. tense, aspect, auxiliaries, adverbials, coordinators, subordinators, etc. Yet, these devices are generally insufficient for determining the fine-grained temporal structure of such events. Consider the narrative in 1.

. At 3pm, a boy broke his neighbor’s window. He was running away, when the neighbor rushed out to confront him. His parents were called but couldn’t arrive for two hours because they were still at work.

Most native English speakers would have little difficulty drawing a timeline for these events, likely producing something like that in Figure 1. But how do we know that the breaking, the running away, the confrontation, and the calling were short, while the parents being at work was not? And why should the first four be in sequence, with the last containing the others?

The answers to these questions likely involve a complex interplay between linguistic information, on the one hand, and common sense knowledge about events and their relationships, on the other  (Minsky, 1975; Schank and Abelson, 1975; Lamport, 1978; Allen and Hayes, 1985; Hobbs et al., 1987). But it remains an open question how best to capture this interaction.

Figure 1: A typical timeline for the narrative in 1.

A promising line of attack lies in the task of temporal relation extraction. Prior work in this domain has approached this task as a classification problem, labeling pairs of event-referring expressions – e.g. broke or be at work in 1 – and time-referring expressions – e.g. 3pm or two hours – with categorical temporal relations (Pustejovsky et al., 2003; Styler IV et al., 2014; Minard et al., 2016). The downside of this approach is that we must rely on time-referring expressions to express duration information. But as example 1 highlights, nearly all temporal duration information can be left implicit, meaning it is only explicitly encoded when it is linguistically encoded.

In this paper, we develop a novel framework for temporal relation representation that puts event duration front and center. Like standard approaches using the TimeML standard, we draw inspiration from Allen’s (1983) seminal work on interval representations of time. But instead of annotating text for categorical temporal relations, we map event pairs directly to real-valued relative timeline representations, in addition to mapping events to their likely durations. This change not only supports the goal of giving a more central role to event duration, it also allows us to better reason about the temporal structure of complex events as described by entire documents.

We begin with a discussion of the literature on temporal relation extraction (§2) and then discuss our own framework and data collection methodology (§3). The resulting Universal Decompositional Semantics Time (UDS-T) dataset is the largest temporal relation dataset to date (available at, covering all of the Universal Dependencies Silveira et al. (2014); De Marneffe et al. (2014); Nivre et al. (2015) English Web Treebank (Bies et al., 2012). We use this dataset to train a variety of neural models (§4) to jointly predict fine-grained (real-valued) temporal relations and event durations (§5), showing not only that our models obtain strong results on our dataset, the representations they learn can be straightforwardly transferred to the standard categorical relation datasets. (§6).

2 Background

We review prior work on temporal relations frameworks and associated corpora as well as systems for temporal relation extraction.


Most large datasets capturing temporal relations between events use the TimeML standard (Pustejovsky et al., 2003; Styler IV et al., 2014; Minard et al., 2016). TimeBank is one of the earliest large corpora built using this standard, capturing event pairs that annotators felt were salient  (Pustejovsky et al., 2003). The TempEval competitions improved on the number of temporal relations by covering relations between all the events and times in a sentence, but only one of the TempEval tasks covered inter-sentential event relations (Verhagen et al., 2007, 2010; UzZaman et al., 2013, and see Chambers et al. 2014).

Efforts have been made to address the issue of sparsity in event-graphs with corpora such as the TimeBank-Dense (Cassidy et al., 2014) where annotators label all local-edges irrespective of ambiguity. TimeBank-Dense does not capture the complete graph over events and times relations, instead attempting to achieve completeness by capturing all relations within a sentence and the neighboring sentence. We take inspiration from this work for our own annotation protocol.

The Richer Event Description (RED) corpus takes a multi-stage annotation pipeline where various event-event phenomena, including temporal relations and sub-event relations are annotated together in the same datasets (O’Gorman et al., 2016). Similarly, Hong et al. (2016) build a cross-document event corpora which covers fine-grained event-event relations and roles with more number of event types and sub-types. Another framework called GAF  (Fokkens et al., 2013) captures event-identification through both textual and non-textual sources to track events across news articles.

Most of the corpora mentioned above required skilled workers to build the annotations as they follow specific ontologies. We take an alternative approach of capturing temporal relations by designing a protocol that asks simple questions about events which can be answered by any native speaker of English, finding surprisingly high agreement among annotators (see §3).


A variety of approaches have been taken to identifying the temporal relations between pairs of events. Early approaches use hand-tagged features modeled with multinomial logistic regression and support vector machines 

(Mani et al., 2006; Bethard, 2013; Lin et al., 2015). Other approaches use a combination of rule-based and learning-based approaches (D’Souza and Ng, 2013) and sieve-based architectures (Chambers et al., 2014; Mirza and Tonelli, 2016).  Ning et al. (2018)

jointly model causal and temporal relations using Constrained Conditional Models and formulate the problem as an Interger Linear Programming problem.

We presented a novel joint framework, Temporal and Causal Reasoning (TCR), using CCMs and ILP to the extraction problem of temporal and causal relations between events

In the recent years, neural network-based approaches have used both recurrent 

(Tourille et al., 2017; Cheng and Miyao, 2017; Leeuwenberg and Moens, 2018) and convolutional architectures (Dligach et al., 2017). Leeuwenberg and Moens (2018) use such models to predict relative timelines constructed from a set of temporal relations. Our annotations allow us to directly predict relative timelines between a pair of events which we then use to create document timelines anchored to some specific event.

The pairwise classification can result in inconsistent temporal graphs, and efforts have been made to avert this issue by employing temporal reasoning (Chambers and Jurafsky, 2008; Yoshikawa et al., 2009; Denis and Muller, 2011; Do et al., 2012; Laokulrat et al., 2016; Ning et al., 2017; Leeuwenberg and Moens, 2017).

People have also worked on modelling event durations from text (Pan et al., 2007; Gusev et al., 2011; Williams and Katz, 2012), but they don’t tie it directly to temporal relations. On the other hand, Filatova and Hovy (2001) assign a time-stamp to every clause in text, but the durations of events are not taken into consideration.

Attention-based models have proven effective in neural machine translation literature 

(Bahdanau et al., 2014; Luong et al., 2015; Vaswani et al., 2017), but to our knowledge, they have not been explored in identifying temporal relations. We follow up on this work in our models, using a variation of dot-product attention (Luong et al., 2015; Vaswani et al., 2017) to predict the event timelines and durations which is described §4

. To cater to temporal reasoning, we treat the document timeline as a hidden representation and build it from the actual pairwise annotations as described in §


3 Data Collection

Figure 2: An annotated example from our protocol

We collect the Universal Decompositional Semantics Time (UDS-T) dataset, which is annotated on top of the Universal Dependencies Silveira et al. (2014); De Marneffe et al. (2014); Nivre et al. (2015) English Web Treebank (Bies et al., 2012). The main advantages of UD-EWT over other similar corpora are: (i) it covers text from a variety of genres; unlike most other datasets; (ii) it is built upon gold standard Universal Dependency parses; and (iii) it is compatible with various other semantic annotations which use the same predicate extraction standard White et al. (2016); Zhang et al. (2017); Rudinger et al. (2018). Table 1 compares UDS-T against other temporal relations datasets.

Dataset #Events #Event-Event Relations
TimeBank 7,935 3,481
TempEval 2010 5,688 3,308
TempEval 2013 11,145 5,272
TimeBank-Dense 1,729 8,130
Hong et al. (2016) 863 25,610
UDS-T 32,302 70,368
Table 1: Number of total events, and event-event temporal relations captured in various corpora
Figure 3:

Our heuristic finds

fly as (the root of) the pivot predicate in Has anyone considered that perhaps George Bush just wanted to fly jets?

Protocol design

Annotators are given two contiguous sentences from a document with two highlighted event-referring expressions (predicates). If the predicate contains a copula, the whole predicate starting from the copula is highlighted. Otherwise, only the root of the predicate is highlighted. They are then asked (i) to provide relative timelines on a 0-100 scale for the pair of events referred to by the highlighted predicates; and (ii) to give the likely duration of the event referred to by the predicate from the following list: instantaneous, seconds, minutes, hours, days, weeks, months, years, decades, centuries, forever. In addition, annotators were asked to give a confidence ratings for their relation annotation and each of their two duration annotation on the same five-point scale - not at all confident (0), not very confident (1), somewhat confident (2), very confident (3), totally confident (4).

An example of the annotation instrument is shown in Figure 2. Henceforth, we refer to the situation referred to by the predicate that comes first in linear order (feed in Figure 2) as and the situation referred to by the predicate that comes second in linear order (sick in Figure 2) as .

Predicate extraction

We extract predicates from UD-EWT using PredPatt White et al. (2016); Zhang et al. (2017), which identifies 33,935 predicates from 16,622 sentences. We consider predicates with POS tags in: [ADJ, NOUN, NUM, DET, PROPN, PRON, VERB, AUX].

We concatenate two adjacent sentences to form a combined sentence which allows us to capture inter-sentential temporal relations. Considering all possible pairs of events in the combined sentence results into an exploding number of event-event comparisons. Therefore, to reduce the total number of comparisons, we find the pivot-predicate of the antecedent of the combined sentence as follows - find the root predicate of the antecedent and if it governs a CCOMP, CSUBJ, or XCOMP, follow that dependency to the next predicate until a predicate is found that doesn’t govern a CCOMP, CSUBJ, or XCOMP. We then take all pairs of the antecedent predicates and pair every predicate of the consequent only with the pivot-predicate. This results into predicates instead of per sentence, where N and M are the number of predicates in the antecedent and consequent respectively. This heuristic allows us to find a predicate that loosely denotes the topic being talked about in the sentence. Figure 3 shows an example of finding the pivot predicate.

Figure 4: Distribution of event durations in training and development sets.


We recruited 765 annotators from Amazon Mechanical Turk to annotate predicate pairs in groups of ten. Each predicate pair contained in the UD-EWT training set was annotated by a single annotator, and each predicate in the UD-EWT development and test sets was annotated by three annotators.


We normalize the slider responses for each event pair by subtracting the minimum slider value from all values, then dividing all such shifted values by the maximum value (after shifting). This ensures that the earliest beginning point for every event pair lies at 0 and that the right-most end-point lies at 1, while preserving the ratio between the durations implied by the sliders.

Summary statistics

Figure 4 shows the distribution of duration responses in the training and development sets. There is a relatively high density of events lasting minutes, with a relatively even distribution across durations of years or less and few events lasting decades or more.

The raw slider positions themselves are somewhat difficult to directly interpret, and so it is not particularly informative to show their distribution directly. To improve interpretability, we rotate the slider position space to construct four new dimensions: (i) priority, which is positive when starts and/or ends earlier than and most negative when starts and/or ends earlier than ; (ii) containment, which is most negative when contains more of and most positive when contains more of ; (iii) equality, which is largest when both and have the same temporal extents and smallest when they are most unequal; and (iv) shift, which moves the events forward or backward in time. We construct these dimensions by solving for in

where contains the slider positions for our datapoints in the following order: beg(), end(), beg(), end().

Figure 5: Distribution of event relations in training and development sets.

Figure 5 shows the embedding of the event pairs on the first three of these dimensions of . The triangular pattern near the top and bottom of the plot arises because strict priority – i.e. extreme positivity or negativity on the -axis – precludes any temporal overlap between the two events, and as we move toward the center of the plot, different priority relations mix with different overlap relations – e.g. the upper-middle left corresponds to event pairs where most of comes toward the beginning of , while the upper middle right of the plot corresponds to event pairs where most of comes toward the end of .

We see that there is a strong bias for to start and/or end earlier than – evidenced by the higher density of points near the upper center of Figure 5 than near the lower center – and a slight bias for to contain more of – evidenced by slightly higher density of points near the right center of Figure 5 than near the left center.

Inter-annotator agreement

We measure interannotator agreement for the temporal relation sliders by calculating the rank (Spearman) correlation between the normalized slider positions for each pair of annotators that annotated a particular group of ten predicate pairs in the development set. Rank correlation is a useful measure in this case because it tells us how much different annotators agree of the relative position of each slider. The average rank correlation between annotators was 0.665 (95% CI=[0.661, 0.669]).

We measure interannotator agreement for the durations by calculating the absolute difference in duration rank between the duration responses for each pair of annotators that annotated a particular group of ten predicate pairs in the development set. On average, annotators disagree by 2.24 scale points (95% CI=[2.21, 2.25]), though there is heavy positive skew (

= 1.16, 95% CI=[1.15, 1.18]) – evidenced by the fact that the modal rank difference is 1 (25.3% of the response pairs), with rank difference 0 as the next most likely (24.6%) and rank difference 2 as a distant third (15.4%).

Annotation coherence

Annotators were asked to approximate the relative duration of the two events that they were annotating using the distance between the sliders. This means that an annotation is coherent insofar as the ratio of distances between the slider responses for each event matches the ratio of the categorical duration responses. We rejected annotations wherein there was gross mismatch between the categorical responses and the slider responses – i.e. one event is annotated as having a longer duration but is given a shorter slider response – but because this does not guarantee that the exact ratios are preserved, we assess that here using a canonical correlation analysis (CCA; Hotelling 1936) between the categorical duration responses and the slider responses.

Figure 6: Scores from canonical correlation analysis comparing categorical duration annotations and slider relation annotations.

Figure 6 shows the CCA scores. We find that the first canonical correlation, which captures the ratios between unequal events, is 0.765; and the second, which captures the ratios between roughly unequal events, is 0.427. This preservation of the ratios is quite impressive in light of the fact that our slider scales are bounded; though we hoped for at least a non-linear relationship between the categorical durations and the slider distances, we did not expect such a strong linear relationship.

4 Model

For a given event pair in a sentence, we aim to jointly predict each event’s duration alongside the relative event timelines. We then use these relative timelines to construct timelines for entire documents with a separate model.

Relative timelines

The relative timeline model consists of three components: an event model, a duration model, and a relation model. These components use multiple layers of dot product attention (Luong et al., 2015) on top of an embedding for a sentence tuned on the three -dimensional contextual embeddings produced by ELMo (Peters et al., 2018) for that sentence, concatenated together.111We found that correctly aligning BERT’s wordpiece representations with the predicate spans produced by PredPatt was not possible in general. In future work, we aim to introduce tunable intermediary alignment models for this purpose.

where is the dimension for the tuned embeddings, , and .

Event model

We define the model’s representation for the event referred to by predicate as , where is the embedding size. We build this representation using a variant of dot-product attention, based on the predicate root.

where ; is the hidden representation of the predicate’s root; and is obtained by stacking the hidden representations of the entire predicate.

Figure 7: Network diagram for model. Dashed arrows are only included in some models.

The idea here is that the predicate root itself may be indicative of where within the predicate the relevant temporal information lies. For example, the predicate been sick for now in Figure 2 has sick as its root, and thus we would take the hidden representation for sick as . Similarly, would be equal to taking the hidden-state representations of been sick for now and stacking them together. Then, if the model learns that tense information is important, it may weight been using the attention mechanism.

Duration model

The temporal duration representation for the event referred to by the predicate is defined similarly to the event representation, but instead of stacking the predicate’s span, we stack the hidden representations of the entire sentence .

where and .

We consider two models of the categorical durations: a softmax model and a binomial model. The main difference is that the binomial model enforces that the probabilities

over the 11 duration values be concave in the duration rank, whereas the softmax model has no such constraint. We employ a cross-entropy loss for both models.

In the softmax model, we pass the duration representation for event

through a multilayer perceptron (MLP) with a single hidden layer and ReLU activations, to yield probabilities

over the 11 durations.

In the binomial distribution model, we again pass the duration representation through a MLP with a single hidden layer of ReLU activations, but in this case, we yield only a single value

. With as defined above:

where represents the ranked durations – instant (0), seconds (1), minutes (2), …, centuries (9), forever (10) – and is the maximum class rank (10).

Relation model

To represent the temporal relation representation between the event referred to by the predicate and the event referred to by the predicate, we again use a similar attention mechanism.

where and .

The main idea behind our temporal model is to map events and states directly to a timeline, which we represent via a reference interval . For situation , we aim to predict the beginning point b and end-point of .

We predict these values by passing through an MLP with one hidden layer of ReLU activations and four real-valued outputs

, representing the estimated relative beginning points (

) and durations () for events and . We then calculate the predicted slider values

The predicted values are then normalized in the same fashion as the true slider values prior to being entered into the loss. We constrain this normalized using four L1 losses.

The final loss function is then

with set to a fixed value of 2 (see §5).

Duration-relation connections

We also experiment with four architectures wherein the duration and relation models are connected to each other in the Dur Rel or Dur Rel directions.

In the first Dur Rel architecture, we modify by additionally concatenating the and predicate’s duration probabilities from the binomial distribution model.

In the second Dur Rel architecture, we do not use the relation representation model at all, just using the and predicate’s duration probabilities from the binomial distribution model.

In the first Dur Rel architecture, we modify by concatenating the and from the relation model.

In the second Dur Rel architecture, we do not use the duration representation model at all, and instead use the predicted relative duration obtained from the relation model, passing it through the binomial distribution model.

Model Duration Relation
Duration Relation Connection rank diff. R1 Absolute Relative R1
softmax - 32.63 1.86 -8.59 77.91 68.00 -2.82
binomial - 37.75 1.75 -13.73 77.87 67.68 -2.35
- Dur Rel 22.65 3.08 -51.68 71.65 66.59 -6.09
binomial - Dur Rel 36.52 1.76 -13.17 77.58 66.36 -0.85
binomial Dur Rel 38.38 1.75 -13.85 77.82 67.73 -2.58
binomial Dur Rel 38.12 1.75 -13.68 78.12 68.22 -2.96
Table 2: Results on test data based on different model representations; denotes the Spearman-correlation coefficient; rank-diff is the duration rank difference. The model highlighted in blue performs best overall on dev-data. The numbers highlighted in bold are the best-performing numbers in the respective columns.

Document timelines

From the timeline model, we learn the hidden document timelines for UDS-T development set using: (i) actual pairwise slider annotations; (ii) slider values predicted by the best performing model on UDS-T development set. We assume a hidden timeline , where is the total number of predicates in that document, the two dimensions represent the beginning point and the duration of the predicates. We then construct predicted relative timelines with

We learn for each document under the relation loss . We further constrain to predict the categorical durations using the binomial distribution model on the durations implied by , assuming .

5 Experiments

We implement the neural model and attention in pytorch 1.0

. We use the concatenated ELMo layers as word embeddings which are then tuned to a lower dimension of 256. For all experiments, we use stochastic gradient descent to train the ELMo-tuned embeddings, attention, and MLP parameters. The hyperparameter

is set to be 2.0. Both the relation and duration MLP have a single hidden layer with 128 nodes. We weight both , and by the ridit-scored confidence ratings of event durations and event relations respectively.

To predict TimeML relations in TempEval3 (Task C - relation only)  (UzZaman et al., 2013) and TimeBank-Dense  (Cassidy et al., 2014)

, we use a transfer learning approach. We first use the best-performing model on the UDS-T development set to obtain the relation representation for each pair of annotated predicates in TempEval3 and TimeBank-Dense. We then use this vector as input features to a SVM classifier with a gaussian kernel (

sklearn 0.20.0Pedregosa et al. 2011). to predict the temporal relation on these datasets using the feature vector obtained from our model. We run a hyperparameter grid-search over 4-fold CV with C: (0.1, 1, 10), and gamma: (0.001, 0.01, 0.1, 1). The best performance on cross-validation (C=10 and gamma=0.001) is then evaluated on the test-sets of TempEval3 and TimeBank-Dense.

Since we require spans of predicates for our model, we pre-process TempEval3 and TimeBank-Dense by removing all xml tags from the sentences and then we pass it through Stanford CoreNLP 3.9.2  (Manning et al., 2014) to get the corresponding conllu format. Roots and spans of predicates are then extracted using PredPatt. For our purposes, the identity and simultaneous relations in TempEval-3 are equivalent when comparing event-event relations. Hence, they are collapsed into one single relation.

Following recent work using continuous labels in event factuality prediction  Lee et al. (2015); Stanovsky et al. (2017); Rudinger et al. (2018); White et al. (2018) and genericity prediction  Govindarajan et al. (2019) we report three metrics for the duration prediction: Spearman correlation (), mean rank difference (rank diff), and proportion rank difference explained (R1). We report four metrics for the relation prediction: Spearman correlation between the normalized values of actual beginning and end points and the predicted ones (absolute ), the Spearman correlation between the actual and predicted values in (relative ), and the proportion of MAE explained (R1).

In both cases, the R1 metric corresponds closely to the related

metric, which measures the amount of variance in the data explained by the model, but is defined in terms of mean absolute error (MAE), which assumes an L1 space.


where is always guessing the median. For both and R1, we report the value scaled by 100 for readability.

As Govindarajan et al. (2019) note, these metrics are useful, since tells us how similar the predictions are to the true values, ignoring scale, and R1 tells us how close the predictions are to the true values, after accounting for variability in the data.

One difficulty that arises in computing metrics for the relation annotations on our test set is that we obtained three annotation each, and taking, e.g., the mean for each slider value in these annotations can result in a qualitatively different temporal relation, with different duration and relation characteristics, than any of the three annotations themselves. So instead of aggregating either the duration or relation annotations, we compute our metrics on all three annotations separately and then aggregate over them. Note that this will result in higher errors than we might see if we aggregate, but we believe it is the fairest way to report.

6 Results

Table 2

shows the results of different model architectures on the UDS-T test set, and Table

3 shows the results of our transfer-learning approach on TempEval-3 and TimeBank-Dense.

Systems Data F1 F1
Micro Macro
CAEVO TD 0.494 -
CATENA TD 0.519 -
Cheng and Miyao (2017) TD 0.529 -
This work TD 0.566 0.327
This work TE3 0.489 0.208
Table 3: Results of our transfer learning experiment on event-event relations in TimeBank-Dense (TD) and TempEval-3 (TE3) compared against other systems.

UDS-T results

The overarching pattern we see is that most of our models are able to predict the relative position of the beginning and ending of events very well (high relation ) and the relative duration of events somewhat well (relatively low duration ), but they have a lot more trouble predicting relation exactly and relatively less trouble predicting duration exactly.

Duration model

The binomial distribution model outperforms the softmax model for duration prediction by a large margin, though it has effectively no effect on the accuracy of the relation model, with the binomial and softmax models performing comparably. This suggests two things. First, the fact that the duration and relation models share the weights associated with the predicate representation does not affect the models this representation feeds into – i.e. having a bad duration representation does not entail having a bad relation representation, even if they are built upon the same foundation. Second, it seems that enforcing concavity in duration rank on the duration probabilities helps the model better predict durations. Indeed, as an elaboration on the first point, it may not be that the duration representations for the softmax model are worse than for the binomial models, it may just be that the extra constraints from the binomial model are helping.

Word Attention Rank Freq
(mean) (mean)
soldiers 0.911 1.28 69
months 0.844 1.38 264
Nothing 0.777 5.07 114
minutes 0.768 1.33 81
astronauts 0.756 1.37 81
hour 0.749 1.41 84
Palestinians 0.735 1.72 288
month 0.721 2.03 186
cartoonists 0.714 1.35 63
years 0.708 1.94 588
days 0.635 1.39 84
thoughts 0.592 2.90 60
us 0.557 2.09 483
week 0.531 2.23 558
advocates 0.517 2.30 105
Word Attention Rank Freq
(mean) (mean)
occupied 0.685 1.33 54
massive 0.522 2.71 66
social 0.510 1.68 57
general 0.410 3.52 168
few 0.394 3.07 474
mathematical 0.393 7.66 132
are 0.387 3.47 4415
comes 0.339 2.39 51
or 0.326 3.50 3137
and 0.307 4.86 17615
emerge 0.305 2.67 54
filed 0.303 7.14 66
s 0.298 4.03 1152
were 0.282 3.49 1308
gets 0.239 7.36 228
Table 4: The top 15 words in the dev-data which had the highest mean duration-attention and relation-attention weights. For duration, the words highlighted in bold directly correpond to some duration class. For relation, the words in bold are either conjunctions or words containing tense information.


Connecting the duration and relation model doesn’t improve performance in general. In fact, when the durations are directly predicted from the temporal relation model – i.e. without using the duration representation model – the model’s performance drops by a large margin, with the Spearman correlation down by roughly 15 percentage points. This indicates that constraining the relations model to predict the durations is not enough and that the duration representation is needed to predict durations well.

On the other hand, predicting temporal relations directly from the duration probability distribution – i.e. without using the relation representation model – results in a similar score as that of the top-performing model. This indicates that the duration representation is able to capture most of the relation characteristics of the sentence. Using both duration representation and relation representation separately (model highlighted in blue) results in the best performance overall on the UDS-T development set. This is interesting in light of the fact that, as noted in §

3, there is a strong linear relationship between the categorical durations and the durations implied by the relation annotations.

TempEval-3 and TimeBank-Dense

We report F1-micro and F1-macro scores on TempEval-3 and TimeBank-Dense in Table 3 and compare our results with some of the other systems. The TD F1-micro scores for these systems are reported by Cheng and Miyao (2017).222We do not report the temporal awareness scores (F1) of other systems on TE3 as they report their metrics on all relations, including timex-timex, and event-timex relations. Hence, it is not a fair comparison against our model. For TD, only those systems are reported which report F1-micro scores. Our system beats the TD F1-micro scores of all other systems reported in Table 3. The top performing system on TE3 (Mirza and Tonelli, 2016) reports an F1 score of 0.619 over all relations. This indicates that our model is able to achieve competitive performance on other standard temporal classification problems.

Document timelines

We apply the document timeline model described in §4 to both the annotations on the development set and the best-performing model’s predictions to obtain timelines for all documents in the development set. Figure 8 shows an example, comparing the two resulting document timelines.

Figure 8: Learned Timeline for the following document based on actual (black) and predicted (red) annotations: “A+. I would rate Fran pcs an A + because the price was lower than everyone else , i got my computer back the next day , and the professionalism he showed was great . He took the time to explain things to me about my computer , i would recommend you go to him. David”

For these two timelines, we compare the induced beginning points and durations, obtaining a mean Spearman correlation of 0.28 for beginning points and -0.097 for durations. This suggests that the model agrees to some extent with the annotations about the beginning points of events in most documents but is struggling to find the correct duration spans. One possible reason for poor prediction of durations could be the lack of a direct source of duration information. The model currently tries to identify the duration based only on the slider values, which leads to poor performance in the Dur Rel model.

7 Model Analysis and Timelines

We investigate three aspects of the best-performing model on the development set (highlighted in blue in Table 2): what our duration and relation representations attend to, how well we reconstruct the relation space defined in §3, and how well document timelines constructed from the model’s predictions match those constructed from the annotations themselves.


The advantage of using an attention mechanism is that we can often interpret what linguistic information the model is using by analyzing the attention weights. We extract these attention weights for both the duration representation and the relation representation from our best model on the development set. We then compute the mean attention weight for these two attention models for each word type across the corpus. We also compute the mean rank of the attention weight for each word token within a sentence, with rank 1 assigned to the word with highest attention weight. Table

4 shows the top 15 words in the UDS-T development set according to mean attention weight, excluding words with frequency of less than 50 in EWT.


Words that denote some time period – e.g. month(s), minutes, hour, years, days, week – are among the top words in the duration model, with seven of the top 15 words directly denoting one of the duration classes. This is exactly what one might expect this model to rely heavily on, since time expressions are likely highly informative for making predictions about duration. It also may suggest that we do not need to directly encode relations between event-referring and time-referring expressions in our framework – as do annotation standards like TimeML – since our models may discover these relations.

The remainder of the top words in the duration model are plurals or mass nouns. This may suggest that the plurality of a predicate’s arguments is an indicator of the likely duration of the event referred to by that predicate. To investigate this possibility, we compute a multinomial regression predicting the attention weights for each sentence from the morphological features of each word in that sentence

, which are extracted from the UD-EWT features column and binarized. To do this, we optimize coefficients

in , where is the KL divergence. We find that the five most strongly weighted positive features in are all features of nouns – number=plur, case=acc, prontype=prs, number=sing, gender=masc – suggesting that good portion of duration information can be gleaned from the arguments of a predicate. We believe this may be because nominal information can be useful in determining whether the clause is about particular events or generic events Govindarajan et al. (2019). This is corroborated by the fact that the five most strongly weighted negative features in tend to be features of function words or predicates: prontype=Rel, degree=sup, numtype=mult, voice=pass, numtype=ord.


A majority of the top words in the relation model are either coordinators – such as or and and – or bearers of tense information – i.e. lexical verbs and auxiliaries. The first makes sense in light of the fact that, in context, coordinators can carry information about temporal sequencing  Bar-Lev and Palacas (1980); Carston (1993); Wilson and Sperber (1998). The second makes sense in that information about the tense of predicates being compared likely helps the model determine relative ordering of the events they refer to.

To further investigate the role of morphological information, we compute multinomial regression in the same way as for the duration model, using the same morphological featurization. We find that the five most strongly weighted positive features in are all features of verbs or auxiliaries – person=1, person=3, tense=pres, tense=past, mood=ind, – suggesting that a majority of the information relevant to relation can be gleaned from the tense-bearing units in a clause. This is corroborated by the fact that the five most strongly weighted negative features in tend to be features of nouns or non-coordinator function words: case=acc, degree=cmp, gender=neut, prontype=Rel, numtype=ord.

Relation space

We rotate the predicted slider positions in the relation space defined in §3 and compare it with the rotated space of actual slider positions. We see a Spearman correlation of 0.19 for priority, 0.23 for containment, and 0.17 for equality. This suggests that our model is best able to capture containment relations and slightly less good at capturing priority and equality relations, though all the numbers are quite low compared to the absolute and relative metrics reported in Table 2. This may be indicative of the fact that our models do somewhat poorly on predicting more fine-grained aspects of an event relation, and in the future it may be useful to jointly train against the more interpretable priority, containment, and equality measures instead of or in conjunction with the slider values.

8 Conclusion

We present a new semantic framework which allows us to get annotations of fine-grained temporal relations and event durations. Based on this framework, we construct the largest temporal relations dataset to date, which is built on top of Universal Dependencies English Web Treebank. Our neural model architecture learns the fine-grained relations with a spearman correlation of 0.7804 suggesting that these fine-grained relations can be learned fairly well. We also showcase how our model can be used to predict standard temporal relation classification tasks using a transfer learning approach. We present an analysis over different components of the model and show that the attention model focusses on interesting linguistic features to predict the durations and temporal relations. Finally, we present a simple model to generate document timelines from the learned fine-grained relations. These timelines can be used in other tasks to keep track of events temporally.

9 Acknowledgment

We thank the FACTS.lab at the University of Rochester for useful comments on framework and protocol design. This research was supported by the University of Rochester, JHU HLTCOE, and DARPA AIDA. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government.


  • Allen (1983) James F Allen. 1983. Maintaining knowledge about temporal intervals. Communications of the ACM, 26(11):832–843.
  • Allen and Hayes (1985) James F Allen and Patrick J Hayes. 1985. A common-sense theory of time. In IJCAI, volume 85, pages 528–531.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Bar-Lev and Palacas (1980) Zev Bar-Lev and Arthur Palacas. 1980. Semantic command over pragmatic priority. Lingua, 51(2-3):137–146.
  • Bethard (2013) Steven Bethard. 2013. Cleartk-timeml: A minimalist approach to tempeval 2013. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), volume 2, pages 10–14.
  • Bies et al. (2012) Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012. English web treebank. Linguistic Data Consortium, Philadelphia, PA.
  • Carston (1993) Robyn Carston. 1993. Conjunction, explanation and relevance. Lingua, 90(1-2):27–48.
  • Cassidy et al. (2014) Taylor Cassidy, Bill McDowell, Nathanel Chambers, and Steven Bethard. 2014. An annotation framework for dense event ordering. Technical report, CARNEGIE-MELLON UNIV PITTSBURGH PA.
  • Chambers et al. (2014) Nathanael Chambers, Taylor Cassidy, Bill McDowell, and Steven Bethard. 2014. Dense event ordering with a multi-pass architecture. Transactions of the Association for Computational Linguistics, 2:273–284.
  • Chambers and Jurafsky (2008) Nathanael Chambers and Dan Jurafsky. 2008. Jointly combining implicit constraints improves temporal ordering. In

    Proceedings of the Conference on Empirical Methods in Natural Language Processing

    , pages 698–706. Association for Computational Linguistics.
  • Cheng and Miyao (2017) Fei Cheng and Yusuke Miyao. 2017. Classifying temporal relations by bidirectional lstm over dependency paths. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 1–6.
  • De Marneffe et al. (2014) Marie-Catherine De Marneffe, Timothy Dozat, Natalia Silveira, Katri Haverinen, Filip Ginter, Joakim Nivre, and Christopher D. Manning. 2014. Universal Stanford dependencies: A cross-linguistic typology. In Proceedings of LREC, volume 14, pages 4585–4592.
  • Denis and Muller (2011) Pascal Denis and Philippe Muller. 2011. Predicting globally-coherent temporal structures from texts via endpoint inference and graph decomposition. In

    IJCAI-11-International Joint Conference on Artificial Intelligence

  • Dligach et al. (2017) Dmitriy Dligach, Timothy Miller, Chen Lin, Steven Bethard, and Guergana Savova. 2017. Neural temporal relation extraction. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, volume 2, pages 746–751.
  • Do et al. (2012) Quang Xuan Do, Wei Lu, and Dan Roth. 2012. Joint inference for event timeline construction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 677–687. Association for Computational Linguistics.
  • D’Souza and Ng (2013) Jennifer D’Souza and Vincent Ng. 2013. Classifying temporal relations with rich linguistic knowledge. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 918–927.
  • Filatova and Hovy (2001) Elena Filatova and Eduard Hovy. 2001. Assigning time-stamps to event-clauses. In Proceedings of the workshop on Temporal and spatial information processing-Volume 13, page 13. Association for Computational Linguistics.
  • Fokkens et al. (2013) Antske Fokkens, Marieke van Erp, Piek Vossen, Sara Tonelli, Willem Robert van Hage, BV SynerScope, Luciano Serafini, Rachele Sprugnoli, and Jesper Hoeksema. 2013. Gaf: A grounded annotation framework for events. NAACL HLT 2013, page 11.
  • Govindarajan et al. (2019) Venkata Subrahmanyan Govindarajan, Benjamin Van Durme, and Aaron Steven White. 2019. Decomposing generalization: Models of generic, habitual, and episodic statements. arXiv preprint arXiv:1901.11429.
  • Gusev et al. (2011) Andrey Gusev, Nathanael Chambers, Pranav Khaitan, Divye Khilnani, Steven Bethard, and Dan Jurafsky. 2011. Using query patterns to learn the duration of events. In Proceedings of the ninth international conference on computational semantics, pages 145–154. Association for Computational Linguistics.
  • Hobbs et al. (1987) Jerry R Hobbs, William Croft, Todd Davies, Douglas Edwards, and Kenneth Laws. 1987. Commonsense metaphysics and lexical semantics. Computational linguistics, 13(3-4):241–250.
  • Hong et al. (2016) Yu Hong, Tongtao Zhang, Tim O’Gorman, Sharone Horowit-Hendler, Heng Ji, and Martha Palmer. 2016. Building a cross-document event-event relation corpus. In Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016), pages 1–6.
  • Hotelling (1936) Harold Hotelling. 1936. Relations between two sets of variates. Biometrika, 28(3/4):321–377.
  • Lamport (1978) Leslie Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Communications of the ACM, 21(7):558–565.
  • Laokulrat et al. (2016) Natsuda Laokulrat, Makoto Miwa, and Yoshimasa Tsuruoka. 2016. Stacking approach to temporal relation classification with temporal inference. Information and Media Technologies, 11:53–78.
  • Lee et al. (2015) Kenton Lee, Yoav Artzi, Yejin Choi, and Luke Zettlemoyer. 2015. Event detection and factuality assessment with non-expert supervision. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1643–1648.
  • Leeuwenberg and Moens (2018) Artuur Leeuwenberg and Marie-Francine Moens. 2018. Temporal information extraction by predicting relative time-lines. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1237–1246.
  • Leeuwenberg and Moens (2017) Tuur Leeuwenberg and Marie-Francine Moens. 2017. Structured learning for temporal relation extraction from clinical records. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 1150–1158.
  • Lin et al. (2015) Chen Lin, Dmitriy Dligach, Timothy A Miller, Steven Bethard, and Guergana K Savova. 2015. Multilayered temporal modeling for the clinical domain. Journal of the American Medical Informatics Association, 23(2):387–395.
  • Luong et al. (2015) Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025.
  • Mani et al. (2006) Inderjeet Mani, Marc Verhagen, Ben Wellner, Chong Min Lee, and James Pustejovsky. 2006. Machine learning of temporal relations. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics, pages 753–760. Association for Computational Linguistics.
  • Manning et al. (2014) Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. 2014. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
  • Minard et al. (2016) Anne-Lyse Myriam Minard, Manuela Speranza, Ruben Urizar, Begona Altuna, Marieke van Erp, Anneleen Schoen, and Chantal van Son. 2016. Meantime, the newsreader multilingual event and time corpus. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). European Language Resources Association (ELRA).
  • Minsky (1975) Marvin Minsky. 1975. A framework for representing knowledge.

    The Psychology of Computer Vision

  • Mirza and Tonelli (2016) Paramita Mirza and Sara Tonelli. 2016. Catena: Causal and temporal relation extraction from natural language texts. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 64–75.
  • Ning et al. (2017) Qiang Ning, Zhili Feng, and Dan Roth. 2017. A structured learning approach to temporal relation extraction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1027–1037.
  • Ning et al. (2018) Qiang Ning, Zhili Feng, Hao Wu, and Dan Roth. 2018. Joint reasoning for temporal and causal relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2278–2288.
  • Nivre et al. (2015) Joakim Nivre, Zeljko Agic, Maria Jesus Aranzabe, Masayuki Asahara, Aitziber Atutxa, Miguel Ballesteros, John Bauer, Kepa Bengoetxea, Riyaz Ahmad Bhat, Cristina Bosco, Sam Bowman, Giuseppe G. A. Celano, Miriam Connor, Marie-Catherine de Marneffe, Arantza Diaz de Ilarraza, Kaja Dobrovoljc, Timothy Dozat, Tomaž Erjavec, Richárd Farkas, Jennifer Foster, Daniel Galbraith, Filip Ginter, Iakes Goenaga, Koldo Gojenola, Yoav Goldberg, Berta Gonzales, Bruno Guillaume, Jan Hajič, Dag Haug, Radu Ion, Elena Irimia, Anders Johannsen, Hiroshi Kanayama, Jenna Kanerva, Simon Krek, Veronika Laippala, Alessandro Lenci, Nikola Ljubešić, Teresa Lynn, Christopher Manning, Cătălina Mărănduc, David Mareček, Héctor Martínez Alonso, Jan Mašek, Yuji Matsumoto, Ryan McDonald, Anna Missilä, Verginica Mititelu, Yusuke Miyao, Simonetta Montemagni, Shunsuke Mori, Hanna Nurmi, Petya Osenova, Lilja Øvrelid, Elena Pascual, Marco Passarotti, Cenel-Augusto Perez, Slav Petrov, Jussi Piitulainen, Barbara Plank, Martin Popel, Prokopis Prokopidis, Sampo Pyysalo, Loganathan Ramasamy, Rudolf Rosa, Shadi Saleh, Sebastian Schuster, Wolfgang Seeker, Mojgan Seraji, Natalia Silveira, Maria Simi, Radu Simionescu, Katalin Simkó, Kiril Simov, Aaron Smith, Jan Štěpánek, Alane Suhr, Zsolt Szántó, Takaaki Tanaka, Reut Tsarfaty, Sumire Uematsu, Larraitz Uria, Viktor Varga, Veronika Vincze, Zdeněk Žabokrtský, Daniel Zeman, and Hanzhi Zhu. 2015. Universal Dependencies 1.2.
  • O’Gorman et al. (2016) Tim O’Gorman, Kristin Wright-Bettner, and Martha Palmer. 2016. Richer event description: Integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), pages 47–56.
  • Pan et al. (2007) Feng Pan, Rutu Mulkar-Mehta, and Jerry R Hobbs. 2007. Modeling and learning vague event durations for temporal reasoning. In Proceedings of the 22nd national conference on Artificial intelligence-Volume 2, pages 1659–1662. AAAI Press.
  • Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  • Peters et al. (2018) Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proc. of NAACL.
  • Pustejovsky et al. (2003) James Pustejovsky, Patrick Hanks, Roser Sauri, Andrew See, Robert Gaizauskas, Andrea Setzer, Dragomir Radev, Beth Sundheim, David Day, Lisa Ferro, et al. 2003. The timebank corpus. In Corpus linguistics, volume 2003, page 40. Lancaster, UK.
  • Rudinger et al. (2018) Rachel Rudinger, Aaron Steven White, and Benjamin Van Durme. 2018. Neural models of factuality. arXiv preprint arXiv:1804.02472.
  • Schank and Abelson (1975) Roger C Schank and Robert P Abelson. 1975. Scripts, plans, and knowledge. In IJCAI, pages 151–157.
  • Silveira et al. (2014) Natalia Silveira, Timothy Dozat, Marie-Catherine De Marneffe, Samuel R Bowman, Miriam Connor, John Bauer, and Christopher D Manning. 2014. A gold standard dependency corpus for english. In LREC, pages 2897–2904.
  • Stanovsky et al. (2017) Gabriel Stanovsky, Judith Eckle-Kohler, Yevgeniy Puzikov, Ido Dagan, and Iryna Gurevych. 2017. Integrating deep linguistic features in factuality prediction over unified datasets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 352–357.
  • Styler IV et al. (2014) William F Styler IV, Steven Bethard, Sean Finan, Martha Palmer, Sameer Pradhan, Piet C de Groen, Brad Erickson, Timothy Miller, Chen Lin, Guergana Savova, et al. 2014. Temporal annotation in the clinical domain. Transactions of the Association for Computational Linguistics, 2:143.
  • Tourille et al. (2017) Julien Tourille, Olivier Ferret, Aurelie Neveol, and Xavier Tannier. 2017. Neural architecture for temporal relation extraction: A bi-lstm approach for detecting narrative containers. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 224–230.
  • UzZaman et al. (2013) Naushad UzZaman, Hector Llorens, Leon Derczynski, James Allen, Marc Verhagen, and James Pustejovsky. 2013. Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Second Joint Conference on Lexical and Computational Semantics (* SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), volume 2, pages 1–9.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Verhagen et al. (2007) Marc Verhagen, Robert Gaizauskas, Frank Schilder, Mark Hepple, Graham Katz, and James Pustejovsky. 2007. Semeval-2007 task 15: Tempeval temporal relation identification. In Proceedings of the 4th international workshop on semantic evaluations, pages 75–80. Association for Computational Linguistics.
  • Verhagen et al. (2010) Marc Verhagen, Roser Sauri, Tommaso Caselli, and James Pustejovsky. 2010. Semeval-2010 task 13: Tempeval-2. In Proceedings of the 5th international workshop on semantic evaluation, pages 57–62. Association for Computational Linguistics.
  • White et al. (2016) Aaron Steven White, Drew Reisinger, Keisuke Sakaguchi, Tim Vieira, Sheng Zhang, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2016. Universal decompositional semantics on universal dependencies. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1713–1723, Austin, TX. Association for Computational Linguistics.
  • White et al. (2018) Aaron Steven White, Rachel Rudinger, Kyle Rawlins, and Benjamin Van Durme. 2018. Lexicosyntactic inference in neural models. arXiv preprint arXiv:1808.06232.
  • Williams and Katz (2012) Jennifer Williams and Graham Katz. 2012. Extracting and modeling durations for habits and events from twitter. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers-Volume 2, pages 223–227. Association for Computational Linguistics.
  • Wilson and Sperber (1998) Deirdre Wilson and Dan Sperber. 1998. Pragmatics and time. Pragmatics and Beyond New Series, pages 1–22.
  • Yoshikawa et al. (2009) Katsumasa Yoshikawa, Sebastian Riedel, Masayuki Asahara, and Yuji Matsumoto. 2009. Jointly identifying temporal relations with markov logic. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1, pages 405–413. Association for Computational Linguistics.
  • Zhang et al. (2017) Sheng Zhang, Rachel Rudinger, and Benjamin Van Durme. 2017. An evaluation of predpatt and open ie via stage 1 semantic role labeling. In IWCS 2017—12th International Conference on Computational Semantics—Short papers.