NarrativeTime: Dense High-Speed Temporal Annotation on a Timeline

08/29/2019 ∙ by Anna Rogers, et al. ∙ UMass Lowell 0

We present NarrativeTime, a new timeline-based annotation scheme for temporal order of events in text, and a new densely annotated fiction corpus comparable to TimeBank-Dense. NarrativeTime is considerably faster than schemes based on event pairs such as TimeML, and it produces more temporal links between events than TimeBank-Dense, while maintaining comparable agreement on temporal links. This is achieved through new strategies for encoding vagueness in temporal relations and an annotation workflow that takes into account the annotators' chunking and commonsense reasoning strategies. NarrativeTime comes with new specialized web-based tools for annotation and adjudication.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the biggest challenges in annotation for the modern data-hungry NLP systems is temporal annotation: to date, it has been complex, expensive, slow, and plagued by low inter-annotator agreement (IAA). The largest available gold-standard resources are not large by current standards. TimeBank 1.2 Pustejovsky et al. (2003b) includes 183 news articles, MEANTime Minard et al. (2016) – 120 news stories per language. CaTeRS includes 320 1-paragraph stories, TimeBank-Dense Cassidy et al. (2014b) used a subset of 36 TimeBank documents that were later used in a few other studies Reimers et al. (2016); O’Gorman (2017). Temporal relations are notably absent from larger resources annotated for events, such as ACEWalker et al. (2006), ECB+ corpus Cybulska and Vossen (2015, 2014) and projects that are based on ERE annotation scheme Z. Song, A. Bies, S. Strassel, T. Riese, J. Mott, J. Ellis, J. Wright, S. Kulick, N. Ryant, and X. Ma (2015); 39; Z. Song, A. Bies, S. Strassel, J. Ellis, T. Mitamura, H. T. Dang, Y. Yamakawa, and S. Holm (2016).

We present NarrativeTime, a new annotation scheme for temporal order that takes a radical break from the mainstream approaches. Instead of focusing on relations in event pairs, NarrativeTime builds a dynamic interactive timeline representation as the annotators go through the text. We propose a novel way to handle underspecification by incorporating it in event type definitions rather than at a separate temporal link (TLink) layer, and we leverage chunking processes to ease temporal ordering. The result is a significant reduction in the amount of clicks needed to annotate an event: a 350 word text can be processed in under 30 minutes. A pilot study on English fiction texts (3.5K words) showed Cohen’s kappa of 0.58, which is comparable or superior to other proposals, while providing much faster and denser annotation.

NarrativeTime comes with a new annotated corpus of fiction texts, comparable in size to TimeBank-Dense (the currently default resource for temporal information extraction systems).

Annotation scheme TLink types Realis types Events IAA TLinks IAA TLink type IAA Corpus size1 IAA Metric
TimeML 1.2.1 and ISO-TimeML Pustejovsky et al. (2005, 2010) 13 62 0.78 n/a 0.55 10 (news) AvgPnR
TempEval-1 Verhagen et al. (2007, 2009) 6 12 n/a n/a 0.47 TimeBank Cohen’s Kappa
TempEval-3 UzZaman et al. (2012) 13 12 0.87 n/a n/a 6k words (web) F1
THYME-TimeML Styler et al. (2014) 5 4 0.7899 0.5012 0.499 107 (clin. notes) Krippendorff’s Alpha
Temporal Dependency Structure Kolomiyets et al. (2012); Bethard et al. (2012) 6 1 0.856 0.822 0.7 100 (fables) Krippendorff’s Alpha
Multi-Axis Annotation Scheme Ning et al. (2018) 4 5 0.85 n/a 0.84 36 (news) Cohen’s Kappa
RED O’Gorman et al. (2016); Ikuta et al. (2014) 4 4 0.861 0.729 0.184-0.544 55 (news) F1
TimeBank-Dense Cassidy et al. (2014b) 6 3 n/a n/a .56-.64 36 (news) (Cohen’s?) Kappa
NewsReader Minard et al. (2016); Tonelli et al. (2014); van Erp et al. (2015) 13 2 0.68 n/a n/a 30 (news) Dice’s coefficient
Araki et al. Araki et al. (2018) 2 n/a F1 0.802 n/a 0.108-0.139 100 (simple wiki) Fleiss’ kappa
CaTeRS Mostafazadeh et al. (2016) 4 n/a 0.91 n/a 0.51 20 mini-stories Fleiss’ kappa
  • Corpus size refers to the sample for which IAA was reported.

  • Each event instance is annotated for modality and polarity attributes.

Table 1: Current temporal annotation schemes

2 Related work

Temporal reasoning has attracted a lot of attention in the recent years, with numerous annotation schemes aiming to improve on different aspects of the classical TimeML Pustejovsky et al. (2005). The best-known proposals are listed in Table 1, which summarizes the number of temporal relations, realis types, and the IAA reported in various studies. TimeML leads on both of the number of recognized temporal relations and types of events.

Two major problems that all of these schemes have to address are underspecification and annotation density. For underspecification, the chief solution is to introduce additional restrictions to avoid annotating non-actual events Bethard et al. (2012) or, more recently, place them on separate axes Ning et al. (2018). The density problem is inherent in annotating event pairs, since a complete set of relations needed to describe a text would be quadratic to the number of events. Therefore most of them limit the scope of the task in some way: only annotating TLinks in the same or adjacent sentences Verhagen et al. (2007, 2010); UzZaman et al. (2012); Minard et al. (2016), limiting the scope to a specific construction Bethard et al. (2007), and attempting to infer the missing TLinks later by computing transitive closure.

We are by no means the first to use timelines in temporal annotation: similar representations have been proposed by KolomiyetsBethardEtAl_2012_Extracting_Narrative_Timelines_as_Temporal_Dependency_Structures, DoLuEtAl_2012_Joint_Inference_for_Event_Timeline_Construction, CaselliVossen_2016_The_Storyline_Annotation_and_Representation_Scheme_StaR_A_Proposal,CaselliVossen_2017_The_Event_StoryLine_Corpus_A_New_Benchmark_for_Causal_and_Temporal_Relation_Extraction, and go back to the early work by VerhagenKnippenEtAl_2006_Annotation_of_Temporal_Relations_with_Tango. However, they all used temporal graphs only as a representation of the final result, while the annotation itself was still based on event pairs.

NarrativeTime differs radically from all the above-mentioned approaches in handling vagueness at the level of both event types and TLinks, and leveraging the annotator’s chunking strategies for single-click annotation of clusters of consecutive or roughly-simultaneous events. To the best of our knowledge, NarrativeTime also produces the densest annotation, surpassing both TimeBank-Dense Cassidy et al. (2014b) which only focused on pairs of events in a given window, and DoLuEtAl_2012_Joint_Inference_for_Event_Timeline_Construction, a study where “the annotator was not required to annotate all pairs of event mentions, but as many as possible”.

3 Why event pairs are problematic: motivation in psychology

We believe that the chief reason why the temporal annotation has been so slow and expensive is the focus on individual event pairs, which runs against the natural reading comprehension process and forces the annotator to ignore or make tortured decisions regarding underspecified relations.

The exact mechanism of reading comprehension is still debated Rayner and Reichle (2010), but there are good reasons to believe that we gradually build a mental model of the whole narrative van der Meer et al. (2002); Zwaan (2016). This model has a directional representation of time and temporal distance between events, and is built correctly even if the text is not organized chronologically, e.g. if there are flashbacks Claus (2012).

We also know that texts pre-chunked in semantically coherent segments are easier to process Frase and Schwartz (1979); O’Shea and Sindelar (1983); Rajendran et al. (2013). For dynamic situations, “semantic coherence” is best explained in terms of scripts/frames, mental representations of stereotypical complex activities. They have internal organization, with possibly complex sub-elements that can be managed without losing track of the overall goal of the script Farag et al. (2010).

The process of constructing a mental model of a narrative is likely to be subject to the same on-line constraints111Reading comprehension in particular is influenced by the working memory capacity Seigneuric et al. (2000), vocabulary proficiency Quinn et al. (2015), and even individual differences in statistical learning Misyak et al. (2010). as the rest of language processing. This brings into play the ”good-enough processing” Christianson (2016); Ferreira et al. (2009). Not all temporal relations can be inferred, since the writers focus on advancing their story in an engaging way rather than spelling out all the details. The readers, in their turn, have limited time and attention resources, and are interested in major, salient developments with the characters, saving processing effort in minor details.

Counter-intuitively, readers do not save effort by looking at each segment only once: we regress as needed Schotter et al. (2014), even across sentence boundaries Shebilske and Reid (1979). This suggests that during reading a good-enough representation of the narrative is constructed, with the readers anticipating the developments Coll-Florit and Gennari (2011) and filling the most glaring gaps with their world knowledge. The variation is particularly notable with regards to the length of durative events Coll-Florit and Gennari (2011).

If the above view of reading comprehension is correct, it is the opposite of the reading process in annotation based on event pairs. The annotators have to detach a pair of events from the rest of the story, and consider their temporal relation in isolation. A given pair may or may not be in the category of events that were salient enough in the discourse to be easily order-able. Furthermore, there is no allowance for the fact that underspecified relations are not just “vague”: if they are salient enough, their order will be interpreted, but that interpretation may well be different for annotators with different backgrounds or even simply different working memory capacity.

4 NarrativeTime annotation scheme

4.1 What counts as event

NarrativeTime annotation is performed in two stages: (1) identification of events of interest and their coreference, and (2) their temporal ordering. This paper focuses on our new temporal ordering strategies: as shown in section 2, detection of events is an easier task with relatively high IAA, and we do not introduce anything new here.

As in in TimeML, we define events as anything that happens or occurs, or as a state in which something obtains or holds true Pustejovsky et al. (2003a). Events can be expressed as verbs, nominals, adjectives/participles, or phrases/clauses.

We concur with NingWuEtAl_2018_A_Multi-Axis_Annotation_Scheme_for_Event_Temporal_Relations that different realis types involve different timeline axes, the temporal relations between which are vague. Our main timeline currently represents only actual events. Hypotheticals, imperatives, future events, and questions that don’t explicitly contribute information about actual events are all annotated with a special irrealis event type [I].

4.2 Event types

Most current temporal annotation schemes adopt a model of temporal relations based on interval algebra Allen (1984). Start and endpoints of 2 events form 13 possible relations: before/after, immediately before/after, overlap/is overlapped, ends/is ended on, starts /is started on, during, and identity.

However, mental tracking of all the start/endpoints is psychologically unrealistic. NingWuEtAl_2018_A_Multi-Axis_Annotation_Scheme_for_Event_Temporal_Relations suggest focusing on start points due to variation in perceived event durations Coll-Florit and Gennari (2011), but this assumes that the start of events is always more salient than other phases. That can hardly be the case, since focus depends on contextual saliency: for example, we would be more concerned with the end of a resuscitation activity than its beginning.

We propose integrating some temporal order information in event definitions rather than leaving it all to TLinks. The annotators need to be able to focus on the start, end, or the ongoing phase of an event, or any combination thereof that is salient in the context, and leave out the underspecified parts. This idea owes a lot to the huge body of linguistic work on verb aspect and event structure Dowty (1986); Pustejovsky (1991); Moens and Steedman (1988); Smith (1997), verb classes Vendler (1957); Levin (1993); Chipman et al. (2017), and particularly the geometric event phase representations by Croft_2012_Verbs_aspect_and_causal_structure. To the best of our knowledge, the phase approach has not been applied to full-text annotation of temporal relations.

To achieve this, NarrativeTime distinguishes between bounded, unbounded and partially bounded events, defined222We are hoping that the linguist reader will excuse our re-defining the term “boundedness”, as it is an established term in verb aspect literature. as follows.

Bounded events [B]

are events (of any nature and duration) of which it is known that they start roughly after the end of the nearest other event on the timeline, and they end before the next event starts (with or without a temporal gap). For example, in the sentence John started working when Mary came in and stopped when she left the event John’s working () is “bounded” by the events of Mary’s coming () and leaving ().

Figure 1: Bounded events

Creating a [B] type event requires indicating its position on the timeline with respect to other events, which is input simply as a number starting at 1. It is possible for a [B] event to span other [B] events, as in case of in Figure 1: it overlaps with both and .

Unbounded events {U}

are events (of any nature and duration), of which the exact start and end points are not known, but it is known that they overlap with some other event on the timeline, and also (in an underspecified way) with its nearest neighbors.

Figure 2: Unbounded events

For example, consider the sentence Mary went to the coffee shop and found John there. He was working. She left.. The event of John’s working () started at an underspecified point, possibly before Mary’s even deciding to go to the coffee shop (). We also don’t know when he stops working; maybe immediately after Mary’s leaving (), and maybe hours later. But we do know that he was working when she found him (), and this is what U events encode in NarrativeTime. The temporal location of [B] event is used as the temporal “center” of the {U} event .

A big advantage of this definition of unbounded events is that it enables inference about relations of events surrounding the anchor [B] event and the {U} event, based on the world knowledge. The authors’ intuition is that it didn’t take Mary long to get to the coffee shop, so John was probably working while she was getting there. Specifying such guesses is definitely not in the scope of temporal annotation, but {U} annotations would enable formulating a new task for the future AI systems - to try to imitate the commonsense reasoning about when the {U} event most likely started or ended.

The event in Figure 2 represents a state that holds equally true throughout the narrative. We use this mechanism to account for relatively permanent characteristics of characters and entities. The only difference from a “centered” {U} event is that it is not associated with any particular slot on the timeline. For example, this could be the color of John’s eyes (if he doesn’t get a plastic surgery in the narrative), his job (if he doesn’t get fired). We also use this mechanism for generic events such as ”people like coffee”, as they can be conceptualized as occurring “all the time”.

Partially bounded events [U}, {U]

are a combination of the two above types, used when one endpoint of an event is known, and the other endpoint is underspecified. For example, in the sentence John started working when Mary came in, the event of Mary’s coming in is “anchoring” the [U} type event of John’s working , which lasts during her visit + some underspecified time.

Figure 3: Partially bounded events

These 3 event types account for all possible ambiguities between events that can be placed on a coherent timeline. Vagueness due to the events’ not being on the same timeline is handled with the branching mechanism (subsection 4.5).

NarrativeTime provides annotators with freedom in the level of granularity of event order. For example, in a crime story it may be a crucial detail that Mary came in seconds after John did; in this case NarrativeTime will treat the two events as separate and consecutive events. On the other hand, in most contexts these events would be simply treated as roughly-simultaneous. Figure 4 shows that fine-grained relations like “includes” and “begins_by” could be treated as roughly-simultaneous, if the precise temporal location is not salient enough in the text, and before/after relations apply with or without a temporal gap between neighboring events.

Figure 4: Coarse temporal relations in NarrativeTime

Original text: Two Travellers were on the road together, when a Bear suddenly appeared on the scene. Before he could see them them, one made for a tree at the side of the road, and climbed up into the branches and hid there. The other was not so nimble as his companion; he threw himself on the ground and pretended to be dead. The Bear came up and sniffed all round him, but he kept perfectly still and held his breath: for they say that a bear will not touch a dead body. The Bear took him for a corpse, and went away.

Figure 5: NarrativeTime represenation of Two travellers fable (excerpt)

4.3 Event clusters

Consistently with what is known about chunking and the role of scripts/frames in reading comprehension (section 3), NarrativeTime actively encourages the annotators to think in terms of event clusters rather than single events. Instead of defining an individual event, they can define a span containing several events with the same temporal position that can be positioned on the timeline in a single action. All the necessary TLinks are then inferred in post-processing (see subsection 4.8).

In practice, we found the following types of clusters to be the most useful.

Clusters of roughly-simultaneous bounded events.

These are clusters in which the events are either clearly roughly-simultaneous, or their order does not matter for the purposes of the current narrative (e.g. John called, texted and left voicemails for Mary incessantly.) In the annotation interface, [B] span can be applied either to a single or to multiple events.

Clusters of consecutive events.

Most often these are mini-scripts (John brushed his teeth and got dressed), or combinations of cause/effect, enabling/enabled events that could only happen in that order (John woke up and thought of Mary.) We defined a special span type [S] which amounts to a sequence of [B]… [B] events.

Clusters of unbounded events

. Narratives often contain long descriptive sequences, such as ”John was a short, fat man with a red face and a bald patch”.

Clusters considerably reduce the annotation effort by positioning several events at once. Moreover, they enable a more natural reading flow by visualizing the chunks that the annotators create in the comprehension process anyway. They should be easier to store in short-term memory while the annotator considers their potential placement, and they enable a high-level view of the plot of the narrative. In contrast, annotation based on event pairs forces the annotators to saccade between isolated pairs of events, relying only on re-reading and on their memory of the whole narrative for retrieving the (underspecified) temporal relations.

Note that annotators do not have to create identical event clusters; this would not be realistic, as everybody differs in their reading strategies. This is not a problem, since temporal relations are ultimately established for the pre-defined events in the clusters. Annotated files are post-processed to normalize the annotations before adjudication, merging and splitting differing spans to bring them as comparable as possible, and the final representation conforms to TimeML representation that is standard for current temporal reasoning systems.

4.4 Timeline representation

Given the event types introduced in subsection 4.2, and the ability to cluster them (subsection 4.3), the goal of the annotation task is to construct a timeline representation of a story. Since in NarrativeTime events can be bounded and unbounded, the conventional temporal graphs visualizations such as TDAG Bramsen et al. (2006) are not sufficient. We opted for a “multi-track” representation that shows timeline positions for [B] and [S] events in the order of occurrence (for readability).Only two temporal relations are needed: roughly-simultaneous and before/after.

A small example is shown in Figure 5. This story contains 18 events, which with minimal pairwise annotation of TLinks with adjacent events would require 34 TLinks. NarrativeTime combination of clusters of roughly-simultaneous and consecutive events constructs a full temporal representation of all TLinks between 18 events in only 11 annotations, with 1-2 actions per annotation.

The last two events are kept separate to enable markup of information coming not from the narrator (since characters can be mistaken or lie). For consistency and extended reasoning we also enable timeline positions of implicit speech events.

4.5 Timeline branches

The kinds of vagueness about temporal relations that are encoded in bounded/unbounded event span definitions subsection 4.2 can only help with with events that are on the same coherent timeline. However, often they are not, even within the same realis type. Consider the following example:

John came  back to New York. (…) John bought  the ticket, had  a quick coffee and headed  to the movie theater. He had already read  the book and he liked  it. The movie started .

It is not clear whether John read the book before or after coming to NY, but we do know that he did it before watching the movie. NarrativeTime handles this by creating a branch on the main timeline; the relations between events in branches and events prior to the point of attachment are vague. A branch is defined as a mini-timeline, linked with a before/after relation to some location on the main timeline (Figure 6).

Note that with a little reasoning about how long it takes to get to a movie theater, and how long it takes to read a book, we can infer that the book was probably read before John bought the movie ticket. This is obviously a big source of disagreement, and we experimented with forcing the annotators to attach branches simply where they were mentioned. However, this goes against the natural reading comprehension process, and the annotators were not able to do that consistently. We believe this is one of the reasons why temporal annotation generally suffers from relatively low IAA.

Figure 6: Branching timelines in NarrativeTime
Figure 7: NarrativeTime annotation interface

4.6 Temporal expressions

NarrativeTime follows previous work Pustejovsky et al. (2005) in defining temporal expressions, we make no contribution in this area, and pre-mark temporal expressions before the timeline annotation. What NarrativeTime does improve is their linking with events: annotators only need to include any temporal expressions in the span of the event clusters which they anchor, so the spans function as temporal containers Pustejovsky and Stubbs (2011). For example, if [John met Mary on Monday] is chosen as event span, then the meeting event would be linked to Monday. This approach echoes treating temporal expressions as arguments of events, which ReimersDehghaniEtAl_2016_Temporal_Anchoring_of_Events_for_the_TimeBank_Corpus report to reduce the annotation effort by 85% as compared to TimeBank-Dense.

4.7 Annotation workflow

NarrativeTime comes with new web-based tools for annotation and adjudication, which will be released open source upon the publication of the paper. A screenshot is provided in the supplementary materials.

The annotation workflow is designed to minimize the number of clicks. Event types described above can be activated by clicking a buttons or a shortcut. As the annotators highlight events or event clusters in the text, the annotation list is automatically populated with consecutive integer timeline positions, so ordering a simple narrative is as easy as highlighting the event spans in the chronological order. If anything needs to be reordered, the annotators only need to change the numbers in the Tml column. If a new event needs to be inserted between two existing events, it can be done simply with an appropriate middle value (e.g. 4.5 between events at locations 4 and 5), with no need to reorder anything.

Discontinuous spans are handled by adding indexing characters to timeline positions, e.g. events with locations 1 and 1 are interpreted as simultaneous events at position 1, and events with locations 1% and 1% are a single event at location 1. Multiple discontinuous events at the same location are also possible (using different indexing characters).

4.8 Post-processing

The goal of post-processing is to convert the chunk-style annotation of NarrativeTime to the ISO-standard TimeML format. This is achieved roughly as follows.

All events (pre-marked before time order annotation) are collected, and their clusters are identified. For each cluster, we infer any TLinks between events it includes that follow from its definition (e.g. any events inside a cluster of consecutive events get before/after TLinks). For each pair of clusters, the appropriate TLinks between their constituent events and the rest are inferred from the timeline locations. Overlaps are interpreted as while, and everything else as before/after.

Events in branches have the vague relation with regards to all events before (or after) the attachment point of the branch, and before/after relation with all the other events. The unbounded events for generic events and constant states are interpreted as the while TLink to every other event in the story (even if they are in a branch).

Any (pre-marked) temporal expressions get TLinks to the event(s) in the cluster in which they are found (defining the clusters appropriately is discussed in the guidelines). E.g. if “Wednesday” is mentioned inside a cluster, all events in that cluster are assumed to happen on Wednesday.

5 Results

5.1 Inter-annotator agreement

We evaluate IAA on a batch of fiction texts totalling 3.5K words with 474 events and 11709 TLinks. Events were pre-marked333We experimented with automatic markup with NewsReader NLP pipeline (, but the results were unsatisfactory for fiction. according to ECB+ guidelines. Before the start of the project the annotators were provided with 3 training rounds (4 single-paragraph Aesop fables and 2 300-word fiction texts in total, with discussion and feedback sessions).

We found that Cohen’s kappa for TLinks between all possible pairs of 474 events was 0.58, with 0.57 agreement on types of spans. These numbers generally indicate “moderate” IAA (with “substantial” IAA starting at 0.6), and it is comparable or superior to all the pair-based schemes discussed in section 2 (except for the Multi-Axis scheme). However, direct comparison is not quite fair because NarrativeTime is the only scheme that offers full human annotation of TLinks without computed transitive closure – so our annotators were performing a harder task. Consider also the significant variation in genres: 1-paragraph crowdsourced stories Mostafazadeh et al. (2016), Wikipedia Araki et al. (2018), fables Kolomiyets et al. (2012), and news Cassidy et al. (2014a). Arguably, they are all easier than modern fiction full of flashbacks, modals, indirect and implicit speech, long descriptions, and other hard cases444We cannot address these challenges here due to space limitations, but we outline our solutions in the guidelines, which will be released together with the corpus and the annotation tool upon the publication of the paper.. Given the above, we believe that NarrativeTime is at least comparable to pair-based alternatives in IAA while being superior in terms of speed and annotation density.

Since IAA is computed on the standard TLinks and not spans themselves, NarrativeTime is not compromised by any differences in reading/chunking strategies of the annotators555We observed that one of our annotators was more conservative and consistently preferred working with single events, while the other made bolder use of chunking. This is also likely to change as they gain more experience with the tool..

It is an open question how high IAA on temporal ordering is achievable in principle, given the genuine differences in interpretation that are unavoidable as long as two different people with different backgrounds are processing the same text. For NarrativeTime there is clearly some room for improvement in future work, as the confusion matrix in

Table 2 suggests that a major culprit is bounded vs unbounded events, and unbounded events vs irrealis.

[B] [I] [U} {U] {U}
[B] 146 4 10 6 14
[I] 4 70 1 1 11
[U} 11 0 8 2 9
{U] 9 2 0 2 6
{U} 16 21 8 10 103
Table 2: Confusion matrix for event types

5.2 Annotation speed

Ideally, a new annotation scheme would be compared to existing schemes by experiments with different teams of annotators. However, temporal annotation is a highly skilled task, and the guidelines are typically over 30 pages long. This means that the quality (and therefore agreement) depends on both the annotator training and experience.

Therefore, to conduct such a study one would need to provide multiple training sessions for several independent teams of annotators that have not previously worked with temporal annotation. However, no one research team could be relied upon to provide the same quality of training for schemes that they did not develop, which would make such a comparison inherently unfair. Perhaps this is why no study referenced in section 2 conducted such an experiment.

Alternatively, the annotation speed could be crudely estimated as the number of actions required for ordering pre-marked events in a text. Such a comparison between pair-based and timeline-based annotation is shown in

Table 3. For the pair-based annotation, we assume that a link between 2 events can be created in a single click (which is actually rarely the case), and that one more action is required for choosing the type of TLink. In practice, depending on the schema and annotation tool, it may take many more clicks. For fair comparison, we provide the counts for dense annotation of all TLinks between all events in a segment. Arguably NarrativeTime provides both much faster and a more intuitive way to annotate.

Narrative type Pair-based annotation Narrative Time
2 consecutive events 2 1
5 consecutive events (action narrative) 20 1
5 non-consecutive events (flashbacks) 20 5
5 states (a descriptive paragraph) 20 1
a permanent state + 5 consecutive events 30 2
Table 3: Number of actions for an annotation segment

Note that the structure of texts is often coherent: action narratives tend to list sequences of events, descriptive paragraphs contain a lot of unbounded events all together. NarrativeTime leverages that, positioning them all in a single click.

5.3 Annotation density

The timeline representation naturally forces the annotators to form a complete temporal representation of the text (explicitly marking up any irrealis spans that cannot be represented on the timeline). This makes it the densest annotation scheme available. As discussed in section 2, TimeBank-Dense Cassidy et al. (2014b) only focused on pairs of events in a given window, and for DoLuEtAl_2012_Joint_Inference_for_Event_Timeline_Construction “the annotator was not required to annotate all pairs of event mentions, but as many as possible”. All other temporal annotation projects we are aware of only process certain pairs of events, and attempt to infer the missing TLinks automatically.

5.4 Annotated corpus

We have completed annotation of 36 documents coming from CC-licensed fiction texts, to be released with the publication of the paper as a fiction corpus comparable to TimeBank-Dense (which is roughly the same size, but is based on news, and is currently the de-facto standard training dataset in temporal information extraction). The corpus will be made available under an open-source license upon the publication of the paper.

6 Conclusion

We present NarrativeTime, a new timeline-based annotation scheme for temporal order of events. While most current annotation schemes are based on TLinks between pairs of events, NarrativeTime offers an interactive timeline representation constructed as the annotators work through the text.

NarrativeTime introduces a novel way to handle underspecification by incorporating it in event type definitions rather than at a separate TLink layer. We also leverage chunking processes to ease temporal ordering. The result is a significantly faster and denser annotation: a representation of all possible TLinks in a 350 word text can be created in under 30 minutes, with IAA comparable or superior to alternative schemes.

NarrativeTime comes with a new fiction corpus comparable in size to TimeBank-Dense, thus effectively doubling the size and domain coverage of the currently-default resource for temporal information extraction.


  • J. F. Allen (1984) Towards a general theory of action and time. Artificial intelligence 23 (2), pp. 123–154. Cited by: §4.2.
  • J. Araki, L. Mulaffer, A. Pandian, Y. Yamakawa, K. Oflazer, and T. Mitamura (2018) Interoperable Annotation of Events and Event Relations across Domains. In Proceedings 14th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, pp. 10–20. External Links: Link Cited by: Table 1, §5.1.
  • S. Bethard, J. H. Martin, and S. Klingenstein (2007) Timelines from Text: Identification of Syntactic Temporal Relations. In International Conference on Semantic Computing (ICSC 2007), pp. 11–18. External Links: Document Cited by: §2.
  • S. Bethard, O. Kolomiyets, and M. Moens (2012) Annotating Story Timelines as Temporal Dependency Structures. In Language Resources and Evaluation Conference, pp. 2721–2726. Cited by: Table 1, §2.
  • P. Bramsen, P. Deshpande, Y. K. Lee, and R. Barzilay (2006) Inducing Temporal Graphs. In

    Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

    EMNLP ’06, Stroudsburg, PA, USA, pp. 189–198. External Links: ISBN 978-1-932432-73-2, Link Cited by: §4.4.
  • T. Cassidy, B. McDowell, N. Chambers, and S. Bethard (2014a) An Annotation Framework for Dense Event Ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 501–506. External Links: Document, Link Cited by: §5.1.
  • T. Cassidy, B. McDowell, N. Chambers, and S. Bethard (2014b) An Annotation Framework for Dense Event Ordering. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 501–506. External Links: Document, Link Cited by: Table 1, §1, §2, §5.3.
  • S. E. F. Chipman, M. Palmer, C. Bonial, and J. Hwang (2017) VerbNet: capturing English verb behavior, meaning, and usage. In The Oxford Handbook of Cognitive Science, S. E. F. Chipman (Ed.), External Links: Link, ISBN 9780199842193 Cited by: §4.2.
  • K. Christianson (2016) When language comprehension goes wrong for the right reasons: Good-enough, underspecified, or shallow language processing. The Quarterly Journal of Experimental Psychology 69 (5), pp. 817–828. External Links: ISSN 1747-0218, Document, Link Cited by: §3.
  • B. Claus (2012) Processing Narrative Texts: Melting Frozen Time?. Constraints in Discourse 3: Representing and Inferring Discourse Structure 223, pp. 17. Cited by: §3.
  • M. Coll-Florit and S. P. Gennari (2011) Time in language: event duration in language comprehension. Cognitive Psychology 62 (1), pp. 41–79 (eng). External Links: ISSN 1095-5623, Document Cited by: §3, §4.2.
  • A. Cybulska and P. Vossen (2014) Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), External Links: Link Cited by: §1.
  • A. Cybulska and P. Vossen (2015) Guidelines for ECB+ Annotation of Events and their Coreference. Technical report Technical Report NWR-2014-1, VU Amsterdam (en). External Links: Link Cited by: §1.
  • D. R. Dowty (1986) The effects of aspectual class on the temporal structure of discourse: semantics or pragmatics?. Linguistics and philosophy 9 (1), pp. 37–61. External Links: Link Cited by: §4.2.
  • C. Farag, V. Troiani, M. Bonner, C. Powers, B. Avants, J. Gee, and M. Grossman (2010) Hierarchical Organization of Scripts: Converging Evidence from fMRI and Frontotemporal Degeneration. Cerebral Cortex 20 (10), pp. 2453–2463 (en). External Links: ISSN 1047-3211, Document, Link Cited by: §3.
  • F. Ferreira, P. E. Engelhardt, P. Engelhardt, and M. W. Jones (2009) Good Enough Language Processing: A Satisficing Approach. Proceedings of the Annual Meeting of the Cognitive Science Society 31, pp. 413–418 (en). External Links: Link Cited by: §3.
  • L. T. Frase and B. J. Schwartz (1979) Typographical cues that facilitate comprehension.. Journal of Educational Psychology 71 (2), pp. 197–206 (en). External Links: ISSN 0022-0663, Document, Link Cited by: §3.
  • R. Ikuta, W. Styler, M. Hamang, T. O’Gorman, and M. Palmer (2014) Challenges of adding causation to richer event descriptions. In Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference, and Representation, pp. 12–20. Cited by: Table 1.
  • O. Kolomiyets, S. Bethard, and M. Moens (2012) Extracting Narrative Timelines as Temporal Dependency Structures. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 88–97. External Links: Link Cited by: Table 1, §5.1.
  • B. Levin (1993) English verb classes and alternations: a preliminary investigation. University of Chicago Press, Chicago. External Links: ISBN 978-0-226-47532-5 978-0-226-47533-2, LCCN PE1271 .L48 1993 Cited by: §4.2.
  • A. Minard, M. Speranza, R. Urizar, M. van Erp, A. Schoen, and C. van Son (2016) MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Language Resources and Evaluation Conference, pp. 4417–4422 (en). Cited by: Table 1, §1, §2.
  • J. B. Misyak, M. H. Christiansen, and J. B. Tomblin (2010) On-Line Individual Differences in Statistical Learning Predict Language Processing. Frontiers in Psychology 1 (English). External Links: ISSN 1664-1078, Document, Link Cited by: footnote 1.
  • M. Moens and M. Steedman (1988) Temporal ontology and temporal reference. Computational linguistics 14 (2), pp. 15–28. External Links: Link Cited by: §4.2.
  • N. Mostafazadeh, A. Grealish, N. Chambers, J. Allen, and L. Vanderwende (2016) CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures. In Proceedings of the Fourth Workshop on Events, pp. 51–61. External Links: Document, Link Cited by: Table 1, §5.1.
  • Q. Ning, H. Wu, and D. Roth (2018) A Multi-Axis Annotation Scheme for Event Temporal Relations. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1318–1328. External Links: Link Cited by: Table 1, §2.
  • T. O’Gorman, K. Wright-Bettner, and M. Palmer (2016) Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation. In Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016), pp. 47–56. Cited by: Table 1.
  • T. O’Gorman (2017) RicherEventDescription: A living document with the guidelines for RED annotation, and any scripts useful for working with RED documents. External Links: Link Cited by: §1.
  • L. J. O’Shea and P. T. Sindelar (1983) The Effects of Segmenting Written Discourse on the Reading Comprehension of Low- and High-Performance Readers. Reading Research Quarterly 18 (4), pp. 458–465. External Links: ISSN 0034-0553, Document, Link Cited by: §3.
  • J. Pustejovsky, J. M. Castaño, R. Ingria, R. Saurí, R. J. Gaizauskas, A. Setzer, G. Katz, and D. R. Radev (2003a) TimeML: Robust Specification of Event and Temporal Expressions in Text. In New Directions in Question Answering, External Links: Link Cited by: §4.1.
  • J. Pustejovsky, P. Hanks, R. Saurí, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L. Ferro, and M. Lazo (2003b) The TIMEBANK Corpus. In Proceedings of Corpus Linguistics, pp. 647–656 (en). Cited by: §1.
  • J. Pustejovsky, B. Ingria, R. Sauri, J. Castano, J. Littman, R. Gaizauskas, A. Setzer, G. Katz, and I. Mani (2005) The specification language TimeML. The language of time: A reader, pp. 545–557. External Links: Link Cited by: Table 1, §2, §4.6.
  • J. Pustejovsky, K. Lee, H. Bunt, and L. Romary (2010) ISO-TimeML: An International Standard for Semantic Annotation. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), External Links: Link Cited by: Table 1.
  • J. Pustejovsky and A. Stubbs (2011) Increasing Informativeness in Temporal Annotation. In Proceedings of the 5th Linguistic Annotation Workshop, pp. 152–160. External Links: Link Cited by: §4.6.
  • J. Pustejovsky (1991) The syntax of event structure. Cognition 41 (1), pp. 47–81. External Links: Link Cited by: §4.2.
  • J. M. Quinn, R. K. Wagner, Y. Petscher, and D. Lopez (2015) Developmental Relations Between Vocabulary Knowledge and Reading Comprehension: A Latent Change Score Modeling Study. Child Development 86 (1), pp. 159–175 (en). External Links: ISSN 1467-8624, Document, Link Cited by: footnote 1.
  • D. J. Rajendran, A. T. Duchowski, P. Orero, J. Martínez, and P. Romero-Fresco (2013) Effects of text chunking on subtitling: A quantitative and qualitative examination. Perspectives 21 (1), pp. 5–21 (en). External Links: ISSN 0907-676X, 1747-6623, Document, Link Cited by: §3.
  • K. Rayner and E. D. Reichle (2010) Models of the Reading Process. Wiley interdisciplinary reviews. Cognitive science 1 (6), pp. 787–799. External Links: ISSN 1939-5078, Document, Link Cited by: §3.
  • N. Reimers, N. Dehghani, and I. Gurevych (2016) Temporal Anchoring of Events for the TimeBank Corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2195–2204. External Links: Document, Link Cited by: §1.
  • [39] (2016-04) Rich ERE Annotation Guidelines Overview V4.2. Technical report Linguistic Data Consortium. External Links: Link Cited by: §1.
  • E. R. Schotter, R. Tran, and K. Rayner (2014) Don’t Believe What You Read (Only Once): Comprehension Is Supported by Regressions During Reading. Psychological Science 25 (6), pp. 1218–1226 (en). External Links: ISSN 0956-7976, Document, Link Cited by: §3.
  • A. Seigneuric, M. Ehrlich, J. V. Oakhill, and N. M. Yuill (2000) Working memory resources and children’s reading comprehension. Reading and Writing 13 (1), pp. 81–103 (en). External Links: ISSN 1573-0905, Document, Link Cited by: footnote 1.
  • W. L. Shebilske and L. S. Reid (1979) Reading Eye Movements, Macro-structure and Comprehension Processes. In Processing of Visible Language, P. A. Kolers, M. E. Wrolstad, and H. Bouma (Eds.), Nato Conference Series, pp. 97–110 (en). External Links: ISBN 978-1-4684-0994-9, Link, Document Cited by: §3.
  • C. S. Smith (1997) The parameter of aspect. 2. ed edition, Studies in Linguistics and Philosophy, Kluwer Academic Publ, Dordrecht (eng). External Links: ISBN 978-0-7923-4659-3 978-0-7923-4657-9 Cited by: §4.2.
  • Z. Song, A. Bies, S. Strassel, J. Ellis, T. Mitamura, H. T. Dang, Y. Yamakawa, and S. Holm (2016) Event Nugget and Event Coreference Annotation. In Proceedings of the Fourth Workshop on Events, pp. 37–45. External Links: Document, Link Cited by: §1.
  • Z. Song, A. Bies, S. Strassel, T. Riese, J. Mott, J. Ellis, J. Wright, S. Kulick, N. Ryant, and X. Ma (2015) From Light to Rich ERE: Annotation of Entities, Relations, and Events. In Proceedings of the 3rd Workshop on EVENTS at the NAACL-HLT 2015, pp. 89–98 (en). External Links: Document, Link Cited by: §1.
  • W. F. Styler, S. Bethard, S. Finan, M. Palmer, S. Pradhan, P. C. de Groen, B. Erickson, T. Miller, C. Lin, G. Savova, and J. Pustejovsky (2014) Temporal Annotation in the Clinical Domain. Transactions of the Association for Computational Linguistics 2, pp. 143–154. External Links: ISSN 2307-387X, Link Cited by: Table 1.
  • S. Tonelli, R. Sprugnoli, M. Speranza, and A. Minard (2014) NewsReader Guidelines for Annotation at Document Level. Technical report Technical Report NWR-2014-2-2, ICT 316404 (en). External Links: Link Cited by: Table 1.
  • N. UzZaman, H. Llorens, J. Allen, L. Derczynski, M. Verhagen, and J. Pustejovsky (2012) Tempeval-3: Evaluating events, time expressions, and temporal relations. arXiv preprint arXiv:1206.5333. External Links: Link Cited by: Table 1, §2.
  • E. van der Meer, R. Beyer, B. Heinze, and I. Badel (2002) Temporal order relations in language comprehension. Journal of Experimental Psychology. Learning, Memory, and Cognition 28 (4), pp. 770–779 (eng). External Links: ISSN 0278-7393 Cited by: §3.
  • M. van Erp, P. Vossen, R. Agerri, A. Minard, M. Speranza, R. Urizar, E. Laparra, I. Aldabe, and G. Rigau (2015) Annotated Data, version 2. Technical report Technical Report D3-3-2, VU Amsterdam (en). External Links: Link Cited by: Table 1.
  • Z. Vendler (1957) Verbs and Times. The Philosophical Review 66 (2), pp. 143. External Links: ISSN 00318108, Document, Link Cited by: §4.2.
  • M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, G. Katz, and J. Pustejovsky (2007) SemEval-2007 Task 15: TempEval Temporal Relation Identification. In Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 75–80. External Links: Link Cited by: Table 1, §2.
  • M. Verhagen, R. Gaizauskas, F. Schilder, M. Hepple, J. Moszkowicz, and J. Pustejovsky (2009) The TempEval challenge: identifying temporal relations in text. Language Resources and Evaluation 43 (2), pp. 161–179 (en). External Links: ISSN 1574-020X, 1574-0218, Document, Link Cited by: Table 1.
  • M. Verhagen, R. Sauri, T. Caselli, and J. Pustejovsky (2010) SemEval-2010 Task 13: TempEval-2. In Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden, 15-16 July 2010, pp. 57–62 (en). External Links: Link Cited by: §2.
  • C. Walker, S. Strassel, J. Medero, and K. Maeda (2006) ACE 2005 Multilingual Training Corpus. External Links: Link Cited by: §1.
  • R. A. Zwaan (2016) Situation models, mental simulations, and abstract concepts in discourse comprehension. Psychonomic Bulletin & Review 23 (4), pp. 1028–1034 (en). External Links: ISSN 1531-5320, Document, Link Cited by: §3.