Extracting Temporal and Causal Relations between Events

04/27/2016 ∙ by Paramita Mirza, et al. ∙ 0

Structured information resulting from temporal information processing is crucial for a variety of natural language processing tasks, for instance to generate timeline summarization of events from news documents, or to answer temporal/causal-related questions about some events. In this thesis we present a framework for an integrated temporal and causal relation extraction system. We first develop a robust extraction component for each type of relations, i.e. temporal order and causality. We then combine the two extraction components into an integrated relation extraction system, CATENA---CAusal and Temporal relation Extraction from NAtural language texts---, by utilizing the presumption about event precedence in causality, that causing events must happened BEFORE resulting events. Several resources and techniques to improve our relation extraction systems are also discussed, including word embeddings and training data expansion. Finally, we report our adaptation efforts of temporal information processing for languages other than English, namely Italian and Indonesian.



There are no comments yet.


page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Motivations and Goals

Temporal Relations

TimeML is the annotation framework used in a series of evaluation campaigns for temporal information processing called TempEval verhagen-EtAl:2007:SemEval-2007,verhagen-EtAl:2010:SemEval,uzzaman-EtAl:2013:SemEval-2013, in which the ultimate goal is the automatic identification of temporal expressions, events and temporal relations within a text. In TempEval, the temporal information processing task is divided into several sub-problems. Given a text, the extraction task basically includes: (i) identifying temporal entities mentioned in the text and (ii) identifying the temporal relations between them. In this research, we take the best performing systems in TempEval as our baseline.

The best performing extraction system for complete temporal information extraction achieves 30.98% F1-score. According to the results reported in TempEval, the main limiting factor seems to be the low performance of temporal relation extraction systems (36.26% F1-score). This is the main reason why we focus our research on temporal relation extraction. Meanwhile, the extraction systems for temporal entities already achieve quite good results (¿80% F1-scores). Therefore, and to limit the scope of our thesis, we assume that the annotation of temporal entities is already given.

In our attempt to improve the performance of the extraction system for temporal relations, we explore several research directions, which will be explained in the following paragraphs.

Causal Relations

A cause should always precede its effect. — Anonymous

The first research direction for improving the performance of temporal relation extraction is related to the connection between temporal and causal relations, based on the assumption that there is a temporal constraint in causality regarding event precedence. We aimed to investigate whether extracting causal relations between events can benefit temporal relation extraction. Apart from the efforts to improve the temporal relation extraction system, the recognition of causality between events is also crucial to reconstruct a causal chain of events in a story. This could be exploited, for example, in question answering systems, decision making support systems and for predicting future events given a chain of events. Having an integrated extraction system for both temporal and causal relations is one of the goals of this research.

Unfortunately, unlike for temporal relations, there was no corpus available for building (and evaluating) an automatic extraction system for event causality, specifically the one that provides comprehensive account of how causality can be expressed in a text without limiting the effort to specific connectives. This motivated us to build annotation guidelines for explicit causality in text, and to annotate the TimeBank corpus, in which gold annotated events and temporal relations were already present. The resulting causality corpus, which we called Causal-TimeBank, enabled the adaptation of existing temporal processing systems to the extraction of causal information, and made it easier for us to investigate the relation between temporal and causal information.

Word Embeddings

You shall know a word by the company it keeps. — firth57synopsis

Word embeddings and deep learning techniques are gaining momentum in the NLP research, as they are seen as powerful tools to solve several NLP tasks, such as language modelling, relation extraction and sentiment analysis. Word embedding is a way to capture the semantics of a word via a low-dimensional vector, based on the distribution of other words around this word.

In this research, we explored the effect of using lexical semantic information about event words, based on word embeddings, on temporal relation extraction between events. For example, whether the word embeddings can capture that attack often happens BEFORE injured.

Training Data Expansion

We don’t have better algorithms. We just have more data. — Google’s Research Director Peter Norvig

The scarcity of annotated data is often an issue in building extraction systems with supervised learning approach. One widely known approach to gain more training examples is semi-supervised learning, as for some NLP tasks it was shown that unlabelled data, when used in conjunction with a small amount of labelled data, can produce considerable improvement in learning accuracy.

We investigated two approaches to expand the training data for temporal and causal relation extraction, namely (i) temporal reasoning on demand for temporal relation type classification and (ii) self-training, a wrapper method for semi-supervised learning, for causal relation extraction.


To have another language is to possess a second soul. — Charlemagne

Research on temporal information processing has been gaining a lot of attention from the NLP community, but most research efforts have focused only on English. In this research we explore the adaptation of our temporal information processing system for two languages other than English, i.e. Italian and Indonesian.

2 Contributions

The following contributions are presented in this thesis:

  • [itemsep=1pt]

  • A hybrid approach for building an improved temporal relation extraction system, partly inspired by the sieve-based architecture of CAEVO chambers-etal:2014:TACL. Our approach is arguably more efficient than CAEVO, because (i) the temporal closure inference over extracted temporal relations is run only once and (ii) we use less classifiers in general.

  • Annotation guidelines for annotating explicit causality between events, strongly inspired by TimeML. Compared with existing attempts for annotating causality in text, we aim to provide a more comprehensive account of how causality can be expressed in a text, without limiting the effort to specific connectives.

  • An event causality corpus, Causal-TimeBank, is made available to the research community, to support evaluations or developments of supervised learning systems for extracting causal relations between events.

  • A hybrid approach for building an improved causal relation extraction system, making use of the constructed event causality corpus.

  • An integrated extraction system for temporal and causal relations, which exploits the assumption about event precedence when two events are connected by causality.

  • Preliminary results on how word embeddings can be exploited for temporal relation extraction.

  • An investigation into the impact of training data expansion for temporal and causal relation extraction.

  • A summary of our adaptation efforts of temporal information processing for Italian and Indonesian languages.

3 Structure of the Thesis

This thesis is organized as follows. In Chapter , we provide background information about natural language processing and information extraction, and discuss approaches widely used for information extraction tasks. Chapter  introduces the task of temporal information processing that comprises the TimeML annotation standard, annotated corpora and related evaluation campaigns. We also give a brief overview of state-of-the-art methods for extracting temporal information from text.

Chapter  focuses on our hybrid approach for building an improved temporal relation extraction system. In Chapter  we present annotation guidelines for explicit causality between events. We also provide some statistics from the resulting causality-annotated corpus, Causal-TimeBank, on the behaviour of causal cues in a text. Chapter  provides details on the hybrid approach for extracting causal relations between events from a text. In Chapter  we describe our approach for building an integrated system for both temporal and causal relations, making use of the assumption about the temporal constraint of causality.

Chapter  provides preliminary results on the effects of using word embeddings for extracting temporal relations between events. Chapter  discusses the impacts of our training data expansion approaches for temporal relation type classification and causal relation extraction. In Chapter  we address the multilinguality issue, by providing a review of our adaptation efforts of the temporal information processing task for Italian and Indonesian.

Finally, Chapter  discusses the lesson learned from this research work, and possible fruitful directions for future research.

4 Natural Language Processing

4.1 Morphological Analysis

Morphological analysis refers to the identification, analysis and description of the structure and formation of a given languages’s morphemes and other linguistic units, such as stems, affixes, part-of-speech, intonations and stresses, or implied context. A morpheme is defined as the smallest meaningful unit of a language. Consider a word like unhappiness containing three morphemes, each carrying a certain amount of meaning: un means “not”, ness means “being in a state or condition” and happy. Happy is a free morpheme, and considered as a root, because it can appear on its own. Bound morphemes, typically affixes, have to be attached to a free morpheme, thus, we cannot have sentences in English such as “Jason feels very un ness today”. Morphological analysis is a very important step for natural language processing, especially when dealing with morphologically complex languages.

Stemming and Lemmatization

A stem may be a root (e.g. run) or a word with derivational morphemes (e.g. the derived verbs standard-ize). For instance, the root of destabilized is stabil- (i.e. a form of stable that does not occur alone), and the stem is de-stabil-ize, which includes the derivational affixes de- and -ize but not the inflectional past tense suffix -(e)d. In other words, a stem is a part of a word that inflectional affixes attach to. A lemma refers to a dictionary form of a word. A typical example of this are the words see, sees, seeing and saw, which all have the same see-lemma.

Part-of-Speech Tagging

In natural language, words are divided into two broad categories: open and closed classes. Open classes do not have a fixed word membership, and encompass nouns, verbs, adjectives and adverbs. Closed classes, contrastingly, have a relatively fixed word membership. They include function words, such as articles, prepositions, auxiliary verbs and pronouns, which have a high occurrence frequency in linguistic expressions. Part-of-speech (PoS) tagging is the problem of assigning each word in a sentence the part of speech that it assumes in that sentence, according to their different lexical categories (noun, verb, adjective, adverb, preposition, pronoun, etc.).

A PoS tagset specifies the set of PoS categories being distinguished and provides a list of tags used to denote each of those categories. The commonly used PoS tagsets include Penn Treebank PoS Tagset444http://www.comp.leeds.ac.uk/amalgam/tagsets/upenn.html Marcus:1993:BLA:972470.972475, santorini93penntagset, the British National Corpus (BNC) Tagset and the BNC Enriched Tagset555http://www.natcorp.ox.ac.uk/docs/gramtag.html leech1994claws. The difference between the text annotated with Penn Treebank PoS Tagset and BNC (Basic) Tagset is exemplified in the following sentence (i) and (ii), respectively.

  1. I/PRP saw/VBD a/DT boy/NN with/IN a/DT dog/NN ./.

  2. I/PNP saw/VVD a/AT0 boy/NN1 with/PRP a/AT0 dog/NN1 ./PUN

There are a total of 48 tags in the Penn Treebank PoS Tagset, while the BNC Basic Tagset, also known as the C5 Tagset, distinguishes a total of 61 categories. Notably, the C5 Tagset includes separate categories for the various forms of the verbs be, do and have.

The Penn Treebank PoS Tagset is used in the Stanford CoreNLP tool suite666http://stanfordnlp.github.io/CoreNLP/ manning-EtAl:2014:P14-5. Meanwhile, the TextPro tool suite777http://hlt-services2.fbk.eu/textpro/ PIANTA08.645, which is the one mainly used in our research, employs the BNC Basic Tagset.

4.2 Syntactic Analysis
Syntactic Parsing

Parsing means taking an input and producing some sort of linguistic structure for it. — Jurafsky:2000:SLP:555733

A syntactic parser takes a sentence as input and produces a syntactic structure that corresponds to a semantic interpretation of the sentence. For example, the sentence “I saw a boy with a dog” can be parsed in two different ways (Figure .1). This divergence is caused by two possible interpretations of the sentence: (a) and (b). While both are grammatically correct, they reflect two different meanings: (a) the phrase “a dog” is attached to “a boy” which means accompanied; (b) the phrase “a dog” is attached to “saw” which means a tool used to make the observation. The major challenge for a syntactic parser is to find the correct parse(s) from an exponential number of possible parses.

[.ROOT [.S [.NP [.PRP I ] ] [.VP [.VBD saw ] [.NP [.NP [.DT a ] [.NN boy ] ] [.PP [.IN with ] [.NP [.DT a ] [.NN dog ] ] ] ] ] ] ]


[.ROOT [.S [.NP [.PRP I ] ] [.VP [.VBD saw ] [.NP [.DT a ] [.NN boy ] ] [.PP [.IN with ] [.NP [.DT a ] [.NN dog ] ] ] ] ] ]

Figure .1: Two variants of syntactic parse trees for “I saw a boy with a dog”. The interpretation represented by (a) is the most likely semantic representation and means that “the boy” was “with a dog”.

In terms of its overall structure, the parse tree is always rooted at a node ROOT, with the terminal elements that relate to actual words in the sentence. Each of its sub-parses, or internal nodes, spans over several tokens, and is characterized by a set of syntactic types (e.g., NP and VP, which denote noun and verb phrases, resp.). The most important word in that span is called the head word. In this work we will also refer to syntactically dominant and governing verbs. Syntactically dominant verbs are the verbs that are located closer to the root of the entire parse tree. For a number of words in a textual span, governing verbs are the verbs in verb phrases that are the roots of the corresponding sub-trees. For example, for the sentence in Figure .1, the verb “saw” is the syntactically dominant verb of the sentence, and the governing verb for the textual span “a boy with a dog”.

[text only label, label style=below] [column sep=.7em] (a) & I & saw & a & boy & with & a & dog
[edge unit distance=2ex]3ROOT 32nsubj 54det 87det 35dobj 86case [edge unit distance=1.8ex]38nmod

[text only label, label style=below] [column sep=.7em] (b) & I & saw & a & boy & with & a & dog
[edge unit distance=2ex]3ROOT 32SBJ 54NMOD 87NMOD 35OBJ 68PMOD 36ADV

Figure .2: Dependency trees for a sentence “I saw a boy with a dog”, using (a) Stanford CoreNLP tool suite and (b) Mate tools.
Dependency Parsing

In contrast to syntactic parsing, where the linguistic structure is formulated by the grammar that organizes the sentences’ words into phrases, word dependency formalism orders them according to binary dependency relations between the words (as between a head and a dependent). Word dependency formalism is often referenced as an effective mean to represent the linguistic structures of languages with a relatively free word-order.

Examples of dependency parses for a sentence “I saw a boy with a dog” are presented in Figure .2. There are several dependency representations, such as (a) Stanford (Typed) Dependencies demarneffe:2008:stanford used in Stanford CoreNLP, and (b) CoNLL-2008 Shared Task Syntactic Dependencies surdeanu-EtAl:2008:CONLL used in Mate tools888https://code.google.com/archive/p/mate-tools/ bjorkelund-EtAl:2010:COLING-DEMOS.

4.3 Information Extraction

Information extraction is a broad research field that uses computer algorithms to extract predefined structured information from natural language text, where elements of the structure relate to textual spans in the input. With the exception of temporal information processing, which will be explained further in Chapter , the different tasks of information extraction are listed in the following sections.

Named-Entity Recognition

Named-entity recognition is a task of information extraction that categorizes single textual elements in text in terms of a set of common criterion (persons, organizations, locations, times, numbers, etc.).

The violent clashes between the security forces and protesters have lasted [two days Date] in [Cairo Location] and other cities.

In the example, the textual span “two days” is identified and classified as an instance of Date, while the span of “Cairo” is identified and classified as an instance of Location.

Word-Sense Disambiguation

The task of word-sense disambiguation is to assign a label to every noun phrase, (non-auxiliary) verb phrase, adverb and adjective in a text. This label indicates the meaning of its attached word, and is chosen from a dictionary of meanings for a large number of phrases.

The [violent violent.01] [clashes clash.04] between the [security security.03] [forces force.01] and [protesters protester.02] have [lasted last.01] [two two.01] [days day.04] in [Cairo Cairo.02] and [other other.01] [cities city.01].

In the example, the meanings are assigned the labels of synsets in the WordNet lexical database Fellbaum-98, e.g., the word “clashes” receives the label clash.04 which means “fight” or “fighting”, whereas the most common sense clash.01 stands for “clang” or “noise”.

Semantic Role Labelling

Semantic Role Labelling (SRL) consists of the detection of the semantic arguments associated with the predicate or verb of a sentence, and their classification into their specific roles. For example, given a sentence like “Mary sold the car to John”, the task would be to recognize the verb “to sell” as the predicate, “Mary” as the seller (agent), “the car” as the goods (theme) and “John” as the recipient. The task is seen as an important step towards making sense of the meaning of a sentence, which is at a higher-level of abstraction than a syntactic tree. For instance, “The car has been sold by Mary to John” has a different syntactic form, but the same semantic roles.

The FrameNet project baker-fillmore-lowe:1998:ACLCOLING produced the first major computational lexicon that systematically described many predicates and their corresponding roles. Gildea:2002:ALS:643092.643093 developed the first automatic semantic role labeling system based on FrameNet. FrameNet additionally captures relationships between different frames, including among others:

Precedes, which captures a temporal order that holds between subframes of a complex scenario, and Causative_of, which expresses causality between frames.

Another project related to semantic role labelling is the PropBank project J05-1004, which added semantic role—or predicate-argument relations—annotations to the syntactic tree of the Penn Treebank corpus PRASAD08.754. The PropBank annotation is exemplified in the following sentence:

[The violent clashes Arg1] between the security forces and protesters have [lasted last.01] [two days Arg2] in [Cairo and other cities Arg-Loc].

Here the verb “lasted” has a predicate label last.01, which means “extend for some period of time”. The related words have semantic roles:

  • Arg1 for “The violent clashes”, denoting thing that lasts

  • Arg2 for “two days”, denoting period of time

  • Arg-Loc for “Cairo and other cities”, denoting location

Coreference Resolution

Given a sentence or larger chunk of text, the task is to determine which words—mentions—refer to the same objects—entities. Anaphora resolution is a special case of this task, which is concerned with matching up pronouns with the nouns or names that they refer to.

Another typical coreference problem is to find links between previously-extracted named entities. For example, “International Business Machines” and “IBM” might refer to the same real-world entity. If we take the two sentences “M. Smith likes fishing. But he doesn’t like biking”, it would be beneficial to detect that “he” is referring to the previously detected person “M. Smith”.

Relationship Extraction

This task basically deals with the identification of relations between entities, including:

  • Compound noun relations: recognition of relations between two nouns.

  • (Geo)spatial analysis: recognition of trajectors, landmarks, frames of reference, paths, regions, directions and motions, and relations between them.

  • Discourse analysis: recognition of non-overlapping text spans and discourse relations between them.

5 Techniques for Information Extraction

5.1 Rule-based Methods

Rule-based methods are the earliest ones used in information extraction. A rule-based system makes use of a database of predefined and hand-crafted rules that specify knowledge typically in form of

regular expressions. Regular expressions are a linguistic formalism that is based on a regular grammar—one of the simplest classes of formal language grammars journals/iandc/Chomsky59a.

Regular expressions are a declarative mechanism for specifying declarative languages based on regular grammars. Regular grammars are recognized by a computation device, called finite state automaton (FSA). A finite state automaton Hopcroft:1969:FLR:1096945 is a five-tuple , where is a finite set of states, is the initial state, is a finite set of alphabet symbols, is a relation from states and alphabet symbols to states, and is a set of final states. The extension of to handle input strings is standard and denoted by . denotes the state reached from on reading the string . A string is said to be accepted by an FSA if . The language is the set of all strings accepted by ’s FSA. Strings that are not accepted by ’s FSA are outside of the language .

Systems based on regular expressions are considered as rule-based systems in which knowledge about the domain is encoded in regular expressions. If the input string is accepted, i.e., it matches one of the regular expressions, it is labelled with a class label associated with that particular rule. In natural language processing, rule-based approaches were applied for, among others, tokenization—identifying the spans of single tokens in a text, stemming—finding the stem of a token, and Part-of-Speech (PoS) tagging.

time = ^((0?[0-9]|1[012])([:.][0-9]2)?(\s?[ap]m)|([01]?[0-9]|2[0-3])([:.][0-9]2)?)$
date = ^[1-9]|[1-2][0-9]|3[0-1])$

Figure .3: Regular expressions for extracting time and date in the POSIX Extended Regular Expression (ERE) syntax.

Figure .3 shows regular expression examples in the POSIX Extended Regular Expressions (ERE) syntax to extract time and date from a text.

Always traditionally popular, rule-based techniques have long been utilized for small-size applications and applications for new domains. However, with the development of large annotated corpora, machine learning techniques have grown increasingly popular, with users beginning to compare their performance to rule-based methods. These comparative studies have found that rule-based systems are very difficult to maintain, and that such systems are not well-scalable. Nevertheless, there are problems which can only be solved by the rule-based approach. Main reasons to still employ rule-based systems are:

  • [itemsep=1px]

  • New, small or restricted application domains.

  • Short development time for a set of generally applicable and observable rules.

  • Absence of annotated training data.

  • Poor quality of training data.

5.2 Supervised Machine Learning

Since rule-writing requires enormous human effort, an easier approach would be to utilize existing examples, i.e. annotations

, to extract the rules automatically; or to use statistics, which can predict the labels of words, phrases, sentences or even the entire document. In the following sections we describe a number of state-of-the-art supervised machine learning methods that are currently used in the field of natural language processing and information extraction. The focus of supervised approaches in NLP has hitherto been limited to feature extraction (how an object under consideration is represented in a numerical way as a vector of features), and selecting appropriate machine learning methods.

Formal Definitions

In terms of supervised machine learning, the labelling task can be defined as: given a set of observations with their corresponding target class value , the goal is to predict the value of for an unseen instance . More formally, it can be defined as , where each instance is represented as a vector of feature values, i.e., , with being the total number of features used in the representation. Depending on the number of distinct target values of , one distinguishes between binary (with two target values) and multi-class classifications. In the following sections we describe the commonly used machine learning methods to model the prediction function .

Support Vector Machines

Support Vector Machines (SVMs) Cortes:1995:SN:218919.218929 is a well-known discriminative machine learning classifier that models the data as points in a high-dimensional space, and spatially separates them as far as possible. Technically, an SVM constructs a hyperplane, which can be used for classification, regression or other tasks. The best separation of data points is achieved by the hyperplane that has the largest distance to the nearest data point of any class.

Figure .4: Support Vector Machines with two characteristics hyperplanes and Burges:1998:TSV:593419.593463. The data points that lie on the hyperplanes and are called support vectors (circled), satisfying , where w is normal to the hyperplane, is the perpendicular distance from the hyperplane to the origin, and is the Euclidean norm of w.

Formally, an SVM is defined as: given a set of observations with a corresponding set of labels , where , the separating hyperplane that divides the data points in space can be defined as:


where w is the normal vector to the hyperplane, is a set of points that lie on the hyperplane, and denotes the dot product (see Figure .5).

We can select two others hyperplanes and which also separate the data and defined as:




so that is equidistant from and , and taking into consideration the constraint that there is no data point between the two hyperplanes. Equation ( .2) and (.3) can be combined into a single constraint:


The optimal hyperplane is the unique one that separates the training data with a maximal margin, i.e., the distance between the two hyperplanes and is maximal. This means that the optimal hyperplane is the unique one that minimizes under the constraint (.4).

Consider the case where the training data cannot be separated without error. In this case one may want to separate the training set with a minimal number of errors. To express this formally some non-negative variables , are introduced. The problem of finding the optimal soft-margin hyperplane is then defined as:


where the training vectors are mapped to a higher dimensional space by the function . is the penalty parameter of the error term. Furthermore, is called the kernel function. Though new kernels are being proposed by researchers, the basic kernels include:

  • linear:

  • polynomial:

  • radial basis function (RBF):

  • sigmoid:

The earliest used implementation for SVM multi-class classification is probably the one-against-all method. It construct SVM models where is the number of classes. The th SVM is trained with all of the examples in the th class with positive labels, and all other examples with negative labels. Thus, given training data , where , and is the class of , the th SVM solves the following problem:


After solving (.6), there are decision functions: . We say is in the class which has the largest value of the decision function:


Another major method is called the one-against-one method, introduced by Knerr1990. This method constructs classifiers where each one is trained on data from two classes. For the training data from the th and the th classes, we solve the following binary classification problem:


There are different methods for doing the future testing after all classifiers are constructed. For instance, the following voting strategy suggested by Friedman:96 may be used: if says is in the th class, then the vote for the th class is added by one. Otherwise, the th is increased by one. Then we predict is in the class with the largest vote. This voting approach is also called the “Max Wins” strategy. In case that two classes have identical votes, the one with the smaller index is usually selected, though it may not be a good strategy.

Logistic Regression

Logistic regression999We took the explanations about Logistic Regression from http://www.cs.cmu.edu/~tom/NewChapters.html by Tom Mitchell. is an approach to learning functions of the form , or in the case where is discrete-valued, and is any vector containing discrete or continuous variables.

Figure .5: Form of logistic function. In Logistic Regression, is assumed to follow this form.

Logistic Regression assumes a parametric form for the distribution

, then directly estimates its parameters from the training data. The parametric model assumed by Logistic Regression in the case where

is boolean is:




Note that equation (.10) follows directly from equation (.9), because the sum of these two probabilities must equal to 1.

One highly convenient property of this form for is that it leads to a simple linear expression for classification. To classify any given we generally want to assign the value maximizing . Put another way, we assign the label if the following condition holds:


substituting from equations (.9) and (.10), this becomes


and taking the natural log of both sides we have a linear classification rule that assigns label if satisfies


and assigns otherwise.

One reasonable approach to train a logistic regression model is to choose parameter values that maximize the conditional data likelihood. The conditional data likelihood is the probability of the observed values in the training data, conditioned on their corresponding values. We choose parameters that satisfy


where is the vector of parameters to be estimated, denotes the observed value of in the th training example, and denotes the observed value of in the th training example. The expression to the right of the is the conditional data likelihood. Equivalently, we can work with the log of the conditional likelihood:


Above we considered using Logistic Regression to learn only for the case where is a boolean variable, i.e. binary classification. If can take on any of the discrete values , then the form of for is:


When , it is


Here denotes the weight associated with the th class and with input . It is easy to see that our earlier expressions for the case where is boolean (equation (.9) and (.10)) are a special case of the above expressions. Note also that the form of the expression for assures that .

The primary difference between these expressions and those for boolean is that when takes on possible values, we construct different linear expressions to capture the distributions for for the different values of . The distribution for the final, th, value of is simply one minus the probabilities of the first values.

5.3 Hybrid Approaches

Hybrid approaches are another kind of method employed in natural language processing, which combine rule-based with machine learning methods. Hybrid approaches are considered as a reasonable solution for a number of problems for which the training data exhibit irregularities and exceptions.


Rule-based Methods

Statistical Methods



Rule-based Methods

Statistical Methods

Figure .6: Examples of hybrid architecture for information processing.

Figure .6 exemplifies two hybrid architectures: (a) a concurrent information processing pipeline in which different tasks are performed by either rule-based or statistical methods, and (b) an information processing pipeline in which the output of the one family of methods is used as input for the other. Hybrid approaches are very popular in NLP applications such as machine translation, parsing, information extraction, etc. schaefer2007 provides a good overview of integrating deep and shallow NLP components into hybrid architectures.

5.4 Semi-supervised Machine Learning

As the name suggests, semi-supervised learning101010We took the explanations about Semi-supervised Learning from Zhu:2009:ISL:1717872. is somewhere between unsupervised and supervised learning. In fact, most semi-supervised learning strategies are based on extending either unsupervised or supervised learning to include additional information typical of the other learning paradigm. Specifically, semi-supervised learning encompasses several different settings, including:

  • Semi-supervised classification. Also known as classification with labelled and unlabelled data (or partially labelled data), this is an extension to the supervised classification problem. The training data consists of both labelled instances and unlabelled instances . One typically assumes that there is much more unlabelled data than labelled data, i.e., . The goal of semi-supervised classification is to train a classifier from both the labelled and unlabelled data, such that it is better than the supervised classifier trained on the labelled data alone.

  • Constrained clustering. This is an extension to unsupervised clustering. The training data consists of unlabelled instances , as well as some “supervised information” about the clusters. For example, such information can be so-called must-link constraints, that two instances must be in the same cluster; and cannot-link constraints, that cannot be in the same cluster. One can also constrain the size of the clusters. The goal of constrained clustering is to obtain better clustering than the clustering from unlabelled data alone.

Semi-supervised learning has tremendous practical value. In many tasks, there is a paucity of labelled data. The labels may be difficult to obtain because they require human annotators, special devices, or expensive and slow experiments. In this thesis, we will focus on a simple semi-supervised classification model: self-training.


Self-training is characterized by the fact that the learning process uses its own predictions to teach itself. For this reason, it is also called self-teaching or bootstrapping (not to be confused with the statistical procedure with the same name). Self-training can be either inductive or transductive, depending on the nature of the predictor . The algorithm for self-training is as follows:

0:  labelled data , unlabelled instances .
1:  Initially, let and .
2:  repeat
3:     Train from using supervised learning.
4:     Apply to the unlabelled instances in .
5:     Remove a subset from ; add to .
6:  until U is empty.

The main idea is to first train on labelled data. The function is then used to predict the labels for the unlabelled data. A subset of the unlabelled data, together with their predicted labels, are then selected to augment the labelled data. Typically, consists of the few unlabelled instances with the most confident predictions. The function is re-trained on the now larger set of labelled data, and the procedure repeats. It is also possible for to be the whole unlabelled data set. In this case, and remain the whole training sample, but the assigned labels on unlabelled instances might vary from iteration to iteration.

Self-Training Assumption The assumption of self-training is that its own predictions, at least the high confidence ones, tend to be correct.

The major advantages of self-training are its simplicity and the fact that it is a wrapper method. This means that the choice of learner for in step 3 is left completely open. The self-training procedure “wraps” around the learner without changing its inner workings. This is important for many real world tasks related to natural language processing, where the learners can be complicated black boxes not amenable to changes.

On the other hand, it is conceivable that an early mistake made by (which is not perfect to start with, due to a small initial ) can reinforce itself by generating incorrectly labelled data. Re-training with this data will lead to an even worse in the next iteration.

5.5 Word Embeddings

Image and audio processing systems typically work with rich, high-dimensional datasets encoded as vectors, e.g., the individual raw pixel-intensities for image data, or power spectral density coefficients for audio data. For tasks like object or speech recognition we know that all the information required to successfully perform the task is encoded in the data. However, natural language processing systems traditionally treat words as discrete atomic symbols, and therefore, provide no useful information to the system regarding the relationships that may exist between the individual symbols. This means that a model can leverage very little of what it has learned about cat when it is processing data about dog, for instance, that they are both animals, four-legged, pets, and so on. This kind of representations could lead to data sparsity, and usually means that we may need more data in order to successfully train statistical models. Vector representations of words can overcome these obstacles.

It has been shown that for words in the same language, the more often two words can be substituted into the same contexts the more similar in meaning they are judged to be miller1991. This phenomenon that words that occur in similar contexts tend to have similar meanings has been widely known as Distributional Hypothesis harris54, which can be stated in the following way:

Distributional Hypothesis The degree of semantic similarity between two linguistic expressions and is a function of the similarity of the linguistic contexts in which and can appear.

This hypothesis is the core behind the application of vector-based models for semantic representation of words, which are variously known as word space sahlgren2006, semantic spaces mitchell2010, vector space models (VSMs) turney2010 or distributional semantic models (DSMs) baroni2010.

To have better illustration about distributional hypothesis, consider a foreign word such as wampimuk, occurring in these two sentences: (1) He filled the wampimuk, passed it around and we all drunk some, and (2) We found a little, hairy wampimuk sleeping behind the tree. We could infer that the meaning of wampimuk is either ’cup’ or ’animal’, heavily depends on its context which is either sentence (1) or (2) respectively.

The different approaches that leverage this principle can be divided into two categories baroni-dinu-kruszewski:2014:P14-1: (i) count-based models, e.g. Latent Semantic Analysis (LSA) pa:deerwester90indexing, and (ii) predictive models, e.g. neural probabilistic language models journals/corr/abs-1301-3781.

Count-based models

Count-based models compute the statistics of how often some word co-occurs with its neighbouring words in a large text corpus, and then map these count-statistics down to a small, dense vector for each word sahlgren2006,pado2007, bullinaria2007,agirre-EtAl:2009:NAACLHLT09, baroni2010.

One widely known algorithm falls under this category is GloVe111111We took the explanations about GloVe from http://nlp.stanford.edu/projects/glove/. pennington2014glove. GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. For example, consider the co-occurrence probabilities for target words ice and steam with various probe words from the vocabulary. Here are some actual probabilities from a 6 billion word corpus:

Probability and Ratio

As one might expect, ice co-occurs more frequently with solid than it does with gas, whereas steam co-occurs more frequently with gas than it does with solid. Both words co-occur with their shared property water frequently, and both co-occur with the unrelated word fashion infrequently. Only in the ratio of probabilities does noise from non-discriminative words like water and fashion cancel out, so that large values (much greater than 1) correlate well with properties specific to ice, and small values (much less than 1) correlate well with properties specific of steam. In this way, the ratio of probabilities encodes some crude form of meaning associated with the abstract concept of thermodynamic phase.

The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words’ probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well.

Predictive models

In predictive models, instead of first collecting context vectors and then re-weighting these vectors based on various criteria, the vector weights are directly set to optimally predict the contexts in which the corresponding words tend to appear Bengio:2003:NPL:944919.944966,Collobert:2008:UAN:1390156.1390177,Collobert:2011:NLP:1953048.2078186,huang-EtAl:2012:ACL20122,journals/corr/abs-1301-3781,turian-ratinov-bengio:2010:ACL.

Word2Vec121212We took the explanations about Word2Vec from http://www.tensorflow.org/versions/r0.7/tutorials/word2vec/. journals/corr/abs-1301-3781 is a particularly computationally-efficient predictive model for learning word embeddings from raw text. It comes in two flavours, the Continuous Bag-of-Words model (CBOW) and the Skip-Gram model. Algorithmically, these models are similar, except that CBOW predicts target words (‘mat’) from source context words (‘the cat sits on the’), whereas the skip-gram does the inverse and predict source context words from the target words. This inversion might seem like an arbitrary choice, but statistically it has the effect that CBOW smoothes over a lot of the distributional information, by treating an entire context as one observation. For the most part, this turns out to be a useful feature for smaller datasets. On the other hand, skip-gram treats each context-target pair as a new observation, and this tends to do better when we have larger datasets.

Neural probabilistic language models are traditionally trained using the maximum likelihood (ML) principle to maximize the probability of the target word given the previous words (history) in terms of a softmax function:


where computes the compatibility of word with the context , typically using a dot product. The model is trained by maximizing its log-likelihood on the training set, i.e. by optimizing:


This yields a properly normalized probabilistic model for language modelling. However, this is very expensive, because we need to compute and normalize each probability using the score for all other words in the current context , at every training step.

In Word2Vec, a full probabilistic model is not needed. Instead, the CBOW and skip-gram models are trained using a binary classification objective (logistic regression) to discriminate the real target words from imaginary (noise) words , in the same context. Mathematically, the objective is to maximize:


where is the binary logistic regression probability under the model of seeing the word in the context in the dataset , calculated in terms of the learned embedding vectors . The expectation is approximated by drawing contrastive words from the noise distribution.

This objective is maximized when the model assigns high probabilities to the real words, and low probabilities to noise words. Technically, this is called Negative Sampling

, and there is a good mathematical motivation to use this loss function, i.e., the updates it proposes approximate the updates of the softmax function in the limit. But computationally it is especially appealing because computing the loss function now scales only with the number of noise words that are selected (

) instead of the size of the vocabulary.

6 Modelling Temporal Information

In NLP, the definition of an event can be varied depending on the target application. In topic detection and tracking allan2002, the term event is used interchangeably with topic, which describes something that happens and is usually used to identify a cluster of documents, e.g., Olympics, wars. On the other hand, information extraction provides finer granularity of event definitions, in which events are entities that happen/occur within the scope of a document.

Events, especially within a narrative text, are naturally anchored to temporal attributes, which are often expressed with time expressions such as ‘two days ago’ or ‘Friday the ’. However, an event can also have non-temporal attributes such as event participants and the location where the event took place. Here is where event modelling plays its part in automatic event extraction, to define the structure of events one wants to extract from a text.

There are several annotation frameworks for events and time expressions that can be viewed as event models, TimeML pustejovsky2003 and ACE ldc2005 being the prominent ones. There are other event models based on web ontology (RDFS+OWL) such as LODE shaw2009, SEM vanhage2011 and DOLCE gangemi2002, which encode knowledge about events as triples. While event triples can be seen as ways to store the extracted knowledge to perform reasoning on, event annotations and the corresponding annotated corpora are geared towards automatic event extraction from texts in natural language.


TimeML is a language specification for events and time expressions, which was developed in the context of the TERQAS workshop131313http://www.timeml.org/site/terqas/index.html supported by the AQUAINT program. The main purpose is to identify and extract events and their temporal anchoring from a text, such that it can be used to support a question answering system in answering temporally-based questions like “In which year did Iraq finally pull out of Kuwait during the war in the 1990s?”.


The ACE annotation framework was introduced by the ACE program, which provides annotated data, evaluation tools, and periodic evaluation exercises for a variety of information extraction tasks. It covers the annotation of five basic kinds of extraction targets including entities, values, time expressions, relations (between entities) and events ace2005.

TimeML vs ACE

Both TimeML and ACE define an event as something that happens/occurs or a state that holds true, which can be expressed by a verb, a noun, an adjective, as well as a nominalization either from verbs or adjectives. However, both event models are designed for different purposes, hence, resulting in different annotation of events. In addition to basic features of events existing in both models (tense, aspect, polarity and modality), ACE events have more complex structures involving event arguments, which can either be event participants (entities participating in the corresponding events) or event attributes (place and time of the corresponding events)  ldc2005.

While in TimeML all events are annotated, because every event takes part in the temporal network141414Except for generics as in “Use of corporate jets for political travel is legal.” timeml2006, in ACE only ’interesting’ events falling into a set of particular types and subtypes are annotated.

In annotating temporal expressions, ACE and TimeML use similar temporal annotations. ACE uses TIMEX2 ferro2001 model, which was developed under DARPA’s Translingual Information Detection, Extraction and Summarization (TIDES) program, whereas TimeML introduces TIMEX3 annotation modelled on TIMEX setzer2001 as well as TIMEX2.

The most important attribute of TimeML that differs from ACE is the separation of the representation of events and time expressions from the anchoring or ordering dependencies. Instead of treating a time expression as an event argument, TimeML introduces temporal link annotations to establish dependencies (temporal relations) between events and time expressions pustejovsky2003. This annotation is important in (i) anchoring an event to a time expression (event time-stamping) and (ii) determining the temporal order between events. This distinctive feature was the main reason why we chose TimeML as the event model for our research.


According to TimeML, we can formalize the definitions of temporal information as follows:

  • Events are expressions in text denoting situations that happen or occur, or predicates describing states or circumstances in which something obtains or holds true. They can be punctual or last for a period of time.

  • Time expressions, temporal expressions, or simply timexes are expressions in text denoting time “when” something happens, how often it happens, or how long it lasts.

  • Temporal relations represent the temporal order holding between two arguments, i.e., event and event, event and timex, or timex and timex.

  • Temporal signals are specific types of word indicating or providing a cue of an explicit temporal relation between two arguments.

7 TimeML Annotation Standard

TimeML introduces 4 major data structures: EVENT for events, TIMEX3 for time expressions, SIGNAL for temporal signals, and LINK for relations among EVENTs and TIMEX3s pustejovsky2003,timeml2006. There are three types of LINK tags: TLINK, SLINK and ALINK, which will be further explained in the following section. Note that TimeML EVENTs never participate in a link. Instead, their corresponding event instance IDs, which are realized through the MAKEINSTANCE tag, are used.

For the clarity purposes, henceforth, snippets of text annotated with events, timexes and temporal signals serving as examples will be in the respective forms. For example, “John drove for 5 hours.”

7.1 TimeML Tags

Events in a text can be expressed by tensed or untensed (phrasal) verbs (1), nominalizations (2), adjectives (3), predicative clauses (4), or prepositional phrases (5).

  1. Foreign ministers of member-states has agreed to set up a seven-member panel to investigate who shot down Rwandan President Juvenal Habyarimana’s plane.

  2. The financial assistance from the World Bank and the International Monetary Fund are not helping.

  3. Philippine volcano, dormant for six centuries, began exploding with searing gases, thick ash and deadly debris.

  4. Those observers looking for a battle between uncompromising representatives and very different ideologies will, in all likelihood, be disappointed.

  5. All 75 people on board the Aeroflot Airbus died.

Note that some events may be sequentially discontinuous in some context as exhibited in (4). In order to simplify the annotation process, only the word considered as the syntactic head is annotated, shown with bold letters in the examples, except for prepositional phrases.

The attributes for the EVENT tag includes:

  • eid – unique ID number.


  • stem – stem of the event’s head.


The MAKEINSTANCE tag is an auxiliary tag used to distinguish event tokens from event instances. The typical example of its usage is: to annotate the markable ‘taught’ in “He taught on Monday and Tuesday.” as two event instances happened in different time. The attributes for this tag include:

  • eiid – unique ID number.

  • eventID – unique ID to the referenced EVENT found in the text.

  • tense – tense of the event: PAST, PRESENT, FUTURE, INFINITIVE, PRESPART, PASTPART or NONE.


  • pos – part-of-speech tag of the event: ADJECTIVE, NOUN, VERB, PREPOSITION or OTHER.

  • polarity – polarity of the event: POS or NEG.

  • modality – the modal word modifying the event (if exists).


The TIMEX3 tag is used to mark up explicit temporal expressions, including dates, times, durations, and sets of dates and times. There are three major types of TIMEX3 expressions: (i) fully specified timexes, e.g., June 11 1989, summer 2002; (ii) underspecified timexes, e.g. Monday, next month, two days ago; (iii) durations, e.g., three months.

This tag allows specification of a temporal anchor, which facilitates the use of temporal functions to calculate the value of an underspecified timex. For example, within an article with a document creation time such as ‘January 3, 2006’, the temporal expression ‘today’ may occur. By anchoring the TIMEX3 for ‘today’ to the document creation time, we can determine the exact value of the TIMEX3.

The attributes of the TIMEX3 tag, which are of particular interest in the scope of this work, include:151515The full set of attributes with their descriptions can be found in timeml2006

  • tid – unique ID number.

  • type – type of timex: DATE, TIME, DURATION or SET.

  • value – normalized temporal value of the annotated timex represented in an extended ISO 8601 format.

  • functionInDocument – function of a TIMEX3 in providing a temporal anchor for other temporal expressions in the document: CREATION_TIME, MODIFICATION_TIME, PUBLICATION_TIME, RELEASE_TIME, RECEPTION_TIME, EXPIRATION_TIME or NONE.

  • anchorTimeID – (optional) the timex ID of the timex to which the TIMEX3 markable is temporally anchored.


The SIGNAL tag is used to mark up textual elements that make relations holding between two temporal elements explicit, which are generally:

  • Temporal prepositions: on, in, at, from, to, during, etc.

  • Temporal conjunctions: before, after, while, when, etc.

  • Prepositions signaling modality: to.

  • Special characters: ‘-’ and ‘/’, in temporal expressions denoting ranges, e.g., September 4-6 or Apr. 1999/Jul. 1999.

The only attribute for the SIGNAL tag is sid, corresponding to the unique ID number.


The TLINK, Temporal Link, tag is used to (i) establish a temporal order between two events (event-event pair), (ii) anchor an event to a time expression (event-timex pair), and (iii) establish a temporal order between two time expressions (timex-timex pair). Each temporal link has a temporal relation type assigned to it. The temporal relation types are modelled based on Allen’s interval algebra between two intervals allen1983. Table .1 shows the TimeML temporal relation types corresponding to relation types existing in Allen’s interval logic.

Interval interpretation Allen’s Relation TimeML TLINK Type
overlaps with -
finishes ENDS, ENDED_BY
is equal to SIMULTANEOUS
Table .1: Allen’s atomic relations, their semantics when interpreted over the real line, and their corresponding TimeML TLINK type

The Allen’s overlap relation is not represented in TimeML. However, TimeML introduces three more types of temporal relations (IDENTITY, INCLUDES and IS_INCLUDED), resulting in a set of 14 relation types. IDENTITY relation is used to encode event co-reference as exhibited in the following sentence, “John drove to Boston. During his drive he ate a donut.”

According to TimeML 1.2.1 annotation guidelines timeml2006, the difference between DURING and IS_INCLUDED (also their inverses) is that DURING relation is specified when an event persists throughout a temporal duration (1), while IS_INCLUDED is specified when an event happens within a temporal expression (2). Moreover, INCLUDES and IS_INCLUDED relations are used to specify a set/subset relationship between events (3).

  1. John drove for 5 hours.

  2. John arrived on Tuesday.

  3. The police looked into the slayings of 14 women. In six of the cases suspects have already been arrested.

The attributes of the TLINK tag include:

  • lid – unique ID number.

  • eventInstanceID or timeID – unique ID of the annotated MAKEINSTANCE or TIMEX3 involved in the temporal link.

  • relatedToEventInstance or relatedToTime – unique ID of the annotated MAKEINSTANCE or TIMEX3 that is being related to.


  • signalID – (optional) the ID of SIGNAL explicitly signalling the temporal relation.


The SLINK, Subordination Link, tag is used to introduce a directional relation going from the main to the subordinated verb (indicated with s), which can be in one of the following contexts:

  • Modal, e.g., “Mary wanted John to buy s some wine.”

  • Factive, e.g., “John managed to go s to the supermarket.”

  • Counter-factive, e.g., “John forgot to buy s some wine”.

  • Evidential, e.g., “Mary saw John only carrying s beer.”

  • Negative-evidential, e.g., “John denied he bought s only beer.”

  • Conditional, e.g., “If John brings s only beer, Mary will buy some wine.”

The attributes of the SLINK tag include:

  • lid – unique ID number.

  • eventInstanceID – unique ID of the annotated MAKEINSTANCE involved in the subordination link.

  • subordinatedEventInstance – unique ID of the subordinated MAKEINSTANCE that is being related to.

  • relType – subordination relation holding between the event instances: MODAL, EVIDENTIAL, NEG_EVIDENTIAL, FACTIVE, COUNTER_FACTIVE or CONDITIONAL.

  • signalID – (optional) the ID of SIGNAL explicitly signalling the subordination relation.


The ALINK, Aspectual Link, tag represents the relationship between an aspectual event (indicated with a) and its argument event, belonging to one of the following:

  • Initiation, e.g., “John started a to read.”

  • Culmination, e.g., “John finished a assembling the table.”

  • Termination, e.g., “John stopped a talking.”

  • Continuation, e.g., “John kept a talking.”

  • Reinitiation, e.g., “John resumed a talking.”

The attributes of the ALINK tag include:

  • lid – unique ID number.

  • eventInstanceID – unique ID of the annotated (aspectual) MAKEINSTANCE involved in the aspectual link.

  • relatedToEventInstance – unique ID of the MAKEINSTANCE that is being related to.

  • relType – relation holding between the event instances: INITIATES, CULMINATES, TERMINATES, CONTINUES or REINITIATES.

  • signalID – (optional) the ID of SIGNAL explicitly signalling the relation.


Figure .1 shows an excerpt of news text annotated with temporal entities and temporal relations in TimeML annotation standard.

<TimeML> <DOCID>wsj_0679</DOCID> <DCT><TIMEX3 tid="t0" type="DATE" value="1989-10-30" temporalFunction="false" functionInDocument="CREATION_TIME">1989-10-30</TIMEX3></DCT> <TEXT> According to the filing, Hewlett-Packard <EVENT eid="e24" class="OCCURRENCE">acquired</EVENT> 730,070 common shares from Octel as a result of an <TIMEX3 tid="t25" type="DATE" value="1988-08-10" functionInDocument="NONE">Aug. 10, 1988</TIMEX3>, stock purchase <EVENT eid="e26" class="I_ACTION">agreement</EVENT>. That <EVENT eid="e27" class="I_ACTION">accord</EVENT> also <EVENT eid="e28" class="I_ACTION">called</EVENT> for Hewlett-Packard to <EVENT eid="e29" class="OCCURRENCE">buy</EVENT> 730,070 Octel shares in the open market <SIGNAL sid=s30>within</SIGNAL> <TIMEX3 tid="t31" type="DURATION" value="P18M" functionInDocument="NONE">18 months</TIMEX3>. </TEXT> <MAKEINSTANCE eventID="e24" eiid="ei24" tense="PAST" aspect="NONE" polarity="POS" pos="VERB"/> <MAKEINSTANCE eventID="e26" eiid="ei26" tense="NONE" aspect="NONE" polarity="POS" pos="NOUN"/> <MAKEINSTANCE eventID="e27" eiid="ei27" tense="NONE" aspect="NONE" polarity="POS" pos="NOUN"/> <MAKEINSTANCE eventID="e28" eiid="ei28" tense="PAST" aspect="NONE" polarity="POS" pos="VERB"/> <MAKEINSTANCE eventID="e29" eiid="ei29" tense="INFINITIVE" aspect="NONE" polarity="POS" pos="VERB"/> <TLINK lid="l21" relType="AFTER" timeID="t31" relatedToTime="t25"/> <TLINK lid="l22" relType="DURING" eventInstanceID="ei29" relatedToTime="t31" signalID="s30"/> <TLINK lid="l23" relType="AFTER" eventInstanceID="ei23" relatedToEventInstance="ei26"/> </TimeML>

Figure .1: Text excerpt annotated with temporal entities and temporal relations in TimeML standard.
7.2 ISO Related Standards
Iso 8601

ISO 8601 is an international standard providing an unambiguous method for representing dates and times, which is used by TimeML to represent the value attribute of the TIMEX3 tag. The standard is based on the following principles:161616Note that in the formulation, the emphasized letters denote place-holders for number or designated letter values.

  • Date, time and duration unit values are organized from the most to the least significant: year, month (or week), day, hour, minute, second and fraction of second.

  • Calendar dates are represented in the form of YYYY-MM-DD, where YYYY, MM and DD are the place-holders for the year, month and day number values respectively, or YYYY-Www with ww for the week-of-year number value.

  • Times are represented with respect to the 24-hour clock system and follow the format of hh:mm:ss.ff, with hh, mm, ss and ff for the hour, minute, second and fraction of second values respectively.

  • The combination of a date and a time is represented in the format of:

  • Durations follow the format of PnX, where n is the duration number value, and X is the duration unit which can be one of the following units: Y, M, W, D, H, M and S for year, month, week, day, hour, minute and second respectively.

  • The combination of duration units follows the format of: PnYnMnDTnHnMnS or PnW.

An extended version of ISO 8601 was proposed in the TIDES annotation standard ferro2001 to address the ambiguity and vagueness of natural language:

  • Parts of day, weekend, seasons, decades and centuries were introduced as new concepts. For example, YYYY-Www-WE (where WE indicates weekend), YYYY-MM-DDTPoD (where PoD takes one of the following values: NI, MO, MI, AF and EV for night, morning, midday, afternoon and evening respectively), etc.

  • Additional temporal values are used for temporal expressions such as ‘nowadays’ to refer to either the past, present or future , i.e., PAST_REF, PRESENT_REF or FUTURE_REF respectively.


Adopting the existing TimeML annotation standard, ISO-TimeML aims to define a mark-up language for annotating documents with information about time and events. Several changes to TimeML have been proposed to address capturing temporal semantics in text: (i) stand-off annotations rather than in-line annotations that do not modify the text being annotated, and (ii) the introduction of a new link for measuring out events (MLINK), which characterizes a temporal expression of DURATION type as a temporal measurement of an event. The resulting standard, ISO 24617-1:2012, SemAF-Time, specifies a formalized XML-based mark-up language facilitating the exchange of temporal information PUSTEJOVSKY10.55.

8 TimeML Annotated Corpora

We list below several corpora annotated with either simplified or extended TimeML annotation standard, which are freely available for research purposes171717To access the Clinical TempEval corpus, users must agree to handle the data appropriately, formalized in the requirement that users must sign a data use agreement with the Mayo Clinic.. Most corpora are in English, but few other corpora are in other languages, such as Chinese, French, Italian, Korean and Spanish.

TimeBank 1.2

The TimeBank 1.2 corpus timebank is an annotated corpus of temporal semantics that was created following the TimeML 1.2.1 specification timeml2006. It contains 183 news articles, with just over 61,000 non-punctuation tokens, coming from a variety of news report, specifically from the ACE program and PropBank. The ones taken from the ACE program are originally transcribed broadcast news from the following sources: ABC, CNN, PRI, VOA and news-wire from AP and NYT. Meanwhile, PropBank contains articles from the Wall Stree Journal. TimeBank 1.2 is freely distributed by the Linguistic Data Consortium.181818http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T08


The AQUAINT corpus contains 73 news report document, and freely available for download.191919http://timeml.org/site/timebank/aquaint-timeml/aquaint_timeml_1.0.tar.gz. It is often referred to as the Opinion corpus.

TempEval Related Corpora

The corpora released in the context of TempEval evaluation campaigns (see Section 9), which serve as development and evaluation datasets, are mostly based on the TimeML annotation standard, with some exception in the early tasks for the purpose of simplification:

  • The corpus created for the first TempEval task verhagen-EtAl:2007:SemEval-2007 at SemEval-2007 employs a simplified version of TimeML. For example, there is no event instance annotation (realized with the MAKEINSTANCE tag), and the TLINK types include only three core relations (BEFORE, AFTER and OVERLAP), two less specific relations (BEFORE-OR-OVERLAP and OVERLAP-OR-AFTER) for ambiguous cases, and VAGUE for where no particular relation can be established.

  • As the TempEval-2 task verhagen-EtAl:2010:SemEval at SemEval-2010 attempted to address multilinguality, the corpus released within this task includes texts in Chinese, English, French, Italian, Korean and Spanish. The annotation contains the same set of TLINK types used in the previous TempEval.

  • The TempEval-3 corpus created for the TempEval-3 task uzzaman-EtAl:2013:SemEval-2013 at SemEval-2013, however, is based on the latest TimeML annotation guideline version 1.2.1 timeml2006, with the complete set of 14 TLINK types. The corpus contains (i) the enhanced existing corpora, TimeBank 1.2 and AQUAINT, resulting in TBAQ-cleaned as the development data for the task, (ii) the TempEval-3 silver corpus202020The TempEval-3 silver corpus is obtained by running automatic annotation systems, TIPSem and TIPSem-B llorens-saquete-navarro:2010:SemEval and TRIOS uzzaman-allen:2010:SemEval, on 600K word corpus collected from Gigaword., and (iii) the newly released TE3-Platinum as the evaluation corpus.

  • The creation of evaluation corpus for QA-TempEval llorens-EtAl:2015:SemEval does not require manual annotation of all TimeML elements in the documents. The annotators created temporal-related questions from the documents, such as, “Will Manchester United and Liverpool play each other after they topped their respective groups?”, provided the correct yes/no answers, then annotated the corresponding entities and relations in the text following the TimeML annotation format. There are 294 questions in total, coming from 28 documents belonging to three different domains: news articles, Wikipedia articles (history, biographical) and informal blog posts (narrative).

  • The Clinical TempEval corpus bethard-EtAl:2015:SemEval comprises 600 clinical notes and pathology reports from cancer patients at the Mayo clinic. The documents are annotated using an extended TimeML annotation framework, which includes new temporal expression types (e.g., PrePostOp for post-operative), new EVENT attributes (e.g., degree=LITTLE for slight nausea) and new temporal relation type (CONTAINS).

  • The evaluation dataset released for the “TimeLine: Cross-Document Event Ordering” task minard-EtAl:2015:SemEval consists of 90 Wikinews articles within specific topics (e.g., Airbus, General Motors, Stock Market) surrounding the target entities for which the event timelines are created. An event timeline is represented as ordered events, which are anchored to time with granularity ranging from DAY to YEAR. There are 37 event timelines for target entities of type PERSON (e.g., Steve Jobs), ORGANISATION (e.g., Apple Inc.), PRODUCT (e.g., Airbus A380) and FINANCIAL (e.g., Nasdaq), with around 24 events and 18 event chains per timeline in average.


The Ita-TimeBank corpus caselli-EtAl:2011:LAW is composed of two corpora (more than 150K tokens) that have been developed in parallel following the It-TimeML annotation scheme for Italian language. The two corpora are (i) the CELCT corpus containing news articles taken from the Italian Content Annotation Bank (I-CAB) magnini:2006:lrec2006, and (ii) the ILC corpus, which consists of 171 news articles collected from the Italian Syntactic-Semantic Treebank, the PAROLE corpus and the web.


The TimeBank-Dense corpus chambers-etal:2014:TACL is created to address the sparsity issue in the existing TimeML corpora. Using a specialized annotation tool, annotators are prompted to label all pairs of events and time expressions in the same sentence, all pairs of events and time expressions in the immediately following sentence, and all pairs of events and the document creation time. The VAGUE relation introduced at the first TempEval task verhagen-EtAl:2007:SemEval-2007 is adopted to cope with ambiguous temporal relations, or to indicate pairs for which no clear temporal relation exists. The resulting corpus contains 12,715 temporal relations, under the labels BEFORE, AFTER, INCLUDES, IS_INCLUDED, SIMULTANEOUS and VAGUE, over 36 documents taken from TimeBank.212121This is significantly in contrast with the TimeBank corpus containing only 6,418 temporal relations over 183 documents.

9 TempEval Evaluation Campaigns

TempEval is a series of evaluation campaigns, which are part of SemEval (Semantic Evaluation), an ongoing series of evaluations of computational semantic analysis systems. The ultimate goal of TempEval is the automatic identification of temporal expressions (timexes), events, and temporal relations within a text as specified in TimeML annotation pustejovsky2003. However, since addressing this aim in a first evaluation challenge was deemed too difficult, a staged approach was employed.


The first TempEval verhagen-EtAl:2007:SemEval-2007 focuses only on the categorization of temporal relations into simplified TimeML TLINK types, and only for English. There were three tasks proposed, each is a task of determining the TLINK type of:

  • pairs of event and timex within the same sentence,

  • pairs of event and DCT (Document Creation Time), and

  • pairs of main events of adjacent sentences, where main event is usually the syntactically dominant verb in a sentence.


TempEval-2 verhagen-EtAl:2010:SemEval extended the first TempEval, growing into a multilingual task, and adding three more tasks:

  • determining the extent of time expressions (TIMEX3 tagging) and the attribute values for type and value,

  • determining the extent of events (EVENT tagging) and the attribute values for class, tense, aspect, polarity and modality,

  • determining the TLINK type of pairs of events where one event syntactically dominate the other event.


TempEval-3 uzzaman-EtAl:2013:SemEval-2013 is different from its predecessor in several aspects:

  • Dataset     In terms of size, the task provided 100K word gold standard data and 600K word silver standard data for training, compared to 50K word corpus used in TempEval-1 and TempEval-2. A new evaluation dataset was developed, TE3-Platinum, based on manual annotations by experts over new text.

  • End-to-end extraction task     The temporal information extraction tasks are performed on raw text. Participants need to recognize EVENTs and TIMEX3s first, determine which ones to link, then label the links with TLINK types. In previous TempEvals, gold annotated EVENTs, TIMEX3s and TLINKs (without type) were given.

  • TLINK types     The full set of relation types according to TimeML is used, as opposed to the simplified one used in earlier TempEvals.

  • Evaluation     A single score, temporal awareness score, was reported to rank the participating systems.

There were three main tasks proposed in TempEval-3 focusing on TimeML entities and relations:

  • Task A     Determine the extent of timexes in a text as defined by the TIMEX3 tag, and determine the value of their type and value attributes.

  • Task B     Determine the extent of events in a text as defined by the EVENT tag, and assign the value of the class attribute.

  • Task ABC     The end-to-end task that goes from raw text to TimeML annotation of EVENTs, TIMEX3s and TLINKs, which entails performing tasks A and B.

In addition to the main tasks, two extra temporal relation tasks were also included:

  • Task C     Given gold annotated EVENTs and TIMEX3s, identify the pairs of entities having temporal link (TLINK) and classify the relation type.

  • Task C relation type only     Given gold annotated EVENTs, TIMEX3s and TLINKs (without type), classify the relation type.

TempEval Continuation

At the last SemEval-2015, there are several tasks related to temporal processing, taking it further into different directions: cross-document event ordering minard-EtAl:2015:SemEval, temporal-related question answering llorens-EtAl:2015:SemEval and clinical domain bethard-EtAl:2015:SemEval.

We are particularly interested in the QA TempEval task llorens-EtAl:2015:SemEval, which requires the participants to perform end-to-end TimeML annotation from the plain text in the same way as in TempEval-3 (Task ABC), but evaluates the systems in terms of correctly answered questions instead of using common information extraction performance measures. The task focuses on answering yes/no questions in the following format: IS <entity1> <RELATION> <entity2> ?, e.g., Is event1 BEFORE event2 ?. The systems are ranked based on the accuracy in answering the questions.

10 State-of-the-art Methods

The problem of temporal information processing can be decomposed into several sub-problems, as has been defined in TempEval-3. Hence, the best participating systems in TempEval-3 for each task, i.e., timex extraction (Task A), event extraction (Task B) and temporal relation extraction (Task C), can be perceived as the state-of-the-art systems (see Table .2).222222Note that we only report the high-performing participating systems (and their best system runs) in TempEval-3. Apart from TempEval-3, the efforts towards complete temporal information processing are still ongoing, so we also report systems claiming to be better than the best systems in TempEval-3. For some tasks, TIPSem llorens-saquete-navarro:2010:SemEval, the best performing system in TempEval-2, also performed best in TempEval-3. However, it was used by the annotators to pre-label the evaluation corpus, so it was excluded from the ranking.

strict value class
System F1 P R F1 F1 F1
Timex Extraction
HeidelTime-t strotgen2013 90.30 93.08 87.68 81.34 77.61
NavyTime-1,2 chambers:2013:SemEval-2013 90.32 89.36 91.30 79.57 70.97
SUTime chang-manning:2012:SUTime 90.32 89.36 91.30 79.57 67.38
ClearTK-1,2 bethard:2013:EMNLP 90.23 93.75 86.96 82.71 64.66
Event Extraction
TIPSem llorens-saquete-navarro:2010:SemEval 82.89 83.51 82.28 75.59
ATT-1 jung2013 81.05 81.44 80.67 71.88
KUL 79.32 80.69 77.99 70.17
Temporal Relation Extraction
TIPSem llorens-saquete-navarro:2010:SemEval 44.25 39.71 49.94
ClearTK-2 bethard:2013:EMNLP 36.26 37.32 35.25
UTTime-5 laokulrat-EtAl:2013:SemEval-2013 34.90 35.94 33.92
NavyTime-1 chambers:2013:SemEval-2013 31.06 35.48 27.62
UTTime-1 laokulrat-EtAl:2013:SemEval-2013 24.65 15.18 65.64
Temporal Relation Type Classification
UTTime-1,4 laokulrat-EtAl:2013:SemEval-2013 56.45 55.58 57.35
Temporal Information Processing
TIPSem llorens-saquete-navarro:2010:SemEval 42.39 38.79 46.74
ClearTK-2 bethard:2013:EMNLP 30.98 34.08 28.40
Table .2: State-of-the-art temporal information processing systems according to TempEval-3
10.1 Timex Extraction

In terms of recognizing the extent of timexes in a text, both rule-based and data-driven strategies are equally good. The rule-engineering systems HeidelTime, NavyTime and SUTime performed best at relaxed matching with 90.3%, 90.32% and 90.32% F1-score respectively, while the statistical system ClearTK performed best at strict matching with 82.71% F1-score. Strict match is when there is an exact match between the system entity and gold entity, e.g., sunday morning vs sunday morning, whereas relaxed match is when there is at least an overlap between the system entity and gold entity, e.g., sunday vs sunday morning.

The rule-engineering systems commonly rely on regular expression (regex) matching to find a temporal expressions in a text, whereas the data-driven approaches regard the problem as a BIO token-chunking task, building a classifier to decide whether a token is at the B(eginning) of, I(nside) of or O(utside) of a timex.

In TempEval-3, the timex recognition task also includes determining the type of a timex (DATE, TIME, DURATION of SET) and normalize its value, e.g., the day before yesterday would be normalized into 2015-12-30 (assuming that today is 2016-01-01). The normalization task is currently (and perhaps inherently) done best by rule-engineered systems, HeidelTime being the best with 77.61% F1-score.232323The F1-score for value captures the performance of extracting timex and identifying the attribute value together, i.e., ClearTK included another classifier to determine timex types, but used TIMEN LLORENS12.128.L12-1015, which is rule-based, to normalize timex values.

Most rule-based approaches for timex normalization use string-to-string translation approach, i.e., each word in the expression is looked up in a normalization lexicon, then the resulting sequence is mapped directly to the normalized form. Both HeidelTime and TIMEN follow this approach. A drawback of this approach is that there are different rules for each expression, e.g., yesterday, the day before yesterday, regardless of the compositional nature that may hold, that the day before yesterday is one day before yesterday.

TimeNorm bethard:2013:EMNLP exploits a synchronous context free grammar for timex normalization to address these shortcomings. Synchronous rules map the source language to formally defined operators for manipulating times. Time expressions are then parsed using an extended CYK+ algorithm, and converted to a normalized form by applying the operators recursively. UWTime lee-EtAl:2014:P14-1 uses a Combinatory Categorial Grammar (CCG) to construct compositional meaning representations, while also considering contextual cues (e.g. the document creation time, the governing verb’s tense) to compute the normalized value of a timex.

Evaluated on the TempEval-3 evaluation corpus, UWTime achieved 82.4% F1-score on the value resolution task, while TimeNorm achieved 81.6% accuracy given gold annotated timex extents, compared with 78.5% and 74.1% accuracies achieved by HeidelTime and TIMEN, respectively.

10.2 Event Extraction

All high performing systems for event recognition in TempEval-3 used machine learning approaches. Typically a system consists of different classifiers each for recognizing events and determining their class attribute. For event extent recognition, since in TimeML the annotated events are usually single-word events, the problem is often regarded as a binary token-classification.242424Some systems in TempEval3 also modelled the problem as a BIO token-chunking task, e.g., ClearTK, as in for timex recognition. Meanwhile, since there are 7 event classes in TimeML, the task of determining the class attribute is modelled as a multi-class classification task.

The best performing system is ATT with 81.05% and 71.88%, followed by KUL with 79.32% and 70.17%.252525The F1-score for class captures the performance of extracting event and identifying the attribute class together, i.e., These systems, also TIPSem, use semantic information obtained through semantic role labelling as features, which proves to play an important role in event recognition.

10.3 Temporal Relation Extraction

In TempEval-3, identifying which pair of entities are connected by a temporal relation is a new task in the series of TempEval challenges; in TempEval and TempEval-2, the pair of entities are given and limited to specific syntactic constructs. TempEval-3 participants approached the problem with rule-based, data-driven and also hybrid methods. The rules are typically based on the possible TLINK candidates enumerated in the task description: (i) main events of consecutive sentences, (ii) pairs of events in the same sentence, (iii) event and timex in the same sentence and (iv) event and document creation time.

For (ii) candidate pairs, TIPSem only considered pairs of events where one is subordinated by the other. ClearTK included three different multi-class classification models (for (ii), (iii) and (iv) candidate pairs) for temporal relation identification, as well as temporal relation type classification. Given a pair of entities, the classifiers have to predict the temporal relation type (BEFORE, AFTER, SIMULTANEOUS, etc.) or NORELATION if there is no relation exists. UTTime-1 only relied on rules to consider candidate pairs as having temporal relations. However, UTTime-5 used re-trained classifiers with an additional relation type UNKNOWN to filter the candidate pairs, in the same way as ClearTK.

The hybrid method, employed by NavyTime, combines candidate-pair-rules with four binary classifiers (for (i), (ii), (iii) and (iv) candidate pairs) that decide whether a candidate pair is having temporal relation or not.

On the other hand, for classifying the temporal relation types, all participants resort to data-driven approaches. Both TIPSem and UTTime used sentence-level semantic information as features, obtained via semantic role labelling and deep syntactic parsing, respectively.

Regarding the classifiers used, ClearTK relied on Mallet262626http://mallet.cs.umass.edu/ MaxEnt, OpenNLP272727http://opennlp.apache.org/ MaxEnt, and LIBLINEAR REF08a, and picked the final classifiers by running a grid search over models and parameters on the training data. UTTime used two LIBLINEAR REF08a classifiers (L2-regularized logistic regression); one for event-event pairs, i.e., (i) and (ii) candidate pairs, and another one for event-timex pairs, i.e., (iii) and (iv) candidate pairs. In addition to four binary classifiers for identifying candidate pairs having temporal links, NavyTime trained four MaxEnt classifiers for temporal relation classification.

For both temporal relation identification and temporal relation type classification tasks, ClearTK is the best performing system with 36.26% F1-score. The organizers also provided the gold annotated temporal links to measure the performance of systems in classifying the temporal relation types (Task C relation type only). UTTime with semantic features performed best with 56.45% F1-score. Using only rules to determine the candidate pairs, UTTime-1 achieved the highest recall (65.64%) at the expense of precision (15.18%). UTTime-5 can obtain a better F1-score by reducing the recall significantly, but still in the second place after ClearTK with 34.90% F1-score.

10.4 Temporal Information Processing

For complete temporal annotation from raw text, which is the Task ABC in TempEval-3, the best performing system is ClearTK, with 30.98% F1-score.

11 Conclusions

We presented an introduction to temporal information processing, particularly in using TimeML as the annotation framework. The separation of temporal entities and temporal anchoring/dependency representations in TimeML, also the fact that events are not limited to specific types, were the main reason why we chose TimeML over ACE for temporal information modelling in our research. The TimeML annotation standard has been described (Section 7), along with several corpora annotated with TimeML (Section 8).

We have also given an overview of state-of-the-arts methods for extracting temporal information from text (Section 10), according to TempEval evaluation campaigns (Section 9). TempEval-3 results reported by uzzaman-EtAl:2013:SemEval-2013 show that even though the performances of systems for extracting TimeML entities are quite good (¿80% F1-score), the overall performance of end-to-end temporal information extraction systems suffers due to the low performance on extracting temporal relations. The state-of-the-art performance on the temporal relation extraction task yields only around 36% F1-score. This is the main reason underlying our choice to focus this work on the extraction of temporal relations.

Identifying temporal relations in a full discourse is a task that is difficult to define. In general it involves the classification of temporal relations between every possible pair of events and timexes. Hence, without a completely labelled graph of events and timexes, we cannot speak about true extraction, but rather about matching human labelling decisions that were constrained by time and effort. The TimeBank-Dense corpus mentioned in Section 8 is created to cope with such problem.

Several tasks in line with TempEval (Section 9) approach the problem by changing the evaluation scheme used. In QA TempEval, the task is no longer about annotation accuracy, but rather the accuracy for targeted questions. The “TimeLine: Cross-Document Event Ordering” task limited the extraction of event timelines only to events related to specific target entities.

12 Introduction

Temporal relations, or temporal links, are annotations that bring together pieces of markable temporal information in a text, and make formal representation of temporally ordered events possible. In TimeML, temporal relation types have been modelled based on Allen’s interval algebra between two intervals allen1983. In Table .1 in Section 7.1 we show the relation types defined in Allen’s interval logic, along with the corresponding TLINK types in TimeML.

TimeML pustejovsky2003 is the annotation framework used in the TempEval series, evaluation exercises focused on temporal information processing, i.e. the extraction of temporal expressions (timexes), events and temporal relations in a text (see Section 7 and Section 9). According to TempEval-3 results reported by uzzaman-EtAl:2013:SemEval-2013, while systems for timex extraction and event extraction tasks yield quite high performances with over 80% F1-scores, the best performing system achieved very low performance on the temporal relation extraction task, bringing down the overall performance on the end-to-end temporal information processing task to only around 30% F1-score. This is the main reason why we focus our attention on the automatic extraction of temporal relations.

Identifying temporal relations in a full discourse is a very difficult task. In general it involves the classification of temporal relations between every possible pair of events and timexes. With markable elements in a text, the total number of possible temporal links is . Most of the research done so far focused on estimating the relation type, given an annotated pair of temporal events and timexes. In TempEval-1 and TempEval-2, participants were given gold annotated temporal links, which are missing the type annotation, between temporal entities following predefined syntactic constructs, e.g. pairs of main events of adjacent sentences.

In TempEval-3, participants were required to identify pairs of temporal entities connected by a temporal link (TLINK), but possible TLINK candidates were only: (i) main events of consecutive sentences, (ii) pairs of events in the same sentence, (iii) event and timex in the same sentence and (iv) event and document creation time. Moreover, compared to earlier TempEval campaigns, TempEval-3 required the recognition of the full set of temporal relations in TimeML (14 TLINK types) instead of a simplified set, increasing the task complexity.

In this chapter, we describe our methods in building an improved temporal relation extraction system (Section 16), then evaluate our system following the TempEval-3 evaluation scheme to be able to compare it with the state-of-the-art systems (Section 17.1).

However, the sparse annotation of temporal relations in the TempEval corpora makes it difficult to build an automatic extraction system and evaluate the system regarding its performance, particularly on identifying temporal links. The best system in TempEval-3 for labelling the temporal links with 14 temporal relation types (Task C classifying only), UTTime laokulrat-EtAl:2013:SemEval-2013, achieved around 56% F1-score. When the system is evaluated on the temporal relation extraction task (Task C: identifying + classifying), its performance dropped to 24.65% F1-score, even though it gained a very high recall of 65.64%. The best performing system for Task C in TempEval-3, ClearTK bethard:2013:EMNLP, optimized only relation classification and intentionally left many pairs unlabelled, balancing the precision and recall into 36.26% F1-score.

The TimeBank-Dense corpus (Section 8) is created to cope with this sparsity issue. Using a specialized annotation tool, annotators are prompted to label all possible pairs of temporal entities, resulting in a complete graph of temporal relations. On the other hand, one of the continuation of the TempEval series, QA TempEval (Section 9), approached the problem by changing the evaluation scheme used. The task is no longer about annotation accuracy, but rather the accuracy for answering targeted questions.

Therefore, we also evaluate our system following the TimeBank-Dense and QA TempEval evaluation methodologies (Section 17.2 and Section 17.3, respectively), to give a complete overview on how well our system can extract temporal relations between temporal entities in a text.

13 Related Work

Supervised learning for temporal relation extraction has already been explored in several earlier works. Most existing models formulate temporal ordering as a pairwise classification task, where each pair of temporal entities is classified into temporal relation types Mani_2007.three, chambers-wang-jurafsky:2007:PosterDemo.

Several works have tried to exploit an external temporal reasoning module to improve the supervised learning models for temporal relation extraction, through training data expansion Mani_2007.three, tatu-srikanth:2008:PAPERS, or testing data validation, i.e., replacing the inconsistencies in automatically identified relations (if any) with the next best relation types tatu-srikanth:2008:PAPERS. Some other works tried to take advantage of global information to ensure that the pairwise classifications satisfy temporal logic transitivity constraints

, using frameworks like Integer Linear Programming and Markov Logic Networks chambers-jurafsky:2008:EMNLP, yoshikawa-EtAl:2009:ACLIJCNLP, uzzaman-allen:2010:SemEval. The gains have been small, likely because of the disconnectedness that is common in sparsely annotated corpora chambers-jurafsky:2008:EMNLP.

In the context of TempEval evaluation campaigns (Section 9), which is a series of evaluations of temporal information processing systems, our research on temporal relation extraction is based on the third instalment of the series, TempEval-3. For the tasks related to temporal relation extraction (Task C and Task C relation type only), there were five participants in total, including ClearTK bethard:2013:EMNLP, UTTime laokulrat-EtAl:2013:SemEval-2013 and NavyTime chambers:2013:SemEval-2013, which are reported in Section 10.3. These systems resorted to data-driven approaches for classifying the temporal relation types, using morphosyntactic information (e.g., PoS tags, syntactic parsing information) and lexical semantic information (e.g., WordNet synsets) as features. UTTime additionally used sentence-level semantic information (i.e., predicate-argument structure) as features.

Our proposed approach for temporal relation type classification is inspired by recent works on hybrid classification models dsouza-ng:2013:NAACL-HLT, chambers-etal:2014:TACL. dsouza-ng:2013:NAACL-HLT introduce 437 hand-coded rules along with supervised classification models using lexical relation features (extracted from Merriam-Webster dictionary and WordNet), as well as semantic and discourse features. CAEVO, a CAscading EVent Ordering architecture by chambers-etal:2014:TACL, combines rule-based and data-driven classifiers in a sieve-based architecture for temporal ordering. The classifiers are ordered by their individual precision. After each classifier-sieve proposes its labels, the architecture infers transitive links from the new labels, adds them to the temporal label graph and informs the next classifier-sieve about this decision.

14 Related Publications

In mirza-tonelli:2014:EACL, we argue that using a simple set of features, avoiding complex pre-processing steps (e.g., discourse parsing, deep syntactic parsing, semantic role labelling), combined with carefully selected contributing features, could result in a better performance compared with the work of dsouza-ng:2013:NAACL-HLT and the best system in TempEval-3 (Task C relation type only), UTTime laokulrat-EtAl:2013:SemEval-2013.

For QA TempEval task, we submitted our temporal information processing system, HLT-FBK mirza-minard:2015:SemEval, which ranked 1st in all three domains: News, Wikipedia and Blogs (informal narrative text).

Both works serve as the basis of our proposed temporal relation extraction system that will be described in the following sections.

15 Formal Task Definition

For temporal relation extraction, we perform two tasks: identification and classification. We first identify pairs of temporal entities having temporal relations, then classify the temporal relation types of these pairs.

Temporal Relation Identification

Given a text annotated with a document creation time (DCT) and temporal entities, which can be an event or timex, identify which entity pairs are considered as having temporal relations.

Temporal Relation Type Classification

Given an ordered pair of entities

, which can be a timex-timex (T-T), event-DCT (E-D), event-timex (E-T) or event-event (E-E) pair, assign a certain label to the pair, which can be one of the 14 TLINK types: BEFORE, AFTER, INCLUDES, IS_INCLUDED, DURING, DURING_INV, SIMULTANEOUS, IAFTER, IBEFORE, IDENTITY, BEGINS, ENDS, BEGUN_BY or ENDED_BY.


Consider the following excerpt taken from the TimeBank corpus, annotated with events and temporal expressions:

DCT=1989-10-30 t
According to the filing, Hewlett - Packard acquired E 730,070 common shares from Octel as a result of an Aug. 10, 1988 T, stock purchase agreement E. That accord E also called E for Hewlett - Packard to buy E 730,070 Octel shares in the open market within 18 months T.

The temporal relation extraction system should be able to identify, among others:

  • [noitemsep]

  • timex-timex: [T AFTER T], [T AFTER T]

  • event-DCT: [E BEFORE T, [E BEFORE T]

  • event-timex: [E IS_INCLUDED T], [E DURING T]

  • event-event: [E AFTER T], [E INCLUDES T]

16 Method

We propose a hybrid approach for temporal relation extraction, as illustrated in Figure .1. Our system, TempRelPro, is composed of two main modules: (i) temporal relation identification, which is based on a simple set of rules, and (ii) temporal relation type classification, which is a combination of rule-based and supervised classification modules, and a temporal reasoner component in between.

Figure .1: Our proposed temporal relation extraction system, TempRelPro
16.1 Temporal Relation Identification

All possible pairs having temporal relations according to the TempEval-3 task description are extracted using a set of simple rules; pairs of temporal entities satisfying one of the following rules are considered as having temporal links (TLINKs):

  • pairs of main events of consecutive sentences

  • pairs of events in the same sentence

  • pairs of event and timex in the same sentence

  • pairs of event and document creation time

  • pairs of all possible timexes (including document creation time) linked with each other282828Note that this is not included in the enumerated possible TLINKs in the TempEval-3 task description.

These pairs are then grouped together into four different groups: timex-timex (T-T), event-DCT (E-D), event-timex (E-T) and event-event (E-E).

16.2 Temporal Relation Type Classification

Our approach for temporal relation type classification is inspired by CAEVO chambers-etal:2014:TACL, which combines rule-based and supervised classifiers in a sieve-based architecture. One of the benefits of this architecture is the seamless enforcement of transitivity constraints, by inferring all transitive relations from each classifier-sieve’s output before the graph is passed on to the next one. The classifiers are ordered based on their precision. Hence, the most precise ones based on linguistic motivated rule-based approaches are executed first, followed by machine learned ones.

We also follow the idea of a sieve-based architecture. However, our proposed system is different than CAEVO regarding the following:

  • We consider all rule-based classifiers as one sieve component (rule-based sieve), and all Support Vector Machine (SVM) classifiers as another one (machine-learned sieve).

  • Instead of running transitive inference after each classifier, we run our temporal reasoner module (Section 16.2.5) on the output of the rule-based sieve, only once.

  • We use the output of the rule-based sieve as features for the machine-learned sieve, specifically:

    • The timex-DCT link label proposed by the timex-timex rules (Section 16.2.1) are used as a feature in the event-timex SVM (Section 16.2.6)

    • The event-DCT link label proposed by the event-DCT rules (Section 16.2.1) are used as a feature in the event-event SVM (Section 16.2.6)

  • In Table .1 we report the comparison between CAEVO’s sieves and ours. Several sieves are not implemented in our system. Some others are different in terms of generality. Note that the last sieve of CAEVO, which labels all unlabelled pairs with VAGUE, is not implemented in our system, but implicitly embedded in our machine-learned models.

width=1 CAEVO TempRelPro Differences Rules-Verb/Time Adjacent Event-timex Rules Rules-TimeTime Timex-timex Rules Rules-Reporting Governor Event-event Rules Rules-Reichenbach Event-event Rules Rules-General Governor Event-event Rules Rules-WordNet - Rules-Reporting DCT Event-DCT Rules TempRelPro considers all types of events, not only reporting events ML-E-T SameSent Event-timex SVM TempRelPro considers both intra- and inter- sentential pairs ML-E-E SameSent Event-event SVM TempRelPro considers both intra- and inter- sentential pairs ML-E-E Dominate - ML-E-DCT Event-DCT SVM Rules-AllVague -

Table .1: CAEVO’s classifiers vs TempRelPro’s classifiers
16.2.1 Timex-timex Rules

Only temporal expressions of types DATE and TIME are considered in the hand-crafted set of rules, based on their normalized values. For example, 7 PM tonight with value = 2015-12-12T19:00 IS_INCLUDED in today with value = 2015-12-12.

16.2.2 Event-DCT Rules

The rules for E-D pairs are based on the tense and/or aspect of the event word:

  • If tense = PAST and aspect = PERFECTIVE then [E BEFORE D]

  • If tense = PRESENT and aspect = PROGRESSIVE then [E INCLUDES D]

  • If tense = PRESENT and aspect = PERFECTIVE_PROGRESSIVE then [E INCLUDES D]

  • If tense = FUTURE then [E AFTER D]

16.2.3 Event-timex Rules

Many prepositions in English have temporal senses, as has been discussed in The Preposition Project (TPP) Litkowski:Hargraves:06 and the Pattern Dictionary of English Prepositions (PDEP) litkowski:2014:P14-1. We took the list of temporal prepositions292929http://www.clres.com/db/classes/ClassTemporal.php and built a set of rules for E-T pairs based on their temporal senses (tsense). The rules are only applied whenever a temporal preposition establishes a temporal modifier relationship between an event (E) and a timex (T), based on the existing dependency path:

  • If tsense = TimePoint (e.g., in, at, on) then [E IS_INCLUDED T]

  • If tsense = TimePreceding (e.g., before) then [E BEFORE T]

  • If tsense = TimeFollowing (e.g., after) then [E AFTER T]

  • If tsense = Duration (e.g., during, throughout) then [E DURING T]

  • If tsense = StartTime (e.g., from, since) then [E BEGUN_BY T]

  • If tsense = EndTime (e.g., until) then [E ENDED_BY T]

In the absence of a temporal preposition, a timex might simply be a temporal modifier of an event, as exemplified in “Police confirmed E Friday T that the body was found…”. In this case, we assume that [E IS_INCLUDED T].

Moreover, sometimes events are modified by temporal expressions marking the starting time and ending time in a duration pattern. For example, ‘between tmx_begin and tmx_end’, ‘from tmx_begin to/until tmx_end’ or ‘tmx_begin-tmx_end’. We define the rules as follow:

  • If T matches tmx_begin then [E BEGUN_BY T]

  • If T matches tmx_end then [E ENDED_BY T]

16.2.4 Event-event Rules

The first set of rules applied to E-E pairs is based on the existing dependency path (dep)303030The dependency path syntax is according to The CoNLL-2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies surdeanu-EtAl:2008:CONLL. between the first event (E) and the second event (E):

  • If E is the logical subject of E (a passive verb), i.e., dep = LGS-PMOD then [E AFTER E], e.g., “The disastrous chain reaction touched E off by the collapse E of Lehman Brothers…”

  • If E is the locative adverb of E, i.e., dep = LOC-PMOD then [E IS_INCLUDED E], e.g., “China’s current economic policies cause an enormous surge E in coal consumption E.”

  • If E is the predicative complement of E (a raising/control verb), i.e., dep = OPRD-IM or dep = OPRD:

    • If E is an aspectual verb for initiation (e.g., begin, start) then [E BEGINS E], e.g., “The situation began E to relax E in the early 1990s.”

    • If E is an aspectual verb for culmination/termination (e.g., finish, stop) then [E ENDS E], e.g., “There ’s some price at which we ’d stop E bidding E.”

    • If E is an aspectual verb for continuation (e.g., continue, keep) then [E INCLUDES E], e.g., “The maturing industry ’s growth continues E to slow E.”

    • If E is a general verb and aspect = PERFECTIVE_PROGRESSIVE then [E SIMULTANEOUS E], e.g., “Hewlett-Packard have been working E to develop E quantum computers.”

    • If E is a general verb then [E BEFORE E], e.g., “The AAR consortium attempted E to block E a drilling joint venture.”

The other sets of rules are taken from CAEVO chambers-etal:2014:TACL, including:

  • Rules for links between a reporting event and another event that is syntactically dominated by the reporting event, based on the tense and aspect of both events.

  • Reichenbach rules based on the analysis of the role played by various tenses of English verbs in conveying temporal discourse reichenbach47.

16.2.5 Temporal Reasoner

Consider as an example the following news excerpt taken from the TimeBank corpus, annotated with events and temporal expressions:

She (Magdalene Albright, Ed.) then lavished praise, and the State Department’s award for heroism, on embassy staffers before meeting e with bombing victims at the Muhimbili Medical Center and with government officials. […] (During the meeting, Ed.) Albright announced e a gift of 500 pounds (225 kilograms) of medical supplies to Tanzania and Kenya from the Walter Reed Army Medical Center. She also pledged e to ask e Congress to approve […]

The annotated temporal relations of the documents are the following: [e BEFORE e], [e BEFORE e], [e IS_INCLUDED e]313131This is an annotation error that later causes an inconsistency in the temporal graph during the consistency checking. and [e INCLUDES e].

An annotated TimeML document can be mapped into a constraint problem according to how TLINKs are mapped into Allen relations (Table .1). A possible mapping is as follows:

  • and for BEFORE and AFTER

  • and for DURING and DURING_INV

  • and for IS_INCLUDED and INCLUDES

  • and for BEGINS and BEGUN_BY

  • and for ENDS and ENDED_BY

For instance, the TLINKs in the previous excerpt can be mapped as follows: for BEFORE between announced and pledged; for IS_INCLUDED between meeting and announced; and for INCLUDES between meeting and pledge. Other mappings are possible, e.g., by relaxing the mapping of BEFORE and its inverse AFTER into and , respectively, considering vagueness in interpreting temporal annotations. For example, for BEFORE between announced and pledged could be replaced by in case of uncertainty whether one event is before or immediately before the other.

These and other mappings are handled by the Service-oriented Qualitative Temporal Reasoner (SQTR), which was developed for reasoning on TimeML documents within the TERENCE FP7 project (GA n. 257410) in a Service-Oriented Architecture context erl2004service. SQTR is used to check consistency and perform deduction, and relies on the Generic Qualitative Reasoner (GQR), a fast solver for generic qualitative constraint problems, such as Allen constraint problems. The rationale of preferring GQR to other solutions, such as fast SAT solvers, is due to its scalability, simplicity of use and efficiency performances DBLP:conf/ijcai/WestphalW09.

SQTR behaves as follows:

  • In case of consistency checking, SQTR maps the TimeML document into a GQR constraint problem, invokes GQR, and returns a true/false value. In case of consistency, it also returns the mapping for which consistency is found for informing the deduction operation. If we consider the previous example, the system will detect an inconsistency, caused by the annotation of IS_INCLUDED between meeting and announced, which should be INCLUDES instead for the set of TLINKs to be consistent.

  • In case of deduction, SQTR maps the TimeML document into a GQR constraint problem, invokes GQR, maps the GQR output to a TimeML document, marks the deduced TLINKs with an attribute deduced set to true, and returns such a document as the result. The system will deduce, for example, a new relation BEFORE between announced and ask, because the same relation holds between announced and pledge and between pledge and ask.

Note that the temporal reasoner only deduce new TLINKs if the TimeML document is found to be consistent.

16.2.6 SVM Classifiers

We built three supervised classification models each for event-DCT (E-D), event-timex (E-T) and event-event (E-E) pairs, using LIBLINEAR REF08a L2-regularized L2-loss linear SVM (dual), with default parameters, and one-vs-rest strategy for multi-class classification.

Tools and Resources

Several external tools and resources are used to extract features from each temporal entity pair, including:

  • TextPro tool suite323232http://textpro.fbk.eu/ PIANTA08.645 to get the morphological analysis (PoS tags, shallow phrase chunk) of each token in the text.

  • Mate tools333333http://code.google.com/archive/p/mate-tools/ bjorkelund-EtAl:2010:COLING-DEMOS to extract the dependency path between tokens in the document.

  • WordNet similarity module343434http://ws4jdemo.appspot.com/ to compute (Lin) semantic similarity/relatedness Lin:1998:IDS:645527.657297 between words.

  • Temporal signal lists as described in mirza-tonelli:2014:EACL. However, we further expand the lists using the Paraphrase Database ganitkevitch2013ppdb, and manually cluster some signals together, e.g. {before, prior to, in advance of}. Finally, we have 50 timex-related and 138 event-related temporal signals in total, which are clustered into 27 and 35 clusters, respectively (Appendix A).

width=1 Feature E-D E-T E-E Rep. Description Morphosyntactic information   PoS x x x one-hot Part-of-speech tags of and .   phraseChunk x x x one-hot Shallow phrase chunk of and .   samePoS x x binary Whether and have the same PoS. Textual context   entityOrder x binary Appearance order of and in the text.353535The order of and in event-event pairs is always according to the appearance order in the text, while in event-timex pairs, is always a timex regardless of the appearance order.   sentenceDistance x x binary 0 if and are in the same sentence, 1 otherwise.   entityDistance x x binary 0 if and are adjacent, 1 otherwise. EVENT attributes   class x x x one-hot EVENT attributes as specified in TimeML.   tense x x x one-hot   aspect x x x one-hot   polarity x x x one-hot   sameClass x binary Whether and have the same EVENT attributes.   sameTenseAspect x binary   samePolarity x binary TIMEX3 attributes   type x x one-hot TIMEX3 attributes as specified in TimeML. Dependency information   dependencyPath x one-hot Dependency path between and .   isMainVerb x x x binary Whether / is the main verb of the sentence. Temporal signals   signalTokens x x one-hot Tokens (cluster) of temporal signal around and .   signalPosition x x one-hot Temporal signal position w.r.t /, e.g., BETWEEN, BEFORE, BEGIN, etc.   signalDependency x x one-hot Temporal signal dependency path between signal tokens and /. Lexical semantic information   wnSim x one-hot WordNet similarity computed between the lemmas of and . TLINK labels from the rule-based sieve   timex-DCT label x one-hot The TLINK type of the (timex) and DCT pair (if any).   event-DCT label x one-hot The TLINK types of the / and DCT pairs (if any).

Table .2: Feature set for event-DCT (E-D), event-timex (E-T) and event-event (E-E) classification models, along with each feature representation (Rep.) in the feature vector and feature descriptions.
Feature Set

The implemented features are listed in Table .2. Some features are computed independently based on either or

of the temporal entity pairs, while some others are pairwise features, which are computed based on both entities. In order to have a feature vector of reasonable size, we simplified the possible values of some features during the one-hot encoding:

  • dependencyPath We only consider several dependency path between the event pairs denoting e.g. coordination, subordination, subject and object relations.

  • signalTokens The clusterID of signal cluster, e.g., {before, prior to, in advance of}, is considered as a feature instead of the signal tokens.

  • signalDependency For each atomic label in a vector of syntactic dependency labels according to surdeanu-EtAl:2008:CONLL,363636We manually selected a subset of such labels that are relevant for temporal modifier. if the signal dependency path contains the atomic label, the value in the feature vector is flipped to 1. Hence, TMP-SUB and SUB-TMP will have the same one-hot representations.

  • wnSim The value of WordNet similarity measure is discretized as follows: , , and .

Note that several features from mirza-tonelli:2014:EACL such as string features and temporal discourse connectives373737The information about discourse connectives was acquired using the addDiscourse tool pitler-nenkova:2009:Short, which identifies connectives based on syntactic constructions, and assigns them to one of four semantic classes: Temporal, Expansion, Contingency and Comparison. are not used. String features, i.e., token and lemma of temporal entities, are removed in order to increase the classifiers’ robustness in dealing with completely new texts with different vocabularies. So instead, we include WordNet similarity in the feature set. Temporal discourse connectives are no more included as features because it did not prove to be beneficial.

Label Simplification

During the feature extraction process for training the classification models, we collapse some labels, i.e., IBEFORE into BEFORE, IAFTER into AFTER, DURING and DURING_INV into SIMULTANEOUS, in order to simplify the learning process, also considering the sparse annotation of such labels in the datasets.

17 Evaluation

17.1 TempEval-3 Evaluation

We use the same training and test data released in the context of Tempeval-3. Two types of training data were made available in the challenge: TBAQ-cleaned and TE3-Silver-data. The former includes the cleaned and improved version of the TimeBank 1.2 corpus and the AQUAINT TimeML corpus (see Section 8). TE3-Silver-data, instead, is a 600K word corpus annotated by the best performing systems at Tempeval-2, which we do not use because it was proven not so useful for temporal relation extraction task uzzaman-EtAl:2013:SemEval-2013. For evaluation, the newly created TempEval-3-platinum evaluation corpus is used. The distribution of the relation types in training and test datasets is shown in Table .3.

TBAQ-cleaned Relation TempEval-3-platinum
training test
168 1366 230 1797 BEFORE 7 90 5 210
29 205 124 1141 AFTER 2 26 5 184
2 1 2 70 IBEFORE 2 5 2
1 1 4 33 IAFTER 1 7 1
111 5 742 IDENTITY 15
2 2 56 519 SIMULTANEOUS 4 0 6 81
107 554 141 462 INCLUDES 8 37 2 40
29 471 1782 262 IS_INCLUDED 4 14 114 47
61 119 38 DURING
1 19 42 DURING_INV 1
1 23 48 BEGINS 1 1 1
3 11 56 38 BEGUN_BY 1 1
1 65 33 ENDS 2 1
14 8 47 46 ENDED_BY 2
468 2681 2673 5271 Total 30 167 150 583
Table .3: The distribution of each relation type in the datasets for each type of temporal entity pairs: timex-timex (T-T), event-DCT (E-D), event-timex (E-T) and event-event (E-E).
Evaluation Metrics

TempEval-3 introduced an evaluation metric uzzaman-allen:2011:ACL-HLT2011 capturing temporal awareness in terms of precision, recall and F1-score. To compute precision and recall, the correctness of annotated temporal links is verified using temporal closure, by checking the existence of the identified relations in the closure graph. However, there is a minor variation of the formula, that the reduced graph of relations is considered instead of all relations of the system and reference.

383838Details can be found in Chapter 6 of uzzaman-2012.

Precision is the proportion of the number of reduced system relations () that can be verified from the reference annotation temporal closure graph (), out of the number of temporal relations in the reduced system relations (). Similarly, Recall is the proportion of number of reduced reference annotation relations () that can be verified from the system’s temporal closure graph (), out of the number of temporal relations in reduced reference annotation (). In order to replicate this type of evaluation, we use the scorer made available to the task participants.

Evaluation Results

We compare in Table .4 the performance of TempRelPro to the other systems participating in temporal relation tasks of TempEval-3, Task C and Task C relation type only, according to the figures reported in uzzaman-EtAl:2013:SemEval-2013. We also compare TempRelPro performance with our preliminary results reported in mirza-tonelli:2014:EACL for Task C relation type only.

For the temporal relation type classification task (Task C relation type only), TempRelPro achieves the best performance with 61.86% F1-score. For the temporal relation extraction task (Task C), our approach is most similar to UTTime-1 with the highest recall in TempEval-3. In comparison with UTTime-1, we can double the precision without reducing too much the recall. TempRelPro achieves the best F1-score of 40.15%, almost 4% increase compared with the best system in TempEval-3, ClearTK-2.

System Task C Task C relation type only
P R F1 P R F1
TempRelPro 30.30 59.49 40.15 62.13 61.59 61.86
mirza-tonelli:2014:EACL - - - 58.80 58.17 58.48
ClearTK-2 37.32 35.25 36.26 - - -
UTTime-5 35.94 33.92 34.90 53.85 55.58 54.70
NavyTime-1 35.48 27.62 31.06 46.59 47.07 46.83
JU-CSE 21.04 35.47 26.41 35.07 34.48 34.77
UTTime-1 15.18 65.64 24.65 55.58 57.43 56.45
Table .4: Tempeval-3 evaluation on temporal relation extraction tasks
Sieve T-T E-D E-T E-E Overall
P R P R P R P R P R F1
Temporal Relation Identification
0.03 0.67 0.37 0.99 0.33 0.99 0.42 0.97 0.40 0.96 0.56
Without T-T 0.53 0.95 0.68
Temporal Relation Type Classification
RB 0.85 0.57 1 0.08 0.91 0.39 0.91 0.05 0.91 0.13 0.22
ML 0.77 0.76 0.73 0.72 0.53 0.51 0.61 0.58 0.59
RB + TR 0.85 0.57 1 0.17 0.92 0.48 0.89 0.06 0.92 0.16 0.28
RB + ML 0.85 0.57 0.78 0.77 0.73 0.72 0.53 0.51 0.62 0.59 0.61
RB + TR + ML 0.85 0.57 0.79 0.78 0.75 0.81 0.53 0.51 0.63 0.61 0.62
Majority labels 0.35 0.23 0.55 0.54 0.77 0.76 0.37 0.36 0.47 0.45 0.46
Table .5: TempRelPro performances per module on temporal relation identification and type classification, evaluated on the TempEval-3 evaluation corpus. RB = rule-based sieve, ML = machine-learned sieve and TR = temporal reasoner.

We also report in Table .5 the performances of each module included in TempRelPro, evaluated on TempEval-3-platinum. The temporal relation identification module (Section 16.1) obtains a very low precision for T-T pairs because the dataset contains very few annotated timex-timex links. If we remove the T-T pairs, we can increase the F1-score for the temporal relation identification task by 12%. Therefore, in our final annotated TimeML documents for the TempEval-3 evaluation, T-T pairs are not included, even though they play a big role in the temporal relation type classification task.

Regarding the temporal relation type classification modules (Section 16.2), there is no significant improvement by combining rule-based and machine-learned sieves (RB + ML), compared with only using machine-learned classifiers (ML), particularly for E-T and E-E pairs. However, introducing the temporal reasoner in between (RB + TR + ML) results in significant improvement especially for E-T pairs, since recall increases from 72% to 81%. We also compare TempRelPro performances for this classification task to a majority class baseline for each temporal entity type according to the distribution of temporal relation types in the training data (Table .3), i.e., BEFORE for T-T, BEFORE for E-D, IS_INCLUDED for E-T and BEFORE for E-E pairs.

17.2 TimeBank-Dense Evaluation

We follow the experimental setup in chambers-etal:2014:TACL, in which the TimeBank-Dense corpus (mentioned in Section 8) is split into a 22 document training set, a 5 document development set and a 9 document test set393939Available at http://www.usna.edu/Users/cs/nchamber/caevo/.. All the classification models for the machine-learned sieve are trained using the training set. We evaluate our system performances on the test set.


The set of TLINK types used in TimeBank-Dense corpus is different from the one used in TempEval-3. Some relation types are not used, and the VAGUE relation introduced at the first TempEval task verhagen-EtAl:2007:SemEval-2007 is adopted to cope with ambiguous temporal relations, or to indicate pairs for which no clear temporal relation exists. The final set of TLINK types in TimeBank-Dense includes: BEFORE, AFTER, INCLUDES, IS_INCLUDED, SIMULTANEOUS and VAGUE. Therefore, we map the relation types of TLINKs labelled by TempRelPro as follows:404040We tried different mappings in our experiments and found this mapping to be the one giving the best outcome.




Moreover, we introduce some rules for E-D and E-T pairs to recover the VAGUE relations, such as:

  • If the PoS tag of the event in an E-D pair is an adjective then [E VAGUE D]

  • If the timex value in an E-T pair is PAST_REF, PRESENT_REF or FUTURE_REF then [E VAGUE T]

System T-T E-D E-T E-E Overall
P/R/F1 P/R/F1 P/R/F1 P/R/F1 P R F1
TempRelPro 0.780 0.518 0.556 0.487 0.512 0.510 0.511
CAEVO 0.712 0.553 0.494 0.494 0.508 0.506 0.507
Table .6: TempRelPro performances evaluated on the TimeBank-Dense test set and compared with CAEVO.
Evaluation Results

In Table .6 we report the performances of TempRelPro compared with CAEVO. We achieve a small improvement in the overall F1-score, i.e., 51.1% vs 50.7%. For each temporal entity pair type, since we label all possible links, precision and recall are the same. TempRelPro is significantly better than CAEVO in labelling T-T and E-T pairs.

width=1 Sieve T-T E-D E-T E-E Overall CAEVO P/R/F1 P R P R P R P R F1 P R F1 Temporal Relation Type Classification RB 0.780 0.667 0.070 0.705 0.073 0.609 0.010 0.727 0.049 0.092 ML 0.473 0.480 0.488 0.484 0.471 0.478 0.458 0.202 0.280 RB + TR 0.780 0.722 0.125 0.700 0.166 0.546 0.013 0.713 0.076 0.138 RB + ML 0.780 0.495 0.480 0.488 0.495 0.493 0.494 0.486 0.240 0.321 RB + TR + ML 0.780 0.518 0.556 0.487 0.512 0.510 0.511 0.505 0.328 0.398 RB + TR + ML + AllVague 0.507 0.507 0.507

Table .7: TempRelPro performances per module on temporal relation type classification, evaluated on the TimeBank-Dense test set, and compared with CAEVO. RB = rule-based sieve, ML = machine-learned sieve and TR = temporal reasoner.

We also report in Table .7 the performances of each module composing TempRelPro, evaluated on the TimeBank-Dense test set. Note that one of differences between TempRelPro and CAEVO is that in TempRelPro machine-learned sieve (ML) is the last sieve, while in CAEVO AllVague is the last sieve. This explains the big difference of F1-score for the RB + TR + ML composition in TempRelPro and CAEVO, i.e., 51.1% vs 39.8%.

In general, combining RB and ML modules results in a slight improvement (47.8% to 49.4% F1-score), especially for T-T (since there is no ML classifier for T-T pairs) and E-D pairs, but not for E-T and E-E pairs. Introducing the TR module in between (RB + TR + ML) is even more beneficial, resulting in overall 51.1% F1-score, especially for E-T pairs with an increase from 48% to 55.6% F1-score. This is in line with the results of the TempEval-3 evaluation (Section 17.1).

With only two sieves, TempRelPro is arguably more efficient than CAEVO, because (i) the temporal closure inference over extracted TLINKs is run only once and (ii) we use less classifiers in general (see Table .1). Our decision to consider all rule-based classifiers as one sieve is motivated by the hypothesis that entity pairs generated by each rule-based classifier, i.e. E-D, E-T and E-E pairs, are independent of each other. Using the consistency checking module of the temporal reasoner, we found out that all the documents in the test set, annotated by the rule-based classifiers, are consistent, which supports our hypothesis.

17.3 QA TempEval Evaluation

The training data set is the TimeML annotated data released by the task organizers, which includes TBAQ-cleaned and TE3-Platinum corpora reused from the TempEval-3 task uzzaman-EtAl:2013:SemEval-2013. The test data are 30 plain texts extracted from news, wikipedia and blogs domains (10 documents each). For evaluating the system, 294 temporal-based questions and the test data annotated with entities relevant for the questions are used.

Temporal Entity Extraction System

We use the same systems reported in mirza-minard:2015:SemEval for timex and event extraction.

Evaluation System

Given the documents labelled by the participating systems, the evaluation process consists of three main steps llorens-EtAl:2015:SemEval:

  • ID normalization: this step is performed because systems may provide different IDs to the same temporal entities annotated in the gold standard test data.

  • Timegraph generation: Timegraph gerevini1995 is used to compute temporal closure as proposed by journals/ci/MillerS90. Timegraph is first initialized by adding the system’s explicit TLINKs. Then the Timegraph’s reasoning mechanism infers implicit relations through rules such as transitivity.

  • Question processing: queries are converted to point-based queries in order to check the necessary point relations in Timegraph to verify an interval relation. For example, to answer the question “is AFTER ”, the evaluation system verifies whether ; if it is verified then the answer is true (YES), if it conflicts with the Timegraph then it is false (NO), otherwise it is UNKNOWN.

Evaluation Metrics

For each question the obtained answer from the Timegraph (created with system annotations) is compared with the expected answer (human annotated).

Recall (QA accuracy) is used as the main metrics to rank the systems, and F1-score is used in case of the same recall. Coverage is used to measure how many questions can be answered by a system, regardless of the correctness.

Evaluation Results

We compare TempRelPro with our previous system submitted for QA TempEval, HLT-FBK mirza-minard:2015:SemEval, in Table .8. HLT-FBK shows a significant improvement by including an event co-reference rule414141Whenever two events co-refer, the E-E pair is excluded from the classifier and automatically labelled SIMULTANEOUS. (HLT-FBK + coref). The event co-reference information was obtained from the NewsReader pipeline.424242More information about the NewsReader pipeline, as well as a demo, are available on the project website http://www.newsreader-project.eu/results/. For TempRelPro, we include the event co-reference rule in the rule-based sieve for E-E pairs (Section 16.2.4). Using event co-reference, the overall performance of TempRelPro (TempRelPro + coref) is slightly improved, especially for Blogs domain. In general, HLT-FBK + coref is very good in covering the number of questions answered (Cov), but not in answering accurately.

width=1 System News Wikipedia Blogs All Cov P R F1 Cov P R F1 Cov P R F1 R TempRelPro 0.62 0.62 0.38 0.48 0.55 0.74 0.41 0.52 0.34 0.45 0.15 0.23 0.34 TempRelPro + coref 0.61 0.63 0.38 0.48 0.55 0.74 0.41 0.52 0.37 0.50 0.18 0.27 0.35 HLT-FBK 0.36 0.56 0.20 0.30 0.29 0.58 0.17 0.26 0.29 0.47 0.14 0.21 0.17 HLT-FBK + coref 0.69 0.43 0.29 0.35 0.58 0.62 0.36 0.46 0.58 0.34 0.20 0.25 0.30

Table .8: TempRelPro performances in terms of coverage (Cov), precision (P), recall (R) and F1-score (F1), compared with HLT-FBK.
System Cov P R F1
TempRelPro 0.53 0.65 0.34 0.45
TempRelPro + coref 0.53 0.66 0.35 0.46
HLT-FBK + trefl 0.48 0.61 0.29 0.39
HLT-FBK + coref + trefl 0.67 0.51 0.34 0.40
HITSZ-ICRC + trefl 0.15 0.58 0.09 0.15
CAEVO + trefl 0.36 0.60 0.21 0.32
TIPSemB + trefl 0.37 0.64 0.24 0.35
TIPSem + trefl 0.40 0.68 0.27 0.38
Table .9: TempRelPro performances in terms of coverage (Cov), precision (P), recall (R) and F1-score (F1) for all domains, compared with systems in QA TempEval augmented with TREFL.

The QA TempEval organizers also provide an extra evaluation, augmenting the participating systems with a time expression reasoner (TREFL) as a post-processing step llorens-EtAl:2015:SemEval. The TREFL component adds TLINKs between timexes based on their resolved values. Note that TempRelPro already includes the T-T links in the final TimeML documents produced, based on the output of the rule-based sieve for T-T pairs (Section 16.2.1). In Table .9 we report the performance of TempRelPro compared with participating systems in QA TempEval, augmented with TREFL, as reported in llorens-EtAl:2015:SemEval. A comparison with off-the-shelf systems not optimized for the task, i.e., CAEVO chambers-etal:2014:TACL, which is the same system reported in Section 17.2, and TIPSemB and TIPSem llorens-saquete-navarro:2010:SemEval, was also provided. TempRelPro + coref achieves the best performance with 35% recall and 46% F1-score.

18 Conclusions

Our decision to focus on temporal relation extraction is driven by the low performance of state-of-the-art systems in the TempEval-3 evaluation campaign for this particular task (36.26% F1-score), compared with system performances for the temporal entity extraction tasks (¿80% F1-score). We have described our approach in building an improved temporal relation extraction system, TempRelPro, which is inspired by a sieve-based architecture for temporal ordering introduced by chambers-etal:2014:TACL with their system, CAEVO. However, our approach is different from CAEVO by adopting simpler architecture, considering all rule-based classifiers as one sieve and all machine-learned classifiers as another one. Hence, we run our temporal reasoner module only once, in between the two sieves we have. Moreover, we also introduced a novel method to include the rule-based sieve output, particularly the labels of timex-DCT and event-DCT links, as features for the supervised event-timex and event-event classifiers.

We have evaluated TempRelPro using the TempEval-3 evaluation scheme, which results in a significant improvement of 40.15% F1-score, compared to the best performing system in TempEval-3, ClearTK-2, with 36.26% F1-score. Unfortunately, building and evaluating an automatic temporal relation extraction system is not trivial, given the sparse annotated temporal relations as in TempEval-3. Without a completely labelled graph of temporal entities, we cannot speak of true extraction, but rather of matching human annotation decisions that were constrained by time and effort. This is shown by the low precision achieved by TempRelPro, since it extracts many TLINKs of which the real labels are unknown. Therefore, we also evaluated TempRelPro following the TimeBank-Dense evaluation methodology chambers-etal:2014:TACL and QA TempEval llorens-saquete-navarro:2010:SemEval. In general, TempRelPro performs best in both evaluation methodologies.

According to the TempEval-3 and TimeBank-Dense evaluation schemes, component-wise, combining rule-based and machine-learned sieves (RB + ML) results in a slight improvement. However, introducing the temporal reasoner module in between the sieves (RB + TR + ML) is quite beneficial, especially for E-T pairs.

If we look into each type of temporal entity pairs, TempRelPro still performs poorly with E-E pairs. On TempEval-3 evaluation, TempRelPro performances when labelling T-T, E-D and E-T pairs are already above 70% F1-scores, but only around 50% F1-score for E-E pairs. On TimeBank-Dense evaluation, its performance for E-E pairs is still 50% F1-score. Our efforts in improving the performance of the supervised classifier for E-E pairs, i.e., combining it with a rule-based classifier, introducing a temporal reasoner module, or including event-DCT labels as features, do not result in a better outcome.

There are several directions that we look into regarding this issue. The first one is by building a causal relation extraction system, because there is a temporal constraint in causal relations, so that the causing event always happens BEFORE the resulting event. In the following chapters we will discuss the interaction between these two types of relations, and whether extracting causal relations can help in improving the output of a temporal relation extraction system, especially for pairs of events (Chapter ).

The other direction would be to exploit the lexical semantic information about the event words in building a supervised classifier for E-E pairs, using word embeddings and deep learning techniques (Chapter ). Currently, the only lexical semantic information used by the event-event SVM classifier in TempRelPro is WordNet similarity measure between pairs of words.

19 Introduction

While there is a wide consensus in the NLP community over the modeling of temporal relations between events, mainly based on Allen’s interval algebra allen1983, the question on how to model other types of event relations is still open. In particular, linguistic annotation of causal relations, which have been widely investigated from a philosophical and logical point of view, are still under debate. This leads, in turn, to the lack of a standard benchmark to evaluate causal relation extraction systems, making it difficult to compare systems performances, and to identify the state-of-the-art approach for this particular task.

Although several resources exist in which causality has been annotated, they cover only few aspects of causality and do not model it in a global way, comparable to what has been proposed for temporal relations in TimeML. See for instance the annotation of causal arguments in PropBank propbank and of causal discourse relations in the Penn Discourse Treebank PRASAD08.754.

In Section 22, we propose annotation guidelines for explicit construction of causality inspired by TimeML, trying to take advantage of the clear definition of events, signals and relations proposed by pustejovsky2003. This is the first step towards the annotation of a TimeML corpus with causality.

We annotated TimeBank, a freely available corpus, with the aim of making it available to the research community for further evaluations. Our annotation effort results in Causal-TimeBank, a TimeML corpus annotated with both temporal and causal information (Section 23). We chose TimeBank because it already contains gold annotated temporal information, including temporal entities (events and temporal expressions) and temporal relations. The other reason is because we want to investigate the strict connection between temporal and causal relations. In fact, there is a temporal constraint in causality, i.e. the cause must occur BEFORE the effect. We believe that investigating this precondition on a corpus basis can contribute to improving the performance of temporal and causal relation extraction systems.

20 Related Work

Unlike the temporal order that has a clear definition, there is no consensus in the NLP community on how to define causality. Causality is not a linguistic notion, meaning that although language can be used to express causality, causality exists as a psychological tool for understanding the world independently of language neeleman2012. In the psychology field, several models have been proposed to model causality, including the counterfactual model lewis1973, probabilistic contrast model cheng1991,cheng1992 and the dynamics model wolff:2003,wolff:2005,wolff:2007, which is based on Talmy’s force dynamic account of causality talmy1985,talmy:1988.

Several attempts have been made to annotate causal relations in texts. A common approach is to look for specific cue phrases like because or since or to look for verbs that contain a cause as part of their meaning, such as break (cause to be broken) or kill (cause to die) Khoo:2000:ECK:1075218.1075261,Sakaji2008.PAKM,girju-EtAl:2007:SemEval-2007. In PropBank propbank, causal relations are annotated in the form of predicate-argument relations, where argm-cau is used to annotate ‘the reason for an action’, for example: “They