Event understanding is intuitive for humans and important for daily decision making. For example, given the raw text shown in Figure 1, a person can infer lots of information including event trigger and type, event related arguments (e.g., agent, patient, location), event duration and temporal relations between events based on the linguistic and common sense knowledge. These understandings help people comprehend the situation and prepare for future events. The event and temporal knowledge are helpful for many downstream applications including question answering meng2017temporal; huang2019cosmospeng2018towards; yao2019plan; goldfarb2019plan; goldfarb2020content, and forecasting wang2017integrating; granroth2016happens; li2018constructing.
Despite the importance, there are relatively few tools available for users to conduct text-based temporal event understanding. Researchers have been building natural language processing (NLP) analysis tools for “core NLP” tasksgardner2018allennlp; manning2014stanford; khashabi2018cogcompnlp. However, systems that target at semantic understanding of events and their temporal information are still under-explored. There are individual works for event extraction, temporal relation detection and event duration detection, but they are separately developed and thus cannot provide comprehensive and coherent temporal event knowledge.
We present EventPlus, the first pipeline system integrating several high-performance temporal event information extraction models for comprehensive temporal event understanding. Specifically, EventPlus contains event extraction (both on defined ontology and for novel event triggers), event temporal relation prediction, event duration detection and event-related arguments and named entity recognition, as shown in Figure 2.111 We provide an introductory video at https://pluslabnlp.github.io/eventplus.
EventPlus is designed with multi-domain support in mind. Particularly, we present an initial effort to adapt EventPlus to the biomedical domain. We summarize the contributions as follows:
We present the first event pipeline system with comprehensive event understanding capabilities to extract event triggers and argument, temporal relations among events and event duration to provide an event-centric natural language understanding (NLU) tool to facilitate downstream applications.
Each component in EventPlus has comparable performance to the state-of-the-art, which assures the quality and efficacy of our system for temporal event reasoning.
In this section, we introduce each component in our system, as shown in Figure 3. We use a multi-task learning model for event trigger and temporal relation extraction (§ 2.1). The model introduced in § 2.2 extracts semantic-rich events following the ACE ontology, and the model introduced in § 2.3 predicts the event duration. Note that our system handles two types of event representations: one represents an event as the trigger word pustejovsky2003timeml (as the event extraction model in § 2.1), the other represents event as a complex structure including trigger, type and arguments ahn2006stages (as the event extraction model in § 2.2). The corpus following the former definition usually has a broader coverage while the latter one can provide richer information. Therefore, we develop models to combine the benefits of both worlds.
2.1 Multi-task Learning of Event Trigger and Temporal Relation Extraction
The event trigger extraction component takes the input of raw text and outputs single-token event triggers. The input to the temporal relation extraction model is raw text and a list of detected event triggers and the model will predict temporal relationships between each pair of events. In previous literature han-etal-2019-joint, multi-task learning of these two tasks can significantly improve performance on both tasks following the intuition that event relation signals can be helpful to distinguish event triggers and non-event tokens.
The model feeds BERT embedding devlin2019bert
of the input text to a shared BiLSTM layer for encoding task-specific contextual information. The output of the BiLSTM is passed to an event scoring function and a relation scoring function which are MLP classifiers to calculate the probability of being an event (for event extraction) or a probability distribution over all possible relations (for temporal relation extraction). We train the multi-task model on MATRES(NingWuRo18) containing temporal relations before, after, simultaneous and vague. Though the model performs both tasks during training, it can be separately used for each individual task during inference.
2.2 Event Extraction on ACE Ontology
Although event triggers present the occurrence of events, they are not sufficient to demonstrate the semantic-rich information of events. ACE 2005222https://www.ldc.upenn.edu/collaborations/past-projects/ace corpus defines an event ontology that represents an event as a structure with triggers and corresponding event arguments (participants) with specific roles doddington4r.333The ACE program provides annotated data for five kinds of extraction targets: entities, times, values, relations and events. We only focus on events and entities data in this paper. Our system is trained with ACE 2005 corpus, thus it is capable of extracting events with the complex structure. ACE focuses on events of a particular set of types including life, movement, transaction, business, conflict, contact, personnel and justice, where each type has corresponding sub-types. Following prior works wadden-etal-2019-entity; lin-etal-2020-joint, we keep 7 entity types (person, organization, location, geo-political entity, facility, vehicle, weapon), 33 event sub-types, and 22 argument roles that are associated with sub-types.
Similar to han-etal-2019-joint, we build our event extraction component for ACE ontology upon a multi-task learning framework that consists of trigger detection, argument role detection and entity detection. These tasks share the same BERT encoder, which is fine-tuned during training. The entity detector predicts the argument candidates for all events in an input sentence. The trigger detector labels the input sequence with the event sub-types at the token level. The argument role detector finds the argument sequence444Argument sequences are presented using BIO encoding. for each detected trigger via attention mechanism. For example, for the sentence in Figure 1, its target trigger sequence has Movement:Transport label at the position of “toured” token, and its argument sequence for this Movement:Transport event has B-artifact, I-artifact labels at the position of “George Pataki” and B-destination label at the position of “counties” respectively. The entire multi-task learning framework is jointly trained.
During inference, our system detects arguments solely based on triggers. To make our system better leverage information from argument candidates, we developed the following constraints during decoding based on the predicted entities (argument candidates) and other specific definitions in ACE:
Entity-Argument constraint. The argument role label for a token can take one of the 22 argument roles if and only if the token at this position belongs to a predicted entity.
Entity-Trigger constraint. The trigger label for a token can take one of the 33 event sub-types if and only if the token at this position does not belong to a predicted entity.
Valid Trigger-Argument constraint. Based on the definitions in ACE05, each event sub-type takes certain types of argument roles. We enforce that given the predicted trigger label, the argument roles in this sequence can only take those that are valid for this trigger.
To account for these constraints, we set the probability of all invalid configurations to be 0 during decoding.
2.3 Event Duration Detection
This component classifies event triggers into duration categories. While many datasets have covered time expressions which are explicit timestamps for events pustejovsky2003timebank; cassidy2014annotation; TACL1218; bethard-etal-2017-semeval, they do not target categorical event duration. To supplement this, vashishtha-etal-2019-fine introduces the UDS-T dataset, where they provide 11 duration categories which we adopt for our event pipeline: instant, seconds, minutes, hours, days, weeks, months, years, decades, centuries and forever. Pan2006LearningED also present a news domain duration annotation dataset containing 58 articles developed from TimeBank corpus (we refer as Typical-Duration in the following), it provides 7 duration categories (a subset of the 11 categories in UDS-T from seconds to years).
We developed two models for the event duration detection task. For a sentence, along with predicate root and span, the models perform duration classification. In the first method, we fine tune a BERT language model devlin2019bert
on single sentences and take hidden states of event tokens from the output of the last layer, then feed into a multi-layer perceptron for classification.
The second model is adapted from the UDS-T baseline’s model which is trained under the multi-task objectives of duration and temporal relation extraction. The model computes ELMo embeddings peters2018deep followed by attention layers to compute the attended representation of the predicate given sentence. The final MLP layers extract the duration category. Even though this model can detect temporal relations, it underperforms the model we described in § 2.1, so we exclude the temporal relation during inference.
We design a pipeline system to enable the interaction among components with state-of-the-art performance introduced in § 2 and provide a comprehensive output for events and visualize the results. Figure 3 shows the overall system design.
3.1 Pipeline Design
EventPlus takes in raw text and feeds the tokenized text to two event extraction modules trained on ACE ontology-based datasets and free-formatted event triggers. The ACE ontology extraction modules will produce the output of event triggers (“toured” is a trigger), event type (it is a Movement:Transport event), argument and its role (the artifact is “George Pataki” and destination is “counties”) and NER result (“New York” and “counties” are geo-political entity and “governer” and “George Pataki” are person). The trigger-only extraction model will produce all event triggers (“continues”, “maintain” and “declared” are also event triggers but we do not have arguments predicted for them). Then trigger-only events will be merged to ACE-style events list and create a combined event list from the two models.
Duration Detection and Temporal Relation Extraction
The combined events list will be passed to the event duration detection model to detect duration for each of the extracted events (“tours” will take days etc.) and passed to temporal relation extraction component to detect temporal relations among each pair of events (“toured” is after “declared” etc.). Note that duration and temporal relation extraction are based on the context sentence besides the event triggers themselves and they are designed to consider contextualized information contained in sentences. Therefore “take (a break)” can take minutes in the scenario of “Dr. Porter is now taking a break and will be able to see you soon” but take days in the context of “Dr. Porter is now taking a Christmas break” ning2019understanding.
To keep the resulted temporal graph clear, we remove predicted vague relations since that indicates the model cannot confidently predict temporal relations for those event pairs. Finally, all model outputs are gathered and pass to the front-end for visualization.
3.2 Interface Design
Figure 2 shows the interface design of EventPlus.555We have a walk-through instruction available to help first-time end users get familiar with EventPlus. Please see our video for more information. We display the NER result with wavy underlines and highlight event triggers and corresponding arguments with the same color upon clicks. Besides, we represent the temporal relations among events in a directed graph using d3 666https://d3js.org/ if there are any, where we also indicate each event’s duration in the label for each event node.
Each capability in the pipeline has its own input and output protocol and they require various datasets to learn implicit knowledge independently. In this section, we describe the performance for each capability on corresponding labeled datasets.
4.1 Event Trigger Extraction
We report the evaluation about event triggers extraction component on TB-Dense cassidy2014annotation and MATRES NingWuRo18, two event extraction datasets in the news domain han-etal-2019-joint. We show the result in Table 1. Comparing the performance on TB-Dense with CAEVO chambers-etal-2014-dense, DEER han2020deer and MATRES performance with ning-etal-2018-cogcomptime, the model we use achieves best F1 scores and yields the state-of-the-art performance.
4.2 Event Extraction on ACE Ontology
We evaluate our event extraction component on the test set of ACE 2005 dataset using the same data split as prior works lin-etal-2020-joint; wadden-etal-2019-entity. We follow the same evaluation criteria:
Entity: An entity is correct if its span and type are both correct.
Trigger: A trigger is correctly identified (Trig-I) if its span is correct. It is correctly classified (Trig-C) if its type is also correct.
Argument: An argument is correctly identified (Arg-I) if its span and event type are correct. It is correctly classified (Arg-C) if its role is also correct.
In Table 2, we compare the performance of our system with the current state-of-the-art method OneIE lin-etal-2020-joint. Our system outperforms OneIE in terms of entity detection performance. However our trigger and argument detection performance is worse than it. We leave the improvements for triggers and arguments for future work.
4.3 Event Duration Detection
We evaluate the event duration detection models on Typical-Duration and newly annotated ACE-Duration dataset to reflect the performance on generic news domain for which our system is optimized. Since UDS-T dataset vashishtha-etal-2019-fine is imbalanced and has limited samples for some duration categories, we do not use it as an evaluation benchmark but we sample 466 high IAA data points as training resources. We split Typical-Duration dataset and use 1790 samples for training, 224 for validation and 224 for testing.
To create ACE-Duration, we sample 50 unique triggers with related sentences from the ACE dataset, conduct manual annotation with three annotators and take the majority vote as the gold duration category. Given natural ordering among duration categories, the following metrics are employed: accuracy over 7 duration categories (Acc), coarse accuracy (Acc-c, if the prediction falls in categories whose distance to the ground truth is 1, it’s counted as correct) and Spearman correlation (Corr).
Experimental results in Table 3 show the BERT model is better than UDS-T ELMo-based model in general and data augmentation is especially helpful to improve performance on ACE-Duration. Due to the limited size of ACE-Duration, we weight more on the Typical-Duration dataset and select BERT (T) as the best configuration. To the best of our knowledge, this is the state-of-the-art performance on the event duration detection task.
4.4 Temporal Relation Extraction
We report temporal relation extraction performance on TB-Dense and MATRES datasets. TB-Dense consider the duration of events so the labels are includes, included in, before, after, simultaneous and vague, while MATRES uses start-point as event temporal anchor and hence its labels exclude includes and included in. In EventPlus, we augment extracted events from multiple components, so we report temporal relation extraction result given golden events as relation candidates to better reflect single task performance.
Table 4 shows the experimental results.777The MATRES experiment result in Table 4 uses 183 documents for training and 20 for testing developed from the entire TempEval-3 dataset. han-etal-2019-deep reports higher F1 score but it uses a subset of MATRES (22 documents for train, 5 for dev and 9 for test) and has different setting. Our model in § 2.1 achieves the best result on temporal relation extraction and is significantly better than vashishtha-etal-2019-fine mentioned in § 2.3.888 The latest state-of-the-art work han2020deer only reports end-to-end event extraction and temporal relation extraction result, pure temporal relation extraction result given ground-truth events are not provided. We are not able to compare with it directly.
5 Extension to Biomedical Domain
With our flexible design, each component of EventPlus can be easily extended to other domains with little modification. We explore two approaches to extend the event extraction capability (§ 2.2) to the biomedicine domain: 1) multi-domain training (MDT) with GENIA kim-etal-2009-overview, a dataset containing biomolecular interaction events from scientific literature, with shared token embeddings, which enables the model to predict on both news and biomedical text; 2) replace the current component with an in-domain event extraction component SciBERT-FT huang-etal-2020-biomedical which is a biomedical event extraction system based on fine-tuned SciBERT beltagy-etal-2019-scibert.
While MDT on ACE and GENIA datasets from different domains improves the performance on GENIA, it is still lower than SciBERT-FT (Figure 4). Therefore, we decide to pursue the second extension approach to incorporate SciBERT-FT and extend EventPlus to the biomedical domain.
6 Related Works
Existing NLP toolkits manning2014stanford; khashabi2018cogcompnlp provide an interface for a set of useful models. Some tools integrate several models in a pipeline fashion peng-etal-2015-concrete; noji-miyao-2016-jigg. The majority of these systems focus on token-level tasks like tokenization, lemmatization, part-of-speech tagging, or sentence-level tasks like syntactic parsing, semantic role labeling etc. There are only a few systems that can provide capabilities of event extraction and temporal information detection tao2013eventcube; ning2019understanding.
For event extraction, some systems only provide results within a certain defined ontology such as AIDA li-etal-2019-multilingual, there are also some works utilizing data from multiple modalities li-etal-2020-gaia; li-etal-2020-cross. Some work could handle novel events, but they are either restricted to a certain domain yang-etal-2018-dcfee or lack of performance superiority because of their rule-based algorithm valenzuela-escarcega-etal-2015-domain. For temporal information detection, ning-etal-2019-improved proposes a neural-based temporal relation extraction system with knowledge injection. Most related to our work, ning-etal-2018-cogcomptime demonstrates a temporal understanding system to extract time expression and implicit temporal relations among detected events, but this system cannot provide event-related arguments, entities and event duration information.
These previous works either are not capable of event understanding or just focus on one perspective of event-related features. There is no existing system that incorporates a comprehensive set of event-centric features including event extraction and related arguments and entities, temporal relations and event duration.
7 Conclusion and Future Work
We represent EventPlus, a pipeline system that takes raw texts as inputs and produces a set of temporal event understanding annotations, including event trigger and type, event arguments, event duration and temporal relations. To the best of our knowledge, EventPlus is the first available system that provides such a comprehensive set of temporal event knowledge extraction capabilities with state-of-the-art components integrated. We believe EventPlus will provide insights for understanding narratives and facilitating downstream tasks.
In the future, we plan to further improve EventPlus by tightly integrating event duration prediction and temporal relation extraction modules. We also plan on improving the performance for triggers and arguments detection under the ACE ontology, and developing joint training models to optimize for all event-related features in an end-to-end fashion.