Event Extraction as Natural Language Generation

08/29/2021 ∙ by I-Hung Hsu, et al. ∙ USC Information Sciences Institute 0

Event extraction (EE), the task that identifies event triggers and their arguments in text, is usually formulated as a classification or structured prediction problem. Such models usually reduce labels to numeric identifiers, making them unable to take advantage of label semantics (e.g. an event type named Arrest is related to words like arrest, detain, or apprehend). This prevents the generalization to new event types. In this work, we formulate EE as a natural language generation task and propose GenEE, a model that not only captures complex dependencies within an event but also generalizes well to unseen or rare event types. Given a passage and an event type, GenEE is trained to generate a natural sentence following a predefined template for that event type. The generated output is then decoded into trigger and argument predictions. The autoregressive generation process naturally models the dependencies among the predictions – each new word predicted depends on those already generated in the output sentence. Using carefully designed input prompts during generation, GenEE is able to capture label semantics, which enables the generalization to new event types. Empirical results show that our model achieves strong performance on event extraction tasks under all zero-shot, few-shot, and high-resource scenarios. Especially, in the high-resource setting, GenEE outperforms the state-of-the-art model on argument extraction and gets competitive results with the current best on end-to-end EE tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Event extraction (EE) is a crucial information extraction task that extracts structured event information, i.e., an event trigger and its arguments, from unstructured texts. In literature, EE is often divided into two subtasks: (1) event detection, which identifies event triggers and their types, and (2) event argument extraction, which aims to extract the entities that serve as event arguments and their corresponding roles for the given event. EE has been shown to benefit a wide range of applications, e.g., building knowledge graphs 

DBLP:conf/www/ZhangLPSL20, question answering  berant-etal-2014-modeling, and supporting decision making hogenboom2016survey.

Figure 1: We illustrate how different formulations of EE predict event arguments and demonstrate their corresponding strengths. Our method aims to enjoy all the three benefits that other approaches have—capturing label semantics, modeling event structures, and the generalizability to new event types.

Early studies mostly formulate EE as a classification problem Chen2015dmcnn; wang-etal-2019-hmeae and many of them employ structured prediction which models complex dependencies among output variables to better decide the final outputs li-etal-2014-constructing; Lin20oneie. Despite their success, classification-based methods usually lack generalizability to new or rare event classes, partially because they ignore the knowledge of label semantics, which has been shown to be helpful for zero-shot and few-shot learning DBLP:conf/aaai/ChangRRS08. Take the sentence in Figure 1 as an example, when a new event type “Justice:Execute” is presented, it is hard to identify “execution” as an event trigger if the model uses a numeric label to represent “Justice:Execute”. However, for a system exploiting label semantics, just the name of the event type “Justice:Execute” could help identify “execution”. Similarly, “Supreme Court” could be associated with the role “Adjudicator” based on the semantic similarity between the text and the label.

Recently, several works have attempted to leverage label semantics. Liu2020rceeer; Du20qa cast EE as a machine reading comprehension (MRC) problem. By encoding the lexical features of labels into questions, MRC methods can transfer argument extraction knowledge and thus improve generalizability. Nevertheless, these frameworks usually need multiple turns of independent question-answering to make all predictions, so dependencies among the event triggers and arguments are neglected. Another work, TANL Paolini21tacl, treats EE as a translation task between augmented natural languages, as shown in Figure 1. Its generation process helps it capture output dependencies and exploit label semantics. However, the way it applies label semantics is not from the input, limiting the possibility to import new semantics to the model. Hence, it prevents TANL from addressing zero-shot EE.

In this work, we frame EE as a natural language generation task and propose GenEE that can model event structures, exploit label semantics, and generalize to new or rare event types. GenEE is an encoder-decoder generative model initialized by pretrained BART lewis-etal-2020-bart. Given a passage and an event type, GenEE generates a natural sentence following a predefined template, from where the final triggers and arguments can be decoded. The input of GenEE consists of a passage and a carefully designed input prompt, which includes the event definition and other task-specific information, to guide the generation. GenEE captures the complex dependencies among triggers and arguments because of its autoregressive decoding process—each new word prediction depends on words that have been generated. Our refined input prompt helps the model capture label semantics by encoding the event definition, enabling GenEE to address zero-shot and few-shot scenarios. Furthermore, training the model to produce outputs in a natural language style facilitates knowledge transfer from the pretrained language model.

We evaluate our method on ERE-EN and ACE 2005 English event extraction dataset doddington-etal-2004-automatic. The results demonstrate that GenEE surpasses the previous best EE model Lin20oneie on event argument extraction tasks by up to 6.3% absolute F1-score when given golden triggers, and achieves competitive results with the state-of-the-art model on end-to-end EE tasks. Moreover, when tested under the few-shot and zero-shot scenarios with new or rare event type queries, GenEE trained in the zero-shot setting is able to outperform two other baseline models trained under the 5-shot situation.

2 Event Extraction as Natural Language Generation

Figure 2: An illustration of our method. Given an event type, we feed a sequence containing a passage and a set of prompts to the model. The prompts vary depending on the queried type and the task (ED, EAE, or E2E). The model is trained to generate output texts following event-type-specific templates, which contain placeholders for filling triggers or arguments. The final event predictions can made by comparing the template and the output text.

We formulate event extraction (EE) as a natural language generation task and propose GenEE and its variant, GenE2E to solve EE tasks. GenEE is a pipelined model stacking a generative trigger detection component (GenED) with a generative argument extraction component (GenEAE). In contrast, GenE2E is an end-to-end model that predicts event triggers and their arguments at the same time.

While addressing different tasks, all GenED, GenEAE, and GenE2E share the same architecture, as illustrated in Figure 2. The network follows the design of pretrained BART lewis-etal-2020-bart, an encoder-decoder generative model. Given a passage and an event type, the goal of the model is to generate an output text that can be processed into trigger and argument predictions. We convert the input passage and the queried event type into an input sequence, which contains a passage and a prompt connected by special separator [SEP]. The selection of prompts differs based on the task while the generated output text is expected to be in a predefined format to facilitate the final decoding. We give more details about the exact input and output formats for each task in this section.

2.1 Event Detection Model

GenED is designed to extract triggers for the given event types. The prompt for GenED contains:

  • [topsep=3pt, itemsep=-3pt, leftmargin=13pt]

  • provides a definition for the given event type111The definition can be derived from the annotation guidelines for each dataset, e.g. the guidelines ACE (https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/english-events-guidelines-v5.4.3.pdf.). For example, “The event is related to conflict and some violent physical act.” describes a “Conflict:Attack” event.

  • presents some words that are semantically related to the given event type. In practice, we collect three words that appear as triggers in the example sentences from the annotation guidelines.

  • is the predefined output format for event detection tasks, specifically “Event trigger is <Trigger>.” (“<Trigger>” is a special token.) It is included in the prompt in order to guide the generation.

The goal of GenED is to generate an output that replaces “<Trigger>” in the with one or more triggers of the appropriate type from the given passage, if they are present. For example, the golden output text for GenED in Figure 2 is “Event trigger is detonated.”, where “detonated” is the trigger for the queried “Conflict:Attack” type. When there are multiple triggers, we connect the triggers with the word “and”. For cases containing no triggers for the given event type, we keep the original as the desired output.

During the inference time, GenED must run once for each event type in the target ontology, since it only considers one event type at a time. During training, we randomly select event types as queries for each training example, to avoid the training signal being too sparse; is a hyper-parameter.

Conflict:Demonstrate Justice:Sue
Event Type Description The event is related to a large number of people coming together to protest. The event is related to a court proceeding has been initiated and someone sue the other.
Event Keywords rally; protest; demonstration sue; lawsuit; suit
ED Template Event trigger is <Trigger>. Event trigger is <Trigger>.
Output Text for ED Event trigger is demonstrate. Event trigger is lawsuits.
Argument Roles (Role: Argument) Entity: Iraqis; Place: fallujah Defendant: gunmakers, dealers; Plaintiff: victims, families; Place: none; Adjudicator: none
EAE Template some people or some organization protest at somewhere. somebody was sued by some other in somewhere. The adjudication was judged by some adjudicator.
Output Text for EAE Iraqis protest at fallujah. gunmakers and dealers was sued by victims and families in somewhere. The adjudication was judged by some adjudicator.
Table 1: Two examples of event-type-specific input prompts and output texts for GenED and GenEAE. The events in the table follow the guidelines from the ACE 2005 dataset. Underlined parts in the templates represent placeholders, which can be replaced with triggers or arguments.

2.2 Event Argument Extraction Model

GenEAE is designed to identify arguments and their corresponding roles for a given event trigger. Similar to GenED, we design three kinds of prompts for GenEAE, as illustrated in Figure 2:

  • [topsep=3pt, itemsep=-3pt, leftmargin=13pt]

  • is the same as it is for GenED.

  • indicates the trigger word whose arguments are being sought, e.g. “detonated”.

  • is an event-type-specific natural language style template. Distinct templates are devised for each type to represent the unique structure of each event type. For example, in Figure 2, some attacker attacked some facility, someone, or some organization by some way in somewhere.” is the tailored for “Conflict:Attack” events. Each underlined part starting with “some-” serves as a placeholder corresponding to an argument role in an “Conflict:Attack” event. The full list of can be found in the Appendix D.

Our strategy for creating an is to first identify all valid argument roles for the event type222The valid roles for each event type are pre-defined in the event ontology for each dataset.. Then, for each argument role, according to the semantics of the role type, we select natural and fluent words to form its placeholder. This design aims to provide a simple way to help the model learn roles’ label semantics. The final step is to create a natural language sentence that connects all these placeholders.

Similar to GenED, GenEAE is trained to generate output texts that replace placeholders with arguments in whenever possible.

2.3 Event Extraction Model

Pipelined event extraction model.  We combine the two proposed models, GenED and GenEAE, and form a pipelined EE model called GenEE. GenEE has several advantages: (1) The autoregressive generation framework lets the model naturally consider the output dependencies, where the later predictions are made based on previous predictions. (2) The input prompts help the model capture knowledge about events, including label semantics from the event definition and the event structures from output templates. (3) We intentionally write templates in natural sentences. This encourages the model to leverage the powerful pretrained knowledge from language models.

End-to-end event extraction model.  To further exploit the dependencies within events, we propose GenE2E, which, for a given event type, generates an output text containing all triggers and their corresponding arguments. The input prompt design of GenE2E is similar to that of GenED except for the creation of a new , which is a combination of and , as shown in Figure 2. The goal of GenE2E is to generate texts that replace the placeholders in the with real triggers and arguments.

2.4 Zero-shot event extraction.

Our flexible prompt design helps GenEE adapt to the zero-shot setting easily because the semantic knowledge of new concepts can be imported to the model through the input prompts. In order to add a new event type, all that is required are , , , and . Compared to the effort to collect real annotations, these materials can be easily obtained from annotation guidelines or created very quickly by humans.

Model ACE05-E ACE05-E ERE-EN
Arg-I Arg-C Arg-I Arg-C Arg-I Arg-C
DyGIE++ Wadden19dygiepp 66.2 60.7 - - - -
BERT_QA* Du20qa 68.2 65.4 - - - -
OneIE Lin20oneie 73.2 69.3 73.3 70.6 75.3 70.0
GenEAE (ours) 76.0 73.5 75.2 73.0 80.2 76.3
Table 2: Results for event argument extraction. Models generate arguments and roles based on the give gold triggers. Highest scores are in bold. *We report the numbers from the original paper.
Model ACE05-E ACE05-E ERE-EN
Trig-I Trig-C Arg-I Arg-C Trig-I Trig-C Arg-I Arg-C Trig-I Trig-C Arg-I Arg-C
TANL* Paolini21tacl 72.9 68.4 50.1 47.6 - - - - - - - -
DyGIE++ Wadden19dygiepp 74.3 70.0 53.2 50.0 - - - - - - - -
BERT_QA* Du20qa 75.8 72.4 55.3 53.3 - - - - - - - -
OneIE* Lin20oneie 78.2 74.7 59.2 56.8 75.6 72.8 57.3 54.8 68.4 57.0 50.1 46.5
OneIE Lin20oneie 76.6 73.5 57.9 55.5 75.8 73.5 60.7 58.6 67.7 58.0 53.2 49.4
GenE2E 71.7 69.1 55.0 54.0 73.7 69.9 57.4 55.3 65.8 57.1 52.0 49.6
GenEE (GenED + GenEAE) 75.7 72.0 57.6 56.0 74.3 70.9 59.6 58.0 63.6 56.6 53.5 51.1
Table 3: Results for event extraction. Highest scores are in bold and the second best scores are underlined. *We report the numbers from the original paper.

3 High-Resource Event Extraction

We conduct experiments on event argument extraction task (Section 3.2) and end-to-end event extraction task (Section 3.3) on benchmark datasets and analyze our model through detailed ablation studies (Section 3.4).

3.1 Settings

Datasets.  We consider ACE 2005 doddington-etal-2004-automatic, a widely used EE dataset which provides entity, value, time, relation, and event annotations. We follow the pre-processing in Wadden19dygiepp and Lin20oneie, resulting in two variants: ACE05-E and ACE05-E. The two variants differ because ACE05-E filters out multi-token event triggers, but both of them contain 33 event types and 22 argument roles. In addition, we consider ERE-EN, a dataset created under the Deep Exploration and Filtering of Test (DEFT) program. By adopting the pre-processing in Lin20oneie, we keep 38 event types and 21 argument roles for ERE-EN. Appendix A lists more details.

Evaluation.  We consider the same criteria in previous work Wadden19dygiepp; Lin20oneie and report F1-scores in our experiments.

  • [topsep=3pt, itemsep=-3pt, leftmargin=13pt]

  • Event detection (ED)

    : an trigger is correctly identified (Tri-I) if its offset matches the gold trigger’s; and it is correctly classified (Tri-C) if its event type also matches the gold trigger’s.

  • Event argument extraction (EAE): an argument is correctly identified (Arg-I) if its offset and event type matches the gold argument’s; and it is correctly classified (Arg-C) if its role label also matches the gold argument’s.

Compared baselines.  We consider the following baselines: (1) DyGIE++ Wadden19dygiepp, which is a classification-based model with span graph propagation technique. (2) OneIE Lin20oneie, which is the state-of-the-art structured prediction model that is trained with global features and multi-tasking. (3) BERT_QA Du20qa, which views EE tasks as a couple of extractive question answering problems. (4) TANL Paolini21tacl, which treats EE tasks as translation tasks between augmented natural languages. The implementation details for all baselines and our model are shown in Appendix B.

3.2 Results for Event Argument Extraction

In this section, we focus on event argument extraction tasks. Given gold triggers and their event types, the models are expected to predict their event arguments and the corresponding roles. Table 2 shows Arg-I and Arg-C for the three datasets. GenEAE achieves improvements over the state-of-the-art model (OneIE) on all three datasets by an absolute F1 of 4.2%, 2.4%, and 6.3% when measuring on Arg-C. It is worth noting that GenEAE requires no entity annotation, which is used by both OneIE and DyGIE++. By using the pretrained knowledge in BART and considering structured sentence generation, GenEAE is able to identify entity boundaries and the accurate connections between entities and triggers at the same time.

3.3 Results for Event Extraction

Next, we evaluate our method on EE tasks. As illustrated in Table 3, GenEE is superior to most baselines (TANL, DyGIE++, BERT_QA) by all four evaluation criteria. We observe that the improvements of GenEE on event detection tasks (Tri-I and Tri-C) over baselines are less significant compared to the improvements of GenEE on event argument extraction tasks (Arg-I and Arg-C). We hypothesize that it is because the dependency between triggers is less strong than the dependency between arguments. Therefore, GenEE, which enjoys benefits from the pretrained knowledge about object dependencies, becomes less effective. GenEE is slightly worse than OneIE in terms of Tri-I and Tri-C; however, since GenEAE is much powerful, it still achieves competitive results on Arg-I and Arg-C to OneIE. GenEE even outperforms OneIE in the ERE-EN dataset by an absolute 1.7% F1 in Arg-C.

Similar to GenEE, GenE2E performs better than most baselines under the end-to-end scenario when measuring Arg-I and Arg-C. Nevertheless, there is a small performance gap between GenEE and GenE2E. A possible reason is that jointly considering event detection and event argument extraction makes the learning become more difficult. In fact, the observation that pipelined models perform better than end-to-end models is also reported by other work Zhong21pipe and we leave it as future work to study how to facilitate the learning of an end-to-end model while preserving its benefit of considering advanced dependencies between triggers and the corresponding arguments.

3.4 Ablation Study

We conduct a comprehensive study on the design of the proposed model, including prompt choices and the output format of GenED and GenEAE.

What input information is helpful?  As mentioned in Section 2, the designed prompts contain several types of information. To better understand their contributions, we test different combinations of prompt choices and report the results of GenEAE on the ACE05-E dataset in Table 4.

Ablating drops the model performance the most. Triggers help the model identify the given event and provide potential information about the event type. enables the model to generate a more accurate output format, resulting in performance improvements. Although the model with only as a prompt is the least effective, it is still helpful to provide to GenEAE.

Next, Table 5 shows the impact of our input prompt selections on GenED. In contrast to GenEAE, the plays the most important role. Despite relatively weaker helpfulness on the other two components, it is still effective to combine all of them.

Input Format Arg-I Arg-C
Full GenEAE model 76.0 73.5
- w/o 74.5 71.1
- w/o 71.4 69.0
- w/o 73.8 70.4
- only 71.4 68.2
- only 71.2 69.4
- only 71.4 68.6
Table 4: Input ablation study for event argument extraction with gold triggers provided on ACE05-E.
Input Format Tri-I Tri-C
Full GenED model 74.0 69.7
- w/o 71.5 66.9
- w/o 73.1 69.2
- w/o 73.9 69.5
- only 72.6 68.9
- only 70.8 66.2
Table 5: Input ablation study for event detection on the ACE05-E dataset.

Is natural language output important?  To verify our design of using natural sentences as output texts, we consider five variants of :

  1. [topsep=3pt, itemsep=-3pt, leftmargin=13pt]

  2. Natural sentence: our proposed templates described in Section 2.2, e.g., “somebody was born in somewhere.” We replace placeholders whenever there are corresponding arguments.

  3. Partially natural sentence w/ different role tokens: natural sentence templates with placeholders replaced by role-specific special tokens, e.g., “<Person> was born in <Place>.

  4. Partially natural sentence w/ an unified role token: natural sentence templates with placeholders replaced by a unified special token, e.g., “<Role> was born in <Role>.

  5. Slot filling style w/ different role tags: a slot filling style sentence with role-specific tags, e.g., “<Person> placeholder1 </Person> <Place> placeholder2 </Place>”.

  6. Slot filling style w/ an unified role tag: a slot filling style sentence with an unified role tag, e.g., “<Role> placeholder1 </Role> <Role> placeholder2 </Role>”. We use the order to assign predictions with corresponding role labels.

The results of all variants of on ACE05-E are shown by Table 6. We notice that natural outputs perform better than slot filling style outputs, and the natural language style placeholders provide further improvements. This validates our assumption that natural outputs help models leverage the knowledge in the pretrained language model. We also observe that using different role tokens or tags is better than using an unified one. This confirms that label semantics is important.

Output Format Arg-I Arg-C
Natural sentence 76.0 73.5
Partially natural sentence w/ different role tokens 75.2 73.1
Partially natural sentence w/ an unified role token 73.9 71.3
Slot filling style w/ different role tags 71.0 68.9
Slot filling style w/ an unified role tag 67.2 62.9
Table 6: Studies on the impact of GenEAE on the ACE05-E dataset using different output formats.
(a) Results for top common 5 event types.
(b) Results for top common 10 event types.
Figure 3: The zero/few-shot experimental results. Left: The result for the models on event detection task with the scores reported in trigger classification F1. Middle: The models are tested under the scenario of given gold trigger and evaluated with argument classification criterion. Right: The results for the models to perform event extraction task, which aims to predict triggers and their corresponding arguments. We report the argument classification F1. The full numerical results can be found in the Appendix E.

4 Zero/Few-Shot Event Extraction

We conduct zero-shot and few-shot experiments on the ACE05-E dataset using GenEE in this section.

Settings.  We select the top common event types as “seen” types333Top common types are listed in Appendix C.; we use the rest as “unseen/rare” types. To simulate a zero-shot scenario, we remove all events with “unseen/rare” types from the training data. To simulate a few-shot scenario, we keep only event examples for each “unseen/rare” type (denoted as -shot). During the evaluation, although inference on the full test set, we calculate micro F1-scores just for these “unseen/rare” types.

Compared baselines.  We consider the following baselines: (1) BERT_QA Du20qa, which encodes label semantics into questions and is able to perform zero-shot event argument extraction. However, its trigger model is a sequential tagging model and the number of event types that it can predict is fixed after training, which prevents it from addressing zero-shot event detection. Thus, when evaluating event detection, BERT_QA can only be used in few-shot experiments. (2) OneIE Lin20oneie

, which is a classification model and it always makes a fixed-size categorical estimation. Hence, it cannot generalize to the zero-shot setting. As a result, we only test OneIE in the few-shot scenario. (3)

Matching baseline, a proposed baseline that makes trigger predictions by performing string matching between the input passage and the . (4) Lemmatization baseline, another proposed baseline that performs string matching on lemmatized input passage and the . (Note: (3) and (4) are baselines only for event detection tasks.)

Experimental results.  Figure 3 shows the results of and . From the two subfigures in the left column, we see that GenED achieves promising results in the zero-shot setting. In fact, it performs better than BERT_QA trained in the 10-shot setting and OneIE trained in the 5-shot setting. This demonstrates the great potential of GenED to discover new event types. Interestingly, we observe that our two proposed baselines perform surprisingly well, suggesting that the trigger annotations in ACE05-E are actually not diverse. Despite their impressive performance, GenED still outperforms the matching baseline by over 4.7% absolute trigger classification F1 in both and cases in zero-shot scenario. Additionally, with only one training instance for each unseen type, GenED can outperform both proposed baselines.

Next, we compare the results for the event argument extraction task. From the two middle subfigures, we observe that when given gold triggers, our model performs much better than all baselines with a large margin. Lastly, we train models for both trigger and argument extraction and report the final argument classification scores in the two right subfigures. We justify that GenEE has strong generalizability to unseen event types and it can outperform BERT_QA and OneIE even when they are both trained in 5-shot settings.

5 Related Work

Classification-based event extraction.  Event extraction (EE) has been studied for over a decade ahn-2006-stages; ji-grishman-2008-refining and is usually formulated as a classification problem. These classification-based models consider neural models, including CNNs nguyen-grishman-2015-event; wang-etal-2019-hmeae, RNNs Nguyen16jrnn; wang-etal-2019-hmeae, and Transformers yang-etal-2019-exploring. Most of them follow a pipelined design, which leads to the error propagation problem and disallows the interactions between each local prediction. As a solution, some works propose to incorporate global features to capture the dependencies among each local classifier and apply joint inference Lin20oneie; Li13jointbeam; yang-mitchell-2016-joint; Li13jointbeam; Nguyen16jrnn. Despite improvements, these models usually lack the exploitation of label semantics and are unable to handle new event types. Our method not only models the event structure, mitigates the error propagation problem by excluding entity prediction steps, but also generalize to zero-shot scenario.

Event extraction as MRC.  Liu2020rceeer; Du20qa; li-etal-2020-event cast EE as an extractive machine reading comprehension problem. They use different strategies to frame questions querying each component of an event and models are trained to point out the corresponding answers, such as manually-created questions Du20qa, unsupervised question generation Liu2020rceeer, and successive question-answering paradigm li-etal-2020-event. These methods exploit label semantics by phrasing questions that semantically relate to the label. However, their frameworks require multiple steps to extract all components of an event, which ignores the dependencies between each prediction. Also, it requires a complex threshold-tuning process to handle missing answer or multiple answer cases. In contrast, our method not only considers the dependencies among triggers and arguments but also generates predictions without complicated threshold tuning.

Generative model for event extraction.  TANL Paolini21tacl, which finetunes on T5 t5model, is a generative model that makes structured predictions by autoregressively generating augmented languages, which contain patterns that represent the final outputs. Both TANL and our GenEE enjoy the benefits of capturing output dependencies by applying generative models. However, TANL is not able to address zero-shot EE, while GenEE can deal with events of new types. Moreover, their design of using augmented language, which is not natural language and is rarely seen in the pretraining, might hinders TANL to fully use the knowledge in the pretrained language model. In contrast, the input and output of GenEE are devised in natural sentences to better exploit the pretrained knowledge.

Another proposed generative model for EE is a concurrent work that focuses on document-level event argument extraction li2021document. The work shares a similar design with us in using templates to extract event arguments. However, there are three major differences between their method and ours: (1) Their templates contain special tokens as placeholders. In our design, all placeholders are natural words, which is shown to be useful in Section 3.4. (2) They use a special token “<tgr>” to mark targeted triggers in the passage, while GenEE has a natural-language-styled input, which can better exploit the pretrained knowledge in language models. (3) Our proposed framework is flexible to address all event detection, event argument extraction, and end-to-end event extraction tasks and demonstrates competitive results. In contrast, their generative model is only used to extract event arguments. All distinct designs help GenEE achieve a better performance on the ACE05-E dataset compared to their model444Their results on ACE05-E dataset can be found at https://github.com/raspberryice/gen-arg/blob/main/NAACL_2021_Appendix.pdf.

6 Conclusion

In this paper, we formulate event extraction as a natural language generation task. We propose a framework that encodes the given information into carefully designed prompts and generates a natural sentence, from where the final triggers and arguments can be decoded. It successfully models event structures and exploits label semantics, resulting in promising performance on event extraction tasks under all zero-shot, few-shot, and high-resource scenarios. In the future, we plan to extend the framework to a more general paradigm in order to address more structured prediction problems.

Ethics Considerations

Our proposed models are base on the pretrained language model that is trained on larger text corpus. It is known that the pretrained language model could capture the bias reflecting the training data. Therefore, our models can potentially generate offensive or biased content learned by the pretrained language model. We suggest to carefully examine the potential bias before deploying models in any real-world applications.

References

Appendix A Dataset Statistics

Table 7 lists the statistics of ACE05-E, ACE05-E, and ERE-EN.

Dataset Split #Sents #Events #Args
ACE05-E Train 17172 4202 4859
Dev 923 450 605
Test 832 403 576
ACE05-E Train 19216 4419 6607
Dev 901 468 759
Test 676 424 689
ERE-EN Train 14736 6208 8924
Dev 1209 525 730
Test 1163 551 822
Table 7: Dataset statistics.

Appendix B Implementation Details

The implementation details of all baselines and are as follows:

For all of GenED, GenEAE, and GenE2E, we finetune the pretrained BART-large lewis-etal-2020-bart. We consider ADAMW optimizer with learning rate set to and the weight decay set to

. The number of epochs is 45. We set the batch size to 6 for

GenEAE and 32 for GenED and GenE2E. We set the number of negative examples to 18.

Appendix C Common Event Types

Table 8 shows the top common event types in ACE05-E.

n Seen Event Types for Training/Development
5 Conflict:Attack, Movement:Transport, Life:Die, Contact:Meet, Personnel:Elect
10 Conflict:Attack, Movement:Transport, Life:Die, Contact:Meet, Personnel:Elect, Life:Injure, Personnel:End-Position, Justice:Trial-Hearing, Contact:Phone-Write, Transaction:Transfer-Money
Table 8: Common event types in ACE05-E.

Appendix D List of EAE Templates

Table 9 and 10 lists all for ACE05-E, ACE05-E, and ERE-EN.

Event Type EAE Template
Life:Be-Born somebody was born in somewhere.
Life:Marry somebody got married in somewhere.
Life:Divorce somebody divorced in somewhere.
Life:Injure somebody or some organization led to some victim injured by some way in somewhere.
Life:Die somebody or some organization led to some victim died by some way in somewhere.
Movement:Transport something was sent to somewhere from some place by some vehicle. somebody or some organization was responsible for the transport.
Transaction:Transfer-Ownership someone got something from some seller in somewhere.
Transaction:Transfer-Money someone paid some other in somewhere.
Business:Start-Org somebody or some organization launched some organzation in somewhere.
Business:Merge-Org some organzation was merged.
Business:Declare-Bankruptcy some organzation declared bankruptcy.
Business:End-Org some organzation dissolved.
Conflict:Attack some attacker attacked some facility, someone, or some organization by some way in somewhere.
Conflict:Demonstrate some people or some organization protest at somewhere.
Contact:Meet some people or some organization met at somewhere.
Contact:Phone-Write some people or some organization called or texted messages at somewhere.
Personnel:Start-Position somebody got new job and was hired by some people or some organization in somewhere.
Personnel:End-Position somebody stopped working for some people or some organization at somewhere.
Personnel:Nominate somebody was nominated by somebody or some organization to do a job.
Personnel:Elect somebody was elected a position, and the election was voted by some people or some organization in somewhere.
Justice:Arrest-Jail somebody was sent to jailed or arrested by somebody or some organization in somewhere.
Justice:Release-Parole somebody was released by some people or some organization from somewhere.
Justice:Trial-Hearing somebody, prosecuted by some other, faced a trial in somewhere. The hearing was judged by some adjudicator.
Justice:Charge-Indict somebody was charged by some other in somewhere. The adjudication was judged by some adjudicator.
Justice:Sue somebody was sued by some other in somewhere. The adjudication was judged by some adjudicator.
Justice:Convict somebody was convicted of a crime in somewhere. The adjudication was judged by some adjudicator.
Justice:Sentence somebody was sentenced to punishment in somewhere. The adjudication was judged by some adjudicator.
Justice:Fine some people or some organization in somewhere was ordered by some adjudicator to pay a fine.
Justice:Execute somebody was executed by somebody or some organization at somewhere.
Justice:Extradite somebody was extradicted to somewhere from some place. somebody or some organization was responsible for the extradition.
Justice:Acquit somebody was acquitted of the charges by some adjudicator.
Justice:Pardon somebody received a pardon from some adjudicator.
Justice:Appeal some other in somewhere appealed the adjudication from some adjudicator.
Table 9: All for ACE05-E and ACE05-E.
Event Type EAE Template
Life:Be-Born somebody was born in somewhere.
Life:Marry somebody got married in somewhere.
Life:Divorce somebody divorced in somewhere.
Life:Injure somebody or some organization led to some victim injured by some way in somewhere.
Life:Die somebody or some organization led to some victim died by some way in somewhere.
Movement:Transport-Person somebody was moved to somewhere from some place by some way. somebody or some organization was responsible for the movement.
Movement:Transport-Artifact something was sent to somewhere from some place. somebody or some organization was responsible for the transport.
Business:Start-Org somebody or some organization launched some organzation in somewhere.
Business:Merge-Org some organzation was merged.
Business:Declare-Bankruptcy some organzation declared bankruptcy.
Business:End-Org some organzation dissolved.
Conflict:Attack some attacker attacked some facility, someone, or some organization by some way in somewhere.
Conflict:Demonstrate some people or some organization protest at somewhere.
Contact:Meet some people or some organization met at somewhere.
Contact:Correspondence some people or some organization contacted each other at somewhere.
Contact:Broadcast some people or some organization made announcement to some publicity at somewhere.
Contact:Contact some people or some organization talked to each other at somewhere.
Manufacture:Artifact something was built by somebody or some organization in somewhere.
Personnel:Start-Position somebody got new job and was hired by some people or some organization in somewhere.
Personnel:End-Position somebody stopped working for some people or some organization at somewhere.
Personnel:Nominate somebody was nominated by somebody or some organization to do a job.
Personnel:Elect somebody was elected a position, and the election was voted by somebody or some organization in somewhere.
Transaction:Transfer-Ownership The ownership of something from someone was transferred to some other at somewhere.
Transaction:Transfer-Money someone paid some other in somewhere.
Transaction:Transaction someone give some things to some other in somewhere.
Justice:Arrest-Jail somebody was sent to jailed or arrested by somebody or some organization in somewhere.
Justice:Release-Parole somebody was released by somebody or some organization from somewhere.
Justice:Trial-Hearing somebody, prosecuted by some other, faced a trial in somewhere. The hearing was judged by some adjudicator.
Justice:Charge-Indict somebody was charged by some other in somewhere. The adjudication was judged by some adjudicator.
Justice:Sue somebody was sued by some other in somewhere. The adjudication was judged by some adjudicator.
Justice:Convict somebody was convicted of a crime in somewhere. The adjudication was judged by some adjudicator.
Justice:Sentence somebody was sentenced to punishment in somewhere. The adjudication was judged by some adjudicator.
Justice:Fine some people or some organization in somewhere was ordered by some adjudicator to pay a fine.
Justice:Execute somebody was executed by somebody or some organization at somewhere.
Justice:Extradite somebody was extradicted to somewhere from some place. somebody or some organization was responsible for the extradition.
Justice:Acquit somebody was acquitted of the charges by some adjudicator.
Justice:Pardon somebody received a pardon from some adjudicator.
Justice:Appeal somebody in somewhere appealed the adjudication from some adjudicator.
Table 10: All for ERE-EN.

Appendix E Full Results of Zero/Few-Shot Event Extraction

The full results of zero/few-shot setting on ACE05-E are shown in Table 11.

Trigger Argument Common 5 Common 10
Tri-I Tri-C Arg-I Arg-C Tri-I Tri-C Arg-I Arg-C
Event Argument Extraction
Gold Triggers BERT_QA 0-shot 100.0 100.0 55.8 37.9 100.0 100.0 57.2 46.7
Gold Triggers BERT_QA 1-shot 100.0 100.0 55.8 44.3 100.0 100.0 57.8 47.2
Gold Triggers BERT_QA 5-shot 100.0 100.0 56.6 49.6 100.0 100.0 59.1 50.6
Gold Triggers BERT_QA 10-shot 100.0 100.0 58.8 52.9 100.0 100.0 60.5 52.8
Gold Triggers OneIE 1-shot 100.0 100.0 40.9 36.5 100.0 100.0 48.3 44.2
Gold Triggers OneIE 5-shot 100.0 100.0 55.6 51.4 100.0 100.0 58.6 55.0
Gold Triggers OneIE 10-shot 100.0 100.0 59.4 56.7 100.0 100.0 62.0 59.5
Gold Triggers GenEAE 0-shot 100.0 100.0 56.1 48.0 100.0 100.0 66.5 53.3
Gold Triggers GenEAE 1-shot 100.0 100.0 65.2 55.2 100.0 100.0 65.4 54.7
Gold Triggers GenEAE 5-shot 100.0 100.0 70.9 62.2 100.0 100.0 68.0 61.7
Gold Triggers GenEAE 10-shot 100.0 100.0 71.1 64.2 100.0 100.0 71.6 64.3
Gold Triggers BERT_QA (Full) 100.0 100.0 63.1 57.9 100.0 100.0 62.1 56.5
Gold Triggers OneIE (Full) 100.0 100.0 70.8 66.4 100.0 100.0 67.9 64.1
Gold Triggers GenEAE (Full) 100.0 100.0 74.5 70.6 100.0 100.0 73.6 68.9
Event Extraction
Matching Baseline 42.7 42.1 - - 46.3 46.3 - -
Lemmatization Baseline 51.5 50.2 - - 56.6 56.0 - -
BERT_QA 1-shot 10.0 1.4 1.3 1.3 8.2 1.6 1.1 1.1
BERT_QA 5-shot 14.0 12.6 11.1 10.8 20.8 15.4 14.6 13.9
BERT_QA 10-shot 37.8 33.5 22.9 22.1 32.0 27.8 19.5 18.6
OneIE 1-shot 4.2 4.2 1.5 1.5 4.1 2.7 2.0 2.0
OneIE 5-shot 39.3 38.5 24.8 22.8 41.9 41.9 29.7 27.2
OneIE 10-shot 54.8 53.3 36.0 34.9 61.5 57.8 41.4 39.2
GenED 0-shot GenEAE 0-shot 53.3 46.8 29.6 25.1 60.9 54.5 42.0 31.4
GenED 1-shot GenEAE 1-shot 60.1 53.3 38.8 31.6 61.2 60.9 41.1 34.7
GenED 5-shot GenEAE 5-shot 57.8 55.5 40.6 36.1 65.8 64.8 45.3 42.7
GenED 10-shot GenEAE 10-shot 63.8 61.2 46.0 42.0 72.1 68.8 52.5 48.4
OneIE (Full) 72.7 70.5 52.3 49.9 74.5 73.0 51.2 48.9
GenED (Full) GenEAE (Full) 68.4 66.0 51.9 48.7 72.0 69.8 52.5 49.2
Table 11: Full results of zero/few-shot setting on ACE05-E. BERT_QA refers to the model from Du20qa, and OneIE is from Lin20oneie.