Few-Shot Event Detection with Prototypical Amortized Conditional Random Field

by   Xin Cong, et al.

Event Detection, a fundamental task of Information Extraction, tends to struggle when it needs to recognize novel event types with a few samples, i.e. Few-Shot Event Detection (FSED). Previous identify-then-classify paradigm attempts to solve this problem in the pipeline manner but ignores the trigger discrepancy between event types, thus suffering from the error propagation. In this paper, we present a novel unified joint model which converts the task to a few-shot tagging problem with a double-part tagging scheme. To this end, we first design the Prototypical Amortized Conditional Random Field (PA-CRF) to model the label dependency in the few-shot scenario, which builds prototypical amortization networks to approximate the transition scores between labels based on the label prototypes. Then Gaussian distribution is introduced for the modeling of the transition scores in PA-CRF to alleviate the uncertain estimation resulting from insufficient data. We conduct experiments on the benchmark dataset FewEvent and the experimental results show that the tagging based methods are better than existing pipeline and joint learning methods. In addition, the proposed PA-CRF achieves the best results on the public dataset.


page 1

page 2

page 3

page 4


Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network

In this paper, we explore the slot tagging with only a few labeled suppo...

Joint Extraction of Entities and Relations Based on a Novel Tagging Scheme

Joint extraction of entities and relations is an important task in infor...

The Art of Prompting: Event Detection based on Type Specific Prompts

We compare various forms of prompts to represent event types and develop...

Few-Shot Sequence Labeling with Label Dependency Transfer

Few-shot sequence labeling faces a unique challenge compared with the ot...

PILED: An Identify-and-Localize Framework for Few-Shot Event Detection

Practical applications of event extraction systems have long been hinder...

Integrating Propositional and Relational Label Side Information for Hierarchical Zero-Shot Image Classification

Zero-shot learning (ZSL) is one of the most extreme forms of learning fr...

Improving Agreement and Disagreement Identification in Online Discussions with A Socially-Tuned Sentiment Lexicon

We study the problem of agreement and disagreement detection in online d...


Event detection (ED) systems extract events of specific types from the given text. Traditionally, researchers use pipeline approaches Ahn (2006)

where a trigger identification (TI) system is used to identify event triggers in a sentence and then a trigger classifier (TC) is used to find the event types of extracted triggers. Such a framework makes the task easy to conduct but ignores the interaction and correlation between the two subtasks, being susceptible to cascading errors. In the last few years, several neural network-based models were proposed to jointly identify triggers and classify event types from a sentence 

Chen et al. (2015); Nguyen and Grishman (2015, 2018); Liu, Luo, and Huang (2018); Yan et al. (2019); Cui et al. (2020)

. These models have achieved promising performance and proved the effectiveness of solving ED in the joint framework. But they almost followed the supervised learning paradigm and depended on the large scale human-annotated dataset, while new event types emerge every day and most of them suffer from the lack of sufficient annotated data. In the case of insufficient resources, existing joint models cannot recognize the novel unseen event types with only few samples, i.e. Few-Shot Event Detection (FSED).

Train Marry It served as the location of Bogart’s wedding to Bacall.
Test E-Mail If you have a better idea, please e-mail me.
Table 1: An example from FewEvent dataset revealing the trigger discrepancy. Bold masks the event trigger.

One intuitive way to solve this problem is to first identify event triggers and then classify the event types based on the Few-Shot Classification methods Vinyals et al. (2016); Snell, Swersky, and Zemel (2017); Sung et al. (2018), these two subtasks can be trained jointly by parameter sharing. Such identify-then-classify paradigm Deng et al. (2020) seems to be convincing because TI aims to recognize triggers and does not need to adapt to novel classes, so we just need to solve the TC in the few-shot manner. Unfortunately, our preliminary experiments reveal that TI tends to struggle when recognizing triggers of novel event types, because novel events usually contain triggers completely different from the known events. Table 1 gives an example that the trigger “e-mail” would only occur in event E-Mail but not in Marry. Experiments on FewEvent (a benchmark dataset for FSED) show that 59.21% triggers in the test set do not appear in the training set and the F1 score of TI with the SOTA TI model BERT-tagger Yang et al. (2019) is only 31.82%. Thus, the performance of the identify-then-classify paradigm will be limited by the TI part due to the cascading errors.

In this paper, we present a new unified method to solve FSED. Specifically, we convert this task to a sequence labeling problem and design a double-part tagging scheme using trigger and event parts to describe the features of each word in a sentence. The key to the sequence labeling framework is to model the dependency between labels. Conditional Random Field (CRF) is a popular choice to capture such dependency by learning transition scores of fixed label space in the training dataset. Nevertheless, in FSED, CRF cannot be applied directly due to the non-adaptation problem, that is the label space of the test set is non-overlapping with the training set since FSED aims to recognize novel event types. Therefore, the learned transition scores of CRF from the training set do not model the dependency of the novel labels in the test set.

To address the non-adaptation problem, we propose Prototypical Amortized Conditional Random Field (PA-CRF), which generates the transition scores based on the label prototypes instead of learning by optimization, i.e. in the amortized manner Kingma and Welling (2014); Gordon et al. (2019)

. Specifically, we introduce Prototypical Amortization Networks (PAN). It first applies the self-attention mechanism to capture the dependency information between labels, and then maps the label prototype pairs to the corresponding transition scores. In this way, PAN can produce label-specific transition scores based on the few support samples, which can adapt to arbitrary novel event types. However, predicting the transition score as a single fixed value actually acts as the point estimation, which usually acquires a large amount of annotated data to achieve accurate estimation. Since the transition scores are predicted based on the handful of samples, the estimated transition score may suffer the uncertainty due to the random data bias. To release this issue, we treat the transition score as the random variable and utilize the Gaussian distribution to approximate its distribution to model the uncertainty. Therefore, our PAN is to estimate the parameters of the Gaussian distribution rather than the transition scores directly. In the inference phase, the Probabilistic Inference 

Gordon et al. (2019) is employed based on the Gaussian distribution over all possible values to make the inference robust by taking the possible perturbation of transition scores into account, since the perturbation is also learned in a way that coherently explains the uncertainty of the samples.

To summarize, our contributions are as following:

  • We devise a tagging-based joint model for FSED. To the best of our knowledge, we are the first to solve this task in a unified manner, free from the cascading errors.

  • We propose a novel model, PA-CRF, which estimates the distributions of transition scores for modeling the label dependency in the few-shot sequence labeling setting.

  • We demonstrate the effectiveness of our method on the benchmark FewEvent and achieve SOTA results. Further analyses show that the performance of non-unified models is limited by TI and our unified model can release it.

Related Work

Our work is inspired by two lines of research: few-shot event detection and few-shot sequence labeling.

Few-shot Event Detection. Event Detection (ED) aims to recognize the specific type of events in a sentence. In recent years, various neural-based models have been proposed and achieved promising performance in ED. Chen et al. Chen et al. (2015) and Nguyen and Grishman Nguyen and Grishman (2015) proposed the convolution architecture to capture the semantic information in the sentence. Nguyen, Cho, and Grishman Nguyen, Cho, and Grishman (2016)

introduced the recurrent neural network to model the sequence contextual information of words. Recently, GCN-based models 

Nguyen and Grishman (2018); Liu, Luo, and Huang (2018); Yan et al. (2019); Cui et al. (2020) have been proposed to exploit the syntactic dependency information and achieved state-of-the-art performance. However, all these models are data-hungry, limiting dramatically their usability and deployability in real-world scenarios

Recently, there has been an increasing research interest in solving the Few-Shot Event Detection (FSED) Deng et al. (2020); Lai, Dernoncourt, and Nguyen (2020a, b) by exploiting the Few-Shot Learning (FSL) Vinyals et al. (2016); Snell, Swersky, and Zemel (2017); Sung et al. (2018); Finn, Abbeel, and Levine (2017); Cong et al. (2020). Lai, Dernoncourt, and Nguyen Lai, Dernoncourt, and Nguyen (2020a) focused on the few-shot trigger classification, which split the part of the support set to act as the auxiliary query set to training the model. Lai, Dernoncourt, and Nguyen Lai, Dernoncourt, and Nguyen (2020b) also focused on the few-shot trigger classification and introduces two regularization loss to improve the performance of models. Deng et al. Deng et al. (2020) first proposed the benchmark dataset, FewEvent, for FSED. They utilized the conventional trigger identification task as the auxiliary task to train the few-shot trigger classifier jointly and only evaluated the model performance on the few-shot trigger classification task. All these models focus on the few-shot trigger classification which classifies the event type of the trigger according to the context based on few samples, treating triggers as being provided by human annotators. This is unrealistic as triggers are usually predicted by some existing toolkits whose errors might be propagated to the event classification. Moreover, our preliminary experiments reveal that the conventional trigger identification model tends to struggle when recognizing triggers of novel event types because of the trigger discrepancy between event types. Different from previous identify-then-classify framework, for the first time, we propose a unified model which solves FSED with two subtasks jointly.

Few-shot Sequence Labeling. In recent years, several works Hou et al. (2020); Fritzler, Logacheva, and Kretov (2019)

have been proposed to solve the few-shot named entity recognition using sequence labeling methods.

Fritzler, Logacheva, and Kretov Fritzler, Logacheva, and Kretov (2019) applied the vanilla CRF in the few-shot scenario directly. Hou et al. Hou et al. (2020) proposed a collapsed dependency transfer mechanism into CRF, which learns label dependency patterns of a set of task-agnostic abstract labels and utilizes these patterns as transition scores for novel labels. Different from these methods learning the transition scores by optimization, we build a prototypical amortization networks to generate the transition scores based on the label prototypes instead. In this way, we can generate exact label-specific transition scores of arbitrary novel event types to achieve adaptation ability. And we further introduce Gaussian distribution to estimate the uncertainty of the data bias. Experiments prove the effectiveness of our method over the previous methods.

Problem Formulation

Figure 1: Architecture of our proposed PA-CRF. It consists of three modules: a) Emission Module calculates the emission scores for the query instance based on the prototypes derived from the support set. b) Transition Module generates the Gaussian distributions of transition scores with respect to prototypes. c) Decoding Module exploits the emission scores and approximated Gaussian distributions for transition scores to decode the predicted label sequence with the Monte Carlo Sampling.

We convert event detection to a sequence labeling task. Each word is assigned a label that contributes to detecting the events. Tag “O” represents the “Other” tag, which means that the corresponding word is independent of the target events. In addition, the other tags consist of two parts: the word position in the trigger and the event type. We use the “BI” (Begin, Inside) signs to represent the position information of a word in the event trigger. The event type information is obtained from a predefined set of events. Thus, the total number of tags is ( for B-EventType, for I-EventType, and an additional O label to denote other tokens), where is the number of predefined event types.

Furthermore, we formulate the Few-Shot Event Detection (FSED) problem in the typical -way--shot paradigm. Let denote an -word sequence, and denote the label sequence of the . Given a support set which contains event types and each event type has only instances, FSED aims to predict the labels of a unlabeled query set based on the support set . Formally, a pair is called a -way--shot task . There exist two datasets consisting of a set of tasks : and where and denote the number of the task in two datasets respectively. As the name suggests, is used to train models in the training phase while is for evaluation. It is noted that these two datasets have their own event types, which means that the label space of two datasets are disjoint with each other.



As described above, we formulate FSED as the few-shot sequence labeling task with interdependent labels. Following the widely used CRF framework, we propose a novel PA-CRF model to model such label dependency in the few-shot setting, and decode the best-predicted label sequence. CRF framework uses emission scores to measure the possible label for each token and transition scores to measure the dependency between labels, and then exploits both of them to decode the global optimal label sequence. Similarly, our PA-CRF also contains three modules: 1) Emission Module: It first computes the prototype of each label based on the support set, and then calculates the similarity between prototypes and each token in the query set as the emission scores. 2) Transition Module: It exploits the prototypes from the Emission Module to generate the parameters of Gaussian distribution of the transition scores for decoding. 3) Decoding Module: Based on the emission scores and Gaussian distributions of transition scores, Decoding Module calculates the probabilities of possible label sequences for the given query set and decodes the predicted label sequence. Figure

1 gives an illustration of PA-CRF. We detail each component from the bottom to the top.

Emission Module

The Emission Module assigns the emission scores to each token of the sentences in the query set with regard to each label based on the support set .

Base Encoder

Base Encoder aims to embed tokens in both support set and query set

into real-value embedding vectors to capture the semantic information of tokens.

Since BERT Devlin et al. (2019) with the advanced ability to capture the sequence information is widely used in NLP tasks recently, we use it as the backbone. Given an input word sequence , BERT first maps all tokens into hidden embedding representations. We denote this operation as:



refers to the hidden representation of token

, is the dimension of the hidden representation.

Prototype Layer

Prototype Layer is to derive the prototypes of each label from the support set . As described in the problem formulation, we use BIO schema to annotate the event trigger and event types could contain labels. Thus, indeed, we could get prototypes. Following the previous work Snell, Swersky, and Zemel (2017), we calculate the prototype of each label by averaging all the word representations with that label in the support set :


where denotes the prototype for label , refers to the token set containing all words in the support set with label , represents the corresponding hidden representation of token , and is the number of set elements.

Emission Scorer

Emission Scorer aims to calculate the emission score for each token in the query set . The emission scores are calculated according to the similarities between tokens and prototypes. The computation of the emission score of the label for the word is defined as:


where is the similarity function. In practice, we choose the dot product operation to measure the similarity.

Finally, given a word sequence , the emission score of the whole sentence with its corresponding ground-truth label sequence is computed as:


Transition Module

In vanilla CRF, the transition scores are learnable parameters and optimized from large-scale data to model the label dependency. However, in the few-shot scenarios, the learned transition scores cannot adapt to the novel label set due to the disjoint label space. To overcome this problem, we use neural networks to generate the transition scores based on the label prototypes instead of learning by optimization to achieve adaptation ability. In this case, a problem needing to be solved is that using few support instances with random data bias to generate transition scores would cause uncertain estimation and result in wrong inference. To model the uncertainty, we treat the transition score as a random variable and use the Gaussian distribution to approximate its distribution. Specifically, the Prototypical Amortization Networks (PAN) is proposed to generate the distributional parameters (mean and variance) of transition scores based on the label prototypes. PAN consists of two layers: 1) Prototypical Interaction Layer and 2) Distribution Approximator. Details of each layer are listed in the following part.

Prototype Interaction Layer

Since the transition score is to model the dependency between labels, individual prototypes for each event type with rare dependency information is hard to generate their transition scores. Thus, we propose a Prototype Interaction Layer which exploits the self-attention mechanism to capture the dependency between labels.

We first calculate the attention scores of each prototype with others:


where and are transformed from by two linear layers respectively.

Getting the attention scores, the prototype with dependency information is calculated as follow:


where is also transformed linearly from .

Distribution Approximator

This module aims to generate the mean and variance of Gaussian distributions based on the prototypes with dependency information.

We first denote the transition score matrix as for all label pairs, and denote the the -th row -th column element of as which refers to the transition score for label transiting to label . As treating as random variable, we use the Gaussian distribution to approximate , where refers to the Gaussian distribution. To estimate the mean and variance of , we concatenate the corresponding prototypes and

and feed into two feed-forward neural networks respectively:


where means the concatenation operation. We denote the approximated transition score from label to label based on as:


Given a label sequence , the transition score of the whole label sequence is approximated by:


Decoding Module

Decoding Module derives the probabilities for a specific label sequence of the query set according to the emission scores and approximated Gaussian distributions of transition scores.

Since the approximated transition score is Gaussian distributional and not a single value, we denote the probability density function of the approximated transition score matrix as

. According to the Probabilistic Inference Gordon et al. (2019), the probability of label sequence of a word sequence based on the support set is calculated as:


Following the CRF algorithm, the probability can be calculated based on the Equation 4 and Equation 10:




and refers to all possible label sequences.

Objective Function

In the training phase, we use negative log-likelihood loss as our objective function:


Due to the hardness to compute the integral of Equation 12, in practice, we use the Monte Carlo sampling technique Gordon et al. (2019) to approximate the integral. To make the sampling process differentiable for optimization, we employ the reparameterization trick Kingma and Welling (2014) for each transition score :



In the inference phase, the Viterbi algorithm Forney (1973) is employed to decode the best-predicted label sequence for the query set.



We conduct experiments on the benchmark FewEvent dataset introduced in  Deng et al. (2020), which is the currently largest few-shot dataset for event detection. It contains 70,852 instances for 100 event types and each event type owns about 700 instances on average. Following the previous work Deng et al. (2020), we use 80 event types as the training set (67982 instances), 10 event types as the dev set (2173 instances), and the rest 10 event types as the test set (697 instances).


We follow the evaluation metrics in previous event detection work 

Chen et al. (2015); Liu, Luo, and Huang (2018); Cui et al. (2020), an event trigger is masked correct if and only if its event type and its offsets in the sentence are both correct 111Note that our evaluation metrics are different from previous work DMBPN. As described in the caption of Table 1 of the original paper, DMBPN only makes evaluations on trigger classification without considering whether the trigger is correct or no.

. We adopt the standard micro F1 score to evaluate the results. As to the Precision and Recall, please refer to Appendix. For fair comparisons, we report the averaged test results over 5 randomly initialized runs of our model.

Implementation Details

We employ BERT-BASE-UNCASED Devlin et al. (2019) as the base encoder. The maximum sentence length is set as 128. Our model is trained using AdamW optimizer with the learning rate of 1e-5 searched from . We train our model with 20,000 iterations on the training set and evaluate its performance with 3,000 iterations on the test set following the episodic paradigm Vinyals et al. (2016)

. All the hyper-parameters are tuned on the validation set. We run all experiments using PyTorch 1.5.1 on the Nvidia Tesla T4 GPU, Intel(R) Xeon(R) Silver 4110 CPU with 256GB memory on Red Hat 4.8.3 OS. In the training phase, we follow the widely used episodic training 

Vinyals et al. (2016)

in few-shot learning. Episodic training aims to mimic N-way-K-shot scenario in training phase. In each epoch, we randomly sample N event types from training set and each event type randomly sample K instances as support set and other M instances as query set.


To investigate the effectiveness of our proposed method, we compare it with a range of baselines and state-of-the-art models, which can be categorized into two classes: non-unified and unified.

Non-unified models first perform trigger identification (named as TI) and then classify the event types based on the Few-Shot Learning methods (named as FSTC). We investigate two non-unified paradigms: separate and multi-task. We first exploit the state-of-the-art BERT tagger named BertTI for the TI task. It uses BERT Devlin et al. (2019) and a linear layer to tag the trigger in the sentence as a sequence labeling task. Since TI just aims to recognize the occurrence of the trigger, the label set only contains three labels: O, B-Trigger, I-Trigger. For the FSTC task, we implement two models: ProtoTC and DMBPN. ProtoTC applies BERT as the encoder and Prototypical Networks as the few-shot classifier. The [CLS] representation is used as the representation of a sentence and the average of sentence representations of an event type is calculated as the prototype. Dot product is utilized as the similarity metric as our model. DMBPN is the SOTA few-shot event classification method Deng et al. (2020) using GRU. For the fair comparison, we reimplement it based on the BERT encoder. In the separate paradigm, the few-shot classifier and BertTI are trained separately without parameter sharing. We denote the separate paradigm of ProtoTC and DMBPN as SEP-ProtoTC and SEP-DMBPN, respectively. In terms of the multi-task paradigm, the few-shot classifier is jointly trained with BertTI with shared BERT parameters. Similarly, we name the multi-task paradigm of ProtoTC and DMBPN as Multi-ProtoTC and Multi-DMBPN, respectively.

Unified models perform few-shot event detection with a single model without task decomposition. Because we are the first to solve this task in a unified way, there is no previous unified model that can be compared. But for the comprehensive evaluation of our proposed PA-CRF model, we also construct two groups of variants of PA-CRF: 1) non-CRF models, and 2) CRF-based models. Non-CRF models use emission scores to predict via softmax and do not take the label dependency into account. We implement four typical few-shot classifiers: 1) UNI-Match Vinyals et al. (2016) uses cosine function to measure the similarity, 2) UNI-Proto Snell, Swersky, and Zemel (2017) uses Euclidean Distance as the similarity metric, 3) UNI-Proto-Dot uses dot product to compute the similarity, 4) UNI-Relation Sung et al. (2018) builds a two-layer neural networks to measure the similarity. All these models use BERT as the base encoder to embed each sentence. Since CRF with the capacity of modeling label dependency is widely used in sequence labeling task, we implement two kinds of CRF-based models as our baselines: 1) Vanilla CRF: Although the vanilla CRF learns the transition scores from the train set cannot adapt to the test set, we still implement it in the FSED task. 2) Collapsed CRF Hou et al. (2020): As the SOTA of the few-shot NER task, we re-implement it according to the official code and adapt it in the FSED task to replace our Transition Module. For the fair comparison, the emission module of these two CRF-based baseline models is the same as our PA-CRF.

Experimental Results

Model 5-Way 5-Way 10-Way 10-Way
5-Shot 10-Shot 5-Shot 10-Shot
SEP-ProtoTC 29.87 30.87 28.73 29.64
SEP-DMBPN 30.39 31.12 29.87 30.37
Multi-ProtoTC 30.56 30.88 29.32 30.22
Multi-DMBPN 31.01 31.22 30.10 31.31
UNI-Match 39.93 46.02 30.88 35.91
UNI-Proto 50.11 52.97 43.51 42.70
UNI-Proto-Dot 58.82 61.01 55.04 58.78
UNI-Relation 28.91 29.83 18.49 21.47
Vanilla CRF 59.01 62.21 56.00 59.35
Collapsed CRF 59.30 62.77 56.41 59.44
PA-CRF 62.25 64.45 58.48 61.64
Table 2: F1 scores () of different models on the FewEvent test set. Bold marks the highest number among all models, underline marks the second-highest number, and

marks the standard deviation.

Table 2 summarizes the results of our method PA-CRF against other baseline models on the FewEvent test set.

Comparison with non-unified models

1) All non-unified models perform lower than our PA-CRF with huge gaps about 30%, which powerfully proves that the effectiveness of our unified framework. 2) Comparing with the separation paradigm (SEP-ProtoTC and SEP-DMBPN), the multitask paradigm (Multi-ProtoTC and Multi-DMBPN) is able to improve the performance but it still cannot catch up with the unified paradigm. 3) DMBPN works slightly better than ProtoTC but still works poorly to handle the FSED due to the limitation of the TI. We will discuss the bottleneck of the non-unified paradigm in the later section.

Comparison with unified models

(1) Over the best non-CRF baseline model UNI-Proto-Dot, PA-CRF achieves substantial improvements of 3.43%, 3.44%, 3.44% and 2.86% on four few-shot scenarios respectively, which confirms the effectiveness and rationality of PA-CRF to model the label dependency. (2) Comparing four few-shot scenarios, we can find that the F1 score increases as the K-Shot increases, which shows that more support samples can provide more information of the event type. The F1 score decreases as the N-Way increases when the shot number is fixed, which reveals that the larger way number causes more event types to predict which increases the difficulty of correct detection. (3) Vanilla CRF performs better than other non-unified baseline methods, which demonstrates that CRF is able to improve the performance by modeling the label dependency, even if the learned transition scores do not match the label space of the test set. (4) Compared to Vanilla CRF, Collapsed CRF achieves slightly higher F1 scores (0.29%, 0.56%, 0.41% and 0.09% for four scenarios), indicating the transition scores of abstract BIO labels can improve the model performance to some extent. (5) PA-CRF outperforms Collapsed CRF (2.95%, 1.68%, 2.07% and 2.20% respectively) with absolute gaps. We consider that it is because Collapsed CRF learning the transition scores of the abstract labels cannot model the exact dependency of specific label set, so its adaptation ability is limited. In contrast, PA-CRF generates the label-specific transition scores based on the label prototype, which can capture the dependency for specific novel event types.

To summarize, we can draw the conclusion that (1) The non-unified paradigm is incapable of solving the FSED task. (2) Compared to the non-unified paradigm, the unified paradigm works more effectively for the FSED task. (3) Generating transition scores based on the label prototypes not by optimization, our PA-CRF achieves better adaptation on novel event types.

Analysis and Discussion

Bottleneck Analysis

SEP-DMBPN (BertTI) 31.82 94.40 30.39
Multi-DMBPN (BertTI) 32.31 95.44 31.01
SEP-DMBPN (ProtoTI) 52.69 95.39 51.50
Multi-DMBPN (ProtoTI) 54.69 95.49 53.93
PA-CRF 63.68 96.76 62.25
Table 3: Comparison of PA-CRF and baselines on two subtasks. F1 scores are reported on the FewEvent test set in the 5-way-5-shot setting.

To investigate the bottleneck of the non-unified paradigm, we evaluate SEP-DMBPN, Multi-DMBPN, and PA-CRF on two subtasks: TI and FSTC separately in the 5-way-5-shot setting on the FewEvent test set. The experimental results are reported in Table 3. From Table 3, We find that: (1) All three models achieve more than 90% F1 score on the FSTC task, indicating that both non-unified and unified framework is capable enough of solving the FSTC problem. (2) For the TI task, two non-unified baselines perform 31.82% and 32.31% F1 score respectively, which demonstrates that the conventional TI model has difficulty in adapting to novel event triggers. Hence, due to the cascading errors, the poorly-performed TI module limits the performance of the non-unified models. (3) PA-CRF achieves 63.68% F1 score on TI task, which exceeds the two kinds of non-unified models significantly. Unlike non-unified models recognizing triggers based on seen triggers, PA-CRF utilizes the trigger representations from the support set of the novel event types to identify novel triggers so our unified model works better in TI task of FSED.

Additionally, to verify the effectiveness of the unified framework, we further adapt our best baseline model, Collapsed CRF, to the TI task to recognize triggers in the few-shot manner. It recognizes triggers based on the similarity between tokens and label prototypes calculating from the support set. In this case, we rename it as ProtoTI and combine it with DMBPN and evaluate it in both separate and multi-task paradigms. Results are also reported based on the 5-way-5-shot setting in Table 3. From Table 3, we observe that: Two ProtoTI-based models achieve 52.69% and 54.69% in TI task, exceeding the BertTI-based models, which shows that solving TI in the few-shot manner by utilizing the support set can reduce the trigger discrepancy to some extent. Although the performance of FSTC is similar to BertTI models, owing to the improvements of TI task, the final performance of FSED exceeds BertTI-based models about 20%. But they are still inferior to PA-CRF with a huge gap (10.99% and 8.99% on TI task respectively), which proves that solving FSED in the unified manner can utilize the correlation between two subtasks to improve the model performance significantly.

Ablation Study

Model 5-Way 5-Way 10-Way 10-Way
5-Shot 10-Shot 5-Shot 10-Shot
PA-CRF 44.39 51.06 41.82 46.88
 - Dist Est 43.47 49.41 40.80 45.51
 - Interact 41.62 45.74 38.97 43.50
 - Trans 39.83 45.07 37.95 42.25
Table 4: Ablation study of PA-CRF. F1 scores in different settings are reported on the FewEvent dev set.

To study the contribution of each component in our PA-CRF model, we run the ablation study on the FewEvent dev set. From these ablations (see Table 4), we find that: (1) - Dist Est: To study whether distributional estimation is helpful to improve the performance, we remove the distributional estimation and make the Distribution Approximator generate a single value as the transition score directly as the point estimation. And the inference is based on the generated transition scores without Probabilistic Inference. As a result, the F1 score drops 1.02%, 1.65%, 1.02% and 1.37% in four scenarios, respectively. We attribute these gaps to our proposed Gaussian-based distributional estimation which can model the data bias and relieve the influence of uncertainty. (2) - Interact: To certify that the Prototype Interaction Layer contributes to capturing the information between prototypes, we remove it and evaluate on four scenarios. We read from Table 4 that F1 scores decrease significantly by 2.77%, 5.32%, 2.85% and 3.38% respectively, which indicates that the Prototype Interaction Layer is able to capture the dependency among prototypes. (3) - Trans: To prove the contribution of the label dependency, we remove the Transition Module and only use the emission score for prediction. Results show that without transition scores, the performance of the model drops dramatically by 4.56%, 5.99%, 3.87% and 4.63% respectively, which powerfully proves that the transition score can improve the performance of the sequence labeling task.

Case Study

Event Model Output
Sonsorship Multi-DBMPN Candlestick Park was dropped when the [sponsorship] agreement expired.
PA-CRF Candlestick Park was dropped when the [sponsorship] agreement expired.
Jail Collapsed CRF Willmore will tell everyone for wanting to keep the poor man [locked] [up].
PA-CRF Willmore will tell everyone for wanting to keep the poor man [locked] [up].
Table 5: Output of PA-CRF, Multi-DMBPN and Collapsed CRF on samples from the FewEvent test set. The subscripts denote the labels tagged by the models.

We compare our method with the best identify-then-classify baseline, Multi-DMBPN and the best unified baseline, Collapsed CRF on some cases, as shown in Table 5.

As demonstrated by the first example, Candlestick Park was dropped when the sponsorship agreement expired, is a Sponsorship event and the word sponsorship is trigger. In the identify-then-classify paradigm, Multi-DMBPN fails to identify the trigger sponsorship. Multi-DMBPN uses the conventional trigger identifier, BertTI, to identify the event trigger. Since BertTI is trained on training set and sponsorship acting as the trigger in Sponsorship events does not appear in the training set, BertTI is incapable of identifying the Sponsorship event trigger. Due to the cascading errors, the performance of the identify-then-classify models on the FSED task is limited. In contrast, our PA-CRF is successful to identify it since PA-CRF utilizes the information of the support set of Sponsorship event in which word sponsorship appears and acts the trigger.

In the second example, Willmore will tell everyone for wanting to keep the poor man locked up, an instance of Jail event, the best unified baseline, Collapsed CRF, tags the first trigger word locked with I-Jail label wrongly. That is because Collapsed CRF learns the abstract transition scores among a set of abstract labels which cannot model the label dependency for this specific event type. Thanks to the PAN which models the label dependency based on the label prototypes from the support set of Jail event, our PA-CRF is capable of tagging the word locked with B-Jail label correctly.

Error Study

Support Instance #1 Cult members [visited] and built a laser weapon mounted on a truck
Set Instance #2 Israel [leave] the West Bank and Gaza and dismantle Jewish settlements.
Query Ground Truth Refugees have been [pouring] [out] of Fallujah over the last few days.
Set Prediction Refugees have been [pouring] [out] of Fallujah over the last few days.
Table 6: A case of the wrong prediction from the FewEvent test set. The subscripts denote the triggers and their event types.

Although our method outperforms all baseline models, we still observe some failure cases. Table 6 gives a typical example of the wrong prediction. For the query instance, the ground truth event trigger is “pouring out”. The word “pouring” should be labeled as B-Transport and the out should be labeled as I-Transport. However, our model only detects “pouring” with B-Transport while missing “out”. From the support set, we find that all support instances of this event type only contain the one-word trigger without I-Transport label tokens, resulting in that the prototype of I-Transport is zero vector. As a result, the emission score for the label I-Transport of each query token is calculated as zero and the transition scores based on the prototypes are also affected. Therefore, our model is not able to detect the I-Transport label correctly in this case. In the future, we will further study to solve the missing I label problem.


In this paper, we explore a new viewpoint of solving few-shot event detection in a unified manner. Specifically, we propose a prototypical amortized conditional random field to generate the transition scores to achieve adaptation ability for novel event types based on the label prototypes. Furthermore, we present the Gaussian-based distributional estimation to approximate transition scores to relieve the uncertainty of data bias. Finally, experimental results on the benchmark FewEvent dataset prove the effectiveness of our proposed method. In the future, we plan to adapt our method to other few-shot sequence labeling tasks such as named entity recognition.


  • Ahn (2006) Ahn, D. 2006. The stages of event extraction. In Proceedings of the Workshop on Annotating and Reasoning about Time and Events, 1–8. Sydney, Australia: Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W06-0901.
  • Chen et al. (2015) Chen, Y.; Xu, L.; Liu, K.; Zeng, D.; and Zhao, J. 2015.

    Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks.


    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    , 167–176. Beijing, China: Association for Computational Linguistics.
    doi:10.3115/v1/P15-1017. URL https://www.aclweb.org/anthology/P15-1017.
  • Cong et al. (2020) Cong, X.; Yu, B.; Liu, T.; Cui, S.; Tang, H.; and Wang, B. 2020. Inductive Unsupervised Domain Adaptation for Few-Shot Classification via Clustering .
  • Cui et al. (2020) Cui, S.; Yu, B.; Liu, T.; Zhang, Z.; Wang, X.; and Shi, J. 2020. Event Detection with Relation-Aware Graph Convolutional Neural Networks. CoRR abs/2002.10757. URL https://arxiv.org/abs/2002.10757.
  • Deng et al. (2020) Deng, S.; Zhang, N.; Kang, J.; Zhang, Y.; Zhang, W.; and Chen, H. 2020. Meta-Learning with Dynamic-Memory-Based Prototypical Network for Few-Shot Event Detection. In Caverlee, J.; Hu, X. B.; Lalmas, M.; and Wang, W., eds., WSDM ’20: The Thirteenth ACM International Conference on Web Search and Data Mining, Houston, TX, USA, February 3-7, 2020, 151–159. ACM.
  • Devlin et al. (2019) Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  • Finn, Abbeel, and Levine (2017) Finn, C.; Abbeel, P.; and Levine, S. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In

    Proceedings of the 34th International Conference on Machine Learning, ICML 2017

  • Forney (1973) Forney, G. D. 1973. The viterbi algorithm. Proceedings of the IEEE 61(3): 268–278.
  • Fritzler, Logacheva, and Kretov (2019) Fritzler, A.; Logacheva, V.; and Kretov, M. 2019. Few-shot classification in Named Entity Recognition Task.
  • Gordon et al. (2019) Gordon, J.; Bronskill, J.; Bauer, M.; Nowozin, S.; and Turner, R. 2019. Meta-Learning Probabilistic Inference for Prediction. In International Conference on Learning Representations. URL https://openreview.net/forum?id=HkxStoC5F7.
  • Hou et al. (2020) Hou, Y.; Che, W.; Lai, Y.; Zhou, Z.; Liu, Y.; Liu, H.; and Liu, T. 2020. Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics.
  • Kingma and Welling (2014) Kingma, D. P.; and Welling, M. 2014. Auto-Encoding Variational Bayes. In Bengio, Y.; and LeCun, Y., eds., 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
  • Lai, Dernoncourt, and Nguyen (2020a) Lai, V. D.; Dernoncourt, F.; and Nguyen, T. H. 2020a. Exploiting the Matching Information in the Support Set for Few Shot Event Classification. In Lauw, H. W.; Wong, R. C.; Ntoulas, A.; Lim, E.; Ng, S.; and Pan, S. J., eds., Advances in Knowledge Discovery and Data Mining - 24th Pacific-Asia Conference, PAKDD 2020, Singapore, May 11-14, 2020, Proceedings, Part II, volume 12085 of Lecture Notes in Computer Science, 233–245. Springer.
  • Lai, Dernoncourt, and Nguyen (2020b) Lai, V. D.; Dernoncourt, F.; and Nguyen, T. H. 2020b. Extensively Matching for Few-shot Learning Event Detection. CoRR abs/2006.10093.
  • Liu, Luo, and Huang (2018) Liu, X.; Luo, Z.; and Huang, H. 2018. Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1247–1256. Brussels, Belgium: Association for Computational Linguistics. doi:10.18653/v1/D18-1156. URL https://www.aclweb.org/anthology/D18-1156.
  • Nguyen, Cho, and Grishman (2016) Nguyen, T. H.; Cho, K.; and Grishman, R. 2016. Joint Event Extraction via Recurrent Neural Networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 300–309. San Diego, California: Association for Computational Linguistics. doi:10.18653/v1/N16-1034. URL https://www.aclweb.org/anthology/N16-1034.
  • Nguyen and Grishman (2015) Nguyen, T. H.; and Grishman, R. 2015. Event Detection and Domain Adaptation with Convolutional Neural Networks. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 365–371. Beijing, China: Association for Computational Linguistics. doi:10.3115/v1/P15-2060. URL https://www.aclweb.org/anthology/P15-2060.
  • Nguyen and Grishman (2018) Nguyen, T. H.; and Grishman, R. 2018. Graph Convolutional Networks With Argument-Aware Pooling for Event Detection. In McIlraith, S. A.; and Weinberger, K. Q., eds.,

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    , 5900–5907. AAAI Press.
  • Snell, Swersky, and Zemel (2017) Snell, J.; Swersky, K.; and Zemel, R. 2017. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, 4077–4087.
  • Sung et al. (2018) Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P. H. S.; and Hospedales, T. M. 2018. Learning to Compare: Relation Network for Few-Shot Learning. In

    2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018

  • Vinyals et al. (2016) Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al. 2016. Matching networks for one shot learning. In Advances in neural information processing systems, 3630–3638.
  • Yan et al. (2019) Yan, H.; Jin, X.; Meng, X.; Guo, J.; and Cheng, X. 2019. Event Detection with Multi-Order Graph Convolution and Aggregated Attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5766–5770. Hong Kong, China: Association for Computational Linguistics. doi:10.18653/v1/D19-1582. URL https://www.aclweb.org/anthology/D19-1582.
  • Yang et al. (2019) Yang, S.; Feng, D.; Qiao, L.; Kan, Z.; and Li, D. 2019. Exploring Pre-trained Language Models for Event Extraction and Generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics. URL https://www.aclweb.org/anthology/P19-1522.